Methodology

How we score

Every tool we review is scored across six axes — capability, reliability, speed, pricing, developer experience, and trust. Each axis gets a 0–10 score backed by evidence we can point to, weighted equally into the composite. That evidence comes from one of two methods — research-based synthesis or hands-on testing — and we tell you which one produced every review.

How we review

Two kinds of reviews

Not every tool gets the same treatment, and pretending otherwise would be dishonest. Some reviews are built from research; some are built from hands-on use. Both run through the same six-axis rubric, the same editorial review, and the same scoring discipline — but the evidence underneath them is different, so we label every review with the method that produced it.

RESEARCH-BASED

Research-based review

Most of our reviews are research-based. We synthesize evidence from multiple independent sources rather than relying on a single demo run:

Aggregated user sentiment from Reddit and Hacker News — weighted toward specific, reproducible complaints and praise, not star averages.
Expert write-ups and independent technical analyses.
The vendor's own documentation, changelog, and pricing pages — read critically.
Our rating aggregator's blended score across review platforms.

A research-based review is not us using the tool for hours on end. It's a structured read of what the tool is, what it costs, and what people who depend on it actually report — scored on the same six axes, with the rationale for each score written down. These reviews carry a 'Research-based review' tag on the tool page.

HANDS-ON

Hands-on tested review

Some reviews are hands-on: we run the tool on representative workloads and measure what happens. When a review is hands-on, the tool page shows how long we spent and what we tested. The detail below — wall-clock timing, the test window, quarterly re-testing — describes how a hands-on review works. If a tool page doesn't show a testing duration, the review is research-based, and you should read the scores as an evidence synthesis rather than a lab result.

How to tell which is which: check the meta line at the top of any tool review. A 'Research-based review' tag means we synthesized sources. A testing-duration tag (e.g. '14 hours tested') means we ran it ourselves.

The six axes

What we look at

01CAPABILITY
Capability
Can the tool actually do what it claims?
Evidence we look for: Output quality on the headline use cases, judged from documented results and user reports — not vendor demos or synthetic benchmarks. Coverage of the edge cases vendors don't market. On hands-on reviews, we verify this on real workloads ourselves.
02RELIABILITY
Reliability
Does it work when you need it to?
Evidence we look for: Documented failure modes and how the tool handles them. Whether output quality holds up across sessions and at scale, per user reports and incident history — not just the demo run. On hands-on reviews, we add uptime observed over our own test window.
03SPEED
Speed
How fast does it complete real tasks?
Evidence we look for: Whether real-world speed matches the marketing under normal load, drawn from user reports of latency and throughput on interactive operations. On hands-on reviews, we add wall-clock timing on representative workloads we run ourselves.
04PRICING
Pricing
Is the value-per-dollar reasonable?
Evidence we look for: Cost vs. capability vs. direct competitors. Whether the free tier is genuinely useful or a teaser. Hidden costs in usage-based billing — the kind that surprise you at month-end.
05DEVELOPER EXPERIENCE
DX
How clear is the path from "I want X" to "X is done"?
Evidence we look for: Onboarding friction. Documentation quality and accuracy. Error messages that tell you what's wrong. Time to first useful output for a new user who hasn't read the docs.
06TRUST
Trust
Would we recommend this to someone we know?
Evidence we look for: Data handling and privacy posture. Vendor business model and incentive alignment. Whether the tool is built to serve users — or to capture them.

Scoring scale

What the numbers mean

9–10Best-in-classSets the standard. Competitors benchmark against it.

7–8SolidDelivers on its claims with minor caveats.

5–6AcceptableWorks, but with friction or gaps worth knowing about.

3–4WeakUnderdelivers relative to claims or price.

1–2AvoidSignificant problems. We'd steer a friend away.

Worked example

Cursor — how we got to its score

This walkthrough shows the six-axis rubric in action. The method behind any given review — research-based or hands-on — is labeled on that tool's page.

<p><strong>Update (June 2026):</strong> SpaceX exercised its option to acquire Anysphere, the parent company of Cursor, in an all-stock transaction announced June 16, 2026, valuing the deal at approxi

—

Capability9.0

Reliability6.0

Speed8.0

Pricing5.0

DX8.0

Read the full Cursor review →

Out of scope

What we don't score on

Hype — Being on the front page of Hacker News doesn't move a score.
Aesthetic — A beautiful UI is nice. It's not part of the rubric.
Popularity — Usage volume is informative context, not a scoring input.
Vendor reputation — We score the tool, not the company's PR cycle.
Our affiliate rate — Tools with high commissions don't get higher scores. Tools with zero commissions don't get lower ones. You can verify this on our transparency page.

Cadence

How often we revisit

Tools change. Models update, prices shift, vendors pivot. We revisit every reviewed tool at least once per quarter — re-running the research synthesis, and re-testing the hands-on reviews — sooner if any of these triggers fire:

The vendor ships a major model or product upgrade.
Pricing changes by more than 20 %.
A reader flags a regression we can reproduce.
A new competitor invalidates the prior comparison set.

When a re-test changes a score, the prior score stays on the tool page as a delta indicator (▲/▼). Tools that regress past a threshold get a public dropped notice at /v2/dropped.

Two kinds of reviews

Research-based review

Hands-on tested review

What we look at

Capability

Reliability

Speed

Pricing

DX

Trust

What the numbers mean

Cursor — how we got to its score

What we don't score on

How often we revisit