Methodology

How we score

Every tool we review is scored across six axes — capability, reliability, speed, pricing, developer experience, and trust. Each axis gets a 0–10 score backed by evidence we can point to, weighted equally into the composite. That evidence comes from one of two methods — research-based synthesis or hands-on testing — and we tell you which one produced every review.

How we review

Two kinds of reviews

Not every tool gets the same treatment, and pretending otherwise would be dishonest. Some reviews are built from research; some are built from hands-on use. Both run through the same six-axis rubric, the same editorial review, and the same scoring discipline — but the evidence underneath them is different, so we label every review with the method that produced it.

RESEARCH-BASED

Research-based review

Most of our reviews are research-based. We synthesize evidence from multiple independent sources rather than relying on a single demo run:

  • Aggregated user sentiment from Reddit, Hacker News, Trustpilot, G2, and Capterra — weighted toward specific, reproducible complaints and praise, not star averages.
  • Expert write-ups and independent technical analyses.
  • The vendor's own documentation, changelog, and pricing pages — read critically.
  • Our rating aggregator's blended score across review platforms.

A research-based review is not us using the tool for hours on end. It's a structured read of what the tool is, what it costs, and what people who depend on it actually report — scored on the same six axes, with the rationale for each score written down. These reviews carry a 'Research-based review' tag on the tool page.

HANDS-ON

Hands-on tested review

Some reviews are hands-on: we run the tool on representative workloads and measure what happens. When a review is hands-on, the tool page shows how long we spent and what we tested. The detail below — wall-clock timing, the test window, quarterly re-testing — describes how a hands-on review works. If a tool page doesn't show a testing duration, the review is research-based, and you should read the scores as an evidence synthesis rather than a lab result.

How to tell which is which: check the meta line at the top of any tool review. A 'Research-based review' tag means we synthesized sources. A testing-duration tag (e.g. '14 hours tested') means we ran it ourselves.

The six axes

What we look at

  1. 01CAPABILITY

    Capability

    Can the tool actually do what it claims?

    Evidence we look for: Output quality on the headline use cases, judged from documented results and user reports — not vendor demos or synthetic benchmarks. Coverage of the edge cases vendors don't market. On hands-on reviews, we verify this on real workloads ourselves.

  2. 02RELIABILITY

    Reliability

    Does it work when you need it to?

    Evidence we look for: Documented failure modes and how the tool handles them. Whether output quality holds up across sessions and at scale, per user reports and incident history — not just the demo run. On hands-on reviews, we add uptime observed over our own test window.

  3. 03SPEED

    Speed

    How fast does it complete real tasks?

    Evidence we look for: Whether real-world speed matches the marketing under normal load, drawn from user reports of latency and throughput on interactive operations. On hands-on reviews, we add wall-clock timing on representative workloads we run ourselves.

  4. 04PRICING

    Pricing

    Is the value-per-dollar reasonable?

    Evidence we look for: Cost vs. capability vs. direct competitors. Whether the free tier is genuinely useful or a teaser. Hidden costs in usage-based billing — the kind that surprise you at month-end.

  5. 05DEVELOPER EXPERIENCE

    DX

    How clear is the path from "I want X" to "X is done"?

    Evidence we look for: Onboarding friction. Documentation quality and accuracy. Error messages that tell you what's wrong. Time to first useful output for a new user who hasn't read the docs.

  6. 06TRUST

    Trust

    Would we recommend this to someone we know?

    Evidence we look for: Data handling and privacy posture. Vendor business model and incentive alignment. Whether the tool is built to serve users — or to capture them.

Scoring scale

What the numbers mean

9–10Best-in-classSets the standard. Competitors benchmark against it.
7–8SolidDelivers on its claims with minor caveats.
5–6AcceptableWorks, but with friction or gaps worth knowing about.
3–4WeakUnderdelivers relative to claims or price.
1–2AvoidSignificant problems. We'd steer a friend away.
Worked example

Cursor — how we got to its score

This walkthrough shows the six-axis rubric in action. The method behind any given review — research-based or hands-on — is labeled on that tool's page.

<p>Cursor is an AI-native code editor built as a fork of VS Code, and it has become the dominant AI-first IDE in developer communities as of 2026. Testing reveals that its core strength is treating AI

Capability
9.0
Reliability
6.0
Speed
8.0
Pricing
5.0
DX
8.0
Read the full Cursor review →
Out of scope

What we don't score on

Cadence

How often we revisit

Tools change. Models update, prices shift, vendors pivot. We revisit every reviewed tool at least once per quarter — re-running the research synthesis, and re-testing the hands-on reviews — sooner if any of these triggers fire:

When a re-test changes a score, the prior score stays on the tool page as a delta indicator (▲/▼). Tools that regress past a threshold get a public dropped notice at /v2/dropped.