How we score
Every tool we review is scored across six axes — capability, reliability, speed, pricing, developer experience, and trust. Each axis gets a 0–10 score backed by evidence we can point to, weighted equally into the composite. That evidence comes from one of two methods — research-based synthesis or hands-on testing — and we tell you which one produced every review.
Two kinds of reviews
Not every tool gets the same treatment, and pretending otherwise would be dishonest. Some reviews are built from research; some are built from hands-on use. Both run through the same six-axis rubric, the same editorial review, and the same scoring discipline — but the evidence underneath them is different, so we label every review with the method that produced it.
Research-based review
Most of our reviews are research-based. We synthesize evidence from multiple independent sources rather than relying on a single demo run:
- Aggregated user sentiment from Reddit, Hacker News, Trustpilot, G2, and Capterra — weighted toward specific, reproducible complaints and praise, not star averages.
- Expert write-ups and independent technical analyses.
- The vendor's own documentation, changelog, and pricing pages — read critically.
- Our rating aggregator's blended score across review platforms.
A research-based review is not us using the tool for hours on end. It's a structured read of what the tool is, what it costs, and what people who depend on it actually report — scored on the same six axes, with the rationale for each score written down. These reviews carry a 'Research-based review' tag on the tool page.
Hands-on tested review
Some reviews are hands-on: we run the tool on representative workloads and measure what happens. When a review is hands-on, the tool page shows how long we spent and what we tested. The detail below — wall-clock timing, the test window, quarterly re-testing — describes how a hands-on review works. If a tool page doesn't show a testing duration, the review is research-based, and you should read the scores as an evidence synthesis rather than a lab result.
How to tell which is which: check the meta line at the top of any tool review. A 'Research-based review' tag means we synthesized sources. A testing-duration tag (e.g. '14 hours tested') means we ran it ourselves.
What we look at
- 01CAPABILITY
Capability
Can the tool actually do what it claims?
Evidence we look for: Output quality on the headline use cases, judged from documented results and user reports — not vendor demos or synthetic benchmarks. Coverage of the edge cases vendors don't market. On hands-on reviews, we verify this on real workloads ourselves.
- 02RELIABILITY
Reliability
Does it work when you need it to?
Evidence we look for: Documented failure modes and how the tool handles them. Whether output quality holds up across sessions and at scale, per user reports and incident history — not just the demo run. On hands-on reviews, we add uptime observed over our own test window.
- 03SPEED
Speed
How fast does it complete real tasks?
Evidence we look for: Whether real-world speed matches the marketing under normal load, drawn from user reports of latency and throughput on interactive operations. On hands-on reviews, we add wall-clock timing on representative workloads we run ourselves.
- 04PRICING
Pricing
Is the value-per-dollar reasonable?
Evidence we look for: Cost vs. capability vs. direct competitors. Whether the free tier is genuinely useful or a teaser. Hidden costs in usage-based billing — the kind that surprise you at month-end.
- 05DEVELOPER EXPERIENCE
DX
How clear is the path from "I want X" to "X is done"?
Evidence we look for: Onboarding friction. Documentation quality and accuracy. Error messages that tell you what's wrong. Time to first useful output for a new user who hasn't read the docs.
- 06TRUST
Trust
Would we recommend this to someone we know?
Evidence we look for: Data handling and privacy posture. Vendor business model and incentive alignment. Whether the tool is built to serve users — or to capture them.
What the numbers mean
Cursor — how we got to its score
This walkthrough shows the six-axis rubric in action. The method behind any given review — research-based or hands-on — is labeled on that tool's page.
<p>Cursor is an AI-native code editor built as a fork of VS Code, and it has become the dominant AI-first IDE in developer communities as of 2026. Testing reveals that its core strength is treating AI
What we don't score on
- Hype — Being on the front page of Hacker News doesn't move a score.
- Aesthetic — A beautiful UI is nice. It's not part of the rubric.
- Popularity — Usage volume is informative context, not a scoring input.
- Vendor reputation — We score the tool, not the company's PR cycle.
- Our affiliate rate — Tools with high commissions don't get higher scores. Tools with zero commissions don't get lower ones. You can verify this on our transparency page.
How often we revisit
Tools change. Models update, prices shift, vendors pivot. We revisit every reviewed tool at least once per quarter — re-running the research synthesis, and re-testing the hands-on reviews — sooner if any of these triggers fire:
- The vendor ships a major model or product upgrade.
- Pricing changes by more than 20 %.
- A reader flags a regression we can reproduce.
- A new competitor invalidates the prior comparison set.
When a re-test changes a score, the prior score stays on the tool page as a delta indicator (▲/▼). Tools that regress past a threshold get a public dropped notice at /v2/dropped.