AI Code Review Benchmark

We evaluated Kody and other AI code review tools on the same PRs across five open-source projects. The goal is to give you a clear picture of how each tool performs in real reviews.

How We Built This Benchmark

We used the same public repositories from an existing benchmark and added Kody, our code review agent. To keep the comparison meaningful, we focused only on Critical, High, and Medium-level issues.

We ran the exact same pull requests through four AI code review tools (Kodus, Coderabbit, GitHub Copilot, and Cursor BugBot) with no additional setup or custom configuration, specifically to avoid skewing the results.

All tools were evaluated using the same dataset under the same conditions.

Sentry

Cal.com

Grafana

Discourse

Keycloak

Repositories Analyzed

TL;DR

  • For critical issues, Kodus (6%) and GitHub (62%) delivered the best results. Even so, the numbers show there’s still plenty of room for improvement in this type of detection.

  • For high-severity issues, the gap between tools became more noticeable. Coderabbit had its worst performance here (31%), falling well below the others. Cursor (50%) and Kodus (81%) performed better, though results still varied across scenarios.

  • For medium-severity issues, all tools performed at a higher level. Kodus detected 89% of the cases, and Cursor achieved its best performance in this category, finding 67% of the bugs.

Overall, Kodus was the most consistent tool across all three categories (critical, high, and medium), identifying 79% of the issues, while the others fluctuated more depending on the type of problem.

Don’t take our word for it. Try Kody on your next PR.

Spin it up in under 2 minutes—cloud or self-hosted, no credit card.