The limitations of using only Claude for code security review

🚀 Benchmark: How AI code review tools perform on real PRs. View the results →

The limitations of using only Claude for code security review

Last Update:
December 29, 2025

The first time you see an AI comment on a pull request, the feedback loop stands out. A full review appears in seconds, pointing out potential issues before a human reviewer has even opened the file. The appeal of using a tool like Claude for code security review, a critical part of security in the SDLC, is clear: catch problems early and reduce the team’s manual workload.

In practice, however, this speed often creates a false sense of security. It works well at first, but starts to break down as the team grows and systems become more complex.

The problem is that these tools operate with a critical blind spot. They see the code, but they do not see the system. They can analyze syntax, but they do not understand intent, history, or the architectural contracts that keep a complex application working.

The Critical Blind Spot

A good security review depends on context that is not in the diff. That context lives outside the isolated code. By nature, an LLM does not have access to it. It analyzes only a slice of the code, in isolation, and misses the broader view where, in practice, the most serious vulnerabilities usually live.

Architectural and data flow risks that go unnoticed

Many critical security flaws are not in the code itself, but in how data flows between components. An LLM does not know the system’s trust boundaries. It does not know, for example, that UserService is internal only, or that any data coming from a publicly exposed APIGateway must be revalidated, regardless of prior validations.

Consider an authorization flaw that slips through. A developer adds a new endpoint that correctly checks whether the user has the admin role. Looking only at the diff, the code looks correct.

But a senior engineer knows an implicit system rule: an admin from Tenant A should never access data from Tenant B. The code does not check the tenant ID.

Claude will not flag this because it does not understand your multi-tenancy model or the internal rules around data isolation and sensitivity. It sees a valid role check and moves on, letting a potential cross-tenant data leak slip through.

Ignoring repository history and the evolution of threats

A codebase is a living document. The history of commits, pull requests, and incident reports contains valuable security context. A human reviewer may remember a past incident involving incomplete input validation on a specific data model and will be extra alert to similar changes. An LLM has no memory of this.

For example, a team may have fixed a denial of service vulnerability by adding a hard size limit to a free text field. Six months later, a new developer, working on another feature, adds a similar field but forgets the size validation. The code is syntactically correct, but it reintroduces a known vulnerability pattern. An experienced reviewer spots this immediately. An LLM sees only the new code, with no access to lessons learned in the past.

Inability to Learn Team-Specific Security Policies

Every engineering team develops its own set of security conventions and policies. They are often domain-specific and not always explicit in the code.

Your company policy might prohibit storing any form of PII in Redis.
Or you may have a rule to use a specific internal library for all cryptographic operations, because standard libraries were misused in the past.
Your team may have decided to use UUIDv7 for all new primary keys for performance reasons.

An LLM has no knowledge of these internal standards.

It may even suggest a solution that directly violates these rules, creating more work for the reviewer, who now has to fix both the code and the AI’s suggestion. The confident and authoritative tone of an LLM can lead more junior developers to assume its suggestions represent code quality best practices, even when they contradict standards already established by the team.

Scaling Traps: When LLM Limitations Add Up

For a small team working on a monolith, some of these gaps may be manageable. But as the organization tries to deal with the challenge of scaling code review in a growing team, with more engineers, more teams, and more microservices, these limitations create systemic problems that automation cannot solve.

The Human Verification Bottleneck

reviewing the AI’s own output. With a constant stream of low impact or irrelevant suggestions, engineers quickly develop alert fatigue and start treating AI comments like linter noise, something easy to ignore.

In practice, every AI generated comment still requires someone to assess its validity, impact, and context. This slows reviews down and pulls attention away from what actually matters. The cognitive load of filtering AI noise can easily outweigh the benefit of catching a few obvious issues.

Architectural understanding gaps in LLM-based code security reviews

In distributed systems, the most dangerous bugs usually live in the interactions between services. An LLM reviewing a change in a single repository has no visibility into how that change might break an implicit contract with a downstream consumer. It does not notice, for example, that removing a field from a JSON response can cause silent failures in another team’s service that depends on that field.

The same applies to cryptography errors. An LLM can flag obvious problems, like the use of an obsolete algorithm such as DES. But it tends to miss harder to detect flaws, like reusing an initialization vector (IV) in a block cipher. Identifying this type of issue requires understanding application state and data flow across multiple requests, which goes far beyond static analysis of a code snippet.

Hallucinations

LLMs can be wrong with a lot of confidence. It is not uncommon to see recommendations for security libraries that do not exist, incorrect interpretations of details from a real CVE, or broken code snippets presented as a “fix.”

In security, this is especially dangerous. A developer may accept an explanation that sounds plausible but is wrong, and end up introducing a new vulnerability while trying to fix another one. This false sense of confidence undermines learning and can lead to a worse security outcome than the original issue.

Why human expertise still matters

This does not mean AI tools have no place. The problem is treating them as replacements for human judgment rather than as a complement. Human reviewers provide essential context that machines cannot.

Beyond Syntax: Business Logic and Intent

A senior engineer understands the why behind the code. They connect the proposed change to its business goal and can ask critical questions that an LLM would never ask.

“What happens if a user uploads a file with more than 255 characters in the name?” or “Is this new user permission aligned with the company’s GDPR compliance requirements?”

This kind of reasoning about real world impact is the foundation of a good security review.

Mentorship and Building a Security Culture

Code reviews are one of the main mechanisms for knowledge transfer within a team. When a senior engineer points out a security flaw, they do not just say “this is wrong.” They explain the risk, reference a past decision or an internal document, and use the review as a learning moment.

This process raises security awareness across the entire team and strengthens a culture of shared responsibility. An automated bot comment offers none of that. It just feels like another checklist item to clear.

A Hybrid Review Model

The goal is not to reject new tools, but to be intentional about how they are used. A healthy security posture uses automation to augment human judgment, not to replace it.

Augment, Not Replace: Where LLMs Make Sense

The best use of LLMs in code review is as a first automated pass for a very specific class of problems. For example:

Hardcoded secrets and API keys
Use of known insecure libraries or functions (such as strcpy in C or pickle in Python)
Common patterns indicating SQL injection or XSS

The output should be treated as a suggestion, not a verdict. Final authority still rests with the human reviewer.

Invest in Context

Getting consistently useful results from an LLM requires significant investment in providing the right context. This includes architectural diagrams, data flow information, and internal team policies, often guided by advanced prompt engineering practices.

That context also needs to be kept up to date, which creates an ongoing maintenance burden. Before making an LLM a mandatory step in CI/CD, it is necessary to understand that cost and those limits.

Cultivate a Strong Security Posture to Scale

In the end, a strong security culture depends on human judgment. Automation works well for simple, repetitive, and context-free tasks. This frees more experienced engineers to focus on complex, dependency-heavy risks, where experience really matters. Balancing the efficiency of automation with the judgment of those who know the system is the only way to build a security practice that truly scales.

Posted by:

⠀ Edvaldo Freitas

Share!

Automate your Code Reviews with Kody

Kody Rules

Changelog

ROI Calculator

Customers

Security

AI Code Review Tools Benchmark

Code Review Benchmark

Agent Skills Library

Kody vs CodeRabbit: production grade benchmark

Index

The limitations of using only Claude for code security review

Índice:

The Critical Blind Spot

Architectural and data flow risks that go unnoticed

Ignoring repository history and the evolution of threats

Inability to Learn Team-Specific Security Policies

Scaling Traps: When LLM Limitations Add Up

The Human Verification Bottleneck

Architectural understanding gaps in LLM-based code security reviews

Hallucinations

Why human expertise still matters

Beyond Syntax: Business Logic and Intent

Mentorship and Building a Security Culture

A Hybrid Review Model

Augment, Not Replace: Where LLMs Make Sense

Invest in Context

Cultivate a Strong Security Posture to Scale

Posts relacionados

The limitations of using only Claude for code security review

The limitations of using only Claude for code security review

The limitations of using only Claude for code security review