AI smart contract audit tools struggle to find real bugs

Just days after AI was celebrated for finding a multi-million-dollar bug in a crypto lending protocol, human intelligence-powered Web3 auditors tested three AI-powered smart contract audit tools and found that their touted “game-changing” impact on the industry may be somewhat premature.
Current web-based AI audit tools can reveal real issues but still suffer from false positives, duplication, and blind spots around economic and design reasoning, the authors of the test, Lyuboslav Lyubenov and Radoslav Radev, concluded.
They tested tools such as AlmanaxAI, AuditAgent, and SavantChat against judge-adjudicated ground truth from three public Sherlock contests involving protocols such as Yearn yBOLD, Crestal Network, and CAP Protocol. After measuring the precision, recall, and quality of findings, they established that none of the tools achieved both high precision and high recall across all contests.
The test showed that the three tools primarily found template issues, such as access control, reentrancy patterns, and basic math errors, but couldn’t reliably discover business logic flaws, cross-contract integration issues, or more complex economic vulnerabilities.
"AuditAgent provided the best recall in this pilot, SavantChat provided high-quality PoCs for one contest, and AlmanaxAI provided limited coverage. These tools are far from actually discovering significant, novel bugs in production systems," Lyubenov and Radev said.
What’s more, they excluded seven AI tools, such as LISA, Bughunter.lve, Finite Monkey, and others, for reasons including failure to produce actionable vulnerabilities, generating generic assistant-like text, repeatedly failing to run projects, or being unstable.
The testers also found that AI-powered tools have different trade-offs. For example, higher recall typically incurs a higher false-positive volume and greater triage costs.
Meanwhile, among the common failures and blind spots, they noted that economic and accounting reasoning remains a recurring weakness.
Other issues include template over-reporting creating many false positives, verbose duplication inflating function point counts, and operational limitations – such as upload limits, credit systems, invitation requirements, and instability – "significantly hindering real-world use."
Unlock more exclusive Cybernews content on YouTube.