ai / engineering

Audit tournament — making AI models compete to find real bugs

June 16, 2026

The idea

I ship apps solo with a mixed AI agent fleet: one reasoning model for hard calls, cheaper workers for parallel grunt work, and me as the final reviewer. That setup raises a harder question: how do you trust an AI to audit code it could also be writing? My answer is an audit tournament — several models, each in its own isolated git worktree, reviewing the same code blind, then a separate adjudication pass that decides what is real and what is confidently imagined.

The tournament is not a leaderboard of which model sounds smartest. It is a filter. The product is not the raw findings; it is the smaller set of findings that survive disagreement.

The setup

I took a public blind forensic-audit harness and forked it to my own Forgejo instance so I could change the parts that matter for my workflow. It runs three workers in parallel: Claude Opus, Claude Sonnet, and Kimi. Each worker gets an identical prompt and an identical snapshot of the target repo, but each runs in its own worktree so they cannot see each other's notes, guesses, or framing. After they finish, an adjudicator — Opus, because it is strongest under ambiguity — reads all three reports and classifies every finding as accepted, rejected, or needing evidence.

The original harness ran workers serially. I made them concurrent. The gain is not marginal: the same task that once threatened to time out at 600 seconds now finishes in about 13 seconds for Kimi, because I cut out a middleman layer and pointed opencode directly at a local OpenAI-compatible proxy. The bottleneck was never the model. It was the scaffolding around it.

What the numbers actually say

When I aimed the tournament at its own code, the results were instructive: nine real findings accepted, zero real findings missed, and four hallucinated bugs caught before they reached me. That ratio is the honest result. The models did find genuine issues — a subprocess cleanup leak, prompts leaking into process listings, worktree directory handling that could break under self-audit — but they also produced confident false positives that contradicted each other.

That contradiction is the signal. If two models independently flag the same pattern, it is probably real. If one model reports a subtle bug the others call fine, it needs a human or a verification command. If all three disagree on severity, that usually means the code is ambiguous, not that one model is smarter than the others.

Why adjudication is the actual product

The hardest part of AI-assisted code review is not generating findings. It is suppressing the noise. Each model has its own hallucination profile. Opus is cautious and tends to over-qualify; Sonnet is faster but occasionally frames a style issue as a security issue; Kimi is direct and can be confidently wrong about context it does not have. Running them against each other does not eliminate these biases — it makes them visible.

I added an evidence execution layer that optionally runs the adjudicator's proposed verification command for each accepted finding. It is off by default because running arbitrary shell commands on a hunch is dangerous, but when enabled it turns "looks like a bug" into "this command proves the bug." That separation of proposing from proving is what keeps the loop honest.

I also added lens-specific workers: instead of three generalists, I can spin up one worker focused on security, one on concurrency, one on error handling. The shared base prompt stays the same, but each lens gets a specialized profile. This reduced the number of shallow findings and made the disagreements more meaningful.

Where it still breaks

No substitute for knowing the system. I can only adjudicate what I understand. When a model flags something in a dependency or language feature I have not touched recently, I have to look it up myself or defer the finding.
Verification commands can be destructive. The execution layer is opt-in and sandboxed, but a bad command is still a bad command. I review every suggested command before it runs.
Tests matter. One real finding got deferred because the box I was on had no network, no pip, and no pytest. A finding without a regression test is a finding that will come back.

Practical takeaways

If you want to try this yourself, start smaller than a full tournament. Pick one file you know well, give the same prompt to two different models, and mark the places where they disagree. That disagreement list is more valuable than either report alone.

Use isolated worktrees or throwaway clones. Parallel review only works if the reviewers cannot influence each other. Treat the adjudicator as a filter, not an authority: its job is to organize the disagreement, not to resolve it for you. And always have a human gate at the end who can say, "That looks correct, but it is not what I meant."

Models compete, adjudication decides, a human confirms. The tournament does not replace judgment — it gives judgment something solid to judge.

← back to writing