A code-review agent benchmarked on SWE-bench Lite (300 real GitHub issues). The PR diff + tree-sitter AST + linked issue + git history flow into five parallel analyzers (bug detection, style, test gaps, security, performance). Comments are synthesized, ranked by severity, and posted inline. Bug detection F1 0.75 vs ruff's 0.30 baseline. Ships as a GitHub Action on the marketplace.
Engineering teams ship in PR-sized chunks. Senior review time is the throughput limit. Linters catch 20% of issues but miss every semantic bug. A naive Claude pass over the diff finds bugs but generates 58% false positives. The right answer is a structured pipeline.
A naive review prompt sees only the diff hunk. It doesn't know what's imported in the file, what the surrounding code does, whether the function is called from tests, or whether a similar pattern exists elsewhere in the repo.
Result: bugs that depend on context (missing imports, breaking signature changes, callers in other files) get missed; cosmetic issues (variable names, style) get over-flagged. 58% false positive rate is unusable in production.
Context Builder assembles tree-sitter AST of changed files + their importers/importees + linked issue + git blame for the changed lines.
Five analyzers run in parallel: bug detector (Claude + static analysis), style checker (ruff + Claude), test gap analyzer (Claude over coverage delta), security scanner (semgrep + Claude), performance reviewer (Claude over hot paths).
Synthesizer dedupes, ranks by severity, and emits the top-N inline comments. False positive rate drops to 29%; bug F1 climbs to 0.75.
Install tree-sitter, ruff, mypy, semgrep, PyGithub. Pull SWE-bench Lite. Run the 5-analyzer pipeline over all 300 issues. Compare to ruff-only, single-pass Claude, and pipeline. LLM-as-judge on comment quality. Publish v1.0.0 to GitHub Marketplace.
A Django manager change introduces an N+1 fix but breaks the return type, misses imports, adds an unsafe mass-deactivation, and lacks tests. Watch each analyzer fire and emit comments at the correct line numbers, with severity badges and actionable suggestions. Verdict: CHANGES REQUESTED.