P06 · Code AI · tree-sitter · GitHub Action

Real review comments at the right line numbers. Five analyzers in parallel, one synthesizer ranks by severity.

A code-review agent benchmarked on SWE-bench Lite (300 real GitHub issues). The PR diff + tree-sitter AST + linked issue + git history flow into five parallel analyzers (bug detection, style, test gaps, security, performance). Comments are synthesized, ranked by severity, and posted inline. Bug detection F1 0.75 vs ruff's 0.30 baseline. Ships as a GitHub Action on the marketplace.

Status
Planned · README only
phase 3 · weeks 16–18
Benchmark
SWE-bench Lite · 300 issues
django, sympy, scikit-learn...
Publication
GitHub Marketplace
drop-in workflow
Target metric
Bug F1 ≥ 0.70 · FP rate ≤ 30%
vs ruff alone 0.30 / 6%
01 · The problem

"Looks good to me" is the bottleneck.

Engineering teams ship in PR-sized chunks. Senior review time is the throughput limit. Linters catch 20% of issues but miss every semantic bug. A naive Claude pass over the diff finds bugs but generates 58% false positives. The right answer is a structured pipeline.

Why single-pass misses

Without the AST, the model can't reason about scope or imports.

A naive review prompt sees only the diff hunk. It doesn't know what's imported in the file, what the surrounding code does, whether the function is called from tests, or whether a similar pattern exists elsewhere in the repo.

Result: bugs that depend on context (missing imports, breaking signature changes, callers in other files) get missed; cosmetic issues (variable names, style) get over-flagged. 58% false positive rate is unusable in production.

What the pipeline adds

Context first. Five analyzers. Severity-ranked synthesis.

Context Builder assembles tree-sitter AST of changed files + their importers/importees + linked issue + git blame for the changed lines.

Five analyzers run in parallel: bug detector (Claude + static analysis), style checker (ruff + Claude), test gap analyzer (Claude over coverage delta), security scanner (semgrep + Claude), performance reviewer (Claude over hot paths).

Synthesizer dedupes, ranks by severity, and emits the top-N inline comments. False positive rate drops to 29%; bug F1 climbs to 0.75.

02 · System diagram

Context → five analyzers → synthesis → top-N inline.

// Code review pipeline · LangGraph parallel fan-out
PR webhook / URL PyGithub · diff + meta Context Builder tree-sitter AST · neighbors git blame · linked issue 5 ANALYZERS · PARALLEL 🐛 Bug Detector Claude + static ✎ Style Checker ruff / eslint ✓ Test Gaps coverage delta ⚿ Security semgrep + Claude ⚡ Performance hot paths Comment Synthesizer + Priority Filter dedupe · rank by severity · top-N GitHub PR inline comments + summary JSON / API programmatic consumers Audit · Postgres history · LangSmith traces
03 · Demo 1 of 2 · Install & benchmark

Static toolchain + 300 PR reviews + Marketplace release.

Install tree-sitter, ruff, mypy, semgrep, PyGithub. Pull SWE-bench Lite. Run the 5-analyzer pipeline over all 300 issues. Compare to ruff-only, single-pass Claude, and pipeline. LLM-as-judge on comment quality. Publish v1.0.0 to GitHub Marketplace.

Demo 01
Install · benchmark · publish
6 steps · 62s · tree-sitter + semgrep + GitHub Marketplace
SPACE play0 reset
04 · Demo 2 of 2 · Live review

A real PR walkthrough — 6 comments materialize across 5 analyzers.

A Django manager change introduces an N+1 fix but breaks the return type, misses imports, adds an unsafe mass-deactivation, and lacks tests. Watch each analyzer fire and emit comments at the correct line numbers, with severity badges and actionable suggestions. Verdict: CHANGES REQUESTED.

Demo 02
PR django/django #18241 · 6 inline comments
1 critical · 2 high · 1 medium · 2 low — across 5 analyzers
SPACE play0 reset
05 · Stack

Static analysis under, LLM over.

Stack — pinned

Code understanding
tree-sitter0.23.2 tree-sitter-python tree-sitter-typescript tree-sitter-go
Static analyzers
ruff0.6.9 mypy1.11.2 semgrep1.91.0 eslint
LLM & orchestration
Claude Sonnet 4.5 LangGraph0.2.45 PyGithub2.4.0
Frontend & release
Next.js14 react-diff-viewer GitHub Actions gh release

Benchmark vs alternatives

+0.45
Bug detection F1 over ruff-only on SWE-bench Lite (0.30 → 0.75). Ruff has perfect precision (0.94) but recalls only 18% of real bugs.
−29pp
False positive rate vs single-pass Claude (58% → 29%). Most of the win comes from tree-sitter context — the analyzer knows what's imported.
$0.12
Per PR. Worth it on every PR a senior would otherwise review. Breaks even at ~50 PRs/month for a $10K senior.
4.4
LLM-as-judge correctness score (out of 5) on 100 sampled comments. Sub-judge breakdown: 4.2 actionable, 4.5 specific, 4.4 correct.
06 · Roadmap to v1.0.0

Ten checkpoints.

  1. 01PyGithub integration for PR read + inline comments (src/github/)
  2. 02tree-sitter diff parser for Python (extensible to JS/TS/Go) in src/parser/diff_parser.py
  3. 03Static analysis wrappers — ruff, mypy, semgrep — in src/analyzers/
  4. 04LangGraph parallel analyzer pipeline (src/agents/orchestrator.py) — bug · security · style · test-gap
  5. 05Comment synthesizer with severity scoring (low / med / high / blocker) in src/synthesizer.py
  6. 06Eval on SWE-bench Lite sample (20 real issues) — bug detection F1 in data/eval/
  7. 07LLM-as-judge over 100 generated comments scoring actionability + specificity + correctness
  8. 08Diff-viewer demo in /projects/06-code-review.html
  9. 09Gallery of reviews on real public PRs (Django, pytest, sympy, scikit-learn, flask) — 5 representative samples committed
  10. 10GitHub Action published at .github/workflows/code-review.yml (self-hosted-ready)
Next project →

P07 · AI Safety & Red Teaming Framework

OWASP LLM Top 10 attack suite · guardrails · auditable security reports