A red-team framework applied to the other projects in this portfolio. 500 attacks drawn from HarmBench, JailbreakBench, prompt-injection variants (DAN, instruction override, base64, unicode bypass) plus Claude-generated adversarials. Categorized against OWASP LLM Top 10. A guardrails layer (Presidio · toxic-bert · DeBERTa prompt-injection detector · refusal injection) drops baseline attack success rate from 41% to 4%. Auditable PDF reports for P01 and P03.
Banking, insurance, healthcare, government — every one of them requires you to prove the system doesn't leak PII, doesn't comply with malicious intent, and doesn't hallucinate facts the customer will act on. You can't fix what you haven't measured.
A handful of one-off prompts catch the obvious vulnerabilities. They don't catch the long tail: base64-encoded injection, roleplay jailbreaks, unicode bypass, training-data extraction via "repeat the word forever", cross-session memory queries, instruction-overrides hidden in translation requests.
Worse: without a baseline, you can't quantify the value of your guardrails. The CISO asks "how much risk did we reduce?" and the answer is a shrug.
Attack Generator pulls from public benchmarks (HarmBench, JailbreakBench), expands each base attack into 5 encoding variants (plaintext, base64, rot13, unicode-confusable, roleplay-wrap), and Claude generates new adversarials when coverage is thin.
Guardrails Layer: input-side prompt-injection classifier (DeBERTa) + jailbreak detector; output-side PII redaction (Presidio), toxicity filter (toxic-bert), HTML/URL-scheme sanitizer; policy-side refusal injection for high-risk categories.
Quantified report: per-OWASP-category attack success rate before / after, false-positive rate on 1,000 legitimate queries, latency overhead of the guardrails stack, PDF deliverable for the audit.
Walks through dataset downloads, baseline red-team run against P01 (no guardrails), per-OWASP-category breakdown of attack success, guardrails install at middleware layer, re-run with guardrails on, side-by-side reduction table, PDF report generation.
Top counter tracks attempts vs verdicts. Left panel shows per-OWASP attack success rate before (red) and after guardrails (green) — the reduction shown as a delta in pp. Right panel is the live feed of each attack with payload, verdict, and which guard caught it. 0 breaches across the run.