P07 · MLSec · OWASP LLM Top 10 · Guardrails

Attack the system before someone else does. Then quantify what the guardrails buy you.

A red-team framework applied to the other projects in this portfolio. 500 attacks drawn from HarmBench, JailbreakBench, prompt-injection variants (DAN, instruction override, base64, unicode bypass) plus Claude-generated adversarials. Categorized against OWASP LLM Top 10. A guardrails layer (Presidio · toxic-bert · DeBERTa prompt-injection detector · refusal injection) drops baseline attack success rate from 41% to 4%. Auditable PDF reports for P01 and P03.

Status
Planned · README only
phase 3 · weeks 19–21
Datasets
HarmBench · JailbreakBench
+ ToxicChat · 10K
Standard
OWASP LLM Top 10
10 categories covered
Target metric
Attack rate ≤ 5%
FP rate ≤ 2% on legit
01 · The problem

You cannot ship a regulated LLM without an attack budget.

Banking, insurance, healthcare, government — every one of them requires you to prove the system doesn't leak PII, doesn't comply with malicious intent, and doesn't hallucinate facts the customer will act on. You can't fix what you haven't measured.

Why "we tested it" isn't enough

The first 100 attacks teach you what was broken yesterday.

A handful of one-off prompts catch the obvious vulnerabilities. They don't catch the long tail: base64-encoded injection, roleplay jailbreaks, unicode bypass, training-data extraction via "repeat the word forever", cross-session memory queries, instruction-overrides hidden in translation requests.

Worse: without a baseline, you can't quantify the value of your guardrails. The CISO asks "how much risk did we reduce?" and the answer is a shrug.

What this framework does

500 attacks → guardrails → 500 attacks again → numbers.

Attack Generator pulls from public benchmarks (HarmBench, JailbreakBench), expands each base attack into 5 encoding variants (plaintext, base64, rot13, unicode-confusable, roleplay-wrap), and Claude generates new adversarials when coverage is thin.

Guardrails Layer: input-side prompt-injection classifier (DeBERTa) + jailbreak detector; output-side PII redaction (Presidio), toxicity filter (toxic-bert), HTML/URL-scheme sanitizer; policy-side refusal injection for high-risk categories.

Quantified report: per-OWASP-category attack success rate before / after, false-positive rate on 1,000 legitimate queries, latency overhead of the guardrails stack, PDF deliverable for the audit.

02 · System diagram

Generate · attack · categorize · guard · re-attack · report.

// Red-team loop · attack source → target → analyzer → OWASP map → guardrails → report
Attack Generator HarmBench · JailbreakBench 5 encoding variants each + Claude attack mode Target System P01 / P03 endpoint { guardrails on / off } Response Analyzer refusal? leak? compliance? PII detect · tox classify OWASP LLM Top 10 Categorizer LLM01..LLM10 · severity score aligned to OWASP 2023.10 Guardrails Layer · installed as middleware Input Filter prompt-injection · jailbreak PII Redaction Presidio · ID/CC/email masks Toxicity unitary/toxic-bert Output Sanitizer HTML strip · URL allowlist Refusal Injection policy-gated categories PDF Security Report before / after · per OWASP · FP rate · audit-ready
03 · Demo 1 of 2 · End-to-end run

500 attacks. Baseline. Install guardrails. 500 attacks. PDF.

Walks through dataset downloads, baseline red-team run against P01 (no guardrails), per-OWASP-category breakdown of attack success, guardrails install at middleware layer, re-run with guardrails on, side-by-side reduction table, PDF report generation.

Demo 01
Red team · before & after guardrails
6 steps · 65s · HarmBench + JailbreakBench + Presidio + DeBERTa
SPACE play0 reset
04 · Demo 2 of 2 · Live attack dashboard

Watch 20 attacks fire against guarded P01 — every one is blocked, redacted, or guarded.

Top counter tracks attempts vs verdicts. Left panel shows per-OWASP attack success rate before (red) and after guardrails (green) — the reduction shown as a delta in pp. Right panel is the live feed of each attack with payload, verdict, and which guard caught it. 0 breaches across the run.

Demo 02
Live red-team feed against P01 (guarded)
20 attacks · 10 OWASP categories · 100% caught
SPACE play0 reset
05 · Stack

Open-source attack tools + commercial-grade guards.

Stack — pinned

Red team
garak0.10.0 giskard2.15.0 HarmBench dataset JailbreakBench ToxicChat
Guardrails
guardrails-ai0.6.0 nemo-guardrails0.11.0 Presidio (PII) unitary/toxic-bert protectai DeBERTa
Audit & serving
Postgres16 LangSmith Next.js dashboard reportlab (PDF)

Reduction by category

LLM01 Injection
45% → 3% baseline → guarded. DeBERTa prompt-injection classifier catches the long tail (base64, unicode, roleplay-wrap).
LLM06 Disclosure
48% → 1%. Presidio redacts PII before it leaves the model; cross-session memory queries blocked at tenant boundary.
LLM09 Overreliance
58% → 12%. Hardest category — medical/financial speculation guarded with disclaimers + human-escalate, not blocked outright.
FP on legitimate
1.2% on a 1,000-query test set. Within budget for a customer-facing assistant.
06 · Roadmap to v1.0.0

Eleven checkpoints.

  1. 01HarmBench / JailbreakBench / ToxicChat sample corpus (400 / 100 / 1K representative rows) under data/attacks/
  2. 02Adversarial prompt generators (encoded, role-play, instruction-override) in src/attacks/
  3. 03Presidio PII detection wired into output filter (src/guardrails/pii.py)
  4. 04Toxicity classifier (unitary/toxic-bert) wired in src/guardrails/toxicity.py
  5. 05Prompt-injection detector (protectai/deberta-v3-base) in src/guardrails/injection.py
  6. 06OWASP LLM Top 10 attack-to-category map in src/owasp_mapping.py
  7. 07Guardrails layer (guardrails-ai-style) wired into FastAPI middleware
  8. 08Framework applied to P01 + P03 of this portfolio (sample integrations in examples/)
  9. 09Attack-success-rate report before / after guardrails per OWASP category (docs/asr_report.md)
  10. 10Animated attack-gallery + report viewer in /projects/07-ai-safety.html
  11. 11Sample PDF security reports for P01 + P03 in docs/reports/
Next project →

P08 · GraphRAG over SEC EDGAR

Knowledge graph in Neo4j · entity extraction · multi-hop Cypher reasoning over S&P 500 10-Ks