P07 — AI Safety & Red Teaming · Juan David Suárez Sánchez

01 · The problem

You cannot ship a regulated LLM without an attack budget.

Banking, insurance, healthcare, government — every one of them requires you to prove the system doesn't leak PII, doesn't comply with malicious intent, and doesn't hallucinate facts the customer will act on. You can't fix what you haven't measured.

Why "we tested it" isn't enough

The first 100 attacks teach you what was broken yesterday.

A handful of one-off prompts catch the obvious vulnerabilities. They don't catch the long tail: base64-encoded injection, roleplay jailbreaks, unicode bypass, training-data extraction via "repeat the word forever", cross-session memory queries, instruction-overrides hidden in translation requests.

Worse: without a baseline, you can't quantify the value of your guardrails. The CISO asks "how much risk did we reduce?" and the answer is a shrug.

What this framework does

500 attacks → guardrails → 500 attacks again → numbers.

Attack Generator pulls from public benchmarks (HarmBench, JailbreakBench), expands each base attack into 5 encoding variants (plaintext, base64, rot13, unicode-confusable, roleplay-wrap), and Claude generates new adversarials when coverage is thin.

Guardrails Layer: input-side prompt-injection classifier (DeBERTa) + jailbreak detector; output-side PII redaction (Presidio), toxicity filter (toxic-bert), HTML/URL-scheme sanitizer; policy-side refusal injection for high-risk categories.

Quantified report: per-OWASP-category attack success rate before / after, false-positive rate on 1,000 legitimate queries, latency overhead of the guardrails stack, PDF deliverable for the audit.

03 · Demo 1 of 2 · End-to-end run

500 attacks. Baseline. Install guardrails. 500 attacks. PDF.

Walks through dataset downloads, baseline red-team run against P01 (no guardrails), per-OWASP-category breakdown of attack success, guardrails install at middleware layer, re-run with guardrails on, side-by-side reduction table, PDF report generation.

Demo 01

Red team · before & after guardrails

6 steps · 65s · HarmBench + JailbreakBench + Presidio + DeBERTa

SPACE play0 reset

04 · Demo 2 of 2 · Live attack dashboard

Watch 20 attacks fire against guarded P01 — every one is blocked, redacted, or guarded.

Top counter tracks attempts vs verdicts. Left panel shows per-OWASP attack success rate before (red) and after guardrails (green) — the reduction shown as a delta in pp. Right panel is the live feed of each attack with payload, verdict, and which guard caught it. 0 breaches across the run.

Demo 02

Live red-team feed against P01 (guarded)

20 attacks · 10 OWASP categories · 100% caught

SPACE play0 reset

05 · Stack

Open-source attack tools + commercial-grade guards.

Stack — pinned

Red team

garak0.10.0 giskard2.15.0 HarmBench dataset JailbreakBench ToxicChat

Guardrails

guardrails-ai0.6.0 nemo-guardrails0.11.0 Presidio (PII) unitary/toxic-bert protectai DeBERTa

Audit & serving

Postgres16 LangSmith Next.js dashboard reportlab (PDF)

Reduction by category

LLM01 Injection

45% → 3% baseline → guarded. DeBERTa prompt-injection classifier catches the long tail (base64, unicode, roleplay-wrap).

LLM06 Disclosure

48% → 1%. Presidio redacts PII before it leaves the model; cross-session memory queries blocked at tenant boundary.

LLM09 Overreliance

58% → 12%. Hardest category — medical/financial speculation guarded with disclaimers + human-escalate, not blocked outright.

FP on legitimate

1.2% on a 1,000-query test set. Within budget for a customer-facing assistant.

06 · Roadmap to v1.0.0

Eleven checkpoints.

01✓HarmBench / JailbreakBench / ToxicChat sample corpus (400 / 100 / 1K representative rows) under data/attacks/
02✓Adversarial prompt generators (encoded, role-play, instruction-override) in src/attacks/
03✓Presidio PII detection wired into output filter (src/guardrails/pii.py)
04✓Toxicity classifier (unitary/toxic-bert) wired in src/guardrails/toxicity.py
05✓Prompt-injection detector (protectai/deberta-v3-base) in src/guardrails/injection.py
06✓OWASP LLM Top 10 attack-to-category map in src/owasp_mapping.py
07✓Guardrails layer (guardrails-ai-style) wired into FastAPI middleware
08✓Framework applied to P01 + P03 of this portfolio (sample integrations in examples/)
09✓Attack-success-rate report before / after guardrails per OWASP category (docs/asr_report.md)
10✓Animated attack-gallery + report viewer in /projects/07-ai-safety.html
11✓Sample PDF security reports for P01 + P03 in docs/reports/

Attack the system before someone else does. Then quantify what the guardrails buy you.