Production triage for customer support: a LoRA-fine-tuned DistilBERT classifies intent in 85ms for $0.0004 per ticket. Claude handles the reasoning the small model can't — drafting responses grounded in similar resolved tickets retrieved from Qdrant. A confidence estimator decides auto-resolve vs human-suggest vs supervisor escalation. The fine-tuned adapter ships to HuggingFace Hub with a full model card.
Zero-shot Claude over every ticket costs $0.012 each and tops out at 0.78 macro-F1. A classifier alone is fast and cheap but can't draft a response. The right answer is both, wired together with a confidence-gated router.
Zero-shot Claude gets the intent right ~78% of the time, generates fine responses, costs $0.012/ticket, and has 1.8s p95 — that's a $12K/month bill for a 1k/hour queue.
Classifier-only hits 0.94 macro-F1 in 85ms for $0.0004/ticket, but it can't actually answer the customer. It just labels.
And neither has a sense of when to stop and let a human in. The 6% of tickets where the model is wrong cause the worst incidents.
DistilBERT + LoRA classifies intent in 85ms. Claude reads the top-5 similar resolved tickets from Qdrant and drafts a response grounded in those past resolutions.
Confidence Estimator combines intent confidence, sentiment, similarity to past resolutions, and priority into a single score: > 0.85 auto · 0.6–0.85 suggest · < 0.6 escalate.
Net: same accuracy as Claude (0.94), 50% lower cost, 33% faster, and 19 extra auto-resolutions per hundred tickets.
Each stage logs to LangSmith. Postgres stores the audit trail. Qdrant holds embedded resolutions for semantic retrieval.
Press play. Walks through env setup, Docker stack, LoRA fine-tune of DistilBERT (3 epochs · r=16 · 0.66% trainable params), eval on 27 intents with confusion matrix, push adapter + model card to HuggingFace Hub, ingest historical resolutions, run the head-to-head benchmark (zero-shot vs classifier-only vs hybrid), and launch the Kanban UI.
Watch tickets flow through the Kanban: a delivery question and a refund auto-resolve, a website bug gets suggested to a human, a health-related lawsuit threat escalates, a positive discount request auto-resolves with a code. The detail panel below shows the classifier output, similar tickets retrieved, drafted response, and routing decision for the currently active ticket.
Same stack philosophy as P01 — versions are explicit, alternatives are wired and toggleable.
From 02-support-triage/README.md on main.