P02 · Fine-tuning · LoRA · DistilBERT · Claude

Cheap classifier where it suffices. Claude reasoning where it matters. One pipeline, three routings.

Production triage for customer support: a LoRA-fine-tuned DistilBERT classifies intent in 85ms for $0.0004 per ticket. Claude handles the reasoning the small model can't — drafting responses grounded in similar resolved tickets retrieved from Qdrant. A confidence estimator decides auto-resolve vs human-suggest vs supervisor escalation. The fine-tuned adapter ships to HuggingFace Hub with a full model card.

Status
Planned · README only
queued after P01 wraps
Datasets
Bitext · 26,872 queries
+ Twitter CS · 3M tweets
Publication
HuggingFace Hub adapter
benchmarks + model card
Target metric
Macro-F1 ≥ 0.93 · 27 intents
auto-resolve ≥ 70%
01 · The problem

One model is either too dumb or too expensive.

Zero-shot Claude over every ticket costs $0.012 each and tops out at 0.78 macro-F1. A classifier alone is fast and cheap but can't draft a response. The right answer is both, wired together with a confidence-gated router.

Why single-model fails

You either pay too much or you can't escalate well.

Zero-shot Claude gets the intent right ~78% of the time, generates fine responses, costs $0.012/ticket, and has 1.8s p95 — that's a $12K/month bill for a 1k/hour queue.

Classifier-only hits 0.94 macro-F1 in 85ms for $0.0004/ticket, but it can't actually answer the customer. It just labels.

And neither has a sense of when to stop and let a human in. The 6% of tickets where the model is wrong cause the worst incidents.

The hybrid wins on every axis

Classifier scopes the problem. Claude solves it. Confidence routes it.

DistilBERT + LoRA classifies intent in 85ms. Claude reads the top-5 similar resolved tickets from Qdrant and drafts a response grounded in those past resolutions.

Confidence Estimator combines intent confidence, sentiment, similarity to past resolutions, and priority into a single score: > 0.85 auto · 0.6–0.85 suggest · < 0.6 escalate.

Net: same accuracy as Claude (0.94), 50% lower cost, 33% faster, and 19 extra auto-resolutions per hundred tickets.

02 · System diagram

Five stages, one confidence gate, three routings.

Each stage logs to LangSmith. Postgres stores the audit trail. Qdrant holds embedded resolutions for semantic retrieval.

// Triage pipeline · classifier → reasoner → confidence-gated routing
Ticket in email · slack · chat Intent Classifier DistilBERT + LoRA (r=16) → 27 intents · 85ms · F1 0.94 Sentiment + Priority Claude · structured sentiment · P0–P3 · urgency Similar Resolved Tickets Qdrant · mpnet-base-v2 → top-5 past resolutions Solution Drafter Claude Sonnet 4.5 response grounded in past wins Confidence Estimator f(intent_conf, sentiment, sim, priority) → score ∈ [0, 1] AUTO-RESOLVE score > 0.85 → send draft as-is SUGGEST 0.6 ≤ score ≤ 0.85 → human reviews draft ESCALATE score < 0.6 or sensitive → supervisor inbox Audit Logger Postgres · all decisions · LangSmith trace_id
03 · Demo 1 of 2 · Train & deploy

From git clone to a published HF Hub model.

Press play. Walks through env setup, Docker stack, LoRA fine-tune of DistilBERT (3 epochs · r=16 · 0.66% trainable params), eval on 27 intents with confusion matrix, push adapter + model card to HuggingFace Hub, ingest historical resolutions, run the head-to-head benchmark (zero-shot vs classifier-only vs hybrid), and launch the Kanban UI.

Demo 01
Train · publish · benchmark
9 steps · 75s · uv + PEFT + HF Hub + LangSmith
SPACE play / pause seek 0 reset
04 · Demo 2 of 2 · System in motion

Five tickets, three routings, one supervisor escalation.

Watch tickets flow through the Kanban: a delivery question and a refund auto-resolve, a website bug gets suggested to a human, a health-related lawsuit threat escalates, a positive discount request auto-resolves with a code. The detail panel below shows the classifier output, similar tickets retrieved, drafted response, and routing decision for the currently active ticket.

Demo 02
Kanban — 5 tickets through the full pipeline
auto-resolve × 3 · suggest × 1 · escalate × 1
SPACE play / pause seek 0 reset
05 · Stack

Pinned dependencies, public weights.

Same stack philosophy as P01 — versions are explicit, alternatives are wired and toggleable.

Stack — pinned

Training
transformers4.46.0 peft0.13.0 accelerate1.0.1 torch2.5.0 bitsandbytes0.44.1
Base models
distilbert-base-uncased roberta-base (fallback) all-mpnet-base-v2
Reasoning
Claude Sonnet 4.5 LangGraph0.2.45 Pydanticv2
Storage & serving
Qdrant1.12 Postgres16 HuggingFace Hub Next.js14 shadcn/ui

Publication checklist

HF model card
Required: intended use, training details (LoRA r/alpha/lr/epochs), data card with bias notes, eval table with macro-F1 by intent, license, citation BibTeX.
Reproducibility
Training script committed with seed=42, exact dependency lockfile, expected loss curve documented. A stranger can pull the dataset and re-run.
Limitations
Bitext is English-only and skewed to e-commerce. Spanish customers degrade to zero-shot Claude until v2 covers Spanish CS data.
License
MIT for code; CC-BY-4.0 for the adapter; Bitext upstream license preserved in attribution.
06 · Roadmap to v1.0.0

Ten checkpoints. Each one is a separate PR.

From 02-support-triage/README.md on main.

  1. 01Load Bitext (27K queries × 27 intents), 80/10/10 stratified split
  2. 02LoRA fine-tune DistilBERT, evaluate macro-F1 on test split
  3. 03HuggingFace Hub push script + full MODEL_CARD.md ready in models/intent-classifier-lora/ (token-gated publish)
  4. 04Qdrant retrieval module (src/retrieval/similar_tickets.py) + ingest CLI for resolved-tickets corpus
  5. 05LangGraph agent: classifier → sentiment → similar tickets → drafter → confidence scorer
  6. 06End-to-end eval runner in src/eval/runner.py over 200-sample test split
  7. 07Comparative report: zero-shot Claude vs LoRA classifier-only vs hybrid pipeline (docs/pipeline_comparison.md)
  8. 08Animated kanban demo (pending → drafted → suggested → resolved → escalated) in /projects/02-support-triage.html
  9. 09LangSmith trace hooks wired in orchestrator; sample runs documented in docs/trace_gallery.md
  10. 10Confusion matrix rendered to reports/confusion_matrix.png and embedded in MODEL_CARD.md
Next project →

P03 · B2B Sales Intelligence Agent

Planner-Executor-Reflector loop · Tavily + HN · personalized outreach with measurable personalization score