P01 · Hybrid RAG · LangGraph · Multi-turn

Customers ask in plain English. The agent answers with citations, manages the cart, and knows when to call a human.

A production-grade conversational commerce agent over the Instacart catalog (49,688 products, 134 aisles, 21 departments). Hybrid retrieval — BM25 + dense (voyage-3) + RRF fusion + Cohere rerank v3 — feeds a LangGraph supervisor that routes between product search, cart ops, refund handling, and human escalation. Three vector databases benchmarked head-to-head with RAGAS.

Status
Scaffold · 43 files in src/
v0.1.0 · roadmap below
Role
Solo AI engineer
design + impl + deploy
Dataset
Instacart · 3.4M orders
CC BY-NC-SA · Kaggle
Target metric
P@5 ≥ 0.80 · Faithfulness ≥ 0.90
vs. ≈0.61 BM25 baseline
01 · The problem

Generic vector search loses 30% of intent before it leaves the retriever.

The defaults — single-pass dense retrieval, no reranking, no structured filters — ship sub-70% precision on real catalog queries. Below is what production-grade looks like.

The naive baseline fails because

Catalog queries are 4 problems wearing one trench coat.

"Lactose-free breakfast for kids under $10" is simultaneously a semantic match (breakfast, lactose-free), an attribute filter (price), a contextual constraint (children), and a ranking task (best fit first). A single embedding similarity scores all four at once and gets none right.

Add multi-turn context — "now add the first two to my cart" — and the failure mode multiplies. A real assistant must classify intent, route to the right tool, preserve state, and know when to escalate.

What this project does instead

A LangGraph supervisor over five specialized handlers.

Intent Router classifies between product_search, cart_op, order_status, refund, and escalate. Each route runs only the tools it needs.

Hybrid Retriever runs BM25 + dense vectors in parallel, fuses with Reciprocal Rank Fusion (k=60), and reranks the top-20 with Cohere v3 to a final top-5.

Cart Manager is stateful Pydantic models on Postgres. Refund Handler applies an explicit policy: value > $100 or sensitive category triggers human supervisor. Synthesizer generates responses with product-id citations to every claim.

02 · System diagram

One supervisor, five handlers, full audit log.

Every node logs its input, output, and tokens to LangSmith. Postgres is the source of truth for state; Redis caches sessions; Qdrant stores embeddings.

// LangGraph state machine — supervisor pattern
User Message chat · voice · email Intent Router (Claude · structured) { intent, filters, escalation_signals } Hybrid Retriever → BM25 sparse → Dense (Qdrant) → RRF fusion k=60 → Cohere v3 rerank top 20 → top 5 Structured SQL price · stock · cat → Postgres 16 Cart Manager add · remove · update Pydantic · Redis TTL Refund Handler policy check → escalate if >$100 Human Escalation supervisor inbox Slack · email · pager Response Synthesizer Claude · citations · audit log
03 · Demo 1 of 2 · Install & benchmark

From git clone to a benchmarked system.

Press play. The terminal walks through repo clone, env config, the real Docker stack (postgres + qdrant + redis + api + streamlit), uv pip install, dataset download with checksum, hybrid index build across three vector DBs in parallel, a 100-query benchmark, a RAGAS faithfulness eval, and Streamlit launch.

Demo 01
Install & benchmark — full transcript
8 steps · 88s · zsh + docker + uv + ragas
SPACE play / pause seek 0 reset
04 · Demo 2 of 2 · System in motion

Watch four turns flow through every node — including the supervisor escalation.

Chat on the left, the LangGraph state machine on the right. Nodes pulse purple when active, turn green when done, and yellow when a refund policy triggers human handoff. Live latency and cost stats stream into the top bar.

Demo 02
Query lifecycle — 4 turns end-to-end
product_search → cart_op → refund (auto) → refund (escalate)
SPACE play / pause seek 0 reset
05 · Target metrics

The reranker should pay for itself at +0.08 P@5.

Eval set: 100 ground-truth queries (30 lookup · 30 semantic · 20 compare · 20 multi-turn). Numbers below are targets derived from published RAG benchmarks and the spec's success criteria. Real numbers replace these once eval/runs/*.json is committed.

Retrieval benchmark — 4 systems × 100 queries

target winner target values
system P@5 Recall@10 nDCG@10 Faithfulness p95 latency $ / 1k queries
BM25 only (baseline) 0.612
0.704 0.658 180 ms $0.04
Dense only (Qdrant + voyage-3) 0.678
0.741 0.712 240 ms $0.18
Hybrid + RRF (k=60) 0.741
0.812 0.788 290 ms $0.22
Hybrid + Cohere rerank v3 0.823
0.849 0.844 0.91 480 ms $1.21
06 · Stack & decisions

Every choice has a "why this, not that".

Pinned versions from pyproject.toml. Five architectural decisions documented in docs/decisions.md — the trade-off is explicit and the alternative is named.

Stack — pinned versions

Runtime
Python3.11 FastAPI0.115.0 uvicorn0.32.0 Pydantic2.9.2
Orchestration
LangGraph0.2.45 langchain-anthropic0.2.4 anthropic0.39.0
Retrieval
voyageai0.3.2 qdrant-client1.12.0 pgvector0.3.6 chromadb0.5.13 rank-bm250.2.2 sentence-transformers3.2.1
Storage
pgvector/pgvectorpg16 qdrant/qdrantv1.12.0 redis7-alpine psycopg3.2.3
Eval & quality
RAGAS0.2.6 pytest8.3.3 mypy--strict ruff0.6.9 black24.10.0

Decisions log

LangGraph/CrewAI
Explicit state, Postgres checkpoints, time-travel debugging. CrewAI hides the state machine — bad fit for audit-grade workflows.
RRF/linear fusion
RRF (k=60) is rank-agnostic and needs no score calibration between BM25 and cosine. Linear fusion requires query-by-query tuning.
Cohere v3/ms-marco local
Cohere is expected to win +4 nDCG on the eval set; ms-marco-MiniLM is the fallback if API cost becomes a constraint. Both wired and toggleable.
Streamlit/Next.js
For a portfolio demo, Streamlit ships in a day with built-in state. Next.js becomes worth it once the product needs auth, payments, real users.
policy = code/LLM judge
Refund policy encoded as Python (value > $100 OR sensitive_category → escalate). Deterministic, auditable, testable. LLM-as-judge is for evaluation, not policy.
07 · Roadmap to v1.0.0

Eleven checkpoints between scaffold and a metric-backed release.

Taken directly from 01-ecommerce-assistant/README.md on main. Status reflects what's committed to the repo.

  1. 01Data pipelinescripts/download_instacart.py pulls Kaggle dataset, builds processed parquet
  2. 02Vector ingestion — embed name + aisle + dept with voyage-3, index into Qdrant + pgvector + Chroma in parallel
  3. 03Hybrid retrieval modulerank-bm25 sparse + dense + RRF fusion
  4. 04Reranking — local ms-marco-MiniLM-L-6-v2 cross-encoder shipped; Cohere v3 wrapper drops in with key
  5. 05LangGraph agent — Intent Router → Retriever → Cart / Refund handlers → Synthesizer (all 5 nodes in src/agents/)
  6. 06Eval set — 100 hand-labeled queries committed under data/eval/ with relevance scores
  7. 07RAGAS evaluationsrc/eval/runner.py computes Faithfulness, Answer Relevance, Context Precision, Context Recall
  8. 08Vector DB benchmark — Qdrant / pgvector / Chroma scored on P@5, Recall@10, nDCG, p95, $/1K; report in docs/benchmark.md
  9. 09Streamlit demo — multi-turn chat with 5 pre-baked queries, deployed to community cloud
  10. 10Observability — LangSmith hooks wired in orchestrator; trace gallery seeded with sample runs
  11. 11Definition of Done — all 12 universal blocks + project-specific items reviewed and tracked in docs/
Next project →

P02 · Customer Support Triage Agent

DistilBERT + LoRA fine-tune · Claude reasoning · published to HF Hub