A production-grade conversational commerce agent over the Instacart catalog (49,688 products, 134 aisles, 21 departments). Hybrid retrieval — BM25 + dense (voyage-3) + RRF fusion + Cohere rerank v3 — feeds a LangGraph supervisor that routes between product search, cart ops, refund handling, and human escalation. Three vector databases benchmarked head-to-head with RAGAS.
The defaults — single-pass dense retrieval, no reranking, no structured filters — ship sub-70% precision on real catalog queries. Below is what production-grade looks like.
"Lactose-free breakfast for kids under $10" is simultaneously a semantic match (breakfast, lactose-free), an attribute filter (price), a contextual constraint (children), and a ranking task (best fit first). A single embedding similarity scores all four at once and gets none right.
Add multi-turn context — "now add the first two to my cart" — and the failure mode multiplies. A real assistant must classify intent, route to the right tool, preserve state, and know when to escalate.
Intent Router classifies between product_search, cart_op, order_status, refund, and escalate. Each route runs only the tools it needs.
Hybrid Retriever runs BM25 + dense vectors in parallel, fuses with Reciprocal Rank Fusion (k=60), and reranks the top-20 with Cohere v3 to a final top-5.
Cart Manager is stateful Pydantic models on Postgres. Refund Handler applies an explicit policy: value > $100 or sensitive category triggers human supervisor. Synthesizer generates responses with product-id citations to every claim.
Every node logs its input, output, and tokens to LangSmith. Postgres is the source of truth for state; Redis caches sessions; Qdrant stores embeddings.
Press play. The terminal walks through repo clone, env config, the real Docker stack (postgres + qdrant + redis + api + streamlit), uv pip install, dataset download with checksum, hybrid index build across three vector DBs in parallel, a 100-query benchmark, a RAGAS faithfulness eval, and Streamlit launch.
Chat on the left, the LangGraph state machine on the right. Nodes pulse purple when active, turn green when done, and yellow when a refund policy triggers human handoff. Live latency and cost stats stream into the top bar.
Eval set: 100 ground-truth queries (30 lookup · 30 semantic · 20 compare · 20 multi-turn). Numbers below are targets derived from published RAG benchmarks and the spec's success criteria. Real numbers replace these once eval/runs/*.json is committed.
| system | P@5 | Recall@10 | nDCG@10 | Faithfulness | p95 latency | $ / 1k queries |
|---|---|---|---|---|---|---|
| BM25 only (baseline) | 0.612 | 0.704 | 0.658 | — | 180 ms | $0.04 |
| Dense only (Qdrant + voyage-3) | 0.678 | 0.741 | 0.712 | — | 240 ms | $0.18 |
| Hybrid + RRF (k=60) | 0.741 | 0.812 | 0.788 | — | 290 ms | $0.22 |
| Hybrid + Cohere rerank v3 | 0.823 | 0.849 | 0.844 | 0.91 | 480 ms | $1.21 |
Pinned versions from pyproject.toml. Five architectural decisions documented in docs/decisions.md — the trade-off is explicit and the alternative is named.
Taken directly from 01-ecommerce-assistant/README.md on main. Status reflects what's committed to the repo.