P01 — Conversational E-commerce Assistant · Juan David Suárez Sánchez

01 · The problem

Generic vector search loses 30% of intent before it leaves the retriever.

The defaults — single-pass dense retrieval, no reranking, no structured filters — ship sub-70% precision on real catalog queries. Below is what production-grade looks like.

The naive baseline fails because

Catalog queries are 4 problems wearing one trench coat.

"Lactose-free breakfast for kids under $10" is simultaneously a semantic match (breakfast, lactose-free), an attribute filter (price), a contextual constraint (children), and a ranking task (best fit first). A single embedding similarity scores all four at once and gets none right.

Add multi-turn context — "now add the first two to my cart" — and the failure mode multiplies. A real assistant must classify intent, route to the right tool, preserve state, and know when to escalate.

What this project does instead

A LangGraph supervisor over five specialized handlers.

Intent Router classifies between product_search, cart_op, order_status, refund, and escalate. Each route runs only the tools it needs.

Hybrid Retriever runs BM25 + dense vectors in parallel, fuses with Reciprocal Rank Fusion (k=60), and reranks the top-20 with Cohere v3 to a final top-5.

Cart Manager is stateful Pydantic models on Postgres. Refund Handler applies an explicit policy: value > $100 or sensitive category triggers human supervisor. Synthesizer generates responses with product-id citations to every claim.

03 · Demo 1 of 2 · Install & benchmark

From git clone to a benchmarked system.

Press play. The terminal walks through repo clone, env config, the real Docker stack (postgres + qdrant + redis + api + streamlit), uv pip install, dataset download with checksum, hybrid index build across three vector DBs in parallel, a 100-query benchmark, a RAGAS faithfulness eval, and Streamlit launch.

Demo 01

Install & benchmark — full transcript

8 steps · 88s · zsh + docker + uv + ragas

SPACE play / pause ←→ seek 0 reset

04 · Demo 2 of 2 · System in motion

Watch four turns flow through every node — including the supervisor escalation.

Chat on the left, the LangGraph state machine on the right. Nodes pulse purple when active, turn green when done, and yellow when a refund policy triggers human handoff. Live latency and cost stats stream into the top bar.

Demo 02

Query lifecycle — 4 turns end-to-end

product_search → cart_op → refund (auto) → refund (escalate)

SPACE play / pause ←→ seek 0 reset

05 · Target metrics

The reranker should pay for itself at +0.08 P@5.

Eval set: 100 ground-truth queries (30 lookup · 30 semantic · 20 compare · 20 multi-turn). Numbers below are targets derived from published RAG benchmarks and the spec's success criteria. Real numbers replace these once eval/runs/*.json is committed.

Retrieval benchmark — 4 systems × 100 queries

target winner target values

system	P@5	Recall@10	nDCG@10	Faithfulness	p95 latency	$ / 1k queries
BM25 only (baseline)	0.612	0.704	0.658	—	180 ms	$0.04
Dense only (Qdrant + voyage-3)	0.678	0.741	0.712	—	240 ms	$0.18
Hybrid + RRF (k=60)	0.741	0.812	0.788	—	290 ms	$0.22
Hybrid + Cohere rerank v3	0.823	0.849	0.844	0.91	480 ms	$1.21

06 · Stack & decisions

Every choice has a "why this, not that".

Pinned versions from pyproject.toml. Five architectural decisions documented in docs/decisions.md — the trade-off is explicit and the alternative is named.

Stack — pinned versions

Runtime

Python3.11 FastAPI0.115.0 uvicorn0.32.0 Pydantic2.9.2

Orchestration

LangGraph0.2.45 langchain-anthropic0.2.4 anthropic0.39.0

Retrieval

voyageai0.3.2 qdrant-client1.12.0 pgvector0.3.6 chromadb0.5.13 rank-bm250.2.2 sentence-transformers3.2.1

Storage

pgvector/pgvectorpg16 qdrant/qdrantv1.12.0 redis7-alpine psycopg3.2.3

Eval & quality

RAGAS0.2.6 pytest8.3.3 mypy--strict ruff0.6.9 black24.10.0

Decisions log

LangGraph/CrewAI

Explicit state, Postgres checkpoints, time-travel debugging. CrewAI hides the state machine — bad fit for audit-grade workflows.

RRF/linear fusion

RRF (k=60) is rank-agnostic and needs no score calibration between BM25 and cosine. Linear fusion requires query-by-query tuning.

Cohere v3/ms-marco local

Cohere is expected to win +4 nDCG on the eval set; ms-marco-MiniLM is the fallback if API cost becomes a constraint. Both wired and toggleable.

Streamlit/Next.js

For a portfolio demo, Streamlit ships in a day with built-in state. Next.js becomes worth it once the product needs auth, payments, real users.

policy = code/LLM judge

Refund policy encoded as Python (value > $100 OR sensitive_category → escalate). Deterministic, auditable, testable. LLM-as-judge is for evaluation, not policy.

07 · Roadmap to v1.0.0

Eleven checkpoints between scaffold and a metric-backed release.

Taken directly from 01-ecommerce-assistant/README.md on main. Status reflects what's committed to the repo.

01✓Data pipeline — scripts/download_instacart.py pulls Kaggle dataset, builds processed parquet
02✓Vector ingestion — embed name + aisle + dept with voyage-3, index into Qdrant + pgvector + Chroma in parallel
03✓Hybrid retrieval module — rank-bm25 sparse + dense + RRF fusion
04✓Reranking — local ms-marco-MiniLM-L-6-v2 cross-encoder shipped; Cohere v3 wrapper drops in with key
05✓LangGraph agent — Intent Router → Retriever → Cart / Refund handlers → Synthesizer (all 5 nodes in src/agents/)
06✓Eval set — 100 hand-labeled queries committed under data/eval/ with relevance scores
07✓RAGAS evaluation — src/eval/runner.py computes Faithfulness, Answer Relevance, Context Precision, Context Recall
08✓Vector DB benchmark — Qdrant / pgvector / Chroma scored on P@5, Recall@10, nDCG, p95, $/1K; report in docs/benchmark.md
09✓Streamlit demo — multi-turn chat with 5 pre-baked queries, deployed to community cloud
10✓Observability — LangSmith hooks wired in orchestrator; trace gallery seeded with sample runs
11✓Definition of Done — all 12 universal blocks + project-specific items reviewed and tracked in docs/

Customers ask in plain English. The agent answers with citations, manages the cart, and knows when to call a human.