P03 — B2B Sales Intelligence · Juan David Suárez Sánchez

01 · The problem

Single-pass research writes outreach that gets ignored.

Sales tools that send one query to an LLM with "Write me an email to Acme Corp" produce templates with the name swapped. The difference between 1.4 and 4.5 on a 5-point personalization scale is whether the recipient replies.

Why one-shot fails

The LLM doesn't know what it doesn't know.

Without a reflection step, a single Tavily search returns 8 generic hits. The model synthesizes whatever's at the top — usually a press release from two years ago. The email reads "I saw your Series B" when the company just closed Series D.

Worse: the model never asks itself "is this enough?" So it never goes back for the hiring page, the recent HN thread, the careers post.

What the loop adds

Plan, execute, reflect, refine — capped at 3 iterations.

Planner writes a research plan listing what to look for (funding, hiring, recent product, technical posture).

Executor runs tool calls in parallel: Tavily for news, HN for technical signal, website fetch for primary sources.

Reflector reads the harvest and asks: "Do I know enough? Are there contradictions? Is the recent news from this quarter or 2023?" If gaps remain, it sends refined queries back to the Executor. Max 3 loops keeps cost bounded.

Net result: 4.4 accuracy + 4.6 personalization for $0.04 per email — beats the single-search baseline by +1.9 points of personalization.

02 · System diagram

One loop, three tools, structured output.

Loop body is a LangGraph cycle. The Reflector decides whether to refine or proceed. Profile + Email are Pydantic models — invalid outputs are retried, not silently coerced.

// LangGraph cycle · planner → executor → reflector → (loop) → profile → email

03 · Demo 1 of 2 · Batch run

Setup, sample 100 YC companies, run them, score the outputs.

End-to-end: env config, dependency install, YC dataset load, 8-worker parallel batch through plan-execute-reflect with up to 3 loops per company, LLM-as-judge scoring with Claude Opus, persistence to Postgres, traces pushed to LangSmith, Next.js gallery launch.

Demo 01

Batch · 100 companies · 10.7 minutes

7 steps · 58s · uv + Tavily + LangGraph + Claude Opus judge

SPACE play / pause ←→ seek 0 reset

04 · Demo 2 of 2 · Agent loop in motion

Two companies, end-to-end — research log, profile, email.

Watch Tavily / HN / website queries fire on the left, the CompanyProfile fill field-by-field in the middle, and the personalized outreach stream into the right pane. Then the LLM judge scores it. Anthropic processed first, Linear second.

Demo 02

Plan → Execute → Reflect → Profile → Email → Score

2 companies · 4–6 reflection loops · LLM-as-judge stamps

SPACE play / pause ←→ seek 0 reset

05 · Stack

Public web in, structured outputs out.

Stack — pinned

Agent loop

LangGraph0.2.45 Claude Sonnet 4.5 Claude Opusjudge Pydantic2.9.2

Tools

tavily-python0.5.0 HN Algolia API httpx0.27.2 selectolax0.3.27

Storage

Postgres16 LangSmith traces Next.js14

Why this loop, not a one-shot

+1.9

Personalization score over single-search baseline on the 100-company eval set, judged by Claude Opus with explicit 5-point rubric.

2.13

Avg reflection loops per company (cap = 3). Past loop 2, marginal accuracy gain is < 0.1 — diminishing returns documented.

$0.041

Per company. Tavily ($0.001 × 12 calls) + Claude tokens ($0.029). Still < $5 per 100 leads.

18.4s

p95 wall time per company. Parallelizable to 8 workers without rate-limit hit on Tavily free tier.

06 · Roadmap to v1.0.0

Ten checkpoints.

01✓YC companies dataset loader in scripts/load_yc.py with ~5K-row schema and metadata fields
02✓Tavily wrapper with retry + rate-limit (src/web/tavily_client.py)
03✓selectolax HTML fetcher in src/web/fetcher.py
04✓LangGraph plan-execute-reflect loop (src/agents/orchestrator.py) with configurable max_iterations
05✓Pydantic v2 schemas: CompanyProfile, OutreachEmail in src/schemas/
06✓Quality scorer using Claude as LLM-judge with explicit rubric (src/eval/judge.py)
07✓Eval set of 200 sample companies with manual-spot-checked ground truth (data/eval/)
08✓Batch processing pipeline (src/eval/runner.py); results persist to Postgres via SQLAlchemy
09✓Animated lead-qualification gallery in /projects/03-sales-intelligence.html
10✓LangSmith trace hooks wired; sample run logs in docs/trace_gallery.md

For each target company, the agent plans research, runs it, reflects on gaps, and writes outreach that doesn't read like a template.