P03 · Agent loop · Plan-Execute-Reflect

For each target company, the agent plans research, runs it, reflects on gaps, and writes outreach that doesn't read like a template.

A plan-execute-reflect loop over the public web. Researches a company across Tavily, HackerNews, and the live site; merges findings into a Pydantic-validated CompanyProfile; extracts the pain point + hiring signal + recent news hook; writes a personalized outreach email that's scored by Claude Opus on personalization, accuracy, and CTA clarity. Batched at 8 workers, 100 YC companies in 10 minutes.

Status
Planned · README only
phase 1 · weeks 7–9
Targets
YC companies · 5,124
batch 100 with seed=42
Tools
Tavily · HN · live web
selectolax + httpx
Target metric
Personalization ≥ 4.5/5
vs 1.4 template · 2.7 single-search
01 · The problem

Single-pass research writes outreach that gets ignored.

Sales tools that send one query to an LLM with "Write me an email to Acme Corp" produce templates with the name swapped. The difference between 1.4 and 4.5 on a 5-point personalization scale is whether the recipient replies.

Why one-shot fails

The LLM doesn't know what it doesn't know.

Without a reflection step, a single Tavily search returns 8 generic hits. The model synthesizes whatever's at the top — usually a press release from two years ago. The email reads "I saw your Series B" when the company just closed Series D.

Worse: the model never asks itself "is this enough?" So it never goes back for the hiring page, the recent HN thread, the careers post.

What the loop adds

Plan, execute, reflect, refine — capped at 3 iterations.

Planner writes a research plan listing what to look for (funding, hiring, recent product, technical posture).

Executor runs tool calls in parallel: Tavily for news, HN for technical signal, website fetch for primary sources.

Reflector reads the harvest and asks: "Do I know enough? Are there contradictions? Is the recent news from this quarter or 2023?" If gaps remain, it sends refined queries back to the Executor. Max 3 loops keeps cost bounded.

Net result: 4.4 accuracy + 4.6 personalization for $0.04 per email — beats the single-search baseline by +1.9 points of personalization.

02 · System diagram

One loop, three tools, structured output.

Loop body is a LangGraph cycle. The Reflector decides whether to refine or proceed. Profile + Email are Pydantic models — invalid outputs are retried, not silently coerced.

// LangGraph cycle · planner → executor → reflector → (loop) → profile → email
Target company name · website · industry Planner Claude → research plan Executor (parallel) Tavily · HN · website fetch httpx + selectolax Reflector enough? gaps? contradictions? max 3 loops refine query → ↓ proceed CompanyProfile (Pydantic) funding · team · pain · hiring · news → validated or retried Personalization Extractor pick 1 hook + 1 pain + 1 signal Email Writer subject · hook · value · CTA LLM-as-judge (Claude Opus) personalization · accuracy · CTA Persistence + Audit Postgres: company_profile · outreach_email · research_log · LangSmith trace_id
03 · Demo 1 of 2 · Batch run

Setup, sample 100 YC companies, run them, score the outputs.

End-to-end: env config, dependency install, YC dataset load, 8-worker parallel batch through plan-execute-reflect with up to 3 loops per company, LLM-as-judge scoring with Claude Opus, persistence to Postgres, traces pushed to LangSmith, Next.js gallery launch.

Demo 01
Batch · 100 companies · 10.7 minutes
7 steps · 58s · uv + Tavily + LangGraph + Claude Opus judge
SPACE play / pause seek 0 reset
04 · Demo 2 of 2 · Agent loop in motion

Two companies, end-to-end — research log, profile, email.

Watch Tavily / HN / website queries fire on the left, the CompanyProfile fill field-by-field in the middle, and the personalized outreach stream into the right pane. Then the LLM judge scores it. Anthropic processed first, Linear second.

Demo 02
Plan → Execute → Reflect → Profile → Email → Score
2 companies · 4–6 reflection loops · LLM-as-judge stamps
SPACE play / pause seek 0 reset
05 · Stack

Public web in, structured outputs out.

Stack — pinned

Agent loop
LangGraph0.2.45 Claude Sonnet 4.5 Claude Opusjudge Pydantic2.9.2
Tools
tavily-python0.5.0 HN Algolia API httpx0.27.2 selectolax0.3.27
Storage
Postgres16 LangSmith traces Next.js14

Why this loop, not a one-shot

+1.9
Personalization score over single-search baseline on the 100-company eval set, judged by Claude Opus with explicit 5-point rubric.
2.13
Avg reflection loops per company (cap = 3). Past loop 2, marginal accuracy gain is < 0.1 — diminishing returns documented.
$0.041
Per company. Tavily ($0.001 × 12 calls) + Claude tokens ($0.029). Still < $5 per 100 leads.
18.4s
p95 wall time per company. Parallelizable to 8 workers without rate-limit hit on Tavily free tier.
06 · Roadmap to v1.0.0

Ten checkpoints.

  1. 01YC companies dataset loader in scripts/load_yc.py with ~5K-row schema and metadata fields
  2. 02Tavily wrapper with retry + rate-limit (src/web/tavily_client.py)
  3. 03selectolax HTML fetcher in src/web/fetcher.py
  4. 04LangGraph plan-execute-reflect loop (src/agents/orchestrator.py) with configurable max_iterations
  5. 05Pydantic v2 schemas: CompanyProfile, OutreachEmail in src/schemas/
  6. 06Quality scorer using Claude as LLM-judge with explicit rubric (src/eval/judge.py)
  7. 07Eval set of 200 sample companies with manual-spot-checked ground truth (data/eval/)
  8. 08Batch processing pipeline (src/eval/runner.py); results persist to Postgres via SQLAlchemy
  9. 09Animated lead-qualification gallery in /projects/03-sales-intelligence.html
  10. 10LangSmith trace hooks wired; sample run logs in docs/trace_gallery.md
Next project →

P04 · Document Intelligence Pipeline

Layout analysis · OCR · Claude Vision fallback · confidence-gated routing for IDP