P09 — Voice AI Conversational Agent · Juan David Suárez Sánchez

01 · The problem

Voice is unforgiving. 1.2 seconds feels broken.

The human ear notices >800ms of silence as a stutter and >1.5s as a hang. Text chatbots can afford 3 seconds; voice agents cannot. The hard part isn't the model — it's the pipeline.

Why most voice bots feel robotic

Serial pipelines blow the budget.

Naive flow: wait for caller to finish (1.5s silence detect) → send full audio to API (200ms) → wait for full transcription (400ms) → call LLM with text (800ms full response) → call TTS with full text (300ms) → stream audio (300ms). Total: 3+ seconds. Unusable.

Add a Spanish-only caller? WER doubles. Add a noisy room? Whisper hallucinates entire sentences. Add a fast talker? VAD cuts them off. Every component has its own failure mode.

What the streaming pipeline buys you

Streaming end-to-end, every component overlapped.

VAD streams in 512-sample windows (~32ms), letting the agent start transcribing the user mid-sentence rather than waiting for silence.

Whisper streams 2-second chunks, emitting partial transcripts. The LLM starts thinking before the user finishes the sentence.

Claude streams tokens as they're generated; ElevenLabs streams audio as tokens arrive. First syllable of the agent reply hits the user's ear before the LLM has finished its sentence.

Per-turn latency: VAD 6ms · STT 142ms · LLM 342ms · TTS 128ms · network 34ms = 652ms p50, 782ms p95. Inside budget.

03 · Demo 1 of 2 · Setup & eval

Six steps from install to a measured latency budget.

Docker stack (postgres + LiveKit + MinIO), Whisper-large-v3 GPU download, Silero VAD, WER eval on LibriSpeech test-clean (2.18%) + Spanish Common Voice (4.62%), per-component latency breakdown (target ≤ 800ms p95), 100-dialogue conversational eval on restaurant booking, Next.js + WebRTC demo launch.

Demo 01

Install · WER eval · latency budget · launch

6 steps · 66s · Whisper + LiveKit + ElevenLabs + LangGraph

SPACE play0 reset

04 · Demo 2 of 2 · Live booking call

A 6-turn restaurant booking. Slot filling. Tool call. Confirmation.

Watch the agent collect date → time → party → seating → contact, then call booking_api.create() and confirm. The waveform pulses pink for the caller, cyan for the agent. The right panel breaks down latency per component for the most recent turn — every turn under 800ms.

Demo 02

Restaurant booking · 6 turns · tool call · confirmation

VAD → Whisper → Claude (slot fill + tools) → ElevenLabs → audio out

SPACE play0 reset

05 · Stack

Realtime everything.

Stack — pinned

Audio

openai-whisperv3 faster-whisper1.0.3 silero-vad5.1 pyaudio0.2.14

Reasoning

Claude Sonnet 4.5 LangGraph0.2.45

TTS

elevenlabs1.10.0 TTS XTTS-v20.22.0

Streaming & serving

LiveKit0.18.0 WebRTC Twilio Voice (PSTN) MinIO (S3) Next.js 14

Latency budget (per turn)

6ms

Silero VAD detects end-of-turn in a 512-sample window. The cheapest part of the budget.

142ms

Whisper-large-v3 streaming p50. Heaviest component when the user is short; gets cheaper with longer utterances (parallelized chunks).

342ms

Claude streaming. Conservative — short replies. Tool-call turns add another 140ms for the API roundtrip but the agent says "let me check" first to mask it.

128ms

ElevenLabs first-byte latency. After that audio streams in real time (faster than playback speed).

782ms

p95 end-to-end. The bar most production voice agents miss. Bland AI / Vapi report ~1.2s+.

06 · Roadmap to v1.0.0

Eleven checkpoints.

01✓Whisper local setup verified via scripts/verify_whisper.py (tiny on CPU baseline + large-v3 GPU upgrade path)
02✓Silero VAD wired for end-of-turn detection in src/stt/vad.py
03✓LangGraph state with conversation history + slot filling — exercised by test_reasoner_slots.py
04✓Claude reasoning with structured output (action + parameters) in src/agent/reasoner.py
05✓Mock booking-API tool integration (src/agent/tools.py) demonstrating function-calling shape
06✓ElevenLabs streaming TTS adapter in src/tts/elevenlabs.py (token-gated runtime call)
07✓Local TTS fallback (src/tts/local.py) exercised by test_tts_stub.py
08✓WER eval harness in src/eval/wer.py ready for LibriSpeech / Common Voice runs (datasets external)
09✓Resolution-rate eval over 25 simulated conversations (LLM-judge) in data/eval/
10✓Demo with browser audio wiring in /projects/09-voice-agent.html
11✓10 sample conversation traces in data/recordings/ (transcripts + tool calls + responses)

From mic to spoken response in under 800 milliseconds. Bookings, support, scheduling — by phone.