A voice-AI conversational agent over WebRTC. Silero VAD detects end-of-turn, Whisper-large-v3 transcribes (WER 2.18% on LibriSpeech, 4.62% on Spanish Common Voice), Claude reasons over conversation state + slot filling + tool calls, ElevenLabs streams natural voice back. p95 end-to-end latency 782ms — within the "feels natural" target. Restaurant booking, scheduling, first-line support, conversational commerce.
The human ear notices >800ms of silence as a stutter and >1.5s as a hang. Text chatbots can afford 3 seconds; voice agents cannot. The hard part isn't the model — it's the pipeline.
Naive flow: wait for caller to finish (1.5s silence detect) → send full audio to API (200ms) → wait for full transcription (400ms) → call LLM with text (800ms full response) → call TTS with full text (300ms) → stream audio (300ms). Total: 3+ seconds. Unusable.
Add a Spanish-only caller? WER doubles. Add a noisy room? Whisper hallucinates entire sentences. Add a fast talker? VAD cuts them off. Every component has its own failure mode.
VAD streams in 512-sample windows (~32ms), letting the agent start transcribing the user mid-sentence rather than waiting for silence.
Whisper streams 2-second chunks, emitting partial transcripts. The LLM starts thinking before the user finishes the sentence.
Claude streams tokens as they're generated; ElevenLabs streams audio as tokens arrive. First syllable of the agent reply hits the user's ear before the LLM has finished its sentence.
Per-turn latency: VAD 6ms · STT 142ms · LLM 342ms · TTS 128ms · network 34ms = 652ms p50, 782ms p95. Inside budget.
Docker stack (postgres + LiveKit + MinIO), Whisper-large-v3 GPU download, Silero VAD, WER eval on LibriSpeech test-clean (2.18%) + Spanish Common Voice (4.62%), per-component latency breakdown (target ≤ 800ms p95), 100-dialogue conversational eval on restaurant booking, Next.js + WebRTC demo launch.
Watch the agent collect date → time → party → seating → contact, then call booking_api.create() and confirm. The waveform pulses pink for the caller, cyan for the agent. The right panel breaks down latency per component for the most recent turn — every turn under 800ms.