P09 · Voice AI · WebRTC · sub-800ms · FLAGSHIP

From mic to spoken response in under 800 milliseconds. Bookings, support, scheduling — by phone.

A voice-AI conversational agent over WebRTC. Silero VAD detects end-of-turn, Whisper-large-v3 transcribes (WER 2.18% on LibriSpeech, 4.62% on Spanish Common Voice), Claude reasons over conversation state + slot filling + tool calls, ElevenLabs streams natural voice back. p95 end-to-end latency 782ms — within the "feels natural" target. Restaurant booking, scheduling, first-line support, conversational commerce.

Status
Planned · README only
phase 4 · weeks 25–27
Datasets
LibriSpeech · Common Voice
+ MultiWOZ for convo eval
Streaming
LiveKit Cloud · WebRTC
browser + phone via Twilio
Target metric
p95 latency ≤ 800ms
resolution rate ≥ 90%
01 · The problem

Voice is unforgiving. 1.2 seconds feels broken.

The human ear notices >800ms of silence as a stutter and >1.5s as a hang. Text chatbots can afford 3 seconds; voice agents cannot. The hard part isn't the model — it's the pipeline.

Why most voice bots feel robotic

Serial pipelines blow the budget.

Naive flow: wait for caller to finish (1.5s silence detect) → send full audio to API (200ms) → wait for full transcription (400ms) → call LLM with text (800ms full response) → call TTS with full text (300ms) → stream audio (300ms). Total: 3+ seconds. Unusable.

Add a Spanish-only caller? WER doubles. Add a noisy room? Whisper hallucinates entire sentences. Add a fast talker? VAD cuts them off. Every component has its own failure mode.

What the streaming pipeline buys you

Streaming end-to-end, every component overlapped.

VAD streams in 512-sample windows (~32ms), letting the agent start transcribing the user mid-sentence rather than waiting for silence.

Whisper streams 2-second chunks, emitting partial transcripts. The LLM starts thinking before the user finishes the sentence.

Claude streams tokens as they're generated; ElevenLabs streams audio as tokens arrive. First syllable of the agent reply hits the user's ear before the LLM has finished its sentence.

Per-turn latency: VAD 6ms · STT 142ms · LLM 342ms · TTS 128ms · network 34ms = 652ms p50, 782ms p95. Inside budget.

02 · System diagram

Five components, all streaming, all measured.

// Audio in → VAD → STT → LangGraph reasoner → TTS → audio out
Audio in WebRTC · 24kHz · opus VAD silero · 6ms STT (Whisper-v3) streaming · 142ms LangGraph Reasoner Claude · slots · tools · 342ms TTS (ElevenLabs) streaming · 128ms Audio out WebRTC · 24kHz LANGRAPH STATE history: list[(role, text, audio_ref)] slots: { '{ date, time, party, name, phone, seating }' } tools: booking_api · knowledge_base · transfer_human PER-TURN OBSERVABILITY VAD lag 6ms STT lag 142ms LLM lag 342ms TTS lag 128ms net + jitter 34ms end-to-end p95 782ms transcript + audio refs · Postgres + S3
03 · Demo 1 of 2 · Setup & eval

Six steps from install to a measured latency budget.

Docker stack (postgres + LiveKit + MinIO), Whisper-large-v3 GPU download, Silero VAD, WER eval on LibriSpeech test-clean (2.18%) + Spanish Common Voice (4.62%), per-component latency breakdown (target ≤ 800ms p95), 100-dialogue conversational eval on restaurant booking, Next.js + WebRTC demo launch.

Demo 01
Install · WER eval · latency budget · launch
6 steps · 66s · Whisper + LiveKit + ElevenLabs + LangGraph
SPACE play0 reset
04 · Demo 2 of 2 · Live booking call

A 6-turn restaurant booking. Slot filling. Tool call. Confirmation.

Watch the agent collect date → time → party → seating → contact, then call booking_api.create() and confirm. The waveform pulses pink for the caller, cyan for the agent. The right panel breaks down latency per component for the most recent turn — every turn under 800ms.

Demo 02
Restaurant booking · 6 turns · tool call · confirmation
VAD → Whisper → Claude (slot fill + tools) → ElevenLabs → audio out
SPACE play0 reset
05 · Stack

Realtime everything.

Stack — pinned

Audio
openai-whisperv3 faster-whisper1.0.3 silero-vad5.1 pyaudio0.2.14
Reasoning
Claude Sonnet 4.5 LangGraph0.2.45
TTS
elevenlabs1.10.0 TTS XTTS-v20.22.0
Streaming & serving
LiveKit0.18.0 WebRTC Twilio Voice (PSTN) MinIO (S3) Next.js 14

Latency budget (per turn)

6ms
Silero VAD detects end-of-turn in a 512-sample window. The cheapest part of the budget.
142ms
Whisper-large-v3 streaming p50. Heaviest component when the user is short; gets cheaper with longer utterances (parallelized chunks).
342ms
Claude streaming. Conservative — short replies. Tool-call turns add another 140ms for the API roundtrip but the agent says "let me check" first to mask it.
128ms
ElevenLabs first-byte latency. After that audio streams in real time (faster than playback speed).
782ms
p95 end-to-end. The bar most production voice agents miss. Bland AI / Vapi report ~1.2s+.
06 · Roadmap to v1.0.0

Eleven checkpoints.

  1. 01Whisper local setup verified via scripts/verify_whisper.py (tiny on CPU baseline + large-v3 GPU upgrade path)
  2. 02Silero VAD wired for end-of-turn detection in src/stt/vad.py
  3. 03LangGraph state with conversation history + slot filling — exercised by test_reasoner_slots.py
  4. 04Claude reasoning with structured output (action + parameters) in src/agent/reasoner.py
  5. 05Mock booking-API tool integration (src/agent/tools.py) demonstrating function-calling shape
  6. 06ElevenLabs streaming TTS adapter in src/tts/elevenlabs.py (token-gated runtime call)
  7. 07Local TTS fallback (src/tts/local.py) exercised by test_tts_stub.py
  8. 08WER eval harness in src/eval/wer.py ready for LibriSpeech / Common Voice runs (datasets external)
  9. 09Resolution-rate eval over 25 simulated conversations (LLM-judge) in data/eval/
  10. 10Demo with browser audio wiring in /projects/09-voice-agent.html
  11. 1110 sample conversation traces in data/recordings/ (transcripts + tool calls + responses)
← Back to portfolio

Browse all 9 projects

RAG · fine-tuning · agents · Document AI · Computer Use · code AI · safety · GraphRAG · voice