Short Answer
Voice agent evals are repeatable tests for AI phone calls. Start with realistic caller scenarios, expected outcomes, prohibited answers, and evidence artifacts. Use Vapi evals or LiveKit behavior tests near the runtime, Hamming for dedicated QA, and Retell AI post-call analysis for production review. Always separate text behavior, voice pipeline, tool-call, and human handoff checks.
Eval Layers
| Eval layer | What it tests | Why it matters |
|---|---|---|
| Prompt behavior | The agent follows instructions, policy, tone, and approved boundaries. | Cheap to run and useful before audio testing. |
| Tool behavior | The agent calls calendars, CRMs, tickets, or APIs correctly. | Business outcomes fail when tools fail. |
| Voice behavior | The agent handles audio, turn-taking, interruption, and latency. | Callers judge the spoken experience. |
| Handoff behavior | The agent transfers or creates callbacks with context. | Human fallback protects sensitive and high-value calls. |
| Post-call output | The transcript, summary, fields, and disposition are useful. | Staff need reliable records after the call. |
| Regression behavior | Previously observed failures stay fixed. | Voice agents can regress after small prompt or model changes. |
Do not collapse these into one score. A call can sound good and still create a bad appointment, unsafe escalation, or useless summary.
Write The Scenario First
A good eval starts with the caller, not the model.
| Scenario field | Example |
|---|---|
| Caller role | New dental patient with tooth pain. |
| Goal | Book the earliest safe appointment or route to staff. |
| Messy detail | Caller changes Friday morning to Thursday afternoon. |
| Tool need | Calendar lookup and appointment creation. |
| Prohibited behavior | Do not promise insurance coverage. Do not provide medical advice. |
| Pass evidence | Correct appointment, accurate summary, sensitive-topic fallback, transcript timestamps. |
The scenario should feel ordinary, not adversarial. Real callers are tired, interrupted, noisy, and imprecise. That is enough to expose weak agents.
Build A Rubric
Use a rubric with explicit pass/fail or 1-5 scoring:
| Criterion | Strong pass signal |
|---|---|
| Intent capture | The agent understands the caller’s actual task. |
| State update | The final record uses corrected details. |
| Tool correctness | The right tool is called with the right arguments. |
| Failure honesty | The agent does not claim success after a timeout or bad response. |
| Interruption recovery | The agent stops, listens, and resumes with minimal repetition. |
| Handoff context | The human receives reason, summary, and known caller details. |
| Summary quality | Staff can act without replaying the full call. |
| Policy compliance | Sensitive topics trigger approved language and escalation. |
The rubric should be understandable by operations, not only engineering. If staff cannot score the call, the eval is too abstract.
Text Evals Versus Voice Tests
Text evals are useful for prompt behavior, tool selection, data extraction, and policy checks. Voice tests are needed for telephony, speech recognition, interruption, latency, audio quality, caller emotion, and transfer timing.
Use both:
- Text evals for fast regression checks before deploy.
- Voice tests for end-to-end launch proof.
- Production QA for live caller behavior that scripts missed.
Vapi Voice Testing is a good example of an official voice-test surface because it describes simulated phone conversations, recordings, transcripts, and rubric-based assessment. LiveKit testing and evaluation is a good example of framework-adjacent behavior tests for teams that own the runtime.
Source-Backed Evidence
Vapi Evals documents evals with mock conversations, automated validation, tool-call testing, and debugging. Vapi debugging docs also point buyers toward call logs, API logs, webhook logs, voice test suites, and tool testing.
Hamming positions voice-agent QA around automated tests, production monitoring, replayable production failures, metrics, integrations, and CI/CD. Retell post-call analysis and Retell webhooks are useful sources for the post-call evidence that evals should verify after a call.
Minimum Eval Suite
Start with these seven evals:
- Normal task completion.
- Caller correction.
- Caller interruption.
- Noisy caller or spelling correction.
- Failed tool call.
- Sensitive-topic escalation.
- Human transfer or callback fallback.
For each eval, define expected result, prohibited behavior, scoring rubric, evidence artifact, and owner. The AI voice agent call test script can provide the caller language.
What To Save
Every eval run should save:
- Scenario version
- Assistant, prompt, model, voice, and tool version
- Transcript and timestamps
- Recording or replay policy
- Tool-call request and response
- Transfer or callback evidence
- Summary and structured fields
- Score, failure reason, and reviewer notes
- Cost if the test used phone minutes or paid model calls
This is the difference between “we tested it” and “we can prove what changed.”
Regression Rules
Create a regression eval when:
- A live caller corrected information and the agent ignored it.
- Staff corrected a summary or structured field.
- A tool returned an error and the agent claimed success.
- A transfer failed.
- The agent used unsafe policy language.
- A prompt or model update broke a workflow that used to pass.
Regression cases should run before future changes ship. If the test is expensive, run a smaller text eval first and a full voice test before production.
Vendor Questions
- Can we run evals before every prompt change?
- Can evals include tool calls and webhook failures?
- Can a production call become a replayable test?
- Can evals tie results to the exact assistant version?
- Can we test actual phone calls, not only text?
- Can staff review and override automated scores?
- Can QA results export for compliance or operations?
- Can the system separate model, telephony, tool, and handoff failure causes?
Related Reading
- Best Voice Agent Testing Tools
- Voice Agent Testing and QA Stack
- Hamming vs Vapi Evals
- AI Voice Agent Evaluation Scorecard
- Voice AI Observability Guide
Buyer FAQs
What are voice agent evals?
Voice agent evals are repeatable tests that score an AI phone agent against expected caller outcomes, approved policies, tool behavior, handoff quality, transcript evidence, and post-call output.
How are voice agent evals different from chatbot evals?
Voice agent evals must account for audio quality, latency, interruptions, speech recognition errors, phone routing, transfer behavior, and spoken recovery. Text-only evals are useful but do not prove the whole phone experience.
What should a voice agent eval rubric include?
Include intent accuracy, caller correction, interruption recovery, tool-call correctness, truthful failure handling, escalation judgment, summary accuracy, transfer context, latency, and compliance-sensitive language.
How often should voice agent evals run?
Run critical evals before every prompt, workflow, model, voice, tool, or routing change. Also rerun evals after production incidents so previously observed failures do not return silently.
