Testing

Voice Agent Evals: How to Test AI Phone Agents

How to design voice agent evals for AI phone calls: scenario scripts, voice tests, tool checks, rubrics, regression suites, and QA review.

Voice agent eval scorecard with scenario scripts, call transcripts, tool-call checks, and reviewer rubrics. — Good evals measure caller outcomes, tool behavior, and safe recovery, not only answer style.

Short Answer

Voice agent evals are repeatable tests for AI phone calls. Start with realistic caller scenarios, expected outcomes, prohibited answers, and evidence artifacts. Use Vapi evals or LiveKit behavior tests near the runtime, Hamming for dedicated QA, and Retell AI post-call analysis for production review. Always separate text behavior, voice pipeline, tool-call, and human handoff checks.

Eval Layers

Eval layer	What it tests	Why it matters
Prompt behavior	The agent follows instructions, policy, tone, and approved boundaries.	Cheap to run and useful before audio testing.
Tool behavior	The agent calls calendars, CRMs, tickets, or APIs correctly.	Business outcomes fail when tools fail.
Voice behavior	The agent handles audio, turn-taking, interruption, and latency.	Callers judge the spoken experience.
Handoff behavior	The agent transfers or creates callbacks with context.	Human fallback protects sensitive and high-value calls.
Post-call output	The transcript, summary, fields, and disposition are useful.	Staff need reliable records after the call.
Regression behavior	Previously observed failures stay fixed.	Voice agents can regress after small prompt or model changes.

Do not collapse these into one score. A call can sound good and still create a bad appointment, unsafe escalation, or useless summary.

Write The Scenario First

A good eval starts with the caller, not the model.

Scenario field	Example
Caller role	New dental patient with tooth pain.
Goal	Book the earliest safe appointment or route to staff.
Messy detail	Caller changes Friday morning to Thursday afternoon.
Tool need	Calendar lookup and appointment creation.
Prohibited behavior	Do not promise insurance coverage. Do not provide medical advice.
Pass evidence	Correct appointment, accurate summary, sensitive-topic fallback, transcript timestamps.

The scenario should feel ordinary, not adversarial. Real callers are tired, interrupted, noisy, and imprecise. That is enough to expose weak agents.

Build A Rubric

Use a rubric with explicit pass/fail or 1-5 scoring:

Criterion	Strong pass signal
Intent capture	The agent understands the caller’s actual task.
State update	The final record uses corrected details.
Tool correctness	The right tool is called with the right arguments.
Failure honesty	The agent does not claim success after a timeout or bad response.
Interruption recovery	The agent stops, listens, and resumes with minimal repetition.
Handoff context	The human receives reason, summary, and known caller details.
Summary quality	Staff can act without replaying the full call.
Policy compliance	Sensitive topics trigger approved language and escalation.

The rubric should be understandable by operations, not only engineering. If staff cannot score the call, the eval is too abstract.

Text Evals Versus Voice Tests

Text evals are useful for prompt behavior, tool selection, data extraction, and policy checks. Voice tests are needed for telephony, speech recognition, interruption, latency, audio quality, caller emotion, and transfer timing.

Use both:

Text evals for fast regression checks before deploy.
Voice tests for end-to-end launch proof.
Production QA for live caller behavior that scripts missed.

Vapi Voice Testing is a good example of an official voice-test surface because it describes simulated phone conversations, recordings, transcripts, and rubric-based assessment. LiveKit testing and evaluation is a good example of framework-adjacent behavior tests for teams that own the runtime.

Source-Backed Evidence

Vapi Evals documents evals with mock conversations, automated validation, tool-call testing, and debugging. Vapi debugging docs also point buyers toward call logs, API logs, webhook logs, voice test suites, and tool testing.

Hamming positions voice-agent QA around automated tests, production monitoring, replayable production failures, metrics, integrations, and CI/CD. Retell post-call analysis and Retell webhooks are useful sources for the post-call evidence that evals should verify after a call.

Minimum Eval Suite

Start with these seven evals:

Normal task completion.
Caller correction.
Caller interruption.
Noisy caller or spelling correction.
Failed tool call.
Sensitive-topic escalation.
Human transfer or callback fallback.

For each eval, define expected result, prohibited behavior, scoring rubric, evidence artifact, and owner. The AI voice agent call test script can provide the caller language.

What To Save

Every eval run should save:

Scenario version
Assistant, prompt, model, voice, and tool version
Transcript and timestamps
Recording or replay policy
Tool-call request and response
Transfer or callback evidence
Summary and structured fields
Score, failure reason, and reviewer notes
Cost if the test used phone minutes or paid model calls

This is the difference between “we tested it” and “we can prove what changed.”

Regression Rules

Create a regression eval when:

A live caller corrected information and the agent ignored it.
Staff corrected a summary or structured field.
A tool returned an error and the agent claimed success.
A transfer failed.
The agent used unsafe policy language.
A prompt or model update broke a workflow that used to pass.

Regression cases should run before future changes ship. If the test is expensive, run a smaller text eval first and a full voice test before production.

Vendor Questions

Can we run evals before every prompt change?
Can evals include tool calls and webhook failures?
Can a production call become a replayable test?
Can evals tie results to the exact assistant version?
Can we test actual phone calls, not only text?
Can staff review and override automated scores?
Can QA results export for compliance or operations?
Can the system separate model, telephony, tool, and handoff failure causes?

Buyer FAQs

What are voice agent evals?

Voice agent evals are repeatable tests that score an AI phone agent against expected caller outcomes, approved policies, tool behavior, handoff quality, transcript evidence, and post-call output.

How are voice agent evals different from chatbot evals?

Voice agent evals must account for audio quality, latency, interruptions, speech recognition errors, phone routing, transfer behavior, and spoken recovery. Text-only evals are useful but do not prove the whole phone experience.

What should a voice agent eval rubric include?

Include intent accuracy, caller correction, interruption recovery, tool-call correctness, truthful failure handling, escalation judgment, summary accuracy, transfer context, latency, and compliance-sensitive language.

How often should voice agent evals run?

Run critical evals before every prompt, workflow, model, voice, tool, or routing change. Also rerun evals after production incidents so previously observed failures do not return silently.