Voice Agent Index
Voice agent eval scorecard with scenario scripts, call transcripts, tool-call checks, and reviewer rubrics.
Good evals measure caller outcomes, tool behavior, and safe recovery, not only answer style.

Short Answer

Voice agent evals are repeatable tests for AI phone calls. Start with realistic caller scenarios, expected outcomes, prohibited answers, and evidence artifacts. Use Vapi evals or LiveKit behavior tests near the runtime, Hamming for dedicated QA, and Retell AI post-call analysis for production review. Always separate text behavior, voice pipeline, tool-call, and human handoff checks.

Eval Layers

Eval layerWhat it testsWhy it matters
Prompt behaviorThe agent follows instructions, policy, tone, and approved boundaries.Cheap to run and useful before audio testing.
Tool behaviorThe agent calls calendars, CRMs, tickets, or APIs correctly.Business outcomes fail when tools fail.
Voice behaviorThe agent handles audio, turn-taking, interruption, and latency.Callers judge the spoken experience.
Handoff behaviorThe agent transfers or creates callbacks with context.Human fallback protects sensitive and high-value calls.
Post-call outputThe transcript, summary, fields, and disposition are useful.Staff need reliable records after the call.
Regression behaviorPreviously observed failures stay fixed.Voice agents can regress after small prompt or model changes.

Do not collapse these into one score. A call can sound good and still create a bad appointment, unsafe escalation, or useless summary.

Write The Scenario First

A good eval starts with the caller, not the model.

Scenario fieldExample
Caller roleNew dental patient with tooth pain.
GoalBook the earliest safe appointment or route to staff.
Messy detailCaller changes Friday morning to Thursday afternoon.
Tool needCalendar lookup and appointment creation.
Prohibited behaviorDo not promise insurance coverage. Do not provide medical advice.
Pass evidenceCorrect appointment, accurate summary, sensitive-topic fallback, transcript timestamps.

The scenario should feel ordinary, not adversarial. Real callers are tired, interrupted, noisy, and imprecise. That is enough to expose weak agents.

Build A Rubric

Use a rubric with explicit pass/fail or 1-5 scoring:

CriterionStrong pass signal
Intent captureThe agent understands the caller’s actual task.
State updateThe final record uses corrected details.
Tool correctnessThe right tool is called with the right arguments.
Failure honestyThe agent does not claim success after a timeout or bad response.
Interruption recoveryThe agent stops, listens, and resumes with minimal repetition.
Handoff contextThe human receives reason, summary, and known caller details.
Summary qualityStaff can act without replaying the full call.
Policy complianceSensitive topics trigger approved language and escalation.

The rubric should be understandable by operations, not only engineering. If staff cannot score the call, the eval is too abstract.

Text Evals Versus Voice Tests

Text evals are useful for prompt behavior, tool selection, data extraction, and policy checks. Voice tests are needed for telephony, speech recognition, interruption, latency, audio quality, caller emotion, and transfer timing.

Use both:

  • Text evals for fast regression checks before deploy.
  • Voice tests for end-to-end launch proof.
  • Production QA for live caller behavior that scripts missed.

Vapi Voice Testing is a good example of an official voice-test surface because it describes simulated phone conversations, recordings, transcripts, and rubric-based assessment. LiveKit testing and evaluation is a good example of framework-adjacent behavior tests for teams that own the runtime.

Source-Backed Evidence

Vapi Evals documents evals with mock conversations, automated validation, tool-call testing, and debugging. Vapi debugging docs also point buyers toward call logs, API logs, webhook logs, voice test suites, and tool testing.

Hamming positions voice-agent QA around automated tests, production monitoring, replayable production failures, metrics, integrations, and CI/CD. Retell post-call analysis and Retell webhooks are useful sources for the post-call evidence that evals should verify after a call.

Minimum Eval Suite

Start with these seven evals:

  1. Normal task completion.
  2. Caller correction.
  3. Caller interruption.
  4. Noisy caller or spelling correction.
  5. Failed tool call.
  6. Sensitive-topic escalation.
  7. Human transfer or callback fallback.

For each eval, define expected result, prohibited behavior, scoring rubric, evidence artifact, and owner. The AI voice agent call test script can provide the caller language.

What To Save

Every eval run should save:

  • Scenario version
  • Assistant, prompt, model, voice, and tool version
  • Transcript and timestamps
  • Recording or replay policy
  • Tool-call request and response
  • Transfer or callback evidence
  • Summary and structured fields
  • Score, failure reason, and reviewer notes
  • Cost if the test used phone minutes or paid model calls

This is the difference between “we tested it” and “we can prove what changed.”

Regression Rules

Create a regression eval when:

  • A live caller corrected information and the agent ignored it.
  • Staff corrected a summary or structured field.
  • A tool returned an error and the agent claimed success.
  • A transfer failed.
  • The agent used unsafe policy language.
  • A prompt or model update broke a workflow that used to pass.

Regression cases should run before future changes ship. If the test is expensive, run a smaller text eval first and a full voice test before production.

Vendor Questions

  • Can we run evals before every prompt change?
  • Can evals include tool calls and webhook failures?
  • Can a production call become a replayable test?
  • Can evals tie results to the exact assistant version?
  • Can we test actual phone calls, not only text?
  • Can staff review and override automated scores?
  • Can QA results export for compliance or operations?
  • Can the system separate model, telephony, tool, and handoff failure causes?

Buyer FAQs

What are voice agent evals?

Voice agent evals are repeatable tests that score an AI phone agent against expected caller outcomes, approved policies, tool behavior, handoff quality, transcript evidence, and post-call output.

How are voice agent evals different from chatbot evals?

Voice agent evals must account for audio quality, latency, interruptions, speech recognition errors, phone routing, transfer behavior, and spoken recovery. Text-only evals are useful but do not prove the whole phone experience.

What should a voice agent eval rubric include?

Include intent accuracy, caller correction, interruption recovery, tool-call correctness, truthful failure handling, escalation judgment, summary accuracy, transfer context, latency, and compliance-sensitive language.

How often should voice agent evals run?

Run critical evals before every prompt, workflow, model, voice, tool, or routing change. Also rerun evals after production incidents so previously observed failures do not return silently.