Voice Agent Index
AI voice agent handoff benchmark board with caller cards moving from automated triage to live support, callback, and QA lanes.
Human handoff is the safety valve for sensitive, urgent, confused, or high-value calls.

What This Benchmark Measures

This benchmark tests whether an AI voice agent can stop automation at the right time and involve a human cleanly.

The handoff path should prove:

  • The agent recognizes explicit human requests.
  • The agent recognizes urgency and sensitivity.
  • The agent transfers or creates fallback quickly.
  • The receiving person gets context.
  • The caller hears honest expectations.
  • Staff can review the handoff evidence later.

Use this protocol with the human handoff playbook.

Core Scenario

Call the agent as a realistic customer:

  1. Start with a normal request.
  2. Give enough detail for the agent to collect fields.
  3. Add uncertainty or urgency.
  4. Say, “Can I talk to a person?”
  5. If the transfer fails, ask what happens next.

Do not warn the vendor which phrase will trigger escalation. Real callers do not follow the ideal script.

Evidence Checklist

EvidencePass condition
Transfer triggerAgent starts handoff after explicit request or urgency signal.
Transfer languageAgent explains the next step without arguing or over-apologizing.
Context packetHuman receives caller name, number, intent, collected details, and escalation reason.
Failed transfer pathAgent creates callback, ticket, alert, or other fallback.
Promise controlAgent does not promise immediate response unless the business staffs it.
Post-call reviewTranscript and handoff reason are available for QA.

Scoring Rubric

ScoreMeaning
1No clear human path, or caller is trapped in automation.
2Transfer exists but lacks context or fallback.
3Transfer works for explicit human requests and creates a basic note.
4Transfer includes useful context and a reliable failed-transfer fallback.
5Handoff is configurable by intent, urgency, hours, team, and compliance sensitivity.

The score should reward judgment. An agent that transfers a sensitive call quickly may deserve a higher score than an agent that automates longer.

Failure Cases To Test

Run at least two:

  • Caller asks for a person immediately.
  • Caller becomes frustrated after a wrong answer.
  • Caller reports an urgent service issue.
  • Caller gives sensitive medical, legal, payment, or safety context.
  • Transfer destination does not answer.
  • Caller hangs up during transfer.
  • Staff receives incomplete context.

If the agent cannot explain what happens when a human is unavailable, the handoff system is not ready.

What The Human Should See

A useful handoff packet includes:

FieldExample
Caller identityName and callback number.
IntentNew appointment, urgent dispatch, complaint, quote, billing, support.
Collected fieldsAddress, preferred time, account detail, issue category, service type.
Escalation reasonAsked for person, urgent, confused, high value, sensitive topic.
Confidence noteWhat the AI is unsure about.
Next actionAnswer now, callback, create ticket, send manager alert.

The human should not have to restart the call from zero.

Pass Bar

Before launch, a buyer should be able to say:

  • The agent transfers when asked.
  • The agent detects urgent or sensitive contexts.
  • The transfer includes context.
  • Failed transfer creates a real next step.
  • Staff know what the AI promised.
  • QA can review every handoff.

That pass bar is the difference between an AI phone agent and an operational support system.

Benchmark FAQs

What is a good human handoff result?

A good result transfers or routes the caller quickly, gives the human caller identity and reason for escalation, and creates a reliable callback or alert if no one answers.

Should every human handoff reduce the vendor score?

No. Handoff is not failure. The benchmark rewards agents that know when to stop, pass context, and protect the caller experience.