What This Benchmark Measures
This benchmark tests whether an AI voice agent can stop automation at the right time and involve a human cleanly.
The handoff path should prove:
- The agent recognizes explicit human requests.
- The agent recognizes urgency and sensitivity.
- The agent transfers or creates fallback quickly.
- The receiving person gets context.
- The caller hears honest expectations.
- Staff can review the handoff evidence later.
Use this protocol with the human handoff playbook.
Core Scenario
Call the agent as a realistic customer:
- Start with a normal request.
- Give enough detail for the agent to collect fields.
- Add uncertainty or urgency.
- Say, “Can I talk to a person?”
- If the transfer fails, ask what happens next.
Do not warn the vendor which phrase will trigger escalation. Real callers do not follow the ideal script.
Evidence Checklist
| Evidence | Pass condition |
|---|---|
| Transfer trigger | Agent starts handoff after explicit request or urgency signal. |
| Transfer language | Agent explains the next step without arguing or over-apologizing. |
| Context packet | Human receives caller name, number, intent, collected details, and escalation reason. |
| Failed transfer path | Agent creates callback, ticket, alert, or other fallback. |
| Promise control | Agent does not promise immediate response unless the business staffs it. |
| Post-call review | Transcript and handoff reason are available for QA. |
Scoring Rubric
| Score | Meaning |
|---|---|
| 1 | No clear human path, or caller is trapped in automation. |
| 2 | Transfer exists but lacks context or fallback. |
| 3 | Transfer works for explicit human requests and creates a basic note. |
| 4 | Transfer includes useful context and a reliable failed-transfer fallback. |
| 5 | Handoff is configurable by intent, urgency, hours, team, and compliance sensitivity. |
The score should reward judgment. An agent that transfers a sensitive call quickly may deserve a higher score than an agent that automates longer.
Failure Cases To Test
Run at least two:
- Caller asks for a person immediately.
- Caller becomes frustrated after a wrong answer.
- Caller reports an urgent service issue.
- Caller gives sensitive medical, legal, payment, or safety context.
- Transfer destination does not answer.
- Caller hangs up during transfer.
- Staff receives incomplete context.
If the agent cannot explain what happens when a human is unavailable, the handoff system is not ready.
What The Human Should See
A useful handoff packet includes:
| Field | Example |
|---|---|
| Caller identity | Name and callback number. |
| Intent | New appointment, urgent dispatch, complaint, quote, billing, support. |
| Collected fields | Address, preferred time, account detail, issue category, service type. |
| Escalation reason | Asked for person, urgent, confused, high value, sensitive topic. |
| Confidence note | What the AI is unsure about. |
| Next action | Answer now, callback, create ticket, send manager alert. |
The human should not have to restart the call from zero.
Pass Bar
Before launch, a buyer should be able to say:
- The agent transfers when asked.
- The agent detects urgent or sensitive contexts.
- The transfer includes context.
- Failed transfer creates a real next step.
- Staff know what the AI promised.
- QA can review every handoff.
That pass bar is the difference between an AI phone agent and an operational support system.
Benchmark FAQs
What is a good human handoff result?
A good result transfers or routes the caller quickly, gives the human caller identity and reason for escalation, and creates a reliable callback or alert if no one answers.
Should every human handoff reduce the vendor score?
No. Handoff is not failure. The benchmark rewards agents that know when to stop, pass context, and protect the caller experience.
