What This Benchmark Measures
Latency is not one number. A caller feels the full chain: phone routing, speech detection, transcription, model response, voice generation, tool calls, transfer, and post-call processing.
This protocol measures six moments:
| Moment | Start | Stop |
|---|---|---|
| Answer to greeting | Call connects | First agent audio begins |
| Caller stop to response | Caller finishes a request | Agent begins relevant answer |
| Barge-in recovery | Caller interrupts while agent speaks | Agent stops and responds to correction |
| Tool-call wait | Agent says it will check something | Agent returns with a result or failure path |
| Transfer start | Escalation trigger occurs | Human ring, queue, or callback path begins |
| Post-call artifact | Call ends | Transcript, summary, and structured fields are available |
The benchmark should be run on the same network, phone path, and test prompt where practical. If not, document the difference.
Test Script
Run this scenario three times:
- Call the test number.
- Ask a simple factual question.
- Interrupt the answer with a correction.
- Ask the agent to check a calendar, CRM, order, or equivalent tool.
- Ask for a human.
- End the call and wait for the transcript or summary.
Use a stopwatch, call recording, or transcript timestamps. The buyer does not need lab-grade instrumentation for early vendor screening, but the timing method should be the same across vendors.
Scoring Rubric
| Score | Meaning |
|---|---|
| 1 | Long pauses, frequent talk-over, or no usable timing evidence. |
| 2 | Usable in a demo, but tool waits or interruptions feel awkward. |
| 3 | Acceptable for low-risk calls with occasional pauses and recoverable delays. |
| 4 | Natural timing in most turns, clean interruption recovery, and clear tool-call language. |
| 5 | Consistently natural pacing, logged timing, fast recovery, and no unexplained silence. |
Score the worst credible run. If two calls feel smooth and the third contains a long unexplained pause, the buyer should plan for that pause in production.
What To Record
Capture:
- Connection method and phone path.
- Greeting delay.
- Average caller-stop-to-agent-response time.
- Worst caller-stop-to-agent-response time.
- Barge-in recovery time.
- Tool-call wait time.
- Transfer start time.
- Transcript availability time.
- Any long silence above 3 seconds.
- Whether the agent explained waits honestly.
Red Flags
Watch for:
- The agent talks over corrections.
- The agent repeats a full sentence after interruption.
- The agent creates silence while a tool runs.
- The agent claims an action succeeded before the tool confirms it.
- The agent delays transfer after the caller asks for a person.
- The transcript arrives too late for staff follow-up.
Latency problems often become trust problems. A caller may forgive a small pause, but not a wrong booking or a stalled urgent transfer.
Better Than Raw Speed
The best voice agent is not always the one with the smallest response number. Buyers should prefer systems that:
- Acknowledge tool waits briefly.
- Stop speaking when interrupted.
- Confirm only when needed.
- Escalate without extra debate.
- Show timing data after the call.
Use the latency and architecture guide for deeper stack review, then run this benchmark during vendor demos.
Benchmark FAQs
What latency number matters most?
The most useful number is perceived response delay after the caller stops speaking, but buyers should also measure greeting speed, barge-in recovery, tool-call waits, transfer start, and post-call artifact availability.
Should the fastest vendor always win?
No. Predictable timing, clean interruption handling, and honest tool-call language can matter more than the lowest raw response number.
