What This Benchmark Measures
This benchmark tests whether the AI voice agent can survive normal caller messiness:
- Background noise.
- Speakerphone audio.
- Accents or fast speech.
- Spelled names.
- Street addresses.
- Phone numbers.
- Corrections.
- Interruptions.
- Low-confidence fields.
The goal is not perfect transcription. The goal is safe recovery when the agent is uncertain.
Test Setup
Run the same scenario three ways:
- Quiet room.
- Mild background noise.
- Caller on speakerphone or with deliberate interruptions.
Use the same caller request each time. Change only the audio condition and correction pattern.
Caller Script
Use fields that matter operationally:
- Name with spelling.
- Callback number.
- Address or location.
- Desired date or time.
- Service need.
- One correction after the agent repeats back a field.
Example correction phrases:
- “No, that was 512, not 215.”
- “The street is Cedar, C-E-D-A-R.”
- “Actually, make that Thursday afternoon.”
- “Wait, I need the downtown location.”
- “Can you repeat what you have?”
Scoring Rubric
| Score | Meaning |
|---|---|
| 1 | Agent captures wrong details and does not recover. |
| 2 | Agent asks repeats but still loses critical fields. |
| 3 | Agent confirms enough fields for low-risk workflows. |
| 4 | Agent handles corrections, marks uncertainty, and routes unclear cases. |
| 5 | Agent reliably captures critical details, logs confidence, and uses human fallback when needed. |
Fields To Verify
| Field | Pass condition |
|---|---|
| Name | Correct or marked uncertain. |
| Phone | Confirmed before follow-up or transfer. |
| Address | Repeated back if used for service, dispatch, or eligibility. |
| Date/time | Updated after correction. |
| Intent | Still understood after noisy audio. |
| Urgency | Not lost because another field was unclear. |
Red Flags
Watch for:
- The agent confidently repeats a wrong address.
- Corrections do not update the final summary.
- The transcript looks clean but the structured fields are wrong.
- The agent asks the same clarification too many times.
- Low-confidence fields are not visible to staff.
- No human fallback exists for bad audio.
Evidence Packet
Ask for:
- Audio or transcript.
- Structured fields extracted from the call.
- Confidence or uncertainty notes.
- Final summary.
- Tool-call payload if the fields were used.
- Transfer or callback packet if uncertainty required handoff.
Pass Bar
Before launch, the buyer should verify:
- Critical fields are confirmed.
- Corrections update the final record.
- Noisy details are marked uncertain when needed.
- Staff can see what needs review.
- The agent escalates when bad audio makes automation unsafe.
This benchmark is especially important for dispatch, healthcare, legal intake, and local services because one wrong address, phone number, or appointment time can break the entire workflow.
Benchmark FAQs
Why test noisy callers?
Reception calls often include background noise, poor phone audio, accents, names, addresses, and spelling corrections. A smooth quiet-room demo does not prove the agent can collect reliable operational details.
What should happen when confidence is low?
The agent should confirm critical fields, mark uncertainty in the handoff packet, or route to a human instead of pretending the data is reliable.
