Virtual Personas: a new way to test Voice Agents before launch

Testing voice agents with real customers is slow and expensive. Discover how rich life-story backstories let you simulate thousands of diverse callers and find failure points before going live.

Blog Image

The problem for those building voice agents

Anyone developing AI voice systems for customer support knows this scene well. You've built a Voice Agent. It works in internal testing. You launch it in production. And you discover that half of your customers react in ways you didn't anticipate.

Why does this happen? Because testing a voice agent is expensive. You need real people. You need hours of calls. You need representative samples. And often you find out too late that the bot works well with one type of customer but terribly with another.

What if there was a way to test Voice Agents with thousands of "virtual customers" before launch?

The idea: give the model a life

A group of Berkeley researchers published a paper called Anthology (Moon et al., 2024). The idea is simple. Instead of telling the model "you are a 45-year-old man from Texas", you give it a complete life story.

A rich backstory. Detailed. With cultural references. With past experiences. With a way of speaking.

The model stops behaving like a stereotype. It starts behaving like a person.

Why demographic labels aren't enough

Older methods used prompts like this: "I'm a 25-year-old Californian without a high school diploma."

The result? The model falls into stereotypes. It responds the way it thinks "someone from California without a diploma" should respond. Not like a real person.

There's another problem. With demographic labels you can only approximate the population on average. Not individuals. And in customer support, individuals matter. A lot.

A customer angry over their fourth return in a row doesn't behave like "the average dissatisfied user". They behave like that customer, with that story.

The problem goes even deeper. Recent studies show that LLMs used for customer service and marketing absorb and amplify the biases present in their training data. Testing Voice Agents only with generic prompts means not seeing these problems until they surface with real customers.

What changes with backstories

Anthology generates backstories with open-ended prompts, such as "tell me about yourself". The LLM produces hundreds or thousands of plausible lives. Different from each other. Rich in detail.

These personas are then used to simulate responses.

In the original study, the researchers tested the method on three Pew Research Center surveys. They measured three things:

  • How similar virtual responses are to human ones (Wasserstein distance).

  • How consistent they are with each other (Frobenius norm).

  • How internally consistent they are (Cronbach's alpha).

On Llama-3-70B and Mixtral-8x22B, Anthology beat all previous methods. Across all metrics. The results published on the Berkeley blog show up to 18% improvement in alignment with human distributions and 27% improvement in consistency.

Concrete application: testing voice agents in customer support

Now let's transfer the idea to a virtual call center. Here's how it would work.

Step 1: generate a population of virtual customers.

Not demographic lists. Complete stories. For example:

"My name is Maria. I'm 58. I live in a small town of three thousand people in Calabria. I've had the same SIM card for fifteen years. I only switched to digital during Covid because my daughter forced me to. I don't trust automated systems. When something doesn't work I call right away, I never write on apps."

Maria will speak to the Voice Agent in a specific way. She'll be impatient. She'll use simple words. She'll want to talk to a human right away.

Step 2: run the voice agent against a thousand different Marias.

Each with a different backstory. Different ages. Different patience levels. Different familiarity with technology. Different reasons for calling.

Step 3: measure where the bot fails.

Does the Voice Agent understand Maria when she says "you're not listening to me" in a regional dialect? Does it recognize that she's angry even if she doesn't use explicit words? Does it know when to pass her to a human operator?

Concrete example scenarios

Let's look at three practical cases where virtual personas would make a difference.

Case 1: frustration detection.

A young, experienced customer says "this thing isn't working". An elderly customer says "this is the third time I'm calling you about the same issue". Same level of anger. Different expressions.

A Voice Agent tested only on average transcripts misses the second one. Tested against a thousand personas with different backstories, it catches it.

Case 2: code-switching and dialects.

In Italy a customer can switch from Italian to dialect in the middle of a sentence. Especially when upset. With geographically diverse backstories, you discover early on whether the bot collapses when a Neapolitan gentleman says "ma che state a dì".

Case 3: implicit escalation signals.

Some customers openly threaten to switch providers. Others drop phrases like "yeah, I get how this works now". The second is an equally strong churn signal. A Voice Agent tested against diverse personas learns to recognize it before launch.

What the research says

The idea of treating LLMs as agent models comes from a paper cited in the Anthology study: "Language Models as Agent Models" (Andreas, 2022). The intuition is that an LLM, given a context, generates text consistent with the agent that likely produced that context.

So if you condition it well, the model speaks like a specific person. Not like the average of all people.

The Anthology study shows that this works better with detailed backstories. The metrics confirm it on real surveys from the Pew Research Center.

For customer support this means one important thing. Virtual personas are not a toy. They are a measurable testing tool. Recent research on contact center Quality Assurance systems shows that evaluating Voice Agents without diverse profiles hides systematic treatment disparities of up to 16% between different groups.

The honest limitations

It must be said clearly. There are open problems.

The backstories are generated by the LLM itself. So they capture the model's representation of human diversity. Not necessarily true human diversity. If the model has biases, those biases are reflected in the personas.

The bias risk must be monitored. The Anthology authors say this explicitly, referencing the Belmont Report principles on research ethics. Personas can perpetuate subtle stereotypes. They must be used with caution, especially in sensitive sectors.

They don't replace real-world testing. They are a first filter. They catch the big bugs. For the fine ones you still need real users.

In summary

Customer support is becoming a field where Voice Agent quality is decided. Getting the person in front of you wrong costs customers.

Anthology suggests a concrete path. Generate a thousand plausible lives. Test the Voice Agent on them. Find where it fails before it fails with real customers.

It's not the total solution. But it's a serious step in the right direction.

And above all, it's measurable. The original study demonstrates it works better than previous methods on real benchmarks from real surveys. So it's not just a nice idea. It's a tested idea.

References cited: