Five9 Voice AI Agents text over a visualization of human and ai interactions

You Can’t Script the Way People Actually Talk

Vivek Sinha VP, Product Management

As VP, Product Management at Five9 Vivek leads the strategic pivot to Agentic AI for the intelligent contact center. A seasoned product leader with prior roles driving AI and automation initiatives at Google, Adobe and Marketo.

June 22, 2026

For about thirty years, the hardest constraint in the contact center was never the customer. It was the technology. We made people speak a machine’s language: press 1 for billing, say it in three words, choose option 3, repeat your account number twice. If the caller didn’t fit the script, the script won. The human adapted to the system.

Agentic AI finally flips that relationship. You define the goal; the agent reasons its way to the path, even when the caller doesn’t talk in a straight line. And it matters most on voice, still the highest-cost, most complex channel in customer experience, and the one where “adapt to the machine” has always felt the most broken.

But you don’t get there by adding a language model to yesterday’s architecture. You get there by changing the architecture- and the underlying pieces are now ready to do it. That's what this post is about.

You can’t script reality

Start with a real caller:

IVA “Thanks for calling. In a few words, tell me what you’re calling about.”

Caller “Yeah, hi...so I...[dog barking] sorry, hold on...so I got my bill and it’s way higher than norm...actually, is this even the right number for billing? I added a line last month for my daughter and —”

IVA “I think you said billing. Is that right? Say yes or no.”

Caller “Well, yeah, but it’s about the new line, she’s headed to college so...”

IVA “I’m sorry, I didn’t get that. You can say: pay my bill, make a change, or technical support.”

Caller “No...I don’t want to pay it, I want to know why it went up...”

IVA “Okay. Transferring you to a representative. Estimated wait time is 23 minutes.”

Everyone reading this has built that flow. It isn’t a bad flow. It’s a trapped one. In eight seconds the caller was interrupted by her dog, interrupted herself, questioned whether she had the right number, told you exactly why her bill went up (a new line) and pivoted. That’s not an edge case. That’s how people talk. The scripted system did the only thing it can do: it flattened all of it into one keyword, guessed “billing,” and dead-ended her into a 23-minute hold.

Here’s the structural problem. To handle that call well with a script, you have to predict every path in advance. “Why is my bill so high?” alone fans out into a new line, proration, an expired promo, a roaming charge, autopay confusion, a price increase, each with its own follow-ups.

Figure 1 — Every branch is one you must author, test, and maintain.

Every branch you add is a branch you now own, test, and maintain. The effort curve goes vertical long before the conversation gets genuinely complex. That’s the ceiling scripted automation has always hit. That’s why bolting a large language model onto a scripted foundation doesn’t rescue it. The foundation itself assumes the paths were predicted. A smarter model on top of a predict-everything architecture is still a predict-everything architecture.

From scripts to goals

Agentic flips the unit of design. Instead of authoring the path, you define the outcome - “resolve the billing question, reduce the repeat call” - and the agent reasons the path in real time: understand, plan, act, check, adjust. When the caller pivots, the agent re-plans instead of breaking. That isn’t a smarter script. It’s a different execution layer, and it can’t be retrofitted onto the old one.

It’s also not magic. Five9 Voice AI Agent improves through a deliberate, repeatable loop: evaluations, performance reports, and conversation reviews that an agent designer uses to refine the agent over time. The system gets better because humans equipped with AI driven monitoring, evaluations, and testing inspect how it’s doing and tune it.

What “purpose-built” actually means

Purpose-built isn’t a slogan; it’s a set of concrete architectural choices, each aimed at the same thing: making sure the next call like the one above goes differently.

A voice-native execution layer. Voice can’t be an afterthought layered on a text agent. It’s synchronous and real-time: there’s no “please rephrase,” the audio is messy with accents and background noise and people talking over each other, and the agent has to manage a live spoken conversation as it happens. That demands an engine built for voice from the ground up, not a chatbot with a phone number.

A multi-agent orchestrator. This is the part that lets you adopt agentic AI without betting the entire contact center.

Figure 2 — An orchestrator coordinates specialized agents, which reuse shared horizontal agents.

An orchestrator sits on top and reads caller intent, routing each request to the right agent and re-routing mid-call when the caller jumps. Beneath it sit specialized agents (small, individually owned sub-agents that each do one job well, like billing dispute or appointment scheduling) and horizontal agents, shared capabilities like identity verification, payment, and knowledge lookup that every specialized agent can reuse. The payoff is practical: you add one at a time, reuse the pieces you’ve already hardened, and handle multiple intents in a single call. This is available and in production today.

An indigenous Tools Server. This is where one of the most common misconceptions gets cleared up. The Tools Server isn’t just “the agent makes an API call.” It’s where you sculpt the deterministic parts of a workflow: the steps that must always happen the same way, in the same order, regardless of what the model decides in the moment. When a ticket is raised, a confirmation email must follow; that’s a business rule, not a judgment call. Rather than ask the language model to reason about and fire each call separately (slower, and one more place to get it wrong) you chain them in the Tools Server so create_ticket → send_email runs as a single reliable unit. The agent decides what needs to happen; the Tools Server guarantees how. Keep the model out of the steps that don’t need reasoning, and the whole system gets faster, cheaper, and more predictable.

Multi-agent: built to be owned and tested. Each is the building block of the multi-agent system, and it's more than a prompt. It’s a defined unit with four parts: a goal that states the outcome, an operating procedure that lays out how to pursue it, the tools it’s allowed to use, and the knowledge it can draw on.

Figure 3 — An agent is a defined unit: goal, operating procedure, tools, and knowledge.

Structuring agents this way is what makes them testable, ownable, and safe to add incrementally — the foundation the orchestrator coordinates. You can build one agent, test it end to end, and deploy it without touching anything else. That’s what makes the multi-agent architecture practical at enterprise scale, not just elegant in a diagram.

Guardrails as a property, not a patch. Because all of this is built as one execution layer, control lives inside the architecture rather than bolted-on around it. We will cover it in greater depth in a subsequent post.

Why a platform, not a point solution

Here’s what the architecture makes possible that a standalone AI tool can’t: a full spectrum from fully scripted to fully agentic, on one platform, under one admin experience, with carrier-grade voice underneath. You don’t have to choose a lane, and you don’t have to rip out what already works. You extend it. The high-governance flows that should stay deterministic can stay deterministic; the conversations that benefit from reasoning can become agentic; and the orchestrator lets both run in the same call.

The bottom line

This isn’t a feature update. It’s a new foundation. The first time the system adapts to the human instead of forcing the human to adapt to it.

In the next post, we’ll get out of the architecture and into the experience: what resolution-first voice AI actually sounds like to the person on the other end of the line. And after that, how enterprises put it into production with the control and proof they need to trust it.

That caller who added a new line, lost her train of thought, and still got dead-ended into hold? With Five9 Voice AI Agent, that same call, same dog, same pivot, same confusion about the right number, ends in two minutes. No transfer. No hold. No “I didn’t understand that.” She’s who this is for. Getting her to a resolution in the first call isn’t a better metric on a dashboard. It’s the whole point.

If you’re working out where to start, the The Enterprise Guide to Voice AI Agents Buyer’s Guide is a practical guide to assessing voice AI capabilities and building a path forward.