The Voice AI Stack Is Collapsing Into One Model. We're Not Following (Yet)

The voice AI industry is consolidating in front of us. A year ago, building a voice agent meant assembling three things: a speech-to-text engine, a language model, and a text-to-speech voice. Today, the largest AI providers will sell you all three behind a single API. Audio goes in, audio comes out, and everything in between — transcription, reasoning, voice synthesis — happens inside one model you never see into.

It's genuinely impressive engineering. And it's reshaping how the market thinks about voice infrastructure: one vendor, one bill, one connection. The one-stop shop has arrived.

So why haven't we moved?

The short answer: we bet on the ability to swap components over the convenience of not having to. Here is the longer version.

What convergence buys you

Let's be honest about what the unified models do well, because they do it very well.

Latency is the headline. When audio flows through a single model rather than hopping between three providers, response times drop and conversations feel more natural. The integration story is simpler too — one WebSocket, one set of credentials, one latency budget. For a prototype, a consumer app, or a demo that needs to impress in a boardroom next week, the unified stack is hard to beat.

The direction of travel is also clear. Audio-native models are where the big labs are investing, and they will keep improving. Anyone telling you otherwise is selling something.

What convergence costs you

But here is what you give up when you buy the bundle: you can't get between the stages.

The unified stack: one connection, one vendor — and nothing you can reach between the stages.

In a composable pipeline, the moment a caller finishes speaking, we have text — and text is where the real work happens. We can validate an order number against your system before the agent responds. We can route based on what was actually said. We can log, audit, and correct. In a unified model, that seam doesn't exist. The model hears, thinks, and speaks in one pass, and you trust the pass.

That's a fine trade for a consumer assistant. It's a harder one to defend when the call is a customer booking a job, querying an invoice, or reading out a vehicle registration.

And everything else is welded together on top of that. The reasoning model inside a unified stack can't be swapped. If a competing model ships next quarter that's measurably better at understanding a tradie describing a job site, you can't use it — you get whatever your vendor built in. The voices are fixed, often to a handful of options. The pricing is theirs to set, and when there's no second supplier in your stack, there's no pressure keeping it honest.

Why we built composable — and stayed composable

When we designed our platform, we took a best-of-breed approach: choose the strongest speech recognition, the strongest reasoning model, and the strongest voice available — independently — and orchestrate them ourselves.

That decision still holds, and not out of habit. Business phone calls in New Zealand have specific demands. Speech recognition has to handle our accents, our place names, and the alphanumeric strings that fill real calls — order numbers, registrations, addresses. The reasoning layer has to call into the systems our customers actually run, from job management to accounting. And the economics have to be predictable enough to build a wholesale price on.

Here's what that looks like in practice. A caller says they need a job done in Whangārei, or gives an address in Ōtorohanga or Paraparaumu. Speech models trained predominantly on North American audio mangle these names routinely — and in a unified stack, whatever the model misheard is what it reasons with. You never see it happen.

Our pipeline has a seam there, and we use it. Every call passes through an augmentation layer between speech recognition and reasoning: a curated set of New Zealand place names and business vocabulary that corrects the transcript and enriches the context before the reasoning model ever sees it. We can insert, modify, and validate text at that point — fix the place name, normalise the registration plate, check the booking reference against your system. The model receives clean, grounded input; the caller gets an agent that doesn't ask Northland to repeat itself.

Our composable pipeline: every component chosen on merit and swappable, with open text seams where the transcript is corrected and enriched before the model responds.

There's a data sovereignty dimension here too. A unified offshore model gives you no control over where audio is processed — a real constraint under the NZ Privacy Act 2020, and one that MSP and telco partners can't easily waive on their customers' behalf.

Those requirements are best served by evaluating each component on its own merits, against real call audio, on a regular cycle. That's work we do so our partners don't have to. When one provider's transcription accuracy slips or another's pricing moves, we see it in our benchmarks — and because nothing in our stack is welded to anything else, we can act on it.

Optionality is the product

This is the part worth being direct about: our architecture isn't a hedge against being wrong. It's a deliberate bet that in a market moving this fast, the ability to change components is worth more than the convenience of not having to.

When a better speech model ships, our partners get it without a migration, a re-integration, or a renegotiation. When a vendor reprices, we have alternatives in the stack the same week. The market can shift under us — that's exactly what it's been doing — and the ground our partners stand on doesn't move.

The "(yet)" is doing real work

None of this is dogma. We benchmark the unified models alongside everything else, and we expect them to keep closing the gap. The day a converged stack wins on our tests — accuracy on NZ phone audio, reliability of tool calls, voice quality, cost per minute — for production business calling, we'll adopt it.

And here's the quiet advantage of composability: when that day comes, we can. Swapping in a unified model is just another component change on our side of the line. Our partners' integrations, numbers, and workflows won't notice.

That's the position we think a wholesale voice platform should hold right now. Watch the convergence closely. Benchmark it honestly. And keep the architecture that lets you say yes to it the moment it earns the switch — not a day before.

The Voice AI Stack Is Collapsing Into One Model. We're Not Following (Yet)

What convergence buys you

What convergence costs you

Why we built composable — and stayed composable

Optionality is the product

The "(yet)" is doing real work

Want to talk architecture?

More articles

Retell and Vapi Are Genuinely Great. They Are Also Not What Your Customers Need.

Why the Shift from Nova 2 to Flux Matters

Wholesaling Voice AI Agents in NZ: Build vs White-Label