Why the Shift from Nova 2 to Flux Matters

When you phone an AI voice agent, there is a very specific moment that separates calls that feel natural from calls that feel off. It is not the AI's accent. It is not the vocabulary. It is the gap between you finishing your sentence and the agent starting to respond.

That gap — the turn boundary — is where almost every voice AI still betrays itself. Get it too short and the agent interrupts you mid-thought. Get it too long and you get that familiar, slightly robotic pause that tells you it does not quite understand when you have stopped talking. For the last two years, the whole industry has been wrestling with this narrow problem. Deepgram's new Flux model is the most interesting answer anyone has shipped so far, and we are in the middle of moving our platform from their previous model (Nova 2) across to it. Here is what is actually changing, and why it matters.

How speech-to-text has worked until now

When you speak to a voice agent, a speech-to-text service transcribes your words into text that the AI can read. Nova 2 — Deepgram's previous flagship — does this very well. Accuracy is high, latency is low, and it handles accents gracefully.

The problem is that Nova 2, like almost every speech-to-text model before it, does not really know when you have finished speaking. It just transcribes the words. Figuring out that the caller has stopped is a separate job, handled by a technique called endpointing. Endpointing listens for silence. If the caller goes quiet for, say, 700 milliseconds, the system assumes they are done and hands the transcript to the AI.

This works. It also creates every awkward moment you have ever had on an AI call.

Silence threshold too short — 300 ms, say — and the AI cuts you off every time you pause to think or search for a word.
Silence threshold too long — 1,200 ms — and every caller experiences a noticeable, unnatural delay after they stop speaking, which makes the agent feel slow and dim.
Everyone is different. Thoughtful speakers often pause longer. Callers with strong accents sometimes pause where a native speaker would not. A tradie describing a job tends to pause mid-sentence to check an address. A silence threshold that is perfect for one caller will be wrong for the next.

No single number works. Vendors spend a lot of time tuning this, and the best you can do is pick a compromise that is wrong for everyone, but not badly wrong for most people.

Here is how the two approaches handle different types of speakers:

Nova 2 uses a fixed 700ms compromise that is wrong for most callers. Flux adapts the turn boundary to each caller's speech pattern in real-time.

What Flux does differently

Flux is the first Deepgram model with turn detection built into the model itself. It is not just transcribing your words and waiting for silence. It is also — in parallel, on the same audio stream — predicting whether you have actually finished the thought you were trying to express.

The model uses signals that silence detection cannot see: sentence intonation (did your voice drop at the end, or is it still rising?), grammatical completeness, speech rhythm, and the characteristic micro-patterns of people thinking versus people finishing. It emits two distinct signals during a call:

speech_final — "I have a stable transcript of what you just said, but you might not be done."
utterance_end — "I think you have actually finished your turn now."

The voice agent can react to these differently. It can keep listening through a pause when the model's turn-detection says you are still going, and respond quickly when the model says you are clearly done. The threshold is no longer a single blunt number; it is a prediction the model makes fresh for every single turn, based on you.

Flux classifies pause intent in real-time — thinking pauses, completion pauses, and mid-word word-search pauses each get different treatment.

What this means on the call

Nothing, visibly. That is the point.

You should notice fewer "sorry, I wasn't finished" moments. The agent should cut in less often when you pause to check a job number. It should also come back faster when you clearly have finished, because the system is no longer holding open a safety buffer of dead air to avoid getting it wrong. Both ends of the trade-off improve at once.

None of this is magic. Flux will still get some turns wrong. But on average, and in the ways that matter to callers who are not forensically listening for flaws, it pushes the experience a meaningful step closer to talking to a person.

Why we are rolling it out carefully

New speech models usually look spectacular in demos and then reveal quirks under real traffic. At dareena.ai, we have seen this enough times to be cautious.

We are currently evaluating Flux with a small number of controlled tenants. We are measuring the same call-quality metrics we always track — barge-in rate, silence-prompt rate, average latency to first audio, word error rate on real NZ transcripts. The decision to move the broader base across to Flux will be made on that data, not on the press release. If the numbers support it, we will roll forward. If they do not, we stay on Nova 2 for now and revisit when Deepgram iterates.

There is also a second phase waiting in the wings. Flux supports native end-of-turn events that can drive the agent directly, bypassing our own endpointing logic entirely. That is a bigger architectural change than just swapping one model for another, and it is gated on Phase 1 showing us what we hope it will show us.

The bigger picture

The turn boundary is, increasingly, the last visible tell of a voice AI. Models are good enough now that accuracy, latency, and voice naturalness are largely solved — what remains is the rhythm of the conversation itself. Flux is a real step forward on that front. It is not the end of the road, but it is probably the biggest step we will see this year on the specific problem of making an AI voice agent feel like it is actually listening to you, rather than waiting for you to stop making sound.

Why the Shift from Nova 2 to Flux Matters — Even If You Never Look Under the Hood

How speech-to-text has worked until now

What Flux does differently

What this means on the call

Why we are rolling it out carefully

The bigger picture

Try a call for yourself

More articles

Retell and Vapi Are Not What Your Customers Need

One Agent, Every Channel

What Is AI Call Handling? A Plain-English Guide