When you phone an AI voice agent, there is a very specific moment that separates calls that feel natural from calls that feel off. It is not the AI's accent. It is not the vocabulary. It is the gap between you finishing your sentence and the agent starting to respond.

That gap — the turn boundary — is where almost every voice AI still betrays itself. Get it too short and the agent interrupts you mid-thought. Get it too long and you get that familiar, slightly robotic pause that tells you it does not quite understand when you have stopped talking. For the last two years, the whole industry has been wrestling with this narrow problem. Deepgram's new Flux model is the most interesting answer anyone has shipped so far, and we are in the middle of moving our platform from their previous model (Nova 2) across to it. Here is what is actually changing, and why it matters.

How speech-to-text has worked until now

When you speak to a voice agent, a speech-to-text service transcribes your words into text that the AI can read. Nova 2 — Deepgram's previous flagship — does this very well. Accuracy is high, latency is low, and it handles accents gracefully.

The problem is that Nova 2, like almost every speech-to-text model before it, does not really know when you have finished speaking. It just transcribes the words. Figuring out that the caller has stopped is a separate job, handled by a technique called endpointing. Endpointing listens for silence. If the caller goes quiet for, say, 700 milliseconds, the system assumes they are done and hands the transcript to the AI.

This works. It also creates every awkward moment you have ever had on an AI call.

No single number works. Vendors spend a lot of time tuning this, and the best you can do is pick a compromise that is wrong for everyone, but not badly wrong for most people.

Here is how the two approaches handle different types of speakers:

Nova 2: Fixed 700ms Threshold One setting for everyone — wrong for most Thoughtful speaker Natural pause: 1200ms → Gets interrupted at 700ms Feels rushed and frustrated "You cut me off again!" Average caller Natural pause: 650ms 700ms works… mostly Slight delay feels robotic But acceptable Fast speaker Natural pause: 400ms → Awkward 300ms delay AI feels slow and dim "Hello? Are you there?" Flux: Adaptive Turn Detection Analyses each caller's speech patterns in real-time Thoughtful speaker Flux detects: • Slower speech rhythm • Thoughtful pauses • Rising intonation → Waits 1200ms "Perfect timing!" Natural conversation Average caller Flux detects: • Standard cadence • Complete thought • Falling tone → Responds at 650ms Snappy and natural No robotic delay Fast speaker Flux detects: • Rapid speech rate • Crisp endings • Done speaking → Responds at 400ms Keeps up perfectly Feels intelligent The Key Insight Fixed thresholds are wrong for most people. Flux adapts to each caller's natural rhythm.

Nova 2 uses a fixed 700ms compromise that is wrong for most callers. Flux adapts the turn boundary to each caller's speech pattern in real-time.

What Flux does differently

Flux is the first Deepgram model with turn detection built into the model itself. It is not just transcribing your words and waiting for silence. It is also — in parallel, on the same audio stream — predicting whether you have actually finished the thought you were trying to express.

The model uses signals that silence detection cannot see: sentence intonation (did your voice drop at the end, or is it still rising?), grammatical completeness, speech rhythm, and the characteristic micro-patterns of people thinking versus people finishing. It emits two distinct signals during a call:

The voice agent can react to these differently. It can keep listening through a pause when the model's turn-detection says you are still going, and respond quickly when the model says you are clearly done. The threshold is no longer a single blunt number; it is a prediction the model makes fresh for every single turn, based on you.

Thinking Pause: "Can you check the hot water cylinder in... [thinking] ...Dunedin?" Speech: "Can you check..." thinking (1200ms) "...Dunedin?" Flux detects: Rising intonation + incomplete thought → WAIT 1200ms → Perfect timing Completion Pause: "The address is 123 Queen Street." [done speaking] Speech: "The address is 123 Queen Street." silence (400ms) Flux detects: Falling intonation + complete sentence → RESPOND at 400ms → Snappy Mid-word Pause: "The water heater is in the base... [searching] ...ment." Speech: "...is in the base..." word search (800ms) "...ment." Flux detects: Hesitation pattern + mid-phrase → WAIT 800ms → No interruption Flux adapts pause length based on WHY the caller paused Thinking = wait longer • Done speaking = respond quickly • Word searching = patience

Flux classifies pause intent in real-time — thinking pauses, completion pauses, and mid-word word-search pauses each get different treatment.

What this means on the call

Nothing, visibly. That is the point.

You should notice fewer "sorry, I wasn't finished" moments. The agent should cut in less often when you pause to check a job number. It should also come back faster when you clearly have finished, because the system is no longer holding open a safety buffer of dead air to avoid getting it wrong. Both ends of the trade-off improve at once.

None of this is magic. Flux will still get some turns wrong. But on average, and in the ways that matter to callers who are not forensically listening for flaws, it pushes the experience a meaningful step closer to talking to a person.

Why we are rolling it out carefully

New speech models usually look spectacular in demos and then reveal quirks under real traffic. At dareena.ai, we have seen this enough times to be cautious.

We are currently evaluating Flux with a small number of controlled tenants. We are measuring the same call-quality metrics we always track — barge-in rate, silence-prompt rate, average latency to first audio, word error rate on real NZ transcripts. The decision to move the broader base across to Flux will be made on that data, not on the press release. If the numbers support it, we will roll forward. If they do not, we stay on Nova 2 for now and revisit when Deepgram iterates.

There is also a second phase waiting in the wings. Flux supports native end-of-turn events that can drive the agent directly, bypassing our own endpointing logic entirely. That is a bigger architectural change than just swapping one model for another, and it is gated on Phase 1 showing us what we hope it will show us.

The bigger picture

The turn boundary is, increasingly, the last visible tell of a voice AI. Models are good enough now that accuracy, latency, and voice naturalness are largely solved — what remains is the rhythm of the conversation itself. Flux is a real step forward on that front. It is not the end of the road, but it is probably the biggest step we will see this year on the specific problem of making an AI voice agent feel like it is actually listening to you, rather than waiting for you to stop making sound.

Try a call for yourself

See how it sounds — free for 7 days, no credit card required.

Try free for 7 days