Sakana AI's KAME Lets Voice AI Think and Speak at the Same Time

Akram Chauhan
Akram Chauhan
6 min read63 views
Sakana AI's KAME Lets Voice AI Think and Speak at the Same Time

You know that awkward pause? The one you get when you ask your voice assistant a tricky question. You finish speaking, and then... silence. For a solid two seconds, you're just waiting, wondering if it even heard you. It’s just long enough to completely break the flow of a conversation and remind you that you’re talking to a machine.

For years, AI developers have been stuck with a frustrating choice: make an AI that’s fast, or make one that’s smart. You could have one, but not both.

On one hand, you have the "fast talkers." These are direct speech-to-speech (S2S) models. They start responding almost instantly, which feels incredibly natural. The problem? Their answers are often pretty shallow. They’re great at "what's the weather?" but stumble on anything that requires real thought.

On the other hand, you have the "deep thinkers." These systems take your speech, turn it into text, feed that text to a super-powerful Large Language Model (LLM) like GPT-4, get a text response, and then turn that back into speech. The answers are brilliant, but the process is painfully slow. That 2.1-second median delay is what creates that cringey, unnatural pause.

It felt like we were stuck. Do we want a conversation that flows or one that’s actually intelligent? Well, a Tokyo-based AI lab called Sakana AI just came out with something that says: why not both? They call it KAME, and it’s a genuinely clever approach that might just fix the most annoying thing about talking to AI.

So, What's KAME's Secret? It Speaks While It Thinks

The magic behind KAME is that it doesn’t choose between being fast or smart. It does both at the same time by running two different systems in parallel.

Think of it like a two-person team.

  1. The Front-End (The Talker): This is a super-fast S2S model based on an architecture called Moshi. Its only job is to start talking immediately. As soon as you start speaking, it's already formulating a response and getting ready to talk back. It keeps the conversation moving with almost zero delay.

  2. The Back-End (The Thinker): This is where the big brain lives. As you’re talking, your speech is also being converted to text and streamed to a powerful LLM (like GPT-4 or Claude). This LLM starts thinking about your query long before you’ve even finished your sentence.

Here’s where it gets really cool. The Front-End talker doesn't just go off on its own. It has a special, private line of communication to the Back-End thinker. As the thinker comes up with better and more accurate answers, it sends little updates—called "oracles"—over to the talker.

Imagine you're giving a presentation, and a colleague is feeding you better information through an earpiece. You might start a sentence one way, but then you get a new piece of data and seamlessly correct yourself mid-thought. That's exactly what KAME does. The Front-End model starts speaking with its best initial guess, and if a better "oracle" from the LLM arrives, it can pivot and adjust its response on the fly.

This whole process is asynchronous, meaning the two parts work independently. The talker doesn't have to wait for the thinker. It just starts talking and gets smarter as the conversation goes on. The result? The AI starts speaking instantly, but the answer it gives gets progressively more intelligent. It’s a shift from the old "think, then speak" model to a much more human "speak while you think."

Okay, But How Do You Train an AI to Do That?

This is a great question because, as you can imagine, there aren't any training datasets of conversations where one person is getting secret "oracle" messages. Sakana AI’s team had to get creative.

They developed a technique called Simulated Oracle Augmentation. It’s a bit of a mouthful, but the concept is pretty straightforward. They took a standard conversational dataset (a user's question and the correct answer) and used a "simulator" LLM to create the kind of oracles a real-time system would generate.

They created a scale of "hint levels" from 0 to 5.

  • Hint Level 0: The AI gets no help at all.
  • Hint Level 5: The AI is given the perfect, final answer.
  • Levels 1-4: The AI gets progressively better hints, mimicking how an LLM would refine its answer as it gets more of the user's question.

By training KAME on tens of thousands of these synthetic dialogues, they taught the front-end model how to gracefully incorporate these incoming hints and correct its course mid-sentence. It learned how to start with a rough idea and refine it into a polished answer, all while speaking.

The Big Question: Does It Actually Work?

Alright, let's get to the numbers. Does this fancy setup actually bridge the gap? The results are pretty striking.

The team tested KAME on a benchmark designed for multi-turn Q&A. Here’s how it stacked up:

  • Moshi (The Fast Talker): Scored a 2.05. Super-fast, but not winning any trivia nights.
  • Unmute (The Slow Thinker, using GPT-4): Scored a 7.70. A genius-level response, but with that painful 2.1-second delay.
  • KAME (with GPT-4 as its brain): Scored a 6.43 with near-zero delay.

Look at that jump. KAME is dramatically smarter than the fast-only model, getting impressively close to the quality of the slow-but-smart system, all while keeping the conversation fluid and immediate.

Now, you might ask, why isn't it a perfect 7.70? The researchers explain this is actually a feature, not a bug. The score is slightly lower because KAME has to start speaking before it has all the information. That’s the whole point! It’s making a small trade-off in initial perfection for a huge gain in conversational flow. In fact, when they looked at the text from the final oracle KAME received, its quality score was right up there with the slow system. This proves the "thinker" in the back is just as smart; the system is just designed to prioritize a natural, immediate response.

The Best Part? You Can Swap Out Its Brain

Perhaps one of the most practical and exciting things about KAME is that it’s "back-end agnostic."

What does that mean? It means the fast-talking front-end isn't permanently tied to one specific LLM. The team trained it using one model, but at runtime, you can swap in whatever back-end LLM you want without any retraining.

This is huge. Imagine being able to plug in:

  • GPT-4 for its incredible knowledge on humanities.
  • Claude Opus for a particularly tricky reasoning problem.
  • Gemini for another specialized task.

You can route queries to the best "brain" for the job without having to rebuild the entire system. It’s like having a universal socket for intelligence, letting you always use the best tool available.

Ultimately, KAME feels like a real step forward. It tackles the core tension in voice AI head-on and provides an elegant solution. By letting an AI speak and think in parallel, Sakana AI is paving the way for voice assistants that don't just feel faster or smarter, but fundamentally more natural and human. And hopefully, it means the end of that long, awkward pause.

Tags

Product Launch Performance Optimization Real-time AI AI Breakthrough Large Language Models AI architecture Conversational AI AI Audio Low-latency AI Human-AI interaction Sakana AI KAME AI Speech-to-Speech Voice Assistants

Stay Updated

Get the latest articles and insights delivered straight to your inbox.

We respect your privacy. Unsubscribe at any time.

Aicosoft

AI & Technology News, Insights & Innovation

AICOSOFT delivers cutting-edge AI news, technology breakthroughs, and innovation insights. Stay informed about artificial intelligence, machine learning, robotics, and the latest tech trends shaping tomorrow.

Connect With Us

© 2026 Aicosoft. All rights reserved.