Have you ever asked an AI a question and gotten back a wall of text? For the last few years, we’ve all been working under a simple assumption: if you want a better, more thoughtful answer from an AI, you just make it "think" longer by generating a more detailed response.
It makes intuitive sense, right? More steps, more reasoning, a better outcome.
Well, it turns out we might have been completely wrong.
A fascinating new paper from researchers at Google and the University of Virginia just dropped, and it basically flips this idea on its head. They found that just making a Large Language Model (LLM) write more can actually make it less accurate. It’s like that person we all know who talks a lot but doesn’t actually say anything useful.
Instead, they’ve come up with a brilliant new way to measure an AI’s effort—not by how long it talks, but by how hard it’s thinking. And this one little shift is leading to better answers for about half the cost.
Let’s break down what’s going on here.
The Big Problem: Why Longer AI Answers Can Be a Trap
For a while now, engineers have used the sheer number of "tokens" (which are basically words or pieces of words) in an AI's response as a rough guide for how much effort it put in. The technique is called Chain-of-Thought (CoT), and the goal is to get the model to show its work.
But the Google team found something pretty shocking. When they analyzed it, the raw token count had a negative correlation with accuracy. The number they found was r= -0.59, which in plain English means that as the model rambled on, it was actually more likely to be wrong.
Why? They call it "overthinking."
The model gets stuck in loops, repeats the same reasoning steps over and over, or worse, it makes a tiny mistake early on and then confidently builds a mountain of flawed logic on top of it. All that extra text isn't deep reasoning; it's just noise. And it's incredibly expensive, wasting precious computing power on words that don't help at all.
So, if length is a bad metric, how do we know when an AI is actually trying?
It's Not About Length, It's About Depth
This is where things get really cool. The researchers argued that the real "thinking" doesn't just happen in the final words we see. It happens deep inside the complex layers of the neural network before a word is ever generated.
Think of an AI model like a brilliant but very fast student doing a test.
-
Shallow Tokens: For an easy question like "What is 2+2?", the answer "4" comes almost instantly. The student doesn't have to think hard. For an AI, these are "shallow" tokens. The model pretty much knows the right word in its first few layers of processing. Its internal "guess" is stable from the start.
-
Deep-Thinking Tokens: Now, for a hard question, like a complex bit of logic or a tricky math symbol in an equation, the student hesitates. They might scribble something, erase it, and rethink their approach. The final answer only solidifies after a lot of mental effort. For an AI, these are "deep-thinking" tokens. The model's internal prediction for that token keeps changing and shifting as it gets processed through deeper and deeper layers of the network. The correct answer only "settles" at the very end.
The researchers developed a clever way to peek inside the model's "brain" and watch this process happen. They could see which tokens were easy and which ones required this intense, deep-layer revision.
This led them to a brand-new metric: the Deep-Thinking Ratio (DTR). It’s simply the percentage of these "hard" tokens in a given response. And guess what? Unlike token count, DTR has a strong positive correlation with accuracy (an average of r = 0.683).
When an AI's response has a high DTR, it's a great sign that it's genuinely grappling with the hard parts of the problem instead of just spitting out easy, filler words.
Think@n: Getting Better Answers for Half the Price
Okay, so having a good metric is great, but what can you do with it? This is where the research becomes incredibly practical. The team created a new strategy called Think@n.
Here’s how most high-stakes AI systems work today. If you need a super-accurate answer, you use a brute-force method called Self-Consistency. You ask the AI the same question, say, 48 times, get 48 slightly different answers, and then just hold a "majority vote" to pick the most common one.
It works, but it's wildly expensive. You’re paying to generate 48 full, often very long, responses.
Think@n is so much smarter. It works like this:
- Start the Race: The model starts generating multiple different answers, just like before.
- The Checkpoint: But—and this is the key—it stops after just 50 tokens. A tiny little prefix.
- The Smart Filter: It then quickly calculates the DTR for each of those short snippets.
- Cut the Losers: It immediately throws away all the "unpromising" candidates—the ones with a low DTR that are clearly just rambling.
- Finish Strong: It then focuses all its expensive computing power on finishing only the handful of "deep-thinking" candidates that look promising.
It’s like being a manager who gives a small test to a dozen job applicants. You don't need to wait for them all to finish a two-week project. After a few hours, you can already tell who's on the right track and who's just spinning their wheels. You then invest your time in the ones who show real promise.
The Results Are Kind of Mind-Blowing
The team tested Think@n on a really tough math competition benchmark called AIME. The results speak for themselves.
- The Old Way (Majority Vote): Achieved 92.7% accuracy but cost a whopping 307,600 tokens on average to get there.
- The New Way (Think@n): Achieved a higher accuracy of 94.7% and only cost 155,400 tokens on average.
They got a better result while cutting the total cost by 49%. That’s not a small improvement; that's a massive leap in efficiency.
What This Means for All of Us
This is more than just an academic exercise. It's a fundamental shift in how we can build and deploy powerful AI systems. For years, the story has been that better AI requires bigger models and more and more computing power. This research shows that sometimes, being smarter is more important than being bigger.
Here are the big takeaways for me:
- Stop counting words: We have to move past the idea that a longer AI response is a better one. It’s often a sign of confusion, not intelligence.
- Effort can be measured: The Deep-Thinking Ratio gives us a powerful new tool to look under the hood and see if an AI is genuinely working on a problem.
- Efficiency is possible: We don’t have to just throw more money and energy at AI to get better results. Clever techniques like Think@n can give us better performance for a fraction of the cost.
This is the kind of research that gets me excited. It’s elegant, practical, and it solves a real-world problem that everyone in the AI space is facing: the staggering cost of inference. By learning to tell the difference between an AI that’s just talking and one that’s actually thinking, we’re paving the way for more capable and accessible AI for everyone.




