Markovian Thinking: The AI Breakthrough That Crushes the 'Quadratic Curse' and Unlocks Million-Token Reasoning

Akram Chauhan
Akram Chauhan
7 min read145 views
Markovian Thinking: The AI Breakthrough That Crushes the 'Quadratic Curse' and Unlocks Million-Token Reasoning

Have you ever watched an AI start strong on a complex problem, only to see it lose the plot halfway through? It’s a common frustration. Large language models (LLMs) are incredible, but they have a critical weakness: a surprisingly short attention span when it comes to deep, multi-step reasoning. The longer they "think," the more computationally expensive and unwieldy the process becomes, until it eventually grinds to a halt.

This isn't just a minor bug; it's a fundamental roadblock known as the "quadratic curse." For an AI to solve a tough math problem or debug a huge chunk of code, it needs to generate a long internal monologue, a "chain of thought." But with today's transformer models, every new thought adds to a growing pile of context the AI has to re-process over and over again. The cost doesn't just grow—it explodes.

This bottleneck has kept us from seeing what AI is truly capable of. We've been stuck with models that are great at short sprints but can't run a marathon. But what if we could teach an AI to think like a long-distance runner, pacing itself and conserving energy? Researchers at Mila have done just that, with a groundbreaking technique called Markovian Thinking that might just be the key to unlocking AI that can reason for days, weeks, or even millions of tokens.

The Vicious Cycle of AI 'Thinking'

To really get why this new approach is such a big deal, we first need to understand the problem it solves. When you ask an LLM a complex question, it doesn't just spit out an answer. The best models use a process called Chain-of-Thought (CoT) reasoning. They essentially talk themselves through the problem, generating intermediate steps to arrive at a final solution.

Think of it as showing your work in math class. The more steps you write down, the more likely you are to get the right answer. Researchers found that training models to produce longer chains of thought (LongCoT) dramatically improves their reasoning abilities.

But here’s the catch. With standard transformer architectures, every time the model generates a new "thought" token, it has to pay attention to all the previous tokens in its context window—the original prompt plus its entire chain of thought so far.

Imagine trying to write a novel by re-reading the entire book from page one every time you want to add a new sentence. It would be agonizingly slow, and by the time you reached Chapter 20, you'd spend all your energy just re-reading, not writing. That’s the quadratic curse in a nutshell. The computational cost scales quadratically with the length of the reasoning chain, making truly long-form thinking prohibitively expensive.

Most attempts to fix this have been workarounds. They try to limit how much the model thinks, either by encouraging shorter solutions or just cutting the process off early. That’s not a real solution; it’s just avoiding the problem.

A Smarter Way to Reason: Introducing Markovian Thinking

Instead of trying to tame the quadratic beast, the Mila team decided to sidestep it entirely. Their solution, which they call a "Markovian Thinker," fundamentally changes the rules of the game.

The core idea is brilliantly simple: separate "how long the model thinks" from "how much context it has to process at any given moment." A Markovian Thinker reasons while keeping its active context window at a fixed, manageable size. This transforms the explosive quadratic cost into a predictable, linear one. The model can think for as long as it needs to without its computational backpack getting heavier.

To put this theory into practice, they built an environment called Delethink. This system forces the model to break its reasoning process down into bite-sized pieces.

How Delethink Teaches AI to Think in Chunks

So, how does Delethink actually work? It structures the AI's reasoning process into a sequence of fixed-size chunks, say, 8,000 tokens at a time.

Here's a step-by-step breakdown:

  1. Reasoning in a Chunk: The model starts with the original prompt and reasons normally within the first 8,000-token block, using its standard attention mechanism.
  2. The Reset: Once it hits the 8,000-token limit, Delethink performs a reset. The massive chain of thought is wiped from the active context.
  3. The Carryover: A new prompt is created for the next chunk. This prompt includes the original query plus a small, crucial piece of information from the end of the previous chunk—a "carryover."
  4. Repeat: The model then continues its reasoning in the new chunk, and this process repeats as many times as needed.

The carryover is the secret sauce. It could be the last few hundred tokens of the previous thought process or a condensed summary. This forces the model to learn a new, vital skill: how to embed a summary of its progress—a "textual Markovian state"—into that carryover to seamlessly continue its work.

A common question is, "Doesn't the model forget important details from earlier steps?" According to co-author Amirhossein Kazemnejad, the training process handles this. The model is "forced to learn to carry forward the task-critical state." It learns what to remember.

And to be crystal clear, the original input isn't touched. If you gave the AI a 100-page document to analyze, that document remains in the prompt for every single chunk. This approach is purely for managing the model's "scratchpad" of reasoning, not its source material.

The Results Are In: Cheaper, Longer, and Smarter Reasoning

This all sounds great in theory, but does it work in practice? The researchers put it to the test by training a 1.5-billion-parameter model with Delethink on a dataset of tough, competition-level math problems.

The results were stunning.

Matching Performance for a Fraction of the Cost

First, they compared their model (trained to reason in 8,000-token chunks up to a 24,000-token total) against a model trained with the standard LongCoT method. On math, coding, and even PhD-level questions, the Delethink-trained model performed just as well, and in some cases slightly better. The researchers put it plainly: "Delethink uses its thinking tokens as effectively as LongCoT-RL with reduced compute."

Breaking Through the Training Barrier

The real magic happened when they pushed the models beyond their training limits. The standard LongCoT model hit a wall, its performance plateauing right around its 24,000-token training budget.

The Delethink model, however, just kept going. Because its costs are linear, it could continue reasoning far beyond its training. Some math problems were only solved after the model generated a staggering 140,000 tokens of thought—something that would be computationally nightmarish for a standard model.

The cost savings are astronomical. The researchers estimate that training a model for a 96,000-token thinking length would take 27 H100-GPU-months with the old method. With Delethink, it would take just 7. That’s a game-changer for any organization footing the AI bill.

The Secret Superpower: Your AI Might Already Know How to Do This

Perhaps the most exciting discovery is that this isn't some esoteric capability that requires building models from scratch. The researchers found that many off-the-shelf models, including large ones like GPT-OSS 120B, already have a latent ability to think in this Markovian way, even with zero specific training.

What does this mean for developers? You can take an existing, powerful model and wrap it in a "delethink-tracing" process today. As Kazemnejad notes, "...without Delethink-RL— these models can already run a delethink-tracing wrapper and perform competitively with LongCoT on our benchmarked tasks."

This latent ability explains why training with Delethink is so effective—it’s not teaching the model a completely alien skill but rather strengthening an innate one. It also means the approach is compatible with and scales alongside the state-of-the-art models we already use.

A New Horizon for Artificial Intelligence

The success of Markovian Thinking is more than just an efficiency hack; it’s a paradigm shift. It "removes a key bottleneck," opening the door for models that can "think for millions of tokens." This isn't just about solving harder math problems. It's about enabling entirely new classes of AI applications.

Imagine an AI agent that can spend two weeks meticulously debugging an entire enterprise codebase, keeping track of its progress the whole time. Or a research assistant that can pore over thousands of scientific papers, forming and testing hypotheses over days, not minutes. This is the future that Markovian Thinking makes possible.

By freeing AI from the shackles of the quadratic curse, we're not just making it cheaper or faster. We're giving it the room it needs to tackle the kind of long-horizon, complex, and deeply creative problems that have, until now, remained exclusively in the human domain. The era of the AI marathon runner has officially begun.

Tags

LLMs Performance Optimization Transformers AI Reasoning Markovian Thinking

Stay Updated

Get the latest articles and insights delivered straight to your inbox.

We respect your privacy. Unsubscribe at any time.

Aicosoft

AI & Technology News, Insights & Innovation

AICOSOFT delivers cutting-edge AI news, technology breakthroughs, and innovation insights. Stay informed about artificial intelligence, machine learning, robotics, and the latest tech trends shaping tomorrow.

Connect With Us

© 2026 Aicosoft. All rights reserved.