Ever since 2017, the world of AI has been dominated by one single, brilliant idea. It came from a Google paper with a bold title: "Attention Is All You Need." And for the better part of a decade, they were right.
That paper introduced the "transformer" architecture, and its core engine—the attention mechanism—became the blueprint for every major AI you've heard of. GPT-4, Claude, Gemini, Llama... they're all transformers at heart. Think of attention as the model's superpower: it can look back across thousands of words or lines of code and pinpoint exactly what's relevant to its next step.
But that superpower has a dark side. It's incredibly expensive. The math behind attention means that if you double the amount of text you're feeding the model, you quadruple the computing power and memory it needs. As we push for AI that can read entire novels or analyze hours of video, this is becoming a massive bottleneck. Attention is the transformer's Achilles' heel.
And now, a little-known startup called Manifest AI claims they’ve found a way to cut that heel right off. They've just released a new model, Brumby-14B-Base, that performs like a top-tier transformer but was trained for just $4,000. How? By throwing out attention completely.
So, What Replaced Attention?
Let's get into the nerdy stuff, but I'll make it painless, I promise.
In a standard transformer, when the model is trying to figure out the next word, it performs a massive comparison. Every single word in the input is compared against every other word to calculate a "relevance" score. It’s like trying to have a group conversation where, before anyone speaks, they have to stop and explicitly consider their entire conversational history with every single person in the room. It's thorough, but wildly inefficient.
Manifest AI’s model swaps this out for something they call Power Retention.
Instead of that chaotic all-hands meeting, Power Retention works more like a human brain's working memory. It maintains a compressed summary of everything it's seen so far. As each new piece of information (a word or token) comes in, the model doesn't look back at the entire history. It just updates its summary.
This is a huge deal. It means the cost of processing one more word is always the same, whether you're on word number 100 or word number 1,000,000. This is the holy grail for handling long contexts—the cost stays flat instead of exploding. It gets the efficiency of older models like RNNs but, thanks to some clever math (the "power" in Power Retention), it keeps the expressive ability that made transformers so smart in the first place.
The $4,000 Question: How Is That Even Possible?
Okay, a 14-billion parameter model trained for four grand sounds completely insane. Training a model that size from scratch would normally cost hundreds of thousands, if not millions. So what’s the catch?
Here’s the brilliant part: they didn't train it from scratch.
Manifest AI took an existing, powerful open-source model, Qwen3-14B-Base, and essentially performed a brain transplant. They kept all the knowledge the model had already learned about language, grammar, and facts, but they surgically removed its attention layers and replaced them with their new Power Retention layers.
Think of it like this: imagine a world-class concert pianist who has spent decades mastering the piano. They understand music theory, harmony, rhythm—everything. Now, you hand them a guitar for the first time. They still have all their musical genius, but their hands don't know the fretboard. They have to retrain their muscle memory.
That's what Manifest AI did. Brumby started with the "mind" of Qwen3, but it was initially confused about how to use that knowledge with its new Power Retention "body." The $4,000 and 60 hours of training on 32 NVIDIA H100s was the "guitar lessons"—a short, intense period for the model to adapt its existing knowledge to the new architecture.
And it worked. In just a few thousand training steps, the model’s performance shot right back up to where the original Qwen3 was. This is the key takeaway: you don't have to start from zero. You can build on the shoulders of giants, converting these massive, expensive transformers into something far more efficient for a tiny fraction of the cost.
But Does It Actually Perform?
A cheap model is useless if it's dumb. So, how does Brumby stack up against its transformer cousins? The results are fascinating.
Manifest AI released a bunch of benchmark scores, and here's the gist:
| Task | Brumby-14B | Qwen3-14B | GLM-4.5-Air | | :--- | :--- | :--- | :--- | | GSM8K (Math) | 0.88 | 0.84 | 0.83 | | MATH (Math) | 0.62 | 0.54 | 0.47 | | MMLU (General) | 0.71 | 0.78 | 0.77 | | HellaSwag (Commonsense) | 0.77 | 0.81 | 0.85 |
What we're seeing is a really interesting pattern. On tasks that require complex, step-by-step reasoning, like math problems (GSM8K and MATH), Brumby actually outperforms the transformers. This makes perfect sense! These are exactly the kinds of problems where holding a long, coherent "train of thought" is crucial—something the retention mechanism is built for.
On the flip side, it lags slightly on knowledge-heavy benchmarks like MMLU. This also makes sense. The original Qwen3 model spent its entire life learning facts with an attention mechanism, and Brumby is still learning how to access all of that knowledge perfectly with its new brain. Still, it's incredibly close.
It's Not Just Cheaper, It's Faster
The benefits don't stop at training costs. The Power Retention design is also a beast when it comes to inference (the cost of actually running the model).
Because it avoids that massive all-to-all comparison, it's way more efficient on the hardware. Jacob Buckman, Manifest AI's founder, says their code achieves 80-85% hardware utilization. That’s a fancy way of saying you’re getting more performance out of your very expensive GPUs compared to other popular methods like FlashAttention2 (70-75%) or another post-transformer model called Mamba (50-60%).
On very long sequences, they claim it can be hundreds of times faster than attention. That’s not a typo. While Buckman admits they haven't stress-tested it at production scale yet, the potential here is enormous.
Of Course, Tech Twitter Had Thoughts
As you can imagine, dropping a claim like "we trained a 14B model for $4k" set the internet on fire.
Almost immediately, researchers on X (formerly Twitter) jumped in, with some calling the claim misleading. One researcher from Meta pointed out, "They shuffled around the weights of Qwen, fine-tuned it a bit, and called it ‘training a foundation model for $4k.’”
Buckman quickly responded, clarifying that he explained the whole retraining process in a thread and wasn't trying to hide anything. Looking back, he acknowledged that the headline-grabbing number, while technically true in its context, was bound to stir the pot.
But here’s what I think: the controversy misses the point. The fact that you can take a billion-dollar transformer model and give it a more efficient engine for the price of a high-end gaming PC is the story. It opens up a whole new world of experimentation.
So, Is the Transformer Era Really Ending?
Let’s be clear: transformers aren't going away tomorrow. They are incredibly powerful and form the foundation of our entire AI ecosystem. As Buckman himself said, "The end of the transformer era is not yet here."
But what Brumby and Manifest AI have done is show everyone that there's a path forward. They’ve put a significant crack in the transformer's wall. For years, the community has been locked into this one architectural idea, pouring more and more money into making it bigger. Brumby is proof that we can build things differently.
This could democratize AI development again. Small labs, startups, and even individual researchers could start tinkering with massive models, converting them to new architectures without needing a VC's checkbook.
This release feels less like an ending and more like a new beginning. It's, as Buckman put it, "one step forward in a long march" toward a future where AI is not only more powerful but also more accessible and sustainable for everyone. And that’s a future worth getting excited about.




