If you've ever tried to train a large language model on a really, really long piece of text, you know the pain. It’s slow. It’s expensive. And the reason for this headache has a name: attention.
At the heart of every Transformer model is a mechanism called scaled dot-product attention. Think of it as a process where every single word (or token) in a sequence has a "conversation" with every other token to figure out what's important. For a short sentence, that's no big deal. But as the sequence length (N) grows, the number of these conversations explodes quadratically—we're talking O(N²).
This means that if you double your context length, you don't just double the work; you quadruple it. It's a brutal scaling law. We’ve seen amazing tools like FlashAttention come along, which cleverly avoids building the massive N×N attention matrix in memory, but it doesn't change the underlying math. The compute cost is still quadratic, and it's a massive bottleneck.
But what if you could cheat? What if you could get the benefits of long-context training without paying the full price, at least for most of the training run? That's the core idea behind a new method from Nous Research called Lighthouse Attention, and it's a seriously pragmatic and clever approach.
So, What's Lighthouse Attention's Big Idea?
Let's get one thing straight: Lighthouse isn't a replacement for standard attention. Instead, think of it as a temporary "scaffolding" you use during the most expensive part of construction—the pretraining.
Here’s the deal. Most previous attempts at "sparse" attention (where you only pay attention to a few things) made two common design choices that Lighthouse flips on their head.
- They were asymmetric. They'd compress the "memory" of the model (the keys and values) but leave the "questions" (the queries) at full resolution. This helps, but it’s not as efficient as it could be.
- They required custom kernels. The logic for selecting which tokens to focus on was baked right into the attention code itself. This meant you couldn't use the super-optimized, hardware-accelerated kernels like FlashAttention that everyone relies on.
Lighthouse takes a different path. It pools everything symmetrically—queries, keys, and values. And it puts the selection logic entirely outside the attention kernel. It basically pre-selects a small, "all-star" team of tokens and then runs standard, off-the-shelf FlashAttention on just that group.
The real test, though, is what happens after training. An inference-only sparse method is easy; it just has to be as good as the dense model it's based on. But a training-time method faces a much harder question: After you turn it off, is the model still a competent, fully-dense model? Lighthouse treats this "recoverability" as its number one goal.
Let's Walk Through the Four-Stage Pipeline
So how does this all work in practice? A Lighthouse layer wraps around the standard attention call and breaks it into four stages.
-
Build a Pyramid: First, it takes the queries, keys, and values (Q, K, V) and creates a multi-level pyramid by average-pooling them. Imagine you have a long document. Level 0 is the raw text. Level 1 might be a summary of every 4 tokens. Level 2 a summary of every 16 tokens, and so on. This gives the model a multi-scale view of the sequence, and it's a very cheap operation (linear time, O(N)).
-
Score and Select: Next, it needs to pick the most important tokens from this pyramid. Instead of some complex, learnable process, it uses a dead-simple, parameter-free scorer: the L2 norm (basically, the vector's length). It scores each pyramid entry and then uses a super-fast Top-K algorithm to select the
kmost "important" entries across all levels. Crucially, this selection step is non-differentiable. No gradients flow back to teach the scorer how to score better. We'll get to why that's a genius move in a second. -
Gather and Attend: Now for the main event. The system gathers the chosen Q, K, and V entries into a brand new, much smaller, and completely dense subsequence. If your original sequence was a million tokens long, this new one might only be 65,000. Then, it just hands this small, dense sequence over to standard FlashAttention. No custom code needed.
-
Scatter Back: Finally, once FlashAttention has done its work, the results for each summarized token are scattered back to the original positions they represent in the full sequence.
The result? The attention calculation, which used to be O(N²), becomes O(S²), where S is the size of that much smaller, selected subsequence. That's a massive computational win. On a B200 GPU at 512K context, this process is a whopping 21x faster on the forward pass.
The "Non-Differentiable" Trick: A Stroke of Genius
Okay, let's go back to that "non-differentiable" selection step. This might sound weird. In deep learning, we usually want gradients flowing everywhere so the model can learn. But by cutting them off here, the researchers did something really smart.
Gradients don't flow back to the scorer. This means the model isn't learning how to get selected.
Instead, the gradients flow through the "Gather" and "Scatter" steps, back through the FlashAttention calculation, and into the projection matrices (the Wq, Wk, Wv weights). What does this teach the model? It learns to produce Q, K, and V values that are useful when they get selected.
This avoids a ton of optimization headaches. The model isn't trying to game a scoring system; it's just learning to produce high-quality representations that work well when chosen by a simple, fixed heuristic.
The Two-Phase Training Plan: Train Fast, Then "Recover"
This is where it all comes together. The claim is that you can use this sparse training method and still end up with a fully functional dense model. Here's how they prove it:
- Stage 1: Lighthouse Pretraining. For the vast majority of the training (say, the first 12,000 steps), you train the model with Lighthouse Attention enabled. This is the fast part, where you're getting almost 2x the throughput of a normal training run.
- Stage 2: Dense SDPA Resumption. Then, you take that checkpoint, turn off Lighthouse, and continue training with standard, dense attention for a short period (maybe 4,000 steps).
When you first switch to dense attention, the loss spikes. The model is seeing a totally different attention pattern than it was trained on. But here's the magic: within about 1,000-1,500 steps, it "recovers." The loss drops back down and, incredibly, ends up lower than a baseline model that was trained with dense attention from the very beginning.
On a 50-billion-token training run, the Lighthouse-trained models finished in 22-27 hours, compared to 38 hours for the dense baseline, and achieved a better final loss. You get a better model, faster.
But Does It Hurt Retrieval?
A low training loss is great, but can the model still find a needle in a haystack? The team tested this with a simplified Needle-in-a-Haystack (NIAH) evaluation. They hid a single digit in a long string of random text and asked the model to find it.
The results were fantastic. Three out of four tested Lighthouse configurations either matched or beat the dense baseline's retrieval score. It turns out that for retrieval, you want to select a slightly larger number of tokens (a bigger k), but the capability is absolutely there. This gives you a knob to tune: optimize for loss with a smaller k, or optimize for retrieval with a larger k.
Scaling Up to a Million Tokens and Beyond
For truly massive contexts—like one million tokens—you can't fit everything on one GPU anyway. You have to use a technique called Context Parallelism (CP), where the sequence is split across multiple GPUs.
Lighthouse plays beautifully with this. The first two stages (pyramid building and selection) can happen locally on each GPU with no communication. And because the third stage creates a dense subsequence, it can participate in standard "ring attention" just like any normal model. This is something that other sparse methods, which produce scattered indices, just can't do without a lot of custom engineering.
Lighthouse preserves its speed advantage even when scaled up to 32 Blackwell GPUs for 1M-token training.
A Pragmatic Solution for a Massive Problem
What I love about Lighthouse Attention is how practical it is. It's not trying to reinvent the wheel with a fundamentally new architecture. It's a clever, training-only wrapper that lets us use our existing, highly-optimized tools like FlashAttention more efficiently to solve a very real, very expensive problem.
Of course, it has its limits. Because it assumes it can see all the queries at once, it's a training-only method. The final model you deploy for inference is a standard, dense-attention model. But that's the whole point. You get the speed of sparse training and the quality and compatibility of a dense model at the end.
For anyone staring down the barrel of a multi-million dollar, multi-week training run for a long-context model, a 1.7x speedup with a better final result isn't just a nice-to-have; it's a total game-changer.




