Aicosoft - AI & Technology News, Insights & Innovation

Q: So, Why Does Standard LoRA Break at High Ranks?

This brings us to the second failure: scaling collapse. The alpha / r scaling in standard LoRA is the culprit. As you increase the rank r to capture more information, you’re also dividing the entire update by a larger and larger number. The learning signal gets weaker and weaker until it practically vanishes. Imagine you're adding more people to a team (increasing the rank) to get more work done. But for every new person you add, you cut everyone's pay. Pretty soon, nobody has any motivation to do anything. That’s standard LoRA. You add capacity, but you kill the impact. Let’s see the numbers: alpha = 16 rs = np.arange(1, 65) standard_scale = alpha / rs rslora_scale = alpha / np.sqrt(rs)

We’ve all been there. You’re fine-tuning a large language model with LoRA, feeling pretty good about how efficient it is. You want to teach the model a new persona or a specific JSON format. It works like a charm.

Then, you try to teach it something new. Real, hard facts. Maybe it's internal company documentation, complex medical data, or detailed financial stats. The model sounds right—the tone is perfect, the formatting is on point—but the facts are just… wrong. It hallucinates details, mixes up numbers, and confidently gives you garbage.

What’s going on? It feels like you’re hitting a wall, but it’s not your fault. You’ve just stumbled upon a quiet, little assumption baked into LoRA that completely breaks down in production.

The thing is, LoRA secretly assumes that all new information is "simple." And it turns out, there are two very different kinds of information you can teach a model.

Style vs. Facts: The Two Kinds of Learning

Think of it like this. Fine-tuning for style—like a specific tone of voice, a writing format, or a persona—is a relatively simple change. You’re not fundamentally altering the model's worldview; you're just tweaking how it presents information. This kind of update is "low-rank," meaning the important changes are concentrated in just a few conceptual directions. LoRA is fantastic at this.

But teaching a model new facts is a whole different ballgame. This information is complex, dense, and spread out. Think of it as adding a new volume to an encyclopedia. The knowledge touches thousands of interconnected concepts. This is a "high-rank" update.

And here’s the kicker: standard LoRA, with its default low-rank setup (like a rank of 8 or 16), is structurally incapable of capturing all that complexity. It tries to squeeze that entire encyclopedia volume into a single pamphlet. It loses crucial details, which is why your model sounds right but gets the facts wrong.

You might think, "Okay, I'll just increase the rank!" But that's where you hit the second problem: instability. Standard LoRA wasn't built for high ranks, and when you push it, the whole training process can fall apart.

Let's see this happen in real-time. No complicated training loops, no fancy frameworks. We’re going to prove it from scratch with a little NumPy.

Let’s Build a Simulation to See It Break

To really get what’s happening, we’re going to simulate the whole process. We’ll create a fake "weight matrix" from a pretend pre-trained model. Then, we'll create two different updates: one simple "style" update and one complex "facts" update.

This lets us see exactly how much information LoRA can actually capture at different ranks.

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec

np.random.seed(42)

# Let's define the size of our pretend weight matrix
d, k = 64, 64 

# And the LoRA ranks we'll test
r_low = 4   # A typical, standard LoRA rank
r_high = 32 # A higher rank for capturing more info

print(f"Weight matrix shape : ({d} x {k})")
print(f"Low rank (standard): r = {r_low}")
print(f"High rank (RS-LoRA) : r = {r_high}")
print(f"Max possible rank : {min(d, k)}")

Creating Our "Style" and "Fact" Updates

Now for the fun part. We'll generate two update matrices.

Our style update is designed to be low-rank. Most of its important information is packed into just a few dimensions. You’ll see this when we look at its "singular values"—a mathematical way of measuring how important each dimension is. For style, a few values will be big, and the rest will drop off a cliff.

Our fact update is the opposite. It’s high-rank. The information is spread evenly across many dimensions. Its singular values will decay very slowly, meaning every dimension holds a piece of the puzzle.

def make_low_rank_delta(d, k, true_rank, noise=0.01):
    """Simulates a style update -- low intrinsic rank."""
    U = np.random.randn(d, true_rank)
    S = np.linspace(5, 0.5, true_rank)  # These values drop off fast
    V = np.random.randn(k, true_rank)
    U, _ = np.linalg.qr(U)
    V, _ = np.linalg.qr(V)
    delta = (U[:, :true_rank] * S) @ V[:, :true_rank].T
    delta += noise * np.random.randn(d, k)
    return delta

def make_high_rank_delta(d, k, noise=0.01):
    """Simulates a fact/knowledge update -- high intrinsic rank."""
    U = np.random.randn(d, d)
    S = np.linspace(3, 0.5, min(d, k)) # These values drop off slowly
    V = np.random.randn(k, k)
    U, _ = np.linalg.qr(U)
    V, _ = np.linalg.qr(V)
    delta = (U[:, :min(d,k)] * S) @ V[:, :min(d,k)].T
    delta += noise * np.random.randn(d, k)
    return delta

# Create the two updates
delta_style = make_low_rank_delta(d, k, true_rank=4)
delta_facts = make_high_rank_delta(d, k)

# Let's look at their singular values
print("\nStyle update -- top 10 singular values:", np.linalg.svd(delta_style, compute_uv=False)[:10].round(2))
print("Facts update -- top 10 singular values:", np.linalg.svd(delta_facts, compute_uv=False)[:10].round(2))
print("\nNotice: Style decays fast → low-rank. Facts decay slowly → high-rank.")

See that? The singular values for the style update drop from 5.0 to 0.6 almost immediately. But for the facts update, they barely budge. This is the core of the problem, visualized in numbers.

The Test: Can Standard LoRA Handle the Truth?

Okay, let's see how well standard LoRA can reconstruct these two different updates. We’ll also test a slightly modified version called RS-LoRA (Rank-Stabilized LoRA), which uses a different scaling factor.

The only difference is tiny:

Standard LoRA scales the update by alpha / r (where r is the rank).
RS-LoRA scales it by alpha / sqrt(r).

It seems trivial, but as you'll see, that one character change makes all the difference. We'll measure the "reconstruction error"—lower is better.

def lora_approx_standard(delta, r, alpha=16):
    """Approximate delta using rank-r LoRA with standard alpha/r scaling."""
    U, S, Vt = np.linalg.svd(delta, full_matrices=False)
    B = U[:, :r] * S[:r]
    A = Vt[:r, :]
    scaling = alpha / r
    delta_approx = scaling * (B @ A)
    error = np.linalg.norm(delta - delta_approx, 'fro') / np.linalg.norm(delta, 'fro')
    return delta_approx, error

def lora_approx_rslora(delta, r, alpha=16):
    """Approximate delta using rank-r LoRA with RS-LoRA sqrt(r) scaling."""
    U, S, Vt = np.linalg.svd(delta, full_matrices=False)
    B = U[:, :r] * S[:r]
    A = Vt[:r, :]
    scaling = alpha / np.sqrt(r) # <-- The only change is here!
    delta_approx = scaling * (B @ A)
    error = np.linalg.norm(delta - delta_approx, 'fro') / np.linalg.norm(delta, 'fro')
    return delta_approx, error

# Let's run the test across different ranks
ranks = [2, 4, 8, 16, 32, 48]
style_errors_standard, facts_errors_standard = [], []
style_errors_rslora, facts_errors_rslora = [], []

for r in ranks:
    _, e = lora_approx_standard(delta_style, r); style_errors_standard.append(e)
    _, e = lora_approx_standard(delta_facts, r); facts_errors_standard.append(e)
    _, e = lora_approx_rslora(delta_style, r); style_errors_rslora.append(e)
    _, e = lora_approx_rslora(delta_facts, r); facts_errors_rslora.append(e)

print("Rank | Style Err (std) | Facts Err (std) | Facts Err (RS-LoRA)")
print("-" * 60)
for i, r in enumerate(ranks):
    print(f" {r:2d} | {style_errors_standard[i]:.3f}           | {facts_errors_standard[i]:.3f}         | {facts_errors_rslora[i]:.3f}")

The results are crystal clear. For the style update, the error drops quickly even with standard LoRA. But for facts? The error stays stubbornly high at low ranks. You’re losing information.

And look what happens when you increase the rank with standard LoRA—the error for facts doesn't consistently improve. It gets unstable. But with RS-LoRA's simple scaling fix, the error drops steadily as the rank increases. It can actually use the higher rank to learn more.

So, Why Does Standard LoRA Break at High Ranks?

This brings us to the second failure: scaling collapse.

The alpha / r scaling in standard LoRA is the culprit. As you increase the rank r to capture more information, you’re also dividing the entire update by a larger and larger number. The learning signal gets weaker and weaker until it practically vanishes.

Imagine you're adding more people to a team (increasing the rank) to get more work done. But for every new person you add, you cut everyone's pay. Pretty soon, nobody has any motivation to do anything.

That’s standard LoRA. You add capacity, but you kill the impact.

Let’s see the numbers:

alpha = 16
rs = np.arange(1, 65)
standard_scale = alpha / rs
rslora_scale = alpha / np.sqrt(rs)

print("\nRank | Standard Scale (alpha/r) | RS-LoRA Scale (alpha/sqrt(r))")
print("-" * 55)
for r in [1, 4, 8, 16, 32, 64]:
    print(f" {r:2d} | {alpha/r:.4f}                 | {alpha/np.sqrt(r):.4f}")

print("\nStandard scaling vanishes as rank grows.")
print("RS-LoRA scaling stays meaningful at high ranks.")

At rank 64, the standard LoRA signal is scaled down to almost nothing (0.25), while the RS-LoRA signal is still meaningful (2.0). This small change keeps the learning stable, allowing the model to actually benefit from the higher rank.

The "Long Tail" Problem in Plain Sight

Let's look at one last thing: how much information is actually captured at each rank. This really drives the point home.

For our style update, you can capture over 99% of the information with just a rank of 4. The rest is just noise. It’s got a short, stubby tail.

For our facts update, a rank of 8 only gets you about 28% of the way there. Most of the knowledge is in the "long tail" of dimensions. If you cut it off, you lose the information. This is exactly why your model might sound confident but be factually incorrect. It only learned a fraction of the necessary knowledge.

sv_style = np.linalg.svd(delta_style, compute_uv=False)
sv_facts = np.linalg.svd(delta_facts, compute_uv=False)

print("Cumulative variance captured by top-r components:\n")
print(f"{'Rank':>5} | {'Style (%)':>10} | {'Facts (%)':>10}")
print("-" * 32)
total_style = np.sum(sv_style**2)
total_facts = np.sum(sv_facts**2)

for r in [2, 4, 8, 16, 32]:
    cs = 100 * np.sum(sv_style[:r]**2) / total_style
    cf = 100 * np.sum(sv_facts[:r]**2) / total_facts
    print(f" {r:3d} | {cs:9.1f}% | {cf:9.1f}%")
    
print("\nWith r=8, style is nearly fully captured.")
print("With r=8, facts are still poorly captured -- the tail matters!")

So, What's the Takeaway?

LoRA is a fantastic tool, but it's not a magic bullet. It was designed with a specific assumption that doesn't hold up when you're trying to inject dense, factual knowledge into a model.

If you’re only tweaking style, persona, or format, a low-rank standard LoRA is probably fine. But if you're finding that your model is struggling to learn and retain complex information, you're not going crazy. You're likely hitting this exact wall.

The solution isn't to just blindly crank up the rank. The solution is to use a method built for it. RS-LoRA, with its simple but brilliant scaling adjustment, allows you to use higher ranks effectively, giving your model the capacity it needs to learn facts without breaking the training process.

So next time you're fine-tuning, remember the difference between style and facts. It might just be the key to unlocking your model's true potential.

The Hidden LoRA Problem That's Silently Breaking Your AI Fine-Tuning

Style vs. Facts: The Two Kinds of Learning

Let’s Build a Simulation to See It Break

Creating Our "Style" and "Fact" Updates

The Test: Can Standard LoRA Handle the Truth?

So, Why Does Standard LoRA Break at High Ranks?

The "Long Tail" Problem in Plain Sight

So, What's the Takeaway?

Tags

Source

Stay Updated

Related Articles

Recursive Language Models: The Clever Trick Letting LLMs Tackle 10-Million-Token Problems

Google's New AI Has a Near-Perfect Memory. Here's How Titans and MIRAS Work.

The Simple Trick That Makes LLMs 5x Faster: Understanding KV Caching

The Hidden LoRA Problem That's Silently Breaking Your AI Fine-Tuning

Style vs. Facts: The Two Kinds of Learning

Let’s Build a Simulation to See It Break

Creating Our "Style" and "Fact" Updates

The Test: Can Standard LoRA Handle the Truth?

So, Why Does Standard LoRA Break at High Ranks?

The "Long Tail" Problem in Plain Sight

So, What's the Takeaway?

Tags

Source

Stay Updated

Related Articles

Recursive Language Models: The Clever Trick Letting LLMs Tackle 10-Million-Token Problems

Google's New AI Has a Near-Perfect Memory. Here's How Titans and MIRAS Work.

The Simple Trick That Makes LLMs 5x Faster: Understanding KV Caching

Cookie Settings