Why Your AI Model Fails on Rare Words: A Showdown Between SGD and Adam

Akram Chauhan
Akram Chauhan
8 min read42 views
Why Your AI Model Fails on Rare Words: A Showdown Between SGD and Adam

Have you ever tried to have a conversation in a really loud room? You can easily hear the person shouting right next to you, but you have to strain to catch what the quiet person across the room is saying. You might miss their most important points entirely.

Believe it or not, this is a massive problem for the AI models we use every day, especially giant language models. They're trained on mountains of text where some words, like "the" or "a," are the loud people in the room—they show up constantly. But other words, rare but often crucial ones like "photosynthesis" or "thalweg," are the quiet ones. They only appear once in a blue moon.

This creates a hidden challenge for how these models learn. The old-school, workhorse training method, called Stochastic Gradient Descent (or SGD), is a bit like a person who can't tune out the noise. It gives the same amount of attention to every word. The result? The model gets really, really good at understanding "the," but the rare, quiet words get left behind, barely learned at all.

This is where a smarter algorithm called Adam comes in. And today, we're going to pull back the curtain and see exactly how Adam saves the day. We’ll even build a tiny "lab" with some code to watch this battle play out in real-time.

The Heart of the Problem: Everyone Gets the Same Paycheck

To understand why SGD struggles, you have to know a little about how models learn. Think of it like tuning a massive soundboard with millions of knobs. Each knob (a "parameter" or "weight") controls how much the model pays attention to a specific word or feature.

When we train the model, we show it an example, see how wrong its prediction is, and then calculate a "gradient." This gradient is just a set of directions telling us which way to turn each knob to get a better result. SGD takes these directions and nudges every single knob by the exact same amount, a value we call the "learning rate."

Now, imagine the word "the" appears in 95% of your sentences. Its knob gets a little nudge almost every single time. It learns fast.

But what about the word "thalweg" (it’s a real word, I promise—it means the lowest point in a valley)? It might only appear in 0.1% of sentences. Its knob might go thousands of training steps without getting a single nudge. Since SGD gives it the same tiny nudge as "the," it basically never moves from its starting, random position. It remains unlearned.

This isn't just a small flaw; it's a fundamental weakness. SGD is blind to frequency.

Let's Build a Lab to See This in Action

Talk is cheap, right? Let's prove it. We're going to set up a super simple experiment to isolate this exact problem. No complex neural networks, just a straightforward test.

Here’s the plan:

  1. Create a tiny vocabulary: We'll have six words, ranging from extremely common to ridiculously rare.
  2. Set a simple goal: The "correct" answer (the target weight) for every single word will be 1.0. This is key. It means no word is inherently more important than another; the only difference is how often the model sees them.
  3. Train two models: We'll train one model with classic SGD and another with Adam, using the exact same data and settings.

This setup creates a perfect cage match. The only difference between the two models will be the training algorithm they use.

Here's the quick Python setup using NumPy. We're defining our words and their appearance probabilities, which span a huge range.

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec

np.random.seed(42)

TOKENS = ["the", "model", "embedding", "tokenization", "xenobiotic", "thalweg"]
# Appearance probability per batch -- spans 4 orders of magnitude
FREQ = np.array([0.95, 0.60, 0.20, 0.05, 0.005, 0.001])

TRUE_W = np.ones(6) # all weights should reach 1.0
N_STEPS = 3000
LR = 0.05
BATCH_SIZE = 32 # samples per step

def sample_batch(batch_size):
    """
    Each sample is a sparse binary feature vector.
    Token i appears in the sample with probability FREQ[i].
    Target y = x @ TRUE_W + small noise.
    """
    X = (np.random.rand(batch_size, 6) < FREQ).astype(float)
    y = X @ TRUE_W + np.random.randn(batch_size) * 0.1
    return X, y

Round 1: SGD Steps into the Ring

First up is standard, mini-batch SGD. We’ll initialize all our word weights to zero. In a loop, we'll grab a batch of data, calculate how wrong the model is, and then nudge the weights in the right direction using our fixed learning rate.

The crucial part is that a word's weight only gets updated if that word actually appears in the batch. For our rare words, that’s not going to happen very often.

def train_sgd(n_steps, lr, batch_size):
    w = np.zeros(6)
    history = np.zeros((n_steps, 6))      # weight trajectory per token
    grad_counts = np.zeros(6)             # how many non-zero gradients each weight got

    for t in range(n_steps):
        X, y = sample_batch(batch_size)
        error = X @ w - y
        grad = (X.T @ error) / batch_size
        w -= lr * grad

        grad_counts += (np.abs(grad) > 1e-9).astype(float)
        history[t] = w.copy()
    return history, grad_counts

You can already see where this is going. SGD is going to spend all its time and energy on the common words, while the rare ones are left in the dust.

Round 2: Adam Enters with a Secret Weapon

Now, let's train the same model with Adam. People often describe Adam as "SGD with momentum," but that's only half the story. Its real superpower is something else entirely.

Adam is like a smart personal tutor. For every single knob (parameter), it keeps two extra pieces of information:

  • m (momentum): A running average of the direction the knob has been turning. This helps it move more smoothly and quickly.
  • v (variance): A running average of the size of the turns. This is the secret sauce. It's essentially a measure of how much consistent information (gradients) that knob has been getting.

Before Adam makes an update, it looks at this v value. If a knob has been getting updated constantly (like for the word "the"), its v value will be high. If a knob has been sitting idle (like for "thalweg"), its v value will be tiny.

Adam then divides the update size by this v value. The result? Knobs that get updated all the time get smaller nudges, and knobs that rarely get updated get a massive nudge. It automatically boosts the signal for the quiet ones.

def train_adam(n_steps, lr, batch_size, beta1=0.9, beta2=0.999, eps=1e-8):
    w = np.zeros(6)
    m = np.zeros(6)
    v = np.zeros(6)
    history = np.zeros((n_steps, 6))
    v_history = np.zeros((n_steps, 6))    # track variance accumulation

    for t in range(1, n_steps + 1):
        X, y = sample_batch(batch_size)
        error = X @ w - y
        grad = (X.T @ error) / batch_size

        m = beta1 * m + (1 - beta1) * grad
        v = beta2 * v + (1 - beta2) * grad ** 2

        m_hat = m / (1 - beta1 ** t)
        v_hat = v / (1 - beta2 ** t)
        
        w -= lr * m_hat / (np.sqrt(v_hat) + eps)
        history[t-1] = w.copy()
        v_history[t-1] = v_hat.copy()
    return history, v_history

The Moment of Truth: Let's See the Results

Okay, we've trained both models under identical conditions. The only difference was the optimizer. Let's see how they did. Remember, the goal was for every word's weight to reach 1.0.

print("Training SGD...")
sgd_history, sgd_grad_counts = train_sgd(N_STEPS, LR, BATCH_SIZE)
print("Training Adam...")
adam_history, adam_v_history = train_adam(N_STEPS, LR, BATCH_SIZE)

sgd_final = sgd_history[-1]
adam_final = adam_history[-1]

print("=" * 62)
print(f"{'Token':<16} {'Freq':>6} {'SGD w':>8} {'Adam w':>8} {'SGD grads':>10}")
print("-" * 62)
for i, token in enumerate(TOKENS):
    sgd_err = abs(sgd_final[i] - TRUE_W[i])
    flag = " ← fails" if sgd_err > 0.3 else ""
    print(
        f"{token:<16} {FREQ[i]:>6.3f} {sgd_final[i]:>8.4f} "
        f"{adam_final[i]:>8.4f} {int(sgd_grad_counts[i]):>10}{flag}"
    )
print("\nTrue weight for all tokens: 1.0")

The results are just brutal.

| Token | Freq | SGD w | Adam w | SGD grads | |--------------|-------|--------|--------|-----------| | the | 0.950 | 1.0069 | 1.0028 | 2848 | | model | 0.600 | 1.0084 | 1.0044 | 1806 | | embedding | 0.200 | 0.9943 | 0.9822 | 610 | | tokenization | 0.050 | 0.9416 | 0.9707 | 154 | | xenobiotic | 0.005 | 0.5312 | 0.9859 | 15 | ← fails | thalweg | 0.001 | 0.1541 | 0.9880 | 3 | ← fails

Look at that. For the common words, both optimizers did great. But for the rare ones, SGD completely fell apart. The word "xenobiotic" only got 15 updates out of 3000, and its final weight is stuck at 0.53. And poor "thalweg" only got 3 updates, ending at a pathetic 0.15.

But Adam? It nailed it. Even with the same tiny number of updates, it learned the correct weight for every single word.

Uncovering Adam's "Automatic Equalizer"

So, how did Adam pull this off? Let's look at that secret weapon: the "effective learning rate."

Even though we set the main learning rate to 0.05 for everyone, Adam creates a custom one for each word on the fly by dividing by the square root of that v value we talked about.

Let's see what those effective learning rates look like at the end of training.

eps = 1e-8
adam_v_final = adam_v_history[-1]
effective_lr = LR / (np.sqrt(adam_v_final) + eps)

print("=" * 55)
print("Adam Effective Learning Rate (final step)")
print("=" * 55)
for i, token in enumerate(TOKENS):
    print(f" {token:<16} v_hat={adam_v_final[i]:.6f} lr_eff={effective_lr[i]:.4f}")
print(f"\nNominal LR: {LR}")

The results are mind-blowing:

| Token | v_hat | lr_eff | |--------------|------------|----------| | the | 0.000788 | 1.7779 | | model | 0.000495 | 2.2471 | | embedding | 0.000171 | 3.8277 | | tokenization | 0.000042 | 7.6980 | | xenobiotic | 0.000004 | 24.3989 | | thalweg | 0.000001 | 41.5227 |

Wow. The common word "the" gets an effective learning rate near the base rate. But look at "thalweg"! Because it received so few updates, its variance term (v_hat) is tiny. Adam automatically cranked its effective learning rate up to over 41.0! It gave it a massive push every time it saw it, to make up for lost time.

This is the magic of Adam. It didn't need us to tell it which words were rare. It figured it out all by itself, just by watching the gradients. It's an automatic equalizer that ensures every voice, loud or quiet, gets heard.

The visualization tells the whole story perfectly.

Tags

AI Machine Learning Deep Learning LLMs NLP AI Research AI Concepts Large Language Models AI Model Training AI Algorithms Algorithm Explained Deep Learning Optimization Optimization Algorithms Model Performance Stochastic Gradient Descent SGD Adam Optimizer Gradient Descent Frequency Bias Neural Network Training

Stay Updated

Get the latest articles and insights delivered straight to your inbox.

We respect your privacy. Unsubscribe at any time.

Aicosoft

AI & Technology News, Insights & Innovation

AICOSOFT delivers cutting-edge AI news, technology breakthroughs, and innovation insights. Stay informed about artificial intelligence, machine learning, robotics, and the latest tech trends shaping tomorrow.

Connect With Us

© 2026 Aicosoft. All rights reserved.