A Hands-On Guide to Shrinking LLMs: Benchmarking FP8, GPTQ, and SmoothQuant

Akram Chauhan
Akram Chauhan
11 min read41 views
A Hands-On Guide to Shrinking LLMs: Benchmarking FP8, GPTQ, and SmoothQuant

So, you’ve got this incredible, instruction-tuned Large Language Model. It's smart, it's capable, but let's be honest—it's also a bit of a beast. These models are notoriously huge, demanding a ton of disk space and GPU memory, which can make deploying them a real headache.

What if we could put our model on a diet? What if we could shrink it down, make it faster, and more efficient, all without having to retrain it from scratch?

That’s exactly what we're going to do today. We're diving into the world of post-training quantization (PTQ). Think of it like compressing a giant, high-resolution image into a much smaller JPEG. The goal is to make the file smaller and faster to load, while losing as little quality as possible. We’re going to do that with our LLM, and we'll use a fantastic tool called llmcompressor to make it happen.

We’ll take a base model, test its performance, and then try out three different "compression recipes" to see how they stack up. This is a hands-on lab, so roll up your sleeves and let's get started.

Before We Start: Getting Our Lab Set Up

First things first, we need to get our environment ready. This is like getting all your ingredients and tools out on the counter before you start cooking. We'll install the necessary libraries, import the right packages, and make sure we have a GPU ready to do the heavy lifting.

We'll also write a few helper functions. These are little tools we'll use over and over again to make our lives easier. We'll have one to:

  • Clean up memory: Because GPUs can get cluttered fast.
  • Measure disk size: To see exactly how much smaller our models get.
  • Time text generation: A stopwatch to measure latency (how long it takes to respond) and throughput (how many tokens it generates per second).
  • Check perplexity: This is a fancy way of measuring how "confused" the model is by new text. A lower score is better—it means the model is more confident and its predictions make sense.

Finally, we’ll wrap all this up in a neat benchmark function. This will let us run the same set of tests on each version of our model and keep the results organized. It’s our standardized test for every model we create.

Here's the code that gets all of that ready. We're using the Qwen/Qwen2.5-0.5B-Instruct model, a great little instruction-tuned model for this experiment.

# Install our tools
import subprocess, sys
def pip(*pkgs):
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", *pkgs])
pip("llmcompressor", "compressed-tensors", "transformers>=4.45", "accelerate", "datasets")

# Import everything we need
import os, gc, time, json, math
from pathlib import Path
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

# Check for a GPU
assert torch.cuda.is_available(), \
    "Enable a GPU: Runtime > Change runtime type > T4 GPU"
print("GPU:", torch.cuda.get_device_name(0), "| CUDA:", torch.version.cuda, "| torch:", torch.__version__)

# Define our model and workspace
MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct"
WORKDIR = Path("/content/quant_lab"); WORKDIR.mkdir(exist_ok=True)
os.chdir(WORKDIR)

# Our helper functions
def free_mem():
    gc.collect(); torch.cuda.empty_cache()

def dir_size_gb(path):
    total = 0
    for root, _, files in os.walk(path):
        for f in files:
            total += os.path.getsize(os.path.join(root, f))
    return total / 1e9

def time_generation(model, tok, prompt, max_new_tokens=64):
    """Greedy decode; reports latency & tokens/sec after a brief warmup."""
    inputs = tok(prompt, return_tensors="pt").to(model.device)
    _ = model.generate(**inputs, max_new_tokens=4, do_sample=False) # Warmup
    torch.cuda.synchronize()
    t0 = time.time()
    out = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False, pad_token_id=tok.eos_token_id)
    torch.cuda.synchronize()
    dt = time.time() - t0
    new_ids = out[0][inputs["input_ids"].shape[1]:]
    return tok.decode(new_ids, skip_special_tokens=True), dt, max_new_tokens/dt

@torch.no_grad()
def wikitext_ppl(model, tok, seq_len=512, max_chunks=20, stride=512):
    """Light WikiText-2 perplexity probe (fast, indicative)."""
    ds = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
    text = "\n\n".join(t for t in ds["text"][:400] if t.strip())
    enc = tok(text, return_tensors="pt").input_ids.to(model.device)
    nll_sum, tok_count = 0.0, 0
    for begin in range(0, enc.size(1) - seq_len, stride):
        chunk = enc[:, begin:begin+seq_len]
        out = model(chunk, labels=chunk)
        nll_sum += out.loss.float().item() * seq_len
        tok_count += seq_len
        if tok_count // seq_len >= max_chunks: break
    return math.exp(nll_sum / tok_count)

# The master benchmark function
results = {}
PROMPT = ("<|im_start|>user\nIn two sentences, explain why post-training "
          "quantization works for large language models.<|im_end|>\n"
          "<|im_start|>assistant\n")

def benchmark(label, model_path_or_id):
    free_mem()
    print(f"\n──── benchmarking: {label} ────")
    tok = AutoTokenizer.from_pretrained(model_path_or_id)
    m = AutoModelForCausalLM.from_pretrained(
        model_path_or_id, torch_dtype="auto", device_map="cuda").eval()
    sample, dt, tps = time_generation(m, tok, PROMPT)
    ppl = wikitext_ppl(m, tok)
    size = dir_size_gb(model_path_or_id) if os.path.isdir(str(model_path_or_id)) else None
    results[label] = {"size_gb": size, "ppl": round(ppl, 3), "latency_s": round(dt, 3), "tok_per_s": round(tps, 1), "sample": sample.strip().replace("\n", " ")[:180]}
    print(json.dumps(results[label], indent=2))
    del m; free_mem()

The Starting Line: Our FP16 Baseline

Before we start shrinking anything, we need to know what we're starting with. We'll run our benchmark on the original, untouched model. This model uses 16-bit floating-point precision (FP16), which is pretty standard for high-quality models. This is our "control group"—every compressed version will be compared against these numbers.

print("\n════════════ Baseline (FP16) ════════════")
benchmark("00_fp16_baseline", MODEL_ID)

Recipe 1: The Quick and Easy FP8 Method

Alright, let's try our first compression technique: FP8 dynamic quantization. This is one of the simplest methods out there. We're telling llmcompressor to convert most of the model's linear layers—the real workhorses of an LLM—to 8-bit floating-point numbers.

The "dynamic" part means it does this on the fly, without needing any sample data to figure out how. It’s fast and data-free. We do, however, tell it to ignore the lm_head, which is the final output layer. You often want to keep that part in higher precision to maintain output quality.

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

print("\n════════════ Recipe 1: FP8_DYNAMIC ════════════")
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
tok = AutoTokenizer.from_pretrained(MODEL_ID)

# Define the FP8 recipe
recipe_fp8 = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=["lm_head"],
)

# Apply it with a single command
oneshot(model=model, recipe=recipe_fp8)

# Save our new, smaller model
FP8_DIR = "Qwen2.5-0.5B-FP8-Dynamic"
model.save_pretrained(FP8_DIR, save_compressed=True)
tok.save_pretrained(FP8_DIR)
del model; free_mem()

# And benchmark it!
benchmark("01_fp8_dynamic", FP8_DIR)

After running this, we'll immediately see a drop in disk size. But how does it affect speed and that perplexity score? We'll find out soon.

Getting Smarter: Why We Need a "Study Guide" for Our Model

Our next two methods are a bit more sophisticated. They are "calibrated" quantization techniques, which means they use a small amount of data to make smarter decisions about how to compress the model.

Think of it like giving the model a "study guide" before a big test. By seeing a few examples of the kind of data it will process, it can adjust its compression to minimize the loss of important information.

For this, we'll grab a few hundred samples from the ultrachat_200k dataset, which is full of instruction-style conversations. We'll format them just like the model expects and tokenize them. This becomes our reusable calibration set.

NUM_CALIB_SAMPLES = 256
MAX_SEQ_LEN = 1024
tok = AutoTokenizer.from_pretrained(MODEL_ID)

raw = load_dataset("HuggingFaceH4/ultrachat_200k", split=f"train_sft[:{NUM_CALIB_SAMPLES}]")

def to_text(ex):
    return {"text": tok.apply_chat_template(ex["messages"], tokenize=False)}

def tokenize(ex):
    return tok(ex["text"], padding=False, truncation=True, max_length=MAX_SEQ_LEN, add_special_tokens=False)

calib_ds = (raw.shuffle(seed=42)
            .map(to_text)
            .map(tokenize, remove_columns=raw.column_names))

print("Calibration set:", len(calib_ds), "samples, max_seq_len =", MAX_SEQ_LEN)

Recipe 2: Going Hardcore with GPTQ (4-bit Weights!)

Now for the heavy hitter: GPTQ. This method is known for achieving some serious compression. We're going to use a W4A16 scheme, which means we’re crushing the model's Weights down to just 4 bits, while keeping the Activations (the data flowing through the model) at 16 bits.

This is an aggressive shrink! But thanks to our calibration dataset, GPTQ can do this intelligently, working layer by layer to reduce the error that compression introduces. This is where that "study guide" really pays off.

from llmcompressor.modifiers.quantization import GPTQModifier

print("\n════════════ Recipe 2: GPTQ W4A16 ════════════")
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")

# Define the GPTQ recipe
recipe_w4a16 = GPTQModifier(
    targets="Linear",
    scheme="W4A16",
    ignore=["lm_head"],
    dampening_frac=0.01,
)

# Apply it, but this time we pass in our dataset
oneshot(
    model=model,
    dataset=calib_ds,
    recipe=recipe_w4a16,
    max_seq_length=MAX_SEQ_LEN,
    num_calibration_samples=NUM_CALIB_SAMPLES,
)

# Save and benchmark
W4A16_DIR = "Qwen2.5-0.5B-W4A16-G128"
model.save_pretrained(W4A16_DIR, save_compressed=True)
tok.save_pretrained(W4A16_DIR)
del model; free_mem()

benchmark("02_gptq_w4a16", W4A16_DIR)

Recipe 3: The Pro Move with SmoothQuant + GPTQ

Our final recipe is a two-step combo. We're going to use SmoothQuant first, and then apply GPTQ.

So, what’s SmoothQuant? Some layers in an LLM are harder to quantize than others because they have "activation outliers"—basically, a few unusually large values that throw off the whole compression process. SmoothQuant is a clever technique that "smooths out" these spikes by shifting some of the difficulty from the activations to the weights. It's like sanding a piece of rough wood before you paint it; it makes the next step much more effective.

After smoothing, we'll apply an 8-bit GPTQ (W8A8). This isn't as aggressive as the 4-bit version, but because we've pre-processed the model with SmoothQuant, we expect to get excellent performance with very little quality loss.

from llmcompressor.modifiers.smoothquant import SmoothQuantModifier

print("\n════════════ Recipe 3: SmoothQuant + GPTQ W8A8 ════════════")
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")

# A two-part recipe!
recipe_w8a8 = [
    SmoothQuantModifier(smoothing_strength=0.8),
    GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]),
]

# Apply the combo recipe
oneshot(
    model=model,
    dataset=calib_ds,
    recipe=recipe_w8a8,
    max_seq_length=MAX_SEQ_LEN,
    num_calibration_samples=NUM_CALIB_SAMPLES,
)

# Save and benchmark our final model
W8A8_DIR = "Qwen2.5-0.5B-W8A8-SmoothQuant"
model.save_pretrained(W8A8_DIR, save_compressed=True)
tok.save_pretrained(W8A8_DIR)
del model; free_mem()

benchmark("03_smoothquant_w8a8", W8A8_DIR)

The Moment of Truth: Let's Compare the Results

Okay, the experiments are done. We’ve created three different compressed versions of our model. Now it's time to line them all up and see who wins. We'll print a clean summary table showing the size, perplexity, speed, and latency for each one.

We'll also look at the actual text each model generated for our prompt. After all, the numbers are important, but what really matters is whether the model still gives good answers.

print("\n══════════════════════ FINAL SUMMARY ══════════════════════")
print(f"{'Variant':<26}{'Size GB':>9}{'PPL':>10}{'tok/s':>9}{'Latency':>11}")
print("-" * 65)
for k, v in results.items():
    size = f"{v['size_gb']:.3f}" if v['size_gb'] else " (hub) "
    print(f"{k:<26}{size:>9}{v['ppl']:>10.2f}{v['tok_per_s']:>9.1f}"
          f"{v['latency_s']:>10.2f}s")

print("\nSample completions (greedy, 64 new tokens):")
for k, v in results.items():
    print(f"\n[{k}]\n → {v['sample']}")

Looking at the final table, you get a crystal-clear picture of the trade-offs. The 4-bit GPTQ model is tiny, but did its perplexity score jump up? The FP8 model was easy to make, but is it as fast as the others? And how did our sophisticated SmoothQuant combo fare?

This is the core of model optimization. There's no single "best" answer—it's about finding the right balance of size, speed, and quality for your specific needs. By running a simple benchmark like this, you can make an informed decision instead of just guessing. You now have a repeatable process for testing any quantization recipe on any model, which is a seriously powerful skill to have.

Tags

Deep Learning Cost Optimization AI Deployment Large Language Models AI Inference Model Optimization AI Benchmarking LLM compression GPU memory optimization LLM quantization FP8 quantization GPTQ SmoothQuant llmcompressor post-training quantization

Stay Updated

Get the latest articles and insights delivered straight to your inbox.

We respect your privacy. Unsubscribe at any time.

Aicosoft

AI & Technology News, Insights & Innovation

AICOSOFT delivers cutting-edge AI news, technology breakthroughs, and innovation insights. Stay informed about artificial intelligence, machine learning, robotics, and the latest tech trends shaping tomorrow.

Connect With Us

© 2026 Aicosoft. All rights reserved.