Ever ask a chatbot a simple question and get a bizarre, rambling, or just plain wrong answer? Yeah, we’ve all been there. It makes you wonder what’s going on under the hood. How do you take a powerful, pre-trained Large Language Model (LLM)—which is basically a super-smart but very feral parrot—and teach it to be a genuinely helpful assistant?
The answer isn't more pre-training. It's a series of clever post-training techniques, often called "alignment." Think of it as sending the brilliant-but-unruly model to finishing school.
And here’s the cool part: you don't need a warehouse full of servers to do it anymore. In this walkthrough, we’re going to roll up our sleeves and do it ourselves. We'll take a small, open-source model and guide it through the entire alignment process, step-by-step. We'll start by teaching it basic manners and end with teaching it how to reason.
Let's get started.
First, Let's Get Our Tools Ready
Before we can start training, we need to set up our workshop. We're going to lean heavily on some amazing tools from Hugging Face, especially the TRL (Transformer Reinforcement Learning) library. It’s a game-changer for making these advanced techniques accessible.
We’ll also use PEFT (Parameter-Efficient Fine-Tuning) and a technique called LoRA. This is our secret weapon for doing this on limited hardware, like a single Google Colab T4 GPU. LoRA is like putting tiny, trainable adapters onto the frozen, massive model. Instead of trying to retrain billions of parameters, we only have to tweak a few million. It’s incredibly efficient.
Here’s the code to get everything installed and configured. We're also setting up a few helper functions to keep our code clean and manage memory, which is crucial when you're working with a tight VRAM budget.
import subprocess, sys
subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "-U", "torchao>=0.16", "trl>=0.20", "transformers>=4.45", "datasets", "peft>=0.13", "accelerate", "bitsandbytes", ])
import sys as _sys
for _m in [m for m in list(_sys.modules) if m.startswith(("torchao", "peft"))]:
_sys.modules.pop(_m, None)
try:
import torchao
except Exception:
import types
_fake = types.ModuleType("torchao")
_fake.__version__ = "0.16.1"
_sys.modules["torchao"] = _fake
import os, re, gc, torch, warnings
warnings.filterwarnings("ignore")
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["WANDB_DISABLED"] = "true"
os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig
print(f"torch={torch.__version__} cuda={torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"GPU: {torch.cuda.get_device_name(0)} "
f"({torch.cuda.get_device_properties(0).total_memory/1e9:.1f} GB)")
MODEL_NAME = "Qwen/Qwen2.5-0.5B-Instruct"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
BF16_OK = torch.cuda.is_available() and torch.cuda.is_bf16_supported()
LORA_CFG = LoraConfig(
r=8,
lora_alpha=16,
lora_dropout=0.05,
bias="none",
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
task_type="CAUSAL_LM",
)
def cleanup():
"""Release VRAM between training stages (Colab T4 is tight)."""
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()
def chat_generate(model, tokenizer, prompt, max_new_tokens=120):
"""Helper: format as chat, generate, decode just the assistant turn."""
msgs = [{"role": "user", "content": prompt}]
ids = tokenizer.apply_chat_template(
msgs, return_tensors="pt", add_generation_prompt=True
).to(model.device)
with torch.no_grad():
out = model.generate(
ids,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.7,
top_p=0.9,
pad_token_id=tokenizer.eos_token_id,
)
return tokenizer.decode(out[0][ids.shape[-1]:], skip_special_tokens=True)
With our environment set up, we're ready for the first stage of our model's education.
Step 1: Teaching the Basics with Supervised Fine-Tuning (SFT)
Our journey begins with Supervised Fine-Tuning, or SFT. This is the most straightforward step.
Imagine you're teaching a child to answer questions politely. You'd give them a bunch of examples: a question, followed by a perfect answer. SFT is exactly that. We take a dataset of high-quality prompts and responses and simply show them to the model, over and over again. The model's job is to learn to imitate that style and format.
We're not teaching it new knowledge here; we're teaching it how to behave. We want it to understand the "chat" format and produce helpful, well-structured answers.
Let's do it. We'll grab a small conversational dataset and use TRL's SFTTrainer to train our model.
print("\n" + "="*72 + "\nPART 1 — Supervised Fine-Tuning (SFT)\n" + "="*72)
from trl import SFTTrainer, SFTConfig
sft_ds = load_dataset("trl-lib/Capybara", split="train[:300]")
print(f"SFT dataset rows: {len(sft_ds)}")
print(f"Example messages: {sft_ds[0]['messages'][:1]}")
sft_args = SFTConfig(
output_dir="./sft_out",
num_train_epochs=1,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
learning_rate=2e-4,
logging_steps=10,
save_strategy="no",
bf16=BF16_OK,
fp16=not BF16_OK,
max_length=768,
gradient_checkpointing=True,
report_to="none",
)
sft_trainer = SFTTrainer(
model=MODEL_NAME,
args=sft_args,
train_dataset=sft_ds,
peft_config=LORA_CFG,
)
sft_trainer.train()
print("\n[SFT inference]")
print("Q: Explain the bias-variance tradeoff in two sentences.")
print("A:", chat_generate(sft_trainer.model, sft_trainer.processing_class, "Explain the bias-variance tradeoff in two sentences."))
sft_trainer.save_model("./sft_out/final")
del sft_trainer; cleanup()
After a short training run, we can already see a difference. The model is now better at following instructions and providing answers in the format we want. It's a great start, but just imitating good answers isn't enough. We need to teach it to prefer good answers over bad ones.
Step 2: Building a "Taste Tester" with Reward Modeling
This is where things get more interesting. We now want to give our model a sense of "good" versus "bad." To do that, we first need to build a judge—another model whose sole job is to score responses. This is called a Reward Model (RM).
We use a dataset that contains a prompt and two possible answers: one that was "chosen" by a human as better, and one that was "rejected." We then train a model to predict which answer is the chosen one. Essentially, we're training it to have the same preferences as a human.
This reward model doesn't generate text. It just reads a prompt and a response and spits out a single number: a score. A higher score means a better answer.
print("\n" + "="*72 + "\nPART 2 — Reward Modeling\n" + "="*72)
from trl import RewardTrainer, RewardConfig
rm_ds = load_dataset("trl-lib/ultrafeedback_binarized", split="train[:300]")
print(f"RM dataset rows: {len(rm_ds)} keys: {list(rm_ds[0].keys())}")
rm_args = RewardConfig(
output_dir="./rm_out",
num_train_epochs=1,
per_device_train_batch_size=2,
gradient_accumulation_steps=2,
learning_rate=1e-4,
logging_steps=10,
save_strategy="no",
bf16=BF16_OK,
fp16=not BF16_OK,
max_length=512,
gradient_checkpointing=True,
report_to="none",
)
rm_lora = LoraConfig(
r=8,
lora_alpha=16,
lora_dropout=0.05,
bias="none",
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
task_type="SEQ_CLS",
)
rm_trainer = RewardTrainer(
model=MODEL_NAME,
args=rm_args,
train_dataset=rm_ds,
peft_config=rm_lora,
)
rm_trainer.train()
del rm_trainer; cleanup()
Once this training is done, we have a reliable judge that can score any response our main model generates. This judge is the key to the next, more advanced alignment methods.
Step 3: The Shortcut - Aligning Directly with DPO
Okay, so we have a way to judge responses. The traditional next step (called RLHF) would be to have our main model generate answers, get them scored by the reward model, and then use that feedback to update itself. It works, but it's complex and can be unstable.
Direct Preference Optimization (DPO) is a more elegant and direct approach. It’s a brilliant little piece of math that lets us skip training a separate reward model. Instead, DPO directly optimizes our main model on the preference data (the chosen vs. rejected pairs).
Think of it like this: DPO implicitly figures out the reward function and optimizes the model at the same time. It's more stable, more efficient, and gets us to the same goal with less fuss. We're telling the model, "Hey, increase the probability of generating answers like the 'chosen' one and decrease the probability of ones like the 'rejected' one."
print("\n" + "="*72 + "\nPART 3 — Direct Preference Optimization (DPO)\n" + "="*72)
from trl import DPOTrainer, DPOConfig
dpo_ds = load_dataset("trl-lib/ultrafeedback_binarized", split="train[:300]")
dpo_args = DPOConfig(
output_dir="./dpo_out",
num_train_epochs=1,
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
learning_rate=5e-6,
logging_steps=10,
save_strategy="no",
bf16=BF16_OK,
fp16=not BF16_OK,
max_length=512,
max_prompt_length=256,
beta=0.1,
gradient_checkpointing=True,
report_to="none",
)
dpo_trainer = DPOTrainer(
model=MODEL_NAME,
args=dpo_args,
train_dataset=dpo_ds,
peft_config=LORA_CFG,
)
dpo_trainer.train()
del dpo_trainer; cleanup()
After DPO, our model isn't just imitating; it's actively trying to generate responses that align with human preferences. It's a huge leap forward in quality. But what if "preference" isn't enough? What if we need the model to be factually, verifiably correct?
Step 4: Leveling Up to Real Reasoning with GRPO
This is where we push into the really cool stuff. So far, we've relied on human preference data, which is great for style, tone, and helpfulness. But it's not great for things that have a single right answer, like math or logic problems.
Enter Group Relative Policy Optimization (GRPO). This technique is designed for tasks where we can programmatically check if an answer is correct.
Here's the idea: for a given prompt (like a math problem), we have the model generate multiple different answers. Then, we use custom reward functions that we write ourselves to score each one. For a math problem, we can have one function that gives a +1 reward if the final number is correct and 0 otherwise. We could even add another function that gives a small bonus for being concise.
The model then learns to favor the kinds of outputs that score highest according to our custom, verifiable rules. We're no longer just teaching it style; we're teaching it to reason its way to a correct answer.
print("\n" + "="*72 + "\nPART 4 — GRPO with verifiable math rewards\n" + "="*72)
from trl import GRPOTrainer, GRPOConfig
import random
random.seed(0)
def make_math_problem():
a, b = random.randint(1, 50), random.randint(1, 50)
op = random.choice(["+", "-", "*"])
expr = f"{a} {op} {b}"
return {
"prompt": f"Solve this and end your reply with only the final number. {expr} =",
"answer": str(eval(expr)),
}
grpo_ds = Dataset.from_list([make_math_problem() for _ in range(200)])
print(f"GRPO dataset rows: {len(grpo_ds)}")
print(f"Example: {grpo_ds[0]}")
def correctness_reward(completions, **kwargs):
"""+1 if the last number in the completion matches the gold answer."""
answers = kwargs["answer"]
rewards = []
for c, gold in zip(completions, answers):
nums = re.findall(r"-?\d+", c)
rewards.append(1.0 if nums and nums[-1] == gold else 0.0)
return rewards
def brevity_reward(completions, **kwargs):
"""Small bonus for short answers — discourages rambling."""
return [max(0.0, 1.0 - len(c) / 200) * 0.2 for c in completions]
grpo_args = GRPOConfig(
output_dir="./grpo_out",
learning_rate=1e-5,
per_device_train_batch_size=2,
gradient_accumulation_steps=2,
num_generations=4,
max_prompt_length=128,
max_completion_length=96,
logging_steps=2,
save_strategy="no",
bf16=BF16_OK,
fp16=not BF16_OK,
gradient_checkpointing=True,
max_steps=15,
report_to="none",
)
grpo_trainer = GRPOTrainer(
model=MODEL_NAME,
args=grpo_args,
train_dataset=grpo_ds,
reward_funcs=[correctness_reward, brevity_reward],
peft_config=LORA_CFG,
)
grpo_trainer.train()
print("\n[GRPO inference]")
for q in ["What is 17 + 28?", "What is 9 * 7?", "What is 100 - 47?"]:
a = chat_generate(grpo_trainer.model, grpo_trainer.processing_class, q, 60)
print(f"Q: {q}\nA: {a}\n")
del grpo_trainer; cleanup()
print("\n✓ Tutorial complete — you've trained 4 post-training algorithms!")
Check out the inference results. The model gets much better at simple arithmetic, directly learning from our custom reward logic. This is incredibly powerful.
So, What Did We Just Accomplish?
Think about the journey we just took. We started with a generic, pre-trained model and guided it through a complete finishing school curriculum.
- SFT taught it how to follow instructions and format its answers.
- Reward Modeling and DPO gave it a sense of taste, aligning it with human preferences for what makes a "good" answer.
- GRPO pushed it beyond preference into the realm of verifiable reasoning, teaching it to find objectively correct solutions.
The most amazing part? We did all of this using open-source tools on a single, consumer-grade GPU. The techniques that were once the exclusive domain of massive AI labs are now in your hands. This is your foundation for experimenting, for designing your own reward functions, and for building AI systems that are not just powerful, but also helpful, aligned, and trustworthy.
Now it's your turn to play. What will you teach your model?




