How to Make Your LLM Behave: A Practical Guide to DPO and QLoRA

Akram Chauhan
Akram Chauhan
9 min read100 views
How to Make Your LLM Behave: A Practical Guide to DPO and QLoRA

Let’s be honest. Large language models can be incredible, but sometimes they’re just… off. They can be confidently wrong, a little unhinged, or just plain unhelpful. So, how do we nudge them in the right direction? How do we teach them to have the style, tone, and helpfulness we actually want?

For a while, the gold standard was a technique called Reinforcement Learning from Human Feedback (RLHF). It’s the method that powered a lot of the early magic in models like ChatGPT. But RLHF is a beast. It’s complicated and expensive, often involving training a whole separate "reward model" just to guide the main one. It’s like hiring a personal tutor for your AI, which is great if you have the budget of a tech giant, but not so practical for the rest of us.

This is where a newer, smarter technique comes in: Direct Preference Optimization (DPO).

And here’s the best part: DPO is surprisingly straightforward. Instead of a complex reward system, you just show the model two responses to a prompt—one that a human preferred and one they didn't—and tell it, "Hey, be more like this one, and less like that one."

It’s a more direct, stable, and efficient way to shape a model's behavior. In this guide, I’m going to walk you through exactly how to do this. We'll combine DPO with another brilliant trick called QLoRA to fine-tune a powerful language model on a single, everyday GPU. No supercomputer required.

So, What's the Game Plan?

Before we dive into the code, let's get a clear picture of what we're about to do. Think of it like teaching a talented but raw musician how to play for an audience.

The musician (our base LLM) already knows how to play the notes (it’s been pre-trained on tons of text). Our job isn't to teach it music from scratch. It's to refine its style.

Here’s our toolkit for the job:

  • The Technique (DPO): This is our teaching method. We’ll use a special dataset filled with prompts, each having a "chosen" (good) response and a "rejected" (not-so-good) response. DPO directly teaches the model to increase the probability of generating the good stuff and decrease the probability of the bad. Simple.
  • The Efficiency Hack (QLoRA): Training a full LLM is like trying to move a mountain. QLoRA allows us to perform precision surgery instead. We freeze most of the model and only train a few, tiny adapter layers. It’s incredibly memory-efficient and the key to making this work on a consumer GPU.
  • The Textbook (UltraFeedback Dataset): This is our sheet music. It's a massive collection of human preferences, already neatly packaged into the chosen vs. rejected format we need.

Our goal is to take a base model, apply our DPO training using the UltraFeedback data, and end up with a fine-tuned model that’s better aligned with what humans find helpful and appropriate.

Alright, let's get our hands dirty.

Step 1: Setting Up Your Workshop

First things first, we need to get our environment ready. This means installing all the necessary Python libraries. You can think of this as laying out all your tools on the workbench before you start a project.

import os
import random
import torch

# Installing the essentials from the Python package index
!pip -q install -U "transformers>=4.45.0" "datasets>=2.19.0" "accelerate>=0.33.0" "trl>=0.27.0" "peft>=0.12.0" "bitsandbytes>=0.43.0"

# Setting a seed for reproducibility. This is super important!
SEED = 42
random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)

# Defining our key parameters
MODEL_NAME = "Qwen/Qwen2-0.5B-Instruct"
DATASET_NAME = "HuggingFaceH4/ultrafeedback_binarized"
OUTPUT_DIR = "dpo_qwen2_ultrafeedback"

# Check if we have a GPU to work with
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

We're pulling in libraries like transformers and datasets from Hugging Face, peft for our QLoRA magic, trl (Transformer Reinforcement Learning) which contains our DPO trainer, and bitsandbytes for the quantization.

Setting a SEED is crucial. It ensures that anyone who runs this code gets the exact same results. It’s the scientific equivalent of making sure your measurements are repeatable.

Step 2: Loading the Model (Without Breaking the Bank)

Now, we need to load our base model. We’re using Qwen/Qwen2-0.5B-Instruct, a solid open-source model. But even a "small" 0.5 billion parameter model can be too much for a standard GPU's memory.

This is where 4-bit quantization comes in. We’re going to load the model in a super-compressed format.

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Configuration for 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load the model with our quantization config
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",
)

model.config.use_cache = False

Think of BitsAndBytesConfig as the settings for our compression tool. We're telling it to load the model's weights using only 4 bits per parameter instead of the usual 32 or 16. It’s like turning a massive WAV audio file into a compact MP3. You lose a tiny, almost imperceptible bit of fidelity, but the file size plummets, making it manageable.

We also disable use_cache, which is a setting for faster inference but can cause issues during training.

Step 3: Adding Our "Tuning Knobs" with QLoRA

Okay, our model is loaded and compressed. But right now, it's frozen. We can't train it. Instead of unfreezing the whole thing, we're going to use Parameter-Efficient Fine-Tuning (PEFT), specifically a method called LoRA (Low-Rank Adaptation).

This is the clever part. We're going to attach tiny, trainable modules (adapters) to key parts of the model.

from peft import LoraConfig, get_peft_model

# LoRA configuration
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)

# Apply the LoRA adapters to our model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

This config tells PEFT to add adapters to the attention layers of the model (q_proj, k_proj, etc.). When you run print_trainable_parameters(), you'll see something amazing: we’re only going to train a tiny fraction of the total parameters, maybe just 1-2%!

It’s like adding a small equalizer to a massive stereo system. We're not rebuilding the amplifier; we're just tweaking the bass and treble knobs to get the sound just right. This is what makes the whole process so efficient.

Step 4: Preparing the Preference Data

We have our model ready for surgery. Now we need our instruction manual—the dataset. We'll use the ultrafeedback_binarized dataset, which is perfect for DPO because every example contains a prompt, a "chosen" response, and a "rejected" one.

The raw data is in a chat format, so we need to process it into a simple structure our trainer can understand.

from datasets import load_dataset

# Load the dataset
ds = load_dataset(DATASET_NAME, split="train_prefs")

def format_example(example):
    # The prompt is the conversation leading up to the AI's response
    prompt = tokenizer.apply_chat_template(
        example["prompt_messages"], tokenize=False, add_generation_prompt=True
    )
    # The chosen and rejected responses are the final AI messages
    chosen = example["chosen"][-1]["content"]
    rejected = example["rejected"][-1]["content"]
    
    return {"prompt": prompt, "chosen": chosen, "rejected": rejected}

# Format the dataset and shuffle it
train_ds = ds.map(format_example).shuffle(seed=SEED)

# Let's look at an example
print(train_ds[0])

This format_example function is our little helper. It takes the conversational history, formats it into a single prompt string using the model's specific chat template, and then pulls out the final chosen and rejected answers.

The result is a clean dataset where each row has three simple columns: prompt, chosen, and rejected. Exactly what DPO needs.

Step 5: Kicking Off the DPO Training

This is the moment we've been building up to. We'll use the DPOTrainer from the TRL library, which handles all the complex logic of DPO behind the scenes. We just need to give it our model, our dataset, and some training configuration.

from trl import DPOTrainer, DPOConfig

training_args = DPOConfig(
    output_dir=OUTPUT_DIR,
    beta=0.1,  # A key DPO hyperparameter
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    num_train_epochs=1,
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    logging_steps=10,
    save_steps=200,
    max_length=1024, # Max length of prompt + completion
    max_prompt_length=512, # Max length of prompt
)

trainer = DPOTrainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    tokenizer=tokenizer,
)

# And... train!
trainer.train()
trainer.save_model(OUTPUT_DIR)

The DPOConfig holds all our settings. Most are standard for model training, but beta is specific to DPO. It controls how much we stick to the original model's style versus how strongly we enforce the new preferences. A lower beta means a gentler nudge.

Then, we just initialize the DPOTrainer and call trainer.train(). The trainer will now loop through our dataset, showing the model each prompt and its good/bad responses, and gently adjusting the LoRA adapters to make the model more likely to produce outputs like the "chosen" ones.

Step 6: The Moment of Truth: Did It Work?

Training is done. We've saved our tiny LoRA adapters. But did we actually improve the model? There's only one way to find out: a side-by-side comparison.

Let's load our original base model and then load a second version with our newly trained DPO adapters merged in. Then we'll give them the same prompt and see how they respond.

from peft import PeftModel
from transformers import pipeline

# Load the base model again for comparison
base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",
)

# Load our DPO-tuned model
dpo_model = PeftModel.from_pretrained(base_model, OUTPUT_DIR)

# Let's grab a prompt from our dataset
sample = train_ds[10]
prompt = sample['prompt']

print("--- PROMPT ---")
print(prompt)

# Generate with the base model
base_pipe = pipeline("text-generation", model=base_model, tokenizer=tokenizer)
base_output = base_pipe(prompt, max_new_tokens=256, do_sample=True, top_p=0.9)
print("\n--- BASE MODEL RESPONSE ---")
print(base_output[0]['generated_text'][len(prompt):])

# Generate with the DPO model
dpo_pipe = pipeline("text-generation", model=dpo_model, tokenizer=tokenizer)
dpo_output = dpo_pipe(prompt, max_new_tokens=256, do_sample=True, top_p=0.9)
print("\n--- DPO MODEL RESPONSE ---")
print(dpo_output[0]['generated_text'][len(prompt):])

When you run this, the difference is often night and day. You might see the base model give a generic or slightly off-topic answer, while the DPO-tuned model provides a more helpful, detailed, and well-formatted response. It's learned the style of the preferred answers in the UltraFeedback dataset.

This is the real payoff. We didn't just teach the model new facts; we taught it how to behave. And we did it all in a way that’s accessible to almost anyone with a decent GPU. This is a huge shift, putting the power to shape and improve AI into the hands of more developers, researchers, and creators than ever before. So go ahead, give it a try. You might be surprised at how much better you can make your model with just a little bit of preference tuning.

Tags

AI Machine Learning Deep Learning ChatGPT Generative AI NLP AI Safety AI Engineering Transformers AI Research Model Optimization Large Language Models (LLMs) Direct Preference Optimization (DPO) QLoRA Ultra-Feedback LLM Alignment Human Preferences AI Fine-tuning Reinforcement Learning from Human Feedback (RLHF) Reward Model

Stay Updated

Get the latest articles and insights delivered straight to your inbox.

We respect your privacy. Unsubscribe at any time.

Aicosoft

AI & Technology News, Insights & Innovation

AICOSOFT delivers cutting-edge AI news, technology breakthroughs, and innovation insights. Stay informed about artificial intelligence, machine learning, robotics, and the latest tech trends shaping tomorrow.

Connect With Us

© 2026 Aicosoft. All rights reserved.