Chapter 15: Finetuning II — RLHF, PPO, and DPO

Supervised fine-tuning teaches a model to imitate examples, but imitation has limits: it can’t directly encode what humans prefer. Reinforcement Learning from Human Feedback (RLHF) closes this gap by training a reward model on human preference data and then using PPO to optimise the language model against this reward.

The RLHF pipeline has three stages: (1) SFT: fine-tune a base model on high-quality demonstrations (Chapter 14); (2) Reward modelling: train a scalar reward model on pairs of model outputs ranked by humans; (3) PPO: use proximal policy optimisation to maximise the reward while a KL-divergence penalty prevents the policy from drifting too far from the SFT model.

Direct Preference Optimisation (DPO) (Rafailov et al., 2023) elegantly sidesteps the reward model entirely. Given preference pairs (chosen response, rejected response), DPO derives a closed-form loss that directly trains the language model to be more like the chosen response and less like the rejected one. It is simpler, more stable, and often matches PPO quality.

1. Prepare Preference Data

from datasets import load_dataset
import torch
import torch.nn as nn
import os

DATA_DIR = "data"
os.makedirs(DATA_DIR, exist_ok=True)

# Anthropic/hh-rlhf: human preference data with chosen/rejected responses
print("Loading preference dataset …")
try:
    hh_dataset = load_dataset("Anthropic/hh-rlhf", split="train[:2000]")
    print(f"Loaded {len(hh_dataset)} preference pairs")
    print(f"Columns: {hh_dataset.column_names}")
    print(f"\nSample chosen  : {hh_dataset[0]['chosen'][:200]}")
    print(f"Sample rejected: {hh_dataset[0]['rejected'][:200]}")
except Exception as e:
    print(f"Could not load hh-rlhf: {e}")
    # Create synthetic preference data for demonstration
    hh_dataset = [
        {"chosen":   "The sky is blue because of Rayleigh scattering of sunlight.",
         "rejected": "The sky is blue because God painted it."},
        {"chosen":   "Water boils at 100°C at standard atmospheric pressure.",
         "rejected": "Water boils when it feels like it."},
    ]
    print("Using synthetic preference data for demonstration.")

2. Reward Model

from transformers import AutoModel, AutoTokenizer

class RewardModel(nn.Module):
    """
    Reward model built on top of a pretrained encoder.
    Outputs a scalar reward score for a given (prompt, response) string.
    """

    def __init__(self, base_model_name: str = "gpt2"):
        super().__init__()
        from transformers import AutoModelForSequenceClassification
        # Use a sequence classification head that outputs a single scalar
        self.model = AutoModelForSequenceClassification.from_pretrained(
            base_model_name,
            num_labels=1,
        )

    def forward(self, input_ids: torch.Tensor,
                attention_mask: torch.Tensor = None) -> torch.Tensor:
        """Returns reward scalars of shape (B,)."""
        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
        return outputs.logits.squeeze(-1)


# Bradley-Terry reward model training objective:
# maximize log σ(r_chosen - r_rejected)
def reward_model_loss(
    reward_chosen:   torch.Tensor,   # (B,) reward scores for chosen responses
    reward_rejected: torch.Tensor,   # (B,) reward scores for rejected responses
) -> torch.Tensor:
    """
    Bradley-Terry pairwise ranking loss.
    We want reward_chosen > reward_rejected.
    Loss = -mean(log sigmoid(r_w - r_l))
    """
    logits = reward_chosen - reward_rejected
    loss   = -nn.functional.logsigmoid(logits).mean()
    return loss


# Demo: random rewards
torch.manual_seed(42)
B = 8
r_chosen   = torch.randn(B) + 1.0   # biased positive
r_rejected = torch.randn(B) - 1.0   # biased negative
loss_rm    = reward_model_loss(r_chosen, r_rejected)
acc_rm     = (r_chosen > r_rejected).float().mean()
print(f"Reward model loss: {loss_rm.item():.4f}")
print(f"Ranking accuracy : {acc_rm.item():.2%}")

3. PPO Training (Outline)

The PPO objective maximises:

\[\mathcal{L}_\text{PPO} = \mathbb{E}\!\left[\min\!\left(r_t A_t,\; \mathrm{clip}(r_t, 1{-}\varepsilon, 1{+}\varepsilon)\, A_t\right)\right] - \beta\, \mathrm{KL}(\pi_\theta \| \pi_\text{ref})\]

where $r_t = \pi_\theta(a_t

s_t) / \pi_\text{ref}(a_t

s_t)$ is the probability ratio, $A_t = \text{reward_model}(\text{response}) - \text{baseline}$ is the advantage, $\beta$ is the KL penalty coefficient (typically 0.01–0.1), and $\pi_\text{ref}$ is the frozen SFT model that prevents reward hacking.

Training loop (one iteration):

ROLLOUT — sample responses from current policy $\pi_\theta$
SCORE — get reward $r = \text{reward_model}(\text{prompt}, \text{response})$
KL PENALTY — compute $\mathrm{KL}(\pi_\theta | \pi_\text{ref})$ per token
ADVANTAGE — compute $A = r - V(s)$ where $V$ is a value head
PPO UPDATE — update $\pi_\theta$ using clipped surrogate + value loss

Libraries implementing full PPO for LLMs: TRL (trl.PPOTrainer), DeepSpeed-Chat (deepspeed.runtime.rlhf), OpenRLHF.

4. DPO Loss Implementation

def dpo_loss(
    policy_log_probs_chosen:   torch.Tensor,   # (B,) log π_θ(y_w|x)
    policy_log_probs_rejected: torch.Tensor,   # (B,) log π_θ(y_l|x)
    ref_log_probs_chosen:      torch.Tensor,   # (B,) log π_ref(y_w|x)
    ref_log_probs_rejected:    torch.Tensor,   # (B,) log π_ref(y_l|x)
    beta: float = 0.1,
) -> torch.Tensor:
    """
    Direct Preference Optimisation loss (Rafailov et al., 2023).

    DPO eliminates the reward model by showing that the optimal policy
    under the RLHF objective satisfies:

      r*(x,y) = β * log[π*(y|x) / π_ref(y|x)] + β * log Z(x)

    Substituting into the Bradley-Terry preference model and simplifying:

      L_DPO = -E[ log σ(β * (log π(y_w|x) - log π_ref(y_w|x))
                         - β * (log π(y_l|x) - log π_ref(y_l|x))) ]
    """
    log_ratio_chosen   = policy_log_probs_chosen   - ref_log_probs_chosen
    log_ratio_rejected = policy_log_probs_rejected - ref_log_probs_rejected

    # The implicit reward difference
    reward_diff = beta * (log_ratio_chosen - log_ratio_rejected)

    # DPO loss: negative log sigmoid of reward difference
    loss = -nn.functional.logsigmoid(reward_diff).mean()
    return loss


# Demo: policy that correctly prefers chosen over rejected
torch.manual_seed(42)
B = 16
# Simulate log-probs: chosen has higher probability under policy than ref
policy_chosen   = torch.randn(B) - 0.5    # policy mildly prefers chosen
policy_rejected = torch.randn(B) - 1.0
ref_chosen      = torch.randn(B) - 1.0    # reference assigns roughly equal probs
ref_rejected    = torch.randn(B) - 1.0

loss_dpo = dpo_loss(policy_chosen, policy_rejected, ref_chosen, ref_rejected)
print(f"DPO loss: {loss_dpo.item():.4f}")

5. DPO Training with TRL

To run DPO with the TRL library, install trl and peft, then:

from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import get_peft_model, LoraConfig
from datasets import load_dataset

model     = AutoModelForCausalLM.from_pretrained("gpt2")
ref_model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# Apply LoRA to keep training cheap
lora_config = LoraConfig(r=8, lora_alpha=16, target_modules=["c_attn"])
model = get_peft_model(model, lora_config)

# Dataset must have columns: prompt, chosen, rejected
dataset = load_dataset("Anthropic/hh-rlhf", split="train[:5000]")

dpo_config = DPOConfig(
    beta=0.1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=5e-5,
    num_train_epochs=1,
    output_dir="data/dpo_output",
)

trainer = DPOTrainer(
    model=model, ref_model=ref_model,
    args=dpo_config, tokenizer=tokenizer,
    train_dataset=dataset,
)
trainer.train()

6. Reward Hacking and KL Penalty

import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import numpy as np

# Demonstrate the trade-off between maximising reward and staying close to SFT
beta_values = np.linspace(0.001, 1.0, 100)

# Simulated: as beta → 0, policy drifts more but gets higher reward
# as beta → ∞, policy stays close to reference but reward is unconstrained
simulated_reward = 5.0 / (1.0 + beta_values * 10)   # decreasing reward
simulated_kl     = 1.0 / (beta_values + 0.1) - 0.9   # increasing KL

fig, ax1 = plt.subplots(figsize=(8, 4))
ax2 = ax1.twinx()
ax1.plot(beta_values, simulated_reward, color="steelblue", label="Reward")
ax2.plot(beta_values, simulated_kl.clip(0), color="coral", label="KL divergence")
ax1.set_xlabel("Beta (KL penalty coefficient)")
ax1.set_ylabel("Reward (higher is better)", color="steelblue")
ax2.set_ylabel("KL(π || π_ref) (lower is better)", color="coral")
ax1.set_title("Trade-off: Reward vs KL divergence under RLHF")
plt.tight_layout()
plt.savefig("data/ch15_rlhf_tradeoff.png", dpi=100)
print("Saved → data/ch15_rlhf_tradeoff.png")

7. Summary

Method	Requires reward model	Complexity	Quality
SFT only	No	Low	Good
RLHF + PPO	Yes (separate training)	High	Best
DPO	No (implicit)	Medium	Near-best
RLAIF	Yes (AI judge)	Medium	Good

Key RLHF hyperparameters:

beta (KL coefficient): 0.01–0.1 — controls how far from SFT policy can drift
Reward normalisation: subtract mean, divide by std to stabilise training
Clip ratio ε: 0.1–0.2 for PPO — prevents too-large policy updates

Chapter 16 covers deployment: building a production API server with streaming generation and a simple web frontend.