RLHF and Alignment
RLHF and Alignment
Reinforcement Learning from Human Feedback (RLHF) aligns language models with human preferences using reward modeling and policy optimization. Direct Preference Optimization (DPO) offers a simpler, more stable alternative that directly optimizes for preferred outcomes without a separate reward model.
Core Concepts
Three-Stage RLHF Pipeline
Stage 1: Supervised Fine-tuning (SFT)
Pre-trained LM → Fine-tune on high-quality demonstrations → SFT Model
Stage 2: Reward Model Training
SFT Model → Generate pairs of outputs → Human preference labels → Reward Model
Stage 3: PPO Policy Optimization
SFT Model + Reward Model → PPO training → Aligned Model
Reward Modeling
Train a neural network to predict human preferences:
Input: "Summarize: ..." + "Output 1: ..." + "Output 2: ..."
Task: Predict which output humans prefer
Output: Scalar reward score
The reward model learns to score quality, helpfulness, safety, and alignment.
PPO: Proximal Policy Optimization
Optimize the language model using rewards while staying close to SFT model:
Loss = -E[log π(action|state) * advantage] - β * KL(π || π_sft)
Key challenge: Prevent collapse to reward hacking or divergence from base model.
Direct Preference Optimization (DPO)
Simpler alternative that directly optimizes preference pairs:
Loss = -log σ(β * [log π_preferred - log π_rejected])
Advantages:
- No separate reward model needed
- More stable training
- Fewer hyperparameters
- Better sample efficiency
Practical Implementation
RLHF with TRL Library
from trl import PPOTrainer, PPOConfig
from transformers import AutoModelForCausalLM
# SFT Model
sft_model = AutoModelForCausalLM.from_pretrained('gpt2')
# Reward Model (pre-trained on preference data)
reward_model = AutoModelForSequenceClassification.from_pretrained(
'distilbert-base-uncased', num_labels=1
)
# PPO Configuration
ppo_config = PPOConfig(
model_name='gpt2',
learning_rate=1.41e-5,
batch_size=128,
mini_batch_size=4,
gradient_accumulation_steps=4,
ppo_epochs=4,
target_kl=0.1,
)
# PPO Trainer
ppo_trainer = PPOTrainer(
ppo_config,
model=sft_model,
ref_model=sft_model,
tokenizer=tokenizer,
dataset=dataset,
reward_model=reward_model,
)
# Training loop
for epoch in range(10):
for batch in ppo_trainer.dataloader:
# Generate responses
responses = ppo_trainer.generate(batch['query_tensors'])
# Score with reward model
rewards = reward_model(responses).logits
# PPO update
stats = ppo_trainer.step(batch['query_tensors'], responses, rewards)
print(f"Loss: {stats['loss']:.4f}, KL: {stats['kl']:.4f}")
Direct Preference Optimization
import torch
from transformers import Trainer, TrainingArguments
class DPOTrainer(Trainer):
def compute_loss(self, model, inputs, return_outputs=False):
# Get logits for preferred and rejected
preferred_logits = model(inputs['preferred_input_ids']).logits
rejected_logits = model(inputs['rejected_input_ids']).logits
# Compute log probabilities
preferred_log_probs = -torch.nn.functional.cross_entropy(
preferred_logits.view(-1, preferred_logits.size(-1)),
inputs['preferred_input_ids'].view(-1),
reduction='none'
).view_as(inputs['preferred_input_ids']).sum(dim=1)
rejected_log_probs = -torch.nn.functional.cross_entropy(
rejected_logits.view(-1, rejected_logits.size(-1)),
inputs['rejected_input_ids'].view(-1),
reduction='none'
).view_as(inputs['rejected_input_ids']).sum(dim=1)
# DPO loss
beta = 0.5
loss = -torch.nn.functional.logsigmoid(
beta * (preferred_log_probs - rejected_log_probs)
).mean()
return (loss, None) if return_outputs else loss
# Training
training_args = TrainingArguments(
output_dir='./dpo_model',
num_train_epochs=3,
per_device_train_batch_size=8,
learning_rate=2e-4,
)
trainer = DPOTrainer(
model=model,
args=training_args,
train_dataset=preference_dataset,
)
trainer.train()
Advanced Techniques
Multi-Objective Alignment
# Weight multiple reward signals
def compute_total_reward(outputs):
helpfulness = score_helpfulness(outputs)
harmlessness = score_harmlessness(outputs)
honesty = score_honesty(outputs)
total = (
0.5 * helpfulness +
0.3 * harmlessness +
0.2 * honesty
)
return total
Iterated DPO
# Online learning: collect preferences as you go
for iteration in range(5):
# Generate with current model
generations = model.generate(prompts)
# Collect human feedback
preferences = collect_human_feedback(generations)
# Train DPO
dpo_trainer.train(preferences)
Production Considerations
Alignment Evaluation
def evaluate_alignment(model, test_set):
helpful_score = 0
harmless_score = 0
honest_score = 0
for prompt in test_set:
output = model.generate(prompt)
helpful_score += score_helpfulness(output)
harmless_score += score_harmlessness(output)
honest_score += score_honesty(output)
return {
'helpful': helpful_score / len(test_set),
'harmless': harmless_score / len(test_set),
'honest': honest_score / len(test_set),
}
Key Takeaway
RLHF and DPO align models with human preferences, enabling safe, helpful systems. DPO offers simpler, more stable training than traditional RLHF, making alignment accessible to practitioners.
Practical Exercise
Task: Implement DPO training on instruction-following task.
Requirements:
- Collect pairwise preference data
- Implement DPO loss and trainer
- Train on 7B model with LoRA
- Evaluate on multiple alignment dimensions
- Compare with SFT baseline
Evaluation:
- Human preference agreement > 70%
- Reduced harmful outputs
- Improved instruction following
- Training stability vs RLHF