Reinforcement Learning from Human Feedback (RLHF): A Practical Guide with PyTorch Examples

4 min readJan 24, 2025

Reinforcement Learning from Human Feedback (RLHF) is a cutting-edge technique that has revolutionized the way we train machine learning models, especially in natural language processing (NLP). By incorporating human feedback into the training process, RLHF enables models to generate outputs that are more aligned with human preferences and values. In this article, we’ll break down RLHF into simple terms and provide practical PyTorch code examples to help you understand how it works.

What is RLHF?

RLHF is a method used to fine-tune machine learning models, particularly language models, by leveraging human feedback. It involves three key steps:

Supervised Fine-Tuning (SFT): Fine-tune a pre-trained model on high-quality, human-labeled data.
Reward Model Training: Train a reward model to predict human preferences based on ranked outputs.
Reinforcement Learning (RL): Use the reward model to guide the fine-tuning process, optimizing the model to produce outputs that humans prefer.

RLHF has been successfully used in models like OpenAI’s ChatGPT to make them more helpful, engaging, and aligned with user expectations.

RLHF in Practice: A Step-by-Step Guide

Let’s dive into the technical details and implement RLHF using PyTorch. We’ll use a simplified example to demonstrate the core concepts.

Step 1: Supervised Fine-Tuning (SFT)

First, we fine-tune a pre-trained language model on a dataset of human-labeled examples. For simplicity, let’s assume we have a small dataset of prompts and responses.

import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer, AdamW

# Load pre-trained GPT-2 model and tokenizer
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Example dataset (prompts and responses)
prompts = ["What is RLHF?", "Explain reinforcement learning."]
responses = ["RLHF is a technique for aligning models with human preferences.", 
             "Reinforcement learning is a type of machine learning where an agent learns by interacting with an environment."]

# Tokenize the dataset
inputs = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True)
labels = tokenizer(responses, return_tensors="pt", padding=True, truncation=True).input_ids

# Fine-tune the model
optimizer = AdamW(model.parameters(), lr=5e-5)

for epoch in range(3):  # Fine-tune for 3 epochs
    outputs = model(**inputs, labels=labels)
    loss = outputs.loss
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    print(f"Epoch {epoch+1}, Loss: {loss.item()}")

Step 2: Reward Model Training

Next, we train a reward model to predict human preferences. We’ll use a simple neural network that takes model outputs as input and predicts a reward score. Note that this model is just for demonstration. In real life, dataset is constructed from human feedback then we use Bradley-Terry model or similar approaches to map preferences.

import torch.nn as nn

class RewardModel(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(RewardModel, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, 1)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        return self.fc2(x)

# Example: Generate embeddings for model outputs
output_embeddings = model(**inputs).last_hidden_state.mean(dim=1)  # Average pooling

# Initialize reward model
reward_model = RewardModel(input_size=output_embeddings.size(1), hidden_size=64)

# Example human feedback (1 = preferred, 0 = not preferred)
preferred_outputs = output_embeddings[0].unsqueeze(0)  # Preferred output
non_preferred_outputs = output_embeddings[1].unsqueeze(0)  # Non-preferred output

# Train the reward model
optimizer = torch.optim.Adam(reward_model.parameters(), lr=1e-4)
criterion = nn.MSELoss()

for epoch in range(10):  # Train for 10 epochs
    preferred_reward = reward_model(preferred_outputs)
    non_preferred_reward = reward_model(non_preferred_outputs)
    
    # Maximize the margin between preferred and non-preferred rewards
    loss = criterion(preferred_reward, torch.tensor([1.0])) + criterion(non_preferred_reward, torch.tensor([0.0]))
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    print(f"Epoch {epoch+1}, Loss: {loss.item()}")

Step 3: Reinforcement Learning (RL)

Finally, we use the reward model to fine-tune the language model using reinforcement learning. We’ll use the Proximal Policy Optimization (PPO) algorithm for this step. If you don’t know what PPO is, that’s okay. Consider it like this; first model tries couple of random actions(in LLM’s case, predicting next token) and gets a reward from that. By looking at the reward, it will try more and more close next-token-predictions to what human feedback showed.

from torch.distributions import Categorical

# PPO hyperparameters
clip_epsilon = 0.2
gamma = 0.99

# Generate model outputs
outputs = model(**inputs)
logits = outputs.logits
probs = torch.softmax(logits, dim=-1)
dist = Categorical(probs)

# Sample actions (tokens) from the model
actions = dist.sample()

# Compute rewards using the reward model
output_embeddings = model(**inputs).last_hidden_state.mean(dim=1)
rewards = reward_model(output_embeddings)

# Compute PPO loss
old_probs = dist.log_prob(actions)
with torch.no_grad():
    old_values = reward_model(output_embeddings)

# Compute advantages
advantages = rewards - old_values

# PPO objective
ratio = torch.exp(dist.log_prob(actions) - old_probs)
surr1 = ratio * advantages
surr2 = torch.clamp(ratio, 1 - clip_epsilon, 1 + clip_epsilon) * advantages
ppo_loss = -torch.min(surr1, surr2).mean()

# Update the model
optimizer.zero_grad()
ppo_loss.backward()
optimizer.step()
print(f"PPO Loss: {ppo_loss.item()}")

Clipping happens because too great steps leads to worst performance.

Bonus: In the end, it looks for the original model’s output and updated models output. By using KL-divergence, we stay true to the model’s nature and prevent reward hacking.

Challenges and Future Directions

While RLHF is powerful, it comes with challenges:

Scalability: Collecting human feedback is resource-intensive.
Bias: Human annotators may introduce biases into the reward model.
Reward Hacking: The model might exploit the reward model to maximize rewards without truly aligning with human values.

Future research aims to address these challenges by improving reward models, reducing reliance on human feedback, and ensuring fairness and robustness.

Conclusion

RLHF is a game-changer for aligning machine learning models with human preferences. By combining supervised learning, reward modeling, and reinforcement learning, RLHF enables models to generate outputs that are more useful, ethical, and aligned with human values. With the PyTorch examples provided, you can start experimenting with RLHF and explore its potential in your own projects.

Happy coding! 🚀