Shubham's blog

Notes on RLHF | Nathan Lambert's book

Overview

Nathan Lambert is a distinguished researcher from AI2 (Allen Institute for Science) who recently published a book on RLHF (Reinforcement Learning from Human Feedback). Book can be found here

Core Concept

RLHF enables a base model to learn from human preferences. These preferences are encoded in a separate reward model, and the training objective is to align the base model with human preferences through this reward model. A Supervised Fine-Tuning (SFT) step precedes RLHF to facilitate reaching the aligned model.

Three Key Steps:

  1. Train a language model to understand human instruction data
  2. Gather human preference data to train a reward model
  3. Optimize the LM using an RL optimizer by sampling generations and rating them against the reward model

Post-Training

Post-training encompasses all steps after obtaining the base model, including supervised fine-tuning and RLHF.

Key Points:

Chapter 3: Definitions and Background

Language Modeling Overview

Key RL Definitions

Chapter 4: Training Overview

Optimization Objective

J_θ(π) = Expected value of rewards over infinite time horizon

Regularization

Basic RLHF Pipeline

  1. IFT on ~10K examples
  2. Train reward model on ~100K pairwise prompts (using instruction-tuned checkpoint)
  3. Train IFT model with RLHF

Standard objective: maximize reward while constraining KL distance from reference policy

Modern Approaches

Today's models involve multiple iterations with RLVR (RL for reasoning) to boost reasoning behavior:

Chapters 5 & 6: The Nature of Preferences

Challenges in Quantifying Preferences

Humans find it easier to choose between options (rank or rate) than generate answers from scratch.

Preference Collection Methods


Note: Post-training is capability-specific—the behaviors a model needs to exhibit determine the data types and training iterations required.

7. Reward Modeling

Key Takeaways:

8. Regularization

Key Takeaways:

9. Instruction Finetuning

Key Takeaways:

10. Rejection Sampling

Key Takeaways:

11. Reinforcement Learning (Policy Gradient Algorithms)

Key Takeaways:

12. Direct Alignment Algorithms

Key Takeaways:

13. Constitutional AI & AI Feedback

Key Takeaways:

14. Reasoning Training & Inference-Time Scaling

Key Takeaways:

15. Tool Use & Function Calling

Key Takeaways:

16. Synthetic Data & Distillation

Key Takeaways:

17. Evaluation

Key Takeaways:

18. Over-Optimization

Key Takeaways:

19. Style and Information

Key Takeaways:

20. Product, UX, and Model Character

Key Takeaways:


Overall Key Insights:

  1. RLHF has evolved: From simple 3-stage pipeline to complex multi-stage post-training
  2. Synthetic data dominates: AI feedback largely replaced human data (except at capability frontiers)
  3. Algorithm choice matters less than data: DPO vs. PPO less important than quality data
  4. Reasoning revolution: RLVR enables inference-time scaling through RL on verifiable rewards
  5. Implementation details critical: Loss aggregation, masking, KL penalties significantly affect outcomes
  6. Over-optimization inevitable: Proxy objectives always deviate from true goals
  7. Evaluation is complex: No single ground truth; labs optimize private metrics, report public ones
  8. Style undervalued: Format and presentation as important as content for utility
  9. Open questions remain: Best practices still emerging, especially for reasoning and tool use
  10. Future is agentic: Tool use, multi-step reasoning, and RL-based training increasingly central

This is a fantastic intuitive leap. Translating abstract RL concepts into a concrete geometric space is one of the best ways to understand them, especially when dealing with high-dimensional continuous spaces (like those an LLM inhabits).


Building intuition for RL as a Reward Landscape Traversal.

Building this mental model for myself helped me understand RL better

The Core Stage: The 3D Environment

  1. The "Ground Floor" (The 2D Plane): The State Space Imagine an infinitely wide, flat map stretched out on the floor.
    • Every possible coordinate point (x,y) on this map is a unique State (s).
    • Note: In reality, an LLM state space is thousands of dimensions, but for our mental model, compressing it to a 2D plane works perfectly.
  2. The "Elevation" (The Z-Axis): The Reward Function Now, imagine this map is not flat. It has hills, mountains, deep valleys, and flat plains.
    • The altitude (Z-height) at any specific coordinate (x,y) is the Immediate Reward (r) you get for stepping onto that exact spot.
    • Goal: The agent wants to spend its time at the highest possible altitudes.

The Moving Parts: The Agent and Trajectories

Now, let's place the agent into this 3D terrain.

  1. You Are Here: The Current State (st) Place a single pin on the map. That is where you are right now.
  2. The Steering Wheel: The Policy (πθ / The LLM) You are standing at the current state pin. You hold a compass and a steering wheel. This is your Policy.
    • The Policy looks at the surrounding terrain (the state) and decides which direction to step next.
    • The Action (at) is the actual step you take—a small vector pushing you from your current (x,y) to a new (x,y).
  3. The Path Taken: The Trajectory (τ) As you take repeated steps driven by your policy, you trace a curved line across the landscape.
    • This line is your trajectory: State Action State Action...
    • As you walk along this curved line, you are constantly moving up and down in altitude (collecting immediate rewards).

The Optimization Concepts: Value and Constraints

Here is where we integrate the more advanced concepts: Value functions and KL divergence.

6. The "Summit Potential": The Value Function (V(s))

If you are standing at a specific point on the map, the Value Function is not the altitude you are currently at. It is an estimate of how much total altitude you will accumulate from this point forward if you keep following your current policy.

7. The "Safety Corridor": KL Divergence Constraint

This is the crucial part for modern RL (like PPO or TRPO). You have an old "baseline" model that you know works okay. You want to improve it, but you don't want to completely break it.


Summary of a Geometric Mental Model

Gemini_Generated_Image_nlsf3anlsf3anlsf