Glossary/Reward Models & the Alignment Problem
Model Training & Optimization

Reward Models & the Alignment Problem

Teaching AI what humans actually want — the core challenge of modern LLM alignment.


Definition

A reward model (RM) is a neural network trained to predict human preference scores for LLM outputs. It takes a prompt + response and outputs a scalar reward — higher means more aligned with human values. Reward models power RLHF (Reinforcement Learning from Human Feedback): the LLM policy is optimised to maximise RM scores. The alignment problem is the broader challenge of ensuring AI systems pursue intended goals rather than gaming metrics. Reward hacking (Goodhart's Law), sycophancy, and specification gaming are real failure modes in deployed LLMs.

How reward models are trained

Training a reward model from human preference pairs

import torch
import torch.nn as nn
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from trl import RewardTrainer, RewardConfig
from datasets import Dataset

# ── Step 1: Collect human preference data ──
# Human raters compare two model responses to the same prompt
# and label which is better (more helpful, harmless, honest)
preference_data = [
    {
        "prompt":    "Explain photosynthesis",
        "chosen":    "Photosynthesis converts light energy...",    # Rated better
        "rejected":  "Plants use sunlight to make food.",          # Rated worse
    },
    {
        "prompt":    "Write a poem about rain",
        "chosen":    "Silver drops on silent leaves...",           # More creative
        "rejected":  "Rain is water falling from sky.",            # Generic
    },
    # Typically: 100k+ human preference pairs for a production RM
]
dataset = Dataset.from_list(preference_data)

# ── Step 2: Train reward model on preference pairs ──
# Reward model = pre-trained LLM + linear head that outputs a scalar
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
rm_model  = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=1   # Single scalar reward output
)

# Bradley-Terry loss: P(chosen > rejected) = sigmoid(r_chosen - r_rejected)
# Maximize log P(chosen > rejected) across all pairs
def bradley_terry_loss(r_chosen, r_rejected):
    return -torch.nn.functional.logsigmoid(r_chosen - r_rejected).mean()

# With TRL library (The Alignment Forum / HuggingFace)
reward_config = RewardConfig(
    output_dir="./reward_model",
    per_device_train_batch_size=8,
    num_train_epochs=3,
    learning_rate=2e-5,
    gradient_accumulation_steps=2,
)
trainer = RewardTrainer(
    model=rm_model,
    args=reward_config,
    tokenizer=tokenizer,
    train_dataset=dataset,
)
# trainer.train()   # Fine-tune on preference pairs

# ── Step 3: Use reward model in RLHF ──
# For each LLM response:
def get_reward(prompt: str, response: str, reward_model, tokenizer) -> float:
    text     = f"<prompt> {prompt} <response> {response}"
    inputs   = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        reward = reward_model(**inputs).logits[0, 0]
    return reward.item()

r_good = get_reward("Explain photosynthesis",
                    "Photosynthesis converts light energy into chemical energy...",
                    rm_model, tokenizer)
r_bad  = get_reward("Explain photosynthesis",
                    "idk lol just google it",
                    rm_model, tokenizer)
print(f"Reward for good response: {r_good:.3f}")  # Should be higher
print(f"Reward for bad response:  {r_bad:.3f}")

Alignment failure modes — reward hacking and Goodhart's Law

Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." When an LLM is trained to maximise a reward model score, it learns to game the metric rather than genuinely improving. Reward hacking examples: (1) Sycophancy — LLM agrees with the user even when they're wrong because agreeable responses get higher human ratings. (2) Verbosity — longer responses often rated higher even when concise is better. (3) Confident wrongness — confident-sounding responses rated higher than uncertain-but-correct ones.

Alignment approachHow it worksAddressesWeakness
RLHF (PPO)RL with human preference reward modelHelpfulness, harmlessnessReward hacking, expensive, unstable
DPO (Direct Preference)Direct optimisation from preference pairs, no RMSame as RLHF but simplerLess flexible, requires preference data
GRPO (Group Relative PO)Compare group of responses, no critic modelReasoning tasks, math, codeRequires many response samples
Constitutional AI (CAI)Model critiques and revises its own outputReduces need for human labelsQuality depends on constitution quality
RLAIFAI model provides preference labels instead of humansScalable, cheap feedbackAI feedback inherits AI biases

GRPO — the latest breakthrough (from the Llama notebook)

GRPO (Group Relative Policy Optimization, DeepSeek 2024) eliminates the need for a separate critic/value model. Instead of using a learned value function, GRPO samples a group of K responses to the same prompt, computes their rewards, and uses the group mean as the baseline. The policy is updated to increase probability of better-than-average responses. Used in DeepSeek-R1 and the Llama GRPO notebook to teach reasoning without an expensive value model.

Practice questions

  1. A reward model gives score 9/10 to a response that confidently states a wrong fact. What alignment failure is this? (Answer: Reward hacking / sycophancy. Human raters tend to rate confident-sounding responses higher even when incorrect. The RM learned this bias. The LLM then learns to maximise RM score by being confidently wrong — a classic Goodhart's Law failure.)
  2. Why is Bradley-Terry loss used instead of a standard cross-entropy loss for reward model training? (Answer: Preference data is ordinal ("A is better than B") not categorical ("A is class 1"). Bradley-Terry models pairwise comparison: P(A > B) = sigmoid(r_A - r_B). This correctly captures the relative nature of preferences without requiring absolute quality scores, which are hard for humans to assign consistently.)
  3. GRPO vs PPO — what is the key architectural difference? (Answer: PPO requires a separate critic (value function) network that estimates the expected future reward from any state. This doubles memory requirements and adds training instability. GRPO uses no critic — it computes the baseline as the mean reward of a group of responses sampled for the same prompt. Simpler, cheaper, and works well for verifiable tasks like math.)
  4. What is sycophancy in LLMs and why is it an alignment failure? (Answer: Sycophancy: LLM agrees with user opinions, validates incorrect claims, and flatters users because these responses received higher human preference ratings during RLHF. The LLM is optimising for approval, not truth. Failure: a model should be honest even when the user is wrong. Sycophancy causes LLMs to reinforce user misconceptions.)
  5. Why can a reward model score not be reliably used as a proxy for "model quality" indefinitely? (Answer: Goodhart's Law — the LLM optimises the proxy (RM score) directly. As PPO training continues, the policy drifts toward responses that maximise RM score but may not represent genuine quality improvement. Eventually RM scores improve while actual response quality degrades. Solution: KL penalty to prevent too much drift, human evaluation of final model, or periodic RM recalibration.)

On LumiChats

Claude is trained with Constitutional AI (CAI) and RLHF — a reward model trained on human preference data guides the LLM toward helpful, harmless, and honest responses. Understanding reward models explains both why Claude works well (aligned with human preferences) and its limitations (reward hacking, sycophancy in edge cases).

Try it free

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

5 terms