GRPO (Group Relative Policy Optimization) is a reinforcement learning algorithm that teaches LLMs to reason by rewarding correct final answers. Introduced by DeepSeek (2024), it samples K responses to each prompt, computes a reward for each (correct/incorrect), and trains the model to increase probability of correct responses relative to the group mean. Unlike PPO, GRPO needs no value function — making it simpler and cheaper. Combined with verifiable reward signals (math answers, code execution), GRPO produces models that naturally develop chain-of-thought reasoning. The Llama GRPO notebook demonstrates this on a single GPU.
Real-life analogy: The study group
GRPO is like a student who tries 8 different approaches to a maths problem, checks which ones got the right answer, and updates their strategy to favour the approaches that worked. No teacher grades each approach individually (no value function) — the student just knows which ones succeeded. Over many problems, the student learns which reasoning patterns reliably produce correct answers. This is exactly how GRPO teaches an LLM to reason.
GRPO algorithm — from the Llama notebook
GRPO training for reasoning with Unsloth (from Llama GRPO notebook)
# This is a simplified version of the GRPO training pattern from the
# Llama_FP8_GRPO.ipynb notebook using Unsloth + TRL
from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel) # Patch for faster GRPO
from trl import GRPOConfig, GRPOTrainer
from datasets import load_dataset
import re
# ── Step 1: Load model with LoRA for efficient fine-tuning ──
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Llama-3.2-1B-Instruct-FP8-Block",
max_seq_length = 8192,
load_in_4bit = False, # FP8 already compressed
)
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj"],
lora_alpha=16,
use_gradient_checkpointing="unsloth", # Save VRAM during backprop
)
# ── Step 2: Define reward functions ──
# GRPO uses verifiable rewards — no learned reward model needed
def reward_correct_answer(completions, answers, **kwargs):
"""Reward 1.0 if answer is correct, 0.0 otherwise."""
rewards = []
for completion, answer in zip(completions, answers):
# Extract the model's final answer from its reasoning chain
match = re.search(r"<answer>(.*?)</answer>", completion, re.DOTALL)
if match:
predicted = match.group(1).strip()
rewards.append(1.0 if predicted == answer else 0.0)
else:
rewards.append(0.0) # No answer tag = wrong format
return rewards
def reward_correct_format(completions, **kwargs):
"""Reward 0.5 for using the correct reasoning format."""
return [0.5 if "<think>" in c and "</think>" in c and "<answer>" in c
else 0.0 for c in completions]
# Combine rewards (format + correctness)
def combined_reward(completions, answers, **kwargs):
format_rewards = reward_correct_format(completions)
answer_rewards = reward_correct_answer(completions, answers)
return [f + a for f, a in zip(format_rewards, answer_rewards)]
# ── Step 3: Configure and run GRPO training ──
grpo_config = GRPOConfig(
use_vllm=True, # Fast sampling with vLLM
learning_rate=5e-6,
num_generations=8, # K=8: sample 8 responses per prompt
max_prompt_length=256,
max_completion_length=1024, # Allow long reasoning chains
num_train_epochs=1,
per_device_train_batch_size=1,
gradient_accumulation_steps=1,
save_steps=250,
output_dir="./grpo_reasoning_model",
bf16=True,
)
# Load math dataset (OpenR1-Math or GSM8K)
dataset = load_dataset("openai/gsm8k", "main", split="train")
grpo_trainer = GRPOTrainer(
model=model,
processing_class=tokenizer,
reward_funcs=[combined_reward],
args=grpo_config,
train_dataset=dataset,
)
# ── GRPO update rule (simplified) ──
# For each prompt, sample K=8 completions
# Compute rewards: r_1, r_2, ..., r_8
# Baseline = mean(r_1, ..., r_8) ← the GROUP mean
# Advantage_i = r_i - baseline ← relative to group
# Loss = -mean(A_i * log P(completion_i)) + KL_penalty
# High advantage → increase probability. Low advantage → decrease.
print("GRPO training: Teaching reasoning via verifiable rewards")
# grpo_trainer.train() # Runs on T4/L4 GPUWhat GRPO learns vs standard fine-tuning
| Aspect | Standard SFT | GRPO / RL Training |
|---|---|---|
| Objective | Copy demonstration responses (imitation) | Maximise reward (correct answer) |
| Data needed | High-quality demonstrations | Questions + verifiable answers (easier to collect) |
| Reasoning style | Mimics training data style | Discovers its own reasoning strategies |
| Performance ceiling | Cannot exceed demonstration quality | Can exceed human demonstrations (AlphaGo, DeepSeek-R1) |
| Emergent behaviour | Rarely emergent | Self-reflection, backtracking, longer chains (emergent in DeepSeek-R1) |
DeepSeek-R1 — emergent reasoning without supervision
DeepSeek-R1 was trained using GRPO on math and code problems with only the final answer as the reward signal (no human demonstrations of reasoning steps). The model spontaneously developed: extended chain-of-thought reasoning, self-correction ("wait, let me reconsider..."), and exploration of multiple approaches — behaviours never explicitly taught. This demonstrates that RL with verifiable rewards can produce capabilities that pure imitation learning cannot.
Practice questions
- GRPO samples K=8 responses. Response 3 gets reward 1.0, others get 0.0. What happens to the probability of response 3? (Answer: Advantage of response 3 = 1.0 - mean(1,0,0,0,0,0,0,0) = 1.0 - 0.125 = 0.875. All others have advantage = 0 - 0.125 = -0.125. Policy update increases P(response 3) and decreases P(others). Over many steps, the model learns to generate responses similar to response 3.)
- Why does GRPO use a KL penalty? (Answer: KL penalty = β × KL(π_current || π_reference). Without it, the policy could drift arbitrarily far from the pretrained model, potentially forgetting language abilities while optimising the reward. The KL penalty keeps the model close to its pretrained distribution, preventing reward hacking and catastrophic forgetting.)
- What makes math and code ideal domains for GRPO? (Answer: Verifiable rewards — the answer is either correct or incorrect (no ambiguity). For math: execute the computation and compare. For code: run the code and check test cases. No human reward model needed. This scales cheaply: generate millions of math problems with automated verification. Subjective tasks (essay quality, creativity) lack this property.)
- Standard SFT cannot exceed the quality of demonstrations. Why can GRPO? (Answer: SFT maximises likelihood of training demonstrations — the model learns to imitate. If demonstrations are imperfect, the model learns the imperfections too. GRPO directly optimises for correctness via rewards. If the model finds a better reasoning strategy than any demonstration, it gets rewarded and reinforced. This is how AlphaGo surpassed all human Go players.)
- What is the difference between GRPO and PPO? (Answer: PPO uses a value function (critic network) that estimates expected future reward from each state. This doubles model size, adds training complexity, and requires the critic to converge. GRPO replaces the value function with the group mean reward as baseline — no critic needed. GRPO is simpler, uses less memory, and works better for discrete reasoning tasks with sparse rewards.)
On LumiChats
GRPO is the cutting-edge algorithm that produces reasoning models like DeepSeek-R1. The same technique in the attached notebook (Llama GRPO) can be run on a free Google Colab T4 GPU — turning a standard 1B Llama into a reasoning model that shows its work. LumiChats's reasoning capabilities use similar RL-based techniques.
Try it free