Definition

Reasoning models are a new class of large language models trained to perform extended chain-of-thought reasoning before producing a final answer. OpenAI's o1 (September 2024) was the first widely deployed reasoning model — it scored 83% on the 2024 International Mathematics Olympiad qualifying exam, compared to 13% for GPT-4o. DeepSeek R1 (January 2025) replicated o1-level performance as an open-source model, setting off a wave of reasoning model development across the industry.

How reasoning models are trained: GRPO and process reward models

Standard LLMs are trained to predict the next token. Reasoning models are trained with reinforcement learning to maximize the correctness of final answers — the model learns to use its context window as a scratchpad. OpenAI uses a proprietary training process; DeepSeek R1 uses Group Relative Policy Optimization (GRPO), which eliminates the need for a separate critic model by using the average reward within a group of generated responses as the baseline.

GRPO objective: advantage A_i is computed relative to the group average reward rather than a learned value function. This eliminates the critic network entirely, reducing training memory by ~50% compared to standard PPO.

Model	Creator	AIME 2024	MATH-500	SWE-Bench	Open?
o1	OpenAI	74.4%	96.4%	48.9%	No
o3 mini	OpenAI	90.0%	97.9%	49.3%	No
DeepSeek R1	DeepSeek	79.8%	97.3%	49.2%	Yes
Claude 3.7 (thinking)	Anthropic	~80%	~97%	70.3%	No
Gemini 2.5 Pro	Google	92.0%	97.9%	Unreported	No

When to use a reasoning model vs a standard model

Use reasoning models for: math problems, formal proofs, multi-step coding tasks, complex logic puzzles, scientific analysis
Use standard models for: writing, summarization, simple Q&A, translation, classification — tasks where extended thinking wastes time and money
Reasoning models are 5–20x more expensive and 5–10x slower than equivalent standard models
The 'thinking' tokens are often not shown to users but count toward your token bill

Practical rule

If a task could be solved by a smart person in 30 seconds, use a standard model. If it would take a PhD student 30 minutes of focused work, use a reasoning model.

Reasoning Models (o1, o3, R1)

How reasoning models are trained: GRPO and process reward models

When to use a reasoning model vs a standard model

Try LumiChats for ₹69

Related Terms