Chain-of-Thought (CoT) prompting is a technique where language models are prompted or trained to generate intermediate reasoning steps before producing a final answer. Instead of jumping directly to an answer, the model thinks through the problem step-by-step. CoT dramatically improves performance on multi-step reasoning, mathematics, logic, and complex analytical tasks.
The original CoT discovery
Chain-of-thought prompting was introduced by Wei et al. (Google Brain, 2022) with a startling finding: simply appending "Let's think step by step" to a prompt, or providing few-shot examples with step-by-step reasoning, dramatically improved LLM performance on math and reasoning benchmarks — with no weight updates at all.
| Model | GSM8K standard | GSM8K + CoT | Gain | Note |
|---|---|---|---|---|
| GPT-3 175B | 18% | 57% | +39pp | CoT only emerged in models >100B params |
| PaLM 540B | 17% | 56% | +39pp | Near GPT-3 level; scale drives CoT benefit |
| PaLM 2 | 80% | 91% | +11pp | Diminishing gains as base capability rises |
| GPT-4 | 87% | 92% | +5pp | Diminishing returns at frontier |
| GPT-4o + self-consistency | 92% | 97% | +5pp | Self-consistency further boosts hard problems |
Emergent capability
CoT showed no benefit for models under ~100B parameters — it only helps once the model is large enough to actually reason. This is an example of an "emergent capability": a behavior that appears suddenly at a scale threshold, not gradually. Smaller models that attempt CoT often produce fluent but meaningless or incorrect reasoning traces.
Zero-shot vs few-shot CoT
Three approaches to eliciting chain-of-thought reasoning, with different tradeoffs in setup cost vs performance.
All three CoT approaches using the OpenAI API — from simplest to most powerful
from openai import OpenAI
client = OpenAI()
MODEL = "gpt-4o"
def ask(messages):
return client.chat.completions.create(
model=MODEL, messages=messages, temperature=0
).choices[0].message.content
PROBLEM = "If a train travels 120 km in 1.5 hours, then stops for 30 minutes, then travels another 90 km in 1 hour, what is its average speed for the entire journey including the stop?"
# ─── Approach 1: Standard (no CoT) ───────────────────────────────────────────
standard = ask([{"role": "user", "content": PROBLEM}])
# Often gives wrong answer: jumps to (120+90)/(1.5+1) = 84 km/h, forgetting the stop
# ─── Approach 2: Zero-shot CoT ───────────────────────────────────────────────
zero_shot_cot = ask([{"role": "user", "content": PROBLEM + "\n\nLet's think step by step."}])
# Model breaks down: total distance=210km, total time=1.5+0.5+1=3h, avg=70 km/h ✓
# ─── Approach 3: Few-shot CoT ─────────────────────────────────────────────────
few_shot_system = """Solve math problems by thinking step by step.
Show each calculation on its own line.
Clearly label: Total distance, Total time, Final answer."""
few_shot_messages = [
{"role": "system", "content": few_shot_system},
{"role": "user", "content": "A car goes 60 km in 1 hour, stops 15 min, goes 45 km in 45 min. Avg speed?"},
{"role": "assistant", "content": "Total distance: 60 + 45 = 105 km\nTotal time: 1 + 0.25 + 0.75 = 2 hours\nAverage speed: 105 / 2 = 52.5 km/h"},
{"role": "user", "content": PROBLEM},
]
few_shot_cot = ask(few_shot_messages)
# Most reliable: follows demonstrated reasoning structure exactly
# ─── Approach 4: Self-consistency (best accuracy) ────────────────────────────
import re
from collections import Counter
def self_consistent_answer(problem: str, n_samples: int = 10) -> str:
"""Generate n independent CoT solutions and take majority vote."""
answers = []
for _ in range(n_samples):
resp = client.chat.completions.create(
model=MODEL,
messages=[{"role": "user", "content": problem + "\n\nThink step by step, then give the final numeric answer."}],
temperature=0.7, # diversity needed for voting to help
).choices[0].message.content
# Extract the last number mentioned as the answer
nums = re.findall(r'\d+\.?\d*', resp)
if nums: answers.append(nums[-1])
if answers:
most_common = Counter(answers).most_common(1)[0][0]
return most_common
return "No consensus"
answer = self_consistent_answer(PROBLEM, n_samples=15)
print(f"Self-consistent answer: {answer} km/h") # → 70Self-consistency tradeoff
Self-consistency (majority vote over 10–40 samples) consistently adds 5–10% accuracy on hard benchmarks but multiplies API cost by N. Use it when: the task is high-stakes (math exams, code generation), base accuracy is already 60–80% (voting helps), and you can afford the cost. For 90%+ accuracy tasks or latency-sensitive apps, single-path CoT is sufficient.
How CoT works internally
Why does writing out reasoning steps improve a model's final answer? The mechanism is not fully understood, but research has narrowed it down to three complementary explanations.
- Computation allocation: Each generated token is a full forward pass through the model. Generating 100 reasoning tokens means 100× more "compute" applied to the problem before giving an answer — essentially a soft form of multi-step computation that the model's fixed layers can't perform in a single pass.
- External working memory: LLMs have no internal state beyond the context window. Writing intermediate results to the context externalizes memory. Without CoT, intermediate values computed in early layers are lost before the answer layer is reached. With CoT, those values persist as tokens in context.
- Knowledge pathway activation: Reasoning through a problem step-by-step activates different, more relevant knowledge paths than jumping directly to an answer. The intermediate tokens serve as attention anchors that pull in more precise knowledge from the model's weights.
The faithfulness problem
Research (Turpin et al., 2023) found that models' verbalized CoT reasoning is sometimes "unfaithful" — the stated reasoning doesn't reflect what's actually driving the answer. When a biasing hint is added to the prompt, the model often changes its answer while rationalizing with different-looking but post-hoc reasoning. This matters for debugging: a correct CoT doesn't guarantee correct internal reasoning.
Reasoning models: o1, R1, and extended thinking
In late 2024, a new paradigm emerged: models trained (not just prompted) to reason extensively before answering. These models generate thousands of hidden "thinking" tokens — an internal scratchpad — before producing the visible response.
| Model | Lab | Approach | AIME 2024 score | Key capability |
|---|---|---|---|---|
| GPT-4o | OpenAI | Standard SFT + RLHF | 13% | Best general assistant without extended thinking |
| o1 | OpenAI | RL-trained to reason; hidden chain-of-thought | 74% | 5.7× better on AIME; beats PhD on many domains |
| o3 | OpenAI | Scaled-up o1; adaptive compute budget | 96% | Near-perfect on AIME; competitive coding champion level |
| DeepSeek-R1 | DeepSeek | Group Relative Policy Optimization (GRPO) on verifiable rewards | 79% | Open-weights reasoning model matching o1 |
| Claude 3.7 Sonnet | Anthropic | Extended thinking mode: configurable token budget for reasoning | ~80% | User-visible thinking traces; budget control |
| Gemini 2.0 Flash Thinking | Distilled reasoning into faster model | ~70% | Fastest reasoning model as of early 2025 |
How they're trained differently
Standard CoT is a prompting technique that works at inference time. Reasoning models are trained with RL using verifiable reward signals — math problems where you can check whether the answer is correct, code that either passes tests or fails. The RL process discovers reasoning strategies that maximize correctness, leading to emergent behaviors like self-correction, exploration, and backtracking that weren't explicitly programmed.
Limits and failure modes of CoT
CoT dramatically improves reasoning, but it is not a silver bullet. Understanding its failure modes is essential for reliable deployment.
| Failure mode | Description | Example | Mitigation |
|---|---|---|---|
| Plausible but wrong | Coherent reasoning steps lead to incorrect final answer | "3 × 4 = 12, therefore total is 14" (arithmetic slip) | Self-consistency; external verification |
| Error compounding | Early mistake amplifies through chain | Wrong unit conversion → all subsequent steps wrong | Structured problem decomposition; re-ask |
| Spurious reasoning | Stated reasoning is post-hoc rationalization, not actual | Model changes answer when hint added but claims different reasoning | Faithfulness probes; cross-check answers |
| Verbosity spiral | More steps ≠ more accuracy; model over-complicates | Simple addition solved in 15 verbose steps with error | Instruction: "Be concise, show only key steps" |
| Hallucinated facts mid-chain | Model invents intermediate values | "Wikipedia says X" — X does not exist | Grounding: tool calls for factual lookups within CoT |
Never deploy CoT alone for high-stakes decisions
In medical, legal, or financial contexts, CoT reasoning that reads as authoritative can be confidently wrong. Always: (1) verify numeric outputs independently, (2) use retrieval-grounded CoT for factual claims, (3) add human review for consequential decisions. A model that thinks through 10 steps and reaches a wrong answer is more dangerous than one that says "I'm not sure" — the confident reasoning creates false trust.
Practice questions
- What empirical finding by Wei et al. (2022) established chain-of-thought as a major prompting technique? (Answer: Wei et al. (Google, 2022) showed that appending 'Let's think step by step' or providing multi-step reasoning examples dramatically improved performance on arithmetic, commonsense, and symbolic reasoning benchmarks — but ONLY for models above ~100B parameters. For smaller models, CoT hurt performance. This scale threshold finding was critical: it meant CoT is an emergent capability of large models, not a general prompting technique. The paper showed 40–60% accuracy improvements on GSM8K with CoT vs direct answering.)
- What is the difference between zero-shot CoT and few-shot CoT prompting? (Answer: Zero-shot CoT: simply append 'Let's think step by step.' to the prompt — no examples provided. Few-shot CoT: provide 3–8 examples of (question, reasoning chain, answer) before the target question. Few-shot CoT outperforms zero-shot CoT on complex reasoning tasks because the examples demonstrate the expected reasoning format and depth. Zero-shot CoT is simpler (no example curation) and often sufficient for well-defined problems. Few-shot CoT is preferred for novel reasoning patterns where the model needs to see the expected structure.)
- What is self-consistency decoding and how does it improve CoT performance? (Answer: Self-consistency (Wang et al. 2022): sample k reasoning chains independently (temperature > 0), execute each to get k answers, take majority vote. The diversity of reasoning paths reduces reliance on any single chain that may contain errors. Key insight: multiple correct paths lead to correct answers; multiple incorrect paths rarely agree on the same wrong answer. GSM8K improvement: CoT+self-consistency (k=40): 88% vs CoT alone: 57%. Trade-off: k× more inference compute and API cost.)
- What is the 'unfaithful reasoning' problem in chain-of-thought? (Answer: CoT reasoning chains may not reflect the model's actual internal computation. Lanham et al. (2023): models sometimes give incorrect CoT but correct final answers (unused reasoning), and correct CoT but incorrect answers (reasoning not actually guiding the output). Faithfulness of CoT is debated: the reasoning might be post-hoc rationalisation of an answer computed through other mechanisms. This matters for safety: if a model's stated reasoning is unfaithful, we cannot use it to understand or verify model behaviour.)
- When does CoT hurt performance compared to direct answering? (Answer: CoT hurts for: (1) Simple factual questions — 'What is the capital of France?' Adding 'Let me reason...' wastes tokens and can introduce errors. (2) Tasks that are pattern-matched from training data — models can answer faster and more accurately without reasoning steps. (3) Small models (<10B) — they lack the capacity to reason effectively in CoT; forced reasoning introduces errors. Rule: use CoT for tasks requiring multi-step computation or reasoning. Skip CoT for simple retrieval, classification, or pattern matching.)