Definition

Chain-of-Thought (CoT) prompting is a technique where language models are prompted or trained to generate intermediate reasoning steps before producing a final answer. Instead of jumping directly to an answer, the model thinks through the problem step-by-step. CoT dramatically improves performance on multi-step reasoning, mathematics, logic, and complex analytical tasks.

The original CoT discovery

Chain-of-thought prompting was introduced by Wei et al. (Google Brain, 2022) with a startling finding: simply appending "Let's think step by step" to a prompt, or providing few-shot examples with step-by-step reasoning, dramatically improved LLM performance on math and reasoning benchmarks — with no weight updates at all.

Model	GSM8K standard	GSM8K + CoT	Gain	Note
GPT-3 175B	18%	57%	+39pp	CoT only emerged in models >100B params
PaLM 540B	17%	56%	+39pp	Near GPT-3 level; scale drives CoT benefit
PaLM 2	80%	91%	+11pp	Diminishing gains as base capability rises
GPT-4	87%	92%	+5pp	Diminishing returns at frontier
GPT-4o + self-consistency	92%	97%	+5pp	Self-consistency further boosts hard problems

Emergent capability

CoT showed no benefit for models under ~100B parameters — it only helps once the model is large enough to actually reason. This is an example of an "emergent capability": a behavior that appears suddenly at a scale threshold, not gradually. Smaller models that attempt CoT often produce fluent but meaningless or incorrect reasoning traces.

Zero-shot vs few-shot CoT

Three approaches to eliciting chain-of-thought reasoning, with different tradeoffs in setup cost vs performance.

All three CoT approaches using the OpenAI API — from simplest to most powerful

from openai import OpenAI

client = OpenAI()
MODEL = "gpt-4o"

def ask(messages):
    return client.chat.completions.create(
        model=MODEL, messages=messages, temperature=0
    ).choices[0].message.content

PROBLEM = "If a train travels 120 km in 1.5 hours, then stops for 30 minutes, then travels another 90 km in 1 hour, what is its average speed for the entire journey including the stop?"

# ─── Approach 1: Standard (no CoT) ───────────────────────────────────────────
standard = ask([{"role": "user", "content": PROBLEM}])
# Often gives wrong answer: jumps to (120+90)/(1.5+1) = 84 km/h, forgetting the stop

# ─── Approach 2: Zero-shot CoT ───────────────────────────────────────────────
zero_shot_cot = ask([{"role": "user", "content": PROBLEM + "\n\nLet's think step by step."}])
# Model breaks down: total distance=210km, total time=1.5+0.5+1=3h, avg=70 km/h ✓

# ─── Approach 3: Few-shot CoT ─────────────────────────────────────────────────
few_shot_system = """Solve math problems by thinking step by step.
Show each calculation on its own line.
Clearly label: Total distance, Total time, Final answer."""

few_shot_messages = [
    {"role": "system",    "content": few_shot_system},
    {"role": "user",      "content": "A car goes 60 km in 1 hour, stops 15 min, goes 45 km in 45 min. Avg speed?"},
    {"role": "assistant", "content": "Total distance: 60 + 45 = 105 km\nTotal time: 1 + 0.25 + 0.75 = 2 hours\nAverage speed: 105 / 2 = 52.5 km/h"},
    {"role": "user",      "content": PROBLEM},
]
few_shot_cot = ask(few_shot_messages)
# Most reliable: follows demonstrated reasoning structure exactly

# ─── Approach 4: Self-consistency (best accuracy) ────────────────────────────
import re
from collections import Counter

def self_consistent_answer(problem: str, n_samples: int = 10) -> str:
    """Generate n independent CoT solutions and take majority vote."""
    answers = []
    for _ in range(n_samples):
        resp = client.chat.completions.create(
            model=MODEL,
            messages=[{"role": "user", "content": problem + "\n\nThink step by step, then give the final numeric answer."}],
            temperature=0.7,   # diversity needed for voting to help
        ).choices[0].message.content
        # Extract the last number mentioned as the answer
        nums = re.findall(r'\d+\.?\d*', resp)
        if nums: answers.append(nums[-1])

    if answers:
        most_common = Counter(answers).most_common(1)[0][0]
        return most_common
    return "No consensus"

answer = self_consistent_answer(PROBLEM, n_samples=15)
print(f"Self-consistent answer: {answer} km/h")  # → 70

Self-consistency tradeoff

Self-consistency (majority vote over 10–40 samples) consistently adds 5–10% accuracy on hard benchmarks but multiplies API cost by N. Use it when: the task is high-stakes (math exams, code generation), base accuracy is already 60–80% (voting helps), and you can afford the cost. For 90%+ accuracy tasks or latency-sensitive apps, single-path CoT is sufficient.

How CoT works internally

Why does writing out reasoning steps improve a model's final answer? The mechanism is not fully understood, but research has narrowed it down to three complementary explanations.

Computation allocation: Each generated token is a full forward pass through the model. Generating 100 reasoning tokens means 100× more "compute" applied to the problem before giving an answer — essentially a soft form of multi-step computation that the model's fixed layers can't perform in a single pass.
External working memory: LLMs have no internal state beyond the context window. Writing intermediate results to the context externalizes memory. Without CoT, intermediate values computed in early layers are lost before the answer layer is reached. With CoT, those values persist as tokens in context.
Knowledge pathway activation: Reasoning through a problem step-by-step activates different, more relevant knowledge paths than jumping directly to an answer. The intermediate tokens serve as attention anchors that pull in more precise knowledge from the model's weights.

The faithfulness problem

Research (Turpin et al., 2023) found that models' verbalized CoT reasoning is sometimes "unfaithful" — the stated reasoning doesn't reflect what's actually driving the answer. When a biasing hint is added to the prompt, the model often changes its answer while rationalizing with different-looking but post-hoc reasoning. This matters for debugging: a correct CoT doesn't guarantee correct internal reasoning.

Reasoning models: o1, R1, and extended thinking

In late 2024, a new paradigm emerged: models trained (not just prompted) to reason extensively before answering. These models generate thousands of hidden "thinking" tokens — an internal scratchpad — before producing the visible response.

Model	Lab	Approach	AIME 2024 score	Key capability
GPT-4o	OpenAI	Standard SFT + RLHF	13%	Best general assistant without extended thinking
o1	OpenAI	RL-trained to reason; hidden chain-of-thought	74%	5.7× better on AIME; beats PhD on many domains
o3	OpenAI	Scaled-up o1; adaptive compute budget	96%	Near-perfect on AIME; competitive coding champion level
DeepSeek-R1	DeepSeek	Group Relative Policy Optimization (GRPO) on verifiable rewards	79%	Open-weights reasoning model matching o1
Claude 3.7 Sonnet	Anthropic	Extended thinking mode: configurable token budget for reasoning	~80%	User-visible thinking traces; budget control
Gemini 2.0 Flash Thinking	Google	Distilled reasoning into faster model	~70%	Fastest reasoning model as of early 2025

How they're trained differently

Standard CoT is a prompting technique that works at inference time. Reasoning models are trained with RL using verifiable reward signals — math problems where you can check whether the answer is correct, code that either passes tests or fails. The RL process discovers reasoning strategies that maximize correctness, leading to emergent behaviors like self-correction, exploration, and backtracking that weren't explicitly programmed.

Limits and failure modes of CoT

CoT dramatically improves reasoning, but it is not a silver bullet. Understanding its failure modes is essential for reliable deployment.

Failure mode	Description	Example	Mitigation
Plausible but wrong	Coherent reasoning steps lead to incorrect final answer	"3 × 4 = 12, therefore total is 14" (arithmetic slip)	Self-consistency; external verification
Error compounding	Early mistake amplifies through chain	Wrong unit conversion → all subsequent steps wrong	Structured problem decomposition; re-ask
Spurious reasoning	Stated reasoning is post-hoc rationalization, not actual	Model changes answer when hint added but claims different reasoning	Faithfulness probes; cross-check answers
Verbosity spiral	More steps ≠ more accuracy; model over-complicates	Simple addition solved in 15 verbose steps with error	Instruction: "Be concise, show only key steps"
Hallucinated facts mid-chain	Model invents intermediate values	"Wikipedia says X" — X does not exist	Grounding: tool calls for factual lookups within CoT

Never deploy CoT alone for high-stakes decisions

In medical, legal, or financial contexts, CoT reasoning that reads as authoritative can be confidently wrong. Always: (1) verify numeric outputs independently, (2) use retrieval-grounded CoT for factual claims, (3) add human review for consequential decisions. A model that thinks through 10 steps and reaches a wrong answer is more dangerous than one that says "I'm not sure" — the confident reasoning creates false trust.

Practice questions

What empirical finding by Wei et al. (2022) established chain-of-thought as a major prompting technique? (Answer: Wei et al. (Google, 2022) showed that appending 'Let's think step by step' or providing multi-step reasoning examples dramatically improved performance on arithmetic, commonsense, and symbolic reasoning benchmarks — but ONLY for models above ~100B parameters. For smaller models, CoT hurt performance. This scale threshold finding was critical: it meant CoT is an emergent capability of large models, not a general prompting technique. The paper showed 40–60% accuracy improvements on GSM8K with CoT vs direct answering.)
What is the difference between zero-shot CoT and few-shot CoT prompting? (Answer: Zero-shot CoT: simply append 'Let's think step by step.' to the prompt — no examples provided. Few-shot CoT: provide 3–8 examples of (question, reasoning chain, answer) before the target question. Few-shot CoT outperforms zero-shot CoT on complex reasoning tasks because the examples demonstrate the expected reasoning format and depth. Zero-shot CoT is simpler (no example curation) and often sufficient for well-defined problems. Few-shot CoT is preferred for novel reasoning patterns where the model needs to see the expected structure.)
What is self-consistency decoding and how does it improve CoT performance? (Answer: Self-consistency (Wang et al. 2022): sample k reasoning chains independently (temperature > 0), execute each to get k answers, take majority vote. The diversity of reasoning paths reduces reliance on any single chain that may contain errors. Key insight: multiple correct paths lead to correct answers; multiple incorrect paths rarely agree on the same wrong answer. GSM8K improvement: CoT+self-consistency (k=40): 88% vs CoT alone: 57%. Trade-off: k× more inference compute and API cost.)
What is the 'unfaithful reasoning' problem in chain-of-thought? (Answer: CoT reasoning chains may not reflect the model's actual internal computation. Lanham et al. (2023): models sometimes give incorrect CoT but correct final answers (unused reasoning), and correct CoT but incorrect answers (reasoning not actually guiding the output). Faithfulness of CoT is debated: the reasoning might be post-hoc rationalisation of an answer computed through other mechanisms. This matters for safety: if a model's stated reasoning is unfaithful, we cannot use it to understand or verify model behaviour.)
When does CoT hurt performance compared to direct answering? (Answer: CoT hurts for: (1) Simple factual questions — 'What is the capital of France?' Adding 'Let me reason...' wastes tokens and can introduce errors. (2) Tasks that are pattern-matched from training data — models can answer faster and more accurately without reasoning steps. (3) Small models (<10B) — they lack the capacity to reason effectively in CoT; forced reasoning introduces errors. Rule: use CoT for tasks requiring multi-step computation or reasoning. Skip CoT for simple retrieval, classification, or pattern matching.)

Chain-of-Thought (CoT) Reasoning

The original CoT discovery

Zero-shot vs few-shot CoT

How CoT works internally

Reasoning models: o1, R1, and extended thinking

Limits and failure modes of CoT

Practice questions

Try LumiChats for ₹69

Related Terms