Definition

Temperature is a hyperparameter that controls the randomness of token selection during LLM text generation. At temperature 0 the model always picks the highest-probability next token (deterministic, repetitive). At higher temperatures it samples from a wider distribution, producing more varied, creative — but potentially less accurate — responses. Sampling strategies like top-k and top-p (nucleus sampling) work alongside temperature to shape output quality.

How temperature works mathematically

Before sampling, the model's raw output is a vector of logits — one unnormalized score per vocabulary token. The softmax function converts these into probabilities. Temperature T divides every logit before softmax, controlling the sharpness of the distribution:

Softmax with temperature T. At T→0, the highest logit dominates and the distribution collapses to a one-hot. At T→∞, all tokens become equally likely.

Temperature	Effect	Distribution shape	Best for
0 (or ≈0.01)	Always picks the top token — fully deterministic	Sharp spike on one token	Code generation, maths, factual Q&A
0.3–0.5	Mostly deterministic, small variation	Narrow peak with some spread	Summarization, classification, structured data
0.7–1.0	Balanced creativity and coherence	Moderate spread	General conversation, essays, explanations
1.2–1.5	Creative and diverse but occasionally off-track	Flat, wide distribution	Brainstorming, poetry, creative writing
>2.0	Near-random gibberish	Almost uniform — no meaningful signal	Not useful in practice

The temperature=0 myth

Even at temperature=0, most APIs are not perfectly deterministic due to floating-point non-determinism across GPU hardware and batching. For true reproducibility, also set a fixed seed if the API supports it.

Top-k and Top-p (nucleus) sampling

Temperature alone isn't enough — even a well-shaped distribution can assign tiny probability to catastrophic tokens. Top-k and top-p sampling truncate the distribution before sampling, preventing rare tokens from ever being picked.

Strategy	How it works	Hyperparameter	Tradeoff
Greedy	Always pick the highest-probability token	None	Deterministic but repetitive
Temperature	Rescale all logits before softmax	T (0–2)	Global — every token is affected
Top-k	Restrict sampling to the k most likely tokens, redistribute probability to zero elsewhere	k (e.g. 40–100)	Fixed vocabulary size regardless of how flat/sharp the distribution is
Top-p (nucleus)	Keep only the smallest set of tokens whose cumulative probability ≥ p	p (e.g. 0.9–0.95)	Adaptive — keeps more tokens when distribution is flat, fewer when it's sharp
Min-p	Keep tokens with probability > p × top-token-probability	p (e.g. 0.05)	Newer; scales threshold relative to the model's confidence

Manual nucleus (top-p) sampling — exactly what happens inside every LLM API

import torch
import torch.nn.functional as F

def top_p_sample(logits: torch.Tensor, temperature: float = 0.9, top_p: float = 0.9) -> int:
    """
    Nucleus sampling with temperature.
    logits: raw unnormalized scores, shape (vocab_size,)
    """
    # 1. Apply temperature
    scaled = logits / max(temperature, 1e-8)

    # 2. Softmax → probabilities
    probs = F.softmax(scaled, dim=-1)

    # 3. Sort tokens by probability (highest first)
    sorted_probs, sorted_indices = torch.sort(probs, descending=True)

    # 4. Compute cumulative sum; find the nucleus boundary
    cumulative = torch.cumsum(sorted_probs, dim=-1)

    # 5. Remove tokens once cumulative prob exceeds top_p
    #    (shift by 1 so we always keep at least 1 token)
    remove_mask = cumulative - sorted_probs > top_p
    sorted_probs[remove_mask] = 0.0

    # 6. Renormalize and sample
    sorted_probs /= sorted_probs.sum()
    sampled_pos = torch.multinomial(sorted_probs, num_samples=1)
    return sorted_indices[sampled_pos].item()

Practical settings by task

Task type	Recommended temperature	Top-p	Notes
Code generation	0–0.2	0.95	Low temperature essential; syntax errors compound
Factual Q&A / RAG	0–0.3	0.9	Accuracy over creativity; hallucinations increase with T
Summarization	0.3–0.5	0.9	Some variation acceptable, faithfulness important
Chat / customer support	0.6–0.8	0.9	Natural-sounding without losing coherence
Creative writing / brainstorming	0.9–1.2	0.95	Diversity is desirable; humans can filter
Roleplay / fiction	1.0–1.3	0.95–1.0	Unexpected word choices enhance immersion

Top-p vs temperature: use both

The most robust production setting combines both: temperature controls the softness of the distribution, top-p prevents low-probability tokens from ever being sampled regardless of temperature. The combination outperforms either alone. A good default: temperature=0.7, top_p=0.9.

Practice questions

What happens when you set temperature=0 in an LLM API call? (Answer: Temperature=0 selects the highest probability token at every step — fully deterministic greedy decoding. The same prompt will always produce the same output (with seed fixed). Use for: factual Q&A, code generation where consistency matters, tests. Avoid for: creative writing, brainstorming, where diversity of outputs is valuable.)
Top-p=0.9 means the model samples from tokens whose cumulative probability equals 90%. For a distribution where one token has probability 0.95, what tokens are eligible? (Answer: Just that one token — it alone accounts for 95% > 90% of probability mass. With top-p=0.9, the smallest set of tokens totalling ≥90% is just this single dominant token. This is the key advantage over top-k: top-p automatically collapses to near-greedy when the model is highly confident, giving creativity only when the model is genuinely uncertain about what comes next.)
What is repetition penalty in LLM sampling and when is it necessary? (Answer: Repetition penalty discounts logits for tokens that have already appeared in the generated text: effective_logit = original_logit / penalty if token appeared, else original_logit. penalty > 1.0 reduces probability of repeating tokens. Default is 1.0 (no penalty). Necessary for models that fall into repetition loops (common without penalty for long generation). Over-penalisation can prevent legitimate word repetition (in lists, technical terms). Typical useful range: 1.1–1.3.)
Why might you use min-p sampling instead of top-k or top-p? (Answer: Min-p: filter tokens whose probability is less than min_p × (probability of the top token). Unlike top-k (fixed count regardless of distribution) or top-p (fixed mass), min-p adapts relative to the strongest option. When top token is at 80%, min-p=0.05 keeps tokens with probability ≥4% — very few. When top token is at 10%, min-p=0.05 keeps tokens with probability ≥0.5% — many options. Maintains consistent relative confidence filtering across all probability distributions.)
A customer service chatbot uses temperature=0.9 for all responses. What problem might arise? (Answer: High temperature introduces randomness — the bot may give inconsistent answers to the same question, different pricing information in different conversations, varying support procedures. For factual, policy-based responses (return policy, pricing, troubleshooting steps), temperature should be low (0.0–0.3). High temperature is appropriate for creative tasks, not deterministic information retrieval. Many production systems use temperature=0 for factual queries.)

On LumiChats

In LumiChats, you can adjust temperature for each conversation context — lower for precise research tasks, higher for creative brainstorming. The default is tuned for balanced accuracy and natural-sounding conversation.

Try it free

Temperature & Sampling

How temperature works mathematically

Top-k and Top-p (nucleus) sampling

Practical settings by task

Practice questions

Try LumiChats for ₹69

Related Terms