Glossary/Temperature & Sampling
AI Fundamentals

Temperature & Sampling

The dial that controls how creative — or chaotic — an AI's output is.


Definition

Temperature is a hyperparameter that controls the randomness of token selection during LLM text generation. At temperature 0 the model always picks the highest-probability next token (deterministic, repetitive). At higher temperatures it samples from a wider distribution, producing more varied, creative — but potentially less accurate — responses. Sampling strategies like top-k and top-p (nucleus sampling) work alongside temperature to shape output quality.

How temperature works mathematically

Before sampling, the model's raw output is a vector of logits — one unnormalized score per vocabulary token. The softmax function converts these into probabilities. Temperature T divides every logit before softmax, controlling the sharpness of the distribution:

Softmax with temperature T. At T→0, the highest logit dominates and the distribution collapses to a one-hot. At T→∞, all tokens become equally likely.

TemperatureEffectDistribution shapeBest for
0 (or ≈0.01)Always picks the top token — fully deterministicSharp spike on one tokenCode generation, maths, factual Q&A
0.3–0.5Mostly deterministic, small variationNarrow peak with some spreadSummarization, classification, structured data
0.7–1.0Balanced creativity and coherenceModerate spreadGeneral conversation, essays, explanations
1.2–1.5Creative and diverse but occasionally off-trackFlat, wide distributionBrainstorming, poetry, creative writing
>2.0Near-random gibberishAlmost uniform — no meaningful signalNot useful in practice

The temperature=0 myth

Even at temperature=0, most APIs are not perfectly deterministic due to floating-point non-determinism across GPU hardware and batching. For true reproducibility, also set a fixed seed if the API supports it.

Top-k and Top-p (nucleus) sampling

Temperature alone isn't enough — even a well-shaped distribution can assign tiny probability to catastrophic tokens. Top-k and top-p sampling truncate the distribution before sampling, preventing rare tokens from ever being picked.

StrategyHow it worksHyperparameterTradeoff
GreedyAlways pick the highest-probability tokenNoneDeterministic but repetitive
TemperatureRescale all logits before softmaxT (0–2)Global — every token is affected
Top-kRestrict sampling to the k most likely tokens, redistribute probability to zero elsewherek (e.g. 40–100)Fixed vocabulary size regardless of how flat/sharp the distribution is
Top-p (nucleus)Keep only the smallest set of tokens whose cumulative probability ≥ pp (e.g. 0.9–0.95)Adaptive — keeps more tokens when distribution is flat, fewer when it's sharp
Min-pKeep tokens with probability > p × top-token-probabilityp (e.g. 0.05)Newer; scales threshold relative to the model's confidence

Manual nucleus (top-p) sampling — exactly what happens inside every LLM API

import torch
import torch.nn.functional as F

def top_p_sample(logits: torch.Tensor, temperature: float = 0.9, top_p: float = 0.9) -> int:
    """
    Nucleus sampling with temperature.
    logits: raw unnormalized scores, shape (vocab_size,)
    """
    # 1. Apply temperature
    scaled = logits / max(temperature, 1e-8)

    # 2. Softmax → probabilities
    probs = F.softmax(scaled, dim=-1)

    # 3. Sort tokens by probability (highest first)
    sorted_probs, sorted_indices = torch.sort(probs, descending=True)

    # 4. Compute cumulative sum; find the nucleus boundary
    cumulative = torch.cumsum(sorted_probs, dim=-1)

    # 5. Remove tokens once cumulative prob exceeds top_p
    #    (shift by 1 so we always keep at least 1 token)
    remove_mask = cumulative - sorted_probs > top_p
    sorted_probs[remove_mask] = 0.0

    # 6. Renormalize and sample
    sorted_probs /= sorted_probs.sum()
    sampled_pos = torch.multinomial(sorted_probs, num_samples=1)
    return sorted_indices[sampled_pos].item()

Practical settings by task

Task typeRecommended temperatureTop-pNotes
Code generation0–0.20.95Low temperature essential; syntax errors compound
Factual Q&A / RAG0–0.30.9Accuracy over creativity; hallucinations increase with T
Summarization0.3–0.50.9Some variation acceptable, faithfulness important
Chat / customer support0.6–0.80.9Natural-sounding without losing coherence
Creative writing / brainstorming0.9–1.20.95Diversity is desirable; humans can filter
Roleplay / fiction1.0–1.30.95–1.0Unexpected word choices enhance immersion

Top-p vs temperature: use both

The most robust production setting combines both: temperature controls the softness of the distribution, top-p prevents low-probability tokens from ever being sampled regardless of temperature. The combination outperforms either alone. A good default: temperature=0.7, top_p=0.9.

Practice questions

  1. What happens when you set temperature=0 in an LLM API call? (Answer: Temperature=0 selects the highest probability token at every step — fully deterministic greedy decoding. The same prompt will always produce the same output (with seed fixed). Use for: factual Q&A, code generation where consistency matters, tests. Avoid for: creative writing, brainstorming, where diversity of outputs is valuable.)
  2. Top-p=0.9 means the model samples from tokens whose cumulative probability equals 90%. For a distribution where one token has probability 0.95, what tokens are eligible? (Answer: Just that one token — it alone accounts for 95% > 90% of probability mass. With top-p=0.9, the smallest set of tokens totalling ≥90% is just this single dominant token. This is the key advantage over top-k: top-p automatically collapses to near-greedy when the model is highly confident, giving creativity only when the model is genuinely uncertain about what comes next.)
  3. What is repetition penalty in LLM sampling and when is it necessary? (Answer: Repetition penalty discounts logits for tokens that have already appeared in the generated text: effective_logit = original_logit / penalty if token appeared, else original_logit. penalty > 1.0 reduces probability of repeating tokens. Default is 1.0 (no penalty). Necessary for models that fall into repetition loops (common without penalty for long generation). Over-penalisation can prevent legitimate word repetition (in lists, technical terms). Typical useful range: 1.1–1.3.)
  4. Why might you use min-p sampling instead of top-k or top-p? (Answer: Min-p: filter tokens whose probability is less than min_p × (probability of the top token). Unlike top-k (fixed count regardless of distribution) or top-p (fixed mass), min-p adapts relative to the strongest option. When top token is at 80%, min-p=0.05 keeps tokens with probability ≥4% — very few. When top token is at 10%, min-p=0.05 keeps tokens with probability ≥0.5% — many options. Maintains consistent relative confidence filtering across all probability distributions.)
  5. A customer service chatbot uses temperature=0.9 for all responses. What problem might arise? (Answer: High temperature introduces randomness — the bot may give inconsistent answers to the same question, different pricing information in different conversations, varying support procedures. For factual, policy-based responses (return policy, pricing, troubleshooting steps), temperature should be low (0.0–0.3). High temperature is appropriate for creative tasks, not deterministic information retrieval. Many production systems use temperature=0 for factual queries.)

On LumiChats

In LumiChats, you can adjust temperature for each conversation context — lower for precise research tasks, higher for creative brainstorming. The default is tuned for balanced accuracy and natural-sounding conversation.

Try it free

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

4 terms