Glossary/GPT & Decoder-Only Language Models
Natural Language Processing

GPT & Decoder-Only Language Models

The auto-regressive architecture behind ChatGPT, Claude, Llama, and modern AI assistants.


Definition

GPT (Generative Pre-trained Transformer) family uses decoder-only transformer architecture with causal (unidirectional) self-attention. Pre-trained on next-token prediction across massive text corpora, then fine-tuned for downstream tasks. GPT-1 (2018) introduced the paradigm; GPT-2 (2019) demonstrated emergent generation quality; GPT-3 (2020, 175B params) showed few-shot learning; GPT-4 (2023) achieved near-human performance on many benchmarks. The same architecture powers Claude (Anthropic), Llama (Meta), Gemini (Google), and Mistral — decoder-only transformers are the dominant paradigm for foundation models.

GPT architecture vs BERT architecture

PropertyGPT (Decoder-only)BERT (Encoder-only)
Attention patternCausal (triangular) — can only see past tokensBidirectional — sees all tokens
Pre-training objectiveCausal LM: predict next token (P(wₜ | w₁..wₜ₋₁))Masked LM: predict 15% masked tokens
Token representationEach token sees only left contextEach token sees full sentence context
Good forGeneration, completion, chat, codeClassification, NER, QA, understanding
Output layerVocabulary head → next token probabilitiesTask-specific head (classifier, span predictor)
ExamplesGPT-4, Claude, Llama-3, Gemini, MistralBERT, RoBERTa, DistilBERT, ELECTRA

GPT-2 text generation with sampling strategies

from transformers import GPT2LMHeadModel, GPT2Tokenizer, pipeline
import torch

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model     = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()

# ── Manual generation: greedy decoding ──
def greedy_generate(prompt: str, max_new: int = 30) -> str:
    inputs = tokenizer(prompt, return_tensors='pt')
    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=max_new,
            do_sample=False,          # Greedy: always pick highest probability token
            pad_token_id=tokenizer.eos_token_id
        )
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

# ── Sampling strategies ──
def sample_generate(prompt: str, strategy: str = 'top_p') -> str:
    inputs = tokenizer(prompt, return_tensors='pt')
    with torch.no_grad():
        if strategy == 'top_k':
            output_ids = model.generate(
                **inputs, max_new_tokens=50, do_sample=True,
                top_k=50,             # Sample from top-50 tokens
                temperature=0.8,
                pad_token_id=tokenizer.eos_token_id)
        elif strategy == 'top_p':
            output_ids = model.generate(
                **inputs, max_new_tokens=50, do_sample=True,
                top_p=0.92,           # Nucleus sampling: smallest set with 92% probability mass
                temperature=0.8,
                pad_token_id=tokenizer.eos_token_id)
        elif strategy == 'beam':
            output_ids = model.generate(
                **inputs, max_new_tokens=50, num_beams=5,  # Beam search: top-5 sequences
                early_stopping=True,
                pad_token_id=tokenizer.eos_token_id)
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

prompt = "The history of artificial intelligence began"
print("Greedy:", greedy_generate(prompt, 20))
print("Top-K: ", sample_generate(prompt, 'top_k'))
print("Top-p: ", sample_generate(prompt, 'top_p'))
print("Beam:  ", sample_generate(prompt, 'beam'))

# ── Logits and next-token distribution ──
inputs = tokenizer("The capital of France is", return_tensors='pt')
with torch.no_grad():
    logits = model(**inputs).logits    # (1, seq_len, vocab_size)

# Probability distribution over next token
next_token_logits = logits[0, -1, :]  # Last token's prediction
probs = torch.softmax(next_token_logits, dim=-1)
top5  = probs.topk(5)
print("
Top-5 next token probabilities:")
for prob, idx in zip(top5.values, top5.indices):
    token = tokenizer.decode([idx])
    print(f"  {prob:.3f}: '{token}'")
# ' Paris' should be highest probability

Scaling laws and the GPT evolution

ModelYearParametersTraining dataKey capability
GPT-12018117M7GB BooksCorpusFirst large-scale LLM, zero-shot basics
GPT-220191.5B40GB WebTextCoherent long-form text generation
GPT-32020175B570GB filtered webIn-context few-shot learning, instruction following
InstructGPT2022175BGPT-3 + RLHFAligned, helpful assistant behaviour
ChatGPT/GPT-42023>1T (est)Multi-modal massive scaleNear-human performance across domains

Emergent capabilities at scale

GPT-3 and larger models exhibit emergent capabilities — abilities that appeared unpredictably at scale and were not present in smaller models: multi-step arithmetic, code generation, analogical reasoning, and in-context few-shot learning. This is why the field shifted from task-specific models to scaling general-purpose LLMs.

Practice questions

  1. Why can GPT generate text but BERT cannot? (Answer: GPT uses causal attention — each token only attends to previous tokens, enabling left-to-right generation. At each step, GPT predicts the next token from all previous ones. BERT uses bidirectional attention that requires the full sequence — you cannot generate token-by-token because each token's representation depends on all future tokens.)
  2. What is top-p (nucleus) sampling and why is it preferred over top-k? (Answer: Top-p samples from the smallest vocabulary subset whose cumulative probability exceeds p (e.g., 0.9). The number of tokens considered varies dynamically — large for uncertain predictions, small for confident ones. Top-k always samples from k tokens regardless of confidence level — can include many low-probability tokens when k is large or be over-restrictive when k is small.)
  3. GPT-3 demonstrates "in-context few-shot learning." What does this mean? (Answer: You provide a few examples in the prompt (e.g., 2-3 input-output pairs) and GPT-3 generalises the pattern to new inputs — WITHOUT any gradient updates or fine-tuning. The model learns from context at inference time. This is qualitatively different from traditional ML which requires labeled training data.)
  4. What is temperature in LLM sampling and what happens with temperature=0? (Answer: Temperature scales logits before softmax: logits/T. T=1: standard distribution. T<1: sharper — model becomes more confident, less diverse. T=0: equivalent to greedy decoding (always picks highest probability token). T>1: flatter distribution, more random/creative. T=0 for factual tasks, T=0.7-1.0 for creative writing.)
  5. What is the difference between beam search and greedy decoding? (Answer: Greedy: always picks the single highest-probability next token. Local optimal but not globally optimal. Beam search: tracks top-k complete sequences simultaneously. At each step, expands all k beams and keeps the k best overall sequences. Finds better overall sequences at cost of more computation. num_beams=5 is common.)

On LumiChats

Claude is a decoder-only transformer — every response is generated by predicting the next token, then the next, until the response is complete. The sampling strategy, temperature, and beam width LumiChats uses can be configured for different use cases: deterministic (temperature=0) for code, creative (temperature=0.9) for writing.

Try it free

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

5 terms