Definition

GPT (Generative Pre-trained Transformer) family uses decoder-only transformer architecture with causal (unidirectional) self-attention. Pre-trained on next-token prediction across massive text corpora, then fine-tuned for downstream tasks. GPT-1 (2018) introduced the paradigm; GPT-2 (2019) demonstrated emergent generation quality; GPT-3 (2020, 175B params) showed few-shot learning; GPT-4 (2023) achieved near-human performance on many benchmarks. The same architecture powers Claude (Anthropic), Llama (Meta), Gemini (Google), and Mistral — decoder-only transformers are the dominant paradigm for foundation models.

GPT architecture vs BERT architecture

Property	GPT (Decoder-only)	BERT (Encoder-only)
Attention pattern	Causal (triangular) — can only see past tokens	Bidirectional — sees all tokens
Pre-training objective	Causal LM: predict next token (P(wₜ \| w₁..wₜ₋₁))	Masked LM: predict 15% masked tokens
Token representation	Each token sees only left context	Each token sees full sentence context
Good for	Generation, completion, chat, code	Classification, NER, QA, understanding
Output layer	Vocabulary head → next token probabilities	Task-specific head (classifier, span predictor)
Examples	GPT-4, Claude, Llama-3, Gemini, Mistral	BERT, RoBERTa, DistilBERT, ELECTRA

GPT-2 text generation with sampling strategies

from transformers import GPT2LMHeadModel, GPT2Tokenizer, pipeline
import torch

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model     = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()

# ── Manual generation: greedy decoding ──
def greedy_generate(prompt: str, max_new: int = 30) -> str:
    inputs = tokenizer(prompt, return_tensors='pt')
    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=max_new,
            do_sample=False,          # Greedy: always pick highest probability token
            pad_token_id=tokenizer.eos_token_id
        )
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

# ── Sampling strategies ──
def sample_generate(prompt: str, strategy: str = 'top_p') -> str:
    inputs = tokenizer(prompt, return_tensors='pt')
    with torch.no_grad():
        if strategy == 'top_k':
            output_ids = model.generate(
                **inputs, max_new_tokens=50, do_sample=True,
                top_k=50,             # Sample from top-50 tokens
                temperature=0.8,
                pad_token_id=tokenizer.eos_token_id)
        elif strategy == 'top_p':
            output_ids = model.generate(
                **inputs, max_new_tokens=50, do_sample=True,
                top_p=0.92,           # Nucleus sampling: smallest set with 92% probability mass
                temperature=0.8,
                pad_token_id=tokenizer.eos_token_id)
        elif strategy == 'beam':
            output_ids = model.generate(
                **inputs, max_new_tokens=50, num_beams=5,  # Beam search: top-5 sequences
                early_stopping=True,
                pad_token_id=tokenizer.eos_token_id)
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

prompt = "The history of artificial intelligence began"
print("Greedy:", greedy_generate(prompt, 20))
print("Top-K: ", sample_generate(prompt, 'top_k'))
print("Top-p: ", sample_generate(prompt, 'top_p'))
print("Beam:  ", sample_generate(prompt, 'beam'))

# ── Logits and next-token distribution ──
inputs = tokenizer("The capital of France is", return_tensors='pt')
with torch.no_grad():
    logits = model(**inputs).logits    # (1, seq_len, vocab_size)

# Probability distribution over next token
next_token_logits = logits[0, -1, :]  # Last token's prediction
probs = torch.softmax(next_token_logits, dim=-1)
top5  = probs.topk(5)
print("
Top-5 next token probabilities:")
for prob, idx in zip(top5.values, top5.indices):
    token = tokenizer.decode([idx])
    print(f"  {prob:.3f}: '{token}'")
# ' Paris' should be highest probability

Scaling laws and the GPT evolution

Model	Year	Parameters	Training data	Key capability
GPT-1	2018	117M	7GB BooksCorpus	First large-scale LLM, zero-shot basics
GPT-2	2019	1.5B	40GB WebText	Coherent long-form text generation
GPT-3	2020	175B	570GB filtered web	In-context few-shot learning, instruction following
InstructGPT	2022	175B	GPT-3 + RLHF	Aligned, helpful assistant behaviour
ChatGPT/GPT-4	2023	>1T (est)	Multi-modal massive scale	Near-human performance across domains

Emergent capabilities at scale

GPT-3 and larger models exhibit emergent capabilities — abilities that appeared unpredictably at scale and were not present in smaller models: multi-step arithmetic, code generation, analogical reasoning, and in-context few-shot learning. This is why the field shifted from task-specific models to scaling general-purpose LLMs.

Practice questions

Why can GPT generate text but BERT cannot? (Answer: GPT uses causal attention — each token only attends to previous tokens, enabling left-to-right generation. At each step, GPT predicts the next token from all previous ones. BERT uses bidirectional attention that requires the full sequence — you cannot generate token-by-token because each token's representation depends on all future tokens.)
What is top-p (nucleus) sampling and why is it preferred over top-k? (Answer: Top-p samples from the smallest vocabulary subset whose cumulative probability exceeds p (e.g., 0.9). The number of tokens considered varies dynamically — large for uncertain predictions, small for confident ones. Top-k always samples from k tokens regardless of confidence level — can include many low-probability tokens when k is large or be over-restrictive when k is small.)
GPT-3 demonstrates "in-context few-shot learning." What does this mean? (Answer: You provide a few examples in the prompt (e.g., 2-3 input-output pairs) and GPT-3 generalises the pattern to new inputs — WITHOUT any gradient updates or fine-tuning. The model learns from context at inference time. This is qualitatively different from traditional ML which requires labeled training data.)
What is temperature in LLM sampling and what happens with temperature=0? (Answer: Temperature scales logits before softmax: logits/T. T=1: standard distribution. T<1: sharper — model becomes more confident, less diverse. T=0: equivalent to greedy decoding (always picks highest probability token). T>1: flatter distribution, more random/creative. T=0 for factual tasks, T=0.7-1.0 for creative writing.)
What is the difference between beam search and greedy decoding? (Answer: Greedy: always picks the single highest-probability next token. Local optimal but not globally optimal. Beam search: tracks top-k complete sequences simultaneously. At each step, expands all k beams and keeps the k best overall sequences. Finds better overall sequences at cost of more computation. num_beams=5 is common.)

On LumiChats

Claude is a decoder-only transformer — every response is generated by predicting the next token, then the next, until the response is complete. The sampling strategy, temperature, and beam width LumiChats uses can be configured for different use cases: deterministic (temperature=0) for code, creative (temperature=0.9) for writing.

Try it free

GPT & Decoder-Only Language Models

GPT architecture vs BERT architecture

Scaling laws and the GPT evolution

Practice questions

Try LumiChats for ₹69

Related Terms