Few-shot learning is the ability of a model to generalize to new tasks given only a few examples — provided either in the prompt (in-context learning) or as a tiny fine-tuning dataset. In-context learning (ICL) is a unique property of large language models: they can perform new tasks described entirely within the input prompt, without updating their weights.
Zero, one, and few-shot defined
In-context learning (ICL) means performing tasks described entirely within the prompt, without updating any model weights. The number of examples provided defines the "shot" count.
Zero, one, and few-shot prompting in practice — the same task across all three levels
from openai import OpenAI
client = OpenAI()
def ask(prompt: str) -> str:
return client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0,
).choices[0].message.content.strip()
task_description = "Convert English to formal French."
test_input = "Hey, can you send me the report by Friday?"
# ─── Zero-shot: description only, no examples ─────────────────────────────
zero_shot = ask(f"{task_description}\n\n{test_input}")
# Works reasonably for high-resource language pairs; may be casual
# ─── One-shot: one example to establish register and format ───────────────
one_shot = ask(f"""{task_description}
English: Could you please review the attached document?
French: Pourriez-vous examiner le document ci-joint, s'il vous plaît ?
English: {test_input}
French:""")
# Model now knows: formal register, question mark placement, "vous" not "tu"
# ─── Few-shot: 3–5 examples lock in pattern, style, and edge cases ─────────
few_shot = ask(f"""{task_description}
English: The meeting has been rescheduled to 3 PM.
French: La réunion a été reportée à 15h00.
English: Please find the invoice attached to this email.
French: Veuillez trouver la facture en pièce jointe de ce courriel.
English: Could you please review the attached document?
French: Pourriez-vous examiner le document ci-joint, s'il vous plaît ?
English: {test_input}
French:""")
# Best result: consistent formal business register, idiomatically correct
print("Zero-shot:", zero_shot)
print("One-shot: ", one_shot)
print("Few-shot: ", few_shot)GPT-3's breakthrough moment
When GPT-3 launched in 2020, its few-shot performance on tasks it had never been fine-tuned on — translation, arithmetic, question answering — was the defining surprise. A 175B model shown 3 examples of a new task often matched fine-tuned smaller models. This changed how practitioners thought about building NLP systems.
Why in-context learning works
ICL is still not fully understood theoretically. But experiments have revealed surprising facts about what examples actually do — and what they don't do.
- Scale is everything: ICL barely works below 1B parameters and dramatically improves with scale. This is an emergent capability — there's no gradual linear improvement, just a threshold effect at roughly 10–100B parameters.
- Wrong labels still help: Surprisingly, providing few-shot examples with incorrect labels (saying "Positive" for a negative review) still improves performance on format-heavy tasks. The model appears to learn output format and structure from examples, not necessarily the semantic label mapping itself (Min et al., 2022).
- Recency bias: Models tend to be influenced more by the last few examples than earlier ones. Example order matters — put your clearest, most representative examples last.
- Example quality >> quantity: 3 high-quality, diverse examples consistently outperform 10 mediocre ones. Duplicates, noisy labels, and redundant examples actively hurt ICL performance.
- Distribution matters: Examples drawn from the same distribution as your actual inputs significantly outperform generic examples. Retrieval-augmented ICL (fetching examples similar to the current query) routinely outperforms random example selection.
ICL vs fine-tuning: when to use which
In-context learning and fine-tuning represent opposite ends of a tradeoff spectrum. Neither is universally better — the right choice depends on your data, latency requirements, and operational constraints.
| Dimension | In-context learning (ICL) | Fine-tuning |
|---|---|---|
| Data required | 0–20 examples in prompt | 100–100,000+ labeled examples |
| Compute needed | None (just inference) | GPU training required ($10–$1000+) |
| Time to deploy | Instant — change prompt, done | Hours to days of training + evaluation |
| Task switching | Instant — swap examples in prompt | Each task requires separate fine-tuned model |
| Context usage | Examples consume token budget | No context cost — knowledge baked into weights |
| Max performance | Lower ceiling — limited by prompt space | Higher ceiling — model learns task internalization |
| Interpretability | Examples are visible and editable | Weights are opaque — hard to audit |
| Production cost | Higher per-call cost (longer prompts) | Lower per-call cost once trained |
Practical decision rule
Use ICL for: prototyping and iteration, tasks with <100 training examples, tasks that change frequently, and when interpretability matters. Switch to fine-tuning when: you have 500+ examples and a stable task, production latency is critical, the context window is limiting performance, or you need consistent behavior that ICL can't reliably provide.
Example selection strategies
Not all in-context examples are equally effective. The gap between random example selection and retrieval-based selection can be 10–20% accuracy on hard tasks.
Retrieval-augmented ICL — dynamically select the most relevant examples for each input query
from openai import OpenAI
import numpy as np
client = OpenAI()
def embed(text: str) -> list[float]:
"""Get embedding vector for semantic similarity computation."""
return client.embeddings.create(
model="text-embedding-3-small", input=text
).data[0].embedding
def cosine_similarity(a: list, b: list) -> float:
a, b = np.array(a), np.array(b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
# Your labeled example pool (large)
example_pool = [
{"input": "Translate 'bonjour' to English", "output": "Hello"},
{"input": "Translate 'merci beaucoup' to English", "output": "Thank you very much"},
{"input": "What does 'au revoir' mean?", "output": "Goodbye"},
{"input": "How do you say 'library' in French?", "output": "bibliothèque"},
# ... potentially hundreds of examples
]
# Pre-compute embeddings for all pool examples (do this once, cache it)
pool_embeddings = [embed(ex["input"]) for ex in example_pool]
def retrieve_examples(query: str, k: int = 3) -> list[dict]:
"""Retrieve the k most semantically similar examples to the query."""
query_emb = embed(query)
scores = [cosine_similarity(query_emb, ex_emb) for ex_emb in pool_embeddings]
top_k_indices = np.argsort(scores)[-k:][::-1]
return [example_pool[i] for i in top_k_indices]
def few_shot_with_retrieval(query: str, k: int = 3) -> str:
"""Build a few-shot prompt using retrieved examples."""
examples = retrieve_examples(query, k=k)
prompt_parts = []
for ex in examples:
prompt_parts.append(f"Input: {ex['input']}\nOutput: {ex['output']}")
prompt_parts.append(f"Input: {query}\nOutput:")
return client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "\n\n".join(prompt_parts)}],
temperature=0,
).choices[0].message.content.strip()
result = few_shot_with_retrieval("Translate 'café au lait' to English")
print(result) # → Coffee with milk (retrieved similar translation examples)Diversity + relevance = best ICL
Research (Zhang et al., 2022) shows that the optimal example set balances relevance (semantically similar to the query) and diversity (covering different sub-cases, edge cases, and output formats). A retrieval system that fetches the top-3 most similar examples can accidentally select 3 nearly identical examples — add a diversity constraint (e.g., maximum marginal relevance) for further improvement.
Meta-learning: learning to learn
Meta-learning is the broader ML paradigm that few-shot learning belongs to: training models that can rapidly adapt to new tasks with minimal examples. Unlike ICL (which requires no weight updates), meta-learning approaches explicitly optimize for fast task adaptation.
| Method | Core idea | How it adapts | Typical use case |
|---|---|---|---|
| MAML (Model-Agnostic Meta-Learning) | Optimize weights to be easily fine-tunable in 1–5 gradient steps | K-shot gradient descent at test time | Robotics, few-shot classification |
| Prototypical Networks | Learn embedding space where class = mean of its examples (prototype) | Classify by nearest prototype in embedding space | Image classification, NLP classification |
| Matching Networks | Attention-weighted sum over training examples for classification | Attention over support set at inference | Few-shot image/text classification |
| Reptile (OpenAI) | Simplified MAML: repeatedly fine-tune on tasks, update toward fine-tuned weights | Same as MAML but simpler implementation | On-device personalization |
| In-context learning (GPT style) | Pretraining implicitly learns to use context as a "task description" | No weight updates — uses context window | General NLP, instruction following |
Why meta-learning matters beyond LLMs
Meta-learning is critical in domains where few-shot ICL is insufficient but traditional fine-tuning is impossible due to data scarcity: rare disease diagnosis (5–10 patient examples per condition), drug discovery (new molecular target with few known binders), personalized recommendation (new user with 3 interactions), robotics (new manipulation task with 5 demonstrations). For these domains, MAML-style approaches and prototypical networks remain state-of-the-art.
Practice questions
- What is the difference between N-way K-shot learning and standard supervised learning? (Answer: N-way K-shot: at test time, classify examples into N new classes (never seen during meta-training), given only K examples per class (often K=1 or K=5). The model must generalise from K examples to classify new instances of those classes. Standard supervised learning: train on thousands of examples of each class, classify into those same classes. Few-shot learning tests the model's ability to RAPIDLY ADAPT to new classes from minimal examples.)
- What is prototypical networks and how do they perform few-shot classification? (Answer: Prototypical Networks: compute a class prototype = mean embedding of the K support examples for each class. Classify query points by nearest prototype in embedding space. Training: episodic training — simulate few-shot episodes, train the embedding space so that intra-class examples are close and inter-class examples are far. At test time: new classes are represented by their prototype embeddings. Simple, effective, and interpretable. Assumption: each class can be represented by a single mean vector (holds well for uni-modal class distributions).)
- What is MAML (Model-Agnostic Meta-Learning) and how does it enable fast adaptation? (Answer: MAML: meta-learn initial model parameters θ such that a few gradient steps on a new task's support set produce a well-performing model. Outer loop: across many tasks, update θ to minimise validation loss after inner loop adaptation. Inner loop: for each task, take k gradient steps from θ → θ'. The outer loop explicitly optimises for good initialisation, not good average performance. At test time: start from θ, take k gradient steps on new task support set. Requires second-order gradients (expensive but model-agnostic).)
- What is the difference between few-shot learning with in-context learning vs meta-learning? (Answer: In-context learning: provide k examples in the prompt, LLM adapts without any weight updates. Fast but limited by context window. Requires a large pretrained model that has developed ICL capability. Meta-learning: explicitly train a model to learn quickly — requires a training phase with many episodic few-shot tasks. Results in a specialised model architecture optimised for fast adaptation. ICL is at inference time; meta-learning is at training time. Modern practice: use ICL for large LLMs, meta-learning for specialised smaller models.)
- What is data augmentation in the context of few-shot learning? (Answer: With K=1 or K=5 support examples, standard augmentation (random crops, flips) helps but is insufficient. Advanced strategies: (1) Feature hallucination — train a generator to hallucinate additional class embeddings from the few support examples (GAN or VAE-based). (2) Cross-modal augmentation — use LLM descriptions of the class to generate synthetic embeddings. (3) Task augmentation — generate novel few-shot tasks from seen classes by resampling and relabelling. (4) Mixup in embedding space — interpolate between few support examples to expand the class distribution.)