Definition

Few-shot learning is the ability of a model to generalize to new tasks given only a few examples — provided either in the prompt (in-context learning) or as a tiny fine-tuning dataset. In-context learning (ICL) is a unique property of large language models: they can perform new tasks described entirely within the input prompt, without updating their weights.

Zero, one, and few-shot defined

In-context learning (ICL) means performing tasks described entirely within the prompt, without updating any model weights. The number of examples provided defines the "shot" count.

Zero, one, and few-shot prompting in practice — the same task across all three levels

from openai import OpenAI
client = OpenAI()

def ask(prompt: str) -> str:
    return client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    ).choices[0].message.content.strip()

task_description = "Convert English to formal French."
test_input = "Hey, can you send me the report by Friday?"

# ─── Zero-shot: description only, no examples ─────────────────────────────
zero_shot = ask(f"{task_description}\n\n{test_input}")
# Works reasonably for high-resource language pairs; may be casual

# ─── One-shot: one example to establish register and format ───────────────
one_shot = ask(f"""{task_description}

English: Could you please review the attached document?
French: Pourriez-vous examiner le document ci-joint, s'il vous plaît ?

English: {test_input}
French:""")
# Model now knows: formal register, question mark placement, "vous" not "tu"

# ─── Few-shot: 3–5 examples lock in pattern, style, and edge cases ─────────
few_shot = ask(f"""{task_description}

English: The meeting has been rescheduled to 3 PM.
French: La réunion a été reportée à 15h00.

English: Please find the invoice attached to this email.
French: Veuillez trouver la facture en pièce jointe de ce courriel.

English: Could you please review the attached document?
French: Pourriez-vous examiner le document ci-joint, s'il vous plaît ?

English: {test_input}
French:""")
# Best result: consistent formal business register, idiomatically correct

print("Zero-shot:", zero_shot)
print("One-shot: ", one_shot)
print("Few-shot: ", few_shot)

GPT-3's breakthrough moment

When GPT-3 launched in 2020, its few-shot performance on tasks it had never been fine-tuned on — translation, arithmetic, question answering — was the defining surprise. A 175B model shown 3 examples of a new task often matched fine-tuned smaller models. This changed how practitioners thought about building NLP systems.

Why in-context learning works

ICL is still not fully understood theoretically. But experiments have revealed surprising facts about what examples actually do — and what they don't do.

Scale is everything: ICL barely works below 1B parameters and dramatically improves with scale. This is an emergent capability — there's no gradual linear improvement, just a threshold effect at roughly 10–100B parameters.
Wrong labels still help: Surprisingly, providing few-shot examples with incorrect labels (saying "Positive" for a negative review) still improves performance on format-heavy tasks. The model appears to learn output format and structure from examples, not necessarily the semantic label mapping itself (Min et al., 2022).
Recency bias: Models tend to be influenced more by the last few examples than earlier ones. Example order matters — put your clearest, most representative examples last.
Example quality >> quantity: 3 high-quality, diverse examples consistently outperform 10 mediocre ones. Duplicates, noisy labels, and redundant examples actively hurt ICL performance.
Distribution matters: Examples drawn from the same distribution as your actual inputs significantly outperform generic examples. Retrieval-augmented ICL (fetching examples similar to the current query) routinely outperforms random example selection.

ICL vs fine-tuning: when to use which

In-context learning and fine-tuning represent opposite ends of a tradeoff spectrum. Neither is universally better — the right choice depends on your data, latency requirements, and operational constraints.

Dimension	In-context learning (ICL)	Fine-tuning
Data required	0–20 examples in prompt	100–100,000+ labeled examples
Compute needed	None (just inference)	GPU training required ($10–$1000+)
Time to deploy	Instant — change prompt, done	Hours to days of training + evaluation
Task switching	Instant — swap examples in prompt	Each task requires separate fine-tuned model
Context usage	Examples consume token budget	No context cost — knowledge baked into weights
Max performance	Lower ceiling — limited by prompt space	Higher ceiling — model learns task internalization
Interpretability	Examples are visible and editable	Weights are opaque — hard to audit
Production cost	Higher per-call cost (longer prompts)	Lower per-call cost once trained

Practical decision rule

Use ICL for: prototyping and iteration, tasks with <100 training examples, tasks that change frequently, and when interpretability matters. Switch to fine-tuning when: you have 500+ examples and a stable task, production latency is critical, the context window is limiting performance, or you need consistent behavior that ICL can't reliably provide.

Example selection strategies

Not all in-context examples are equally effective. The gap between random example selection and retrieval-based selection can be 10–20% accuracy on hard tasks.

Retrieval-augmented ICL — dynamically select the most relevant examples for each input query

from openai import OpenAI
import numpy as np

client = OpenAI()

def embed(text: str) -> list[float]:
    """Get embedding vector for semantic similarity computation."""
    return client.embeddings.create(
        model="text-embedding-3-small", input=text
    ).data[0].embedding

def cosine_similarity(a: list, b: list) -> float:
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

# Your labeled example pool (large)
example_pool = [
    {"input": "Translate 'bonjour' to English",       "output": "Hello"},
    {"input": "Translate 'merci beaucoup' to English", "output": "Thank you very much"},
    {"input": "What does 'au revoir' mean?",           "output": "Goodbye"},
    {"input": "How do you say 'library' in French?",   "output": "bibliothèque"},
    # ... potentially hundreds of examples
]

# Pre-compute embeddings for all pool examples (do this once, cache it)
pool_embeddings = [embed(ex["input"]) for ex in example_pool]

def retrieve_examples(query: str, k: int = 3) -> list[dict]:
    """Retrieve the k most semantically similar examples to the query."""
    query_emb = embed(query)
    scores = [cosine_similarity(query_emb, ex_emb) for ex_emb in pool_embeddings]
    top_k_indices = np.argsort(scores)[-k:][::-1]
    return [example_pool[i] for i in top_k_indices]

def few_shot_with_retrieval(query: str, k: int = 3) -> str:
    """Build a few-shot prompt using retrieved examples."""
    examples = retrieve_examples(query, k=k)
    
    prompt_parts = []
    for ex in examples:
        prompt_parts.append(f"Input: {ex['input']}\nOutput: {ex['output']}")
    prompt_parts.append(f"Input: {query}\nOutput:")
    
    return client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "\n\n".join(prompt_parts)}],
        temperature=0,
    ).choices[0].message.content.strip()

result = few_shot_with_retrieval("Translate 'café au lait' to English")
print(result)  # → Coffee with milk (retrieved similar translation examples)

Diversity + relevance = best ICL

Research (Zhang et al., 2022) shows that the optimal example set balances relevance (semantically similar to the query) and diversity (covering different sub-cases, edge cases, and output formats). A retrieval system that fetches the top-3 most similar examples can accidentally select 3 nearly identical examples — add a diversity constraint (e.g., maximum marginal relevance) for further improvement.

Meta-learning: learning to learn

Meta-learning is the broader ML paradigm that few-shot learning belongs to: training models that can rapidly adapt to new tasks with minimal examples. Unlike ICL (which requires no weight updates), meta-learning approaches explicitly optimize for fast task adaptation.

Method	Core idea	How it adapts	Typical use case
MAML (Model-Agnostic Meta-Learning)	Optimize weights to be easily fine-tunable in 1–5 gradient steps	K-shot gradient descent at test time	Robotics, few-shot classification
Prototypical Networks	Learn embedding space where class = mean of its examples (prototype)	Classify by nearest prototype in embedding space	Image classification, NLP classification
Matching Networks	Attention-weighted sum over training examples for classification	Attention over support set at inference	Few-shot image/text classification
Reptile (OpenAI)	Simplified MAML: repeatedly fine-tune on tasks, update toward fine-tuned weights	Same as MAML but simpler implementation	On-device personalization
In-context learning (GPT style)	Pretraining implicitly learns to use context as a "task description"	No weight updates — uses context window	General NLP, instruction following

Why meta-learning matters beyond LLMs

Meta-learning is critical in domains where few-shot ICL is insufficient but traditional fine-tuning is impossible due to data scarcity: rare disease diagnosis (5–10 patient examples per condition), drug discovery (new molecular target with few known binders), personalized recommendation (new user with 3 interactions), robotics (new manipulation task with 5 demonstrations). For these domains, MAML-style approaches and prototypical networks remain state-of-the-art.

Practice questions

What is the difference between N-way K-shot learning and standard supervised learning? (Answer: N-way K-shot: at test time, classify examples into N new classes (never seen during meta-training), given only K examples per class (often K=1 or K=5). The model must generalise from K examples to classify new instances of those classes. Standard supervised learning: train on thousands of examples of each class, classify into those same classes. Few-shot learning tests the model's ability to RAPIDLY ADAPT to new classes from minimal examples.)
What is prototypical networks and how do they perform few-shot classification? (Answer: Prototypical Networks: compute a class prototype = mean embedding of the K support examples for each class. Classify query points by nearest prototype in embedding space. Training: episodic training — simulate few-shot episodes, train the embedding space so that intra-class examples are close and inter-class examples are far. At test time: new classes are represented by their prototype embeddings. Simple, effective, and interpretable. Assumption: each class can be represented by a single mean vector (holds well for uni-modal class distributions).)
What is MAML (Model-Agnostic Meta-Learning) and how does it enable fast adaptation? (Answer: MAML: meta-learn initial model parameters θ such that a few gradient steps on a new task's support set produce a well-performing model. Outer loop: across many tasks, update θ to minimise validation loss after inner loop adaptation. Inner loop: for each task, take k gradient steps from θ → θ'. The outer loop explicitly optimises for good initialisation, not good average performance. At test time: start from θ, take k gradient steps on new task support set. Requires second-order gradients (expensive but model-agnostic).)
What is the difference between few-shot learning with in-context learning vs meta-learning? (Answer: In-context learning: provide k examples in the prompt, LLM adapts without any weight updates. Fast but limited by context window. Requires a large pretrained model that has developed ICL capability. Meta-learning: explicitly train a model to learn quickly — requires a training phase with many episodic few-shot tasks. Results in a specialised model architecture optimised for fast adaptation. ICL is at inference time; meta-learning is at training time. Modern practice: use ICL for large LLMs, meta-learning for specialised smaller models.)
What is data augmentation in the context of few-shot learning? (Answer: With K=1 or K=5 support examples, standard augmentation (random crops, flips) helps but is insufficient. Advanced strategies: (1) Feature hallucination — train a generator to hallucinate additional class embeddings from the few support examples (GAN or VAE-based). (2) Cross-modal augmentation — use LLM descriptions of the class to generate synthetic embeddings. (3) Task augmentation — generate novel few-shot tasks from seen classes by resampling and relabelling. (4) Mixup in embedding space — interpolate between few support examples to expand the class distribution.)

Few-Shot & In-Context Learning

Zero, one, and few-shot defined

Why in-context learning works

ICL vs fine-tuning: when to use which

Example selection strategies

Meta-learning: learning to learn

Practice questions

Try LumiChats for ₹69

Related Terms