Few-Shot Learning and In-Context Examples

You’ve learned that examples guide the model’s behavior (in Chain-of-Thought). Now we’re going deeper: the art and science of selecting and formatting examples to teach the model exactly what you want. Few-shot learning is remarkably powerful—often a 2-3 good examples teach the model more than pages of explicit instructions. The key is understanding what makes examples effective and when to use them.

Zero-Shot vs. Few-Shot vs. Many-Shot

To understand few-shot learning, let’s first clarify the taxonomy:

Zero-Shot: No Examples

The model answers based purely on your instruction, using only its training.

Classify this text as positive, negative, or neutral:
"This product is amazing!"

Response: Positive

Pros: No token cost for examples, straightforward instruction Cons: Model must infer your exact definition of the categories

Few-Shot: 1-5 Examples

You provide a small number of labeled examples before the actual task.

Classify text as positive, negative, or neutral.

Example 1:
Text: "This product is amazing!"
Label: Positive

Example 2:
Text: "It's okay, nothing special"
Label: Neutral

Example 3:
Text: "Broke after one day"
Label: Negative

Now classify:
Text: "Works great but shipping was slow"
Label:

Pros: Model learns your specific criteria from examples, high quality Cons: Costs tokens for examples, requires creating examples

Many-Shot: 10+ Examples

You provide many examples, often an entire training dataset condensed into a prompt.

[10-20 examples of text and labels]

Now classify:
Text: "Good value for money"
Label:

Pros: Very robust learning, captures edge cases and nuance Cons: Expensive in tokens, may exceed context window, overkill for simple tasks

When Zero-Shot Works (and When It Fails)

Zero-shot prompting works well when:

The task is simple and standard (“Translate to Spanish”)
The model’s training covers the concept extensively (math, common languages, popular frameworks)
You don’t need fine-grained customization (“List three benefits of exercise”)

Zero-shot fails when:

Your definition differs from the model’s training (you define “professional tone” differently than the model expects)
The task is novel or niche (proprietary company jargon, new frameworks)
You need consistency with a specific style or format (your company’s unique email tone)

Example: Zero-Shot Failure → Few-Shot Success

Zero-Shot (Fails):

Classify customer emails as needing urgent action, medium priority, or can wait.

Email: "Hi, my account seems locked out. I can't log in."
Classification:

The model might classify this as “urgent” (reasonable) or “medium” (also reasonable). Zero-shot leaves ambiguity.

Few-Shot (Succeeds):

Classify customer emails based on our support priority standards.

Example 1:
Email: "I'd like to change my password"
Priority: Can wait (low priority requests go here)

Example 2:
Email: "My account is locked and I have a deadline in 2 hours"
Priority: Urgent action (time-sensitive, blocking customer)

Example 3:
Email: "The dashboard layout is confusing"
Priority: Medium priority (feature feedback, not blocking)

Email: "Hi, my account seems locked out. I can't log in."
Priority:

Now the model learns your company’s specific priority framework.

Choosing Effective Examples: Diversity, Relevance, Edge Cases

Not all examples are equally valuable. The quality of your examples directly affects output quality.

Principle 1: Diversity

Include examples across the full range of outcomes:

Bad (Homogeneous):

Classify sentiment:

Example 1: "Love this!" → Positive
Example 2: "Fantastic!" → Positive
Example 3: "Excellent!" → Positive

Now classify: "It's okay"

The model learns “these words mean positive” but has no positive/neutral boundary.

Good (Diverse):

Classify sentiment:

Example 1: "Love this!" → Positive
Example 2: "It works fine" → Neutral
Example 3: "Disappointing" → Negative
Example 4: "Pretty good, has some issues" → Mixed/Neutral
Example 5: "Amazing product!" → Positive

Now classify: "It's okay"

Now the model understands the full spectrum.

Principle 2: Relevance

Choose examples similar to what you’ll actually classify:

Bad (Irrelevant):

Classify sentiment for hotel reviews:

Example 1: "I love ice cream" → Positive
Example 2: "Cars are boring" → Negative
Example 3: "Websites are helpful" → Positive

Now classify: "The room was clean but noisy at night"

The examples don’t help because they’re about different domains.

Good (Relevant):

Classify sentiment for hotel reviews:

Example 1: "Clean room, friendly staff, would stay again" → Positive
Example 2: "Room was fine, but WiFi didn't work" → Negative
Example 3: "Nothing special, met expectations" → Neutral

Now classify: "The room was clean but noisy at night"

These examples show the decision boundaries relevant to hotels.

Principle 3: Edge Cases

Include borderline examples that are hard to classify:

Incomplete (Missing Edge Cases):

Classify code reviews as constructive or harsh:

Example 1: "Great implementation, well structured!" → Constructive
Example 2: "This code is garbage, redo it" → Harsh

Now classify: "This works, but consider using a design pattern here"

The model has no example of a critical-but-kind review.

Complete (With Edge Cases):

Classify code reviews as constructive or harsh:

Example 1: "Great implementation, well structured!" → Constructive
Example 2: "This works, but consider using a pattern here for scalability" → Constructive
Example 3: "This is inefficient and hard to read" → Harsh (critical but factual)
Example 4: "What were you thinking with this code?" → Harsh (insulting)
Example 5: "Code works but difficult to understand. Suggest adding comments" → Constructive

Now classify: "This works but performance could be better"

Now the model understands nuance.

Example Format Matters: Input/Output Pairs and Labeled Examples

How you present examples affects learning. Different formats work better for different tasks.

Format 1: Input/Output Pairs (Most Common)

Simple pairs showing input and expected output:

Task: Translate English to Spanish

Input: "Hello, how are you?"
Output: "Hola, ¿cómo estás?"

Input: "What is your name?"
Output: "¿Cuál es tu nombre?"

Input: "I love this book"
Output: "Amo este libro"

Input: "Can you help me?"
Output:

Best for: Translation, generation, simple transformations

Format 2: Labeled Examples (With Explanation)

Examples with labels and brief explanations:

Task: Classify Python code for performance issues

Code: `items = [x*2 for x in range(1000000)]`
Classification: Potential memory issue
Reason: Creating a large list in memory; consider a generator

Code: `result = db.query().filter(name="John").first()`
Classification: OK for typical use case
Reason: Proper use of indexed query with limit

Code: `for user in users: db.save(user)`
Classification: Performance problem (N+1 queries)
Reason: Should batch database operations

Code: `total = sum(get_price(item) for item in shopping_list)`
Classification:

Best for: Detailed classification, explaining decisions, nuanced judgments

Format 3: Structured JSON Examples

For tasks requiring structured output:

Task: Extract key information from user reviews

Example 1:
Review: "The pillow is comfortable but too firm for side sleepers"
{
  "product_quality": "good",
  "comfort_rating": "mixed",
  "key_issue": "firmness for side sleepers",
  "recommendation": "Ask about firmness options"
}

Example 2:
Review: "Fast shipping, exactly as described!"
{
  "product_quality": "good",
  "comfort_rating": "not mentioned",
  "key_issue": "none",
  "recommendation": "No action needed"
}

Review: "Cheap material, fell apart in a week"
{

Best for: Information extraction, complex structures, API responses

Format 4: Conversational Examples

For dialogue or chat tasks:

Task: Generate customer service responses

Example 1:
Customer: "I ordered 2 weeks ago and still haven't received my package"
Agent: "I apologize for the delay. Let me look up your order. Can you provide
your order number? In the meantime, packages typically arrive within 7-10
business days, so this is unusual. I'll prioritize investigating this."

Example 2:
Customer: "Do you have this in blue?"
Agent: "That's a great question! We currently have it in black and gray. I can
check with our warehouse to see if blue is coming back in stock. What's your
size and I can add you to a waitlist?"

Customer: "Your website is broken, I can't log in"
Agent:

Best for: Chat, dialogue, customer service, interactive tasks

When Few-Shot Beats Zero-Shot (And When It Doesn’t)

Scenario	Zero-Shot	Few-Shot	Winner
Translate common language	Excellent	Good	Zero-shot (no examples needed)
Apply company-specific rule	Poor	Excellent	Few-shot (examples clarify rule)
Simple math	Excellent	Good	Zero-shot (over-complicated)
Classify novel categories	Poor	Excellent	Few-shot (examples teach categories)
Generate standard content	Good	Good	Tie (depends on style needs)
Custom output format	Poor	Excellent	Few-shot (shows exact format)
Generate code for niche framework	Poor	Excellent	Few-shot (teaches patterns)
Translate rare language	Poor	Excellent	Few-shot (guide the model)

Dynamic Few-Shot: Selecting Examples Programmatically

For production systems, you might select examples programmatically based on the input.

Concept: Context-Aware Examples

Instead of static examples, choose examples similar to the current input:

def select_dynamic_examples(input_text, example_pool, num_examples=3):
    """
    Select the most relevant examples from a pool based on input similarity
    """
    # Calculate similarity between input and each example
    similarities = [
        (example, similarity_score(input_text, example))
        for example in example_pool
    ]

    # Return top N most similar examples
    top_examples = sorted(similarities, key=lambda x: x[1], reverse=True)
    return [ex[0] for ex in top_examples[:num_examples]]

# Usage
input_text = "This shirt runs small and the color fades easily"
relevant_examples = select_dynamic_examples(input_text, all_reviews)

prompt = f"""Classify sentiment:

{relevant_examples}

Now classify: "{input_text}"
"""

Why Dynamic Selection Works

Relevance: Examples match the complexity and domain of the current input
Efficiency: Fewer irrelevant examples means tokens saved
Accuracy: Similar examples teach better than random examples

Implementation with Embeddings

For best results, use semantic similarity:

from sklearn.metrics.pairwise import cosine_similarity
import openai

def select_similar_examples(input_text, examples, num_examples=3):
    """Select examples most semantically similar to input"""

    # Get embeddings for input and examples
    input_embedding = get_embedding(input_text)
    example_embeddings = [get_embedding(ex['text']) for ex in examples]

    # Calculate similarity
    similarities = cosine_similarity([input_embedding], example_embeddings)[0]

    # Return top examples
    top_indices = similarities.argsort()[-num_examples:][::-1]
    return [examples[i] for i in top_indices]

def get_embedding(text):
    """Get OpenAI embedding"""
    response = openai.Embedding.create(
        model="text-embedding-3-small",
        input=text
    )
    return response['data'][0]['embedding']

Few-Shot Prompt Structure and Best Practices

Complete Few-Shot Prompt Pattern

<task>
Classify [items] as [categories]
</task>

<classification_guide>
Key criteria:
- Category A is characterized by [what makes it A]
- Category B is characterized by [what makes it B]
</classification_guide>

<examples>
Example 1:
Input: [example input]
Output: [example output]
Reasoning: [why this classification]

Example 2:
Input: [example input]
Output: [example output]
Reasoning: [why this classification]

Example 3:
Input: [example input]
Output: [example output]
Reasoning: [why this classification]
</examples>

<instructions>
Now classify this using the same approach:
Input: [actual task]
Output:
</instructions>

Real-World Example: Few-Shot Code Classification

<task>
Classify Python functions as having good or poor error handling
</task>

<criteria>
Good: Catches specific exceptions, provides helpful context, fails gracefully
Poor: Bare except, no error context, catches Exception broadly, fails silently
</criteria>

<examples>
Example 1:
Code:
```python
def parse_json(text):
    return json.loads(text)

Classification: Poor error handling Reason: No try-except; will crash on invalid JSON

Example 2: Code:

def parse_json(text):
    try:
        return json.loads(text)
    except json.JSONDecodeError as e:
        raise ValueError(f"Invalid JSON at line {e.lineno}: {e.msg}")

Classification: Good error handling Reason: Specific exception, provides context, helpful error message

Example 3: Code:

def parse_json(text):
    try:
        return json.loads(text)
    except:
        return None

Classification: Poor error handling Reason: Bare except catches all exceptions, silent failure

Now classify: Code:

def get_user(user_id):
    try:
        response = requests.get(f"/api/users/{user_id}")
        response.raise_for_status()
        return response.json()
    except requests.RequestException as e:
        logging.error(f"Failed to fetch user {user_id}: {e}")
        return None

Classification:

Key Takeaway

Few-shot learning teaches the model through examples rather than explicit instructions. Effective examples are diverse, relevant, and include edge cases. How you format examples (pairs, JSON, labeled, conversational) depends on your task. Few-shot outperforms zero-shot for novel tasks, specific styles, and custom formats. Dynamic selection of examples based on input similarity can improve efficiency and accuracy in production systems.

Exercise: Build a Few-Shot Prompt for Sentiment Classification

Your task is to create a few-shot prompt for classifying sentiment in product reviews.

The Challenge

You work for an e-commerce company. Reviewers rate products 1-5 stars, but the written reviews don’t always match the star rating. You need a sentiment classifier that:

Classifies the review sentiment (positive, negative, mixed)
Flags reviews where text doesn’t match the star rating
Identifies the main reason for the sentiment

Your Task

Create a few-shot prompt with:

Clear task definition (what are you classifying?)
Classification criteria (what makes something positive vs. negative?)
4-6 diverse examples that:
- Cover the full range (clearly positive, clearly negative, mixed)
- Include edge cases (5-star complaint, 1-star praise)
- Show the flag logic (when rating doesn’t match text)
- Identify the main reason
Output format (how should answers be structured?)

Example Format to Follow

<task>
Classify product review sentiment
</task>

<criteria>
Positive: Overall satisfied, recommends product, minor issues are acceptable
Negative: Overall unsatisfied, wouldn't recommend, major issues present
Mixed: Has both significant positives and negatives

Flag: Review if the sentiment doesn't match the star rating
</criteria>

<examples>
Example 1:
Stars: 5
Review: "Best purchase ever! Works perfectly, great quality."
Sentiment: Positive
Flags: None
Main reason: Excellent quality and functionality

Example 2:
Stars: 1
Review: "Arrived damaged, not usable. Very disappointed."
Sentiment: Negative
Flags: None
Main reason: Product defective/damaged

... [more examples] ...
</examples>

<instructions>
Now classify:
Stars: [rating]
Review: [review text]
Sentiment:
Flags:
Main reason:
</instructions>

What to Deliver

Your complete few-shot prompt
A 150-word explanation of:
- Why you chose those specific examples
- How you ensured diversity and edge cases
- How the format helps the model classify accurately

Bonus Challenge

Create a second version of your prompt:

Version A: For general reviews (any product category)
Version B: For technical products specifically (software, electronics)

Note how the examples and criteria change based on domain. This shows why example selection matters.