Model-Specific Optimization Strategies

Introduction

You’ve now learned how to measure prompt quality and test prompts scientifically. But there’s a critical realization: the “best prompt” depends on which model you’re using. A prompt that works beautifully with Claude might confuse GPT-4. A technique that makes Gemini output perfect JSON might break Llama.

This lesson teaches you the most important secret of production prompt engineering: different models have different strengths, quirks, and optimal prompt structures. You’ll learn how to identify these differences, optimize for specific models, and build cross-model solutions.

Key Takeaway: There is no universally perfect prompt. Each model family has unique capabilities and sensitivities. Production systems must either optimize for a specific model or design prompts that work across multiple models, which usually means optimizing for the least capable one.

How Models Differ in Prompt Response

Let’s start with concrete examples:

Example 1: Instruction Clarity

Task: Extract the first name from a full name

Prompt for Claude:

Extract the first name from this name: John Smith
Please return only the first name, nothing else.

Claude: “John” ✓

Same prompt for GPT-3.5: GPT-3.5: “The first name is John” ✗ (Included extra text)

Optimized for GPT-3.5:

Extract the first name from this name: John Smith

Return ONLY the first name in this format:
[FIRST_NAME]

Example:
John Smith -> [John]

GPT-3.5: “[John]” ✓

Example 2: XML vs Natural Language

Claude excels with XML tags:

<instruction>
Extract the sentiment from this review
</instruction>

<review>
This product is amazing!
</review>

<format>
Return JSON with keys: sentiment, confidence
</format>

GPT-4 prefers numbered lists:

1. Your task: Extract sentiment from the review
2. Input review: "This product is amazing!"
3. Format: Return JSON with keys: sentiment, confidence

Llama works better with imperative commands:

TASK: Extract sentiment from this review
REVIEW: "This product is amazing!"
OUTPUT: JSON with sentiment and confidence

Your response:

Understanding Model Capabilities and Limitations

Model Size and Capability Tiers

from enum import Enum
from dataclasses import dataclass

@dataclass
class ModelProfile:
    """Profile of a model's strengths and weaknesses"""
    name: str
    provider: str
    parameter_count: str
    context_window: int
    strengths: list  # What it's good at
    weaknesses: list  # What it struggles with
    optimal_prompt_style: str  # How it responds best
    cost_per_1k_tokens: float

# Example profiles
MODELS = {
    "claude-3-opus": ModelProfile(
        name="Claude 3 Opus",
        provider="Anthropic",
        parameter_count="~100B (estimated)",
        context_window=200000,
        strengths=[
            "Long-form reasoning",
            "XML/structured markup",
            "Constitutional AI compliance",
            "Code generation and analysis"
        ],
        weaknesses=[
            "Can be verbose",
            "Slower than some competitors"
        ],
        optimal_prompt_style="detailed_with_xml_or_markdown",
        cost_per_1k_tokens=0.015
    ),
    "gpt-4": ModelProfile(
        name="GPT-4",
        provider="OpenAI",
        parameter_count="Unknown (estimated >100B)",
        context_window=128000,
        strengths=[
            "Superior reasoning",
            "Multimodal (vision)",
            "Extremely reliable format following",
            "Consistent behavior across domains"
        ],
        weaknesses=[
            "More expensive",
            "Slower inference"
        ],
        optimal_prompt_style="numbered_lists_and_json_schema",
        cost_per_1k_tokens=0.03
    ),
    "llama-2-70b": ModelProfile(
        name="Llama 2 70B",
        provider="Meta (via API providers)",
        parameter_count="70B",
        context_window=4096,
        strengths=[
            "Fast inference",
            "Open source (can self-host)",
            "Low cost",
            "Good at instruction following"
        ],
        weaknesses=[
            "Less reliable for complex tasks",
            "Smaller context window",
            "Struggles with very specific formatting"
        ],
        optimal_prompt_style="clear_imperative_instructions",
        cost_per_1k_tokens=0.001
    ),
    "gemini-pro": ModelProfile(
        name="Google Gemini Pro",
        provider="Google",
        parameter_count="Unknown",
        context_window=32000,
        strengths=[
            "Excellent summarization",
            "Natural conversation flow",
            "Good at following complex logic"
        ],
        weaknesses=[
            "Sometimes over-confident",
            "Can hallucinate facts"
        ],
        optimal_prompt_style="conversational_narrative",
        cost_per_1k_tokens=0.005
    )
}

Model-Specific Prompt Optimization

Optimizing for Claude

Claude responds exceptionally well to:

XML tags for structured input/output
Detailed explanations of what you want
Constitutional AI framing (asking models to be harmless, helpful, honest)

def claude_optimized_extraction(text: str) -> dict:
    """Extract data optimally for Claude"""

    prompt = """<document>
{text}
</document>

Please extract the following information from the document:

<extraction_task>
- Company name (full legal name)
- Industry (primary classification)
- Founded year (numeric, e.g., 2015)
- Key products (comma-separated list)
</extraction_task>

Return your response as JSON with these exact keys:
- company_name
- industry
- founded_year
- key_products

Be precise and extract only information explicitly stated in the document.
If information is not available, use null for that field."""

    response = claude_client.messages.create(
        model="claude-3-opus-20240229",
        max_tokens=500,
        messages=[
            {"role": "user", "content": prompt.format(text=text)}
        ]
    )

    return json.loads(response.content[0].text)

Optimizing for GPT-4

GPT-4 responds best to:

JSON Schema for output format specification
Step-by-step reasoning prompts (chain-of-thought)
Explicit format examples

def gpt4_optimized_extraction(text: str) -> dict:
    """Extract data optimally for GPT-4"""

    prompt = f"""Extract information from this text:

TEXT:
{text}

TASK:
1. Identify the company name (must be exact legal name)
2. Classify the industry (choose from: Technology, Healthcare, Finance, Retail, Other)
3. Find the founding year (format: YYYY)
4. List all products mentioned (format: product1, product2, product3)

OUTPUT FORMAT (STRICT JSON):
{{
  "company_name": "string",
  "industry": "string",
  "founded_year": "integer or null",
  "key_products": ["string"],
  "confidence": "high/medium/low"
}}

Return ONLY valid JSON. No other text."""

    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )

    return json.loads(response.choices[0].message.content)

Optimizing for Llama

Llama responds best to:

Clear, concise instructions (it doesn’t need verbose explanations)
Simpler formats (sometimes struggles with complex JSON)
Direct imperative commands

def llama_optimized_extraction(text: str) -> dict:
    """Extract data optimally for Llama"""

    prompt = f"""Extract information from this text:

{text}

Extract:
COMPANY: [company name]
INDUSTRY: [industry]
YEAR: [founding year]
PRODUCTS: [product list]

Return results in this exact format only."""

    response = llama_client.generate(
        prompt=prompt,
        temperature=0,
        max_tokens=200
    )

    # Parse the response
    lines = response.strip().split('\n')
    result = {}
    for line in lines:
        if line.startswith('COMPANY:'):
            result['company_name'] = line.replace('COMPANY:', '').strip()
        elif line.startswith('INDUSTRY:'):
            result['industry'] = line.replace('INDUSTRY:', '').strip()
        # ... etc

    return result

Cost Optimization Strategies

Different models have vastly different costs. Sometimes using multiple smaller/cheaper calls is better than one expensive call:

Token Efficiency

def estimate_token_cost(prompt: str, output_tokens: int, model: str) -> float:
    """Estimate cost of a single API call"""

    token_costs = {
        'gpt-4': {'input': 0.03, 'output': 0.06},
        'gpt-3.5-turbo': {'input': 0.0005, 'output': 0.0015},
        'claude-3-opus': {'input': 0.015, 'output': 0.075},
        'claude-3-sonnet': {'input': 0.003, 'output': 0.015},
        'llama-2-70b': {'input': 0.001, 'output': 0.001},
    }

    # Rough token count (better to use actual tokenizer)
    input_tokens = len(prompt.split()) * 1.3  # ~1.3 tokens per word

    rates = token_costs.get(model, {'input': 0.01, 'output': 0.01})
    cost = (input_tokens * rates['input'] + output_tokens * rates['output']) / 1000

    return cost


def choose_optimal_model(task: str, budget_per_call: float = 0.01):
    """Choose model that meets quality and budget requirements"""

    task_profiles = {
        'complex_reasoning': ['gpt-4', 'claude-3-opus'],
        'classification': ['gpt-3.5-turbo', 'claude-3-sonnet'],
        'simple_extraction': ['llama-2-70b', 'gpt-3.5-turbo'],
        'summarization': ['claude-3-sonnet', 'gpt-3.5-turbo']
    }

    candidates = task_profiles.get(task, ['gpt-3.5-turbo'])

    # Filter by budget
    affordable = []
    for model in candidates:
        estimated_cost = estimate_token_cost("", 200, model)
        if estimated_cost <= budget_per_call:
            affordable.append(model)

    return affordable[0] if affordable else candidates[0]

When to Use Smaller Models

def should_use_smaller_model(task: str) -> bool:
    """Determine if a smaller/cheaper model will suffice"""

    simple_tasks = [
        'sentiment_classification',
        'spam_detection',
        'language_detection',
        'simple_extraction',
        'formatting_conversion'
    ]

    complex_tasks = [
        'reasoning',
        'complex_analysis',
        'code_generation',
        'creative_writing',
        'edge_case_handling'
    ]

    if task in simple_tasks:
        return True
    if task in complex_tasks:
        return False

    # Default to larger model if uncertain
    return False

# Usage
if should_use_smaller_model('sentiment_classification'):
    model = 'gpt-3.5-turbo'  # 10x cheaper than GPT-4
else:
    model = 'gpt-4'  # More capable

Caching and Batching

Reduce API calls and costs with intelligent caching:

import hashlib
from datetime import datetime, timedelta

class PromptCache:
    """Cache prompt responses to avoid redundant API calls"""

    def __init__(self, ttl_hours: int = 24):
        self.cache = {}
        self.ttl = timedelta(hours=ttl_hours)

    def _hash_key(self, prompt: str, model: str) -> str:
        """Create cache key from prompt + model"""
        combined = f"{prompt}:{model}"
        return hashlib.md5(combined.encode()).hexdigest()

    def get(self, prompt: str, model: str):
        """Retrieve cached response if exists and not expired"""
        key = self._hash_key(prompt, model)
        if key in self.cache:
            response, timestamp = self.cache[key]
            if datetime.now() - timestamp < self.ttl:
                return response
            else:
                del self.cache[key]  # Expired
        return None

    def set(self, prompt: str, model: str, response: str):
        """Cache a response"""
        key = self._hash_key(prompt, model)
        self.cache[key] = (response, datetime.now())

    def save(self, filepath: str):
        """Persist cache to disk"""
        import json
        cacheable = {
            k: (v[0], v[1].isoformat())
            for k, v in self.cache.items()
        }
        with open(filepath, 'w') as f:
            json.dump(cacheable, f)


cache = PromptCache()

def call_model_with_cache(prompt: str, model: str, api_client):
    """Call API with caching"""
    # Check cache first
    cached = cache.get(prompt, model)
    if cached:
        print("Cache hit!")
        return cached

    # Call API
    response = api_client.call(model, prompt)

    # Store in cache
    cache.set(prompt, model, response)

    return response

Cross-Model Prompt Portability

What if you need a prompt that works across multiple models? The answer is to optimize for the least capable model:

class CrossModelPrompt:
    """Prompt that works across multiple models"""

    def __init__(self, base_prompt: str, model_adaptations: dict):
        """
        base_prompt: The prompt that works on most models
        model_adaptations: {model_name: adjustment_instructions}
        """
        self.base_prompt = base_prompt
        self.adaptations = model_adaptations

    def get_prompt_for_model(self, model: str) -> str:
        """Get prompt optimized for specific model"""
        if model in self.adaptations:
            return self.base_prompt + "\n" + self.adaptations[model]
        return self.base_prompt

# Example
multi_model_extraction = CrossModelPrompt(
    base_prompt="""Extract the following from the text:
- Company name
- Founded year
- Industry

Return as structured data.""",

    model_adaptations={
        'gpt-4': """

IMPORTANT: Return valid JSON format:
{"company_name": "...", "founded_year": 2024, "industry": "..."}""",

        'llama-2-70b': """

Format:
COMPANY: [name]
YEAR: [year]
INDUSTRY: [industry]""",

        'claude-3': """

<result>
<company_name>...</company_name>
<founded_year>...</founded_year>
<industry>...</industry>
</result>"""
    }
)

# Usage
for model in ['gpt-4', 'llama-2-70b', 'claude-3']:
    prompt = multi_model_extraction.get_prompt_for_model(model)
    response = call_api(model, prompt)

Testing Across Models

When optimizing for multiple models, you need comprehensive testing:

class MultiModelTester:
    """Test prompts across multiple models"""

    def __init__(self, models: list, test_cases: list):
        self.models = models
        self.test_cases = test_cases
        self.results = {}

    def test_all(self, prompt_variants: dict) -> dict:
        """Test each prompt variant on each model"""

        results = {}

        for variant_name, prompt_fn in prompt_variants.items():
            results[variant_name] = {}

            for model in self.models:
                model_results = []

                for test_case in self.test_cases:
                    # Get model-specific prompt
                    prompt = prompt_fn(model)

                    # Call API
                    output = self._call_model(model, prompt, test_case['input'])

                    # Evaluate
                    score = self._evaluate(output, test_case['expected'])

                    model_results.append(score)

                # Summary for this model
                results[variant_name][model] = {
                    'mean': np.mean(model_results),
                    'std': np.std(model_results),
                    'scores': model_results
                }

        return results

    def _call_model(self, model: str, prompt: str, user_input: str) -> str:
        """Call the appropriate API"""
        if model.startswith('gpt'):
            return self._call_openai(model, prompt, user_input)
        elif model.startswith('claude'):
            return self._call_anthropic(model, prompt, user_input)
        elif model.startswith('llama'):
            return self._call_llama(model, prompt, user_input)

    def _evaluate(self, output: str, expected: str) -> float:
        """Simple evaluation metric"""
        return float(output.strip() == expected.strip())

    def summarize_results(self, results: dict) -> dict:
        """Create summary showing best model per variant"""

        summary = {}
        for variant, model_scores in results.items():
            best_model = max(model_scores.items(),
                           key=lambda x: x[1]['mean'])
            summary[variant] = {
                'best_model': best_model[0],
                'best_score': best_model[1]['mean']
            }

        return summary


# Usage
tester = MultiModelTester(
    models=['gpt-4', 'claude-3-opus', 'llama-2-70b'],
    test_cases=[
        {'input': 'text1', 'expected': 'output1'},
        {'input': 'text2', 'expected': 'output2'},
    ]
)

results = tester.test_all({
    'simple_prompt': lambda m: "Extract data: {input}",
    'detailed_prompt': lambda m: f"Extract data (optimized for {m}): {input}",
    'structured_prompt': multi_model_extraction.get_prompt_for_model
})

summary = tester.summarize_results(results)
print(summary)

Exercise: Optimize for Multiple Models

Choose a task (e.g., “Extract sentiment from product reviews”) and:

Write a base prompt that works decently across models
Create model-specific adaptations for at least 3 models (GPT-4, Claude, Llama)
Build a test set of 20+ test cases
Test each model+prompt combination
Analyze:
- Which model performs best overall?
- Which model is cheapest while maintaining >80% accuracy?
- Which prompt adaptation helps the most?

Deliverables:

Three prompt variants (one for each model)
Test results showing accuracy per model
A cost analysis (cost per correct classification)
A recommendation for production: which model and prompt would you use and why?

Summary

In this lesson, you’ve learned:

Different models have different strengths and respond better to different prompt structures
How to profile and understand model capabilities
Specific optimization techniques for Claude, GPT-4, Llama, and Gemini
How to optimize for cost: choosing the right model tier for the task
Techniques for caching and batching to reduce costs
How to build prompts that work across multiple models
Testing frameworks for comparing performance across models

Next, you’ll learn how to manage prompts in production: versioning, tracking, and monitoring.