Intermediate

Model-Specific Optimization Strategies

Lesson 3 of 4 Estimated Time 45 min

Model-Specific Optimization Strategies

Introduction

You’ve now learned how to measure prompt quality and test prompts scientifically. But there’s a critical realization: the “best prompt” depends on which model you’re using. A prompt that works beautifully with Claude might confuse GPT-4. A technique that makes Gemini output perfect JSON might break Llama.

This lesson teaches you the most important secret of production prompt engineering: different models have different strengths, quirks, and optimal prompt structures. You’ll learn how to identify these differences, optimize for specific models, and build cross-model solutions.

Key Takeaway: There is no universally perfect prompt. Each model family has unique capabilities and sensitivities. Production systems must either optimize for a specific model or design prompts that work across multiple models, which usually means optimizing for the least capable one.

How Models Differ in Prompt Response

Let’s start with concrete examples:

Example 1: Instruction Clarity

Task: Extract the first name from a full name

Prompt for Claude:

Extract the first name from this name: John Smith
Please return only the first name, nothing else.

Claude: “John” ✓

Same prompt for GPT-3.5: GPT-3.5: “The first name is John” ✗ (Included extra text)

Optimized for GPT-3.5:

Extract the first name from this name: John Smith

Return ONLY the first name in this format:
[FIRST_NAME]

Example:
John Smith -> [John]

GPT-3.5: “[John]” ✓

Example 2: XML vs Natural Language

Claude excels with XML tags:

<instruction>
Extract the sentiment from this review
</instruction>

<review>
This product is amazing!
</review>

<format>
Return JSON with keys: sentiment, confidence
</format>

GPT-4 prefers numbered lists:

1. Your task: Extract sentiment from the review
2. Input review: "This product is amazing!"
3. Format: Return JSON with keys: sentiment, confidence

Llama works better with imperative commands:

TASK: Extract sentiment from this review
REVIEW: "This product is amazing!"
OUTPUT: JSON with sentiment and confidence

Your response:

Understanding Model Capabilities and Limitations

Model Size and Capability Tiers

from enum import Enum
from dataclasses import dataclass

@dataclass
class ModelProfile:
    """Profile of a model's strengths and weaknesses"""
    name: str
    provider: str
    parameter_count: str
    context_window: int
    strengths: list  # What it's good at
    weaknesses: list  # What it struggles with
    optimal_prompt_style: str  # How it responds best
    cost_per_1k_tokens: float

# Example profiles
MODELS = {
    "claude-3-opus": ModelProfile(
        name="Claude 3 Opus",
        provider="Anthropic",
        parameter_count="~100B (estimated)",
        context_window=200000,
        strengths=[
            "Long-form reasoning",
            "XML/structured markup",
            "Constitutional AI compliance",
            "Code generation and analysis"
        ],
        weaknesses=[
            "Can be verbose",
            "Slower than some competitors"
        ],
        optimal_prompt_style="detailed_with_xml_or_markdown",
        cost_per_1k_tokens=0.015
    ),
    "gpt-4": ModelProfile(
        name="GPT-4",
        provider="OpenAI",
        parameter_count="Unknown (estimated >100B)",
        context_window=128000,
        strengths=[
            "Superior reasoning",
            "Multimodal (vision)",
            "Extremely reliable format following",
            "Consistent behavior across domains"
        ],
        weaknesses=[
            "More expensive",
            "Slower inference"
        ],
        optimal_prompt_style="numbered_lists_and_json_schema",
        cost_per_1k_tokens=0.03
    ),
    "llama-2-70b": ModelProfile(
        name="Llama 2 70B",
        provider="Meta (via API providers)",
        parameter_count="70B",
        context_window=4096,
        strengths=[
            "Fast inference",
            "Open source (can self-host)",
            "Low cost",
            "Good at instruction following"
        ],
        weaknesses=[
            "Less reliable for complex tasks",
            "Smaller context window",
            "Struggles with very specific formatting"
        ],
        optimal_prompt_style="clear_imperative_instructions",
        cost_per_1k_tokens=0.001
    ),
    "gemini-pro": ModelProfile(
        name="Google Gemini Pro",
        provider="Google",
        parameter_count="Unknown",
        context_window=32000,
        strengths=[
            "Excellent summarization",
            "Natural conversation flow",
            "Good at following complex logic"
        ],
        weaknesses=[
            "Sometimes over-confident",
            "Can hallucinate facts"
        ],
        optimal_prompt_style="conversational_narrative",
        cost_per_1k_tokens=0.005
    )
}

Model-Specific Prompt Optimization

Optimizing for Claude

Claude responds exceptionally well to:

  • XML tags for structured input/output
  • Detailed explanations of what you want
  • Constitutional AI framing (asking models to be harmless, helpful, honest)
def claude_optimized_extraction(text: str) -> dict:
    """Extract data optimally for Claude"""

    prompt = """<document>
{text}
</document>

Please extract the following information from the document:

<extraction_task>
- Company name (full legal name)
- Industry (primary classification)
- Founded year (numeric, e.g., 2015)
- Key products (comma-separated list)
</extraction_task>

Return your response as JSON with these exact keys:
- company_name
- industry
- founded_year
- key_products

Be precise and extract only information explicitly stated in the document.
If information is not available, use null for that field."""

    response = claude_client.messages.create(
        model="claude-3-opus-20240229",
        max_tokens=500,
        messages=[
            {"role": "user", "content": prompt.format(text=text)}
        ]
    )

    return json.loads(response.content[0].text)

Optimizing for GPT-4

GPT-4 responds best to:

  • JSON Schema for output format specification
  • Step-by-step reasoning prompts (chain-of-thought)
  • Explicit format examples
def gpt4_optimized_extraction(text: str) -> dict:
    """Extract data optimally for GPT-4"""

    prompt = f"""Extract information from this text:

TEXT:
{text}

TASK:
1. Identify the company name (must be exact legal name)
2. Classify the industry (choose from: Technology, Healthcare, Finance, Retail, Other)
3. Find the founding year (format: YYYY)
4. List all products mentioned (format: product1, product2, product3)

OUTPUT FORMAT (STRICT JSON):
{{
  "company_name": "string",
  "industry": "string",
  "founded_year": "integer or null",
  "key_products": ["string"],
  "confidence": "high/medium/low"
}}

Return ONLY valid JSON. No other text."""

    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )

    return json.loads(response.choices[0].message.content)

Optimizing for Llama

Llama responds best to:

  • Clear, concise instructions (it doesn’t need verbose explanations)
  • Simpler formats (sometimes struggles with complex JSON)
  • Direct imperative commands
def llama_optimized_extraction(text: str) -> dict:
    """Extract data optimally for Llama"""

    prompt = f"""Extract information from this text:

{text}

Extract:
COMPANY: [company name]
INDUSTRY: [industry]
YEAR: [founding year]
PRODUCTS: [product list]

Return results in this exact format only."""

    response = llama_client.generate(
        prompt=prompt,
        temperature=0,
        max_tokens=200
    )

    # Parse the response
    lines = response.strip().split('\n')
    result = {}
    for line in lines:
        if line.startswith('COMPANY:'):
            result['company_name'] = line.replace('COMPANY:', '').strip()
        elif line.startswith('INDUSTRY:'):
            result['industry'] = line.replace('INDUSTRY:', '').strip()
        # ... etc

    return result

Cost Optimization Strategies

Different models have vastly different costs. Sometimes using multiple smaller/cheaper calls is better than one expensive call:

Token Efficiency

def estimate_token_cost(prompt: str, output_tokens: int, model: str) -> float:
    """Estimate cost of a single API call"""

    token_costs = {
        'gpt-4': {'input': 0.03, 'output': 0.06},
        'gpt-3.5-turbo': {'input': 0.0005, 'output': 0.0015},
        'claude-3-opus': {'input': 0.015, 'output': 0.075},
        'claude-3-sonnet': {'input': 0.003, 'output': 0.015},
        'llama-2-70b': {'input': 0.001, 'output': 0.001},
    }

    # Rough token count (better to use actual tokenizer)
    input_tokens = len(prompt.split()) * 1.3  # ~1.3 tokens per word

    rates = token_costs.get(model, {'input': 0.01, 'output': 0.01})
    cost = (input_tokens * rates['input'] + output_tokens * rates['output']) / 1000

    return cost


def choose_optimal_model(task: str, budget_per_call: float = 0.01):
    """Choose model that meets quality and budget requirements"""

    task_profiles = {
        'complex_reasoning': ['gpt-4', 'claude-3-opus'],
        'classification': ['gpt-3.5-turbo', 'claude-3-sonnet'],
        'simple_extraction': ['llama-2-70b', 'gpt-3.5-turbo'],
        'summarization': ['claude-3-sonnet', 'gpt-3.5-turbo']
    }

    candidates = task_profiles.get(task, ['gpt-3.5-turbo'])

    # Filter by budget
    affordable = []
    for model in candidates:
        estimated_cost = estimate_token_cost("", 200, model)
        if estimated_cost <= budget_per_call:
            affordable.append(model)

    return affordable[0] if affordable else candidates[0]

When to Use Smaller Models

def should_use_smaller_model(task: str) -> bool:
    """Determine if a smaller/cheaper model will suffice"""

    simple_tasks = [
        'sentiment_classification',
        'spam_detection',
        'language_detection',
        'simple_extraction',
        'formatting_conversion'
    ]

    complex_tasks = [
        'reasoning',
        'complex_analysis',
        'code_generation',
        'creative_writing',
        'edge_case_handling'
    ]

    if task in simple_tasks:
        return True
    if task in complex_tasks:
        return False

    # Default to larger model if uncertain
    return False

# Usage
if should_use_smaller_model('sentiment_classification'):
    model = 'gpt-3.5-turbo'  # 10x cheaper than GPT-4
else:
    model = 'gpt-4'  # More capable

Caching and Batching

Reduce API calls and costs with intelligent caching:

import hashlib
from datetime import datetime, timedelta

class PromptCache:
    """Cache prompt responses to avoid redundant API calls"""

    def __init__(self, ttl_hours: int = 24):
        self.cache = {}
        self.ttl = timedelta(hours=ttl_hours)

    def _hash_key(self, prompt: str, model: str) -> str:
        """Create cache key from prompt + model"""
        combined = f"{prompt}:{model}"
        return hashlib.md5(combined.encode()).hexdigest()

    def get(self, prompt: str, model: str):
        """Retrieve cached response if exists and not expired"""
        key = self._hash_key(prompt, model)
        if key in self.cache:
            response, timestamp = self.cache[key]
            if datetime.now() - timestamp < self.ttl:
                return response
            else:
                del self.cache[key]  # Expired
        return None

    def set(self, prompt: str, model: str, response: str):
        """Cache a response"""
        key = self._hash_key(prompt, model)
        self.cache[key] = (response, datetime.now())

    def save(self, filepath: str):
        """Persist cache to disk"""
        import json
        cacheable = {
            k: (v[0], v[1].isoformat())
            for k, v in self.cache.items()
        }
        with open(filepath, 'w') as f:
            json.dump(cacheable, f)


cache = PromptCache()

def call_model_with_cache(prompt: str, model: str, api_client):
    """Call API with caching"""
    # Check cache first
    cached = cache.get(prompt, model)
    if cached:
        print("Cache hit!")
        return cached

    # Call API
    response = api_client.call(model, prompt)

    # Store in cache
    cache.set(prompt, model, response)

    return response

Cross-Model Prompt Portability

What if you need a prompt that works across multiple models? The answer is to optimize for the least capable model:

class CrossModelPrompt:
    """Prompt that works across multiple models"""

    def __init__(self, base_prompt: str, model_adaptations: dict):
        """
        base_prompt: The prompt that works on most models
        model_adaptations: {model_name: adjustment_instructions}
        """
        self.base_prompt = base_prompt
        self.adaptations = model_adaptations

    def get_prompt_for_model(self, model: str) -> str:
        """Get prompt optimized for specific model"""
        if model in self.adaptations:
            return self.base_prompt + "\n" + self.adaptations[model]
        return self.base_prompt

# Example
multi_model_extraction = CrossModelPrompt(
    base_prompt="""Extract the following from the text:
- Company name
- Founded year
- Industry

Return as structured data.""",

    model_adaptations={
        'gpt-4': """

IMPORTANT: Return valid JSON format:
{"company_name": "...", "founded_year": 2024, "industry": "..."}""",

        'llama-2-70b': """

Format:
COMPANY: [name]
YEAR: [year]
INDUSTRY: [industry]""",

        'claude-3': """

<result>
<company_name>...</company_name>
<founded_year>...</founded_year>
<industry>...</industry>
</result>"""
    }
)

# Usage
for model in ['gpt-4', 'llama-2-70b', 'claude-3']:
    prompt = multi_model_extraction.get_prompt_for_model(model)
    response = call_api(model, prompt)

Testing Across Models

When optimizing for multiple models, you need comprehensive testing:

class MultiModelTester:
    """Test prompts across multiple models"""

    def __init__(self, models: list, test_cases: list):
        self.models = models
        self.test_cases = test_cases
        self.results = {}

    def test_all(self, prompt_variants: dict) -> dict:
        """Test each prompt variant on each model"""

        results = {}

        for variant_name, prompt_fn in prompt_variants.items():
            results[variant_name] = {}

            for model in self.models:
                model_results = []

                for test_case in self.test_cases:
                    # Get model-specific prompt
                    prompt = prompt_fn(model)

                    # Call API
                    output = self._call_model(model, prompt, test_case['input'])

                    # Evaluate
                    score = self._evaluate(output, test_case['expected'])

                    model_results.append(score)

                # Summary for this model
                results[variant_name][model] = {
                    'mean': np.mean(model_results),
                    'std': np.std(model_results),
                    'scores': model_results
                }

        return results

    def _call_model(self, model: str, prompt: str, user_input: str) -> str:
        """Call the appropriate API"""
        if model.startswith('gpt'):
            return self._call_openai(model, prompt, user_input)
        elif model.startswith('claude'):
            return self._call_anthropic(model, prompt, user_input)
        elif model.startswith('llama'):
            return self._call_llama(model, prompt, user_input)

    def _evaluate(self, output: str, expected: str) -> float:
        """Simple evaluation metric"""
        return float(output.strip() == expected.strip())

    def summarize_results(self, results: dict) -> dict:
        """Create summary showing best model per variant"""

        summary = {}
        for variant, model_scores in results.items():
            best_model = max(model_scores.items(),
                           key=lambda x: x[1]['mean'])
            summary[variant] = {
                'best_model': best_model[0],
                'best_score': best_model[1]['mean']
            }

        return summary


# Usage
tester = MultiModelTester(
    models=['gpt-4', 'claude-3-opus', 'llama-2-70b'],
    test_cases=[
        {'input': 'text1', 'expected': 'output1'},
        {'input': 'text2', 'expected': 'output2'},
    ]
)

results = tester.test_all({
    'simple_prompt': lambda m: "Extract data: {input}",
    'detailed_prompt': lambda m: f"Extract data (optimized for {m}): {input}",
    'structured_prompt': multi_model_extraction.get_prompt_for_model
})

summary = tester.summarize_results(results)
print(summary)

Exercise: Optimize for Multiple Models

Choose a task (e.g., “Extract sentiment from product reviews”) and:

  1. Write a base prompt that works decently across models
  2. Create model-specific adaptations for at least 3 models (GPT-4, Claude, Llama)
  3. Build a test set of 20+ test cases
  4. Test each model+prompt combination
  5. Analyze:
    • Which model performs best overall?
    • Which model is cheapest while maintaining >80% accuracy?
    • Which prompt adaptation helps the most?

Deliverables:

  • Three prompt variants (one for each model)
  • Test results showing accuracy per model
  • A cost analysis (cost per correct classification)
  • A recommendation for production: which model and prompt would you use and why?

Summary

In this lesson, you’ve learned:

  • Different models have different strengths and respond better to different prompt structures
  • How to profile and understand model capabilities
  • Specific optimization techniques for Claude, GPT-4, Llama, and Gemini
  • How to optimize for cost: choosing the right model tier for the task
  • Techniques for caching and batching to reduce costs
  • How to build prompts that work across multiple models
  • Testing frameworks for comparing performance across models

Next, you’ll learn how to manage prompts in production: versioning, tracking, and monitoring.