Foundations

Cost Optimization Strategies

Lesson 3 of 4 Estimated Time 45 min

Cost Optimization Strategies

The Optimization Paradox

Better AI models cost more. But the economics don’t always justify the better model. Using GPT-4 instead of GPT-3.5 on a classification task where GPT-3.5 already works might improve accuracy from 85% to 88%—but if 85% is good enough for your use case, you’re paying 3x the cost for 3% improvement.

The best approach: start with efficient models and upgrade only when cost is justified by value.

Strategy 1: Right-Sizing the Model

The Model Capability Curve

Foundation models have different capabilities:

  • Haiku-class: Fast, cheap, works for straightforward tasks
  • Sonnet-class: Balanced, handles complex tasks
  • Opus-class: Most capable, handles nuanced reasoning

Cost differences are often 4-10x. Quality differences are usually 5-15%.

Matching Model to Task

Classification (emails, documents, items):

  • Haiku is usually fine
  • Use Sonnet if categories are nuanced
  • Use Opus only if you need complex reasoning

Summarization:

  • Haiku for straightforward documents
  • Sonnet for long or complex documents
  • Opus rarely needed

Creative/complex reasoning:

  • Sonnet for most cases
  • Opus for deeply complex reasoning

Real-time chat/interactive:

  • Haiku for quick responses
  • Sonnet for thoughtful responses
  • Opus for user-facing high-stakes conversations

Cost Impact of Right-Sizing

Example: Classify 100,000 support emails/month

Using GPT-4 (wrong-sized):

  • 100,000 emails × 1,500 tokens = 150M tokens
  • Cost: 150M × $0.01/1K = $1,500/month

Using Claude 3 Haiku (right-sized):

  • 100,000 emails × 1,500 tokens = 150M tokens
  • Cost: 150M × $0.00025/1K = $37.50/month

Savings: $1,462/month or 98% reduction, with same functionality!

Decision rule: Start with cheap models. Upgrade only if accuracy isn’t sufficient and the upgrade is cost-justified.

Strategy 2: Input Optimization

Shrink Token Count Without Losing Quality

Most input optimization focuses on reducing the number of tokens you send.

Technique 1: Extract Before Providing

  • Don’t send the full document; send the key parts
  • Full document: 5,000 words = 3,750 tokens
  • Key extract: 500 words = 375 tokens
  • Savings: 90%, minimal quality loss

Example: Summarizing a legal document

  • Naive: Send full 10,000-word contract
  • Optimized: Extract sections relevant to query, send 2,000 words

Technique 2: Chunking and Focusing

  • Split large documents into sections
  • Only process relevant sections
  • For classification: Send first 500 words + summary

Document classification:

  • Full document: 10,000 words
  • First 500 words + title + summary: 1,000 words
  • Efficiency gain: 90% cost reduction

Technique 3: Prompt Template Caching

  • Same prompt structure many times? Use caching
  • API caches prompts to reduce cost of repeated requests
  • 70% cost reduction for cached content

Example with prompt caching:

System prompt (cached): "You are a customer service AI..."
Context (cached): "Company policies, common answers..."
User query (new): "What's your return policy?"

Cost: Only pay for the new query, not cached parts

Technique 4: Use Structured Input/Output

  • JSON is more efficient than natural language
  • Less ambiguity = fewer tokens for explanation
  • “Classify this email” vs. “\nSubject: ...\nBody: ...” (structured)

Cost savings from structuring: 10-20%

Practical Input Optimization

Monthly processing: 100,000 documents, 2,000 words each

Current approach (unoptimized):

  • Full documents: 100K × 2K words = 200M words = 150M tokens
  • Cost (GPT-4): $1,500

Optimization approach:

  1. Extract key text (50% of tokens): 75M tokens
  2. Use Haiku instead of GPT-4 (4% of cost): $1,500 × 4% = $60
  3. Use batch API for non-urgent work (50% discount): $30

Total savings: $1,470 (98% reduction)

Strategy 3: Architecture Optimization

Some architectural changes reduce costs significantly.

Caching and Memoization

What it is: Store results for common queries; reuse instead of recomputing.

When it works: You see repeated questions/requests.

Implementation:

  • Cache results of common queries
  • If user asks “What’s your return policy?” for 100th time, return cached answer
  • Only process new/unique queries

Savings: 20-50% if you have repeated queries (common in Q&A systems)

Example: Customer support chatbot

  • 10,000 questions/day
  • 30% are duplicates of previous questions
  • Caching saves: 3,000 × average cost per query

Batch vs. Real-Time APIs

Real-time APIs: Instant response, higher cost

  • Customer service: Need real-time
  • Email classification: Can wait

Batch APIs: Process in bulk, 50% discount

  • Processing 1,000 emails: Use batch API
  • Processing 1 email from a user: Use real-time API

Cost strategy:

  • Real-time for interactive features
  • Batch API for background processing
  • Hybrid: Real-time for immediate response (cheap model), batch for detailed analysis

Example hybrid:

  • User asks question: Haiku provides instant response (30 seconds)
  • Background job: Runs detailed analysis via batch API
  • Next time user asks similar question: Use detailed analysis from background job

Two-Stage Classification

For complex classification, use fast model first, slow model second.

Stage 1: Quick filter (Haiku)

  • “Is this customer question clearly in scope?”
  • Cost: $0.0003 per query
  • Catches 80% of out-of-scope questions

Stage 2: Detailed classification (Claude 3 Opus)

  • Only process remaining 20% with expensive model
  • Cost: $0.075 per query
  • Total cost: (80% × $0.0003) + (20% × $0.075) = $0.015 per query

Direct Opus approach: $0.075 × 100% = $0.075 per query

Savings: 80% with minimal accuracy loss

Strategy 4: Self-Hosting and Open Source

When Self-Hosting Makes Sense

At what volume does self-hosting become cheaper than APIs?

Example: Document Classification

  • Daily volume: 10,000 documents
  • API cost (Haiku): $250/month
  • Self-hosted Llama 2 cost:
    • GPU instance: $1,000/month
    • Ops/monitoring: $200/month
    • Total: $1,200/month

Self-hosting only makes sense at:

  • 50,000+ documents/day (API would be $1,250+/month)
  • Strict data residency requirements
  • Custom model training on proprietary data

Open Source Model Quality

Open source models have improved significantly:

  • Llama 2/3: 70-85% of frontier model capability at 5-10% of cost
  • Mistral 7B: Strong small model, good for classification
  • Phi: Efficient model, good for reasoning
  • Falcon: Strong mid-size model

Quality comparison (classification task):

  • GPT-4: 95% accuracy, $0.10 per classification
  • Claude 3 Haiku: 92% accuracy, $0.001 per classification
  • Llama 2 (self-hosted): 85% accuracy, $0.0001 per classification

Use case: If you only need 85% accuracy, Llama 2 is 1000x cheaper.

True Cost of Self-Hosting

Budget should include:

  • Infrastructure (compute, storage): $1K-5K/month per model
  • Maintenance and ops: $500-2K/month
  • Monitoring and alerting: $200-500/month
  • Model updates and retraining: $5K-10K quarterly
  • Incidents and debugging: 10-40 hours/month
  • Total: $3K-10K/month minimum

Self-hosting makes sense only at significant scale or with unique requirements.

Strategy 5: Prompt Optimization for Cost

Reduce Output Length

Shorter outputs cost less (fewer output tokens).

Technique: Be specific about length requirements.

Poor prompt: “Summarize this article” Good prompt: “Summarize this article in 50 words”

Cost difference: ~50% (longer responses cost more)

Use Structured Output Requests

JSON requests are more efficient than natural language.

Instead of: "Extract the customer's name, email, and reason for contacting us."

Use:
{
  "name": "string",
  "email": "string",
  "reason": "category: complaint|question|feedback"
}

Efficiency gain: 10-20%

Stop Sequences

Use stop tokens to prevent over-generation.

Classify this email as: {one word}
[STOP]

vs.

Classify this email as one of: positive, neutral, or negative.
This email should be classified as: [continue generating...]

The first version stops immediately; the second might generate paragraphs.

Savings: 5-20% depending on task

Cached System Prompts

System prompts are usually repeated. Cache them.

Using prompt caching:

  • System prompt: “You’re a support AI…” (cached)
  • Context: Support guidelines (cached)
  • Query: User question (new)

Only the query is billed. The system + context are cached.

Cost Monitoring and Optimization Loop

Monthly Cost Dashboard

Track:

  • Total API spend by model
  • Cost per transaction
  • Tokens processed (input/output breakdown)
  • Infrastructure costs
  • Team costs

Example metrics:

Total monthly spend: $5,000
  GPT-4: $2,000 (40%)
  Haiku: $500 (10%)
  Infrastructure: $1,000 (20%)
  Team: $1,500 (30%)

Cost per classification: $0.005
Cost per customer served: $0.10

Quarterly Optimization Review

  1. Identify high-cost areas

    • Where’s the money going?
    • Could we right-size models?
    • Could we batch more?
  2. Estimate savings from optimization

    • Switching from GPT-4 to Haiku: could save 75%
    • Using batch API: could save 50%
    • Caching: could save 30%
  3. Prioritize by impact and effort

    • Easy wins: Better prompts, shorter outputs
    • Medium effort: Architecture changes, caching
    • Hard wins: Model switch, retraining
  4. Implement and measure

    • Change one variable at a time
    • Measure impact on cost and quality
    • Document learnings

When NOT to Optimize

Sometimes optimization isn’t worth it.

Skip optimization if:

  • Your AI spend is less than $5K/month (optimization time isn’t worth it)
  • Quality impact would be negative (users care more than you save)
  • You’re still in pilot phase (focus on learning, not cost)
  • You have unlimited budget (some companies do)

Focus optimization on:

  • High-volume, low-impact tasks
  • Tasks where accuracy doesn’t need to be perfect
  • Batch processes that can wait
  • Repeated operations

Strategic Questions

  1. What model are we using and why? Right-sized or oversized?
  2. What’s our cost per unit of value? Know this number.
  3. Where are we wasting money? Oversized models? Inefficient prompts?
  4. What’s our optimization roadmap? Next 3 improvements?
  5. At what scale does self-hosting make sense? When do the economics flip?

Key Takeaway: AI cost optimization starts with using the right-sized model for the job. Optimize inputs through extraction and caching. Consider architecture changes like two-stage classification. Monitor costs continuously. Self-host only at scale. The best approach is starting cheap and upgrading only when justified by value.

Discussion Prompt

For your AI system: Are you using the right-sized model? Where could you optimize inputs? What’s your cost per transaction and is that acceptable?