Cost Optimization Strategies

The Optimization Paradox

Better AI models cost more. But the economics don’t always justify the better model. Using GPT-4 instead of GPT-3.5 on a classification task where GPT-3.5 already works might improve accuracy from 85% to 88%—but if 85% is good enough for your use case, you’re paying 3x the cost for 3% improvement.

The best approach: start with efficient models and upgrade only when cost is justified by value.

Strategy 1: Right-Sizing the Model

The Model Capability Curve

Foundation models have different capabilities:

Haiku-class: Fast, cheap, works for straightforward tasks
Sonnet-class: Balanced, handles complex tasks
Opus-class: Most capable, handles nuanced reasoning

Cost differences are often 4-10x. Quality differences are usually 5-15%.

Matching Model to Task

Classification (emails, documents, items):

Haiku is usually fine
Use Sonnet if categories are nuanced
Use Opus only if you need complex reasoning

Summarization:

Haiku for straightforward documents
Sonnet for long or complex documents
Opus rarely needed

Creative/complex reasoning:

Sonnet for most cases
Opus for deeply complex reasoning

Real-time chat/interactive:

Haiku for quick responses
Sonnet for thoughtful responses
Opus for user-facing high-stakes conversations

Cost Impact of Right-Sizing

Example: Classify 100,000 support emails/month

Using GPT-4 (wrong-sized):

100,000 emails × 1,500 tokens = 150M tokens
Cost: 150M × $0.01/1K = $1,500/month

Using Claude 3 Haiku (right-sized):

100,000 emails × 1,500 tokens = 150M tokens
Cost: 150M × $0.00025/1K = $37.50/month

Savings: $1,462/month or 98% reduction, with same functionality!

Decision rule: Start with cheap models. Upgrade only if accuracy isn’t sufficient and the upgrade is cost-justified.

Strategy 2: Input Optimization

Shrink Token Count Without Losing Quality

Most input optimization focuses on reducing the number of tokens you send.

Technique 1: Extract Before Providing

Don’t send the full document; send the key parts
Full document: 5,000 words = 3,750 tokens
Key extract: 500 words = 375 tokens
Savings: 90%, minimal quality loss

Example: Summarizing a legal document

Naive: Send full 10,000-word contract
Optimized: Extract sections relevant to query, send 2,000 words

Technique 2: Chunking and Focusing

Split large documents into sections
Only process relevant sections
For classification: Send first 500 words + summary

Document classification:

Full document: 10,000 words
First 500 words + title + summary: 1,000 words
Efficiency gain: 90% cost reduction

Technique 3: Prompt Template Caching

Same prompt structure many times? Use caching
API caches prompts to reduce cost of repeated requests
70% cost reduction for cached content

Example with prompt caching:

System prompt (cached): "You are a customer service AI..."
Context (cached): "Company policies, common answers..."
User query (new): "What's your return policy?"

Cost: Only pay for the new query, not cached parts

Technique 4: Use Structured Input/Output

JSON is more efficient than natural language
Less ambiguity = fewer tokens for explanation
“Classify this email” vs. “\nSubject: ...\nBody: ...” (structured)

Cost savings from structuring: 10-20%

Practical Input Optimization

Monthly processing: 100,000 documents, 2,000 words each

Current approach (unoptimized):

Full documents: 100K × 2K words = 200M words = 150M tokens
Cost (GPT-4): $1,500

Optimization approach:

Extract key text (50% of tokens): 75M tokens
Use Haiku instead of GPT-4 (4% of cost): $1,500 × 4% = $60
Use batch API for non-urgent work (50% discount): $30

Total savings: $1,470 (98% reduction)

Strategy 3: Architecture Optimization

Some architectural changes reduce costs significantly.

Caching and Memoization

What it is: Store results for common queries; reuse instead of recomputing.

When it works: You see repeated questions/requests.

Implementation:

Cache results of common queries
If user asks “What’s your return policy?” for 100th time, return cached answer
Only process new/unique queries

Savings: 20-50% if you have repeated queries (common in Q&A systems)

Example: Customer support chatbot

10,000 questions/day
30% are duplicates of previous questions
Caching saves: 3,000 × average cost per query

Batch vs. Real-Time APIs

Real-time APIs: Instant response, higher cost

Customer service: Need real-time
Email classification: Can wait

Batch APIs: Process in bulk, 50% discount

Processing 1,000 emails: Use batch API
Processing 1 email from a user: Use real-time API

Cost strategy:

Real-time for interactive features
Batch API for background processing
Hybrid: Real-time for immediate response (cheap model), batch for detailed analysis

Example hybrid:

User asks question: Haiku provides instant response (30 seconds)
Background job: Runs detailed analysis via batch API
Next time user asks similar question: Use detailed analysis from background job

Two-Stage Classification

For complex classification, use fast model first, slow model second.

Stage 1: Quick filter (Haiku)

“Is this customer question clearly in scope?”
Cost: $0.0003 per query
Catches 80% of out-of-scope questions

Stage 2: Detailed classification (Claude 3 Opus)

Only process remaining 20% with expensive model
Cost: $0.075 per query
Total cost: (80% × $0.0003) + (20% × $0.075) = $0.015 per query

Direct Opus approach: $0.075 × 100% = $0.075 per query

Savings: 80% with minimal accuracy loss

Strategy 4: Self-Hosting and Open Source

When Self-Hosting Makes Sense

At what volume does self-hosting become cheaper than APIs?

Example: Document Classification

Daily volume: 10,000 documents
API cost (Haiku): $250/month
Self-hosted Llama 2 cost:
- GPU instance: $1,000/month
- Ops/monitoring: $200/month
- Total: $1,200/month

Self-hosting only makes sense at:

50,000+ documents/day (API would be $1,250+/month)
Strict data residency requirements
Custom model training on proprietary data

Open Source Model Quality

Open source models have improved significantly:

Llama 2/3: 70-85% of frontier model capability at 5-10% of cost
Mistral 7B: Strong small model, good for classification
Phi: Efficient model, good for reasoning
Falcon: Strong mid-size model

Quality comparison (classification task):

GPT-4: 95% accuracy, $0.10 per classification
Claude 3 Haiku: 92% accuracy, $0.001 per classification
Llama 2 (self-hosted): 85% accuracy, $0.0001 per classification

Use case: If you only need 85% accuracy, Llama 2 is 1000x cheaper.

True Cost of Self-Hosting

Budget should include:

Infrastructure (compute, storage): $1K-5K/month per model
Maintenance and ops: $500-2K/month
Monitoring and alerting: $200-500/month
Model updates and retraining: $5K-10K quarterly
Incidents and debugging: 10-40 hours/month
Total: $3K-10K/month minimum

Self-hosting makes sense only at significant scale or with unique requirements.

Strategy 5: Prompt Optimization for Cost

Reduce Output Length

Shorter outputs cost less (fewer output tokens).

Technique: Be specific about length requirements.

Poor prompt: “Summarize this article” Good prompt: “Summarize this article in 50 words”

Cost difference: ~50% (longer responses cost more)

Use Structured Output Requests

JSON requests are more efficient than natural language.

Instead of: "Extract the customer's name, email, and reason for contacting us."

Use:
{
  "name": "string",
  "email": "string",
  "reason": "category: complaint|question|feedback"
}

Efficiency gain: 10-20%

Stop Sequences

Use stop tokens to prevent over-generation.

Classify this email as: {one word}
[STOP]

vs.

Classify this email as one of: positive, neutral, or negative.
This email should be classified as: [continue generating...]

The first version stops immediately; the second might generate paragraphs.

Savings: 5-20% depending on task

Cached System Prompts

System prompts are usually repeated. Cache them.

Using prompt caching:

System prompt: “You’re a support AI…” (cached)
Context: Support guidelines (cached)
Query: User question (new)

Only the query is billed. The system + context are cached.

Cost Monitoring and Optimization Loop

Monthly Cost Dashboard

Track:

Total API spend by model
Cost per transaction
Tokens processed (input/output breakdown)
Infrastructure costs
Team costs

Example metrics:

Total monthly spend: $5,000
  GPT-4: $2,000 (40%)
  Haiku: $500 (10%)
  Infrastructure: $1,000 (20%)
  Team: $1,500 (30%)

Cost per classification: $0.005
Cost per customer served: $0.10

Quarterly Optimization Review

Identify high-cost areas
- Where’s the money going?
- Could we right-size models?
- Could we batch more?
Estimate savings from optimization
- Switching from GPT-4 to Haiku: could save 75%
- Using batch API: could save 50%
- Caching: could save 30%
Prioritize by impact and effort
- Easy wins: Better prompts, shorter outputs
- Medium effort: Architecture changes, caching
- Hard wins: Model switch, retraining
Implement and measure
- Change one variable at a time
- Measure impact on cost and quality
- Document learnings

When NOT to Optimize

Sometimes optimization isn’t worth it.

Skip optimization if:

Your AI spend is less than $5K/month (optimization time isn’t worth it)
Quality impact would be negative (users care more than you save)
You’re still in pilot phase (focus on learning, not cost)
You have unlimited budget (some companies do)

Focus optimization on:

High-volume, low-impact tasks
Tasks where accuracy doesn’t need to be perfect
Batch processes that can wait
Repeated operations

Strategic Questions

What model are we using and why? Right-sized or oversized?
What’s our cost per unit of value? Know this number.
Where are we wasting money? Oversized models? Inefficient prompts?
What’s our optimization roadmap? Next 3 improvements?
At what scale does self-hosting make sense? When do the economics flip?

Key Takeaway: AI cost optimization starts with using the right-sized model for the job. Optimize inputs through extraction and caching. Consider architecture changes like two-stage classification. Monitor costs continuously. Self-host only at scale. The best approach is starting cheap and upgrading only when justified by value.

Discussion Prompt

For your AI system: Are you using the right-sized model? Where could you optimize inputs? What’s your cost per transaction and is that acceptable?