Cost Optimization Strategies
Cost Optimization Strategies
The Optimization Paradox
Better AI models cost more. But the economics don’t always justify the better model. Using GPT-4 instead of GPT-3.5 on a classification task where GPT-3.5 already works might improve accuracy from 85% to 88%—but if 85% is good enough for your use case, you’re paying 3x the cost for 3% improvement.
The best approach: start with efficient models and upgrade only when cost is justified by value.
Strategy 1: Right-Sizing the Model
The Model Capability Curve
Foundation models have different capabilities:
- Haiku-class: Fast, cheap, works for straightforward tasks
- Sonnet-class: Balanced, handles complex tasks
- Opus-class: Most capable, handles nuanced reasoning
Cost differences are often 4-10x. Quality differences are usually 5-15%.
Matching Model to Task
Classification (emails, documents, items):
- Haiku is usually fine
- Use Sonnet if categories are nuanced
- Use Opus only if you need complex reasoning
Summarization:
- Haiku for straightforward documents
- Sonnet for long or complex documents
- Opus rarely needed
Creative/complex reasoning:
- Sonnet for most cases
- Opus for deeply complex reasoning
Real-time chat/interactive:
- Haiku for quick responses
- Sonnet for thoughtful responses
- Opus for user-facing high-stakes conversations
Cost Impact of Right-Sizing
Example: Classify 100,000 support emails/month
Using GPT-4 (wrong-sized):
- 100,000 emails × 1,500 tokens = 150M tokens
- Cost: 150M × $0.01/1K = $1,500/month
Using Claude 3 Haiku (right-sized):
- 100,000 emails × 1,500 tokens = 150M tokens
- Cost: 150M × $0.00025/1K = $37.50/month
Savings: $1,462/month or 98% reduction, with same functionality!
Decision rule: Start with cheap models. Upgrade only if accuracy isn’t sufficient and the upgrade is cost-justified.
Strategy 2: Input Optimization
Shrink Token Count Without Losing Quality
Most input optimization focuses on reducing the number of tokens you send.
Technique 1: Extract Before Providing
- Don’t send the full document; send the key parts
- Full document: 5,000 words = 3,750 tokens
- Key extract: 500 words = 375 tokens
- Savings: 90%, minimal quality loss
Example: Summarizing a legal document
- Naive: Send full 10,000-word contract
- Optimized: Extract sections relevant to query, send 2,000 words
Technique 2: Chunking and Focusing
- Split large documents into sections
- Only process relevant sections
- For classification: Send first 500 words + summary
Document classification:
- Full document: 10,000 words
- First 500 words + title + summary: 1,000 words
- Efficiency gain: 90% cost reduction
Technique 3: Prompt Template Caching
- Same prompt structure many times? Use caching
- API caches prompts to reduce cost of repeated requests
- 70% cost reduction for cached content
Example with prompt caching:
System prompt (cached): "You are a customer service AI..."
Context (cached): "Company policies, common answers..."
User query (new): "What's your return policy?"
Cost: Only pay for the new query, not cached parts
Technique 4: Use Structured Input/Output
- JSON is more efficient than natural language
- Less ambiguity = fewer tokens for explanation
- “Classify this email” vs. “
\nSubject: ...\nBody: ...” (structured)
Cost savings from structuring: 10-20%
Practical Input Optimization
Monthly processing: 100,000 documents, 2,000 words each
Current approach (unoptimized):
- Full documents: 100K × 2K words = 200M words = 150M tokens
- Cost (GPT-4): $1,500
Optimization approach:
- Extract key text (50% of tokens): 75M tokens
- Use Haiku instead of GPT-4 (4% of cost): $1,500 × 4% = $60
- Use batch API for non-urgent work (50% discount): $30
Total savings: $1,470 (98% reduction)
Strategy 3: Architecture Optimization
Some architectural changes reduce costs significantly.
Caching and Memoization
What it is: Store results for common queries; reuse instead of recomputing.
When it works: You see repeated questions/requests.
Implementation:
- Cache results of common queries
- If user asks “What’s your return policy?” for 100th time, return cached answer
- Only process new/unique queries
Savings: 20-50% if you have repeated queries (common in Q&A systems)
Example: Customer support chatbot
- 10,000 questions/day
- 30% are duplicates of previous questions
- Caching saves: 3,000 × average cost per query
Batch vs. Real-Time APIs
Real-time APIs: Instant response, higher cost
- Customer service: Need real-time
- Email classification: Can wait
Batch APIs: Process in bulk, 50% discount
- Processing 1,000 emails: Use batch API
- Processing 1 email from a user: Use real-time API
Cost strategy:
- Real-time for interactive features
- Batch API for background processing
- Hybrid: Real-time for immediate response (cheap model), batch for detailed analysis
Example hybrid:
- User asks question: Haiku provides instant response (30 seconds)
- Background job: Runs detailed analysis via batch API
- Next time user asks similar question: Use detailed analysis from background job
Two-Stage Classification
For complex classification, use fast model first, slow model second.
Stage 1: Quick filter (Haiku)
- “Is this customer question clearly in scope?”
- Cost: $0.0003 per query
- Catches 80% of out-of-scope questions
Stage 2: Detailed classification (Claude 3 Opus)
- Only process remaining 20% with expensive model
- Cost: $0.075 per query
- Total cost: (80% × $0.0003) + (20% × $0.075) = $0.015 per query
Direct Opus approach: $0.075 × 100% = $0.075 per query
Savings: 80% with minimal accuracy loss
Strategy 4: Self-Hosting and Open Source
When Self-Hosting Makes Sense
At what volume does self-hosting become cheaper than APIs?
Example: Document Classification
- Daily volume: 10,000 documents
- API cost (Haiku): $250/month
- Self-hosted Llama 2 cost:
- GPU instance: $1,000/month
- Ops/monitoring: $200/month
- Total: $1,200/month
Self-hosting only makes sense at:
- 50,000+ documents/day (API would be $1,250+/month)
- Strict data residency requirements
- Custom model training on proprietary data
Open Source Model Quality
Open source models have improved significantly:
- Llama 2/3: 70-85% of frontier model capability at 5-10% of cost
- Mistral 7B: Strong small model, good for classification
- Phi: Efficient model, good for reasoning
- Falcon: Strong mid-size model
Quality comparison (classification task):
- GPT-4: 95% accuracy, $0.10 per classification
- Claude 3 Haiku: 92% accuracy, $0.001 per classification
- Llama 2 (self-hosted): 85% accuracy, $0.0001 per classification
Use case: If you only need 85% accuracy, Llama 2 is 1000x cheaper.
True Cost of Self-Hosting
Budget should include:
- Infrastructure (compute, storage): $1K-5K/month per model
- Maintenance and ops: $500-2K/month
- Monitoring and alerting: $200-500/month
- Model updates and retraining: $5K-10K quarterly
- Incidents and debugging: 10-40 hours/month
- Total: $3K-10K/month minimum
Self-hosting makes sense only at significant scale or with unique requirements.
Strategy 5: Prompt Optimization for Cost
Reduce Output Length
Shorter outputs cost less (fewer output tokens).
Technique: Be specific about length requirements.
Poor prompt: “Summarize this article” Good prompt: “Summarize this article in 50 words”
Cost difference: ~50% (longer responses cost more)
Use Structured Output Requests
JSON requests are more efficient than natural language.
Instead of: "Extract the customer's name, email, and reason for contacting us."
Use:
{
"name": "string",
"email": "string",
"reason": "category: complaint|question|feedback"
}
Efficiency gain: 10-20%
Stop Sequences
Use stop tokens to prevent over-generation.
Classify this email as: {one word}
[STOP]
vs.
Classify this email as one of: positive, neutral, or negative.
This email should be classified as: [continue generating...]
The first version stops immediately; the second might generate paragraphs.
Savings: 5-20% depending on task
Cached System Prompts
System prompts are usually repeated. Cache them.
Using prompt caching:
- System prompt: “You’re a support AI…” (cached)
- Context: Support guidelines (cached)
- Query: User question (new)
Only the query is billed. The system + context are cached.
Cost Monitoring and Optimization Loop
Monthly Cost Dashboard
Track:
- Total API spend by model
- Cost per transaction
- Tokens processed (input/output breakdown)
- Infrastructure costs
- Team costs
Example metrics:
Total monthly spend: $5,000
GPT-4: $2,000 (40%)
Haiku: $500 (10%)
Infrastructure: $1,000 (20%)
Team: $1,500 (30%)
Cost per classification: $0.005
Cost per customer served: $0.10
Quarterly Optimization Review
-
Identify high-cost areas
- Where’s the money going?
- Could we right-size models?
- Could we batch more?
-
Estimate savings from optimization
- Switching from GPT-4 to Haiku: could save 75%
- Using batch API: could save 50%
- Caching: could save 30%
-
Prioritize by impact and effort
- Easy wins: Better prompts, shorter outputs
- Medium effort: Architecture changes, caching
- Hard wins: Model switch, retraining
-
Implement and measure
- Change one variable at a time
- Measure impact on cost and quality
- Document learnings
When NOT to Optimize
Sometimes optimization isn’t worth it.
Skip optimization if:
- Your AI spend is less than $5K/month (optimization time isn’t worth it)
- Quality impact would be negative (users care more than you save)
- You’re still in pilot phase (focus on learning, not cost)
- You have unlimited budget (some companies do)
Focus optimization on:
- High-volume, low-impact tasks
- Tasks where accuracy doesn’t need to be perfect
- Batch processes that can wait
- Repeated operations
Strategic Questions
- What model are we using and why? Right-sized or oversized?
- What’s our cost per unit of value? Know this number.
- Where are we wasting money? Oversized models? Inefficient prompts?
- What’s our optimization roadmap? Next 3 improvements?
- At what scale does self-hosting make sense? When do the economics flip?
Key Takeaway: AI cost optimization starts with using the right-sized model for the job. Optimize inputs through extraction and caching. Consider architecture changes like two-stage classification. Monitor costs continuously. Self-host only at scale. The best approach is starting cheap and upgrading only when justified by value.
Discussion Prompt
For your AI system: Are you using the right-sized model? Where could you optimize inputs? What’s your cost per transaction and is that acceptable?