Cost Management and API Best Practices

LLM APIs are not free. Making thousands of API calls adds up quickly. In this lesson, you’ll learn how to monitor costs, estimate expenses before they happen, cache responses intelligently, and choose the right models for your use cases.

Understanding Token Costs

Pricing is per token, not per request. You pay for input tokens (your prompt) and output tokens (the response), often at different rates.

As of 2024:

GPT-4 Turbo: ~$0.01 per 1K input tokens, ~$0.03 per 1K output tokens
GPT-3.5 Turbo: ~$0.0005 per 1K input tokens, ~$0.0015 per 1K output tokens
Claude 3 Opus: ~$0.015 per 1K input tokens, ~$0.075 per 1K output tokens

(Check current pricing—it changes frequently.)

The key math: if you make 1000 requests with 500 tokens each, that’s 500K tokens. At GPT-3.5 rates, that’s ~$0.25 for input alone.

Token Counting in Python

Before you make an API call, estimate tokens to predict cost:

import tiktoken

def count_tokens_openai(text: str, model: str = "gpt-4-turbo") -> int:
    """Count tokens for OpenAI models."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

def count_tokens_for_messages(messages: list, model: str = "gpt-4-turbo") -> int:
    """Count tokens in a message thread (more accurate than just text)."""
    encoding = tiktoken.encoding_for_model(model)

    # Message overhead: role, content, and separator
    num_tokens = 0
    for message in messages:
        num_tokens += 4  # Overhead per message
        for key, value in message.items():
            num_tokens += len(encoding.encode(str(value)))

    num_tokens += 2  # Final overhead
    return num_tokens

# Test it
messages = [
    {"role": "user", "content": "What is photosynthesis?"},
    {"role": "assistant", "content": "Photosynthesis is the process..."}
]

tokens = count_tokens_for_messages(messages)
print(f"Message thread: {tokens} tokens")

# Single text
text = "Machine learning is a subset of artificial intelligence"
tokens = count_tokens_openai(text)
print(f"Text: {tokens} tokens")

For Anthropic, use their count_tokens method:

from anthropic import Anthropic

client = Anthropic()

messages = [
    {"role": "user", "content": "What is machine learning?"}
]

# Anthropic's SDK can count tokens
token_count = client.messages.count_tokens(
    model="claude-3-opus-20240229",
    messages=messages
)
print(f"Token count: {token_count}")

Estimating Costs Before Making Requests

Build a cost estimator:

class TokenCostEstimator:
    """Estimate costs before making API calls."""

    PRICING = {
        "gpt-4-turbo": {"input": 0.01, "output": 0.03},
        "gpt-3.5-turbo": {"input": 0.0005, "output": 0.0015},
        "claude-3-opus": {"input": 0.015, "output": 0.075},
        "claude-3-sonnet": {"input": 0.003, "output": 0.015},
    }

    def __init__(self, model: str):
        if model not in self.PRICING:
            raise ValueError(f"Unknown model: {model}")
        self.model = model
        self.pricing = self.PRICING[model]

    def estimate_input_cost(self, num_tokens: int) -> float:
        """Estimate cost for input tokens."""
        return (num_tokens / 1000) * self.pricing["input"]

    def estimate_output_cost(self, num_tokens: int) -> float:
        """Estimate cost for output tokens."""
        return (num_tokens / 1000) * self.pricing["output"]

    def estimate_total_cost(
        self,
        input_tokens: int,
        output_tokens: int
    ) -> float:
        """Estimate total cost."""
        return self.estimate_input_cost(input_tokens) + self.estimate_output_cost(output_tokens)

    def find_cheapest_model(
        self,
        input_tokens: int,
        output_tokens: int
    ) -> tuple:
        """Find the cheapest model for your request."""
        costs = {}
        for model_name in self.PRICING:
            estimator = TokenCostEstimator(model_name)
            cost = estimator.estimate_total_cost(input_tokens, output_tokens)
            costs[model_name] = cost

        cheapest = min(costs.items(), key=lambda x: x[1])
        return cheapest

# Usage
estimator = TokenCostEstimator("gpt-4-turbo")

input_tokens = 100
output_tokens = 50

input_cost = estimator.estimate_input_cost(input_tokens)
output_cost = estimator.estimate_output_cost(output_tokens)
total_cost = estimator.estimate_total_cost(input_tokens, output_tokens)

print(f"Input cost: ${input_cost:.6f}")
print(f"Output cost: ${output_cost:.6f}")
print(f"Total cost: ${total_cost:.6f}")

# Find cheapest option
cheapest_model, cheapest_cost = estimator.find_cheapest_model(1000, 500)
print(f"Cheapest model: {cheapest_model} at ${cheapest_cost:.4f}")

Response Caching

Don’t recompute the same responses. Cache them:

import json
import hashlib
from datetime import datetime, timedelta

class ResponseCache:
    """Cache API responses to avoid repeated calls."""

    def __init__(self, ttl_hours: float = 24):
        self.cache = {}
        self.ttl_hours = ttl_hours

    def _key_from_messages(self, messages: list, model: str) -> str:
        """Generate a cache key from messages and model."""
        content = json.dumps(messages) + model
        return hashlib.md5(content.encode()).hexdigest()

    def get(self, messages: list, model: str) -> dict or None:
        """Retrieve cached response if it exists and hasn't expired."""
        key = self._key_from_messages(messages, model)

        if key not in self.cache:
            return None

        cached_response, cached_time = self.cache[key]

        # Check if expired
        if datetime.now() - cached_time > timedelta(hours=self.ttl_hours):
            del self.cache[key]
            return None

        return cached_response

    def set(self, messages: list, model: str, response: dict):
        """Cache a response."""
        key = self._key_from_messages(messages, model)
        self.cache[key] = (response, datetime.now())

    def clear(self):
        """Clear all cached responses."""
        self.cache.clear()

# Usage
cache = ResponseCache(ttl_hours=24)

messages = [{"role": "user", "content": "What is AI?"}]
model = "gpt-4-turbo"

# First call—not in cache
cached = cache.get(messages, model)
if cached:
    print("Cache hit!")
    response = cached
else:
    print("Cache miss—calling API")
    # response = client.chat.completions.create(model=model, messages=messages)
    response = {"cached": False}
    cache.set(messages, model, response)

# Second call—in cache
cached = cache.get(messages, model)
if cached:
    print("Cache hit! No API call needed")
else:
    print("Cache miss")

For production, use Redis or a similar system. For simple cases, this in-memory approach works.

Model Selection for Cost vs. Quality

Different models have different price/quality tradeoffs. Choose wisely:

def choose_model_for_task(task_type: str, quality_requirement: str) -> str:
    """Choose model based on task and quality needs."""

    models = {
        "summarization": {
            "high": "gpt-4-turbo",
            "medium": "gpt-3.5-turbo",
            "low": "gpt-3.5-turbo"
        },
        "classification": {
            "high": "gpt-4-turbo",
            "medium": "gpt-3.5-turbo",
            "low": "gpt-3.5-turbo"
        },
        "generation": {
            "high": "gpt-4-turbo",
            "medium": "gpt-3.5-turbo",
            "low": "gpt-3.5-turbo"
        },
        "reasoning": {
            "high": "gpt-4-turbo",
            "medium": "gpt-4-turbo",
            "low": "gpt-3.5-turbo"
        }
    }

    return models.get(task_type, {}).get(quality_requirement, "gpt-3.5-turbo")

# Usage
task = "summarization"
quality = "medium"
model = choose_model_for_task(task, quality)
print(f"Recommended model: {model}")

Key insight: Use cheaper models (like GPT-3.5) for simple tasks. Reserve expensive models (like GPT-4) for complex reasoning.

Monitoring and Alerting

Track your costs over time:

import json
from datetime import datetime

class CostMonitor:
    """Monitor API costs and alert on thresholds."""

    def __init__(self, daily_budget: float):
        self.daily_budget = daily_budget
        self.logs = []

    def log_call(
        self,
        model: str,
        input_tokens: int,
        output_tokens: int,
        cost: float
    ):
        """Log an API call."""
        self.logs.append({
            "timestamp": datetime.now().isoformat(),
            "model": model,
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "cost": cost
        })

    def daily_cost(self) -> float:
        """Calculate cost for today."""
        today = datetime.now().date()
        total = 0

        for log in self.logs:
            log_date = datetime.fromisoformat(log["timestamp"]).date()
            if log_date == today:
                total += log["cost"]

        return total

    def check_budget(self) -> bool:
        """Check if we're within budget."""
        cost = self.daily_cost()
        if cost > self.daily_budget:
            print(f"⚠️ Budget exceeded! ${cost:.2f} > ${self.daily_budget:.2f}")
            return False
        return True

    def save_logs(self, filepath: str):
        """Save logs to file for analysis."""
        with open(filepath, 'w') as f:
            json.dump(self.logs, f, indent=2)

# Usage
monitor = CostMonitor(daily_budget=10.0)  # $10 per day

# Log some calls
monitor.log_call("gpt-4-turbo", 100, 50, 0.002)
monitor.log_call("gpt-3.5-turbo", 200, 100, 0.0005)

daily = monitor.daily_cost()
print(f"Daily cost: ${daily:.4f}")

monitor.check_budget()
monitor.save_logs("api_costs.json")

Best Practices Summary

1. Batch requests when possible Instead of making one API call per item, batch them:

# Bad: N API calls
for item in items:
    result = api_call(item)

# Better: 1 API call
results = api_call(batch=items)

2. Use shorter prompts Every character counts. Remove unnecessary context.

3. Leverage structured output Let models generate JSON directly instead of parsing prose.

4. Cache aggressively Store responses for common queries.

5. Monitor token usage in real-time Track actual vs. estimated tokens.

def api_call_with_monitoring(prompt: str, monitor: CostMonitor):
    """Make API call while monitoring costs."""
    from openai import OpenAI

    client = OpenAI()

    # Estimate before
    input_tokens = count_tokens_openai(prompt)
    estimator = TokenCostEstimator("gpt-3.5-turbo")
    estimated_cost = estimator.estimate_total_cost(input_tokens, 100)

    print(f"Estimated cost: ${estimated_cost:.6f}")

    # Make call
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}]
    )

    # Log actual
    actual_cost = estimator.estimate_total_cost(
        response.usage.prompt_tokens,
        response.usage.completion_tokens
    )
    monitor.log_call(
        "gpt-3.5-turbo",
        response.usage.prompt_tokens,
        response.usage.completion_tokens,
        actual_cost
    )

    print(f"Actual cost: ${actual_cost:.6f}")
    return response

Key Takeaway

LLM API costs scale with token usage. Always estimate costs before making calls. Use cheaper models for simple tasks and expensive models only when necessary. Cache responses aggressively. Monitor actual spending against budgets. Batch requests and remove unnecessary context to reduce token counts.

Exercises

Token counting: Write a function that counts tokens for various models. Test with different message lengths.
Cost estimation: Build a tool that estimates costs for 100, 1000, and 10,000 API calls. Compare costs across models.
Model selection: For three different tasks (summarization, classification, reasoning), choose the appropriate model considering cost/quality tradeoffs.
Caching system: Implement a response cache. Verify it prevents unnecessary API calls for repeated prompts.
Cost monitoring: Create a cost monitor. Log API calls throughout a day. Calculate daily cost and check against budget.
Cost optimization: Identify ways to reduce tokens in a sample prompt without losing quality. Calculate savings.