Error Handling, Rate Limits, and Retries

API calls fail. Networks timeout, servers crash, rate limits kick in, authentication tokens expire. In production, you’ll encounter all of these. This lesson teaches you how to build resilient applications that handle failures gracefully and recover automatically when possible.

Understanding HTTP Status Codes

Every API response includes a status code. You need to understand what they mean:

2xx (Success): Request succeeded
- 200 OK: Success
- 201 Created: Resource created successfully
4xx (Client Error): Problem with your request (don’t retry automatically)
- 400 Bad Request: Invalid parameters or malformed JSON
- 401 Unauthorized: Invalid or missing API key
- 403 Forbidden: Authenticated but not allowed to access this
- 404 Not Found: Endpoint doesn’t exist
- 429 Too Many Requests: Rate limit exceeded
5xx (Server Error): Problem with the service (usually safe to retry)
- 500 Internal Server Error: Something broke on their end
- 502 Bad Gateway: Temporary service issue
- 503 Service Unavailable: Maintenance or overload

The crucial distinction: 4xx errors are client errors—your fault—don’t retry. 5xx errors are server errors—their fault—safe to retry.

def analyze_error_response(status_code: int) -> dict:
    """Determine retry strategy based on status code."""
    return {
        400: {"retryable": False, "reason": "Invalid request"},
        401: {"retryable": False, "reason": "Authentication failed"},
        429: {"retryable": True, "reason": "Rate limited"},
        500: {"retryable": True, "reason": "Server error"},
        503: {"retryable": True, "reason": "Service unavailable"}
    }.get(status_code, {"retryable": False, "reason": "Unknown"})

Exponential Backoff

When a request fails with a retryable error, don’t immediately retry. That’s likely to fail again. Instead, wait—with increasing wait times. This is exponential backoff.

The pattern:

1st retry: wait 1 second
2nd retry: wait 2 seconds
3rd retry: wait 4 seconds
4th retry: wait 8 seconds

This gives the service time to recover while not waiting forever.

import time
import random

def exponential_backoff(attempt: int, base_wait: float = 1.0) -> float:
    """Calculate wait time for exponential backoff."""
    # 2^attempt grows quickly: 1, 2, 4, 8, 16...
    wait_time = base_wait * (2 ** attempt)
    # Add jitter (randomness) to avoid thundering herd
    jitter = random.uniform(0, wait_time * 0.1)
    return wait_time + jitter

# Test it
for attempt in range(5):
    wait = exponential_backoff(attempt)
    print(f"Attempt {attempt}: wait {wait:.2f}s")

Output:

Attempt 0: wait 1.05s
Attempt 1: wait 2.08s
Attempt 2: wait 4.12s
Attempt 3: wait 8.09s
Attempt 4: wait 16.15s

Rate Limiting: Headers Matter

APIs don’t just return a 429 status code—they also include headers telling you when you can retry. Always check these:

RateLimit-Limit: Maximum requests in the window
RateLimit-Remaining: Requests left before hitting the limit
RateLimit-Reset: Unix timestamp when the limit resets

import requests
from datetime import datetime

def make_request_with_rate_limit_awareness(url: str, headers: dict) -> dict:
    """Make a request and log rate limit info."""
    response = requests.post(url, headers=headers, json={})

    rate_limit_info = {
        "limit": response.headers.get("RateLimit-Limit"),
        "remaining": response.headers.get("RateLimit-Remaining"),
        "reset": response.headers.get("RateLimit-Reset")
    }

    if rate_limit_info["remaining"]:
        remaining = int(rate_limit_info["remaining"])
        if remaining < 10:
            print(f"⚠️ Low remaining requests: {remaining}")

    if rate_limit_info["reset"]:
        reset_time = int(rate_limit_info["reset"])
        reset_dt = datetime.fromtimestamp(reset_time)
        print(f"Rate limit resets at: {reset_dt}")

    return {
        "status_code": response.status_code,
        "rate_limit": rate_limit_info,
        "data": response.json()
    }

Building a Robust Retry Wrapper

Let’s build a comprehensive retry function that handles all these cases:

from openai import OpenAI, RateLimitError, APIError, Timeout
import time

def call_with_exponential_backoff(
    client: OpenAI,
    max_retries: int = 3,
    base_wait: float = 1.0,
    **kwargs
):
    """Call an LLM with exponential backoff retry logic."""

    last_exception = None

    for attempt in range(max_retries):
        try:
            # Attempt the API call
            response = client.chat.completions.create(**kwargs)
            return response

        except RateLimitError as e:
            # Rate limit—definitely retry
            if attempt < max_retries - 1:
                wait_time = base_wait * (2 ** attempt)
                print(f"Rate limited. Waiting {wait_time}s (attempt {attempt + 1}/{max_retries})")
                time.sleep(wait_time)
                last_exception = e
            else:
                raise

        except Timeout as e:
            # Timeout—safe to retry
            if attempt < max_retries - 1:
                print(f"Timeout. Retrying (attempt {attempt + 1}/{max_retries})")
                time.sleep(base_wait * (2 ** attempt))
                last_exception = e
            else:
                raise

        except APIError as e:
            # Generic API error
            if hasattr(e, 'status_code') and e.status_code >= 500:
                # Server error—retry
                if attempt < max_retries - 1:
                    wait_time = base_wait * (2 ** attempt)
                    print(f"Server error ({e.status_code}). Retrying (attempt {attempt + 1}/{max_retries})")
                    time.sleep(wait_time)
                    last_exception = e
                else:
                    raise
            else:
                # Client error—don't retry
                raise

        except Exception as e:
            # Unexpected error—fail immediately
            print(f"Unexpected error: {type(e).__name__}: {e}")
            raise

    # Should never reach here, but just in case
    if last_exception:
        raise last_exception

# Usage
client = OpenAI()
try:
    response = call_with_exponential_backoff(
        client,
        max_retries=3,
        model="gpt-4-turbo",
        messages=[{"role": "user", "content": "Hello"}]
    )
    print(response.choices[0].message.content)
except Exception as e:
    print(f"Failed after retries: {e}")

Circuit Breaker Pattern

For systems that make many requests, the circuit breaker pattern prevents cascading failures. If your service keeps failing, stop trying temporarily.

from enum import Enum
import time

class CircuitBreakerState(Enum):
    CLOSED = "closed"  # Normal operation
    OPEN = "open"  # Failing, reject requests
    HALF_OPEN = "half_open"  # Testing if service recovered

class CircuitBreaker:
    """Prevent cascading failures by stopping requests to failing services."""

    def __init__(self, failure_threshold: int = 5, timeout: float = 60.0):
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.failure_count = 0
        self.state = CircuitBreakerState.CLOSED
        self.last_failure_time = None

    def call(self, func, *args, **kwargs):
        """Execute function through circuit breaker."""

        if self.state == CircuitBreakerState.OPEN:
            # Check if timeout has passed
            if time.time() - self.last_failure_time > self.timeout:
                self.state = CircuitBreakerState.HALF_OPEN
                print("Circuit breaker: trying to recover...")
            else:
                raise Exception("Circuit breaker is OPEN. Service is unavailable.")

        try:
            result = func(*args, **kwargs)
            self.on_success()
            return result

        except Exception as e:
            self.on_failure()
            raise

    def on_success(self):
        """Handle successful call."""
        self.failure_count = 0
        self.state = CircuitBreakerState.CLOSED

    def on_failure(self):
        """Handle failed call."""
        self.failure_count += 1
        self.last_failure_time = time.time()

        if self.failure_count >= self.failure_threshold:
            self.state = CircuitBreakerState.OPEN
            print(f"Circuit breaker: OPENING after {self.failure_count} failures")

# Usage
breaker = CircuitBreaker(failure_threshold=3, timeout=30)

def unstable_api_call():
    # Simulated unreliable function
    import random
    if random.random() < 0.8:  # 80% chance of failure
        raise Exception("API call failed")
    return "Success!"

for i in range(10):
    try:
        result = breaker.call(unstable_api_call)
        print(f"Call {i}: {result}")
    except Exception as e:
        print(f"Call {i}: {e}")
    time.sleep(1)

Timeout Handling

Always set timeouts. Infinite hangs are dangerous:

from openai import OpenAI

client = OpenAI(timeout=30.0)  # 30 second timeout

try:
    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{"role": "user", "content": "Hello"}]
    )
except Exception as e:
    print(f"Request timed out or failed: {e}")

For raw HTTP requests:

import requests

try:
    response = requests.post(
        "https://api.openai.com/v1/chat/completions",
        json={"model": "gpt-4-turbo", "messages": []},
        headers={"Authorization": f"Bearer {api_key}"},
        timeout=30  # seconds
    )
except requests.Timeout:
    print("Request timed out after 30 seconds")
except requests.ConnectionError:
    print("Connection error")

Comprehensive Error Handling Example

Here’s a production-ready pattern combining everything:

from openai import OpenAI, APIError, Timeout
import time
from typing import Optional

class RobustLLMClient:
    """LLM client with comprehensive error handling."""

    def __init__(self, api_key: str, max_retries: int = 3):
        self.client = OpenAI(api_key=api_key, timeout=30)
        self.max_retries = max_retries

    def chat_completion(
        self,
        prompt: str,
        model: str = "gpt-4-turbo",
        temperature: float = 0.7
    ) -> Optional[str]:
        """Get chat completion with automatic retries."""

        for attempt in range(self.max_retries):
            try:
                response = self.client.chat.completions.create(
                    model=model,
                    messages=[{"role": "user", "content": prompt}],
                    temperature=temperature
                )
                return response.choices[0].message.content

            except Timeout as e:
                if attempt < self.max_retries - 1:
                    wait = 2 ** attempt
                    print(f"Timeout. Waiting {wait}s before retry...")
                    time.sleep(wait)
                else:
                    print(f"Timeout after {self.max_retries} retries")
                    return None

            except APIError as e:
                if e.status_code == 429:  # Rate limit
                    wait = 10 * (2 ** attempt)  # Longer wait for rate limits
                    print(f"Rate limited. Waiting {wait}s...")
                    time.sleep(wait)

                elif e.status_code >= 500:  # Server error
                    if attempt < self.max_retries - 1:
                        wait = 2 ** attempt
                        print(f"Server error ({e.status_code}). Retrying...")
                        time.sleep(wait)
                    else:
                        print(f"Server error persists after {self.max_retries} retries")
                        return None

                else:  # Client error (4xx except 429)
                    print(f"Client error: {e}")
                    return None

        return None

# Usage
client = RobustLLMClient(api_key="sk-proj-...")
result = client.chat_completion("What is machine learning?")
if result:
    print(result)
else:
    print("Failed to get response")

Key Takeaway

Production systems need intelligent error handling. Distinguish between retryable errors (5xx, timeouts, rate limits) and non-retryable errors (4xx except 429). Use exponential backoff to space out retries. Monitor rate limit headers to avoid hitting limits. Consider circuit breakers for systems making many requests. Always set timeouts to prevent infinite hangs.

Exercises

Test status codes: Create a mock API endpoint that returns different status codes. Write error handling code that properly identifies which are retryable.
Implement exponential backoff: Write a function that calls an API with exponential backoff. Verify wait times increase correctly.
Rate limit simulation: Simulate a rate-limited API. Implement logic that respects the rate limit headers and backs off appropriately.
Circuit breaker pattern: Implement a circuit breaker. Verify it opens after threshold failures and half-opens to test recovery.
Timeout handling: Make API calls with various timeout settings. Observe behavior when requests exceed timeouts.
Production scenario: Write a function that handles rate limits, timeouts, and server errors simultaneously. Test with intentional failures.