Understanding LLM APIs and Authentication

When you want to build applications powered by large language models, you don’t run the models yourself. Instead, you interact with them through APIs—the gateway between your application and the intelligence living on someone else’s servers. In this lesson, you’ll learn how modern LLM APIs work, how to authenticate securely, and the fundamental patterns that make them tick.

What Is an LLM API?

An API is a contract: you send a request in a specific format, and the service responds with the result. For LLM APIs, this means you send text (and sometimes images), and you get back generated text. But there’s much more happening under the hood.

When you make a request to an LLM API, you’re not just sending text into a void. You’re:

Specifying which model you want to use
Defining the parameters that control how the model responds
Sending authentication credentials to prove you’re authorized
Receiving not just the response, but metadata about that response (tokens used, finish reason, etc.)

This abstraction is powerful because it lets you use state-of-the-art models without managing the infrastructure, GPU costs, or model updates yourself.

Understanding REST APIs

Most LLM APIs follow REST (Representational State Transfer) principles, though this is gradually changing with WebSocket and gRPC variants. A REST API uses HTTP methods (GET, POST, PUT, DELETE) and URL paths to represent actions.

For LLMs, the main operation is POST—you’re sending data (your prompt and settings) to a specific endpoint (usually something like https://api.openai.com/v1/chat/completions), and you get back the result.

Here’s what a typical LLM API request looks like conceptually:

POST https://api.provider.com/v1/chat/completions
Headers:
  Authorization: Bearer your-api-key-here
  Content-Type: application/json

Body (JSON):
{
  "model": "gpt-4-turbo",
  "messages": [
    {"role": "user", "content": "What is machine learning?"}
  ],
  "temperature": 0.7
}

The response comes back as JSON, containing:

{
  "id": "chatcmpl-8N8Z...",
  "object": "chat.completion",
  "created": 1699564200,
  "model": "gpt-4-turbo",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Machine learning is a subset of artificial intelligence..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 87,
    "total_tokens": 97
  }
}

Notice the usage field—this tells you exactly how many tokens your request consumed. This is critical for cost management, which we’ll cover later.

The Request/Response Cycle Explained

The lifecycle of an API call follows this pattern:

Prepare your request: Gather the prompt, set parameters (temperature, max tokens, etc.), prepare authentication
Send: Make an HTTP POST request with your data
Wait: The API processes your request—this might take milliseconds to several seconds
Receive: The server responds with the generated content
Parse: Deserialize the JSON response and extract the useful parts
Handle: Check for errors, extract tokens, and use the content in your application

Let’s trace through a real example with Python’s requests library:

import requests
import json

def call_llm_api(prompt):
    api_key = "sk-proj-..."  # Never hardcode this!
    url = "https://api.openai.com/v1/chat/completions"

    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }

    payload = {
        "model": "gpt-4-turbo",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        "temperature": 0.7,
        "max_tokens": 500
    }

    # Step 1-3: Send and wait for response
    response = requests.post(url, headers=headers, json=payload)

    # Step 4-5: Parse the response
    if response.status_code == 200:
        data = response.json()
        completion = data["choices"][0]["message"]["content"]
        tokens_used = data["usage"]["total_tokens"]
        return completion, tokens_used
    else:
        raise Exception(f"API call failed: {response.status_code} - {response.text}")

# Usage
answer, tokens = call_llm_api("Explain quantum computing in one sentence")
print(f"Answer: {answer}")
print(f"Tokens used: {tokens}")

Streaming Responses

For user-facing applications, waiting for the entire response can feel slow. That’s where streaming comes in. Instead of waiting for the complete response, the server sends chunks as they’re generated, allowing you to display content in real-time—just like ChatGPT does in the web interface.

Streaming uses Server-Sent Events (SSE), a simple protocol where the server sends data incrementally. Each chunk is a line of data prefixed with data: followed by JSON.

Here’s what streaming looks like:

def call_llm_api_streaming(prompt):
    api_key = "sk-proj-..."
    url = "https://api.openai.com/v1/chat/completions"

    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }

    payload = {
        "model": "gpt-4-turbo",
        "messages": [{"role": "user", "content": prompt}],
        "stream": True  # Enable streaming
    }

    response = requests.post(url, headers=headers, json=payload, stream=True)

    full_response = ""
    for line in response.iter_lines():
        if line:
            line = line.decode('utf-8')
            if line.startswith("data: "):
                try:
                    chunk_data = json.loads(line[6:])  # Remove "data: " prefix
                    if "choices" in chunk_data:
                        delta = chunk_data["choices"][0].get("delta", {})
                        content = delta.get("content", "")
                        if content:
                            print(content, end="", flush=True)  # Print immediately
                            full_response += content
                except json.JSONDecodeError:
                    pass

    return full_response

The benefit is immediate—users see characters appearing as they’re generated, creating a more responsive experience.

API Authentication Patterns

Different providers use different authentication methods, but they all follow similar principles:

API Key Authentication

The simplest and most common pattern: you include your API key in the Authorization header.

headers = {
    "Authorization": f"Bearer {api_key}"
}

Security note: Never hardcode API keys in your code. Use environment variables instead:

import os
api_key = os.getenv("OPENAI_API_KEY")

Or use configuration files (with proper .gitignore rules):

from dotenv import load_dotenv
load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")

OAuth 2.0 (Less common for direct API access)

Some providers use OAuth for user-delegated access. You exchange credentials for a token with a limited lifetime. This is more complex but more secure for applications that shouldn’t have permanent access.

Custom Headers

Some APIs use custom authentication headers beyond the standard Authorization header. Always check the provider’s documentation.

Different Provider APIs: OpenAI, Anthropic, Azure

Each provider has slightly different APIs, but the concepts are identical.

OpenAI’s Chat Completions API

# OpenAI format
payload = {
    "model": "gpt-4-turbo",
    "messages": [
        {"role": "system", "content": "You are helpful"},
        {"role": "user", "content": "Hello"}
    ],
    "temperature": 0.7
}

Anthropic’s Messages API

import anthropic

client = anthropic.Anthropic(api_key="sk-ant-...")
message = client.messages.create(
    model="claude-3-opus-20240229",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Hello"}
    ]
)
print(message.content[0].text)

Notice Anthropic provides an SDK, so you don’t directly use HTTP. The SDK handles authentication and request formatting.

Azure OpenAI

import openai

openai.api_type = "azure"
openai.api_key = "..."
openai.api_base = "https://your-resource.openai.azure.com/"
openai.api_version = "2024-02-15-preview"

# Then use OpenAI's API as normal
response = openai.ChatCompletion.create(
    engine="your-deployment-name",
    messages=[{"role": "user", "content": "Hello"}]
)

Azure uses the same interface as OpenAI but requires additional configuration for Azure-specific endpoints and deployments.

Key Takeaway

LLM APIs abstract away the complexity of running models yourself. You send JSON over HTTP with your prompt and settings, and receive structured responses containing not just the text but also metadata like token counts and finish reasons. Authentication typically happens via API keys in headers. Different providers follow similar patterns but with different endpoints, parameter names, and SDKs—always check the documentation for your chosen provider.

Exercises

Set up API credentials: Create accounts with OpenAI and Anthropic. Store API keys in environment variables. Verify you can read them from your Python environment without hardcoding.
Make a synchronous API call: Write a Python function that calls an LLM API (either OpenAI or Anthropic) with a simple prompt. Print the response, token count, and any error messages.
Implement streaming: Modify your function to use streaming and display output as it arrives. Notice the latency difference compared to non-streaming.
Compare APIs: Make the same request to both OpenAI and Anthropic (if you have access). Compare the structure of requests, responses, and token counts for the same prompt.
Inspect the full response: Print the entire JSON response from an API call. Identify all fields—not just the message content. Understand what metadata is available.