Understanding LLM APIs and Authentication
Understanding LLM APIs and Authentication
When you want to build applications powered by large language models, you don’t run the models yourself. Instead, you interact with them through APIs—the gateway between your application and the intelligence living on someone else’s servers. In this lesson, you’ll learn how modern LLM APIs work, how to authenticate securely, and the fundamental patterns that make them tick.
What Is an LLM API?
An API is a contract: you send a request in a specific format, and the service responds with the result. For LLM APIs, this means you send text (and sometimes images), and you get back generated text. But there’s much more happening under the hood.
When you make a request to an LLM API, you’re not just sending text into a void. You’re:
- Specifying which model you want to use
- Defining the parameters that control how the model responds
- Sending authentication credentials to prove you’re authorized
- Receiving not just the response, but metadata about that response (tokens used, finish reason, etc.)
This abstraction is powerful because it lets you use state-of-the-art models without managing the infrastructure, GPU costs, or model updates yourself.
Understanding REST APIs
Most LLM APIs follow REST (Representational State Transfer) principles, though this is gradually changing with WebSocket and gRPC variants. A REST API uses HTTP methods (GET, POST, PUT, DELETE) and URL paths to represent actions.
For LLMs, the main operation is POST—you’re sending data (your prompt and settings) to a specific endpoint (usually something like https://api.openai.com/v1/chat/completions), and you get back the result.
Here’s what a typical LLM API request looks like conceptually:
POST https://api.provider.com/v1/chat/completions
Headers:
Authorization: Bearer your-api-key-here
Content-Type: application/json
Body (JSON):
{
"model": "gpt-4-turbo",
"messages": [
{"role": "user", "content": "What is machine learning?"}
],
"temperature": 0.7
}
The response comes back as JSON, containing:
{
"id": "chatcmpl-8N8Z...",
"object": "chat.completion",
"created": 1699564200,
"model": "gpt-4-turbo",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Machine learning is a subset of artificial intelligence..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 10,
"completion_tokens": 87,
"total_tokens": 97
}
}
Notice the usage field—this tells you exactly how many tokens your request consumed. This is critical for cost management, which we’ll cover later.
The Request/Response Cycle Explained
The lifecycle of an API call follows this pattern:
- Prepare your request: Gather the prompt, set parameters (temperature, max tokens, etc.), prepare authentication
- Send: Make an HTTP POST request with your data
- Wait: The API processes your request—this might take milliseconds to several seconds
- Receive: The server responds with the generated content
- Parse: Deserialize the JSON response and extract the useful parts
- Handle: Check for errors, extract tokens, and use the content in your application
Let’s trace through a real example with Python’s requests library:
import requests
import json
def call_llm_api(prompt):
api_key = "sk-proj-..." # Never hardcode this!
url = "https://api.openai.com/v1/chat/completions"
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
}
payload = {
"model": "gpt-4-turbo",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
],
"temperature": 0.7,
"max_tokens": 500
}
# Step 1-3: Send and wait for response
response = requests.post(url, headers=headers, json=payload)
# Step 4-5: Parse the response
if response.status_code == 200:
data = response.json()
completion = data["choices"][0]["message"]["content"]
tokens_used = data["usage"]["total_tokens"]
return completion, tokens_used
else:
raise Exception(f"API call failed: {response.status_code} - {response.text}")
# Usage
answer, tokens = call_llm_api("Explain quantum computing in one sentence")
print(f"Answer: {answer}")
print(f"Tokens used: {tokens}")
Streaming Responses
For user-facing applications, waiting for the entire response can feel slow. That’s where streaming comes in. Instead of waiting for the complete response, the server sends chunks as they’re generated, allowing you to display content in real-time—just like ChatGPT does in the web interface.
Streaming uses Server-Sent Events (SSE), a simple protocol where the server sends data incrementally. Each chunk is a line of data prefixed with data: followed by JSON.
Here’s what streaming looks like:
def call_llm_api_streaming(prompt):
api_key = "sk-proj-..."
url = "https://api.openai.com/v1/chat/completions"
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
}
payload = {
"model": "gpt-4-turbo",
"messages": [{"role": "user", "content": prompt}],
"stream": True # Enable streaming
}
response = requests.post(url, headers=headers, json=payload, stream=True)
full_response = ""
for line in response.iter_lines():
if line:
line = line.decode('utf-8')
if line.startswith("data: "):
try:
chunk_data = json.loads(line[6:]) # Remove "data: " prefix
if "choices" in chunk_data:
delta = chunk_data["choices"][0].get("delta", {})
content = delta.get("content", "")
if content:
print(content, end="", flush=True) # Print immediately
full_response += content
except json.JSONDecodeError:
pass
return full_response
The benefit is immediate—users see characters appearing as they’re generated, creating a more responsive experience.
API Authentication Patterns
Different providers use different authentication methods, but they all follow similar principles:
API Key Authentication
The simplest and most common pattern: you include your API key in the Authorization header.
headers = {
"Authorization": f"Bearer {api_key}"
}
Security note: Never hardcode API keys in your code. Use environment variables instead:
import os
api_key = os.getenv("OPENAI_API_KEY")
Or use configuration files (with proper .gitignore rules):
from dotenv import load_dotenv
load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")
OAuth 2.0 (Less common for direct API access)
Some providers use OAuth for user-delegated access. You exchange credentials for a token with a limited lifetime. This is more complex but more secure for applications that shouldn’t have permanent access.
Custom Headers
Some APIs use custom authentication headers beyond the standard Authorization header. Always check the provider’s documentation.
Different Provider APIs: OpenAI, Anthropic, Azure
Each provider has slightly different APIs, but the concepts are identical.
OpenAI’s Chat Completions API
# OpenAI format
payload = {
"model": "gpt-4-turbo",
"messages": [
{"role": "system", "content": "You are helpful"},
{"role": "user", "content": "Hello"}
],
"temperature": 0.7
}
Anthropic’s Messages API
import anthropic
client = anthropic.Anthropic(api_key="sk-ant-...")
message = client.messages.create(
model="claude-3-opus-20240229",
max_tokens=1024,
messages=[
{"role": "user", "content": "Hello"}
]
)
print(message.content[0].text)
Notice Anthropic provides an SDK, so you don’t directly use HTTP. The SDK handles authentication and request formatting.
Azure OpenAI
import openai
openai.api_type = "azure"
openai.api_key = "..."
openai.api_base = "https://your-resource.openai.azure.com/"
openai.api_version = "2024-02-15-preview"
# Then use OpenAI's API as normal
response = openai.ChatCompletion.create(
engine="your-deployment-name",
messages=[{"role": "user", "content": "Hello"}]
)
Azure uses the same interface as OpenAI but requires additional configuration for Azure-specific endpoints and deployments.
Key Takeaway
LLM APIs abstract away the complexity of running models yourself. You send JSON over HTTP with your prompt and settings, and receive structured responses containing not just the text but also metadata like token counts and finish reasons. Authentication typically happens via API keys in headers. Different providers follow similar patterns but with different endpoints, parameter names, and SDKs—always check the documentation for your chosen provider.
Exercises
-
Set up API credentials: Create accounts with OpenAI and Anthropic. Store API keys in environment variables. Verify you can read them from your Python environment without hardcoding.
-
Make a synchronous API call: Write a Python function that calls an LLM API (either OpenAI or Anthropic) with a simple prompt. Print the response, token count, and any error messages.
-
Implement streaming: Modify your function to use streaming and display output as it arrives. Notice the latency difference compared to non-streaming.
-
Compare APIs: Make the same request to both OpenAI and Anthropic (if you have access). Compare the structure of requests, responses, and token counts for the same prompt.
-
Inspect the full response: Print the entire JSON response from an API call. Identify all fields—not just the message content. Understand what metadata is available.