RAG Architecture and Vector Databases
RAG Architecture and Vector Databases
Retrieval-Augmented Generation (RAG) combines retrieval with generation. Instead of making the model generate purely from its training knowledge, you retrieve relevant documents and provide them as context. This powers accurate answers over custom data. This lesson covers how RAG works, embedding models, and vector databases.
What Is RAG?
RAG solves a critical problem: LLMs have a knowledge cutoff and can’t access your proprietary data. A RAG system:
- Takes user query
- Searches through documents for relevant passages
- Provides those passages as context
- Asks the model to answer using the context
This is more accurate than pure generation because the model has facts to reference.
Without RAG: “What was our Q3 revenue?” → Model guesses (wrong) With RAG: Same query → Retrieves financial documents → Model answers accurately
Embeddings: Converting Text to Vectors
An embedding is a numerical representation of text. Similar texts have similar embeddings, allowing vector similarity search.
Think of embedding as converting words into coordinates in high-dimensional space:
- “dog” and “puppy” are close
- “dog” and “car” are far apart
from langchain_openai import OpenAIEmbeddings
from langchain_google_genai import GoogleGenerativeAIEmbeddings
import numpy as np
# Create embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Embed a query
query_embedding = embeddings.embed_query("What is machine learning?")
print(f"Query embedding dimension: {len(query_embedding)}")
print(f"First 5 dimensions: {query_embedding[:5]}")
# Embed documents
texts = [
"Machine learning is a subset of AI",
"Deep learning uses neural networks",
"A cat is an animal"
]
doc_embeddings = embeddings.embed_documents(texts)
print(f"Embedded {len(doc_embeddings)} documents")
# Similarity: closer embeddings = more similar texts
def cosine_similarity(a, b):
"""Calculate cosine similarity between vectors."""
a = np.array(a)
b = np.array(b)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Compare query to documents
for i, doc_emb in enumerate(doc_embeddings):
similarity = cosine_similarity(query_embedding, doc_emb)
print(f"Doc {i} similarity: {similarity:.3f}")
Vector Databases
Vector databases store embeddings and enable fast similarity search. They’re optimized for finding nearest neighbors in high-dimensional space—much faster than calculating distance to every document.
Popular options:
- Chroma: In-memory, simple, good for development
- Pinecone: Cloud-hosted, production-ready
- Weaviate: Open-source, hybrid search
- pgvector: PostgreSQL extension
- Milvus: Distributed vector database
Using Chroma
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Prepare documents
documents = [
"Machine learning is a subset of artificial intelligence",
"Deep learning uses neural networks with many layers",
"Natural language processing handles text data",
"Computer vision processes images",
]
# Create embeddings and store in Chroma
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_texts(
texts=documents,
embedding=embeddings,
persist_directory="./chroma_db" # Persist to disk
)
# Search
query = "What is deep learning?"
results = vectorstore.similarity_search(query, k=2)
for i, result in enumerate(results):
print(f"Result {i}: {result.page_content}")
# Search with score
results_with_scores = vectorstore.similarity_search_with_relevance_scores(query, k=2)
for result, score in results_with_scores:
print(f"Score: {score:.3f}, Content: {result.page_content}")
Using Pinecone
from langchain_pinecone import PineconeVectorStore
from langchain_openai import OpenAIEmbeddings
import pinecone
# Initialize Pinecone
pinecone.init(api_key="your-key", environment="your-env")
# Create index
index_name = "rag-index"
# Create index if it doesn't exist
pinecone.create_index(
name=index_name,
dimension=1536, # OpenAI embedding dimension
metric="cosine"
)
# Store documents
embeddings = OpenAIEmbeddings()
vectorstore = PineconeVectorStore.from_documents(
documents=docs,
embedding=embeddings,
index_name=index_name
)
# Search
query = "What is RAG?"
results = vectorstore.similarity_search(query, k=3)
Building a Vector Store from Documents
End-to-end pipeline:
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from pathlib import Path
class RAGPipeline:
"""Build a RAG system from documents."""
def __init__(self, db_path: str = "./vectorstore"):
self.db_path = db_path
self.embeddings = OpenAIEmbeddings()
self.vectorstore = None
def ingest_documents(self, documents_dir: str):
"""Load documents and build vector store."""
# Load all text files
loader_list = []
for file_path in Path(documents_dir).glob("*.txt"):
loader_list.append(TextLoader(str(file_path)))
docs = []
for loader in loader_list:
docs.extend(loader.load())
# Split documents
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
split_docs = splitter.split_documents(docs)
# Create vector store
self.vectorstore = Chroma.from_documents(
documents=split_docs,
embedding=self.embeddings,
persist_directory=self.db_path
)
return len(split_docs)
def search(self, query: str, k: int = 3) -> list:
"""Search for relevant documents."""
if not self.vectorstore:
self.vectorstore = Chroma(
persist_directory=self.db_path,
embedding_function=self.embeddings
)
results = self.vectorstore.similarity_search(query, k=k)
return results
def get_retriever(self):
"""Get a LangChain retriever."""
if not self.vectorstore:
self.vectorstore = Chroma(
persist_directory=self.db_path,
embedding_function=self.embeddings
)
return self.vectorstore.as_retriever(search_kwargs={"k": 3})
# Usage
rag = RAGPipeline()
num_chunks = rag.ingest_documents("./documents")
print(f"Ingested {num_chunks} chunks")
results = rag.search("What is machine learning?")
for result in results:
print(result.page_content)
Metadata and Filtering
Store metadata with documents for filtering:
from langchain_core.documents import Document
# Create documents with metadata
docs_with_metadata = [
Document(
page_content="Q3 revenue was $500M",
metadata={"source": "Q3_report.pdf", "date": "2024-09-30"}
),
Document(
page_content="Q4 revenue was $600M",
metadata={"source": "Q4_report.pdf", "date": "2024-12-31"}
),
]
# Store in vector database
vectorstore = Chroma.from_documents(
documents=docs_with_metadata,
embedding=embeddings,
persist_directory="./chroma_db"
)
# Search with filtering
from langchain_community.vectorstores import Chroma
results = vectorstore.similarity_search_with_relevance_scores(
query="revenue",
k=5,
# Metadata filtering (depends on vectorstore backend)
# filter={"date": {"$gte": "2024-01-01"}}
)
Comparing Embedding Models
Different embedding models have different strengths:
from langchain_openai import OpenAIEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_cohere import CohereEmbeddings
class EmbeddingComparison:
"""Compare embedding models."""
@staticmethod
def compare_models(text: str):
"""Test different embedding models."""
models = {
"OpenAI (Small)": OpenAIEmbeddings(model="text-embedding-3-small"),
"OpenAI (Large)": OpenAIEmbeddings(model="text-embedding-3-large"),
"HuggingFace": HuggingFaceEmbeddings(),
"Cohere": CohereEmbeddings(),
}
results = {}
for name, model in models.items():
embedding = model.embed_query(text)
results[name] = {
"dimension": len(embedding),
"first_value": embedding[0],
"has_normalize": abs(sum(e**2 for e in embedding) - 1.0) < 0.01
}
return results
# Compare
results = EmbeddingComparison.compare_models("Machine learning")
for model_name, stats in results.items():
print(f"{model_name}: {stats['dimension']}D")
Dimensionality Considerations
Embedding dimension affects storage, search speed, and accuracy:
# Smaller embeddings (faster, less storage)
small_embeddings = OpenAIEmbeddings(model="text-embedding-3-small") # 512D
# Larger embeddings (slower, more storage, potentially more accurate)
large_embeddings = OpenAIEmbeddings(model="text-embedding-3-large") # 3072D
# Trade-off: Use small for speed, large for accuracy
# For most RAG applications, small is sufficient and faster
Key Takeaway
RAG combines retrieval and generation: search for relevant documents, then ask the model to answer using those documents. Embeddings convert text to vectors, enabling similarity search. Vector databases store embeddings and find nearest neighbors efficiently. Chroma is good for development, Pinecone for production. Build RAG systems by loading documents, splitting them, embedding, and storing in a vector database. Always persist your vector store to avoid re-ingesting documents.
Exercises
-
Embeddings: Embed several sentences. Calculate similarity between them. Verify semantically similar sentences have higher scores.
-
Vector store: Create a Chroma vector store from 10+ documents. Search for queries and inspect results.
-
Metadata: Store documents with rich metadata (source, date, category). Search and verify metadata is preserved.
-
Pipeline: Build an end-to-end RAG pipeline that loads documents, splits, embeds, and stores them.
-
Comparison: Compare embedding models on quality and speed. Test with various query types.
-
Persistence: Create a vector store, persist to disk, close it, then reload and verify documents are still there.