Vision-Language Models
Vision-Language Models
Vision-language models like CLIP and LLaVA enable joint understanding of images and text, powering applications like image retrieval, visual question answering, and image captioning. This lesson covers CLIP architecture, vision-language alignment, LLaVA for instruction-following, and VQA systems.
Core Concepts
CLIP: Contrastive Learning for Images and Text
CLIP learns to align images with text descriptions through contrastive learning:
Image Encoder → Image embedding
Text Encoder → Text embedding
Loss = Contrastive similarity between paired images and texts
Key insight: Train on diverse image-text pairs from internet at scale (400M images).
Vision Transformer (ViT)
Uses transformer architecture for images:
- Patch-based: Divide image into patches (16×16)
- Each patch → linear embedding
- Apply transformer to patch sequence
- More efficient than CNNs at scale
LLaVA: Large Language and Vision Assistant
Connects vision encoder to language model:
Image → ViT → Projection → LLM → Text output
Enables instruction-following for visual tasks through instruction-tuning.
Practical Implementation
CLIP for Image-Text Matching
import torch
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# Image-text matching
image = Image.open("image.jpg")
texts = ["a cat", "a dog", "a bird"]
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
with torch.no_grad():
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)
print(probs) # Probability for each text
Image Retrieval with CLIP Embeddings
import faiss
import numpy as np
class CLIPIndexer:
def __init__(self, model, processor):
self.model = model
self.processor = processor
self.index = None
self.image_paths = []
def index_images(self, image_paths):
embeddings = []
for path in image_paths:
image = Image.open(path)
inputs = self.processor(images=image, return_tensors="pt")
with torch.no_grad():
image_features = self.model.get_image_features(**inputs)
embeddings.append(image_features.cpu().numpy())
embeddings = np.concatenate(embeddings, axis=0).astype('float32')
self.index = faiss.IndexFlatL2(embeddings.shape[1])
self.index.add(embeddings)
self.image_paths = image_paths
def search(self, query_text, k=5):
inputs = self.processor(text=query_text, return_tensors="pt")
with torch.no_grad():
text_features = self.model.get_text_features(**inputs)
text_features = text_features.cpu().numpy().astype('float32')
distances, indices = self.index.search(text_features, k)
results = [self.image_paths[i] for i in indices[0]]
return results
LLaVA for Visual Question Answering
from transformers import AutoProcessor, LlavaForConditionalGeneration
model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf")
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
image = Image.open("image.jpg")
question = "What is in this image?"
prompt = f"[INST] <image>\n{question} [/INST]"
inputs = processor(prompt, image, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=200)
answer = processor.decode(outputs[0], skip_special_tokens=True)
print(answer)
Advanced Techniques
Fine-tuning Vision-Language Models
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir="./vlm_finetuned",
num_train_epochs=3,
per_device_train_batch_size=4,
learning_rate=2e-4,
warmup_steps=500,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
)
trainer.train()
Zero-Shot Image Classification
# Use CLIP for zero-shot classification
classes = ["dog", "cat", "bird", "fish"]
prompts = [f"a photo of a {c}" for c in classes]
image = Image.open("image.jpg")
inputs = processor(text=prompts, images=image, return_tensors="pt", padding=True)
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits_per_image
probs = logits.softmax(dim=1)
predicted_class = classes[probs.argmax()]
Production Considerations
Batch Inference Optimization
def batch_embed_images(image_paths, batch_size=32):
embeddings = []
model.eval()
with torch.no_grad():
for i in range(0, len(image_paths), batch_size):
batch_paths = image_paths[i:i+batch_size]
images = [Image.open(p) for p in batch_paths]
inputs = processor(images=images, return_tensors="pt", padding=True)
image_features = model.get_image_features(**inputs)
embeddings.append(image_features.cpu())
return torch.cat(embeddings, dim=0)
Key Takeaway
Vision-language models enable joint understanding of images and text, unlocking applications from image search to visual question answering. CLIP provides powerful zero-shot capabilities, while LLaVA enables instruction-following for complex visual reasoning.
Practical Exercise
Task: Build image search engine with CLIP and deploy as web service.
Requirements:
- Index 10,000+ images with CLIP embeddings
- Implement text-to-image search
- Support multi-modal retrieval
- Create FastAPI endpoint
- Deploy with caching
Evaluation:
- Mean Reciprocal Rank > 0.75
- Inference latency < 100ms
- Scalability to 1M images
- User satisfaction metrics