Object Detection and Vision Models

This lesson covers modern vision models for object detection (YOLO, DETR), segmentation (SAM), and depth estimation. These models enable understanding of spatial relationships, detailed visual analysis, and 3D scene understanding.

Core Concepts

Object Detection Approaches

Anchor-based (YOLO, Faster R-CNN):

Predict bounding boxes from predefined anchors
Fast inference, good for real-time
Requires careful anchor design

Anchor-free (DETR):

Direct end-to-end detection
Use transformer for detection
Cleaner architecture, requires more data

Key metrics:

mAP (mean Average Precision)
FPS (frames per second)
Latency at different batch sizes

Segmentation

Instance Segmentation: Per-object pixel-level masks Semantic Segmentation: Per-class pixel-level labels Panoptic Segmentation: Combines instance and semantic

SAM (Segment Anything): Unified model for any segmentation task.

Practical Implementation

Object Detection with YOLOv8

from ultralytics import YOLO

# Load model
model = YOLO('yolov8m.pt')

# Detect
results = model.predict(source='image.jpg', conf=0.25)

# Visualize
for result in results:
    boxes = result.boxes
    for box in boxes:
        x1, y1, x2, y2 = box.xyxy[0]
        conf = box.conf[0]
        cls = box.cls[0]
        print(f"Class: {cls}, Confidence: {conf:.2f}")

DETR: Detection Transformer

from transformers import DetrImageProcessor, AutoModelForObjectDetection

processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = AutoModelForObjectDetection.from_pretrained("facebook/detr-resnet-50")

image = Image.open("image.jpg")
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

results = processor.post_process_object_detection(
    outputs,
    target_sizes=[image.size[::-1]],
    threshold=0.9
)[0]

for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
    box = [round(i, 2) for i in box.tolist()]
    print(f"Detected {model.config.id2label[label.item()]} with confidence {round(score.item(), 3)} at {box}")

SAM: Segment Anything

from transformers import SamProcessor, SamModel

processor = SamProcessor.from_pretrained("facebook/sam-vit-base")
model = SamModel.from_pretrained("facebook/sam-vit-base")

image = Image.open("image.jpg")

# Point-based segmentation
input_points = [[500, 375]]  # (x, y)
inputs = processor(image, input_points=[[input_points]], return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

masks = processor.image_processor.post_process_masks(
    outputs.pred_masks,
    inputs["original_sizes"],
    inputs["reshaped_input_sizes"]
)

Advanced Techniques

Multi-task Learning

class MultiTaskVisionModel(nn.Module):
    def __init__(self, backbone):
        super().__init__()
        self.backbone = backbone
        self.detection_head = DetectionHead()
        self.segmentation_head = SegmentationHead()
        self.depth_head = DepthHead()
    
    def forward(self, x):
        features = self.backbone(x)
        detection = self.detection_head(features)
        segmentation = self.segmentation_head(features)
        depth = self.depth_head(features)
        return detection, segmentation, depth

Production Considerations

Real-time Inference

import cv2
import time

def real_time_detection(video_source=0):
    model = YOLO('yolov8n.pt')  # Nano model for speed
    cap = cv2.VideoCapture(video_source)
    
    while True:
        ret, frame = cap.read()
        if not ret:
            break
        
        results = model.predict(frame, conf=0.5)
        annotated_frame = results[0].plot()
        
        cv2.imshow("Detection", annotated_frame)
        if cv2.waitKey(1) & 0xFF == ord('q'):
            break
    
    cap.release()
    cv2.destroyAllWindows()

Key Takeaway

Modern vision models enable precise understanding of visual content through detection, segmentation, and depth estimation. SAM’s zero-shot capabilities and YOLO’s efficiency make state-of-the-art performance accessible across applications.

Practical Exercise

Task: Build real-time object detection system for video stream.

Requirements:

Implement YOLOv8 detector
Add tracking across frames
Count objects by class
Optimize for real-time FPS
Deploy with streaming

Evaluation:

Latency < 33ms (30 FPS)
Accuracy metrics (mAP)
Memory efficiency
Multi-object tracking precision