Object Detection and Vision Models
Object Detection and Vision Models
This lesson covers modern vision models for object detection (YOLO, DETR), segmentation (SAM), and depth estimation. These models enable understanding of spatial relationships, detailed visual analysis, and 3D scene understanding.
Core Concepts
Object Detection Approaches
Anchor-based (YOLO, Faster R-CNN):
- Predict bounding boxes from predefined anchors
- Fast inference, good for real-time
- Requires careful anchor design
Anchor-free (DETR):
- Direct end-to-end detection
- Use transformer for detection
- Cleaner architecture, requires more data
Key metrics:
- mAP (mean Average Precision)
- FPS (frames per second)
- Latency at different batch sizes
Segmentation
Instance Segmentation: Per-object pixel-level masks Semantic Segmentation: Per-class pixel-level labels Panoptic Segmentation: Combines instance and semantic
SAM (Segment Anything): Unified model for any segmentation task.
Practical Implementation
Object Detection with YOLOv8
from ultralytics import YOLO
# Load model
model = YOLO('yolov8m.pt')
# Detect
results = model.predict(source='image.jpg', conf=0.25)
# Visualize
for result in results:
boxes = result.boxes
for box in boxes:
x1, y1, x2, y2 = box.xyxy[0]
conf = box.conf[0]
cls = box.cls[0]
print(f"Class: {cls}, Confidence: {conf:.2f}")
DETR: Detection Transformer
from transformers import DetrImageProcessor, AutoModelForObjectDetection
processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = AutoModelForObjectDetection.from_pretrained("facebook/detr-resnet-50")
image = Image.open("image.jpg")
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
results = processor.post_process_object_detection(
outputs,
target_sizes=[image.size[::-1]],
threshold=0.9
)[0]
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
box = [round(i, 2) for i in box.tolist()]
print(f"Detected {model.config.id2label[label.item()]} with confidence {round(score.item(), 3)} at {box}")
SAM: Segment Anything
from transformers import SamProcessor, SamModel
processor = SamProcessor.from_pretrained("facebook/sam-vit-base")
model = SamModel.from_pretrained("facebook/sam-vit-base")
image = Image.open("image.jpg")
# Point-based segmentation
input_points = [[500, 375]] # (x, y)
inputs = processor(image, input_points=[[input_points]], return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
masks = processor.image_processor.post_process_masks(
outputs.pred_masks,
inputs["original_sizes"],
inputs["reshaped_input_sizes"]
)
Advanced Techniques
Multi-task Learning
class MultiTaskVisionModel(nn.Module):
def __init__(self, backbone):
super().__init__()
self.backbone = backbone
self.detection_head = DetectionHead()
self.segmentation_head = SegmentationHead()
self.depth_head = DepthHead()
def forward(self, x):
features = self.backbone(x)
detection = self.detection_head(features)
segmentation = self.segmentation_head(features)
depth = self.depth_head(features)
return detection, segmentation, depth
Production Considerations
Real-time Inference
import cv2
import time
def real_time_detection(video_source=0):
model = YOLO('yolov8n.pt') # Nano model for speed
cap = cv2.VideoCapture(video_source)
while True:
ret, frame = cap.read()
if not ret:
break
results = model.predict(frame, conf=0.5)
annotated_frame = results[0].plot()
cv2.imshow("Detection", annotated_frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
Key Takeaway
Modern vision models enable precise understanding of visual content through detection, segmentation, and depth estimation. SAM’s zero-shot capabilities and YOLO’s efficiency make state-of-the-art performance accessible across applications.
Practical Exercise
Task: Build real-time object detection system for video stream.
Requirements:
- Implement YOLOv8 detector
- Add tracking across frames
- Count objects by class
- Optimize for real-time FPS
- Deploy with streaming
Evaluation:
- Latency < 33ms (30 FPS)
- Accuracy metrics (mAP)
- Memory efficiency
- Multi-object tracking precision