Advanced
Multimodal Learning (Vision-Language)
Explore the intersection of computer vision and language, learning to build models that understand both images and text. Understand vision transformers and multimodal architectures.
Introduction
Explore the intersection of computer vision and language, learning to build models that understand both images and text. Understand vision transformers and multimodal architectures.
4 Lessons
30h Est. Time
4 Objectives
1 Assessment
By completing this module you will be able to:
✓ Understand vision transformers and their advantages over CNNs
✓ Learn about vision-language models like CLIP and their applications
✓ Implement multimodal embeddings and cross-modal retrieval
✓ Work with image-text alignment techniques
Lessons
Work through each lesson in order. Each one builds on the concepts from the previous lesson.
1
Vision-Language Models
2
Object Detection and Vision Models
3
Audio Models and Speech Processing
4
Multimodal Applications and Deployment
Recommended Reading
Supplement your learning with these selected chapters from the course library.
Transformers for NLP and Computer Vision 3e
Chapters 11-15
Modern Computer Vision with PyTorch 2e
Chapters 7-10
Module Assessment
Multimodal Learning (Vision-Language)
Question 1 of 3