Audio, Video, and Document Prompting
Audio, Video, and Document Prompting
Introduction
Images are just the beginning. Modern AI systems work with audio, video, and documents. But unlike images, these require preprocessing before models can understand them. You can’t pass a 30-minute video directly to a model - you need to break it down first.
This lesson teaches you how to work with these modalities. You’ll learn what preprocessing is necessary, how to structure prompts around multimodal content, and how to combine multiple modalities effectively.
Key Takeaway: Audio, video, and documents are fundamentally different from images. They require conversion to text (transcription) or structured extraction before you can prompt about them. The challenge is doing this conversion without losing critical context.
Audio Prompting
Audio models work through transcription. First you convert speech to text, then you prompt about the text:
Audio → Transcription → Text → Model → Analysis
But transcription isn’t perfect. The model might mishear words, especially with accents, background noise, or domain-specific terminology. Your prompts must account for this.
Speech-to-Text and Context
def transcribe_and_analyze_audio(audio_path: str, context: dict = None) -> dict:
"""Transcribe audio and analyze with context"""
# Step 1: Transcribe
transcript = transcribe_audio(audio_path)
# Step 2: Build analysis prompt with context
context_str = ""
if context:
context_str = f"""CONTEXT:
- Speaker: {context.get('speaker', 'Unknown')}
- Topic: {context.get('topic', 'Unknown')}
- Language: {context.get('language', 'English')}
- Expected terms: {', '.join(context.get('expected_terms', []))}
Use this context to correct transcription errors."""
analysis_prompt = f"""Analyze this transcribed audio:
{context_str}
TRANSCRIPT:
{transcript}
Tasks:
1. Correct any obvious transcription errors
2. Identify key points
3. Summarize in 1-2 sentences
4. List any action items or questions"""
return language_model(analysis_prompt)
def transcribe_audio(audio_path: str) -> str:
"""Convert audio to text using speech recognition"""
# Using OpenAI Whisper API (or similar)
with open(audio_path, 'rb') as f:
transcript = openai.Audio.transcribe(
model="whisper-1",
file=f,
language="en" # Specify language for better accuracy
)
return transcript['text']
Specialized Audio Tasks
def extract_meeting_minutes(audio_path: str, attendees: list = None) -> dict:
"""Extract meeting minutes from audio"""
transcript = transcribe_audio(audio_path)
attendee_str = ""
if attendees:
attendee_str = f"Attendees: {', '.join(attendees)}\n"
prompt = f"""Extract meeting minutes from this transcript:
{attendee_str}
TRANSCRIPT:
{transcript}
Return JSON:
{{
"title": "...",
"date": "YYYY-MM-DD",
"attendees": [...],
"decisions": [...],
"action_items": [
{{"task": "...", "owner": "...", "due": "..."}}
],
"summary": "..."
}}"""
return language_model(prompt)
def analyze_podcast_episode(audio_path: str, topic: str = None) -> dict:
"""Analyze podcast content"""
transcript = transcribe_audio(audio_path)
topic_context = f"Topic: {topic}\n" if topic else ""
prompt = f"""Analyze this podcast episode:
{topic_context}
TRANSCRIPT:
{transcript}
Provide:
1. Main topics discussed (ordered by time)
2. Key insights
3. Quotes worth highlighting
4. Guest expertise/background
5. Interesting follow-up topics"""
return language_model(prompt)
def extract_speaker_segments(audio_path: str) -> dict:
"""Identify and extract individual speaker segments"""
# Note: Transcription might include speaker labels (if diarization is enabled)
transcript = transcribe_audio(audio_path)
prompt = f"""This transcript may contain multiple speakers.
Identify distinct speakers and their contributions:
{transcript}
Return JSON:
{{
"speakers": [
{{
"speaker_id": "Speaker 1",
"estimated_role": "moderator/guest/host",
"speaking_time_percent": 0.0,
"key_points": ["...", "..."]
}}
],
"total_speakers": 0
}}
If speaker labels aren't explicit, infer from content."""
return language_model(prompt)
Video Prompting
Video is more complex than audio. You can’t transcribe a video as easily as audio. You have two approaches:
Approach 1: Extract Key Frames
Convert video to key frames (images) and analyze:
from pathlib import Path
import json
def analyze_video_by_frames(video_path: str, num_frames: int = 5) -> dict:
"""Analyze video by extracting key frames"""
# Step 1: Extract frames at regular intervals
frames = extract_video_frames(video_path, num_frames)
# Step 2: Analyze each frame
analyses = []
for i, frame in enumerate(frames):
timestamp = (i / num_frames) * get_video_duration(video_path)
analysis = analyze_image(
frame,
f"What's happening at {timestamp} seconds into the video?"
)
analyses.append({
'timestamp': timestamp,
'frame_number': i,
'analysis': analysis
})
# Step 3: Synthesize understanding
synthesis_prompt = f"""Based on these video frames analyzed at regular intervals:
{json.dumps(analyses, indent=2)}
1. Describe the overall video content
2. What's the narrative or sequence of events?
3. What are the key changes or transitions?
4. Summarize in 2-3 sentences"""
overall_analysis = language_model(synthesis_prompt)
return {
'frame_analyses': analyses,
'overall_analysis': overall_analysis
}
def extract_video_frames(video_path: str, num_frames: int = 5) -> list:
"""Extract key frames from video"""
import cv2
video = cv2.VideoCapture(video_path)
total_frames = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
frames = []
for i in range(num_frames):
frame_idx = int((i / num_frames) * total_frames)
video.set(cv2.CAP_PROP_POS_FRAMES, frame_idx)
ret, frame = video.read()
if ret:
frames.append(frame)
video.release()
return frames
def get_video_duration(video_path: str) -> float:
"""Get video duration in seconds"""
import cv2
video = cv2.VideoCapture(video_path)
fps = video.get(cv2.CAP_PROP_FPS)
total_frames = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
duration = total_frames / fps
video.release()
return duration
Approach 2: Transcribe Audio Track
Extract and transcribe the audio from video:
def analyze_video_with_audio(video_path: str) -> dict:
"""Analyze video using its audio track"""
# Step 1: Extract audio from video
audio_path = extract_audio_from_video(video_path)
# Step 2: Transcribe
transcript = transcribe_audio(audio_path)
# Step 3: Extract key frames for visual context
frames = extract_video_frames(video_path, num_frames=3)
frame_analyses = [analyze_image(f, "What's shown visually?") for f in frames]
# Step 4: Synthesize
prompt = f"""Analyze this video using both audio transcript and visual frames:
AUDIO TRANSCRIPT:
{transcript}
VISUAL CONTEXT:
{json.dumps(frame_analyses, indent=2)}
Provide:
1. Main topic and purpose
2. Key information conveyed
3. Visual elements supporting the audio
4. Target audience
5. 1-2 minute summary"""
analysis = language_model(prompt)
return {
'transcript': transcript,
'visual_analysis': frame_analyses,
'combined_analysis': analysis
}
def extract_audio_from_video(video_path: str) -> str:
"""Extract audio track from video file"""
import subprocess
import os
audio_output = video_path.replace('.mp4', '.wav')
# Using ffmpeg
subprocess.run([
'ffmpeg', '-i', video_path,
'-q:a', '9', '-n', # Quality, no overwrite
audio_output
], check=True)
return audio_output
Document Prompting
Documents are often multi-page, requiring special handling:
Single-Page Documents
def analyze_single_page_document(image_or_pdf_path: str, task: str) -> dict:
"""Analyze a single document page"""
prompt_templates = {
'extract_all': "Extract all text from this document in reading order",
'extract_table': "Extract any tables as JSON arrays",
'summarize': "Summarize the main content in 2-3 sentences",
'extract_form': "Extract form field values and their contents",
'classify': "What type of document is this? (invoice, resume, contract, etc.)"
}
prompt = prompt_templates.get(task, "Analyze this document")
return vision_model(image_or_pdf_path, prompt)
Multi-Page Documents
def analyze_multi_page_document(pdf_path: str, task: str) -> dict:
"""Analyze a multi-page document"""
# Step 1: Split PDF into pages
pages = extract_pdf_pages(pdf_path)
# Step 2: Analyze first few pages for document type and structure
sample_pages = pages[:min(3, len(pages))]
structure_analysis = analyze_document_structure(sample_pages)
# Step 3: Choose analysis strategy
if task == 'extract_all':
return extract_all_text(pages)
elif task == 'summarize':
return summarize_multi_page(pages)
elif task == 'find_section':
return find_specific_section(pages, structure_analysis)
return None
def extract_all_text(pages: list) -> dict:
"""Extract text from all pages with page tracking"""
extracted = {
'pages': [],
'total_pages': len(pages),
'full_text': ''
}
for i, page_image in enumerate(pages):
page_text = vision_model(
page_image,
f"Extract all text from page {i+1}. Preserve structure and formatting."
)
extracted['pages'].append({
'page_number': i + 1,
'text': page_text
})
extracted['full_text'] += f"\n--- Page {i+1} ---\n{page_text}"
return extracted
def summarize_multi_page(pages: list) -> dict:
"""Create section-by-section summary"""
summaries = []
# Summarize each page individually first
for i, page in enumerate(pages):
page_summary = vision_model(
page,
f"Summarize page {i+1} in 1-2 sentences"
)
summaries.append({
'page': i + 1,
'summary': page_summary
})
# Synthesize overall summary
synthesis_prompt = f"""Create a comprehensive summary from these page summaries:
{json.dumps(summaries, indent=2)}
Overall summary (3-5 sentences):"""
overall = language_model(synthesis_prompt)
return {
'page_summaries': summaries,
'overall_summary': overall
}
def find_specific_section(pages: list, structure: dict,
section_name: str) -> dict:
"""Find and extract a specific section from document"""
# Step 1: Identify which pages likely contain the section
likely_pages = []
for i, page in enumerate(pages):
check = vision_model(
page,
f"Does this page contain a section titled '{section_name}' or related content? Answer yes/no only."
)
if 'yes' in check.lower():
likely_pages.append(i)
# Step 2: Extract content from those pages
extracted = ""
for page_idx in likely_pages:
content = vision_model(
pages[page_idx],
f"Extract the '{section_name}' section from this page"
)
extracted += f"\n--- From page {page_idx + 1} ---\n{content}"
return {
'section_name': section_name,
'pages_found': likely_pages,
'content': extracted
}
Combining Modalities
Often the most effective approach uses multiple modalities together:
def analyze_presentation(pdf_path: str, audio_path: str = None) -> dict:
"""Analyze a presentation combining slides and speaker notes/audio"""
# Step 1: Extract slide content
slides = analyze_multi_page_document(pdf_path, 'extract_all')
# Step 2: Get speaker audio if available
speaker_notes = None
if audio_path:
speaker_notes = transcribe_audio(audio_path)
# Step 3: Synthesize
synthesis_prompt = f"""Analyze this presentation:
SLIDE CONTENT:
{slides['full_text']}
{"SPEAKER NOTES:" if speaker_notes else ""}
{speaker_notes or ""}
Provide:
1. Main thesis/topic
2. Key arguments or concepts (in order)
3. Evidence or examples provided
4. Conclusion/call to action
5. Estimated length if presented"""
analysis = language_model(synthesis_prompt)
return {
'slides': slides,
'speaker_notes': speaker_notes,
'analysis': analysis
}
Handling Large or Complex Documents
For very large documents, you need strategic approaches:
def intelligent_document_analysis(pdf_path: str, query: str) -> dict:
"""Analyze document intelligently based on specific query"""
pages = extract_pdf_pages(pdf_path)
# Step 1: Scan all pages to find relevant ones
relevant_pages = []
for i, page in enumerate(pages):
relevance = vision_model(
page,
f"How relevant is this page to the query: '{query}'? Answer with a score 0-10 only."
)
try:
score = int(relevance.strip())
if score >= 5:
relevant_pages.append((i, score))
except:
pass
# Sort by relevance score
relevant_pages.sort(key=lambda x: x[1], reverse=True)
# Step 2: Deep analysis of top relevant pages
top_pages = [p[0] for p in relevant_pages[:5]] # Top 5 most relevant
detailed_analysis = ""
for page_idx in top_pages:
content = vision_model(
pages[page_idx],
f"Extract information relevant to: '{query}'"
)
detailed_analysis += f"\n[Page {page_idx + 1}]\n{content}"
return {
'query': query,
'total_pages': len(pages),
'relevant_pages': len(relevant_pages),
'analysis': detailed_analysis
}
Best Practices for Multimodal Documents
AUDIO/VIDEO/DOCUMENT BEST PRACTICES:
1. PREPROCESSING FIRST
- Audio: Always transcribe before analysis
- Video: Extract key frames OR audio
- Documents: Convert pages to images/text
2. PRESERVE CONTEXT
- For audio: Include speaker labels if available
- For video: Keep frame timestamps
- For documents: Track page numbers
3. HANDLE LENGTH INTELLIGENTLY
- Long audio/video: Sample strategically
- Long documents: Scan, then deep dive on relevant pages
- Don't try to process everything at once
4. COMBINE MODALITIES
- Audio + visual context for videos
- Text + images for documents
- This gives models fuller understanding
5. TEST WITH REAL CONTENT
- Transcription quality varies by speaker/accent
- Frame quality affects visual analysis
- OCR errors compound in long documents
6. ACCOUNT FOR DEGRADATION
- Old documents may be faded
- Audio may have background noise
- Video compression may lose detail
Exercise: Build Multimodal Analysis Pipeline
Create a system that analyzes different modalities:
-
Audio analysis:
- Transcribe a sample audio file
- Extract key points and action items
- Identify speakers if multiple
-
Video analysis:
- Extract 3-5 key frames
- Analyze each frame
- Synthesize overall narrative
-
Document analysis:
- Process a 3+ page PDF
- Extract specific information
- Create a summary
-
Combined analysis:
- Take a presentation (slides + audio)
- Synthesize both modalities
- Provide comprehensive analysis
Deliverables:
- Code for each modality type
- Sample input files (or descriptions)
- Output samples showing extracted/analyzed content
- Notes on quality and limitations
Summary
In this lesson, you’ve learned:
- How audio is processed (transcription + analysis)
- Strategies for video analysis (frames vs. audio)
- Multi-page document handling
- Combining multiple modalities for richer understanding
- Intelligent approaches for large documents
- Best practices for working with non-image modalities
Next, you’ll learn how to extract structured data reliably from all these modalities.