Audio, Video, and Document Prompting

Introduction

Images are just the beginning. Modern AI systems work with audio, video, and documents. But unlike images, these require preprocessing before models can understand them. You can’t pass a 30-minute video directly to a model - you need to break it down first.

This lesson teaches you how to work with these modalities. You’ll learn what preprocessing is necessary, how to structure prompts around multimodal content, and how to combine multiple modalities effectively.

Key Takeaway: Audio, video, and documents are fundamentally different from images. They require conversion to text (transcription) or structured extraction before you can prompt about them. The challenge is doing this conversion without losing critical context.

Audio Prompting

Audio models work through transcription. First you convert speech to text, then you prompt about the text:

Audio → Transcription → Text → Model → Analysis

But transcription isn’t perfect. The model might mishear words, especially with accents, background noise, or domain-specific terminology. Your prompts must account for this.

Speech-to-Text and Context

def transcribe_and_analyze_audio(audio_path: str, context: dict = None) -> dict:
    """Transcribe audio and analyze with context"""

    # Step 1: Transcribe
    transcript = transcribe_audio(audio_path)

    # Step 2: Build analysis prompt with context
    context_str = ""
    if context:
        context_str = f"""CONTEXT:
- Speaker: {context.get('speaker', 'Unknown')}
- Topic: {context.get('topic', 'Unknown')}
- Language: {context.get('language', 'English')}
- Expected terms: {', '.join(context.get('expected_terms', []))}

Use this context to correct transcription errors."""

    analysis_prompt = f"""Analyze this transcribed audio:

{context_str}

TRANSCRIPT:
{transcript}

Tasks:
1. Correct any obvious transcription errors
2. Identify key points
3. Summarize in 1-2 sentences
4. List any action items or questions"""

    return language_model(analysis_prompt)


def transcribe_audio(audio_path: str) -> str:
    """Convert audio to text using speech recognition"""

    # Using OpenAI Whisper API (or similar)
    with open(audio_path, 'rb') as f:
        transcript = openai.Audio.transcribe(
            model="whisper-1",
            file=f,
            language="en"  # Specify language for better accuracy
        )

    return transcript['text']

Specialized Audio Tasks

def extract_meeting_minutes(audio_path: str, attendees: list = None) -> dict:
    """Extract meeting minutes from audio"""

    transcript = transcribe_audio(audio_path)

    attendee_str = ""
    if attendees:
        attendee_str = f"Attendees: {', '.join(attendees)}\n"

    prompt = f"""Extract meeting minutes from this transcript:

{attendee_str}
TRANSCRIPT:
{transcript}

Return JSON:
{{
  "title": "...",
  "date": "YYYY-MM-DD",
  "attendees": [...],
  "decisions": [...],
  "action_items": [
    {{"task": "...", "owner": "...", "due": "..."}}
  ],
  "summary": "..."
}}"""

    return language_model(prompt)


def analyze_podcast_episode(audio_path: str, topic: str = None) -> dict:
    """Analyze podcast content"""

    transcript = transcribe_audio(audio_path)

    topic_context = f"Topic: {topic}\n" if topic else ""

    prompt = f"""Analyze this podcast episode:

{topic_context}
TRANSCRIPT:
{transcript}

Provide:
1. Main topics discussed (ordered by time)
2. Key insights
3. Quotes worth highlighting
4. Guest expertise/background
5. Interesting follow-up topics"""

    return language_model(prompt)


def extract_speaker_segments(audio_path: str) -> dict:
    """Identify and extract individual speaker segments"""

    # Note: Transcription might include speaker labels (if diarization is enabled)
    transcript = transcribe_audio(audio_path)

    prompt = f"""This transcript may contain multiple speakers.
Identify distinct speakers and their contributions:

{transcript}

Return JSON:
{{
  "speakers": [
    {{
      "speaker_id": "Speaker 1",
      "estimated_role": "moderator/guest/host",
      "speaking_time_percent": 0.0,
      "key_points": ["...", "..."]
    }}
  ],
  "total_speakers": 0
}}

If speaker labels aren't explicit, infer from content."""

    return language_model(prompt)

Video Prompting

Video is more complex than audio. You can’t transcribe a video as easily as audio. You have two approaches:

Approach 1: Extract Key Frames

Convert video to key frames (images) and analyze:

from pathlib import Path
import json

def analyze_video_by_frames(video_path: str, num_frames: int = 5) -> dict:
    """Analyze video by extracting key frames"""

    # Step 1: Extract frames at regular intervals
    frames = extract_video_frames(video_path, num_frames)

    # Step 2: Analyze each frame
    analyses = []
    for i, frame in enumerate(frames):
        timestamp = (i / num_frames) * get_video_duration(video_path)

        analysis = analyze_image(
            frame,
            f"What's happening at {timestamp} seconds into the video?"
        )

        analyses.append({
            'timestamp': timestamp,
            'frame_number': i,
            'analysis': analysis
        })

    # Step 3: Synthesize understanding
    synthesis_prompt = f"""Based on these video frames analyzed at regular intervals:

{json.dumps(analyses, indent=2)}

1. Describe the overall video content
2. What's the narrative or sequence of events?
3. What are the key changes or transitions?
4. Summarize in 2-3 sentences"""

    overall_analysis = language_model(synthesis_prompt)

    return {
        'frame_analyses': analyses,
        'overall_analysis': overall_analysis
    }


def extract_video_frames(video_path: str, num_frames: int = 5) -> list:
    """Extract key frames from video"""

    import cv2

    video = cv2.VideoCapture(video_path)
    total_frames = int(video.get(cv2.CAP_PROP_FRAME_COUNT))

    frames = []
    for i in range(num_frames):
        frame_idx = int((i / num_frames) * total_frames)
        video.set(cv2.CAP_PROP_POS_FRAMES, frame_idx)

        ret, frame = video.read()
        if ret:
            frames.append(frame)

    video.release()
    return frames


def get_video_duration(video_path: str) -> float:
    """Get video duration in seconds"""

    import cv2

    video = cv2.VideoCapture(video_path)
    fps = video.get(cv2.CAP_PROP_FPS)
    total_frames = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
    duration = total_frames / fps

    video.release()
    return duration

Approach 2: Transcribe Audio Track

Extract and transcribe the audio from video:

def analyze_video_with_audio(video_path: str) -> dict:
    """Analyze video using its audio track"""

    # Step 1: Extract audio from video
    audio_path = extract_audio_from_video(video_path)

    # Step 2: Transcribe
    transcript = transcribe_audio(audio_path)

    # Step 3: Extract key frames for visual context
    frames = extract_video_frames(video_path, num_frames=3)
    frame_analyses = [analyze_image(f, "What's shown visually?") for f in frames]

    # Step 4: Synthesize
    prompt = f"""Analyze this video using both audio transcript and visual frames:

AUDIO TRANSCRIPT:
{transcript}

VISUAL CONTEXT:
{json.dumps(frame_analyses, indent=2)}

Provide:
1. Main topic and purpose
2. Key information conveyed
3. Visual elements supporting the audio
4. Target audience
5. 1-2 minute summary"""

    analysis = language_model(prompt)

    return {
        'transcript': transcript,
        'visual_analysis': frame_analyses,
        'combined_analysis': analysis
    }


def extract_audio_from_video(video_path: str) -> str:
    """Extract audio track from video file"""

    import subprocess
    import os

    audio_output = video_path.replace('.mp4', '.wav')

    # Using ffmpeg
    subprocess.run([
        'ffmpeg', '-i', video_path,
        '-q:a', '9', '-n',  # Quality, no overwrite
        audio_output
    ], check=True)

    return audio_output

Document Prompting

Documents are often multi-page, requiring special handling:

Single-Page Documents

def analyze_single_page_document(image_or_pdf_path: str, task: str) -> dict:
    """Analyze a single document page"""

    prompt_templates = {
        'extract_all': "Extract all text from this document in reading order",
        'extract_table': "Extract any tables as JSON arrays",
        'summarize': "Summarize the main content in 2-3 sentences",
        'extract_form': "Extract form field values and their contents",
        'classify': "What type of document is this? (invoice, resume, contract, etc.)"
    }

    prompt = prompt_templates.get(task, "Analyze this document")

    return vision_model(image_or_pdf_path, prompt)

Multi-Page Documents

def analyze_multi_page_document(pdf_path: str, task: str) -> dict:
    """Analyze a multi-page document"""

    # Step 1: Split PDF into pages
    pages = extract_pdf_pages(pdf_path)

    # Step 2: Analyze first few pages for document type and structure
    sample_pages = pages[:min(3, len(pages))]
    structure_analysis = analyze_document_structure(sample_pages)

    # Step 3: Choose analysis strategy
    if task == 'extract_all':
        return extract_all_text(pages)
    elif task == 'summarize':
        return summarize_multi_page(pages)
    elif task == 'find_section':
        return find_specific_section(pages, structure_analysis)

    return None


def extract_all_text(pages: list) -> dict:
    """Extract text from all pages with page tracking"""

    extracted = {
        'pages': [],
        'total_pages': len(pages),
        'full_text': ''
    }

    for i, page_image in enumerate(pages):
        page_text = vision_model(
            page_image,
            f"Extract all text from page {i+1}. Preserve structure and formatting."
        )

        extracted['pages'].append({
            'page_number': i + 1,
            'text': page_text
        })

        extracted['full_text'] += f"\n--- Page {i+1} ---\n{page_text}"

    return extracted


def summarize_multi_page(pages: list) -> dict:
    """Create section-by-section summary"""

    summaries = []

    # Summarize each page individually first
    for i, page in enumerate(pages):
        page_summary = vision_model(
            page,
            f"Summarize page {i+1} in 1-2 sentences"
        )

        summaries.append({
            'page': i + 1,
            'summary': page_summary
        })

    # Synthesize overall summary
    synthesis_prompt = f"""Create a comprehensive summary from these page summaries:

{json.dumps(summaries, indent=2)}

Overall summary (3-5 sentences):"""

    overall = language_model(synthesis_prompt)

    return {
        'page_summaries': summaries,
        'overall_summary': overall
    }


def find_specific_section(pages: list, structure: dict,
                         section_name: str) -> dict:
    """Find and extract a specific section from document"""

    # Step 1: Identify which pages likely contain the section
    likely_pages = []

    for i, page in enumerate(pages):
        check = vision_model(
            page,
            f"Does this page contain a section titled '{section_name}' or related content? Answer yes/no only."
        )

        if 'yes' in check.lower():
            likely_pages.append(i)

    # Step 2: Extract content from those pages
    extracted = ""
    for page_idx in likely_pages:
        content = vision_model(
            pages[page_idx],
            f"Extract the '{section_name}' section from this page"
        )

        extracted += f"\n--- From page {page_idx + 1} ---\n{content}"

    return {
        'section_name': section_name,
        'pages_found': likely_pages,
        'content': extracted
    }

Combining Modalities

Often the most effective approach uses multiple modalities together:

def analyze_presentation(pdf_path: str, audio_path: str = None) -> dict:
    """Analyze a presentation combining slides and speaker notes/audio"""

    # Step 1: Extract slide content
    slides = analyze_multi_page_document(pdf_path, 'extract_all')

    # Step 2: Get speaker audio if available
    speaker_notes = None
    if audio_path:
        speaker_notes = transcribe_audio(audio_path)

    # Step 3: Synthesize
    synthesis_prompt = f"""Analyze this presentation:

SLIDE CONTENT:
{slides['full_text']}

{"SPEAKER NOTES:" if speaker_notes else ""}
{speaker_notes or ""}

Provide:
1. Main thesis/topic
2. Key arguments or concepts (in order)
3. Evidence or examples provided
4. Conclusion/call to action
5. Estimated length if presented"""

    analysis = language_model(synthesis_prompt)

    return {
        'slides': slides,
        'speaker_notes': speaker_notes,
        'analysis': analysis
    }

Handling Large or Complex Documents

For very large documents, you need strategic approaches:

def intelligent_document_analysis(pdf_path: str, query: str) -> dict:
    """Analyze document intelligently based on specific query"""

    pages = extract_pdf_pages(pdf_path)

    # Step 1: Scan all pages to find relevant ones
    relevant_pages = []

    for i, page in enumerate(pages):
        relevance = vision_model(
            page,
            f"How relevant is this page to the query: '{query}'? Answer with a score 0-10 only."
        )

        try:
            score = int(relevance.strip())
            if score >= 5:
                relevant_pages.append((i, score))
        except:
            pass

    # Sort by relevance score
    relevant_pages.sort(key=lambda x: x[1], reverse=True)

    # Step 2: Deep analysis of top relevant pages
    top_pages = [p[0] for p in relevant_pages[:5]]  # Top 5 most relevant

    detailed_analysis = ""
    for page_idx in top_pages:
        content = vision_model(
            pages[page_idx],
            f"Extract information relevant to: '{query}'"
        )

        detailed_analysis += f"\n[Page {page_idx + 1}]\n{content}"

    return {
        'query': query,
        'total_pages': len(pages),
        'relevant_pages': len(relevant_pages),
        'analysis': detailed_analysis
    }

Best Practices for Multimodal Documents

AUDIO/VIDEO/DOCUMENT BEST PRACTICES:

1. PREPROCESSING FIRST
   - Audio: Always transcribe before analysis
   - Video: Extract key frames OR audio
   - Documents: Convert pages to images/text

2. PRESERVE CONTEXT
   - For audio: Include speaker labels if available
   - For video: Keep frame timestamps
   - For documents: Track page numbers

3. HANDLE LENGTH INTELLIGENTLY
   - Long audio/video: Sample strategically
   - Long documents: Scan, then deep dive on relevant pages
   - Don't try to process everything at once

4. COMBINE MODALITIES
   - Audio + visual context for videos
   - Text + images for documents
   - This gives models fuller understanding

5. TEST WITH REAL CONTENT
   - Transcription quality varies by speaker/accent
   - Frame quality affects visual analysis
   - OCR errors compound in long documents

6. ACCOUNT FOR DEGRADATION
   - Old documents may be faded
   - Audio may have background noise
   - Video compression may lose detail

Exercise: Build Multimodal Analysis Pipeline

Create a system that analyzes different modalities:

Audio analysis:
- Transcribe a sample audio file
- Extract key points and action items
- Identify speakers if multiple
Video analysis:
- Extract 3-5 key frames
- Analyze each frame
- Synthesize overall narrative
Document analysis:
- Process a 3+ page PDF
- Extract specific information
- Create a summary
Combined analysis:
- Take a presentation (slides + audio)
- Synthesize both modalities
- Provide comprehensive analysis

Deliverables:

Code for each modality type
Sample input files (or descriptions)
Output samples showing extracted/analyzed content
Notes on quality and limitations

Summary

In this lesson, you’ve learned:

How audio is processed (transcription + analysis)
Strategies for video analysis (frames vs. audio)
Multi-page document handling
Combining multiple modalities for richer understanding
Intelligent approaches for large documents
Best practices for working with non-image modalities

Next, you’ll learn how to extract structured data reliably from all these modalities.