Audio Models and Speech Processing
Audio Models and Speech Processing
Audio understanding spans speech-to-text (Whisper), audio embeddings, text-to-speech (TTS), and audio classification. This lesson covers Whisper for transcription, audio representation learning, and building audio systems.
Core Concepts
Whisper: Automatic Speech Recognition
Pre-trained on 680K hours of multilingual speech data. Robust to noise and accents.
Architecture:
- Audio encoder: MEL spectrograms → transformer
- Decoder: Generates text tokens
- Multilingual: 96 languages
Audio Representations
Spectral features:
- MEL spectrogram: Mel-scale frequency representation
- MFCC: Mel Frequency Cepstral Coefficients
- Chromagram: Pitch-based representation
Learned embeddings:
- AudioMAE: Masked auto-encoding for audio
- CLAP: Contrastive learning audio-text pairs
Text-to-Speech
Vocoder-based: Generate MEL spectrogram → neural vocoder End-to-end: Directly generate audio waveform
Practical Implementation
Whisper for Speech Transcription
import torch
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_id = "openai/whisper-base"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch.float16, device_map=device
)
# Transcribe
audio_file = "audio.mp3"
audio_input, sample_rate = librosa.load(audio_file, sr=16000)
inputs = processor(audio_input, sampling_rate=sample_rate, return_tensors="pt")
with torch.no_grad():
predicted_ids = model.generate(inputs["input_features"].to(device))
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
Audio Classification
from transformers import AutoFeatureExtractor, AutoModelForAudioClassification
feature_extractor = AutoFeatureExtractor.from_pretrained("superb/hubert-base-superb-ks")
model = AutoModelForAudioClassification.from_pretrained("superb/hubert-base-superb-ks")
audio_input, sample_rate = librosa.load("audio.wav", sr=16000)
inputs = feature_extractor(audio_input, sampling_rate=sample_rate, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
predicted_class = torch.argmax(logits, dim=-1).item()
Text-to-Speech
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
text = "Hello, this is a test."
inputs = processor(text=text, return_tensors="pt")
# Speaker embeddings
speaker_embeddings = torch.randn(1, 512)
with torch.no_grad():
speech = model.generate_speech(
inputs["input_ids"],
speaker_embeddings,
vocoder=vocoder
)
Advanced Techniques
Audio-Language Alignment
# CLAP: Contrastive Learning Audio-Text Pairs
from transformers import ClapProcessor, ClapModel
processor = ClapProcessor.from_pretrained("laion/clap-htsat-unfused")
model = ClapModel.from_pretrained("laion/clap-htsat-unfused")
# Audio-text matching
audio_files = ["dog_bark.wav", "car_engine.wav"]
texts = ["dog barking", "car engine sound"]
audio_inputs = processor(audios=audio_files, return_tensors="pt", sampling_rate=48000)
text_inputs = processor(text=texts, return_tensors="pt")
with torch.no_grad():
audio_embeddings = model.get_audio_features(**audio_inputs)
text_embeddings = model.get_text_features(**text_inputs)
logits = torch.matmul(audio_embeddings, text_embeddings.t()) / 0.07
Production Considerations
Real-time Speech Recognition
import sounddevice as sd
import queue
def stream_transcriber():
queue_obj = queue.Queue()
model = WhisperModel("base", device="cpu", compute_type="int8")
def audio_callback(indata, frames, time, status):
queue_obj.put(indata.copy())
with sd.InputStream(callback=audio_callback, blocksize=32000, samplerate=16000):
while True:
audio_chunk = queue_obj.get()
segments, _ = model.transcribe(audio_chunk, language="en")
for segment in segments:
print(segment.text)
Key Takeaway
Audio models enable speech recognition, audio understanding, and synthesis, complementing vision and language for truly multimodal systems. Whisper’s robustness and CLAP’s alignment unlock new possibilities for audio-text applications.
Practical Exercise
Task: Build automated podcast transcription and summarization system.
Requirements:
- Transcribe audio with Whisper
- Segment by speaker changes
- Summarize each segment
- Create transcript with timestamps
- Support batch processing
Evaluation:
- Transcription WER < 8%
- Correct speaker segmentation
- Summary quality vs original
- Processing speed (hours/hour)