ML Methods
Speech and Audio ML
Machine learning for audio: mel spectrograms as 2D representations, CTC loss for sequence alignment, Whisper for speech recognition, text-to-speech synthesis, and why continuous audio signals are harder than discrete text.
Prerequisites
Why This Matters
Audio is a continuous, high-dimensional signal. Converting it to text (speech recognition), generating it from text (synthesis), or classifying it (audio event detection) requires bridging the gap between continuous waveforms and discrete outputs. The techniques here extend to music, environmental sounds, and any time-frequency signal.
Mental Model
Raw audio is a 1D waveform sampled at 16-48 kHz. Processing it directly is possible but expensive. The standard approach converts audio to a 2D time-frequency representation (spectrogram), then applies techniques from computer vision or sequence modeling. The pipeline is: waveform to spectrogram to features to output.
Mel Spectrograms
Mel Spectrogram
Compute the Short-Time Fourier Transform (STFT) of the audio waveform to get a spectrogram (time vs. frequency). Apply a bank of triangular filters spaced according to the mel scale, which compresses higher frequencies more than lower ones, matching human auditory perception. Take the log of the filter bank energies.
The mel scale maps frequency in Hz to mel units: .
The mel spectrogram is typically an matrix where is the number of mel bins (often 80-128) and is the number of time frames. This 2D representation can be processed by CNNs, transformers, or any architecture that handles grid-structured data.
Why log-scale? Human perception of loudness is approximately logarithmic. A 10x increase in energy sounds like a fixed increase in loudness. Log compression also stabilizes the dynamic range for neural network inputs.
CTC Loss
The alignment problem: given an audio sequence of length and a text transcript of length where , you do not know which audio frames correspond to which characters. CTC (Connectionist Temporal Classification) solves this by marginalizing over all valid alignments.
CTC Loss via Marginalization over Alignments
Statement
Let be an alignment (a length- sequence over the vocabulary plus a blank symbol). Let be the function that removes blanks and collapses repeated characters. The CTC loss for target sequence is:
This sum over all valid alignments is computed efficiently in time using a forward-backward algorithm analogous to the HMM forward algorithm.
Intuition
CTC says: "I do not know which frames produce which characters, so I will sum over all possible ways the frames could map to the transcript." The blank token allows the model to emit nothing at frames that fall between characters. The forward-backward algorithm avoids enumerating exponentially many alignments.
Proof Sketch
Define a modified target sequence by inserting blanks between characters and at the start/end. The forward variable is the probability of emitting the first tokens of in the first frames. The recurrence allows transitions from to (stay) or to (advance), with a skip allowed when the characters at positions and differ. The total probability is .
Why It Matters
CTC enabled end-to-end speech recognition by removing the need for forced alignment preprocessing. Before CTC, speech systems required frame-level alignment labels from a separate model. CTC lets you train directly from (audio, transcript) pairs.
Failure Mode
CTC assumes conditional independence of frame-level outputs given the input. This means the model cannot learn output-level dependencies (e.g., that "th" is more likely than "tq" in English). CTC also cannot handle cases where the output is longer than the input. Attention-based encoder-decoder models and transducers address both limitations.
Whisper
Whisper (OpenAI, 2022) is a large-scale speech recognition model trained on 680,000 hours of weakly supervised audio-transcript pairs scraped from the internet. The architecture is a standard encoder-decoder transformer operating on mel spectrogram inputs.
Key design choices:
- Weak supervision at scale instead of carefully labeled data
- Multilingual training (99 languages)
- Multitask: the same model does transcription, translation, language identification, and timestamp prediction via task-specific tokens in the decoder prompt
Whisper demonstrates that scaling data quantity (even noisy data) can compensate for lack of data quality, consistent with the broader trend in foundation models.
Text-to-Speech
Text-to-speech (TTS) inverts the speech recognition pipeline: given text, produce a waveform.
Tacotron 2 (2017): An encoder-decoder model that converts text to mel spectrograms. A separate vocoder (WaveNet or WaveGlow) converts the mel spectrogram to a waveform. The two-stage design separates the linguistic and acoustic modeling problems.
VITS (2021): An end-to-end model that generates waveforms directly from text using a variational autoencoder with normalizing flows and adversarial training. Eliminates the separate vocoder stage.
The core difficulty in TTS: the same text can be spoken in many valid ways (different prosody, emphasis, speaking rate). The model must choose one. This one-to-many mapping makes the problem harder than speech recognition, which is approximately many-to-one.
Audio-Language Models
Recent work extends the language model paradigm to audio. Models like AudioPaLM and Qwen-Audio process interleaved audio and text tokens, enabling tasks like audio captioning, audio question answering, and speech-to-speech translation.
The approach: encode audio as discrete tokens (using a learned audio codec like EnCodec), then train a language model over mixed audio-text sequences. This unifies speech understanding and generation in a single model.
Common Confusions
Spectrograms are not images
While mel spectrograms look like 2D images and can be processed by CNNs, the axes have structurally different semantics. The time axis has causal structure (the future depends on the past); the frequency axis represents simultaneous components. Architectures that respect this asymmetry (e.g., frequency-domain convolutions with causal temporal processing) often outperform naive 2D convolutions.
CTC is not attention-based alignment
CTC marginalizes over monotonic alignments (earlier input frames map to earlier output tokens). Attention-based models learn soft, potentially non-monotonic alignments. For speech, where the alignment is strictly monotonic, CTC is a natural fit. For tasks like translation, where word order changes, attention is needed.
Word error rate is not character error rate
WER counts the fraction of words that are wrong (insertions, deletions, substitutions). CER counts character-level errors. A model with 10% WER might have 3% CER because most errors are single-character mistakes within words. Always check which metric a paper reports.
Key Takeaways
- Mel spectrograms convert audio to 2D time-frequency representations suited for neural networks
- CTC marginalizes over all alignments between input frames and output tokens
- Whisper scales weak supervision to achieve strong multilingual speech recognition
- TTS is harder than ASR because text-to-speech is a one-to-many mapping
- Audio-language models unify speech and text by tokenizing audio
Exercises
Problem
An audio clip is 10 seconds long, sampled at 16 kHz. The STFT uses a hop size of 10ms and produces 80-dimensional mel features. What are the dimensions of the resulting mel spectrogram?
Problem
A CTC model has input length frames and target length characters. What is the time complexity of computing the CTC loss via the forward-backward algorithm? Why is the naive approach (enumerating all alignments) infeasible?
References
Canonical:
- Graves et al., "Connectionist Temporal Classification" (ICML 2006)
- Shen et al., "Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions" (Tacotron 2, ICASSP 2018)
Current:
-
Radford et al., "Robust Speech Recognition via Large-Scale Weak Supervision" (Whisper, 2022)
-
Kim et al., "Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech" (VITS, ICML 2021)
-
Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Signals and Systems for MLLayer 1
- Recurrent Neural NetworksLayer 3
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A
Builds on This
- Audio Language ModelsLayer 5