Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Speech and Audio ML

Machine learning for audio: mel spectrograms as 2D representations, CTC loss for sequence alignment, Whisper for speech recognition, text-to-speech synthesis, and why continuous audio signals are harder than discrete text.

AdvancedTier 2Current~50 min
0

Why This Matters

Audio is a continuous, high-dimensional signal. Converting it to text (speech recognition), generating it from text (synthesis), or classifying it (audio event detection) requires bridging the gap between continuous waveforms and discrete outputs. The techniques here extend to music, environmental sounds, and any time-frequency signal.

Mental Model

Raw audio is a 1D waveform sampled at 16-48 kHz. Processing it directly is possible but expensive. The standard approach converts audio to a 2D time-frequency representation (spectrogram), then applies techniques from computer vision or sequence modeling. The pipeline is: waveform to spectrogram to features to output.

Mel Spectrograms

Definition

Mel Spectrogram

Compute the Short-Time Fourier Transform (STFT) of the audio waveform to get a spectrogram (time vs. frequency). Apply a bank of triangular filters spaced according to the mel scale, which compresses higher frequencies more than lower ones, matching human auditory perception. Take the log of the filter bank energies.

The mel scale maps frequency ff in Hz to mel units: m=2595log10(1+f/700)m = 2595 \log_{10}(1 + f/700).

The mel spectrogram is typically an F×TF \times T matrix where FF is the number of mel bins (often 80-128) and TT is the number of time frames. This 2D representation can be processed by CNNs, transformers, or any architecture that handles grid-structured data.

Why log-scale? Human perception of loudness is approximately logarithmic. A 10x increase in energy sounds like a fixed increase in loudness. Log compression also stabilizes the dynamic range for neural network inputs.

CTC Loss

The alignment problem: given an audio sequence of length TT and a text transcript of length UU where TUT \gg U, you do not know which audio frames correspond to which characters. CTC (Connectionist Temporal Classification) solves this by marginalizing over all valid alignments.

Proposition

CTC Loss via Marginalization over Alignments

Statement

Let π\pi be an alignment (a length-TT sequence over the vocabulary plus a blank symbol). Let B(π)\mathcal{B}(\pi) be the function that removes blanks and collapses repeated characters. The CTC loss for target sequence yy is:

LCTC=logπ:B(π)=yt=1Tp(πtx)L_{\text{CTC}} = -\log \sum_{\pi : \mathcal{B}(\pi) = y} \prod_{t=1}^{T} p(\pi_t \mid x)

This sum over all valid alignments is computed efficiently in O(TU)O(T \cdot U) time using a forward-backward algorithm analogous to the HMM forward algorithm.

Intuition

CTC says: "I do not know which frames produce which characters, so I will sum over all possible ways the frames could map to the transcript." The blank token allows the model to emit nothing at frames that fall between characters. The forward-backward algorithm avoids enumerating exponentially many alignments.

Proof Sketch

Define a modified target sequence yy' by inserting blanks between characters and at the start/end. The forward variable α(t,s)\alpha(t, s) is the probability of emitting the first ss tokens of yy' in the first tt frames. The recurrence allows transitions from ss to ss (stay) or ss to s+1s+1 (advance), with a skip allowed when the characters at positions ss and s+2s+2 differ. The total probability is α(T,y)+α(T,y1)\alpha(T, |y'|) + \alpha(T, |y'|-1).

Why It Matters

CTC enabled end-to-end speech recognition by removing the need for forced alignment preprocessing. Before CTC, speech systems required frame-level alignment labels from a separate model. CTC lets you train directly from (audio, transcript) pairs.

Failure Mode

CTC assumes conditional independence of frame-level outputs given the input. This means the model cannot learn output-level dependencies (e.g., that "th" is more likely than "tq" in English). CTC also cannot handle cases where the output is longer than the input. Attention-based encoder-decoder models and transducers address both limitations.

Whisper

Whisper (OpenAI, 2022) is a large-scale speech recognition model trained on 680,000 hours of weakly supervised audio-transcript pairs scraped from the internet. The architecture is a standard encoder-decoder transformer operating on mel spectrogram inputs.

Key design choices:

  • Weak supervision at scale instead of carefully labeled data
  • Multilingual training (99 languages)
  • Multitask: the same model does transcription, translation, language identification, and timestamp prediction via task-specific tokens in the decoder prompt

Whisper demonstrates that scaling data quantity (even noisy data) can compensate for lack of data quality, consistent with the broader trend in foundation models.

Text-to-Speech

Text-to-speech (TTS) inverts the speech recognition pipeline: given text, produce a waveform.

Tacotron 2 (2017): An encoder-decoder model that converts text to mel spectrograms. A separate vocoder (WaveNet or WaveGlow) converts the mel spectrogram to a waveform. The two-stage design separates the linguistic and acoustic modeling problems.

VITS (2021): An end-to-end model that generates waveforms directly from text using a variational autoencoder with normalizing flows and adversarial training. Eliminates the separate vocoder stage.

The core difficulty in TTS: the same text can be spoken in many valid ways (different prosody, emphasis, speaking rate). The model must choose one. This one-to-many mapping makes the problem harder than speech recognition, which is approximately many-to-one.

Audio-Language Models

Recent work extends the language model paradigm to audio. Models like AudioPaLM and Qwen-Audio process interleaved audio and text tokens, enabling tasks like audio captioning, audio question answering, and speech-to-speech translation.

The approach: encode audio as discrete tokens (using a learned audio codec like EnCodec), then train a language model over mixed audio-text sequences. This unifies speech understanding and generation in a single model.

Common Confusions

Watch Out

Spectrograms are not images

While mel spectrograms look like 2D images and can be processed by CNNs, the axes have structurally different semantics. The time axis has causal structure (the future depends on the past); the frequency axis represents simultaneous components. Architectures that respect this asymmetry (e.g., frequency-domain convolutions with causal temporal processing) often outperform naive 2D convolutions.

Watch Out

CTC is not attention-based alignment

CTC marginalizes over monotonic alignments (earlier input frames map to earlier output tokens). Attention-based models learn soft, potentially non-monotonic alignments. For speech, where the alignment is strictly monotonic, CTC is a natural fit. For tasks like translation, where word order changes, attention is needed.

Watch Out

Word error rate is not character error rate

WER counts the fraction of words that are wrong (insertions, deletions, substitutions). CER counts character-level errors. A model with 10% WER might have 3% CER because most errors are single-character mistakes within words. Always check which metric a paper reports.

Key Takeaways

  • Mel spectrograms convert audio to 2D time-frequency representations suited for neural networks
  • CTC marginalizes over all alignments between input frames and output tokens
  • Whisper scales weak supervision to achieve strong multilingual speech recognition
  • TTS is harder than ASR because text-to-speech is a one-to-many mapping
  • Audio-language models unify speech and text by tokenizing audio

Exercises

ExerciseCore

Problem

An audio clip is 10 seconds long, sampled at 16 kHz. The STFT uses a hop size of 10ms and produces 80-dimensional mel features. What are the dimensions of the resulting mel spectrogram?

ExerciseAdvanced

Problem

A CTC model has input length T=500T = 500 frames and target length U=20U = 20 characters. What is the time complexity of computing the CTC loss via the forward-backward algorithm? Why is the naive approach (enumerating all alignments) infeasible?

References

Canonical:

  • Graves et al., "Connectionist Temporal Classification" (ICML 2006)
  • Shen et al., "Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions" (Tacotron 2, ICASSP 2018)

Current:

  • Radford et al., "Robust Speech Recognition via Large-Scale Weak Supervision" (Whisper, 2022)

  • Kim et al., "Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech" (VITS, ICML 2021)

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This