Beyond Llms
Audio Language Models
Models that process and generate speech alongside text: audio tokenization, Whisper for transcription, end-to-end voice models, music generation, and the audio-language frontier.
Prerequisites
Why This Matters
Vision was the first modality to be fused with language models. Audio is the second, and it is arguably harder. Audio is inherently sequential at a much finer time resolution than text (16,000 samples per second vs. a few tokens per second), it carries both semantic content (what was said) and paralinguistic information (how it was said: tone, emotion, speaker identity), and it must be processed in real-time for conversational applications. Audio language models are the technology behind voice assistants that can hold natural conversations.
Mental Model
There are two architectures for combining audio and language:
- Pipeline: speech-to-text (ASR), then LLM, then text-to-speech (TTS). Each module is optimized separately. Latency is the sum of three stages.
- End-to-end: a single model that takes audio tokens as input and produces audio tokens as output, with text as an intermediate or auxiliary modality. Lower latency, but harder to train and debug.
The field is moving from pipeline to end-to-end, following the same trajectory that machine translation followed from phrase-based to neural.
Audio Tokenization
The core challenge: audio is a continuous, high-dimensional signal. Language models operate on discrete tokens. The bridge is audio tokenization.
Audio Tokenization
Audio tokenization converts a continuous waveform into a sequence of discrete tokens from a finite vocabulary. Two main approaches:
- Semantic tokens: encode linguistic content. Trained with a self-supervised model (e.g., HuBERT, w2v-BERT) that learns to cluster audio frames into discrete units. These capture what was said, discarding speaker and acoustic details.
- Acoustic tokens: encode full audio including speaker identity, prosody, and acoustic environment. Neural audio codecs (EnCodec, SoundStream, DAC) learn to compress audio into discrete codes that can reconstruct the waveform.
Most systems use both: semantic tokens for understanding, acoustic tokens for generation.
Audio Token Rate-Quality Tradeoff
Statement
For a neural audio codec operating at tokens per second with vocabulary size , the information rate is bits per second. Audio reconstruction quality increases monotonically with information rate, but with diminishing returns. Typical operating points:
- Semantic tokens: 25-50 tokens/sec, V = 500-2000. Low bitrate, captures meaning only
- Acoustic tokens: 50-75 tokens/sec per codebook, V = 1024, with 4-8 codebooks stacked. High bitrate, near-lossless reconstruction
The tradeoff: fewer tokens per second means shorter sequences for the language model (lower computational cost), but lower reconstruction quality.
Intuition
Semantic tokens are like a transcript: they capture the words but not the voice. Acoustic tokens are like a compressed audio file: they capture everything but produce longer sequences. A language model can reason about 50 semantic tokens per second of audio, but 400 acoustic tokens per second (8 codebooks times 50) is a much harder sequence modeling problem.
Why It Matters
The token rate directly determines the sequence length the language model must process. A 30-second audio clip at 50 tokens/sec produces 1,500 tokens. At 400 tokens/sec (multi-codebook acoustic), it produces 12,000 tokens. Context window limits and quadratic attention cost make this tradeoff critical for system design.
Failure Mode
Aggressive compression (very few tokens/sec) loses information that cannot be recovered. Semantic tokens discard speaker identity, making voice cloning impossible from semantic tokens alone. Acoustic tokens at very low bitrates introduce audible artifacts (metallic, robotic quality).
Whisper and Transcription
The Whisper family (OpenAI, 2022-2024) established the current standard for speech recognition. Key design choices:
- Encoder-decoder transformer: audio spectrogram input, text output
- Massive multitask training: trained on 680,000 hours of labeled audio for transcription, translation, language identification, and timestamp prediction
- Weak supervision: training data comes from the internet, not hand-labeled corpora. Quality is lower per-example but quantity is vastly larger
Whisper demonstrated that scaling data (even noisy data) beats careful curation for ASR. Whisper-large-v3 achieves word error rates competitive with human transcribers on many benchmarks.
Pipeline vs End-to-End Voice Models
Pipeline vs End-to-End Latency Decomposition
Statement
A three-stage pipeline (ASR + LLM + TTS) has total latency:
where each includes both computation time and any buffering delay. An end-to-end model processes audio tokens directly:
In practice, seconds (ASR: 200-500ms, LLM first token: 200-1000ms, TTS first audio: 200-500ms). End-to-end models can achieve ms by eliminating the serialization overhead.
Intuition
In a pipeline, the LLM cannot start until ASR finishes, and TTS cannot start until the LLM produces text. End-to-end models avoid this by processing audio tokens natively, eliminating the ASR-to-text and text-to-TTS bottlenecks. The tradeoff is that end-to-end models are harder to debug (you cannot inspect the intermediate transcript) and harder to train (the model must learn ASR, reasoning, and TTS jointly).
Why It Matters
Conversational AI requires sub-second response times to feel natural. Human turn-taking gaps average about 200ms. Pipeline latencies of 2-3 seconds create an unnatural pause. End-to-end models are the path to natural conversational interaction.
Failure Mode
End-to-end models can hallucinate audio: generating fluent-sounding speech that does not match the intended content. In a pipeline, the text output of the LLM can be inspected and filtered before TTS. End-to-end models lack this safety checkpoint.
Notable End-to-End Systems
- AudioPaLM (Google, 2023): fuses PaLM text capabilities with audio understanding. Uses both semantic and acoustic tokens
- SeamlessM4T (Meta, 2023): multilingual speech-to-speech translation without intermediate text
- GPT-4o audio mode (OpenAI, 2024): native audio input/output with a single multimodal model
Music Generation
Music generation applies the same audio tokenization approach to a different domain. Key systems:
- MusicLM (Google, 2023): generates music from text descriptions. Uses a hierarchy of semantic and acoustic tokens with a cascaded generation approach
- MusicGen (Meta, 2023): single-stage transformer generating audio codec tokens conditioned on text or melody
- Suno (2023-2024): commercial system generating full songs (vocals + instruments) from text prompts
The music domain has unique challenges: long-range structure (a song has verses, choruses, bridges), multiple simultaneous instruments, and subjective quality evaluation (there is no "ground truth" for a creative task).
Common Confusions
Audio tokens are not phonemes
Phonemes are linguistic units defined by human linguists. Audio tokens are learned representations that may or may not correspond to phonemes. Semantic tokens from HuBERT tend to cluster around phoneme-like units, but acoustic tokens encode much more: pitch, speaker timbre, room acoustics, background noise.
End-to-end does not mean no text at all
Most end-to-end audio language models still use text as an intermediate representation or training signal. "End-to-end" means the model can process audio directly without requiring a separate ASR module at inference time, not that text plays no role in the architecture or training.
Audio generation quality is not just about the language model
The neural audio codec (vocoder) quality determines the ceiling for generated audio quality. A perfect language model generating tokens from a poor codec will produce poor audio. Advances in audio codecs (EnCodec, DAC, Vocos) directly improve the quality of all downstream audio generation systems.
Summary
- Audio tokenization bridges continuous audio and discrete language models
- Semantic tokens capture meaning; acoustic tokens capture full audio fidelity
- Token rate determines sequence length, which determines computational cost
- Pipeline (ASR + LLM + TTS) is modular but slow; end-to-end is fast but harder to train and debug
- Whisper showed that scaling noisy data beats curating clean data for ASR
- Music generation uses the same tokenization framework with longer-range structure requirements
Exercises
Problem
A neural audio codec operates at 50 tokens per second with 4 codebooks, each with vocabulary size 1024. What is the total bitrate? How many tokens does a 1-minute audio clip produce?
Problem
A voice assistant uses a pipeline architecture with ASR (300ms), LLM (500ms to first token, 30ms per subsequent token, average response 50 tokens), and TTS (200ms to first audio chunk). What is the total time to first audio output? How does this compare to the 200ms human conversational turn-taking gap?
References
Canonical:
- Radford et al., "Robust Speech Recognition via Large-Scale Weak Supervision" (2022). Whisper
- Borsos et al., "AudioLM: A Language Modeling Approach to Audio Generation" (2023)
Current:
- Rubenstein et al., "AudioPaLM: A Large Language Model That Can Speak and Listen" (2023)
- Copet et al., "Simple and Controllable Music Generation" (2023). MusicGen
- Defossez et al., "High Fidelity Neural Audio Compression" (2022). EnCodec
Next Topics
- Multimodal RAG: retrieval-augmented generation across text, audio, and vision
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Speech and Audio MLLayer 3
- Signals and Systems for MLLayer 1
- Recurrent Neural NetworksLayer 3
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A
- Transformer ArchitectureLayer 4
- Attention Mechanism TheoryLayer 4
- Softmax and Numerical StabilityLayer 1