Audio Language Models

Sneiderman, Robby

Beyond LLMS

Audio Language Models

Models that process and generate speech alongside text: audio tokenization, Whisper for transcription, end-to-end voice models, music generation, and the audio-language frontier.

AdvancedTier 3FrontierSupporting~50 min

Prerequisites

Speech and Audio ML Transformer Architecture

Prereq Map

Learning position

Read this page in the graph.

beyond-llms | layer 5 | tier 3. This page has 2 direct prerequisites and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

Multimodal RAG

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Vision was the first modality to be fused with language models. Audio is the second, and it is arguably harder. Audio is inherently sequential at a much finer time resolution than text (16,000 samples per second vs. a few tokens per second), it carries both semantic content (what was said) and paralinguistic information (how it was said: tone, emotion, speaker identity), and it must be processed in real-time for conversational applications. Audio language models are the technology behind voice assistants that can hold natural conversations.

Mental Model

There are two architectures for combining audio and language:

Pipeline: speech-to-text (ASR), then LLM, then text-to-speech (TTS). Each module is optimized separately. Latency is the sum of three stages.
End-to-end: a single model that takes audio tokens as input and produces audio tokens as output, with text as an intermediate or auxiliary modality. Lower latency, but harder to train and debug.

The field is moving from pipeline to end-to-end, following the same trajectory that machine translation followed from phrase-based to neural.

Audio Tokenization

The core challenge: audio is a continuous, high-dimensional signal. Language models operate on discrete tokens. The bridge is audio tokenization.

Definition

Audio Tokenization

Audio tokenization converts a continuous waveform into a sequence of discrete tokens from a finite vocabulary. Two main approaches:

Semantic tokens: encode linguistic content. Trained with a self-supervised model (e.g., HuBERT, w2v-BERT) that learns to cluster audio frames into discrete units. These capture what was said, discarding speaker and acoustic details.
Acoustic tokens: encode full audio including speaker identity, prosody, and acoustic environment. Neural audio codecs (EnCodec, SoundStream, DAC) learn to compress audio into discrete codes that can reconstruct the waveform.

Most systems use both: semantic tokens for understanding, acoustic tokens for generation.

Residual Vector Quantization

Neural audio codecs like EnCodec (Defossez et al. 2022, arXiv:2210.13438) and SoundStream (Zeghidour et al. 2021, arXiv:2107.03312) represent audio as a hierarchy of discrete tokens via residual vector quantization. The first codebook quantizes the encoder output and captures coarse structure. Each subsequent codebook quantizes the residual from the previous level, encoding progressively finer detail. Audio language models typically interleave or flatten these token streams to feed them into a transformer.

Semantic vs Acoustic Tokens

AudioLM (Borsos et al. 2022, arXiv:2209.03143) introduced the two-stage design now standard in audio LMs: a coarse stage predicts semantic tokens (from a self-supervised model like w2v-BERT) that carry linguistic content, and a fine stage predicts acoustic tokens (from a neural codec) conditioned on the semantic sequence. Separating the stages lets the model plan content at a low token rate and render acoustic detail at a higher rate.

Proposition

Audio Token Rate-Quality Tradeoff

Statement

For a neural audio codec operating at $R$ tokens per second with vocabulary size $V$ , the information rate is $R \log_2 V$ bits per second. Audio reconstruction quality increases monotonically with information rate, but with diminishing returns. Typical operating points:

Semantic tokens: 25-50 tokens/sec, V = 500-2000. Low bitrate, captures meaning only
Acoustic tokens: 50-75 tokens/sec per codebook, V = 1024, with 4-8 codebooks stacked. High bitrate, near-lossless reconstruction

The tradeoff: fewer tokens per second means shorter sequences for the language model (lower computational cost), but lower reconstruction quality.

Intuition

Semantic tokens are like a transcript: they capture the words but not the voice. Acoustic tokens are like a compressed audio file: they capture everything but produce longer sequences. A language model can reason about 50 semantic tokens per second of audio, but 400 acoustic tokens per second (8 codebooks times 50) is a much harder sequence modeling problem.

Why It Matters

The token rate directly determines the sequence length the language model must process. A 30-second audio clip at 50 tokens/sec produces 1,500 tokens. At 400 tokens/sec (multi-codebook acoustic), it produces 12,000 tokens. Context window limits and quadratic attention cost make this tradeoff critical for system design.

Failure Mode

Aggressive compression (very few tokens/sec) loses information that cannot be recovered. Semantic tokens discard speaker identity, making voice cloning impossible from semantic tokens alone. Acoustic tokens at very low bitrates introduce audible artifacts (metallic, robotic quality).

report a correction →

Whisper and Transcription

The Whisper family (OpenAI, 2022-2024) established the current standard for speech recognition. Key design choices:

Encoder-decoder transformer: audio spectrogram input, text output
Massive multitask training: trained on 680,000 hours of labeled audio for transcription, translation, language identification, and timestamp prediction
Weak supervision: training data comes from the internet, not hand-labeled corpora. Quality is lower per-example but quantity is vastly larger

Whisper demonstrated that scaling data (even noisy data) beats careful curation for ASR. Whisper-large-v3 achieves word error rates competitive with human transcribers on many benchmarks.

Pipeline vs End-to-End Voice Models

Proposition

Pipeline vs End-to-End Latency Decomposition

Statement

A three-stage pipeline (ASR + LLM + TTS) has total latency:

$L_{\text{pipeline}} = L_{\text{ASR}} + L_{\text{LLM}} + L_{\text{TTS}}$

where each $L_i$ includes both computation time and any buffering delay. An end-to-end model processes audio tokens directly:

$L_{\text{e2e}} = L_{\text{first-token}} + L_{\text{decode}}$

In practice, $L_{\text{pipeline}} \approx 1{-}3$ seconds (ASR: 200-500ms, LLM first token: 200-1000ms, TTS first audio: 200-500ms). End-to-end models can achieve $L_{\text{e2e}} \approx 200{-}500$ ms by eliminating the serialization overhead.

Intuition

In a pipeline, the LLM cannot start until ASR finishes, and TTS cannot start until the LLM produces text. End-to-end models avoid this by processing audio tokens natively, eliminating the ASR-to-text and text-to-TTS bottlenecks. The tradeoff is that end-to-end models are harder to debug (you cannot inspect the intermediate transcript) and harder to train (the model must learn ASR, reasoning, and TTS jointly).

Why It Matters

Conversational AI requires sub-second response times to feel natural. Human turn-taking gaps average about 200ms. Pipeline latencies of 2-3 seconds create an unnatural pause. End-to-end models are the path to natural conversational interaction.

Failure Mode

End-to-end models can hallucinate audio: generating fluent-sounding speech that does not match the intended content. In a pipeline, the text output of the LLM can be inspected and filtered before TTS. End-to-end models lack this safety checkpoint.

report a correction →

Notable End-to-End Systems

AudioPaLM (Google, 2023): fuses PaLM text capabilities with audio understanding. Uses both semantic and acoustic tokens
SeamlessM4T (Meta, 2023): multilingual speech-to-speech translation without intermediate text
VALL-E (Wang et al. 2023, arXiv:2301.02111): zero-shot text-to-speech as a neural codec language model. Conditions on a short audio prompt to clone voice, then autoregressively generates EnCodec tokens that are decoded back to waveform
VoiceBox (Le et al. 2023, Meta, arXiv:2306.15687): non-autoregressive speech generation via conditional flow matching on masked audio spans. Supports zero-shot TTS, content editing, and noise removal in one model
GPT-4o audio mode (OpenAI, 2024): native audio input/output with a single multimodal model
Moshi (Kyutai 2024, arXiv:2410.00037): full-duplex speech-text foundation model that streams user and model audio in parallel. Built on the MimI codec, a low-bitrate neural audio codec that jointly distills semantic information into the first RVQ codebook so a single token stream carries both linguistic and acoustic content

Music Generation

Music generation applies the same audio tokenization approach to a different domain. Key systems:

MusicLM (Google, 2023): generates music from text descriptions. Uses a hierarchy of semantic and acoustic tokens with a cascaded generation approach
MusicGen (Meta, 2023): single-stage transformer generating audio codec tokens conditioned on text or melody
Stable Audio 2 (Stability AI, 2024): text-to-music latent diffusion operating in a compressed audio latent space rather than on discrete tokens, trained for long-form structured generation
Suno (2023-2024): commercial system generating full songs (vocals + instruments) from text prompts

The music domain has unique challenges: long-range structure (a song has verses, choruses, bridges), multiple simultaneous instruments, and subjective quality evaluation (there is no "ground truth" for a creative task).

Common Confusions

Watch Out

Audio tokens are not phonemes

Phonemes are linguistic units defined by human linguists. Audio tokens are learned representations that may or may not correspond to phonemes. Semantic tokens from HuBERT tend to cluster around phoneme-like units, but acoustic tokens encode much more: pitch, speaker timbre, room acoustics, background noise.

Watch Out

End-to-end does not mean no text at all

Most end-to-end audio language models still use text as an intermediate representation or training signal. "End-to-end" means the model can process audio directly without requiring a separate ASR module at inference time, not that text plays no role in the architecture or training.

Watch Out

Audio generation quality is not just about the language model

The neural audio codec (vocoder) quality determines the ceiling for generated audio quality. A perfect language model generating tokens from a poor codec will produce poor audio. Advances in audio codecs (EnCodec, DAC, Vocos) directly improve the quality of all downstream audio generation systems.

Summary

Audio tokenization bridges continuous audio and discrete language models
Semantic tokens capture meaning; acoustic tokens capture full audio fidelity
Token rate determines sequence length, which determines computational cost
Pipeline (ASR + LLM + TTS) is modular but slow; end-to-end is fast but harder to train and debug
Whisper showed that scaling noisy data beats curating clean data for ASR
Music generation uses the same tokenization framework with longer-range structure requirements

Exercises

ExerciseCore

Problem

A neural audio codec operates at 50 tokens per second with 4 codebooks, each with vocabulary size 1024. What is the total bitrate? How many tokens does a 1-minute audio clip produce?

ExerciseAdvanced

Problem

A voice assistant uses a pipeline architecture with ASR (300ms), LLM (500ms to first token, 30ms per subsequent token, average response 50 tokens), and TTS (200ms to first audio chunk). What is the total time to first audio output? How does this compare to the 200ms human conversational turn-taking gap?

References

Canonical:

Radford et al., "Robust Speech Recognition via Large-Scale Weak Supervision" (2022). Whisper
Borsos et al., "AudioLM: A Language Modeling Approach to Audio Generation" (2023), arXiv:2209.03143
Zeghidour et al., "SoundStream: An End-to-End Neural Audio Codec" (2021), arXiv:2107.03312
Defossez et al., "High Fidelity Neural Audio Compression" (2022), arXiv:2210.13438. EnCodec

Current:

Rubenstein et al., "AudioPaLM: A Large Language Model That Can Speak and Listen" (2023)
Copet et al., "Simple and Controllable Music Generation" (2023). MusicGen
Wang et al., "Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers" (2023), arXiv:2301.02111. VALL-E
Le et al., "Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale" (2023), arXiv:2306.15687

Frontier:

Defossez et al., "Moshi: a speech-text foundation model for real-time dialogue" (Kyutai, 2024), arXiv:2410.00037. Introduces the MimI codec
Stability AI, "Stable Audio 2" (2024). Text-to-music latent diffusion

Next Topics

Multimodal RAG: retrieval-augmented generation across text, audio, and vision

Last reviewed: May 29, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Speech and Audio MLlayer 3 · tier 2
Transformer Architecturelayer 4 · tier 2

Derived topics

1

Multimodal RAGlayer 5 · tier 2

Graph-backed continuations

Multimodal RAG