Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Beyond Llms

Audio Language Models

Models that process and generate speech alongside text: audio tokenization, Whisper for transcription, end-to-end voice models, music generation, and the audio-language frontier.

AdvancedTier 3Frontier~50 min
0

Why This Matters

Vision was the first modality to be fused with language models. Audio is the second, and it is arguably harder. Audio is inherently sequential at a much finer time resolution than text (16,000 samples per second vs. a few tokens per second), it carries both semantic content (what was said) and paralinguistic information (how it was said: tone, emotion, speaker identity), and it must be processed in real-time for conversational applications. Audio language models are the technology behind voice assistants that can hold natural conversations.

Mental Model

There are two architectures for combining audio and language:

  1. Pipeline: speech-to-text (ASR), then LLM, then text-to-speech (TTS). Each module is optimized separately. Latency is the sum of three stages.
  2. End-to-end: a single model that takes audio tokens as input and produces audio tokens as output, with text as an intermediate or auxiliary modality. Lower latency, but harder to train and debug.

The field is moving from pipeline to end-to-end, following the same trajectory that machine translation followed from phrase-based to neural.

Audio Tokenization

The core challenge: audio is a continuous, high-dimensional signal. Language models operate on discrete tokens. The bridge is audio tokenization.

Definition

Audio Tokenization

Audio tokenization converts a continuous waveform into a sequence of discrete tokens from a finite vocabulary. Two main approaches:

  1. Semantic tokens: encode linguistic content. Trained with a self-supervised model (e.g., HuBERT, w2v-BERT) that learns to cluster audio frames into discrete units. These capture what was said, discarding speaker and acoustic details.
  2. Acoustic tokens: encode full audio including speaker identity, prosody, and acoustic environment. Neural audio codecs (EnCodec, SoundStream, DAC) learn to compress audio into discrete codes that can reconstruct the waveform.

Most systems use both: semantic tokens for understanding, acoustic tokens for generation.

Proposition

Audio Token Rate-Quality Tradeoff

Statement

For a neural audio codec operating at RR tokens per second with vocabulary size VV, the information rate is Rlog2VR \log_2 V bits per second. Audio reconstruction quality increases monotonically with information rate, but with diminishing returns. Typical operating points:

  • Semantic tokens: 25-50 tokens/sec, V = 500-2000. Low bitrate, captures meaning only
  • Acoustic tokens: 50-75 tokens/sec per codebook, V = 1024, with 4-8 codebooks stacked. High bitrate, near-lossless reconstruction

The tradeoff: fewer tokens per second means shorter sequences for the language model (lower computational cost), but lower reconstruction quality.

Intuition

Semantic tokens are like a transcript: they capture the words but not the voice. Acoustic tokens are like a compressed audio file: they capture everything but produce longer sequences. A language model can reason about 50 semantic tokens per second of audio, but 400 acoustic tokens per second (8 codebooks times 50) is a much harder sequence modeling problem.

Why It Matters

The token rate directly determines the sequence length the language model must process. A 30-second audio clip at 50 tokens/sec produces 1,500 tokens. At 400 tokens/sec (multi-codebook acoustic), it produces 12,000 tokens. Context window limits and quadratic attention cost make this tradeoff critical for system design.

Failure Mode

Aggressive compression (very few tokens/sec) loses information that cannot be recovered. Semantic tokens discard speaker identity, making voice cloning impossible from semantic tokens alone. Acoustic tokens at very low bitrates introduce audible artifacts (metallic, robotic quality).

Whisper and Transcription

The Whisper family (OpenAI, 2022-2024) established the current standard for speech recognition. Key design choices:

  1. Encoder-decoder transformer: audio spectrogram input, text output
  2. Massive multitask training: trained on 680,000 hours of labeled audio for transcription, translation, language identification, and timestamp prediction
  3. Weak supervision: training data comes from the internet, not hand-labeled corpora. Quality is lower per-example but quantity is vastly larger

Whisper demonstrated that scaling data (even noisy data) beats careful curation for ASR. Whisper-large-v3 achieves word error rates competitive with human transcribers on many benchmarks.

Pipeline vs End-to-End Voice Models

Proposition

Pipeline vs End-to-End Latency Decomposition

Statement

A three-stage pipeline (ASR + LLM + TTS) has total latency:

Lpipeline=LASR+LLLM+LTTSL_{\text{pipeline}} = L_{\text{ASR}} + L_{\text{LLM}} + L_{\text{TTS}}

where each LiL_i includes both computation time and any buffering delay. An end-to-end model processes audio tokens directly:

Le2e=Lfirst-token+LdecodeL_{\text{e2e}} = L_{\text{first-token}} + L_{\text{decode}}

In practice, Lpipeline13L_{\text{pipeline}} \approx 1{-}3 seconds (ASR: 200-500ms, LLM first token: 200-1000ms, TTS first audio: 200-500ms). End-to-end models can achieve Le2e200500L_{\text{e2e}} \approx 200{-}500ms by eliminating the serialization overhead.

Intuition

In a pipeline, the LLM cannot start until ASR finishes, and TTS cannot start until the LLM produces text. End-to-end models avoid this by processing audio tokens natively, eliminating the ASR-to-text and text-to-TTS bottlenecks. The tradeoff is that end-to-end models are harder to debug (you cannot inspect the intermediate transcript) and harder to train (the model must learn ASR, reasoning, and TTS jointly).

Why It Matters

Conversational AI requires sub-second response times to feel natural. Human turn-taking gaps average about 200ms. Pipeline latencies of 2-3 seconds create an unnatural pause. End-to-end models are the path to natural conversational interaction.

Failure Mode

End-to-end models can hallucinate audio: generating fluent-sounding speech that does not match the intended content. In a pipeline, the text output of the LLM can be inspected and filtered before TTS. End-to-end models lack this safety checkpoint.

Notable End-to-End Systems

  • AudioPaLM (Google, 2023): fuses PaLM text capabilities with audio understanding. Uses both semantic and acoustic tokens
  • SeamlessM4T (Meta, 2023): multilingual speech-to-speech translation without intermediate text
  • GPT-4o audio mode (OpenAI, 2024): native audio input/output with a single multimodal model

Music Generation

Music generation applies the same audio tokenization approach to a different domain. Key systems:

  • MusicLM (Google, 2023): generates music from text descriptions. Uses a hierarchy of semantic and acoustic tokens with a cascaded generation approach
  • MusicGen (Meta, 2023): single-stage transformer generating audio codec tokens conditioned on text or melody
  • Suno (2023-2024): commercial system generating full songs (vocals + instruments) from text prompts

The music domain has unique challenges: long-range structure (a song has verses, choruses, bridges), multiple simultaneous instruments, and subjective quality evaluation (there is no "ground truth" for a creative task).

Common Confusions

Watch Out

Audio tokens are not phonemes

Phonemes are linguistic units defined by human linguists. Audio tokens are learned representations that may or may not correspond to phonemes. Semantic tokens from HuBERT tend to cluster around phoneme-like units, but acoustic tokens encode much more: pitch, speaker timbre, room acoustics, background noise.

Watch Out

End-to-end does not mean no text at all

Most end-to-end audio language models still use text as an intermediate representation or training signal. "End-to-end" means the model can process audio directly without requiring a separate ASR module at inference time, not that text plays no role in the architecture or training.

Watch Out

Audio generation quality is not just about the language model

The neural audio codec (vocoder) quality determines the ceiling for generated audio quality. A perfect language model generating tokens from a poor codec will produce poor audio. Advances in audio codecs (EnCodec, DAC, Vocos) directly improve the quality of all downstream audio generation systems.

Summary

  • Audio tokenization bridges continuous audio and discrete language models
  • Semantic tokens capture meaning; acoustic tokens capture full audio fidelity
  • Token rate determines sequence length, which determines computational cost
  • Pipeline (ASR + LLM + TTS) is modular but slow; end-to-end is fast but harder to train and debug
  • Whisper showed that scaling noisy data beats curating clean data for ASR
  • Music generation uses the same tokenization framework with longer-range structure requirements

Exercises

ExerciseCore

Problem

A neural audio codec operates at 50 tokens per second with 4 codebooks, each with vocabulary size 1024. What is the total bitrate? How many tokens does a 1-minute audio clip produce?

ExerciseAdvanced

Problem

A voice assistant uses a pipeline architecture with ASR (300ms), LLM (500ms to first token, 30ms per subsequent token, average response 50 tokens), and TTS (200ms to first audio chunk). What is the total time to first audio output? How does this compare to the 200ms human conversational turn-taking gap?

References

Canonical:

  • Radford et al., "Robust Speech Recognition via Large-Scale Weak Supervision" (2022). Whisper
  • Borsos et al., "AudioLM: A Language Modeling Approach to Audio Generation" (2023)

Current:

  • Rubenstein et al., "AudioPaLM: A Large Language Model That Can Speak and Listen" (2023)
  • Copet et al., "Simple and Controllable Music Generation" (2023). MusicGen
  • Defossez et al., "High Fidelity Neural Audio Compression" (2022). EnCodec

Next Topics

  • Multimodal RAG: retrieval-augmented generation across text, audio, and vision

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics