Byte-Level Language Models

Sneiderman, Robby

LLM Construction

Byte-Level Language Models

Skip the tokenizer and feed raw bytes to the model. ByT5, MegaByte, and Byte Latent Transformer: why operating on bytes is attractive, why it is expensive, and how hierarchical patching closes the compute gap.

AdvancedTier 3CurrentFrontier watch~45 min

Prerequisites

Tokenization and Information Theory Morphology and Subword Modeling

Prereq Map

Learning position

Read this page in the graph.

llm-construction | layer 4 | tier 3. This page has 1 direct prerequisite and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Transformer Architecture

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Every standard LLM starts with a tokenizer that converts text into subword units drawn from a fixed vocabulary. Tokenization compresses well (short sequences, high bytes-per-token) but carries a long tail of failure modes: arithmetic errors caused by digit-group tokenization, multilingual unfairness (a Burmese character can cost 3 to 5 tokens while an English word costs 1), brittle behavior on typos and rare scripts, and the need to pick a vocabulary before any training.

Byte-level models dispose of the tokenizer entirely. The input sequence is the raw UTF-8 byte stream. The vocabulary is fixed at 256 (plus a few control tokens). Every piece of text has exactly one byte-level encoding, and every Unicode string the model might encounter is representable.

The price is sequence length. A typical subword token is 3-4 bytes of UTF-8 text, so a byte-level sequence is 3-4 times longer than the equivalent tokenized sequence. Under standard transformer scaling, quadratic-in-sequence attention and linear-in-sequence feedforward make byte-level models substantially more expensive per string. Three generations of work (ByT5, MegaByte, Byte Latent Transformer) close that gap through hierarchical processing: small cheap computations at the byte level, expensive transformer computation at a coarser patch level.

The Byte-Level Compute Problem

Definition

Bytes Per Token

For a trained tokenizer and a reference corpus, the average number of UTF-8 bytes represented by one vocabulary token: $\beta = (\text{bytes in corpus}) / (\text{tokens in corpus})$ . Typical values: English with a 32K BPE tokenizer, $\beta \approx 4$ ; Chinese with the same tokenizer, $\beta \approx 1.5$ ; Burmese, $\beta \approx 0.3$ (one byte costs several tokens).

Proposition

Compute Ratio: Byte-Level vs. Subword

Statement

Let the subword-level transformer have sequence length $N_{\text{tok}}$ and the byte-level transformer the same text of length $N_{\text{byte}} = \beta \cdot N_{\text{tok}}$ bytes. With identical architecture (same $L$ , $d$ ), the forward-pass compute ratio is:

$\frac{\text{FLOPs}_{\text{byte}}}{\text{FLOPs}_{\text{tok}}} = \frac{\beta^2 \cdot N_{\text{tok}}^2 d + \beta \cdot N_{\text{tok}} d^2}{N_{\text{tok}}^2 d + N_{\text{tok}} d^2}.$

In the attention-dominated regime ( $N_{\text{tok}} d \gg d^2$ ), this approaches $\beta^2$ . In the feedforward-dominated regime ( $N_{\text{tok}} d \ll d^2$ ), this approaches $\beta$ . For typical modern models $\beta \approx 4$ , so byte-level inference on the same architecture costs between $4\times$ and $16\times$ more FLOPs per string.

Intuition

Attention is quadratic in sequence length, so quadrupling the sequence quadruples each dimension of the attention matrix, costing $\beta^2$ more. Feedforward is linear in sequence length, so it only costs $\beta$ more. Real models have both; the ratio interpolates depending on which term dominates. Long contexts and large models push the ratio toward $\beta^2$ ; short contexts and wide models push it toward $\beta$ .

Proof Sketch

FLOPs per layer of a vanilla transformer: attention $\Theta(N^2 d)$ (computing $QK^\top$ and $\cdot V$ ) plus feedforward $\Theta(N d^2)$ . Substitute $N_{\text{byte}} = \beta N_{\text{tok}}$ : the ratio of total FLOPs is $(\beta^2 N^2 d + \beta N d^2) / (N^2 d + N d^2)$ . Factor $N d$ out of both to get $(\beta^2 N + \beta d) / (N + d)$ . For $N \gg d$ the ratio approaches $\beta^2$ ; for $N \ll d$ it approaches $\beta$ .

Why It Matters

This is the core design constraint. If you run bytes through the same model as tokens, you pay 4 to 16 times the compute for English. That is a steep price, and is why naive byte-level training at frontier scale does not exist: no one has run $10^{25}$ FLOPs on a byte-level model with identical architecture. The hierarchical approaches (MegaByte, BLT) exist precisely to break this ratio.

Failure Mode

The bound is tight only when the architecture is held fixed. In practice byte-level models often shrink other dimensions: ByT5 uses a deeper but narrower encoder than T5 to compensate. Hierarchical byte models change the architecture outright, so the compute ratio can be brought near 1 or below 1 at the cost of a more complex training recipe. The theorem is a lower-bound "what happens if you change nothing else", not a hard limit on byte-level models.

report a correction →

ByT5: Bytes Through a T5

ByT5 (Xue et al. 2022, TACL, arXiv:2105.13626) was the first large-scale demonstration that feeding raw bytes to a standard transformer is practical. The recipe keeps the T5 encoder-decoder architecture unchanged and replaces the SentencePiece tokenizer with a direct UTF-8 byte encoder, adding only three reserved tokens (PAD, EOS, UNK-bytes).

To offset the byte-length blowup, ByT5 adjusts the parameter budget:

Encoder depth is increased (more layers) while decoder depth is reduced, because encoder work dominates the cost when sequences are long and most of the representational burden lives in the encoder.
Feedforward hidden size is expanded to give each byte more capacity.

The result: on mT5's multilingual benchmark suite, ByT5 matches or beats token-level mT5 on noisy, low-resource, and character-sensitive tasks (transliteration, word-level perturbations, rare-language QA), while being slightly worse on standard English text tasks. The multilingual fairness gain is the main prize, but the mechanism is subtler than "every script gets the same bytes per character." UTF-8 itself is uneven: ASCII characters cost 1 byte, Latin-extended/Cyrillic 2 bytes, CJK 3 bytes, supplementary planes (some emoji, historical scripts) 4 bytes. What byte-level models do equalize is that this cost is fixed by Unicode and independent of tokenizer training mix — under a BPE tokenizer trained mostly on English, a Chinese character might cost 4-8 tokens; under raw UTF-8 it deterministically costs 3 bytes. Low-resource languages no longer pay a tokenizer tax even though the underlying UTF-8 tax remains.

MegaByte: Hierarchical Patches for Long Sequences

MegaByte (Yu et al. 2023, NeurIPS, arXiv:2305.07185) addressed the scaling problem directly: the quadratic cost of attention makes byte-level modeling at context lengths above a few thousand bytes prohibitive. The MegaByte architecture introduces a two-level hierarchy.

Definition

Patch

A fixed-length window of $P$ contiguous bytes. A byte-level sequence of length $N$ is partitioned into $N/P$ patches. Each patch is represented by a single vector (the concatenation or projection of its byte embeddings).

The MegaByte architecture has three components:

Patch embedder. Each patch of $P$ bytes is embedded into a single patch vector by concatenating the $P$ byte embeddings and projecting to dimension $d$ .
Global transformer. Operates on the sequence of $N/P$ patch vectors, using standard self-attention. This is the expensive part, but the sequence is $P$ times shorter than the byte sequence.
Local model. A small transformer predicts bytes within a patch, conditioned on the global model's output for that patch. This is cheap because it operates on $P$ bytes at a time with no cross-patch attention.

The compute budget shifts: attention cost becomes $\Theta((N/P)^2 d)$ globally plus $\Theta(N \cdot P \cdot d_{\text{local}})$ locally, which is linear in $N$ when $d_{\text{local}}$ is small and $P$ is constant. For $P = 8$ and typical local-model sizes, MegaByte models sequences of $10^6$ bytes at compute budgets where a vanilla byte-level transformer would handle a few thousand bytes.

Byte Latent Transformer: Patches That Vary With Content

BLT (Pagnoni et al. 2024, arXiv:2412.09871) keeps the hierarchical idea but makes patch boundaries content-dependent. The insight: uniform $P$ -byte patches waste compute on predictable stretches (common English words, where the next byte has low entropy) and under-serve unpredictable stretches (rare words, numbers, code).

Definition

Entropy-Based Patching

A small entropy model $h(y \mid y_{<t})$ estimates the conditional entropy of the next byte given the prefix. A new patch starts at position $t$ whenever $h(y_t \mid y_{<t})$ exceeds a threshold $\theta$ : high-entropy points become patch boundaries. Patches therefore vary in length: short in unpredictable regions, long in predictable ones.

Like MegaByte, BLT runs the expensive global transformer over patches and a smaller local model within patches. Unlike MegaByte, the patch count and boundaries depend on the input. The empirical finding: at matched compute, BLT matches Llama 3 tokenizer-based performance on English benchmarks while degrading less under typos, performing better at character-level manipulation tasks, and closing the gap across languages.

Pagnoni et al. argue that BLT breaks the subword-tokenizer Pareto frontier: prior byte-level models traded quality for noise tolerance and language fairness, while BLT keeps both properties without the quality loss. This claim rests on a specific compute budget and benchmark set; long-horizon comparisons at frontier scale are still active research as of 2026.

Why Byte-Level Matters Beyond Convenience

Multilingual fairness. Tokenizer-based models charge low-resource languages more tokens per character, which translates to smaller effective context windows and worse learning dynamics. Byte-level models replace this tokenizer-induced disparity with the fixed UTF-8 encoding cost: 1 byte per ASCII character, 2 for Latin-extended/Cyrillic, 3 for CJK, 4 for supplementary planes. The remaining unevenness is a property of the Unicode encoding, not of tokenizer training mix, and is the same for every model that operates on bytes.

Stability under perturbation. A single typo can change tokenization dramatically ("hello" $\to$ "h3llo" may go from 1 token to 5). Byte-level models see a one-byte change and handle it gracefully.

Arithmetic. Digit-group tokenization ("123" as one token, "1234" as a different token) is a known source of LLM arithmetic errors. Byte-level models see digits one at a time and can in principle learn digit-wise algorithms.

Byte-aware tasks. Spelling correction, morphological analysis, cryptography, and any task that depends on sub-word structure is natively accessible to byte-level models; token-based models must reconstruct character-level information from the token representation.

Eliminating a pipeline stage. No tokenizer training, no vocabulary decisions, no pre-tokenization language handling. One less moving part in the training pipeline.

When Byte-Level Is Not the Answer

Short-context English-only deployments. If your model only serves English chat at 4K context, you pay $\beta \approx 4$ times more compute for a noise-tolerance and fairness benefit you never exercise.

Very large context windows on standard hardware. Vanilla byte-level models blow up at long context. You need the hierarchical structure of MegaByte or BLT, which is substantially more complex to implement and less battle-tested than plain decoder-only transformers.

Tight latency budgets with small batch sizes. Byte-level sequences are longer, so decode latency per-byte is lower but per-string is higher. For an interactive chat turn of 200 generated tokens, byte-level decode produces 800 bytes and may be slower end-to-end.

Common Confusions

Watch Out

Byte-level is not the same as character-level

A Unicode character can span 1 to 4 UTF-8 bytes. Byte-level models see the 1-4 byte sequence; character-level models (CANINE, Charformer) group bytes into characters before processing. The distinction matters for scripts with multi-byte characters: a Chinese character is 3 bytes in UTF-8, so a byte-level model needs 3 steps per character while a character-level model needs 1. Most "byte-level" papers in modern ML (ByT5, MegaByte, BLT) are true byte-level, not character-level.

Watch Out

Byte-level does not remove the need for special tokens

You still need PAD, EOS, BOS, and any control tokens the task requires. The vocabulary is 256 bytes plus these extras; the savings come from not needing tens of thousands of subword tokens, not from having zero tokens. ByT5 uses 256 bytes plus a handful of special tokens (PAD, EOS, UNK), compared to 32,000+ for a typical BPE tokenizer.

Watch Out

Hierarchical patching does not change the byte-level property

MegaByte and BLT still process raw bytes at input and output. The global transformer operates on summaries of byte patches, but the model's input and output interface is bytes. The pipeline is tokenizer-free even though internally there is a patch step. This is different from "BPE with extra steps": the patch boundaries are learned from the data (BLT) or fixed by geometry (MegaByte), not derived from corpus statistics of a separate training phase.

Watch Out

Byte compute blowup is not quadratic-in-beta in practice

The quadratic term $\beta^2$ applies when attention dominates. Most frontier transformers use enough feedforward capacity that the feedforward term is comparable to the attention term, so the realized byte-level compute ratio is between $\beta$ and $\beta^2$ , often 3 to 8 times rather than 16 times. Hierarchical byte models break this entirely by making attention subquadratic in byte length.

Summary

Byte-level models feed raw UTF-8 bytes to the transformer with a vocabulary of ~259 entries, eliminating the tokenizer
The naive compute cost is between $\beta$ and $\beta^2$ times a subword transformer on the same text, where $\beta$ is the bytes-per-token ratio of the subword tokenizer (typically 3-4 for English)
ByT5 uses a standard T5 architecture with rebalanced encoder/decoder depth; it wins on multilingual and noisy-text tasks, loses slightly on clean English
MegaByte introduces hierarchical patches with fixed patch size $P$ , reducing global-attention cost by a factor of $P^2$ and enabling million-byte contexts
BLT uses entropy-based dynamic patches: boundaries at high-entropy byte positions, long patches in predictable stretches, short patches in unpredictable ones
Byte-level models are fair across languages, resilient to typos and perturbations, and good at character-sensitive tasks; they cost more compute per string on clean English and are less production-tested

Exercises

ExerciseCore

Problem

A transformer trained on English has bytes-per-token $\beta = 4$ with its BPE tokenizer. A user asks it to process the same text at byte level using the same architecture. Sequence length goes from $N = 4096$ tokens to $N_{\text{byte}} = 16{,}384$ bytes. The model dimension is $d = 4096$ .

(a) Compute the attention FLOPs for both cases.

(b) Compute the feedforward FLOPs for both cases (use $4 N d^2$ as the approximation, which accounts for up/down projections).

(c) What is the overall byte-to-token compute ratio?

ExerciseAdvanced

Problem

MegaByte uses patches of size $P$ . Let the global transformer operate on $N/P$ patch tokens with dimension $d$ and the local model operate on each patch of $P$ bytes with dimension $d_{\text{local}} \ll d$ . The global transformer has $L$ layers and the local model has $L_{\text{local}}$ layers.

(a) Write the total forward-pass FLOPs as a function of $N$ , $P$ , $d$ , $d_{\text{local}}$ , $L$ , $L_{\text{local}}$ .

(b) If $d_{\text{local}} = d/8$ and $L_{\text{local}} = L/4$ , compare the MegaByte cost to the cost of a vanilla byte-level transformer at the same $N$ , $d$ , $L$ . For what value of $P$ does MegaByte halve the cost?

References

Canonical:

Xue et al., ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models (TACL 2022, arXiv:2105.13626). Sections 2-4 (architecture rebalancing, encoder-decoder depth changes) and Table 7 (multilingual benchmark comparisons).
Yu et al., MegaByte: Predicting Million-byte Sequences with Multiscale Transformers (NeurIPS 2023, arXiv:2305.07185). Sections 3-4 (patch embedder, global and local models) and Table 1 (scaling to $10^6$ byte context).
Pagnoni et al., Byte Latent Transformer: Patches Scale Better Than Tokens (2024, arXiv:2412.09871). Sections 3-4 (entropy-based patching, BLT architecture) and Table 3 (matched-compute comparison with Llama 3 tokenizer baseline).

Current:

Clark et al., CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation (TACL 2022, arXiv:2103.06874). Character-level (not byte-level) counterpart with hash-based character embeddings.
Tay et al., Charformer: Fast Character Transformers via Gradient-based Subword Tokenization (ICLR 2022, arXiv:2106.12672). Learned soft tokenization at the character level.
Graves, Generating Sequences With Recurrent Neural Networks (2013, arXiv:1308.0850). Early byte-level character RNNs; conceptual ancestor of the modern work.
Kaplan et al., Scaling Laws for Neural Language Models (2020, arXiv:2001.08361). The scaling-law framework that byte-level and tokenized comparisons implicitly appeal to.

Next Topics

Tokenization and Information Theory: the standard tokenizer pipeline that byte-level models replace.
Transformer Architecture: the core block that ByT5 and the local/global models inside MegaByte and BLT reuse with modifications.
Attention Mechanisms History: efficient-attention variants that byte-level models often combine with hierarchical patching.

Last reviewed: May 27, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Tokenization and Information Theorylayer 4 · tier 3

Derived topics

2

Attention Mechanisms Historylayer 3 · tier 2
Transformer Architecturelayer 4 · tier 2

Graph-backed continuations

Transformer Architecture Attention Mechanisms History