Skip to main content

LLM Construction

Byte-Level Language Models

Skip the tokenizer and feed raw bytes to the model. ByT5, MegaByte, and Byte Latent Transformer: why operating on bytes is attractive, why it is expensive, and how hierarchical patching closes the compute gap.

AdvancedTier 3Current~45 min
0

Why This Matters

Every standard LLM starts with a tokenizer that converts text into subword units drawn from a fixed vocabulary. Tokenization compresses well (short sequences, high bytes-per-token) but carries a long tail of failure modes: arithmetic errors caused by digit-group tokenization, multilingual unfairness (a Burmese character can cost 3 to 5 tokens while an English word costs 1), brittle behavior on typos and rare scripts, and the need to pick a vocabulary before any training.

Byte-level models dispose of the tokenizer entirely. The input sequence is the raw UTF-8 byte stream. The vocabulary is fixed at 256 (plus a few control tokens). Every piece of text has exactly one byte-level encoding, and every Unicode string the model might encounter is representable.

The price is sequence length. A typical subword token is 3-4 bytes of UTF-8 text, so a byte-level sequence is 3-4 times longer than the equivalent tokenized sequence. Under standard transformer scaling, quadratic-in-sequence attention and linear-in-sequence feedforward make byte-level models substantially more expensive per string. Three generations of work (ByT5, MegaByte, Byte Latent Transformer) close that gap through hierarchical processing: small cheap computations at the byte level, expensive transformer computation at a coarser patch level.

The Byte-Level Compute Problem

Definition

Bytes Per Token

For a trained tokenizer and a reference corpus, the average number of UTF-8 bytes represented by one vocabulary token: β=(bytes in corpus)/(tokens in corpus)\beta = (\text{bytes in corpus}) / (\text{tokens in corpus}). Typical values: English with a 32K BPE tokenizer, β4\beta \approx 4; Chinese with the same tokenizer, β1.5\beta \approx 1.5; Burmese, β0.3\beta \approx 0.3 (one byte costs several tokens).

Proposition

Compute Ratio: Byte-Level vs. Subword

Statement

Let the subword-level transformer have sequence length NtokN_{\text{tok}} and the byte-level transformer the same text of length Nbyte=βNtokN_{\text{byte}} = \beta \cdot N_{\text{tok}} bytes. With identical architecture (same LL, dd), the forward-pass compute ratio is:

FLOPsbyteFLOPstok=β2Ntok2d+βNtokd2Ntok2d+Ntokd2.\frac{\text{FLOPs}_{\text{byte}}}{\text{FLOPs}_{\text{tok}}} = \frac{\beta^2 \cdot N_{\text{tok}}^2 d + \beta \cdot N_{\text{tok}} d^2}{N_{\text{tok}}^2 d + N_{\text{tok}} d^2}.

In the attention-dominated regime (Ntokdd2N_{\text{tok}} d \gg d^2), this approaches β2\beta^2. In the feedforward-dominated regime (Ntokdd2N_{\text{tok}} d \ll d^2), this approaches β\beta. For typical modern models β4\beta \approx 4, so byte-level inference on the same architecture costs between 4×4\times and 16×16\times more FLOPs per string.

Intuition

Attention is quadratic in sequence length, so quadrupling the sequence quadruples each dimension of the attention matrix, costing β2\beta^2 more. Feedforward is linear in sequence length, so it only costs β\beta more. Real models have both; the ratio interpolates depending on which term dominates. Long contexts and large models push the ratio toward β2\beta^2; short contexts and wide models push it toward β\beta.

Proof Sketch

FLOPs per layer of a vanilla transformer: attention Θ(N2d)\Theta(N^2 d) (computing QKQK^\top and V\cdot V) plus feedforward Θ(Nd2)\Theta(N d^2). Substitute Nbyte=βNtokN_{\text{byte}} = \beta N_{\text{tok}}: the ratio of total FLOPs is (β2N2d+βNd2)/(N2d+Nd2)(\beta^2 N^2 d + \beta N d^2) / (N^2 d + N d^2). Factor NdN d out of both to get (β2N+βd)/(N+d)(\beta^2 N + \beta d) / (N + d). For NdN \gg d the ratio approaches β2\beta^2; for NdN \ll d it approaches β\beta.

Why It Matters

This is the core design constraint. If you run bytes through the same model as tokens, you pay 4 to 16 times the compute for English. That is a steep price, and is why naive byte-level training at frontier scale does not exist: no one has run 102510^{25} FLOPs on a byte-level model with identical architecture. The hierarchical approaches (MegaByte, BLT) exist precisely to break this ratio.

Failure Mode

The bound is tight only when the architecture is held fixed. In practice byte-level models often shrink other dimensions: ByT5 uses a deeper but narrower encoder than T5 to compensate. Hierarchical byte models change the architecture outright, so the compute ratio can be brought near 1 or below 1 at the cost of a more complex training recipe. The theorem is a lower-bound "what happens if you change nothing else", not a hard limit on byte-level models.

ByT5: Bytes Through a T5

ByT5 (Xue et al. 2022, TACL, arXiv:2105.13626) was the first large-scale demonstration that feeding raw bytes to a standard transformer is practical. The recipe keeps the T5 encoder-decoder architecture unchanged and replaces the SentencePiece tokenizer with a direct UTF-8 byte encoder, adding only three reserved tokens (PAD, EOS, UNK-bytes).

To offset the byte-length blowup, ByT5 adjusts the parameter budget:

  • Encoder depth is increased (more layers) while decoder depth is reduced, because encoder work dominates the cost when sequences are long and most of the representational burden lives in the encoder.
  • Feedforward hidden size is expanded to give each byte more capacity.

The result: on mT5's multilingual benchmark suite, ByT5 matches or beats token-level mT5 on noisy, low-resource, and character-sensitive tasks (transliteration, word-level perturbations, rare-language QA), while being slightly worse on standard English text tasks. The multilingual fairness gain is the main prize: every script is encoded at the same ~1 byte per character, so low-resource languages no longer pay a tokenization tax.

MegaByte: Hierarchical Patches for Long Sequences

MegaByte (Yu et al. 2023, NeurIPS, arXiv:2305.07185) addressed the scaling problem directly: the quadratic cost of attention makes byte-level modeling at context lengths above a few thousand bytes prohibitive. The MegaByte architecture introduces a two-level hierarchy.

Definition

Patch

A fixed-length window of PP contiguous bytes. A byte-level sequence of length NN is partitioned into N/PN/P patches. Each patch is represented by a single vector (the concatenation or projection of its byte embeddings).

The MegaByte architecture has three components:

  1. Patch embedder. Each patch of PP bytes is embedded into a single patch vector by concatenating the PP byte embeddings and projecting to dimension dd.
  2. Global transformer. Operates on the sequence of N/PN/P patch vectors, using standard self-attention. This is the expensive part, but the sequence is PP times shorter than the byte sequence.
  3. Local model. A small transformer predicts bytes within a patch, conditioned on the global model's output for that patch. This is cheap because it operates on PP bytes at a time with no cross-patch attention.

The compute budget shifts: attention cost becomes Θ((N/P)2d)\Theta((N/P)^2 d) globally plus Θ(NPdlocal)\Theta(N \cdot P \cdot d_{\text{local}}) locally, which is linear in NN when dlocald_{\text{local}} is small and PP is constant. For P=8P = 8 and typical local-model sizes, MegaByte models sequences of 10610^6 bytes at compute budgets where a vanilla byte-level transformer would handle a few thousand bytes.

Byte Latent Transformer: Patches That Vary With Content

BLT (Pagnoni et al. 2024, arXiv:2412.09871) keeps the hierarchical idea but makes patch boundaries content-dependent. The insight: uniform PP-byte patches waste compute on predictable stretches (common English words, where the next byte has low entropy) and under-serve unpredictable stretches (rare words, numbers, code).

Definition

Entropy-Based Patching

A small entropy model h(yy<t)h(y \mid y_{<t}) estimates the conditional entropy of the next byte given the prefix. A new patch starts at position tt whenever h(yty<t)h(y_t \mid y_{<t}) exceeds a threshold θ\theta: high-entropy points become patch boundaries. Patches therefore vary in length: short in unpredictable regions, long in predictable ones.

Like MegaByte, BLT runs the expensive global transformer over patches and a smaller local model within patches. Unlike MegaByte, the patch count and boundaries depend on the input. The empirical finding: at matched compute, BLT matches Llama 3 tokenizer-based performance on English benchmarks while degrading less under typos, performing better at character-level manipulation tasks, and closing the gap across languages.

Pagnoni et al. argue that BLT breaks the subword-tokenizer Pareto frontier: prior byte-level models traded quality for noise tolerance and language fairness, while BLT keeps both properties without the quality loss. This claim rests on a specific compute budget and benchmark set; long-horizon comparisons at frontier scale are still active research as of 2026.

Why Byte-Level Matters Beyond Convenience

Multilingual fairness. Tokenizer-based models charge low-resource languages more tokens per character, which translates to smaller effective context windows and worse learning dynamics. Byte-level models charge every language the same per character.

Stability under perturbation. A single typo can change tokenization dramatically ("hello" \to "h3llo" may go from 1 token to 5). Byte-level models see a one-byte change and handle it gracefully.

Arithmetic. Digit-group tokenization ("123" as one token, "1234" as a different token) is a known source of LLM arithmetic errors. Byte-level models see digits one at a time and can in principle learn digit-wise algorithms.

Byte-aware tasks. Spelling correction, morphological analysis, cryptography, and any task that depends on sub-word structure is natively accessible to byte-level models; token-based models must reconstruct character-level information from the token representation.

Eliminating a pipeline stage. No tokenizer training, no vocabulary decisions, no pre-tokenization language handling. One less moving part in the training pipeline.

When Byte-Level Is Not the Answer

Short-context English-only deployments. If your model only serves English chat at 4K context, you pay β4\beta \approx 4 times more compute for a noise-tolerance and fairness benefit you never exercise.

Very large context windows on standard hardware. Vanilla byte-level models blow up at long context. You need the hierarchical structure of MegaByte or BLT, which is substantially more complex to implement and less battle-tested than plain decoder-only transformers.

Tight latency budgets with small batch sizes. Byte-level sequences are longer, so decode latency per-byte is lower but per-string is higher. For an interactive chat turn of 200 generated tokens, byte-level decode produces 800 bytes and may be slower end-to-end.

Common Confusions

Watch Out

Byte-level is not the same as character-level

A Unicode character can span 1 to 4 UTF-8 bytes. Byte-level models see the 1-4 byte sequence; character-level models (CANINE, Charformer) group bytes into characters before processing. The distinction matters for scripts with multi-byte characters: a Chinese character is 3 bytes in UTF-8, so a byte-level model needs 3 steps per character while a character-level model needs 1. Most "byte-level" papers in modern ML (ByT5, MegaByte, BLT) are true byte-level, not character-level.

Watch Out

Byte-level does not remove the need for special tokens

You still need PAD, EOS, BOS, and any control tokens the task requires. The vocabulary is 256 bytes plus these extras; the savings come from not needing tens of thousands of subword tokens, not from having zero tokens. ByT5 uses 256 bytes plus a handful of special tokens (PAD, EOS, UNK), compared to 32,000+ for a typical BPE tokenizer.

Watch Out

Hierarchical patching does not change the byte-level property

MegaByte and BLT still process raw bytes at input and output. The global transformer operates on summaries of byte patches, but the model's input and output interface is bytes. The pipeline is tokenizer-free even though internally there is a patch step. This is different from "BPE with extra steps": the patch boundaries are learned from the data (BLT) or fixed by geometry (MegaByte), not derived from corpus statistics of a separate training phase.

Watch Out

Byte compute blowup is not quadratic-in-beta in practice

The quadratic term β2\beta^2 applies when attention dominates. Most frontier transformers use enough feedforward capacity that the feedforward term is comparable to the attention term, so the realized byte-level compute ratio is between β\beta and β2\beta^2, often 3 to 8 times rather than 16 times. Hierarchical byte models break this entirely by making attention subquadratic in byte length.

Summary

  • Byte-level models feed raw UTF-8 bytes to the transformer with a vocabulary of ~259 entries, eliminating the tokenizer
  • The naive compute cost is between β\beta and β2\beta^2 times a subword transformer on the same text, where β\beta is the bytes-per-token ratio of the subword tokenizer (typically 3-4 for English)
  • ByT5 uses a standard T5 architecture with rebalanced encoder/decoder depth; it wins on multilingual and noisy-text tasks, loses slightly on clean English
  • MegaByte introduces hierarchical patches with fixed patch size PP, reducing global-attention cost by a factor of P2P^2 and enabling million-byte contexts
  • BLT uses entropy-based dynamic patches: boundaries at high-entropy byte positions, long patches in predictable stretches, short patches in unpredictable ones
  • Byte-level models are fair across languages, resilient to typos and perturbations, and good at character-sensitive tasks; they cost more compute per string on clean English and are less production-tested

Exercises

ExerciseCore

Problem

A transformer trained on English has bytes-per-token β=4\beta = 4 with its BPE tokenizer. A user asks it to process the same text at byte level using the same architecture. Sequence length goes from N=4096N = 4096 tokens to Nbyte=16,384N_{\text{byte}} = 16{,}384 bytes. The model dimension is d=4096d = 4096.

(a) Compute the attention FLOPs for both cases.

(b) Compute the feedforward FLOPs for both cases (use 4Nd24 N d^2 as the approximation, which accounts for up/down projections).

(c) What is the overall byte-to-token compute ratio?

ExerciseAdvanced

Problem

MegaByte uses patches of size PP. Let the global transformer operate on N/PN/P patch tokens with dimension dd and the local model operate on each patch of PP bytes with dimension dlocaldd_{\text{local}} \ll d. The global transformer has LL layers and the local model has LlocalL_{\text{local}} layers.

(a) Write the total forward-pass FLOPs as a function of NN, PP, dd, dlocald_{\text{local}}, LL, LlocalL_{\text{local}}.

(b) If dlocal=d/8d_{\text{local}} = d/8 and Llocal=L/4L_{\text{local}} = L/4, compare the MegaByte cost to the cost of a vanilla byte-level transformer at the same NN, dd, LL. For what value of PP does MegaByte halve the cost?

References

Canonical:

  • Xue et al., ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models (TACL 2022, arXiv:2105.13626). Sections 2-4 (architecture rebalancing, encoder-decoder depth changes) and Table 7 (multilingual benchmark comparisons).
  • Yu et al., MegaByte: Predicting Million-byte Sequences with Multiscale Transformers (NeurIPS 2023, arXiv:2305.07185). Sections 3-4 (patch embedder, global and local models) and Table 1 (scaling to 10610^6 byte context).
  • Pagnoni et al., Byte Latent Transformer: Patches Scale Better Than Tokens (2024, arXiv:2412.09871). Sections 3-4 (entropy-based patching, BLT architecture) and Table 3 (matched-compute comparison with Llama 3 tokenizer baseline).

Current:

  • Clark et al., CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation (TACL 2022, arXiv:2103.06874). Character-level (not byte-level) counterpart with hash-based character embeddings.
  • Tay et al., Charformer: Fast Character Transformers via Gradient-based Subword Tokenization (ICLR 2022, arXiv:2106.12672). Learned soft tokenization at the character level.
  • Graves, Generating Sequences With Recurrent Neural Networks (2013, arXiv:1308.0850). Early byte-level character RNNs; conceptual ancestor of the modern work.
  • Kaplan et al., Scaling Laws for Neural Language Models (2020, arXiv:2001.08361). The scaling-law framework that byte-level and tokenized comparisons implicitly appeal to.

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics