LLM Construction
Byte-Level Language Models
Skip the tokenizer and feed raw bytes to the model. ByT5, MegaByte, and Byte Latent Transformer: why operating on bytes is attractive, why it is expensive, and how hierarchical patching closes the compute gap.
Prerequisites
Why This Matters
Every standard LLM starts with a tokenizer that converts text into subword units drawn from a fixed vocabulary. Tokenization compresses well (short sequences, high bytes-per-token) but carries a long tail of failure modes: arithmetic errors caused by digit-group tokenization, multilingual unfairness (a Burmese character can cost 3 to 5 tokens while an English word costs 1), brittle behavior on typos and rare scripts, and the need to pick a vocabulary before any training.
Byte-level models dispose of the tokenizer entirely. The input sequence is the raw UTF-8 byte stream. The vocabulary is fixed at 256 (plus a few control tokens). Every piece of text has exactly one byte-level encoding, and every Unicode string the model might encounter is representable.
The price is sequence length. A typical subword token is 3-4 bytes of UTF-8 text, so a byte-level sequence is 3-4 times longer than the equivalent tokenized sequence. Under standard transformer scaling, quadratic-in-sequence attention and linear-in-sequence feedforward make byte-level models substantially more expensive per string. Three generations of work (ByT5, MegaByte, Byte Latent Transformer) close that gap through hierarchical processing: small cheap computations at the byte level, expensive transformer computation at a coarser patch level.
The Byte-Level Compute Problem
Bytes Per Token
For a trained tokenizer and a reference corpus, the average number of UTF-8 bytes represented by one vocabulary token: . Typical values: English with a 32K BPE tokenizer, ; Chinese with the same tokenizer, ; Burmese, (one byte costs several tokens).
Compute Ratio: Byte-Level vs. Subword
Statement
Let the subword-level transformer have sequence length and the byte-level transformer the same text of length bytes. With identical architecture (same , ), the forward-pass compute ratio is:
In the attention-dominated regime (), this approaches . In the feedforward-dominated regime (), this approaches . For typical modern models , so byte-level inference on the same architecture costs between and more FLOPs per string.
Intuition
Attention is quadratic in sequence length, so quadrupling the sequence quadruples each dimension of the attention matrix, costing more. Feedforward is linear in sequence length, so it only costs more. Real models have both; the ratio interpolates depending on which term dominates. Long contexts and large models push the ratio toward ; short contexts and wide models push it toward .
Proof Sketch
FLOPs per layer of a vanilla transformer: attention (computing and ) plus feedforward . Substitute : the ratio of total FLOPs is . Factor out of both to get . For the ratio approaches ; for it approaches .
Why It Matters
This is the core design constraint. If you run bytes through the same model as tokens, you pay 4 to 16 times the compute for English. That is a steep price, and is why naive byte-level training at frontier scale does not exist: no one has run FLOPs on a byte-level model with identical architecture. The hierarchical approaches (MegaByte, BLT) exist precisely to break this ratio.
Failure Mode
The bound is tight only when the architecture is held fixed. In practice byte-level models often shrink other dimensions: ByT5 uses a deeper but narrower encoder than T5 to compensate. Hierarchical byte models change the architecture outright, so the compute ratio can be brought near 1 or below 1 at the cost of a more complex training recipe. The theorem is a lower-bound "what happens if you change nothing else", not a hard limit on byte-level models.
ByT5: Bytes Through a T5
ByT5 (Xue et al. 2022, TACL, arXiv:2105.13626) was the first large-scale demonstration that feeding raw bytes to a standard transformer is practical. The recipe keeps the T5 encoder-decoder architecture unchanged and replaces the SentencePiece tokenizer with a direct UTF-8 byte encoder, adding only three reserved tokens (PAD, EOS, UNK-bytes).
To offset the byte-length blowup, ByT5 adjusts the parameter budget:
- Encoder depth is increased (more layers) while decoder depth is reduced, because encoder work dominates the cost when sequences are long and most of the representational burden lives in the encoder.
- Feedforward hidden size is expanded to give each byte more capacity.
The result: on mT5's multilingual benchmark suite, ByT5 matches or beats token-level mT5 on noisy, low-resource, and character-sensitive tasks (transliteration, word-level perturbations, rare-language QA), while being slightly worse on standard English text tasks. The multilingual fairness gain is the main prize: every script is encoded at the same ~1 byte per character, so low-resource languages no longer pay a tokenization tax.
MegaByte: Hierarchical Patches for Long Sequences
MegaByte (Yu et al. 2023, NeurIPS, arXiv:2305.07185) addressed the scaling problem directly: the quadratic cost of attention makes byte-level modeling at context lengths above a few thousand bytes prohibitive. The MegaByte architecture introduces a two-level hierarchy.
Patch
A fixed-length window of contiguous bytes. A byte-level sequence of length is partitioned into patches. Each patch is represented by a single vector (the concatenation or projection of its byte embeddings).
The MegaByte architecture has three components:
- Patch embedder. Each patch of bytes is embedded into a single patch vector by concatenating the byte embeddings and projecting to dimension .
- Global transformer. Operates on the sequence of patch vectors, using standard self-attention. This is the expensive part, but the sequence is times shorter than the byte sequence.
- Local model. A small transformer predicts bytes within a patch, conditioned on the global model's output for that patch. This is cheap because it operates on bytes at a time with no cross-patch attention.
The compute budget shifts: attention cost becomes globally plus locally, which is linear in when is small and is constant. For and typical local-model sizes, MegaByte models sequences of bytes at compute budgets where a vanilla byte-level transformer would handle a few thousand bytes.
Byte Latent Transformer: Patches That Vary With Content
BLT (Pagnoni et al. 2024, arXiv:2412.09871) keeps the hierarchical idea but makes patch boundaries content-dependent. The insight: uniform -byte patches waste compute on predictable stretches (common English words, where the next byte has low entropy) and under-serve unpredictable stretches (rare words, numbers, code).
Entropy-Based Patching
A small entropy model estimates the conditional entropy of the next byte given the prefix. A new patch starts at position whenever exceeds a threshold : high-entropy points become patch boundaries. Patches therefore vary in length: short in unpredictable regions, long in predictable ones.
Like MegaByte, BLT runs the expensive global transformer over patches and a smaller local model within patches. Unlike MegaByte, the patch count and boundaries depend on the input. The empirical finding: at matched compute, BLT matches Llama 3 tokenizer-based performance on English benchmarks while degrading less under typos, performing better at character-level manipulation tasks, and closing the gap across languages.
Pagnoni et al. argue that BLT breaks the subword-tokenizer Pareto frontier: prior byte-level models traded quality for noise tolerance and language fairness, while BLT keeps both properties without the quality loss. This claim rests on a specific compute budget and benchmark set; long-horizon comparisons at frontier scale are still active research as of 2026.
Why Byte-Level Matters Beyond Convenience
Multilingual fairness. Tokenizer-based models charge low-resource languages more tokens per character, which translates to smaller effective context windows and worse learning dynamics. Byte-level models charge every language the same per character.
Stability under perturbation. A single typo can change tokenization dramatically ("hello" "h3llo" may go from 1 token to 5). Byte-level models see a one-byte change and handle it gracefully.
Arithmetic. Digit-group tokenization ("123" as one token, "1234" as a different token) is a known source of LLM arithmetic errors. Byte-level models see digits one at a time and can in principle learn digit-wise algorithms.
Byte-aware tasks. Spelling correction, morphological analysis, cryptography, and any task that depends on sub-word structure is natively accessible to byte-level models; token-based models must reconstruct character-level information from the token representation.
Eliminating a pipeline stage. No tokenizer training, no vocabulary decisions, no pre-tokenization language handling. One less moving part in the training pipeline.
When Byte-Level Is Not the Answer
Short-context English-only deployments. If your model only serves English chat at 4K context, you pay times more compute for a noise-tolerance and fairness benefit you never exercise.
Very large context windows on standard hardware. Vanilla byte-level models blow up at long context. You need the hierarchical structure of MegaByte or BLT, which is substantially more complex to implement and less battle-tested than plain decoder-only transformers.
Tight latency budgets with small batch sizes. Byte-level sequences are longer, so decode latency per-byte is lower but per-string is higher. For an interactive chat turn of 200 generated tokens, byte-level decode produces 800 bytes and may be slower end-to-end.
Common Confusions
Byte-level is not the same as character-level
A Unicode character can span 1 to 4 UTF-8 bytes. Byte-level models see the 1-4 byte sequence; character-level models (CANINE, Charformer) group bytes into characters before processing. The distinction matters for scripts with multi-byte characters: a Chinese character is 3 bytes in UTF-8, so a byte-level model needs 3 steps per character while a character-level model needs 1. Most "byte-level" papers in modern ML (ByT5, MegaByte, BLT) are true byte-level, not character-level.
Byte-level does not remove the need for special tokens
You still need PAD, EOS, BOS, and any control tokens the task requires. The vocabulary is 256 bytes plus these extras; the savings come from not needing tens of thousands of subword tokens, not from having zero tokens. ByT5 uses 256 bytes plus a handful of special tokens (PAD, EOS, UNK), compared to 32,000+ for a typical BPE tokenizer.
Hierarchical patching does not change the byte-level property
MegaByte and BLT still process raw bytes at input and output. The global transformer operates on summaries of byte patches, but the model's input and output interface is bytes. The pipeline is tokenizer-free even though internally there is a patch step. This is different from "BPE with extra steps": the patch boundaries are learned from the data (BLT) or fixed by geometry (MegaByte), not derived from corpus statistics of a separate training phase.
Byte compute blowup is not quadratic-in-beta in practice
The quadratic term applies when attention dominates. Most frontier transformers use enough feedforward capacity that the feedforward term is comparable to the attention term, so the realized byte-level compute ratio is between and , often 3 to 8 times rather than 16 times. Hierarchical byte models break this entirely by making attention subquadratic in byte length.
Summary
- Byte-level models feed raw UTF-8 bytes to the transformer with a vocabulary of ~259 entries, eliminating the tokenizer
- The naive compute cost is between and times a subword transformer on the same text, where is the bytes-per-token ratio of the subword tokenizer (typically 3-4 for English)
- ByT5 uses a standard T5 architecture with rebalanced encoder/decoder depth; it wins on multilingual and noisy-text tasks, loses slightly on clean English
- MegaByte introduces hierarchical patches with fixed patch size , reducing global-attention cost by a factor of and enabling million-byte contexts
- BLT uses entropy-based dynamic patches: boundaries at high-entropy byte positions, long patches in predictable stretches, short patches in unpredictable ones
- Byte-level models are fair across languages, resilient to typos and perturbations, and good at character-sensitive tasks; they cost more compute per string on clean English and are less production-tested
Exercises
Problem
A transformer trained on English has bytes-per-token with its BPE tokenizer. A user asks it to process the same text at byte level using the same architecture. Sequence length goes from tokens to bytes. The model dimension is .
(a) Compute the attention FLOPs for both cases.
(b) Compute the feedforward FLOPs for both cases (use as the approximation, which accounts for up/down projections).
(c) What is the overall byte-to-token compute ratio?
Problem
MegaByte uses patches of size . Let the global transformer operate on patch tokens with dimension and the local model operate on each patch of bytes with dimension . The global transformer has layers and the local model has layers.
(a) Write the total forward-pass FLOPs as a function of , , , , , .
(b) If and , compare the MegaByte cost to the cost of a vanilla byte-level transformer at the same , , . For what value of does MegaByte halve the cost?
References
Canonical:
- Xue et al., ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models (TACL 2022, arXiv:2105.13626). Sections 2-4 (architecture rebalancing, encoder-decoder depth changes) and Table 7 (multilingual benchmark comparisons).
- Yu et al., MegaByte: Predicting Million-byte Sequences with Multiscale Transformers (NeurIPS 2023, arXiv:2305.07185). Sections 3-4 (patch embedder, global and local models) and Table 1 (scaling to byte context).
- Pagnoni et al., Byte Latent Transformer: Patches Scale Better Than Tokens (2024, arXiv:2412.09871). Sections 3-4 (entropy-based patching, BLT architecture) and Table 3 (matched-compute comparison with Llama 3 tokenizer baseline).
Current:
- Clark et al., CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation (TACL 2022, arXiv:2103.06874). Character-level (not byte-level) counterpart with hash-based character embeddings.
- Tay et al., Charformer: Fast Character Transformers via Gradient-based Subword Tokenization (ICLR 2022, arXiv:2106.12672). Learned soft tokenization at the character level.
- Graves, Generating Sequences With Recurrent Neural Networks (2013, arXiv:1308.0850). Early byte-level character RNNs; conceptual ancestor of the modern work.
- Kaplan et al., Scaling Laws for Neural Language Models (2020, arXiv:2001.08361). The scaling-law framework that byte-level and tokenized comparisons implicitly appeal to.
Next Topics
- Tokenization and Information Theory: the standard tokenizer pipeline that byte-level models replace.
- Transformer Architecture: the core block that ByT5 and the local/global models inside MegaByte and BLT reuse with modifications.
- Attention Mechanisms History: efficient-attention variants that byte-level models often combine with hierarchical patching.
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.