LLM Construction
Structured Output and Constrained Generation
Forcing LLM output to conform to schemas via constrained decoding: logit masking for valid tokens, grammar-guided generation, and why constraining the search space can improve both reliability and quality.
Prerequisites
Why This Matters
LLMs generate free-form text. Applications need structured data: JSON for APIs, SQL for databases, code that parses, function calls with correct argument types. The gap between "text that looks like JSON" and "valid JSON that matches a schema" is the difference between a demo and a production system.
Structured output techniques guarantee that every generated token sequence is valid according to a specified grammar or schema. This eliminates an entire class of failure modes (malformed output) and, counterintuitively, often improves the content quality of the output as well.
Mental Model
At each generation step, the LLM produces a probability distribution over the full vocabulary. Constrained decoding applies a mask that zeros out tokens that would make the output invalid at that point. The model can only choose from tokens that are legal continuations according to the grammar. This is like playing chess where illegal moves are removed from the board before you choose: you cannot make a mistake, and the reduced option set focuses your decision.
Formal Setup
Let be the vocabulary and be a context-free grammar (or schema) defining the set of valid outputs. At generation step , the model has produced tokens . The set of valid next tokens is:
Constrained Decoding
Constrained decoding modifies the generation process by applying a mask at each step:
Tokens not in receive zero probability. The remaining probabilities are renormalized. The result is a valid probability distribution over only legal continuations.
Grammar-Guided Generation
Grammar-guided generation uses a formal grammar (regular expression, context-free grammar, or JSON schema compiled to a grammar) to compute at each step. The grammar is maintained as a parser state that advances with each generated token. At each position, the parser determines which tokens would lead to a valid parse continuation.
Main Theorems
Constrained Decoding Guarantees Valid Output
Statement
If is the exact set of tokens that can lead to a valid completion under grammar , then constrained decoding produces a valid output with probability 1. Formally, for any sequence generated by constrained decoding with for all , the output is in the language .
Intuition
At every step, only tokens leading to valid completions are allowed. Since each step maintains validity, the full output is valid by induction. This is the same principle as a parser that rejects invalid input at the earliest possible point, but applied in reverse: we prevent invalid output at the earliest possible point.
Proof Sketch
By induction on . Base case: contains only tokens that start a valid string in . Inductive step: if is a valid prefix (has at least one valid completion in ), then any maintains this property by definition of . At the final step, the EOS token is only in if the current prefix is a complete valid string.
Why It Matters
Unconstrained generation of JSON from GPT-4 fails to parse approximately 2-5% of the time, depending on schema complexity. Constrained decoding reduces this to 0%. For production systems making millions of API calls, eliminating a 2% failure rate is significant.
Failure Mode
The guarantee is only as good as the grammar. If the grammar is more permissive than the intended schema (e.g., allows any string value when you need a specific enum), the output will be syntactically valid but semantically wrong. Constrained decoding guarantees form, not content.
Grammar Constraints Reduce Effective Search Space
Statement
At each generation step, unconstrained generation chooses from tokens (typically 32K-128K). Grammar-guided generation restricts this to tokens. For structured formats:
- JSON with a fixed schema: ranges from 1 (only a colon after a key) to \sim$$|V|/10 (inside a string value)
- SQL with a fixed table schema: averages \sim$$50- tokens at most positions
- Regular expressions: is determined by the automaton state
The reduction in effective search space concentrates the model's probability mass on semantically relevant tokens. Empirically, grammar-constrained generation on structured tasks matches or exceeds unconstrained generation quality even when the unconstrained output happens to be valid.
Intuition
When the model does not need to "waste" probability mass on syntactically impossible tokens (closing a brace before opening one, generating a comma where a colon is required), it effectively has more capacity to distinguish between the semantically meaningful options. The constraint removes noise from the decision problem.
Why It Matters
This explains the counterintuitive finding that constrained generation can improve quality, not just validity. By eliminating irrelevant options, the model's relative ranking of good vs. mediocre (but syntactically valid) completions becomes sharper.
Failure Mode
If the grammar is too restrictive, it can force the model into outputs it would not naturally produce, degrading quality. For example, if the schema requires an integer field but the correct answer requires a float, the grammar will force an incorrect integer.
Implementation Approaches
Token-level masking (Outlines, llama.cpp grammars). Compile the grammar into a state machine. At each step, query the state machine for valid next tokens, build a bitmask over the vocabulary, and apply it to logits before softmax. The cost per step is the state machine transition plus the mask application, typically negligible compared to the model forward pass.
Structured output APIs (OpenAI, Anthropic). The API accepts a JSON schema and guarantees the response conforms to it. The constraint enforcement happens server-side. The user sees only valid structured output.
Guided generation with retries. Generate unconstrained output, validate it, and retry on failure. This is simpler to implement but wastes compute on invalid generations and provides no formal guarantee of eventual success. With a 5% failure rate, the expected number of retries is , which is manageable. With a 50% failure rate on complex schemas, retries become expensive.
Practical Considerations
Grammar compilation cost. Converting a JSON schema to a state machine is a one-time cost. For complex schemas with nested objects, arrays, and recursive types, the state machine can have thousands of states. This is acceptable because the compilation happens once per schema, not per request.
Interaction with sampling. Constrained decoding works with any sampling strategy: greedy, top-k, top-p, temperature scaling. The constraint is applied as a hard mask on logits before the sampling strategy. The sampling strategy only sees the valid tokens.
Streaming. Constrained decoding is compatible with token-by-token streaming. Each token is guaranteed valid in context, so partial outputs can be streamed to the client. The client can parse partial JSON as it arrives.
Common Confusions
Constrained decoding is not fine-tuning
Constrained decoding modifies the inference-time sampling procedure without changing the model weights. The model is not retrained to produce structured output. The constraint is applied externally. Fine-tuning on structured data can improve the model's natural tendency to produce valid output, but constrained decoding provides the hard guarantee.
JSON mode is weaker than schema-constrained generation
JSON mode (as offered by some APIs) guarantees the output is valid JSON. Schema-constrained generation guarantees the output matches a specific JSON schema (correct field names, types, required fields). Valid JSON that does not match your schema is useless. Always prefer schema-level constraints when available.
Constrained decoding can still produce wrong content
A model constrained to output JSON matching a schema
{"temperature": number} will always produce valid JSON with a numeric
temperature field. But the number might be wrong. Constrained decoding
guarantees syntax, not semantics. Factual accuracy still depends on the
model's knowledge and reasoning.
Summary
- Constrained decoding masks invalid tokens at each step, guaranteeing valid output
- Grammar-guided generation maintains a parser state to compute valid token sets
- The constraint reduces the effective search space, often improving content quality
- Token-level masking adds negligible overhead to the model forward pass
- Constrained decoding works with any sampling strategy (greedy, top-k, top-p)
- JSON mode guarantees valid JSON; schema mode guarantees conformance to a specific schema
- Syntax is guaranteed; semantics (correctness of content) is not
Exercises
Problem
A JSON schema requires output of the form {"name": string, "age": integer}.
At the position immediately after {"name": "Alice", "age": , what tokens
are in the valid set ? Assume the vocabulary includes digits 0-9,
letters, quotes, braces, brackets, commas, colons, and minus sign.
Problem
You are building a constrained decoding system for SQL generation. The
grammar allows only SELECT queries on a database with tables users
(columns: id, name, email) and orders (columns: id, user_id,
amount, date). After the model generates SELECT name, email FROM users WHERE ,
what is ? Consider both column names and SQL keywords that can follow WHERE.
References
Canonical:
- Willard & Louf, "Efficient Guided Generation for Large Language Models" (2023). Outlines library
Current:
- OpenAI, "Structured Outputs" documentation (2024)
- GBNF grammar support in llama.cpp (2023-2024)
- Scholak et al., "PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models" (2021). SQL-constrained generation
Next Topics
The natural next steps from structured output:
- Tool-augmented reasoning: structured output enables reliable tool calling
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Transformer ArchitectureLayer 4
- Attention Mechanism TheoryLayer 4
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Softmax and Numerical StabilityLayer 1
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1