Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Structured Output and Constrained Generation

Forcing LLM output to conform to schemas via constrained decoding: logit masking for valid tokens, grammar-guided generation, and why constraining the search space can improve both reliability and quality.

AdvancedTier 2Frontier~45 min
0

Why This Matters

LLMs generate free-form text. Applications need structured data: JSON for APIs, SQL for databases, code that parses, function calls with correct argument types. The gap between "text that looks like JSON" and "valid JSON that matches a schema" is the difference between a demo and a production system.

Structured output techniques guarantee that every generated token sequence is valid according to a specified grammar or schema. This eliminates an entire class of failure modes (malformed output) and, counterintuitively, often improves the content quality of the output as well.

Mental Model

At each generation step, the LLM produces a probability distribution over the full vocabulary. Constrained decoding applies a mask that zeros out tokens that would make the output invalid at that point. The model can only choose from tokens that are legal continuations according to the grammar. This is like playing chess where illegal moves are removed from the board before you choose: you cannot make a mistake, and the reduced option set focuses your decision.

Formal Setup

Let VV be the vocabulary and GG be a context-free grammar (or schema) defining the set of valid outputs. At generation step tt, the model has produced tokens y1,,yt1y_1, \ldots, y_{t-1}. The set of valid next tokens is:

Vt={vV: valid completion of y1yt1v under G}V_t = \{v \in V : \exists \text{ valid completion of } y_1 \ldots y_{t-1} v \text{ under } G\}

Definition

Constrained Decoding

Constrained decoding modifies the generation process by applying a mask at each step:

pconstrained(yty<t)=p(yty<t)1[ytVt]vVtp(vy<t)p_{\text{constrained}}(y_t | y_{<t}) = \frac{p(y_t | y_{<t}) \cdot \mathbf{1}[y_t \in V_t]}{\sum_{v \in V_t} p(v | y_{<t})}

Tokens not in VtV_t receive zero probability. The remaining probabilities are renormalized. The result is a valid probability distribution over only legal continuations.

Definition

Grammar-Guided Generation

Grammar-guided generation uses a formal grammar (regular expression, context-free grammar, or JSON schema compiled to a grammar) to compute VtV_t at each step. The grammar is maintained as a parser state that advances with each generated token. At each position, the parser determines which tokens would lead to a valid parse continuation.

Main Theorems

Proposition

Constrained Decoding Guarantees Valid Output

Statement

If VtV_t is the exact set of tokens that can lead to a valid completion under grammar GG, then constrained decoding produces a valid output with probability 1. Formally, for any sequence y1,,yTy_1, \ldots, y_T generated by constrained decoding with pconstrained(yty<t)>0p_{\text{constrained}}(y_t | y_{<t}) > 0 for all tt, the output y1yTy_1 \ldots y_T is in the language L(G)L(G).

Intuition

At every step, only tokens leading to valid completions are allowed. Since each step maintains validity, the full output is valid by induction. This is the same principle as a parser that rejects invalid input at the earliest possible point, but applied in reverse: we prevent invalid output at the earliest possible point.

Proof Sketch

By induction on tt. Base case: V1V_1 contains only tokens that start a valid string in L(G)L(G). Inductive step: if y1yt1y_1 \ldots y_{t-1} is a valid prefix (has at least one valid completion in L(G)L(G)), then any ytVty_t \in V_t maintains this property by definition of VtV_t. At the final step, the EOS token is only in VTV_T if the current prefix is a complete valid string.

Why It Matters

Unconstrained generation of JSON from GPT-4 fails to parse approximately 2-5% of the time, depending on schema complexity. Constrained decoding reduces this to 0%. For production systems making millions of API calls, eliminating a 2% failure rate is significant.

Failure Mode

The guarantee is only as good as the grammar. If the grammar is more permissive than the intended schema (e.g., allows any string value when you need a specific enum), the output will be syntactically valid but semantically wrong. Constrained decoding guarantees form, not content.

Proposition

Grammar Constraints Reduce Effective Search Space

Statement

At each generation step, unconstrained generation chooses from V|V| tokens (typically 32K-128K). Grammar-guided generation restricts this to Vt|V_t| tokens. For structured formats:

  • JSON with a fixed schema: Vt|V_t| ranges from 1 (only a colon after a key) to \sim$$|V|/10 (inside a string value)
  • SQL with a fixed table schema: Vt|V_t| averages \sim$$50-200200 tokens at most positions
  • Regular expressions: Vt|V_t| is determined by the automaton state

The reduction in effective search space concentrates the model's probability mass on semantically relevant tokens. Empirically, grammar-constrained generation on structured tasks matches or exceeds unconstrained generation quality even when the unconstrained output happens to be valid.

Intuition

When the model does not need to "waste" probability mass on syntactically impossible tokens (closing a brace before opening one, generating a comma where a colon is required), it effectively has more capacity to distinguish between the semantically meaningful options. The constraint removes noise from the decision problem.

Why It Matters

This explains the counterintuitive finding that constrained generation can improve quality, not just validity. By eliminating irrelevant options, the model's relative ranking of good vs. mediocre (but syntactically valid) completions becomes sharper.

Failure Mode

If the grammar is too restrictive, it can force the model into outputs it would not naturally produce, degrading quality. For example, if the schema requires an integer field but the correct answer requires a float, the grammar will force an incorrect integer.

Implementation Approaches

Token-level masking (Outlines, llama.cpp grammars). Compile the grammar into a state machine. At each step, query the state machine for valid next tokens, build a bitmask over the vocabulary, and apply it to logits before softmax. The cost per step is the state machine transition plus the mask application, typically negligible compared to the model forward pass.

Structured output APIs (OpenAI, Anthropic). The API accepts a JSON schema and guarantees the response conforms to it. The constraint enforcement happens server-side. The user sees only valid structured output.

Guided generation with retries. Generate unconstrained output, validate it, and retry on failure. This is simpler to implement but wastes compute on invalid generations and provides no formal guarantee of eventual success. With a 5% failure rate, the expected number of retries is 1/0.951.051/0.95 \approx 1.05, which is manageable. With a 50% failure rate on complex schemas, retries become expensive.

Practical Considerations

Grammar compilation cost. Converting a JSON schema to a state machine is a one-time cost. For complex schemas with nested objects, arrays, and recursive types, the state machine can have thousands of states. This is acceptable because the compilation happens once per schema, not per request.

Interaction with sampling. Constrained decoding works with any sampling strategy: greedy, top-k, top-p, temperature scaling. The constraint is applied as a hard mask on logits before the sampling strategy. The sampling strategy only sees the valid tokens.

Streaming. Constrained decoding is compatible with token-by-token streaming. Each token is guaranteed valid in context, so partial outputs can be streamed to the client. The client can parse partial JSON as it arrives.

Common Confusions

Watch Out

Constrained decoding is not fine-tuning

Constrained decoding modifies the inference-time sampling procedure without changing the model weights. The model is not retrained to produce structured output. The constraint is applied externally. Fine-tuning on structured data can improve the model's natural tendency to produce valid output, but constrained decoding provides the hard guarantee.

Watch Out

JSON mode is weaker than schema-constrained generation

JSON mode (as offered by some APIs) guarantees the output is valid JSON. Schema-constrained generation guarantees the output matches a specific JSON schema (correct field names, types, required fields). Valid JSON that does not match your schema is useless. Always prefer schema-level constraints when available.

Watch Out

Constrained decoding can still produce wrong content

A model constrained to output JSON matching a schema {"temperature": number} will always produce valid JSON with a numeric temperature field. But the number might be wrong. Constrained decoding guarantees syntax, not semantics. Factual accuracy still depends on the model's knowledge and reasoning.

Summary

  • Constrained decoding masks invalid tokens at each step, guaranteeing valid output
  • Grammar-guided generation maintains a parser state to compute valid token sets
  • The constraint reduces the effective search space, often improving content quality
  • Token-level masking adds negligible overhead to the model forward pass
  • Constrained decoding works with any sampling strategy (greedy, top-k, top-p)
  • JSON mode guarantees valid JSON; schema mode guarantees conformance to a specific schema
  • Syntax is guaranteed; semantics (correctness of content) is not

Exercises

ExerciseCore

Problem

A JSON schema requires output of the form {"name": string, "age": integer}. At the position immediately after {"name": "Alice", "age": , what tokens are in the valid set VtV_t? Assume the vocabulary includes digits 0-9, letters, quotes, braces, brackets, commas, colons, and minus sign.

ExerciseAdvanced

Problem

You are building a constrained decoding system for SQL generation. The grammar allows only SELECT queries on a database with tables users (columns: id, name, email) and orders (columns: id, user_id, amount, date). After the model generates SELECT name, email FROM users WHERE , what is VtV_t? Consider both column names and SQL keywords that can follow WHERE.

References

Canonical:

  • Willard & Louf, "Efficient Guided Generation for Large Language Models" (2023). Outlines library

Current:

  • OpenAI, "Structured Outputs" documentation (2024)
  • GBNF grammar support in llama.cpp (2023-2024)
  • Scholak et al., "PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models" (2021). SQL-constrained generation

Next Topics

The natural next steps from structured output:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics