Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Foundations

Formal Languages and Automata

Regular languages, context-free grammars, pushdown automata, the Chomsky hierarchy, pumping lemmas, and connections to parsing, neural sequence models, and computational complexity.

CoreTier 3Stable~45 min
0

Why This Matters

Formal language theory classifies computational problems by the minimal machine needed to solve them. A finite automaton suffices for pattern matching. A pushdown automaton handles nested structures. A Turing machine handles everything that is computable. This hierarchy is not just a classification exercise: it tells you which problems admit efficient parsing algorithms (regular and context-free) and which require general computation.

Type 0: Recursively EnumerableType 1: Context-SensitiveType 2: Context-FreeType 3: RegularTuring machineLinear-bounded automatonPushdown automatonFinite automaton (DFA/NFA)a*b+(regular)Each level is strictly contained in the one above. More powerful automata recognize strictly larger language classes.

For ML, the Chomsky hierarchy provides the baseline for understanding what neural sequence models can and cannot represent. Recurrent neural networks with bounded precision are equivalent to finite automata. Transformers have connections to circuit complexity classes. Natural language has context-free structure (nested dependencies) that finite-state models cannot capture. Knowing where the boundaries are prevents you from expecting a model to do something its architecture provably cannot.

Core Definitions

Definition

Alphabet, String, and Language

An alphabet Σ\Sigma is a finite, nonempty set of symbols. A string over Σ\Sigma is a finite sequence of symbols from Σ\Sigma. The set of all strings over Σ\Sigma is Σ\Sigma^*, which includes the empty string ε\varepsilon. A language LL over Σ\Sigma is any subset LΣL \subseteq \Sigma^*.

Definition

Deterministic Finite Automaton (DFA)

A DFA is a 5-tuple (Q,Σ,δ,q0,F)(Q, \Sigma, \delta, q_0, F) where:

  • QQ is a finite set of states
  • Σ\Sigma is a finite input alphabet
  • δ:Q×ΣQ\delta: Q \times \Sigma \to Q is the transition function (total)
  • q0Qq_0 \in Q is the start state
  • FQF \subseteq Q is the set of accept states

The DFA reads input one symbol at a time, transitions deterministically, and accepts if it ends in a state in FF.

Definition

Nondeterministic Finite Automaton (NFA)

An NFA is like a DFA but with δ:Q×(Σ{ε})P(Q)\delta: Q \times (\Sigma \cup \{\varepsilon\}) \to \mathcal{P}(Q). The transition function maps to a set of possible next states, and ε\varepsilon-transitions are allowed (the machine may change state without reading input). The NFA accepts if some sequence of choices leads to an accept state.

Definition

Regular Language

A language LL is regular if it is recognized by some DFA (equivalently, by some NFA, or described by some regular expression). The regular languages are exactly the Type 3 languages in the Chomsky hierarchy.

Definition

Context-Free Grammar (CFG)

A context-free grammar is a 4-tuple G=(V,Σ,R,S)G = (V, \Sigma, R, S) where:

  • VV is a finite set of variables (nonterminals)
  • Σ\Sigma is a finite set of terminals (disjoint from VV)
  • RR is a finite set of production rules of the form AwA \to w, where AVA \in V and w(VΣ)w \in (V \cup \Sigma)^*
  • SVS \in V is the start variable

A string wΣw \in \Sigma^* is in the language L(G)L(G) if there is a derivation SwS \Rightarrow^* w using the rules in RR.

Definition

Pushdown Automaton (PDA)

A pushdown automaton is a 6-tuple (Q,Σ,Γ,δ,q0,F)(Q, \Sigma, \Gamma, \delta, q_0, F) where Γ\Gamma is the stack alphabet and δ:Q×(Σ{ε})×(Γ{ε})P(Q×(Γ{ε}))\delta: Q \times (\Sigma \cup \{\varepsilon\}) \times (\Gamma \cup \{\varepsilon\}) \to \mathcal{P}(Q \times (\Gamma \cup \{\varepsilon\})). A PDA is an NFA augmented with an unbounded stack. A language is context-free if and only if it is accepted by some PDA.

DFA/NFA Equivalence

Theorem

DFA-NFA Equivalence

Statement

For every NFA NN with nn states, there exists a DFA DD that recognizes the same language. The DFA may have up to 2n2^n states.

Intuition

An NFA can be in multiple states simultaneously (all states reachable by some sequence of nondeterministic choices). The DFA simulates this by tracking the set of states the NFA could be in. Each DFA state corresponds to a subset of the NFA states.

Proof Sketch

Given NFA N=(Q,Σ,δ,q0,F)N = (Q, \Sigma, \delta, q_0, F), construct DFA DD with state set P(Q)\mathcal{P}(Q). The start state is the ε\varepsilon-closure of {q0}\{q_0\}. The transition function is δD(S,a)=qSε-closure(δ(q,a))\delta_D(S, a) = \bigcup_{q \in S} \varepsilon\text{-closure}(\delta(q, a)) for each SQS \subseteq Q and aΣa \in \Sigma. The accept states are all SQS \subseteq Q with SFS \cap F \neq \emptyset. By induction on string length, DD accepts ww iff NN accepts ww.

Why It Matters

Nondeterminism does not add power to finite automata; it only adds conciseness. An NFA with nn states can always be converted to an equivalent DFA, though possibly with exponentially more states. This exponential blowup is tight: there exist languages where the minimal DFA is exponentially larger than the minimal NFA.

Failure Mode

This equivalence holds only for finite automata. Nondeterministic pushdown automata are strictly more powerful than deterministic ones. The language {wwR:w{0,1}}\{ww^R : w \in \{0,1\}^*\} (palindromes) is accepted by a nondeterministic PDA but not by any deterministic PDA. For Turing machines, deterministic and nondeterministic versions recognize the same languages (though the time complexity may differ, which is the essence of the P vs NP question).

The Chomsky Hierarchy

Theorem

The Chomsky Hierarchy

Statement

Languages are classified into a strict hierarchy of four types:

TypeGrammar RestrictionAutomatonExample
3 (Regular)AaBA \to aB or AaA \to aFinite automaton (DFA/NFA){an:n0}\{a^n : n \geq 0\}
2 (Context-free)AγA \to \gamma for any γ\gammaPushdown automaton{anbn:n0}\{a^n b^n : n \geq 0\}
1 (Context-sensitive)αAβαγβ\alpha A \beta \to \alpha \gamma \beta, $\gamma\geq 1$
0 (Recursively enumerable)No restrictionsTuring machineHALT

The inclusions are strict: RegularContext-FreeContext-SensitiveR.E.\text{Regular} \subsetneq \text{Context-Free} \subsetneq \text{Context-Sensitive} \subsetneq \text{R.E.}

Intuition

Each level adds a computational resource. Finite automata have finite memory. Pushdown automata add a stack (unbounded LIFO memory). Linear bounded automata add bounded tape. Turing machines have unbounded tape. More memory allows recognizing more complex patterns.

Proof Sketch

The inclusions follow from the fact that each machine type can simulate the one below it. The strictness follows from the pumping lemmas (for separating regular from context-free) and specific non-context-free languages like {anbncn}\{a^n b^n c^n\}, proved using the pumping lemma for context-free languages. The separation of context-sensitive from r.e. follows from the existence of undecidable languages (no LBA can decide HALT).

Why It Matters

The Chomsky hierarchy provides a framework for classifying the complexity of languages and the computational power required to process them. In NLP, natural language syntax has context-free backbone structure (phrase structure grammars), but some constructions (cross-serial dependencies in Swiss German, copy languages) appear to require mildly context-sensitive power. Knowing the hierarchy helps determine what parsing algorithms and model architectures are appropriate.

Failure Mode

The Chomsky hierarchy has gaps. There is no level between context-free and context-sensitive that captures "mildly context-sensitive" languages, which are important in computational linguistics. Formalisms like tree-adjoining grammars (TAGs) and multiple context-free grammars fill this gap but are not part of the original hierarchy.

The Pumping Lemma for Regular Languages

Theorem

Pumping Lemma for Regular Languages

Statement

If LL is a regular language, then there exists a constant p1p \geq 1 (the pumping length) such that every string wLw \in L with wp|w| \geq p can be written as w=xyzw = xyz where:

  1. y1|y| \geq 1 (the pumped part is nonempty)
  2. xyp|xy| \leq p (the pumped part is within the first pp characters)
  3. For all i0i \geq 0, xyizLxy^iz \in L (pumping yy any number of times stays in LL)

Intuition

A DFA with pp states reading a string of length p\geq p must revisit some state (pigeonhole principle). The substring between the two visits to the same state can be repeated (pumped) any number of times, and the DFA will still accept because it is in the same state after each repetition.

Proof Sketch

Let MM be a DFA with pp states recognizing LL. Let w=w1w2wnw = w_1 w_2 \cdots w_n with npn \geq p. Consider the sequence of states q0,q1,,qnq_0, q_1, \ldots, q_n visited while reading ww. Since there are pp states and n+1p+1n + 1 \geq p + 1 state visits, by the pigeonhole principle, some state qj=qkq_j = q_k for 0j<kp0 \leq j < k \leq p. Set x=w1wjx = w_1 \cdots w_j, y=wj+1wky = w_{j+1} \cdots w_k, z=wk+1wnz = w_{k+1} \cdots w_n. Then y1|y| \geq 1, xyp|xy| \leq p, and for any ii, reading xyizxy^iz starts at q0q_0, reaches qjq_j after xx, loops back to qj=qkq_j = q_k after each copy of yy, and finishes at qnFq_n \in F after zz.

Why It Matters

The pumping lemma is the standard tool for proving that a language is not regular. You assume LL is regular, apply the pumping lemma to get pp, choose a specific string wLw \in L with wp|w| \geq p, and show that no decomposition w=xyzw = xyz satisfying the three conditions keeps xyizxy^iz in LL for all ii. The resulting contradiction proves LL is not regular.

Failure Mode

The pumping lemma is a necessary condition for regularity, not a sufficient one. There exist non-regular languages that satisfy the pumping lemma. The Myhill-Nerode theorem provides a necessary and sufficient condition: LL is regular iff the equivalence relation xLyx \sim_L y (defined by: for all zz, xzL    yzLxz \in L \iff yz \in L) has finitely many equivalence classes.

Decidability by Language Class

ProblemRegularContext-FreeContext-SensitiveR.E.
Membership (wLw \in L?)O(n)O(n)O(n3)O(n^3) CYKDecidable (PSPACE)Semi-decidable
Emptiness (L=L = \emptyset?)DecidableDecidableUndecidableUndecidable
Universality (L=ΣL = \Sigma^*?)DecidableUndecidableUndecidableUndecidable
Equivalence (L1=L2L_1 = L_2?)DecidableUndecidableUndecidableUndecidable
Containment (L1L2L_1 \subseteq L_2?)DecidableUndecidableUndecidableUndecidable

The sharp drop in decidability between regular and context-free languages reflects the increased complexity of CFGs. For regular languages, all standard questions are decidable because the DFA can be fully analyzed. For context-free languages, membership and emptiness are decidable, but comparison problems (universality, equivalence, containment) become undecidable.

Connection to Neural Networks and NLP

Natural language has nested, recursive structure (relative clauses embedded in relative clauses), which is the hallmark of context-free languages. Finite-state models cannot capture unbounded nesting: the language {anbn:n0}\{a^n b^n : n \geq 0\} requires counting, which exceeds finite-state memory.

Recurrent neural networks (RNNs) with finite precision are equivalent to finite automata. With infinite precision, they can simulate Turing machines, but this is not practical. Transformers occupy a different position: with hard attention, they correspond to a restricted class of circuits (AC0\text{AC}^0 or TC0\text{TC}^0 depending on the activation functions). Soft attention transformers can approximate context-free languages but do not exactly compute them.

This means that when a transformer appears to learn nested structure (matching parentheses, parsing arithmetic), it is using an approximation strategy that works for the training distribution but may fail on strings requiring deeper nesting than seen during training.

Common Confusions

Watch Out

Regular expressions in theory vs. practice

Regular expressions in formal language theory correspond exactly to DFAs and describe only regular languages. Regular expressions in programming (Python, Perl, etc.) include backreferences (\1\backslash 1, \2\backslash 2), which can match non-regular languages like {ww:w{a,b}}\{ww : w \in \{a,b\}^*\}. With backreferences, the matching problem becomes NP-hard. The name "regular expression" is therefore misleading in the programming context; these are strictly more powerful than the theoretical notion.

Watch Out

The pumping lemma cannot prove regularity

The pumping lemma provides a necessary condition for regularity, not a sufficient one. If a language fails the pumping lemma, it is not regular. But if a language satisfies the pumping lemma, it may still be non-regular. For example, the language {aibjck:if i=1 then j=k}\{a^i b^j c^k : \text{if } i = 1 \text{ then } j = k\} satisfies the pumping lemma but is not regular. To prove a language is regular, you must construct a DFA, NFA, or regular expression for it.

Watch Out

Nondeterminism adds power for PDAs but not for FAs or TMs

For finite automata, nondeterminism is a convenience: every NFA has an equivalent DFA. For Turing machines, nondeterminism does not change what is computable (only potentially how fast). But for pushdown automata, nondeterminism strictly increases power. Deterministic context-free languages (those recognized by a DPDA) are a proper subset of the context-free languages. This is why LR parsing (deterministic) cannot handle all context-free grammars.

Exercises

ExerciseCore

Problem

Prove that L={anbn:n0}L = \{a^n b^n : n \geq 0\} is not regular using the pumping lemma.

ExerciseCore

Problem

Write a context-free grammar for the language L={anbn:n0}L = \{a^n b^n : n \geq 0\}.

ExerciseAdvanced

Problem

Prove that L={anbncn:n0}L = \{a^n b^n c^n : n \geq 0\} is not context-free using the pumping lemma for context-free languages.

ExerciseAdvanced

Problem

Construct an NFA with 3 states for the language L={w{0,1}:w contains the substring 01}L = \{w \in \{0,1\}^* : w \text{ contains the substring } 01\}. Then apply the subset construction to produce the equivalent DFA. How many reachable states does the DFA have?

ExerciseResearch

Problem

Transformers with hard attention (argmax instead of softmax) and fixed precision can be simulated by constant-depth threshold circuits (TC0\text{TC}^0). Explain why this implies transformers cannot recognize all context-free languages and give a specific context-free language that fixed-depth transformers provably cannot recognize.

References

Canonical:

  • Sipser, Introduction to the Theory of Computation (2013), Chapters 1-2 (automata and CFGs)
  • Hopcroft, Motwani, Ullman, Introduction to Automata Theory, Languages, and Computation (2006), Chapters 2-7

Additional:

  • Kozen, Automata and Computability (1997), Chapters 1-27
  • Chomsky, "Three models for the description of language" (IRE Transactions, 1956)

Connection to NLP and neural networks:

  • Merrill et al., "Saturated Transformers are Constant-Depth Threshold Circuits" (TACL, 2022)
  • Yun, Bhojanapalli, Rawat, Reddi, Kumar, "Are Transformers universal approximators of sequence-to-sequence functions?" (ICLR, 2020)

Next Topics

  • Computability theory: what happens beyond the Chomsky hierarchy, at the boundary of what Turing machines can compute
  • P vs NP: within the decidable languages, the central open question about efficient computation

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics