Foundations
Formal Languages and Automata
Regular languages, context-free grammars, pushdown automata, the Chomsky hierarchy, pumping lemmas, and connections to parsing, neural sequence models, and computational complexity.
Why This Matters
Formal language theory classifies computational problems by the minimal machine needed to solve them. A finite automaton suffices for pattern matching. A pushdown automaton handles nested structures. A Turing machine handles everything that is computable. This hierarchy is not just a classification exercise: it tells you which problems admit efficient parsing algorithms (regular and context-free) and which require general computation.
For ML, the Chomsky hierarchy provides the baseline for understanding what neural sequence models can and cannot represent. Recurrent neural networks with bounded precision are equivalent to finite automata. Transformers have connections to circuit complexity classes. Natural language has context-free structure (nested dependencies) that finite-state models cannot capture. Knowing where the boundaries are prevents you from expecting a model to do something its architecture provably cannot.
Core Definitions
Alphabet, String, and Language
An alphabet is a finite, nonempty set of symbols. A string over is a finite sequence of symbols from . The set of all strings over is , which includes the empty string . A language over is any subset .
Deterministic Finite Automaton (DFA)
A DFA is a 5-tuple where:
- is a finite set of states
- is a finite input alphabet
- is the transition function (total)
- is the start state
- is the set of accept states
The DFA reads input one symbol at a time, transitions deterministically, and accepts if it ends in a state in .
Nondeterministic Finite Automaton (NFA)
An NFA is like a DFA but with . The transition function maps to a set of possible next states, and -transitions are allowed (the machine may change state without reading input). The NFA accepts if some sequence of choices leads to an accept state.
Regular Language
A language is regular if it is recognized by some DFA (equivalently, by some NFA, or described by some regular expression). The regular languages are exactly the Type 3 languages in the Chomsky hierarchy.
Context-Free Grammar (CFG)
A context-free grammar is a 4-tuple where:
- is a finite set of variables (nonterminals)
- is a finite set of terminals (disjoint from )
- is a finite set of production rules of the form , where and
- is the start variable
A string is in the language if there is a derivation using the rules in .
Pushdown Automaton (PDA)
A pushdown automaton is a 6-tuple where is the stack alphabet and . A PDA is an NFA augmented with an unbounded stack. A language is context-free if and only if it is accepted by some PDA.
DFA/NFA Equivalence
DFA-NFA Equivalence
Statement
For every NFA with states, there exists a DFA that recognizes the same language. The DFA may have up to states.
Intuition
An NFA can be in multiple states simultaneously (all states reachable by some sequence of nondeterministic choices). The DFA simulates this by tracking the set of states the NFA could be in. Each DFA state corresponds to a subset of the NFA states.
Proof Sketch
Given NFA , construct DFA with state set . The start state is the -closure of . The transition function is for each and . The accept states are all with . By induction on string length, accepts iff accepts .
Why It Matters
Nondeterminism does not add power to finite automata; it only adds conciseness. An NFA with states can always be converted to an equivalent DFA, though possibly with exponentially more states. This exponential blowup is tight: there exist languages where the minimal DFA is exponentially larger than the minimal NFA.
Failure Mode
This equivalence holds only for finite automata. Nondeterministic pushdown automata are strictly more powerful than deterministic ones. The language (palindromes) is accepted by a nondeterministic PDA but not by any deterministic PDA. For Turing machines, deterministic and nondeterministic versions recognize the same languages (though the time complexity may differ, which is the essence of the P vs NP question).
The Chomsky Hierarchy
The Chomsky Hierarchy
Statement
Languages are classified into a strict hierarchy of four types:
| Type | Grammar Restriction | Automaton | Example |
|---|---|---|---|
| 3 (Regular) | or | Finite automaton (DFA/NFA) | |
| 2 (Context-free) | for any | Pushdown automaton | |
| 1 (Context-sensitive) | , $ | \gamma | \geq 1$ |
| 0 (Recursively enumerable) | No restrictions | Turing machine | HALT |
The inclusions are strict:
Intuition
Each level adds a computational resource. Finite automata have finite memory. Pushdown automata add a stack (unbounded LIFO memory). Linear bounded automata add bounded tape. Turing machines have unbounded tape. More memory allows recognizing more complex patterns.
Proof Sketch
The inclusions follow from the fact that each machine type can simulate the one below it. The strictness follows from the pumping lemmas (for separating regular from context-free) and specific non-context-free languages like , proved using the pumping lemma for context-free languages. The separation of context-sensitive from r.e. follows from the existence of undecidable languages (no LBA can decide HALT).
Why It Matters
The Chomsky hierarchy provides a framework for classifying the complexity of languages and the computational power required to process them. In NLP, natural language syntax has context-free backbone structure (phrase structure grammars), but some constructions (cross-serial dependencies in Swiss German, copy languages) appear to require mildly context-sensitive power. Knowing the hierarchy helps determine what parsing algorithms and model architectures are appropriate.
Failure Mode
The Chomsky hierarchy has gaps. There is no level between context-free and context-sensitive that captures "mildly context-sensitive" languages, which are important in computational linguistics. Formalisms like tree-adjoining grammars (TAGs) and multiple context-free grammars fill this gap but are not part of the original hierarchy.
The Pumping Lemma for Regular Languages
Pumping Lemma for Regular Languages
Statement
If is a regular language, then there exists a constant (the pumping length) such that every string with can be written as where:
- (the pumped part is nonempty)
- (the pumped part is within the first characters)
- For all , (pumping any number of times stays in )
Intuition
A DFA with states reading a string of length must revisit some state (pigeonhole principle). The substring between the two visits to the same state can be repeated (pumped) any number of times, and the DFA will still accept because it is in the same state after each repetition.
Proof Sketch
Let be a DFA with states recognizing . Let with . Consider the sequence of states visited while reading . Since there are states and state visits, by the pigeonhole principle, some state for . Set , , . Then , , and for any , reading starts at , reaches after , loops back to after each copy of , and finishes at after .
Why It Matters
The pumping lemma is the standard tool for proving that a language is not regular. You assume is regular, apply the pumping lemma to get , choose a specific string with , and show that no decomposition satisfying the three conditions keeps in for all . The resulting contradiction proves is not regular.
Failure Mode
The pumping lemma is a necessary condition for regularity, not a sufficient one. There exist non-regular languages that satisfy the pumping lemma. The Myhill-Nerode theorem provides a necessary and sufficient condition: is regular iff the equivalence relation (defined by: for all , ) has finitely many equivalence classes.
Decidability by Language Class
| Problem | Regular | Context-Free | Context-Sensitive | R.E. |
|---|---|---|---|---|
| Membership (?) | CYK | Decidable (PSPACE) | Semi-decidable | |
| Emptiness (?) | Decidable | Decidable | Undecidable | Undecidable |
| Universality (?) | Decidable | Undecidable | Undecidable | Undecidable |
| Equivalence (?) | Decidable | Undecidable | Undecidable | Undecidable |
| Containment (?) | Decidable | Undecidable | Undecidable | Undecidable |
The sharp drop in decidability between regular and context-free languages reflects the increased complexity of CFGs. For regular languages, all standard questions are decidable because the DFA can be fully analyzed. For context-free languages, membership and emptiness are decidable, but comparison problems (universality, equivalence, containment) become undecidable.
Connection to Neural Networks and NLP
Natural language has nested, recursive structure (relative clauses embedded in relative clauses), which is the hallmark of context-free languages. Finite-state models cannot capture unbounded nesting: the language requires counting, which exceeds finite-state memory.
Recurrent neural networks (RNNs) with finite precision are equivalent to finite automata. With infinite precision, they can simulate Turing machines, but this is not practical. Transformers occupy a different position: with hard attention, they correspond to a restricted class of circuits ( or depending on the activation functions). Soft attention transformers can approximate context-free languages but do not exactly compute them.
This means that when a transformer appears to learn nested structure (matching parentheses, parsing arithmetic), it is using an approximation strategy that works for the training distribution but may fail on strings requiring deeper nesting than seen during training.
Common Confusions
Regular expressions in theory vs. practice
Regular expressions in formal language theory correspond exactly to DFAs and describe only regular languages. Regular expressions in programming (Python, Perl, etc.) include backreferences (, ), which can match non-regular languages like . With backreferences, the matching problem becomes NP-hard. The name "regular expression" is therefore misleading in the programming context; these are strictly more powerful than the theoretical notion.
The pumping lemma cannot prove regularity
The pumping lemma provides a necessary condition for regularity, not a sufficient one. If a language fails the pumping lemma, it is not regular. But if a language satisfies the pumping lemma, it may still be non-regular. For example, the language satisfies the pumping lemma but is not regular. To prove a language is regular, you must construct a DFA, NFA, or regular expression for it.
Nondeterminism adds power for PDAs but not for FAs or TMs
For finite automata, nondeterminism is a convenience: every NFA has an equivalent DFA. For Turing machines, nondeterminism does not change what is computable (only potentially how fast). But for pushdown automata, nondeterminism strictly increases power. Deterministic context-free languages (those recognized by a DPDA) are a proper subset of the context-free languages. This is why LR parsing (deterministic) cannot handle all context-free grammars.
Exercises
Problem
Prove that is not regular using the pumping lemma.
Problem
Write a context-free grammar for the language .
Problem
Prove that is not context-free using the pumping lemma for context-free languages.
Problem
Construct an NFA with 3 states for the language . Then apply the subset construction to produce the equivalent DFA. How many reachable states does the DFA have?
Problem
Transformers with hard attention (argmax instead of softmax) and fixed precision can be simulated by constant-depth threshold circuits (). Explain why this implies transformers cannot recognize all context-free languages and give a specific context-free language that fixed-depth transformers provably cannot recognize.
References
Canonical:
- Sipser, Introduction to the Theory of Computation (2013), Chapters 1-2 (automata and CFGs)
- Hopcroft, Motwani, Ullman, Introduction to Automata Theory, Languages, and Computation (2006), Chapters 2-7
Additional:
- Kozen, Automata and Computability (1997), Chapters 1-27
- Chomsky, "Three models for the description of language" (IRE Transactions, 1956)
Connection to NLP and neural networks:
- Merrill et al., "Saturated Transformers are Constant-Depth Threshold Circuits" (TACL, 2022)
- Yun, Bhojanapalli, Rawat, Reddi, Kumar, "Are Transformers universal approximators of sequence-to-sequence functions?" (ICLR, 2020)
Next Topics
- Computability theory: what happens beyond the Chomsky hierarchy, at the boundary of what Turing machines can compute
- P vs NP: within the decidable languages, the central open question about efficient computation
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Basic Logic and Proof TechniquesLayer 0A
- Sets, Functions, and RelationsLayer 0A