Foundations

Formal Languages and Automata

Regular languages, context-free grammars, pushdown automata, the Chomsky hierarchy, pumping lemmas, and connections to parsing, neural sequence models, and computational complexity.

CoreTier 3Stable~45 min

Prerequisites

Basic Logic and Proof Techniques Sets Functions and Relations

Why This Matters

Formal language theory classifies computational problems by the minimal machine needed to solve them. A finite automaton suffices for pattern matching. A pushdown automaton handles nested structures. A Turing machine handles everything that is computable. This hierarchy is not just a classification exercise: it tells you which problems admit efficient parsing algorithms (regular and context-free) and which require general computation.

For ML, the Chomsky hierarchy provides the baseline for understanding what neural sequence models can and cannot represent. Recurrent neural networks with bounded precision are equivalent to finite automata. Transformers have connections to circuit complexity classes. Natural language has context-free structure (nested dependencies) that finite-state models cannot capture. Knowing where the boundaries are prevents you from expecting a model to do something its architecture provably cannot.

Core Definitions

Definition

Alphabet, String, and Language

An alphabet $\Sigma$ is a finite, nonempty set of symbols. A string over $\Sigma$ is a finite sequence of symbols from $\Sigma$ . The set of all strings over $\Sigma$ is $\Sigma^*$ , which includes the empty string $\varepsilon$ . A language $L$ over $\Sigma$ is any subset $L \subseteq \Sigma^*$ .

Definition

Deterministic Finite Automaton (DFA)

A DFA is a 5-tuple $(Q, \Sigma, \delta, q_0, F)$ where:

$Q$ is a finite set of states
$\Sigma$ is a finite input alphabet
$\delta: Q \times \Sigma \to Q$ is the transition function (total)
$q_0 \in Q$ is the start state
$F \subseteq Q$ is the set of accept states

The DFA reads input one symbol at a time, transitions deterministically, and accepts if it ends in a state in $F$ .

Definition

Nondeterministic Finite Automaton (NFA)

An NFA is like a DFA but with $\delta: Q \times (\Sigma \cup \{\varepsilon\}) \to \mathcal{P}(Q)$ . The transition function maps to a set of possible next states, and $\varepsilon$ -transitions are allowed (the machine may change state without reading input). The NFA accepts if some sequence of choices leads to an accept state.

Definition

Regular Language

A language $L$ is regular if it is recognized by some DFA (equivalently, by some NFA, or described by some regular expression). The regular languages are exactly the Type 3 languages in the Chomsky hierarchy.

Definition

Context-Free Grammar (CFG)

A context-free grammar is a 4-tuple $G = (V, \Sigma, R, S)$ where:

$V$ is a finite set of variables (nonterminals)
$\Sigma$ is a finite set of terminals (disjoint from $V$ )
$R$ is a finite set of production rules of the form $A \to w$ , where $A \in V$ and $w \in (V \cup \Sigma)^*$
$S \in V$ is the start variable

A string $w \in \Sigma^*$ is in the language $L(G)$ if there is a derivation $S \Rightarrow^* w$ using the rules in $R$ .

Definition

Pushdown Automaton (PDA)

A pushdown automaton is a 6-tuple $(Q, \Sigma, \Gamma, \delta, q_0, F)$ where $\Gamma$ is the stack alphabet and $\delta: Q \times (\Sigma \cup \{\varepsilon\}) \times (\Gamma \cup \{\varepsilon\}) \to \mathcal{P}(Q \times (\Gamma \cup \{\varepsilon\}))$ . A PDA is an NFA augmented with an unbounded stack. A language is context-free if and only if it is accepted by some PDA.

DFA/NFA Equivalence

Theorem

DFA-NFA Equivalence

Statement

For every NFA $N$ with $n$ states, there exists a DFA $D$ that recognizes the same language. The DFA may have up to $2^n$ states.

Intuition

An NFA can be in multiple states simultaneously (all states reachable by some sequence of nondeterministic choices). The DFA simulates this by tracking the set of states the NFA could be in. Each DFA state corresponds to a subset of the NFA states.

Proof Sketch

Given NFA $N = (Q, \Sigma, \delta, q_0, F)$ , construct DFA $D$ with state set $\mathcal{P}(Q)$ . The start state is the $\varepsilon$ -closure of $\{q_0\}$ . The transition function is $\delta_D(S, a) = \bigcup_{q \in S} \varepsilon\text{-closure}(\delta(q, a))$ for each $S \subseteq Q$ and $a \in \Sigma$ . The accept states are all $S \subseteq Q$ with $S \cap F \neq \emptyset$ . By induction on string length, $D$ accepts $w$ iff $N$ accepts $w$ .

Why It Matters

Nondeterminism does not add power to finite automata; it only adds conciseness. An NFA with $n$ states can always be converted to an equivalent DFA, though possibly with exponentially more states. This exponential blowup is tight: there exist languages where the minimal DFA is exponentially larger than the minimal NFA.

Failure Mode

This equivalence holds only for finite automata. Nondeterministic pushdown automata are strictly more powerful than deterministic ones. The language $\{ww^R : w \in \{0,1\}^*\}$ (palindromes) is accepted by a nondeterministic PDA but not by any deterministic PDA. For Turing machines, deterministic and nondeterministic versions recognize the same languages (though the time complexity may differ, which is the essence of the P vs NP question).

The Chomsky Hierarchy

Theorem

The Chomsky Hierarchy

Statement

Languages are classified into a strict hierarchy of four types:

Type	Grammar Restriction	Automaton	Example
3 (Regular)	$A \to aB$ or $A \to a$	Finite automaton (DFA/NFA)	$\{a^n : n \geq 0\}$
2 (Context-free)	$A \to \gamma$ for any $\gamma$	Pushdown automaton	$\{a^n b^n : n \geq 0\}$
1 (Context-sensitive)	$\alpha A \beta \to \alpha \gamma \beta$ , $	\gamma	\geq 1$
0 (Recursively enumerable)	No restrictions	Turing machine	HALT

The inclusions are strict: $\text{Regular} \subsetneq \text{Context-Free} \subsetneq \text{Context-Sensitive} \subsetneq \text{R.E.}$

Intuition

Each level adds a computational resource. Finite automata have finite memory. Pushdown automata add a stack (unbounded LIFO memory). Linear bounded automata add bounded tape. Turing machines have unbounded tape. More memory allows recognizing more complex patterns.

Proof Sketch

The inclusions follow from the fact that each machine type can simulate the one below it. The strictness follows from the pumping lemmas (for separating regular from context-free) and specific non-context-free languages like $\{a^n b^n c^n\}$ , proved using the pumping lemma for context-free languages. The separation of context-sensitive from r.e. follows from the existence of undecidable languages (no LBA can decide HALT).

Why It Matters

The Chomsky hierarchy provides a framework for classifying the complexity of languages and the computational power required to process them. In NLP, natural language syntax has context-free backbone structure (phrase structure grammars), but some constructions (cross-serial dependencies in Swiss German, copy languages) appear to require mildly context-sensitive power. Knowing the hierarchy helps determine what parsing algorithms and model architectures are appropriate.

Failure Mode

The Chomsky hierarchy has gaps. There is no level between context-free and context-sensitive that captures "mildly context-sensitive" languages, which are important in computational linguistics. Formalisms like tree-adjoining grammars (TAGs) and multiple context-free grammars fill this gap but are not part of the original hierarchy.

The Pumping Lemma for Regular Languages

Theorem

Pumping Lemma for Regular Languages

Statement

If $L$ is a regular language, then there exists a constant $p \geq 1$ (the pumping length) such that every string $w \in L$ with $|w| \geq p$ can be written as $w = xyz$ where:

$|y| \geq 1$ (the pumped part is nonempty)
$|xy| \leq p$ (the pumped part is within the first $p$ characters)
For all $i \geq 0$ , $xy^iz \in L$ (pumping $y$ any number of times stays in $L$ )

Intuition

A DFA with $p$ states reading a string of length $\geq p$ must revisit some state (pigeonhole principle). The substring between the two visits to the same state can be repeated (pumped) any number of times, and the DFA will still accept because it is in the same state after each repetition.

Proof Sketch

Let $M$ be a DFA with $p$ states recognizing $L$ . Let $w = w_1 w_2 \cdots w_n$ with $n \geq p$ . Consider the sequence of states $q_0, q_1, \ldots, q_n$ visited while reading $w$ . Since there are $p$ states and $n + 1 \geq p + 1$ state visits, by the pigeonhole principle, some state $q_j = q_k$ for $0 \leq j < k \leq p$ . Set $x = w_1 \cdots w_j$ , $y = w_{j+1} \cdots w_k$ , $z = w_{k+1} \cdots w_n$ . Then $|y| \geq 1$ , $|xy| \leq p$ , and for any $i$ , reading $xy^iz$ starts at $q_0$ , reaches $q_j$ after $x$ , loops back to $q_j = q_k$ after each copy of $y$ , and finishes at $q_n \in F$ after $z$ .

Why It Matters

The pumping lemma is the standard tool for proving that a language is not regular. You assume $L$ is regular, apply the pumping lemma to get $p$ , choose a specific string $w \in L$ with $|w| \geq p$ , and show that no decomposition $w = xyz$ satisfying the three conditions keeps $xy^iz$ in $L$ for all $i$ . The resulting contradiction proves $L$ is not regular.

Failure Mode

The pumping lemma is a necessary condition for regularity, not a sufficient one. There exist non-regular languages that satisfy the pumping lemma. The Myhill-Nerode theorem provides a necessary and sufficient condition: $L$ is regular iff the equivalence relation $x \sim_L y$ (defined by: for all $z$ , $xz \in L \iff yz \in L$ ) has finitely many equivalence classes.

Decidability by Language Class

Problem	Regular	Context-Free	Context-Sensitive	R.E.
Membership ( $w \in L$ ?)	$O(n)$	$O(n^3)$ CYK	Decidable (PSPACE)	Semi-decidable
Emptiness ( $L = \emptyset$ ?)	Decidable	Decidable	Undecidable	Undecidable
Universality ( $L = \Sigma^*$ ?)	Decidable	Undecidable	Undecidable	Undecidable
Equivalence ( $L_1 = L_2$ ?)	Decidable	Undecidable	Undecidable	Undecidable
Containment ( $L_1 \subseteq L_2$ ?)	Decidable	Undecidable	Undecidable	Undecidable

The sharp drop in decidability between regular and context-free languages reflects the increased complexity of CFGs. For regular languages, all standard questions are decidable because the DFA can be fully analyzed. For context-free languages, membership and emptiness are decidable, but comparison problems (universality, equivalence, containment) become undecidable.

Connection to Neural Networks and NLP

Natural language has nested, recursive structure (relative clauses embedded in relative clauses), which is the hallmark of context-free languages. Finite-state models cannot capture unbounded nesting: the language $\{a^n b^n : n \geq 0\}$ requires counting, which exceeds finite-state memory.

Recurrent neural networks (RNNs) with finite precision are equivalent to finite automata. With infinite precision, they can simulate Turing machines, but this is not practical. Transformers occupy a different position: with hard attention, they correspond to a restricted class of circuits ( $\text{AC}^0$ or $\text{TC}^0$ depending on the activation functions). Soft attention transformers can approximate context-free languages but do not exactly compute them.

This means that when a transformer appears to learn nested structure (matching parentheses, parsing arithmetic), it is using an approximation strategy that works for the training distribution but may fail on strings requiring deeper nesting than seen during training.

Common Confusions

Watch Out

Regular expressions in theory vs. practice

Regular expressions in formal language theory correspond exactly to DFAs and describe only regular languages. Regular expressions in programming (Python, Perl, etc.) include backreferences ( $\backslash 1$ , $\backslash 2$ ), which can match non-regular languages like $\{ww : w \in \{a,b\}^*\}$ . With backreferences, the matching problem becomes NP-hard. The name "regular expression" is therefore misleading in the programming context; these are strictly more powerful than the theoretical notion.

Watch Out

The pumping lemma cannot prove regularity

The pumping lemma provides a necessary condition for regularity, not a sufficient one. If a language fails the pumping lemma, it is not regular. But if a language satisfies the pumping lemma, it may still be non-regular. For example, the language $\{a^i b^j c^k : \text{if } i = 1 \text{ then } j = k\}$ satisfies the pumping lemma but is not regular. To prove a language is regular, you must construct a DFA, NFA, or regular expression for it.

Watch Out

Nondeterminism adds power for PDAs but not for FAs or TMs

For finite automata, nondeterminism is a convenience: every NFA has an equivalent DFA. For Turing machines, nondeterminism does not change what is computable (only potentially how fast). But for pushdown automata, nondeterminism strictly increases power. Deterministic context-free languages (those recognized by a DPDA) are a proper subset of the context-free languages. This is why LR parsing (deterministic) cannot handle all context-free grammars.

Exercises

ExerciseCore

Problem

Prove that $L = \{a^n b^n : n \geq 0\}$ is not regular using the pumping lemma.

ExerciseCore

Problem

Write a context-free grammar for the language $L = \{a^n b^n : n \geq 0\}$ .

ExerciseAdvanced

Problem

Prove that $L = \{a^n b^n c^n : n \geq 0\}$ is not context-free using the pumping lemma for context-free languages.

ExerciseAdvanced

Problem

Construct an NFA with 3 states for the language $L = \{w \in \{0,1\}^* : w \text{ contains the substring } 01\}$ . Then apply the subset construction to produce the equivalent DFA. How many reachable states does the DFA have?

ExerciseResearch

Problem

Transformers with hard attention (argmax instead of softmax) and fixed precision can be simulated by constant-depth threshold circuits ( $\text{TC}^0$ ). Explain why this implies transformers cannot recognize all context-free languages and give a specific context-free language that fixed-depth transformers provably cannot recognize.

References

Canonical:

Sipser, Introduction to the Theory of Computation (2013), Chapters 1-2 (automata and CFGs)
Hopcroft, Motwani, Ullman, Introduction to Automata Theory, Languages, and Computation (2006), Chapters 2-7

Additional:

Kozen, Automata and Computability (1997), Chapters 1-27
Chomsky, "Three models for the description of language" (IRE Transactions, 1956)

Connection to NLP and neural networks:

Merrill et al., "Saturated Transformers are Constant-Depth Threshold Circuits" (TACL, 2022)
Yun, Bhojanapalli, Rawat, Reddi, Kumar, "Are Transformers universal approximators of sequence-to-sequence functions?" (ICLR, 2020)

Next Topics

Computability theory: what happens beyond the Chomsky hierarchy, at the boundary of what Turing machines can compute
P vs NP: within the decidable languages, the central open question about efficient computation

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Basic Logic and Proof TechniquesLayer 0A
Sets, Functions, and RelationsLayer 0A

Next Topics

Computability TheoryContinue →P vs NpContinue →