VC Dimension

Uniform convergence tells you that controlling hypothesis class complexity gives generalization. VC dimension tells you how to measure that complexity for infinite classes.

Five-panel infographic: shattering with three points on a line, the formal definition of VC dimension as the largest integer n such that some set of n points is shattered by H, three concrete examples (thresholds VC=1, intervals VC=2, linear separators in R^2 VC=3), why finite VC dimension implies uniform convergence and PAC learnability, and the model-complexity tradeoff (training vs true error U-curve, sample complexity m grows as VC(H)/epsilon^2). — VC dimension asks how many points a hypothesis class can label in every possible way, and uses that as a measure of expressive power.

The finite-class bound depends on $\log|\mathcal{H}|$ , which is infinite for continuous hypothesis classes like linear classifiers. VC dimension replaces the crude counting argument with a combinatorial measure: how many points can the class label in all possible ways? This turns out to be the right notion of complexity for binary classification. Finite VC dimension is both necessary and sufficient for PAC learnability.

Mental Model

Imagine placing points on a table and asking: can my hypothesis class produce every possible labeling of these points? If I put down 3 points and my class can label them in all $2^3 = 8$ ways, the class "shatters" those 3 points. The VC dimension is the largest number of points you can shatter.

A class with high VC dimension can fit many patterns, including spurious ones. A class with low VC dimension is constrained. It cannot memorize arbitrary labelings, so good performance on training data is more likely to reflect genuine structure.

Formal Setup and Notation

We work with binary classification: $\mathcal{Y} = \{0, 1\}$ and $\mathcal{H} \subseteq \{0, 1\}^{\mathcal{X}}$ . The loss is the 0-1 loss $\ell(h(x), y) = \mathbf{1}[h(x) \neq y]$ .

For a set $C = \{x_1, \ldots, x_m\} \subseteq \mathcal{X}$ , the restriction of $\mathcal{H}$ to $C$ is:

$\mathcal{H}_C = \{(h(x_1), \ldots, h(x_m)) : h \in \mathcal{H}\} \subseteq \{0, 1\}^m$

This is the set of all labeling patterns that $\mathcal{H}$ can produce on $C$ .

Core Definitions

Definition

Shattering $H shatters C$

A hypothesis class $\mathcal{H} \subseteq \{0,1\}^{\mathcal{X}}$ shatters a set $C = \{x_1, \ldots, x_m\} \subseteq \mathcal{X}$ if and only if for every labeling $(b_1, \ldots, b_m) \in \{0,1\}^m$ , there exists $h \in \mathcal{H}$ such that $h(x_i) = b_i$ for all $i$ .

Equivalently, $|\mathcal{H}_C| = 2^m$ . The restriction of $\mathcal{H}$ to $C$ contains all possible binary labelings.

Definition

Growth Function $Π_{H} (m)$

The growth function (or shattering coefficient) of $\mathcal{H}$ is:

$\Pi_{\mathcal{H}}(m) = \max_{C \subseteq \mathcal{X}, |C| = m} |\mathcal{H}_C|$

It counts the maximum number of distinct labelings $\mathcal{H}$ can produce on any set of $m$ points. Always $\Pi_{\mathcal{H}}(m) \leq 2^m$ , with equality if and only if $\mathcal{H}$ can shatter some set of size $m$ .

Definition

VC Dimension $VCdim (H)$

The VC dimension of $\mathcal{H}$ , denoted $\text{VCdim}(\mathcal{H})$ or $d_{\text{VC}}$ , is the largest $m$ such that $\Pi_{\mathcal{H}}(m) = 2^m$ .

$\text{VCdim}(\mathcal{H}) = \max\{m : \exists C \text{ with } |C| = m \text{ shattered by } \mathcal{H}\}$

$\text{VCdim}(\mathcal{H}) = \infty$ if and only if $\mathcal{H}$ can shatter arbitrarily large sets.

Critical asymmetry: To show $\text{VCdim}(\mathcal{H}) \geq d$ , you exhibit one set of size $d$ that is shattered. To show $\text{VCdim}(\mathcal{H}) < d$ , you must show that no set of size $d$ can be shattered. The existential vs. universal quantifiers make lower bounds easier than upper bounds.

Main Theorems

Lemma

Sauer-Shelah Lemma

Statement

If $\text{VCdim}(\mathcal{H}) = d$ , then for all $m \in \mathbb{N}$ :

$\Pi_{\mathcal{H}}(m) \leq \sum_{i=0}^{d} \binom{m}{i}$

Formalization note. The claim-level Lean wrapper verifies the binomial-sum form above by translating the restriction of a binary hypothesis class to a finite sample into a finite family of subsets. The common analytic estimate

$\sum_{i=0}^{d} \binom{m}{i} \leq \left(\frac{em}{d}\right)^d$

for $m \geq d$ is a downstream corollary, not part of the current verified Lean claim. The $(em/d)^d$ form requires $m \geq d$ ; for $m < d$ , it can be smaller than the actual bound and is not valid.

Intuition

Once $m$ exceeds the VC dimension, the hypothesis class cannot produce all $2^m$ labelings. The Sauer-Shelah lemma quantifies exactly how constrained the class becomes: the number of achievable labelings grows only polynomially, not exponentially. This is the phase transition that makes uniform convergence possible for finite-VC-dimension classes.

Proof Sketch

By induction on $m + d$ . The formal object is the restriction $\mathcal{H}_C$ on the set $C = \{x_1, \ldots, x_m\}$ . The key step: for any restriction on $m$ points with VC dimension $d$ , partition the labelings by what happens when you project out one point $x_m$ . The labelings where each projection onto $\{x_1, \ldots, x_{m-1}\}$ appears with only one value of the $x_m$ -coordinate form a restriction of VC dimension at most $d$ on the $m - 1$ remaining points. The labelings where both values of the $x_m$ -coordinate appear (both extensions are achieved) form a restriction of VC dimension at most $d - 1$ on $m - 1$ points, because if this class shattered a set of size $d$ , adding $x_m$ would give a shattered set of size $d + 1$ for the original class. Apply the inductive hypothesis to both parts and use the identity $\binom{m-1}{i} + \binom{m-1}{i-1} = \binom{m}{i}$ . Rigorous versions are usually presented via Pajor's trace argument; see Alon, Ben-David, Cesa-Bianchi, and Haussler (1997) or Pajor (1985).

Why It Matters

Without Sauer-Shelah, you cannot plug VC dimension into uniform convergence bounds. The lemma is the bridge: it converts the combinatorial statement "VC dimension is $d$ " into the quantitative bound $\Pi_{\mathcal{H}}(m) = O(m^d)$ , which can then be used in place of $|\mathcal{H}|$ in union-bound-style arguments. The transition from exponential to polynomial growth is what makes learning possible.

Failure Mode

The Sauer-Shelah bound is tight in the worst case (there exist classes achieving equality), but can be very loose for specific hypothesis classes. For example, the class of thresholds on $\mathbb{R}$ has VC dimension 1 and growth function $m + 1$ , which matches the Sauer-Shelah bound $\binom{m}{0} + \binom{m}{1} = m + 1$ . But for more structured classes the bound can significantly overestimate.

report a correction →

Theorem

VC Generalization Bound

Statement

Let $\mathcal{H}$ have VC dimension $d < \infty$ . For any distribution $\mathcal{D}$ and any $\delta > 0$ , with probability at least $1 - \delta$ over an i.i.d. sample of size $n$ :

$\sup_{h \in \mathcal{H}} |R(h) - \hat{R}_n(h)| \leq C\sqrt{\frac{d \log(n/d) + \log(1/\delta)}{n}}$

where $C$ is a universal constant independent of $\mathcal{H}$ , $\mathcal{D}$ , $n$ , and $\delta$ . The original VC proof (Vapnik 1998) gives $C \approx 4$ up to logarithmic factors; Mohri, Rostamizadeh, and Talwalkar (2018), Theorem 3.22, give an explicit form suitable for numerical sample-size calculations. In particular, the sample complexity of uniform convergence (and hence ERM learning) is:

$m(\epsilon, \delta) = O\!\left(\frac{d \log(1/\epsilon) + \log(1/\delta)}{\epsilon^2}\right)$

Intuition

Replace $\log|\mathcal{H}|$ in the finite-class bound with $d \log(n/d)$ . Since $d = \text{VCdim}(\mathcal{H})$ is finite for learnable classes, this gives a meaningful bound even for infinite hypothesis classes. For a continuous class, $\log|\mathcal{H}|$ is infinite; finite VC dimension gives $d \log(n/d)$ in place of $\log|\mathcal{H}|$ , which is what makes learning with infinite classes (like all halfspaces) possible.

Proof Sketch

The proof uses the symmetrization technique in four steps:

Double sampling. Introduce a "ghost sample" $S'$ of the same size. Show that $\Pr[\sup_h |R(h) - \hat{R}_n(h)| > \epsilon] \leq 2\Pr[\sup_h |\hat{R}_n(h) - \hat{R}'_n(h)| > \epsilon/2]$ .
Rademacher symmetrization. Swap labels between $S$ and $S'$ using random signs $\sigma_i \in \{-1, +1\}$ . The supremum does not change in expectation because the swapped version has the same distribution.
Conditioning and growth function. Condition on the $2n$ points and use the growth function to bound the number of effective distinct hypotheses: at most $\Pi_{\mathcal{H}}(2n)$ .
Sauer-Shelah + Hoeffding. Replace $\Pi_{\mathcal{H}}(2n)$ with $(2en/d)^d$ and apply Hoeffding to the resulting finite set of hypotheses. Union bound over at most $(2en/d)^d$ effective hypotheses.

Why It Matters

This is the central theorem of classical learning theory for binary classification. Combined with the fundamental theorem of statistical learning, it shows a clean equivalence—a binary hypothesis class is PAC-learnable if and only if its VC dimension is finite. The sample complexity scales linearly in $d$ (up to log factors).

Failure Mode

The VC bound is distribution-free (worst case over all $\mathcal{D}$ ). This makes it robust but often extremely loose. For "nice" distributions (e.g., data with large margin), data-dependent bounds like Rademacher complexity can be much tighter. More critically, the VC bound gives vacuous guarantees for modern neural networks, which have VC dimension proportional to the number of parameters. often in the billions.

report a correction →

Theorem

Binary VC Bridge for Zero-One Loss

Statement

For binary classifiers $h : \mathcal{H} \to \{0, 1\}^X$ with zero-one loss $\ell_h(x) = \mathbf{1}[h(x) \neq y]$ , the number of distinct loss patterns on a sample $S = (x_1, \ldots, x_n)$ equals the number of distinct binary labeling patterns (the trace):

$|\text{effectiveClass}(\ell_{0\text{-}1}, S)| = |\text{binaryClassTrace}(\mathcal{H}, S)|$

As a corollary, the effective class satisfies the Sauer-Shelah bound directly from the binary trace's VC dimension $d$ :

$|\text{effectiveClass}(\ell_{0\text{-}1}, S)| \leq \sum_{k=0}^{d} \binom{n}{k}$

This removes the "user-supplied growth assumption" caveat from the VC-style Rademacher bounds for binary classification with 0-1 loss.

Intuition

A binary classifier's loss pattern on a sample is determined entirely by its prediction pattern: knowing which points are classified as 0 vs 1 determines which points incur loss 0 vs loss 1. The map from prediction patterns to loss vectors is injective (distinct predictions yield distinct losses), and the map from prediction patterns to trace sets is also injective. Since both the effective class and the trace are images of the same prediction-pattern set through injective maps, they have equal cardinality.

Proof Sketch

Both the effective class and the binary trace factor through the prediction pattern $p_h(S) = (h(x_1), \ldots, h(x_n)) \in \{0,1\}^n$ :

effectiveClass = image of prediction patterns under $\text{toLossVec}$ (map $b \mapsto \mathbf{1}[b = 0]$ )
binaryClassTrace = image of prediction patterns under $\text{toFilterSet}$ (map $b \mapsto \{k : b_k = 1\}$ )

Both toLossVec and toFilterSet are injective on $\{0,1\}^n$ , so $|\text{image}(\text{toLossVec})| = |\text{predictionPatterns}| = |\text{image}(\text{toFilterSet})|$ .

Why It Matters

Without this bridge, the VC-style Rademacher sample-complexity chain requires the user to supply an external growth bound on the effective class. This theorem closes that gap for binary classification with 0-1 loss: the Sauer-Shelah bound on the binary trace directly bounds the effective loss-pattern class, making the full chain self-contained.

Failure Mode

The equality holds only for 0-1 loss on binary classifiers. For real-valued loss functions or multi-class classifiers, the effective class can differ from the binary trace; the bridge does not generalize without additional structure (e.g., a contraction argument for Lipschitz losses).

report a correction →

Proof Ideas and Templates Used

The VC bound proof introduces two major techniques:

Symmetrization (ghost sample trick): replace the unknown population risk $R(h)$ with the empirical risk $\hat{R}'_n(h)$ on a second independent sample. This converts a probability-vs-expectation gap into an empirical-vs-empirical gap, which is easier to control.
Effective hypothesis counting: on any fixed set of $2n$ points, the infinite class $\mathcal{H}$ induces at most $\Pi_{\mathcal{H}}(2n)$ distinct labelings. Sauer-Shelah bounds this by $(2en/d)^d$ . Now you have a finite union bound with polynomial (not infinite) cost.

These two ideas recur throughout learning theory, and are also the foundation for Rademacher complexity bounds.

Canonical Examples

Example

Thresholds on the real line: VC dimension 1

Let $\mathcal{H} = \{h_\theta : x \mapsto \mathbf{1}[x \geq \theta], \theta \in \mathbb{R}\}$ be the class of one-sided thresholds.

Lower bound ( $\text{VCdim} \geq 1$ ): A single point $x_1$ is shattered: take $\theta = x_1 - 1$ for the labeling $(1)$ and $\theta = x_1 + 1$ for the labeling $(0)$ .

Upper bound ( $\text{VCdim} < 2$ ): Take any two points $x_1 < x_2$ . The labeling $(1, 0)$ would require a threshold $\theta$ with $x_1 \geq \theta$ and $x_2 < \theta$ , which forces $x_1 \geq \theta > x_2$ , contradicting $x_1 < x_2$ . So no two points are shattered.

Therefore $\text{VCdim}(\mathcal{H}) = 1$ . The growth function matches the Sauer-Shelah bound exactly: $\Pi_{\mathcal{H}}(m) = m + 1$ , equal to $\binom{m}{0} + \binom{m}{1}$ .

Example

Halfplanes in R^2 have VC dimension 3

Let $\mathcal{H}$ be the set of all halfplanes in $\mathbb{R}^2$ : $h_{w,b}(x) = \mathbf{1}[w \cdot x + b \geq 0]$ .

Lower bound ( $\text{VCdim} \geq 3$ ): Take three non-collinear points, e.g., the vertices of an equilateral triangle. Non-collinearity is essential: for three collinear points, the middle point cannot be separated from both endpoints by a halfplane, so $(0, 1, 0)$ is unachievable.

For non-collinear points, check all $2^3 = 8$ labelings: $(0,0,0)$ and $(1,1,1)$ are achieved by halfplanes excluding or containing all three; the three singleton-positive labelings $(1,0,0), (0,1,0), (0,0,1)$ each isolate one vertex on one side of a line (possible by the symmetry of the equilateral triangle); the three singleton-negative labelings $(0,1,1), (1,0,1), (1,1,0)$ follow by flipping the halfplane orientation. All 8 labelings are achievable.

Upper bound ( $\text{VCdim} < 4$ ): Take any 4 points in $\mathbb{R}^2$ . If one point $p$ lies inside the convex hull of the other three, the labeling that assigns $p$ the opposite label from all others cannot be achieved by a halfplane (because $p$ is a convex combination of points with the opposite label). If no point is in the convex hull of the others (the 4 points form a convex quadrilateral), the "alternating" labeling $(+, -, +, -)$ on consecutive vertices cannot be achieved. Either way, no set of 4 points is shattered.

Therefore $\text{VCdim}(\mathcal{H}) = 3$ . More generally, halfspaces in $\mathbb{R}^d$ have VC dimension $d + 1$ .

General $d$ via Radon's theorem. The clean proof of the upper bound for arbitrary $d$ uses Radon's theorem (Radon 1921): any $d + 2$ points in $\mathbb{R}^d$ can be partitioned into two disjoint sets $A, B$ whose convex hulls intersect. If a halfspace labels $A$ as $1$ and $B$ as $0$ , a point in $\text{conv}(A) \cap \text{conv}(B)$ would lie on both sides, a contradiction. So no $d + 2$ points are shattered. See Shalev-Shwartz and Ben-David, Understanding Machine Learning, Theorem 9.2.

Example

Intervals on the real line: VC dimension 2

$\mathcal{H} = \{h_{a,b} : h_{a,b}(x) = \mathbf{1}[a \leq x \leq b]\}$ .

Shatters any 2 points $x_1 < x_2$ : use $[x_1, x_2]$ for $(1,1)$ ; $[x_1, x_1]$ for $(1, 0)$ ; $[x_2, x_2]$ for $(0, 1)$ ; $[x_2 + 1, x_2 + 2]$ for $(0, 0)$ (a bounded interval disjoint from both points).

Cannot shatter any 3 points $x_1 < x_2 < x_3$ : the labeling $(1, 0, 1)$ requires an interval containing $x_1$ and $x_3$ but not $x_2$ , which is impossible for a single interval. So $\text{VCdim} = 2$ .

Example

The sin-classifier: one parameter, infinite VC dimension

Consider $\mathcal{H} = \{x \mapsto \mathbf{1}[\sin(\theta x) \geq 0] : \theta > 0\}$ . Despite having only one parameter $\theta$ , this class has infinite VC dimension.

Infinite VC dimension means arbitrarily large shattered sets exist, not that every point configuration is shatterable. For every $m$ , there exist $m$ -point configurations that are shattered by this class. The classical construction takes $x_i = 2^{-i}$ for $i = 1, \ldots, m$ : for any target labeling $(b_1, \ldots, b_m) \in \{0,1\}^m$ , one can choose $\theta$ large enough so that $\sin(\theta x_i)$ matches $b_i$ at each sample point. This shows VC dimension can be much larger than the number of parameters. See Vapnik, Statistical Learning Theory (1998), §4.8, or Anthony and Bartlett, Neural Network Learning, Theorem 7.8.

Extensions to Real-Valued Function Classes

VC dimension is defined for binary function classes. For real-valued function classes (regression, scoring functions, neural network outputs before thresholding), two standard extensions apply:

Pseudo-dimension (Pollard 1984; Kearns and Schapire 1994): the VC dimension of the class of thresholded functions $\{(x, t) \mapsto \mathbf{1}[f(x) \geq t] : f \in \mathcal{F}\}$ . Used for uniform convergence bounds on real-valued losses.
Fat-shattering dimension (Kearns and Schapire 1994; Bartlett, Long, and Williamson 1996): a scale-sensitive refinement requiring the threshold separation to exceed a margin $\gamma$ . Useful for margin bounds and continuous classes where pseudo-dimension is too coarse.

Bartlett, Harvey, Liaw, and Mehrabian (2019), cited in the references, actually proves pseudo-dimension bounds for piecewise-linear networks; the same bounds transfer to VC dimension for thresholded classifiers.

PAC Learnability Characterization

The fundamental theorem of statistical learning states that for a binary class $\mathcal{H}$ with 0-1 loss:

$\mathcal{H}$ is realizable-PAC-learnable iff $\text{VCdim}(\mathcal{H}) < \infty$ iff ERM on $\mathcal{H}$ is a realizable PAC learner (Shalev-Shwartz and Ben-David, Understanding Machine Learning, Theorem 6.7).
$\mathcal{H}$ is agnostic-PAC-learnable iff $\text{VCdim}(\mathcal{H}) < \infty$ iff ERM on $\mathcal{H}$ is an agnostic PAC learner (Shalev-Shwartz and Ben-David, Theorem 6.8).

Finite VC dimension characterizes learnability in both the realizable setting (where some $h \in \mathcal{H}$ has zero risk) and the agnostic setting (no such assumption). In both cases the same combinatorial quantity, $d = \text{VCdim}(\mathcal{H})$ , governs sample complexity.

Common Confusions

Watch Out

VC dimension is not the number of parameters

A common misconception is that VC dimension equals the number of parameters. While this happens to be true for halfspaces ( $d + 1$ parameters, VC dimension $d + 1$ ), it is not true in general. The sin-classifier above has 1 parameter but infinite VC dimension. Conversely, there exist high-dimensional parameterizations with low VC dimension due to structural constraints. The relationship between parameters and VC dimension is nuanced and class-specific.

Watch Out

Shattering requires some set, not every set

$\text{VCdim}(\mathcal{H}) = d$ means there exists a set of size $d$ that is shattered, not that every set of size $d$ is shattered. When proving a lower bound, you only need to find one shatterable set. When proving an upper bound, you must show that no set of size $d + 1$ can be shattered. this requires a universal argument, which is typically harder.

Watch Out

VC dimension is worst-case; Rademacher complexity is data-dependent

VC dimension is a combinatorial worst-case measure: it is a single number that depends only on $\mathcal{H}$ , not on the data distribution. Rademacher complexity is an average-case measure: it measures how well the class correlates with random noise on the specific data at hand. When the data has structure (e.g., large margin), Rademacher complexity often captures this while VC dimension does not. In typical regimes — favorable distributions, margin structure, or rich function classes — Rademacher bounds are tighter, and worst-case Rademacher complexity is in fact bounded by the VC-based growth function via Massart's lemma, so the standard Rademacher route gives bounds at least as good as the symmetrization-based VC route. But "always at least as loose" is too strong: the two bound paths use different proof routes, different absolute constants, and different $\log$ factors, so for specific $(n, d)$ a particular VC inequality can come out smaller than a particular Rademacher inequality. The right phrasing is that Rademacher complexity is the more refined measure, and its bounds typically dominate; "always tighter" requires fixing the proof template.

Summary

VC dimension = size of the largest set the class can shatter
Sauer-Shelah: once past the VC dimension, growth is polynomial ( $O(m^d)$ ), not exponential ( $2^m$ )
VC generalization bound: sample complexity $\tilde{O}(d/\epsilon^2)$ for uniform convergence
Finite VC dimension $\Leftrightarrow$ realizable-PAC-learnability $\Leftrightarrow$ agnostic-PAC-learnability (for binary classification with 0-1 loss), and in both cases ERM is a valid learner
Halfspaces in $\mathbb{R}^d$ : $\text{VCdim} = d + 1$
VC dimension does not equal the number of parameters in general
VC dimension is worst-case and distribution-free; Rademacher complexity is distribution-aware and often tighter

Exercises

ExerciseCore

Problem

Compute the VC dimension of the class of all unions of $k$ intervals on the real line: $\mathcal{H}_k = \{x \mapsto \mathbf{1}[x \in I_1 \cup \cdots \cup I_k]\}$ where each $I_j = [a_j, b_j]$ .

ExerciseCore

Problem

Let $\mathcal{H}_1$ and $\mathcal{H}_2$ be hypothesis classes with VC dimensions $d_1$ and $d_2$ . Show that $\text{VCdim}(\mathcal{H}_1 \cup \mathcal{H}_2) \leq d_1 + d_2 + 1$ .

ExerciseAdvanced

Problem

A deep ReLU neural network with $W$ total weights and $L$ layers has VC dimension $\Theta(W L \log W)$ (Bartlett, Harvey, Liaw, and Mehrabian 2019, arXiv:1703.02930). The $O(W \log W)$ form applies to shallow (constant-depth) networks and underestimates the VC dimension of deep networks. If a network has $W = 10^8$ parameters, $L = 50$ layers, and is trained on $n = 10^6$ examples, what does the VC bound say about generalization? Why is this problematic given that such networks often achieve low test error in practice?

Related Comparisons

VC Dimension vs. Rademacher Complexity

References

Canonical:

Vapnik & Chervonenkis, "On the Uniform Convergence of Relative Frequencies of Events to their Probabilities" (1971), pp. 264-280. Proved the growth-function bound before Sauer and Shelah.
Sauer, "On the Density of Families of Sets" (1972), pp. 145-147. Independent proof of the growth-function bound.
Shelah, "A Combinatorial Problem; Stability and Order for Models and Theories in Infinitary Languages" (1972). Independent proof from model theory. The result is commonly called the Sauer-Shelah lemma (sometimes Sauer-Shelah-Perles, with Perles credited for an unpublished earlier proof).
Radon, "Mengen konvexer Körper, die einen gemeinsamen Punkt enthalten" (1921). Radon's theorem, used for the halfspace VC upper bound.
Shalev-Shwartz & Ben-David, Understanding Machine Learning, Chapters 6-7 and Theorem 9.2

Current:

Zhang, Bengio, Hardt, Recht, Vinyals, "Understanding deep learning requires rethinking generalization" (ICLR 2017), Sections 3-4
Bartlett, Harvey, Liaw, Mehrabian, "Nearly-tight VC-dimension and pseudodimension bounds for deep neural networks" (JMLR 2019, arXiv:1703.02930). Gives $\Theta(W L \log W)$ pseudo-dimension bounds for deep piecewise-linear networks.

Supporting:

Vapnik, Statistical Learning Theory (1998), §4.8. Sin-classifier example and explicit VC constants.
Anthony & Bartlett, Neural Network Learning: Theoretical Foundations (1999), Theorem 7.8.
Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), Chapters 2-4 and Theorem 3.22 for explicit VC constants.
Pollard, Convergence of Stochastic Processes (1984), Chapters 2-4. Pseudo-dimension.
Kearns & Schapire, "Efficient Distribution-Free Learning of Probabilistic Concepts" (1994), Sections 2-3. Fat-shattering dimension.
Alon, Ben-David, Cesa-Bianchi, Haussler, "Scale-sensitive dimensions, uniform convergence, and learnability" (JACM 1997), Sections 2-3. Rigorous trace-based proofs.
Pajor, Sous-espaces $\ell_1^n$ des espaces de Banach (1985), Section 1. Trace argument for Sauer-Shelah.

Next Topics

From VC dimension, the natural next steps are:

Rademacher complexity: data-dependent complexity that can give tighter, distribution-aware bounds
Algorithmic stability: an alternative to uniform convergence that can explain generalization when VC bounds are vacuous

Last reviewed: April 26, 2026

Claim evidence

Selected claims on this topic have machine-checked support.

Collapsed by default because this is audit material. Open it to see exact theorem names, claim scopes, and source roles.

Click to expand

1

claim matches

1

dependency proofs

0

incomplete markers

This is claim-level evidence, not a whole-page badge. The checked theorem must match the recorded claim scope. Supporting lemmas stay labeled as dependency proofs, not full claim matches.

Binary Vc Zero One Loss Bridge

Formal support is recorded for this governed claim.

Dependency proof

Formal record

Checked theorem: TheoremPath.LearningTheory.BinaryVCBridge.effectiveClass_zeroOneLoss_card_eq_binaryClassTrace
Claim scope: effectiveClass(zeroOneLoss h, x).card = binaryClassTrace(h, x).card AND effectiveClass.card ≤ Σ {k≤vcDim} C(n,k)
Proof scope: binary vc bridge zero one loss equality and sauer shelah corollary

Sauer Shelah Lemma

Formal support is recorded for this governed claim.

Matches claim

Formal record

Checked theorem: TheoremPath.LearningTheory.VCDimension.sauerShelahFiniteSetFamily
Claim scope: finite set family sauer shelah binomial sum
Proof scope: exact mathlib wrapper for sauer shelah binomial sum
Mathlib theorem: Finset.card_le_card_shatterer, Finset.card_shatterer_le_sum_vcDim

See the public evidence page for the display rules and representative Lean mapping examples.

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

12

Sets, Functions, and Relationslayer 0A · tier 1
Concentration Inequalitieslayer 1 · tier 1
PAC Learning Frameworklayer 1 · tier 1
Understanding Machine Learning (Shalev-Shwartz, Ben-David)layer 1 · tier 1
Empirical Risk Minimizationlayer 2 · tier 1

Derived topics

5

Sample Complexity Boundslayer 2 · tier 1
Algorithmic Stabilitylayer 3 · tier 1
Rademacher Complexitylayer 3 · tier 1
Implicit Bias and Modern Generalizationlayer 4 · tier 1
Glivenko-Cantelli Theoremlayer 2 · tier 2

Graph-backed continuations

Rademacher Complexity Algorithmic Stability Glivenko-Cantelli Theorem Implicit Bias and Modern Generalization Sample Complexity Bounds