Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Learning Theory Core

VC Dimension

The Vapnik-Chervonenkis dimension: a combinatorial measure of hypothesis class complexity that characterizes learnability in binary classification.

CoreTier 1Stable~75 min

Why This Matters

Uniform convergence tells you that controlling hypothesis class complexity gives generalization. VC dimension tells you how to measure that complexity for infinite classes.

The finite-class bound depends on logH\log|\mathcal{H}|, which is infinite for continuous hypothesis classes like linear classifiers. VC dimension replaces the crude counting argument with a combinatorial measure: how many points can the class label in all possible ways? This turns out to be the right notion of complexity for binary classification. Finite VC dimension is both necessary and sufficient for PAC learnability.

3 points: can shatter (VC = 3)++-For any labeling of 3 points, a line can separate + from -4 points: cannot shatter+--+?XOR labeling: no single line can separate theseVC dim of linear classifiers in 2D = 3. They shatter any 3 points but fail on at least one 4-point configuration.

Mental Model

Imagine placing points on a table and asking: can my hypothesis class produce every possible labeling of these points? If I put down 3 points and my class can label them in all 23=82^3 = 8 ways, the class "shatters" those 3 points. The VC dimension is the largest number of points you can shatter.

A class with high VC dimension can fit many patterns, including spurious ones. A class with low VC dimension is constrained. It cannot memorize arbitrary labelings, so good performance on training data is more likely to reflect genuine structure.

Formal Setup and Notation

We work with binary classification: Y={0,1}\mathcal{Y} = \{0, 1\} and H{0,1}X\mathcal{H} \subseteq \{0, 1\}^{\mathcal{X}}. The loss is the 0-1 loss (h(x),y)=1[h(x)y]\ell(h(x), y) = \mathbf{1}[h(x) \neq y].

For a set C={x1,,xm}XC = \{x_1, \ldots, x_m\} \subseteq \mathcal{X}, the restriction of H\mathcal{H} to CC is:

HC={(h(x1),,h(xm)):hH}{0,1}m\mathcal{H}_C = \{(h(x_1), \ldots, h(x_m)) : h \in \mathcal{H}\} \subseteq \{0, 1\}^m

This is the set of all labeling patterns that H\mathcal{H} can produce on CC.

Core Definitions

Definition

Shattering

A hypothesis class H{0,1}X\mathcal{H} \subseteq \{0,1\}^{\mathcal{X}} shatters a set C={x1,,xm}XC = \{x_1, \ldots, x_m\} \subseteq \mathcal{X} if for every labeling (b1,,bm){0,1}m(b_1, \ldots, b_m) \in \{0,1\}^m, there exists hHh \in \mathcal{H} such that h(xi)=bih(x_i) = b_i for all ii.

Equivalently, HC=2m|\mathcal{H}_C| = 2^m. The restriction of H\mathcal{H} to CC contains all possible binary labelings.

Definition

Growth Function

The growth function (or shattering coefficient) of H\mathcal{H} is:

ΠH(m)=maxCX,C=mHC\Pi_{\mathcal{H}}(m) = \max_{C \subseteq \mathcal{X}, |C| = m} |\mathcal{H}_C|

It counts the maximum number of distinct labelings H\mathcal{H} can produce on any set of mm points. Always ΠH(m)2m\Pi_{\mathcal{H}}(m) \leq 2^m, with equality if and only if H\mathcal{H} can shatter some set of size mm.

Definition

VC Dimension

The VC dimension of H\mathcal{H}, denoted VCdim(H)\text{VCdim}(\mathcal{H}) or dVCd_{\text{VC}}, is the largest mm such that ΠH(m)=2m\Pi_{\mathcal{H}}(m) = 2^m.

VCdim(H)=max{m:C with C=m shattered by H}\text{VCdim}(\mathcal{H}) = \max\{m : \exists C \text{ with } |C| = m \text{ shattered by } \mathcal{H}\}

If H\mathcal{H} can shatter arbitrarily large sets, VCdim(H)=\text{VCdim}(\mathcal{H}) = \infty.

Critical asymmetry: To show VCdim(H)d\text{VCdim}(\mathcal{H}) \geq d, you exhibit one set of size dd that is shattered. To show VCdim(H)<d\text{VCdim}(\mathcal{H}) < d, you must show that no set of size dd can be shattered. The existential vs. universal quantifiers make lower bounds easier than upper bounds.

Main Theorems

Lemma

Sauer-Shelah Lemma

Statement

If VCdim(H)=d\text{VCdim}(\mathcal{H}) = d, then for all mNm \in \mathbb{N}:

ΠH(m)i=0d(mi)\Pi_{\mathcal{H}}(m) \leq \sum_{i=0}^{d} \binom{m}{i}

and additionally, for all mdm \geq d:

i=0d(mi)(emd)d\sum_{i=0}^{d} \binom{m}{i} \leq \left(\frac{em}{d}\right)^d

The (em/d)d(em/d)^d form requires mdm \geq d; for m<dm < d, it can be smaller than the actual bound and is not valid. In particular, for mdm \geq d, the growth function is polynomial in mm (of degree dd) rather than exponential.

Intuition

Once mm exceeds the VC dimension, the hypothesis class cannot produce all 2m2^m labelings. The Sauer-Shelah lemma quantifies exactly how constrained the class becomes: the number of achievable labelings grows only polynomially, not exponentially. This is the phase transition that makes uniform convergence possible for finite-VC-dimension classes.

Proof Sketch

By induction on m+dm + d. The formal object is the restriction HC\mathcal{H}_C on the set C={x1,,xm}C = \{x_1, \ldots, x_m\}. The key step: for any restriction on mm points with VC dimension dd, partition the labelings by what happens when you project out one point xmx_m. The labelings where each projection onto {x1,,xm1}\{x_1, \ldots, x_{m-1}\} appears with only one value of the xmx_m-coordinate form a restriction of VC dimension at most dd on the m1m - 1 remaining points. The labelings where both values of the xmx_m-coordinate appear (both extensions are achieved) form a restriction of VC dimension at most d1d - 1 on m1m - 1 points, because if this class shattered a set of size dd, adding xmx_m would give a shattered set of size d+1d + 1 for the original class. Apply the inductive hypothesis to both parts and use the identity (m1i)+(m1i1)=(mi)\binom{m-1}{i} + \binom{m-1}{i-1} = \binom{m}{i}. Rigorous versions are usually presented via Pajor's trace argument; see Alon, Ben-David, Cesa-Bianchi, and Haussler (1997) or Pajor (1985).

Why It Matters

Without Sauer-Shelah, you cannot plug VC dimension into uniform convergence bounds. The lemma is the bridge: it converts the combinatorial statement "VC dimension is dd" into the quantitative bound ΠH(m)=O(md)\Pi_{\mathcal{H}}(m) = O(m^d), which can then be used in place of H|\mathcal{H}| in union-bound-style arguments. The transition from exponential to polynomial growth is what makes learning possible.

Failure Mode

The Sauer-Shelah bound is tight in the worst case (there exist classes achieving equality), but can be very loose for specific hypothesis classes. For example, the class of thresholds on R\mathbb{R} has VC dimension 1 and growth function m+1m + 1, which matches the Sauer-Shelah bound (m0)+(m1)=m+1\binom{m}{0} + \binom{m}{1} = m + 1. But for more structured classes the bound can significantly overestimate.

Theorem

VC Generalization Bound

Statement

Let H\mathcal{H} have VC dimension d<d < \infty. For any distribution D\mathcal{D} and any δ>0\delta > 0, with probability at least 1δ1 - \delta over an i.i.d. sample of size nn:

suphHR(h)R^n(h)Cdlog(n/d)+log(1/δ)n\sup_{h \in \mathcal{H}} |R(h) - \hat{R}_n(h)| \leq C\sqrt{\frac{d \log(n/d) + \log(1/\delta)}{n}}

where CC is a universal constant independent of H\mathcal{H}, D\mathcal{D}, nn, and δ\delta. The original VC proof (Vapnik 1998) gives C4C \approx 4 up to logarithmic factors; Mohri, Rostamizadeh, and Talwalkar (2018), Theorem 3.22, give an explicit form suitable for numerical sample-size calculations. In particular, the sample complexity of uniform convergence (and hence ERM learning) is:

m(ϵ,δ)=O ⁣(dlog(1/ϵ)+log(1/δ)ϵ2)m(\epsilon, \delta) = O\!\left(\frac{d \log(1/\epsilon) + \log(1/\delta)}{\epsilon^2}\right)

Intuition

Replace logH\log|\mathcal{H}| in the finite-class bound with dlog(n/d)d \log(n/d). Since d=VCdim(H)d = \text{VCdim}(\mathcal{H}) is finite for learnable classes, this gives a meaningful bound even for infinite hypothesis classes. For a continuous class, logH\log|\mathcal{H}| is infinite; finite VC dimension gives dlog(n/d)d \log(n/d) in place of logH\log|\mathcal{H}|, which is what makes learning with infinite classes (like all halfspaces) possible.

Proof Sketch

The proof uses the symmetrization technique in four steps:

  1. Double sampling. Introduce a "ghost sample" SS' of the same size. Show that Pr[suphR(h)R^n(h)>ϵ]2Pr[suphR^n(h)R^n(h)>ϵ/2]\Pr[\sup_h |R(h) - \hat{R}_n(h)| > \epsilon] \leq 2\Pr[\sup_h |\hat{R}_n(h) - \hat{R}'_n(h)| > \epsilon/2].
  2. Rademacher symmetrization. Swap labels between SS and SS' using random signs σi{1,+1}\sigma_i \in \{-1, +1\}. The supremum does not change in expectation because the swapped version has the same distribution.
  3. Conditioning and growth function. Condition on the 2n2n points and use the growth function to bound the number of effective distinct hypotheses: at most ΠH(2n)\Pi_{\mathcal{H}}(2n).
  4. Sauer-Shelah + Hoeffding. Replace ΠH(2n)\Pi_{\mathcal{H}}(2n) with (2en/d)d(2en/d)^d and apply Hoeffding to the resulting finite set of hypotheses. Union bound over at most (2en/d)d(2en/d)^d effective hypotheses.

Why It Matters

This is the central theorem of classical learning theory for binary classification. Combined with the fundamental theorem of statistical learning, it shows: a binary hypothesis class is PAC-learnable if and only if its VC dimension is finite. The sample complexity scales linearly in dd (up to log factors).

Failure Mode

The VC bound is distribution-free (worst case over all D\mathcal{D}). This makes it robust but often extremely loose. For "nice" distributions (e.g., data with large margin), data-dependent bounds like Rademacher complexity can be much tighter. More critically, the VC bound gives vacuous guarantees for modern neural networks, which have VC dimension proportional to the number of parameters. often in the billions.

Proof Ideas and Templates Used

The VC bound proof introduces two major techniques:

  1. Symmetrization (ghost sample trick): replace the unknown population risk R(h)R(h) with the empirical risk R^n(h)\hat{R}'_n(h) on a second independent sample. This converts a probability-vs-expectation gap into an empirical-vs-empirical gap, which is easier to control.

  2. Effective hypothesis counting: on any fixed set of 2n2n points, the infinite class H\mathcal{H} induces at most ΠH(2n)\Pi_{\mathcal{H}}(2n) distinct labelings. Sauer-Shelah bounds this by (2en/d)d(2en/d)^d. Now you have a finite union bound with polynomial (not infinite) cost.

These two ideas recur throughout learning theory, and are also the foundation for Rademacher complexity bounds.

Canonical Examples

Example

Halfplanes in R^2 have VC dimension 3

Let H\mathcal{H} be the set of all halfplanes in R2\mathbb{R}^2: hw,b(x)=1[wx+b0]h_{w,b}(x) = \mathbf{1}[w \cdot x + b \geq 0].

Lower bound (VCdim3\text{VCdim} \geq 3): Take three non-collinear points, e.g., the vertices of an equilateral triangle. Non-collinearity is essential: for three collinear points, the middle point cannot be separated from both endpoints by a halfplane, so (0,1,0)(0, 1, 0) is unachievable.

For non-collinear points, check all 23=82^3 = 8 labelings: (0,0,0)(0,0,0) and (1,1,1)(1,1,1) are achieved by halfplanes excluding or containing all three; the three singleton-positive labelings (1,0,0),(0,1,0),(0,0,1)(1,0,0), (0,1,0), (0,0,1) each isolate one vertex on one side of a line (possible by the symmetry of the equilateral triangle); the three singleton-negative labelings (0,1,1),(1,0,1),(1,1,0)(0,1,1), (1,0,1), (1,1,0) follow by flipping the halfplane orientation. All 8 labelings are achievable.

Upper bound (VCdim<4\text{VCdim} < 4): Take any 4 points in R2\mathbb{R}^2. If one point pp lies inside the convex hull of the other three, the labeling that assigns pp the opposite label from all others cannot be achieved by a halfplane (because pp is a convex combination of points with the opposite label). If no point is in the convex hull of the others (the 4 points form a convex quadrilateral), the "alternating" labeling (+,,+,)(+, -, +, -) on consecutive vertices cannot be achieved. Either way, no set of 4 points is shattered.

Therefore VCdim(H)=3\text{VCdim}(\mathcal{H}) = 3. More generally, halfspaces in Rd\mathbb{R}^d have VC dimension d+1d + 1.

General dd via Radon's theorem. The clean proof of the upper bound for arbitrary dd uses Radon's theorem (Radon 1921): any d+2d + 2 points in Rd\mathbb{R}^d can be partitioned into two disjoint sets A,BA, B whose convex hulls intersect. If a halfspace labels AA as 11 and BB as 00, a point in conv(A)conv(B)\text{conv}(A) \cap \text{conv}(B) would lie on both sides, a contradiction. So no d+2d + 2 points are shattered. See Shalev-Shwartz and Ben-David, Understanding Machine Learning, Theorem 9.2.

Example

Intervals on the real line: VC dimension 2

H={ha,b:ha,b(x)=1[axb]}\mathcal{H} = \{h_{a,b} : h_{a,b}(x) = \mathbf{1}[a \leq x \leq b]\}.

Shatters any 2 points x1<x2x_1 < x_2: use [x1,x2][x_1, x_2] for (1,1)(1,1); [x1,x1][x_1, x_1] for (1,0)(1, 0); [x2,x2][x_2, x_2] for (0,1)(0, 1); [x2+1,x2+2][x_2 + 1, x_2 + 2] for (0,0)(0, 0) (a bounded interval disjoint from both points).

Cannot shatter any 3 points x1<x2<x3x_1 < x_2 < x_3: the labeling (1,0,1)(1, 0, 1) requires an interval containing x1x_1 and x3x_3 but not x2x_2, which is impossible for a single interval. So VCdim=2\text{VCdim} = 2.

Example

The sin-classifier: one parameter, infinite VC dimension

Consider H={x1[sin(θx)0]:θ>0}\mathcal{H} = \{x \mapsto \mathbf{1}[\sin(\theta x) \geq 0] : \theta > 0\}. Despite having only one parameter θ\theta, this class has infinite VC dimension.

Infinite VC dimension means arbitrarily large shattered sets exist, not that every point configuration is shatterable. For every mm, there exist mm-point configurations that are shattered by this class. The classical construction takes xi=2ix_i = 2^{-i} for i=1,,mi = 1, \ldots, m: for any target labeling (b1,,bm){0,1}m(b_1, \ldots, b_m) \in \{0,1\}^m, one can choose θ\theta large enough so that sin(θxi)\sin(\theta x_i) matches bib_i at each sample point. This shows VC dimension can be much larger than the number of parameters. See Vapnik, Statistical Learning Theory (1998), §4.8, or Anthony and Bartlett, Neural Network Learning, Theorem 7.8.

Extensions to Real-Valued Function Classes

VC dimension is defined for binary function classes. For real-valued function classes (regression, scoring functions, neural network outputs before thresholding), two standard extensions apply:

  • Pseudo-dimension (Pollard 1984; Kearns and Schapire 1994): the VC dimension of the class of thresholded functions {(x,t)1[f(x)t]:fF}\{(x, t) \mapsto \mathbf{1}[f(x) \geq t] : f \in \mathcal{F}\}. Used for uniform convergence bounds on real-valued losses.
  • Fat-shattering dimension (Kearns and Schapire 1994; Bartlett, Long, and Williamson 1996): a scale-sensitive refinement requiring the threshold separation to exceed a margin γ\gamma. Useful for margin bounds and continuous classes where pseudo-dimension is too coarse.

Bartlett, Harvey, Liaw, and Mehrabian (2019), cited in the references, actually proves pseudo-dimension bounds for piecewise-linear networks; the same bounds transfer to VC dimension for thresholded classifiers.

PAC Learnability Characterization

The fundamental theorem of statistical learning states that for a binary class H\mathcal{H} with 0-1 loss:

  • H\mathcal{H} is realizable-PAC-learnable iff VCdim(H)<\text{VCdim}(\mathcal{H}) < \infty iff ERM on H\mathcal{H} is a realizable PAC learner (Shalev-Shwartz and Ben-David, Understanding Machine Learning, Theorem 6.7).
  • H\mathcal{H} is agnostic-PAC-learnable iff VCdim(H)<\text{VCdim}(\mathcal{H}) < \infty iff ERM on H\mathcal{H} is an agnostic PAC learner (Shalev-Shwartz and Ben-David, Theorem 6.8).

Finite VC dimension characterizes learnability in both the realizable setting (where some hHh \in \mathcal{H} has zero risk) and the agnostic setting (no such assumption). In both cases the same combinatorial quantity, d=VCdim(H)d = \text{VCdim}(\mathcal{H}), governs sample complexity.

Common Confusions

Watch Out

VC dimension is not the number of parameters

A common misconception is that VC dimension equals the number of parameters. While this happens to be true for halfspaces (d+1d + 1 parameters, VC dimension d+1d + 1), it is not true in general. The sin-classifier above has 1 parameter but infinite VC dimension. Conversely, there exist high-dimensional parameterizations with low VC dimension due to structural constraints. The relationship between parameters and VC dimension is nuanced and class-specific.

Watch Out

Shattering requires some set, not every set

VCdim(H)=d\text{VCdim}(\mathcal{H}) = d means there exists a set of size dd that is shattered, not that every set of size dd is shattered. When proving a lower bound, you only need to find one shatterable set. When proving an upper bound, you must show that no set of size d+1d + 1 can be shattered. this requires a universal argument, which is typically harder.

Watch Out

VC dimension is worst-case; Rademacher complexity is data-dependent

VC dimension is a combinatorial worst-case measure: it is a single number that depends only on H\mathcal{H}, not on the data distribution. Rademacher complexity is an average-case measure: it measures how well the class correlates with random noise on the specific data at hand. When the data has structure (e.g., large margin), Rademacher complexity captures this while VC dimension does not. The VC generalization bound is always at least as loose as the Rademacher bound.

Summary

  • VC dimension = size of the largest set the class can shatter
  • Sauer-Shelah: once past the VC dimension, growth is polynomial (O(md)O(m^d)), not exponential (2m2^m)
  • VC generalization bound: sample complexity O~(d/ϵ2)\tilde{O}(d/\epsilon^2) for uniform convergence
  • Finite VC dimension \Leftrightarrow realizable-PAC-learnability \Leftrightarrow agnostic-PAC-learnability (for binary classification with 0-1 loss), and in both cases ERM is a valid learner
  • Halfspaces in Rd\mathbb{R}^d: VCdim=d+1\text{VCdim} = d + 1
  • VC dimension does not equal the number of parameters in general
  • VC dimension is worst-case and distribution-free; Rademacher complexity is distribution-aware and often tighter

Exercises

ExerciseCore

Problem

Compute the VC dimension of the class of all unions of kk intervals on the real line: Hk={x1[xI1Ik]}\mathcal{H}_k = \{x \mapsto \mathbf{1}[x \in I_1 \cup \cdots \cup I_k]\} where each Ij=[aj,bj]I_j = [a_j, b_j].

ExerciseCore

Problem

Let H1\mathcal{H}_1 and H2\mathcal{H}_2 be hypothesis classes with VC dimensions d1d_1 and d2d_2. Show that VCdim(H1H2)d1+d2+1\text{VCdim}(\mathcal{H}_1 \cup \mathcal{H}_2) \leq d_1 + d_2 + 1.

ExerciseAdvanced

Problem

A deep ReLU neural network with WW total weights and LL layers has VC dimension Θ(WLlogW)\Theta(W L \log W) (Bartlett, Harvey, Liaw, and Mehrabian 2019, arXiv:1703.02930). The O(WlogW)O(W \log W) form applies to shallow (constant-depth) networks and underestimates the VC dimension of deep networks. If a network has W=108W = 10^8 parameters, L=50L = 50 layers, and is trained on n=106n = 10^6 examples, what does the VC bound say about generalization? Why is this problematic given that such networks often achieve low test error in practice?

Related Comparisons

References

Canonical:

  • Vapnik & Chervonenkis, "On the Uniform Convergence of Relative Frequencies of Events to their Probabilities" (1971). Proved the growth-function bound before Sauer and Shelah.
  • Sauer, "On the Density of Families of Sets" (1972). Independent proof of the growth-function bound.
  • Shelah, "A Combinatorial Problem; Stability and Order for Models and Theories in Infinitary Languages" (1972). Independent proof from model theory. The result is commonly called the Sauer-Shelah lemma (sometimes Sauer-Shelah-Perles, with Perles credited for an unpublished earlier proof).
  • Radon, "Mengen konvexer Körper, die einen gemeinsamen Punkt enthalten" (1921). Radon's theorem, used for the halfspace VC upper bound.
  • Shalev-Shwartz & Ben-David, Understanding Machine Learning, Chapters 6-7 and Theorem 9.2

Current:

  • Zhang, Bengio, Hardt, Recht, Vinyals, "Understanding deep learning requires rethinking generalization" (ICLR 2017)
  • Bartlett, Harvey, Liaw, Mehrabian, "Nearly-tight VC-dimension and pseudodimension bounds for deep neural networks" (JMLR 2019, arXiv:1703.02930). Gives Θ(WLlogW)\Theta(W L \log W) pseudo-dimension bounds for deep piecewise-linear networks.

Supporting:

  • Vapnik, Statistical Learning Theory (1998), §4.8. Sin-classifier example and explicit VC constants.
  • Anthony & Bartlett, Neural Network Learning: Theoretical Foundations (1999), Theorem 7.8.
  • Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), Chapters 2-4 and Theorem 3.22 for explicit VC constants.
  • Pollard, Convergence of Stochastic Processes (1984). Pseudo-dimension.
  • Kearns & Schapire, "Efficient Distribution-Free Learning of Probabilistic Concepts" (1994). Fat-shattering dimension.
  • Alon, Ben-David, Cesa-Bianchi, Haussler, "Scale-sensitive dimensions, uniform convergence, and learnability" (JACM 1997). Rigorous trace-based proofs.
  • Pajor, Sous-espaces 1n\ell_1^n des espaces de Banach (1985). Trace argument for Sauer-Shelah.

Next Topics

From VC dimension, the natural next steps are:

  • Rademacher complexity: data-dependent complexity that can give tighter, distribution-aware bounds
  • Algorithmic stability: an alternative to uniform convergence that can explain generalization when VC bounds are vacuous

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics