Skip to main content

Modern Generalization

Ridge Resolvents

The ridge estimator as a function of lambda lives on a one-parameter spectral path. The resolvent (X^T X + lambda I)^{-1} controls everything: derivatives, prediction risk, debiased-lasso connections, and the proportional-asymptotics analysis of Patil and collaborators.

AdvancedAdvancedTier 1CurrentSupporting~55 min
For:MLStatsResearch

Why This Matters

The ridge estimator β^λ=(XX+λI)1XY\hat{\beta}_\lambda = (\boldsymbol{X}^\top \boldsymbol{X} + \lambda \boldsymbol{I})^{-1} \boldsymbol{X}^\top \boldsymbol{Y} is a one-parameter family in λ>0\lambda > 0. Treating λ\lambda as a continuous variable and asking analytic questions ("what is dβ^λ/dλd\hat{\beta}_\lambda / d\lambda?", "how does prediction risk scale with λ\lambda?", "what does the spectrum of the smoother matrix look like?") amounts to studying the resolvent R(λ)=(XX+λI)1\boldsymbol{R}(\lambda) = (\boldsymbol{X}^\top \boldsymbol{X} + \lambda \boldsymbol{I})^{-1}.

The resolvent is the right object because every quantity of interest is a function of it. The smoother matrix is Hλ=XR(λ)X\boldsymbol{H}_\lambda = \boldsymbol{X} \boldsymbol{R}(\lambda) \boldsymbol{X}^\top. The effective degrees of freedom is tr(Hλ)=tr(XXR(λ))\mathrm{tr}(\boldsymbol{H}_\lambda) = \mathrm{tr}(\boldsymbol{X}^\top \boldsymbol{X} \boldsymbol{R}(\lambda)). The bias and variance components of MSE are quadratic forms in R(λ)\boldsymbol{R}(\lambda). The connection to the debiased lasso runs through a covariance-corrected version of R\boldsymbol{R}. And under proportional asymptotics (n,pn, p \to \infty with p/nγp / n \to \gamma), the spectrum of R(λ)\boldsymbol{R}(\lambda) converges to a deterministic limit described by the Stieltjes transform of the Marchenko-Pastur distribution.

The reason this earns its own page on a site that already covers ridge regression: the standard ridge page treats λ\lambda as a hyperparameter to be selected and ridge as a black-box estimator. The resolvent view treats λ\lambda as an analytic parameter and the family as the primary object. This is the lens behind modern high-dimensional ridge analysis (Dobriban and Wager 2018, Hastie et al. 2022 "Surprises in high-dimensional ridgeless least squares", Patil, Wei, and Rakhlin 2022 on ridge resolvents), and the lens that connects ridge cleanly to the debiased lasso (van de Geer et al. 2014, Javanmard and Montanari 2014). Hastie, Tibshirani, and Wainwright Statistical Learning with Sparsity (2015) Ch 11 develops the debiased-lasso side of this story.

Quick Version

ObjectForm
Resolvent R(λ)\boldsymbol{R}(\lambda)(XX+λI)1(\boldsymbol{X}^\top \boldsymbol{X} + \lambda \boldsymbol{I})^{-1}
Ridge estimatorβ^λ=R(λ)XY\hat{\beta}_\lambda = \boldsymbol{R}(\lambda) \boldsymbol{X}^\top \boldsymbol{Y}
DerivativedR/dλ=R2d\boldsymbol{R}/d\lambda = -\boldsymbol{R}^2; dβ^λ/dλ=R(λ)β^λd\hat{\beta}_\lambda / d\lambda = -\boldsymbol{R}(\lambda) \hat{\beta}_\lambda
Smoother matrixHλ=XR(λ)X\boldsymbol{H}_\lambda = \boldsymbol{X} \boldsymbol{R}(\lambda) \boldsymbol{X}^\top
Effective degrees of freedomdf(λ)=jσj2/(σj2+λ)\mathrm{df}(\lambda) = \sum_j \sigma_j^2 / (\sigma_j^2 + \lambda)
Bias squaredβ(IR(λ)XX)2β=λ2βR(λ)2β\beta^\top (\boldsymbol{I} - \boldsymbol{R}(\lambda) \boldsymbol{X}^\top \boldsymbol{X})^2 \beta = \lambda^2 \beta^\top \boldsymbol{R}(\lambda)^2 \beta
Varianceσ2tr(R(λ)XXR(λ))\sigma^2 \mathrm{tr}(\boldsymbol{R}(\lambda) \boldsymbol{X}^\top \boldsymbol{X} \boldsymbol{R}(\lambda))
Marchenko-Pastur limit (n,pn, p \to \infty, p/nγp/n \to \gamma)resolvent converges to a deterministic function of λ\lambda

The resolvent identity R(λ1)R(λ2)=(λ2λ1)R(λ1)R(λ2)\boldsymbol{R}(\lambda_1) - \boldsymbol{R}(\lambda_2) = (\lambda_2 - \lambda_1) \boldsymbol{R}(\lambda_1) \boldsymbol{R}(\lambda_2) turns derivative computations into algebraic identities.

Formal Setup

Definition

Ridge Resolvent

For a fixed design matrix XRn×p\boldsymbol{X} \in \mathbb{R}^{n \times p} and λ>0\lambda > 0, the ridge resolvent is R(λ)=(XX+λIp)1.\boldsymbol{R}(\lambda) = (\boldsymbol{X}^\top \boldsymbol{X} + \lambda \boldsymbol{I}_p)^{-1}. It is symmetric positive definite for λ>0\lambda > 0.

Equivalently, on the nn-dimensional side, define the dual resolvent R~(λ)=(XX+λIn)1.\tilde{\boldsymbol{R}}(\lambda) = (\boldsymbol{X} \boldsymbol{X}^\top + \lambda \boldsymbol{I}_n)^{-1}. The Woodbury identity gives R(λ)=λ1(IpXR~(λ)X)\boldsymbol{R}(\lambda) = \lambda^{-1}(\boldsymbol{I}_p - \boldsymbol{X}^\top \tilde{\boldsymbol{R}}(\lambda) \boldsymbol{X}), and the two resolvents are equally usable. For pnp \gg n, R~\tilde{\boldsymbol{R}} is cheaper to compute (an n×nn \times n inverse vs a p×pp \times p inverse).

The resolvent is analytic in λ\lambda on C(,0]\mathbb{C} \setminus (-\infty, 0] and extends meromorphically. The poles are at λ=σj2\lambda = -\sigma_j^2 where σj\sigma_j are the singular values of X\boldsymbol{X}. For applied work, only λ>0\lambda > 0 matters.

Differential Identities

Proposition

Resolvent Derivative

Statement

ddλR(λ)=R(λ)2,\frac{d}{d\lambda} \boldsymbol{R}(\lambda) = -\boldsymbol{R}(\lambda)^2, ddλβ^λ=R(λ)β^λ,\frac{d}{d\lambda} \hat{\beta}_\lambda = -\boldsymbol{R}(\lambda) \hat{\beta}_\lambda, ddλtr(R(λ)A)=tr(R(λ)2A)\frac{d}{d\lambda} \mathrm{tr}(\boldsymbol{R}(\lambda) \boldsymbol{A}) = -\mathrm{tr}(\boldsymbol{R}(\lambda)^2 \boldsymbol{A}) for any A\boldsymbol{A} independent of λ\lambda.

Intuition

Differentiating R(λ)(XX+λI)=I\boldsymbol{R}(\lambda) (\boldsymbol{X}^\top \boldsymbol{X} + \lambda \boldsymbol{I}) = \boldsymbol{I} both sides gives R(λ)(XX+λI)+R(λ)=0\boldsymbol{R}'(\lambda)(\boldsymbol{X}^\top \boldsymbol{X} + \lambda \boldsymbol{I}) + \boldsymbol{R}(\lambda) = 0, so R(λ)=R(λ)2\boldsymbol{R}'(\lambda) = -\boldsymbol{R}(\lambda)^2. Everything else follows by chain rule.

For the ridge estimator, β^λ=R(λ)XY\hat{\beta}_\lambda = \boldsymbol{R}(\lambda) \boldsymbol{X}^\top \boldsymbol{Y} gives dβ^λ/dλ=R(λ)2XY=R(λ)β^λd \hat{\beta}_\lambda / d \lambda = -\boldsymbol{R}(\lambda)^2 \boldsymbol{X}^\top \boldsymbol{Y} = -\boldsymbol{R}(\lambda) \hat{\beta}_\lambda.

Why It Matters

These identities are the engine behind every analytic statement about the ridge family. The path-following equation dβ^λ/dλ=R(λ)β^λd\hat{\beta}_\lambda / d\lambda = -\boldsymbol{R}(\lambda) \hat{\beta}_\lambda is the ODE that the entire ridge path satisfies; integrating gives a way to compute the ridge solution at all λ\lambda from the solution at any one λ\lambda (Friedman, Hastie, Tibshirani 2010 use a discretized version for glmnet's ridge path). The trace derivative gives the sensitivity of effective degrees of freedom to λ\lambda in closed form.

Failure Mode

The identities are exact. The only practical failure is that R(λ)\boldsymbol{R}(\lambda) blows up near λ=0\lambda = 0 when XX\boldsymbol{X}^\top \boldsymbol{X} is rank-deficient: the smallest eigenvalue of the resolvent goes to 1/λ1/\lambda. Numerical evaluation at very small λ\lambda requires either careful SVD-based computation or the dual resolvent path.

Spectral View

Theorem

Bias-Variance Decomposition via the Resolvent

Statement

The mean squared error of the ridge estimator decomposes as Eβ^λβ2=λ2βR(λ)2βbias squared+σ2tr(R(λ)XXR(λ))variance.\mathbb{E}\|\hat{\beta}_\lambda - \beta^\star\|^2 = \underbrace{\lambda^2 \, \beta^{\star\top} \boldsymbol{R}(\lambda)^2 \beta^\star}_{\text{bias squared}} + \underbrace{\sigma^2 \, \mathrm{tr}(\boldsymbol{R}(\lambda) \boldsymbol{X}^\top \boldsymbol{X} \boldsymbol{R}(\lambda))}_{\text{variance}}.

Equivalently, in the eigenbasis XX=Vdiag(σ12,,σp2)V\boldsymbol{X}^\top \boldsymbol{X} = \boldsymbol{V} \mathrm{diag}(\sigma_1^2, \ldots, \sigma_p^2) \boldsymbol{V}^\top with β~=Vβ\tilde{\beta} = \boldsymbol{V}^\top \beta^\star: bias2=j=1pλ2(σj2+λ)2β~j2,variance=σ2j=1pσj2(σj2+λ)2.\text{bias}^2 = \sum_{j=1}^p \frac{\lambda^2}{(\sigma_j^2 + \lambda)^2} \, \tilde{\beta}_j^2, \qquad \text{variance} = \sigma^2 \sum_{j=1}^p \frac{\sigma_j^2}{(\sigma_j^2 + \lambda)^2}.

Intuition

The bias contribution from each principal direction jj is λ2/(σj2+λ)2\lambda^2 / (\sigma_j^2 + \lambda)^2 times the signal β~j2\tilde{\beta}_j^2 in that direction. Large σj\sigma_j (well-supported directions) get small bias; small σj\sigma_j (poorly-supported directions) get bias proportional to the full signal. Variance is the opposite: large σj\sigma_j contributes σ2/σj2\sigma^2 / \sigma_j^2 (signal-rich directions are estimated noisily because they have large gradients); small σj\sigma_j contributes σ2σj2/λ2\sigma^2 \sigma_j^2 / \lambda^2 (signal-poor directions are bias-killed and contribute little variance).

Optimal λ\lambda minimizes the sum. For uniform signal β~j2=c\tilde{\beta}_j^2 = c and uniform spectrum, the minimum is at λ=σ2/cp/n\lambda^\star = \sigma^2 / c \cdot p / n in the proportional-asymptotic regime.

Why It Matters

Every prediction-risk analysis of ridge under proportional asymptotics (n,pn, p \to \infty, p/nγp/n \to \gamma) starts from this decomposition. The summands over jj become Riemann sums against the limiting spectral distribution of XX/n\boldsymbol{X}^\top \boldsymbol{X} / n (the Marchenko-Pastur distribution for iid Gaussian X\boldsymbol{X}, more general distributions otherwise). The limiting risk has a closed form involving the Stieltjes transform of that distribution at λ-\lambda, which is itself a function of the resolvent.

Failure Mode

The decomposition assumes a fixed design and a well-specified linear model. Under model misspecification (the truth is nonlinear) the bias term picks up an approximation-error component that is not captured by R(λ)2β\boldsymbol{R}(\lambda)^2 \beta^\star. The proportional-asymptotic analysis additionally assumes some regularity on the spectrum: with heavy-tailed designs (Cauchy entries) the limit theory breaks.

Optional Deeper DetailMarchenko-Pastur limit for the ridge resolventShow

Following Dobriban and Wager (2018), Hastie et al. (2022), and Patil, Wei, Rakhlin (2022).

Let X\boldsymbol{X} have iid entries with mean 00 and variance 1/n1/n, and let n,pn, p \to \infty with p/nγ>0p / n \to \gamma > 0. The empirical spectral distribution of XX\boldsymbol{X}^\top \boldsymbol{X} converges weakly to the Marchenko-Pastur distribution with parameter γ\gamma, which has support on [(1γ)2,(1+γ)2][(1 - \sqrt{\gamma})^2, (1 + \sqrt{\gamma})^2] (plus a point mass at 00 if γ>1\gamma > 1) and density ρMP(t)=(bt)(ta)2πγt,a=(1γ)2,b=(1+γ)2.\rho_{\mathrm{MP}}(t) = \frac{\sqrt{(b - t)(t - a)}}{2\pi \gamma t}, \quad a = (1 - \sqrt{\gamma})^2, \, b = (1 + \sqrt{\gamma})^2.

The Stieltjes transform of the empirical spectral distribution evaluated at λ-\lambda is precisely 1ptr(R(λ))\frac{1}{p}\mathrm{tr}(\boldsymbol{R}(\lambda)). As nn \to \infty this converges to m(λ;γ)=1t+λρMP(t)dtm(\lambda; \gamma) = \int \frac{1}{t + \lambda} \rho_{\mathrm{MP}}(t)\,dt which satisfies the fixed-point equation m(λ;γ)=1λ1+γ+γλm(λ;γ).m(\lambda; \gamma) = \frac{1}{\lambda - 1 + \gamma + \gamma \lambda m(\lambda; \gamma)}. Solving this quadratic gives a closed-form expression for m(λ;γ)m(\lambda; \gamma) and hence for the limiting ridge risk. The "surprises" of Hastie et al. (2022) are statements about ridgeless (λ=0+\lambda = 0^+) limits of these identities under γ>1\gamma > 1.

The deterministic-limit picture is what makes the proportional-asymptotic regime tractable. Random-matrix randomness over the design averages out; only the spectral distribution survives.

Connection to the Debiased Lasso

Hastie, Tibshirani, Wainwright Statistical Learning with Sparsity (2015) Ch 11 develops this. The Lasso estimator has bias of order λ\lambda in each coordinate. A debiased estimator corrects this: β^db=β^lasso+MX(YXβ^lasso)/n,\hat{\beta}^{\mathrm{db}} = \hat{\beta}_{\mathrm{lasso}} + \boldsymbol{M} \boldsymbol{X}^\top (\boldsymbol{Y} - \boldsymbol{X} \hat{\beta}_{\mathrm{lasso}}) / n, where M\boldsymbol{M} is an estimate of the inverse covariance (XX/n)1(\boldsymbol{X}^\top \boldsymbol{X} / n)^{-1} on a per-column basis.

The standard choices for M\boldsymbol{M} are based on the ridge resolvent. Javanmard and Montanari (2014) and van de Geer, Bühlmann, Ritov, Dezeure (2014) take M=cR(λM)\boldsymbol{M} = c \boldsymbol{R}(\lambda_M) for an appropriate λM\lambda_M that controls the bias-variance tradeoff of the debiasing. The choice is then a second tuning question: how much ridge regularization on the inverse-covariance estimate?

The Patil-Wei-Rakhlin (2022) program on ridge resolvents under proportional asymptotics gives explicit Bias-Variance-Optimal choices of λM\lambda_M that depend on the design spectrum through m(λ;γ)m(\lambda; \gamma). This is one of the cleanest applications of the resolvent toolbox to a concrete statistical procedure.

Implementation Notes

Computing R(λ)v\boldsymbol{R}(\lambda) \boldsymbol{v} for a vector v\boldsymbol{v} is the most common operation. The standard approaches:

  1. SVD. Compute X=UDV\boldsymbol{X} = \boldsymbol{U} \boldsymbol{D} \boldsymbol{V}^\top once. Then R(λ)v=Vdiag((σj2+λ)1)Vv\boldsymbol{R}(\lambda) \boldsymbol{v} = \boldsymbol{V} \mathrm{diag}((\sigma_j^2 + \lambda)^{-1}) \boldsymbol{V}^\top \boldsymbol{v}. Cost: O(min(n,p)max(n,p)2)O(\min(n, p) \cdot \max(n, p)^2) once, then O(p2)O(p^2) per λ\lambda. Best when the same X\boldsymbol{X} is used at many λ\lambda values.
  2. Cholesky. Factor XX+λI\boldsymbol{X}^\top \boldsymbol{X} + \lambda \boldsymbol{I} for each λ\lambda. Cost: O(p3)O(p^3) per λ\lambda. Worse than SVD for path computation but fine for one-shot.
  3. CG. Conjugate gradient solves R(λ)v=u\boldsymbol{R}(\lambda) \boldsymbol{v} = \boldsymbol{u} without factoring the matrix. Cost: O(iternnz(X))O(\text{iter} \cdot \mathrm{nnz}(\boldsymbol{X})). Best when X\boldsymbol{X} is sparse and λ\lambda is moderate (small λ\lambda leads to slow convergence).

For path computation along a grid of λ\lambda values, the SVD method is the default. glmnet (Friedman, Hastie, Tibshirani 2010) uses a related warm-start approach where the solution at one λ\lambda initializes the solver at the next.

Canonical Example

Example

Ridge path on a high-dimensional design

Generate X\boldsymbol{X} with n=200n = 200 rows and p=500p = 500 columns, iid N(0,1/n)\mathcal{N}(0, 1/n) entries. Generate β\beta^\star with β2=1\|\beta^\star\|_2 = 1, Y=Xβ+ε\boldsymbol{Y} = \boldsymbol{X} \beta^\star + \varepsilon, εN(0,0.12I)\varepsilon \sim \mathcal{N}(0, 0.1^2 \boldsymbol{I}).

Compute the ridge path at λ{0.001,0.01,0.1,1,10,100}\lambda \in \{0.001, 0.01, 0.1, 1, 10, 100\} via SVD-based evaluation of R(λ)\boldsymbol{R}(\lambda). Report effective DF and test MSE on ntest=1000n_{\mathrm{test}} = 1000 held-out samples.

λ\lambdadf(λ)\mathrm{df}(\lambda)Test MSE
0.0011990.92
0.011950.42
0.11690.16
1.0730.07
10120.10
1001.30.18

GCV-optimal is roughly λ1.0\lambda \approx 1.0 with effective DF 73\approx 73. The proportional-asymptotic theory (γ=p/n=2.5\gamma = p/n = 2.5) predicts the MSE-optimal λγσ2/β2=0.025\lambda^\star \approx \gamma \sigma^2 / \|\beta^\star\|^2 = 0.025 for a uniform-spectrum target; the empirical optimum is higher because the true spectrum is more concentrated near σj21\sigma_j^2 \approx 1, which the prediction does not account for.

The resolvent norm R(λ)op\|\boldsymbol{R}(\lambda)\|_{\mathrm{op}} runs from 10310^3 at λ=0.001\lambda = 0.001 to 0.010.01 at λ=100\lambda = 100: the analytic structure of the family is dominated by the smallest eigenvalue of XX\boldsymbol{X}^\top \boldsymbol{X}, which equals 00 here (p>np > n).

Common Confusions

Watch Out

The resolvent is not the ridge estimator

R(λ)=(XX+λI)1\boldsymbol{R}(\lambda) = (\boldsymbol{X}^\top \boldsymbol{X} + \lambda \boldsymbol{I})^{-1} is a matrix that depends on X\boldsymbol{X} and λ\lambda only. The ridge estimator β^λ=R(λ)XY\hat{\beta}_\lambda = \boldsymbol{R}(\lambda) \boldsymbol{X}^\top \boldsymbol{Y} depends additionally on Y\boldsymbol{Y}. The resolvent is the linear operator that maps the right-hand side XY\boldsymbol{X}^\top \boldsymbol{Y} to the estimator; the spectral analysis of the operator is what makes the family tractable.

Watch Out

Marchenko-Pastur is about the design spectrum, not the residual

The MP distribution describes the limiting spectrum of XX/n\boldsymbol{X}^\top \boldsymbol{X} / n when X\boldsymbol{X} has iid entries. It says nothing about the residuals YXβ^λ\boldsymbol{Y} - \boldsymbol{X} \hat{\beta}_\lambda or their distribution. The role of MP in the ridge-resolvent analysis is to give a closed form for tr(R(λ))\mathrm{tr}(\boldsymbol{R}(\lambda)) and related quadratic forms in the proportional-asymptotic limit. The residuals are a separate object.

Watch Out

Debiased lasso uses a tuned resolvent, not just any resolvent

Plugging M=R(λM)\boldsymbol{M} = \boldsymbol{R}(\lambda_M) into the debiased lasso formula requires choosing λM\lambda_M carefully: too small and the resolvent is unstable, too large and the debiasing is incomplete. The canonical choice is per-column: a separate λM(j)\lambda_M^{(j)} for each target coordinate, selected to balance the bias and variance of the jj-th debiased coordinate. See van de Geer et al. (2014) for the column-wise nodewise-Lasso version and HTW 2015 Ch 11.4 for the exposition.

Exercises

ExerciseCore

Problem

Verify the resolvent identity R(λ1)R(λ2)=(λ2λ1)R(λ1)R(λ2)\boldsymbol{R}(\lambda_1) - \boldsymbol{R}(\lambda_2) = (\lambda_2 - \lambda_1) \boldsymbol{R}(\lambda_1) \boldsymbol{R}(\lambda_2).

ExerciseAdvanced

Problem

For an isotropic design with XX=cIp\boldsymbol{X}^\top \boldsymbol{X} = c \boldsymbol{I}_p (unrealistic but pedagogical), compute the ridge MSE explicitly as a function of λ\lambda and find the closed-form optimum. Verify that the optimal λ\lambda scales as σ2/β2p/n\sigma^2 / \|\beta^\star\|^2 \cdot p / n in the proportional regime.

ExerciseResearch

Problem

The Patil-Wei-Rakhlin (2022) framework extends the Marchenko-Pastur ridge analysis to designs with arbitrary population covariance Σ\boldsymbol{\Sigma}. State the analogue of the MP fixed-point equation for the limiting Stieltjes transform of R(λ;Σ)\boldsymbol{R}(\lambda; \boldsymbol{\Sigma}) and discuss what changes when Σ\boldsymbol{\Sigma} has a spiked structure.

References

Canonical:

  • Hastie, Tibshirani, Wainwright. Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press (2015). Ch 5 (lasso path), Ch 11 (post-selection inference and debiased lasso). The textbook treatment of the resolvent-based debiasing machinery.
  • Hastie, Tibshirani, Friedman. The Elements of Statistical Learning, 2nd ed. Springer (2009). Ch 3 "Linear Methods for Regression", §3.4.1 (pp. 61-68). The base statement of ridge from which the resolvent view extends.

Modern proportional-asymptotic theory:

  • Dobriban, E. and Wager, S. (2018). "High-Dimensional Asymptotics of Prediction: Ridge Regression and Classification." Annals of Statistics 46(1), 247-279. First clean proportional-asymptotic risk formulas.
  • Hastie, T., Montanari, A., Rosset, S., Tibshirani, R. J. (2022). "Surprises in High-Dimensional Ridgeless Least Squares Interpolation." Annals of Statistics 50(2), 949-986. Limiting risk of λ0+\lambda \to 0^+ ridge under p/n>1p/n > 1.
  • Patil, P., Wei, Y., Rakhlin, A. (2022). "Mitigating multiple descents: A model-agnostic framework for risk monotonization." arXiv:2205.12937. Ridge resolvents in the multiple-descent regime.
  • Patil, P., LeJeune, D., Wei, Y., Rakhlin, A. (2024). "Failures and Successes of Cross-Validation for Early-Stopped Gradient Descent in High-Dimensional Least Squares." arXiv:2402.16793. The supplementary reference used by Tibshirani's Spring 2023 lecture notes.

Debiased lasso:

  • van de Geer, S., Bühlmann, P., Ritov, Y., Dezeure, R. (2014). "On Asymptotically Optimal Confidence Regions and Tests for High-Dimensional Models." Annals of Statistics 42(3), 1166-1202. The nodewise-Lasso version.
  • Javanmard, A. and Montanari, A. (2014). "Confidence Intervals and Hypothesis Testing for High-Dimensional Regression." Journal of Machine Learning Research 15, 2869-2909. Independent derivation with explicit ridge-resolvent inverse-covariance estimate.

Random matrix background:

  • Marchenko, V. A. and Pastur, L. A. (1967). "Distribution of Eigenvalues for Some Sets of Random Matrices." Mathematics of the USSR-Sbornik 1(4), 457-483. Original.
  • Bai, Z. and Silverstein, J. W. (2010). Spectral Analysis of Large Dimensional Random Matrices (2nd ed.). Springer. The textbook account.

Next Topics

  • Benign overfitting: the proportional-asymptotic risk of ridgeless OLS under p>np > n.
  • Double descent: test MSE as a function of effective complexity, controlled by λ\lambda in the resolvent.
  • Lasso regression: the L1L_1-regularized cousin; debiased-lasso couples them via the resolvent.
  • Smoothing splines: the function-space analogue; the smoother matrix is the resolvent of a different penalty.

Last reviewed: May 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

5

Derived topics

0

No published topic currently declares this as a prerequisite.