Logistic Regression

Sneiderman, Robby

ML Methods

Logistic Regression

The foundational linear classifier: sigmoid link function, maximum likelihood estimation, cross-entropy loss, gradient derivation, and regularized variants.

CoreTier 1StableCore spine~50 min

Prerequisites

Maximum Likelihood Estimation Convex Optimization Basics Data Preprocessing and Feature Engineering Linear Regression

Start 8-question practice · 6 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 1 | tier 1. This page has 5 direct prerequisites and 7 published dependents.

Open Atlas Prerequisites Leads to

What next

Support Vector Machines

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Logistic regression is the simplest non-trivial classifier and serves as the foundation for understanding neural networks: every network with a sigmoid or softmax output layer is, at its final layer, performing logistic regression on learned features. The cross-entropy loss that dominates modern deep learning is the logistic regression loss. If you understand logistic regression deeply, including the MLE derivation, the gradient form, and why there is no closed-form solution, you understand the core mechanics of training any classifier.

Mental Model

Linear regression predicts a real number $w^T x$ . But for classification, you need a probability $p \in [0,1]$ . Logistic regression passes the linear prediction through the sigmoid function to squash it into $[0,1]$ , then interprets the result as $P(Y = 1 \mid X = x)$ . Training finds the weights $w$ that make the observed labels most likely under this model.

Core Definitions

Definition

Sigmoid Function $σ (z)$

The sigmoid (logistic) function is:

$\sigma(z) = \frac{1}{1 + e^{-z}}$

Key properties: $\sigma(0) = 0.5$ , $\sigma(-z) = 1 - \sigma(z)$ , and the derivative factors cleanly as $\sigma'(z) = \sigma(z)(1 - \sigma(z))$ . It maps $\mathbb{R} \to (0,1)$ .

Definition

Logistic Regression Model

For binary classification with $y \in \{0, 1\}$ :

$P(Y = 1 \mid x) = \sigma(w^T x) = \frac{1}{1 + \exp(-w^T x)}$

Equivalently, the log-odds (logit) is linear:

$\log \frac{P(Y=1 \mid x)}{P(Y=0 \mid x)} = w^T x$

The decision boundary is the hyperplane $\{x : w^T x = 0\}$ .

Maximum Likelihood Estimation

Given data $\{(x_i, y_i)\}_{i=1}^n$ with $y_i \in \{0,1\}$ , the likelihood is:

$L(w) = \prod_{i=1}^n \sigma(w^T x_i)^{y_i}(1 - \sigma(w^T x_i))^{1 - y_i}$

The negative log-likelihood (NLL) is:

$\text{NLL}(w) = -\sum_{i=1}^n \left[ y_i \log \sigma(w^T x_i) + (1-y_i)\log(1 - \sigma(w^T x_i)) \right]$

Definition

Cross-Entropy Loss $H (y, \overset{y}{^})$

The per-sample cross-entropy loss is:

$\ell(y, \hat{p}) = -y\log(\hat{p}) - (1-y)\log(1 - \hat{p})$

where $\hat{p} = \sigma(w^T x)$ . This is identical to the negative log-likelihood of the Bernoulli model. Minimizing cross-entropy loss = maximizing likelihood. They are the same optimization problem.

The Gradient

Theorem

Gradient of Logistic Regression Loss

Statement

The gradient of the NLL with respect to $w$ is:

$\nabla_w \text{NLL}(w) = X^T(\sigma(Xw) - y)$

where $X \in \mathbb{R}^{n \times d}$ is the data matrix and $y \in \{0,1\}^n$ is the label vector.

Intuition

The gradient is a sum of residuals $(\hat{p}_i - y_i)$ weighted by the corresponding feature vectors $x_i$ . If the model over-predicts ( $\hat{p}_i > y_i$ ), the gradient pushes $w$ in the direction that reduces $w^T x_i$ , and vice versa. This simplification arises because $\sigma'(z) = \sigma(z)(1 - \sigma(z))$ , and the $\sigma(1-\sigma)$ terms cancel during differentiation of the log-likelihood.

Proof Sketch

For a single sample: $\frac{\partial}{\partial w}[-y\log\sigma(w^T x) - (1-y)\log(1-\sigma(w^T x))]$ .

Using $\sigma'(z) = \sigma(z)(1-\sigma(z))$ :

$= -y \cdot \frac{\sigma(w^T x)(1-\sigma(w^T x))}{\sigma(w^T x)} \cdot x - (1-y) \cdot \frac{-\sigma(w^T x)(1-\sigma(w^T x))}{1-\sigma(w^T x)} \cdot x$

$= -y(1-\sigma(w^T x))x + (1-y)\sigma(w^T x)x$

$= (\sigma(w^T x) - y)x$ .

Summing over all samples in matrix form: $X^T(\sigma(Xw) - y)$ .

Why It Matters

This gradient has the same form as the linear regression gradient $X^T(Xw - y)$ , except with $\sigma$ applied. This is not a coincidence: both are generalized linear models, and the gradient of the NLL for any GLM with canonical link has this $X^T(\text{residual})$ structure. This is why the same code template works for linear, logistic, and Poisson regression.

Failure Mode

If the data is linearly separable, the MLE does not exist: $\|w\| \to \infty$ along the separating direction, driving the loss to zero but never reaching a finite optimum. Regularization prevents this.

report a correction →

No Closed-Form Solution

Unlike linear regression, setting $\nabla \text{NLL} = 0$ does not yield a closed-form solution — the equation $X^T(\sigma(Xw) - y) = 0$ is nonlinear in $w$ because of the sigmoid. You must use iterative methods.

Theorem

Convexity of Logistic Regression Loss

Statement

The NLL of logistic regression is convex in $w$ . The Hessian is:

$H = X^T S X$

where $S = \text{diag}(\sigma(Xw) \odot (1 - \sigma(Xw)))$ is a diagonal matrix of variance terms. Since $S$ is positive semi-definite (each diagonal entry is in $(0, 0.25]$ ), $H$ is positive semi-definite.

Intuition

Convexity means any local minimum is the global minimum. There is at most one optimal $w$ (unique if $X$ has full column rank), so gradient descent and Newton methods are guaranteed to converge to it.

Proof Sketch

The second derivative of the per-sample NLL is $\sigma(w^T x)(1-\sigma(w^T x)) \cdot x x^T$ , which is a PSD matrix scaled by a positive scalar. Summing PSD matrices over samples gives a PSD matrix.

Why It Matters

Convexity is why logistic regression is reliable in practice: no local minima to worry about, and standard optimizers converge. This is in stark contrast to neural networks, whose loss landscapes are highly non-convex.

Failure Mode

If features are collinear, $X^T S X$ is singular, and the Hessian is only PSD, not PD. The solution exists but is not unique. L2 regularization fixes this by adding $\lambda I$ to the Hessian.

report a correction →

Optimization Methods

Gradient Descent: Update $w \leftarrow w - \eta X^T(\sigma(Xw) - y)$ . Simple but slow convergence (linear rate).

Newton-Raphson / IRLS: Update $w \leftarrow w - (X^T S X)^{-1} X^T(\sigma(Xw) - y)$ . This is equivalent to Iteratively Reweighted Least Squares (IRLS): at each step, solve a weighted least squares problem with weights $S$ . Converges quadratically near the optimum but costs $O(d^3)$ per iteration for the Hessian inverse.

In practice, for large $d$ : use SGD or L-BFGS. For small $d$ : Newton/IRLS is standard.

Inference

Two standard tests for a coefficient $\beta_j$ under $H_0: \beta_j = 0$ .

Wald test. Using the observed information matrix, $\text{SE}(\hat\beta_j)$ is the square root of the $j$ -th diagonal entry of $(X^T S X)^{-1}$ , and $z = \hat\beta_j / \text{SE}(\hat\beta_j) \sim N(0,1)$ asymptotically.

Likelihood ratio test. Fit the full model and the nested model that drops the $k$ coefficients being tested. Then $-2\log\Lambda = 2[\ell(\hat\beta) - \ell(\hat\beta_0)] \sim \chi^2_k$ under the null.

The Wald statistic can collapse toward zero as $\|\hat\beta\| \to \infty$ near separation (the standard error inflates faster than the coefficient), giving spuriously non-significant results. The LRT is more reliable in that regime and is preferred when separation or near-separation is suspected.

Separation and Firth Correction

Watch Out

MLE does not exist under perfect separation

When a hyperplane perfectly separates the classes, the MLE diverges: $\|\hat\beta\| \to \infty$ along the separating direction and the NLL infimum (zero) is never attained. Standard software will report huge coefficients, enormous standard errors, and convergence warnings rather than a finite fit.

Firth (1993) proposed a penalized likelihood that adds a Jeffreys-prior correction $\frac{1}{2}\log|I(\beta)|$ to the log-likelihood, where $I(\beta) = X^T S X$ is the Fisher information. The penalty vanishes on well-separated fits, producing finite $\hat\beta$ with reduced first-order bias. L2 regularization also stabilizes the fit but shrinks toward zero rather than correcting bias. Firth is the standard choice for inference under separation in biostatistics.

Regularized Variants

Definition

L2-Regularized Logistic Regression $λ ∥ w ∥_{2}^{2}$

Add $\frac{\lambda}{2}\|w\|_2^2$ to the NLL:

$\min_w -\sum_i [y_i \log \sigma(w^T x_i) + (1-y_i)\log(1-\sigma(w^T x_i))] + \frac{\lambda}{2}\|w\|^2$

The gradient becomes $X^T(\sigma(Xw) - y) + \lambda w$ . The Hessian becomes $X^T S X + \lambda I$ , which is always positive definite. This ensures a unique solution and prevents $\|w\| \to \infty$ on separable data.

Definition

L1-Regularized Logistic Regression (Lasso) $λ ∥ w ∥_{1}$

Replace L2 penalty with $\lambda \|w\|_1$ . This produces sparse weights (many $w_j = 0$ ), performing automatic feature selection. Not differentiable at zero, so requires proximal gradient methods or coordinate descent.

Multiclass Extension: Softmax

For multi-class and multi-label classification problems with $K$ classes, replace the sigmoid with the softmax function:

$P(Y = k \mid x) = \frac{\exp(w_k^T x)}{\sum_{j=1}^K \exp(w_j^T x)}$

The loss becomes the categorical cross-entropy:

$\ell = -\sum_{k=1}^K \mathbf{1}[y=k] \log P(Y=k \mid x)$

The gradient for class $k$ has the same residual structure: $\nabla_{w_k} = X^T(p_k - \mathbf{1}[y=k])$ where $p_k$ is the vector of predicted probabilities for class $k$ . This is the output layer of every classification neural network.

Common Confusions

Watch Out

Logistic regression is a classifier, not a regressor

Despite the name, logistic regression is used for classification. The "regression" refers to the fact that it regresses the log-odds onto a linear function of the features. Historically, "regression" meant "fitting a model," not "predicting a continuous value." The name stuck.

Watch Out

Cross-entropy loss and log loss are the same thing

You will see the logistic regression loss called "cross-entropy loss," "log loss," "logistic loss," and "negative log-likelihood." These are all the same quantity. The name depends on the field: information theory (cross-entropy), statistics (NLL), and Kaggle (log loss).

Summary

Logistic regression: $P(Y=1 \mid x) = \sigma(w^T x)$
Cross-entropy loss = negative log-likelihood of the Bernoulli model
Gradient: $X^T(\sigma(Xw) - y)$ . The residual form
No closed-form solution (unlike linear regression)
Convex loss: any local minimum is global
MLE does not exist for linearly separable data without regularization
Softmax generalizes sigmoid to $K$ classes

Exercises

ExerciseCore

Problem

Derive the gradient $\nabla_w \text{NLL}$ for a single training example $(x, y)$ with $y \in \{0,1\}$ . Use the identity $\sigma'(z) = \sigma(z)(1 - \sigma(z))$ .

ExerciseAdvanced

Problem

Why does the MLE for logistic regression not exist when the training data is linearly separable? What happens to $w$ and the loss?

Related Comparisons

SVM vs. Logistic Regression

References

Canonical:

Bishop, Pattern Recognition and Machine Learning (2006), Chapter 4
Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapter 8
McCullagh & Nelder, Generalized Linear Models (1989), Chapters 1-4
Agresti, Categorical Data Analysis (2013), Chapter 5

Current:

Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapter 4
Firth, D. (1993). Bias reduction of maximum likelihood estimates. Biometrika 80(1):27-38

Next Topics

The natural next step from logistic regression:

Support Vector Machines: a different approach to linear classification via margin maximization

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

5

Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
Convex Optimization Basicslayer 1 · tier 1
Data Preprocessing and Feature Engineeringlayer 1 · tier 1
Linear Regressionlayer 1 · tier 1
Naive Bayeslayer 1 · tier 2

Derived topics

7

Cross-Entropy Loss: MLE, KL Divergence, and Classificationlayer 1 · tier 1
Loss Functions Cataloglayer 1 · tier 1
Support Vector Machineslayer 2 · tier 1
Multi-Class and Multi-Label Classificationlayer 1 · tier 2
Label Smoothing and Regularizationlayer 2 · tier 2

+2 more on the derived-topics page.

Graph-backed continuations

Support Vector Machines Calibration and Uncertainty Quantification Cross-Entropy Loss: MLE, KL Divergence, and Classification Label Smoothing and Regularization Loss Functions Catalog Multi-Class and Multi-Label Classification Word Embeddings