Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Logistic Regression

The foundational linear classifier: sigmoid link function, maximum likelihood estimation, cross-entropy loss, gradient derivation, and regularized variants.

CoreTier 1Stable~50 min

Why This Matters

decision boundary(w·x = 0, p = 0.5)Predict class 0Predict class 10.000.250.500.751.00P(Y=1|x)-4-2024Linear score z = w·x + bClass 0Class 1

Logistic regression is the simplest non-trivial classifier and serves as the foundation for understanding neural networks. Every neural network with a sigmoid or softmax output layer is, at its final layer, performing logistic regression on learned features. The cross-entropy loss that dominates modern deep learning is the logistic regression loss. If you understand logistic regression deeply. The MLE derivation, the gradient form, why there is no closed-form solution. You understand the core mechanics of training any classifier.

Mental Model

Linear regression predicts a real number wTxw^T x. But for classification, you need a probability p[0,1]p \in [0,1]. Logistic regression passes the linear prediction through the sigmoid function to squash it into [0,1][0,1], then interprets the result as P(Y=1X=x)P(Y = 1 \mid X = x). Training finds the weights ww that make the observed labels most likely under this model.

Core Definitions

Definition

Sigmoid Function

The sigmoid (logistic) function is:

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

Key properties: σ(0)=0.5\sigma(0) = 0.5, σ(z)=1σ(z)\sigma(-z) = 1 - \sigma(z), and the derivative factors cleanly as σ(z)=σ(z)(1σ(z))\sigma'(z) = \sigma(z)(1 - \sigma(z)). It maps R(0,1)\mathbb{R} \to (0,1).

Definition

Logistic Regression Model

For binary classification with y{0,1}y \in \{0, 1\}:

P(Y=1x)=σ(wTx)=11+exp(wTx)P(Y = 1 \mid x) = \sigma(w^T x) = \frac{1}{1 + \exp(-w^T x)}

Equivalently, the log-odds (logit) is linear:

logP(Y=1x)P(Y=0x)=wTx\log \frac{P(Y=1 \mid x)}{P(Y=0 \mid x)} = w^T x

The decision boundary is the hyperplane {x:wTx=0}\{x : w^T x = 0\}.

Maximum Likelihood Estimation

Given data {(xi,yi)}i=1n\{(x_i, y_i)\}_{i=1}^n with yi{0,1}y_i \in \{0,1\}, the likelihood is:

L(w)=i=1nσ(wTxi)yi(1σ(wTxi))1yiL(w) = \prod_{i=1}^n \sigma(w^T x_i)^{y_i}(1 - \sigma(w^T x_i))^{1 - y_i}

The negative log-likelihood (NLL) is:

NLL(w)=i=1n[yilogσ(wTxi)+(1yi)log(1σ(wTxi))]\text{NLL}(w) = -\sum_{i=1}^n \left[ y_i \log \sigma(w^T x_i) + (1-y_i)\log(1 - \sigma(w^T x_i)) \right]

Definition

Cross-Entropy Loss

The per-sample cross-entropy loss is:

(y,p^)=ylog(p^)(1y)log(1p^)\ell(y, \hat{p}) = -y\log(\hat{p}) - (1-y)\log(1 - \hat{p})

where p^=σ(wTx)\hat{p} = \sigma(w^T x). This is identical to the negative log-likelihood of the Bernoulli model. Minimizing cross-entropy loss = maximizing likelihood. They are the same optimization problem.

The Gradient

Theorem

Gradient of Logistic Regression Loss

Statement

The gradient of the NLL with respect to ww is:

wNLL(w)=XT(σ(Xw)y)\nabla_w \text{NLL}(w) = X^T(\sigma(Xw) - y)

where XRn×dX \in \mathbb{R}^{n \times d} is the data matrix and y{0,1}ny \in \{0,1\}^n is the label vector.

Intuition

The gradient is a sum of residuals (p^iyi)(\hat{p}_i - y_i) weighted by the corresponding feature vectors xix_i. If the model over-predicts (p^i>yi\hat{p}_i > y_i), the gradient pushes ww in the direction that reduces wTxiw^T x_i, and vice versa. This simplification arises because σ(z)=σ(z)(1σ(z))\sigma'(z) = \sigma(z)(1 - \sigma(z)), and the σ(1σ)\sigma(1-\sigma) terms cancel during differentiation of the log-likelihood.

Proof Sketch

For a single sample: w[ylogσ(wTx)(1y)log(1σ(wTx))]\frac{\partial}{\partial w}[-y\log\sigma(w^T x) - (1-y)\log(1-\sigma(w^T x))].

Using σ(z)=σ(z)(1σ(z))\sigma'(z) = \sigma(z)(1-\sigma(z)):

=yσ(wTx)(1σ(wTx))σ(wTx)x(1y)σ(wTx)(1σ(wTx))1σ(wTx)x= -y \cdot \frac{\sigma(w^T x)(1-\sigma(w^T x))}{\sigma(w^T x)} \cdot x - (1-y) \cdot \frac{-\sigma(w^T x)(1-\sigma(w^T x))}{1-\sigma(w^T x)} \cdot x

=y(1σ(wTx))x+(1y)σ(wTx)x= -y(1-\sigma(w^T x))x + (1-y)\sigma(w^T x)x

=(σ(wTx)y)x= (\sigma(w^T x) - y)x.

Summing over all samples in matrix form: XT(σ(Xw)y)X^T(\sigma(Xw) - y).

Why It Matters

This gradient has the same form as the linear regression gradient XT(Xwy)X^T(Xw - y), except with σ\sigma applied. This is not a coincidence: both are generalized linear models, and the gradient of the NLL for any GLM with canonical link has this XT(residual)X^T(\text{residual}) structure. This is why the same code template works for linear, logistic, and Poisson regression.

Failure Mode

If the data is linearly separable, the MLE does not exist: w\|w\| \to \infty along the separating direction, driving the loss to zero but never reaching a finite optimum. Regularization prevents this.

No Closed-Form Solution

Unlike linear regression, setting NLL=0\nabla \text{NLL} = 0 does not yield a closed-form solution. The equation XT(σ(Xw)y)=0X^T(\sigma(Xw) - y) = 0 is nonlinear in ww because of the sigmoid. You must use iterative methods.

Theorem

Convexity of Logistic Regression Loss

Statement

The NLL of logistic regression is convex in ww. The Hessian is:

H=XTSXH = X^T S X

where S=diag(σ(Xw)(1σ(Xw)))S = \text{diag}(\sigma(Xw) \odot (1 - \sigma(Xw))) is a diagonal matrix of variance terms. Since SS is positive semi-definite (each diagonal entry is in (0,0.25](0, 0.25]), HH is positive semi-definite.

Intuition

Convexity means any local minimum is the global minimum. There is at most one optimal ww (unique if XX has full column rank), so gradient descent and Newton methods are guaranteed to converge to it.

Proof Sketch

The second derivative of the per-sample NLL is σ(wTx)(1σ(wTx))xxT\sigma(w^T x)(1-\sigma(w^T x)) \cdot x x^T, which is a PSD matrix scaled by a positive scalar. Summing PSD matrices over samples gives a PSD matrix.

Why It Matters

Convexity is why logistic regression is reliable in practice: no local minima to worry about, and standard optimizers converge. This is in stark contrast to neural networks, whose loss landscapes are highly non-convex.

Failure Mode

If features are collinear, XTSXX^T S X is singular, and the Hessian is only PSD, not PD. The solution exists but is not unique. L2 regularization fixes this by adding λI\lambda I to the Hessian.

Optimization Methods

Gradient Descent: Update wwηXT(σ(Xw)y)w \leftarrow w - \eta X^T(\sigma(Xw) - y). Simple but slow convergence (linear rate).

Newton-Raphson / IRLS: Update ww(XTSX)1XT(σ(Xw)y)w \leftarrow w - (X^T S X)^{-1} X^T(\sigma(Xw) - y). This is equivalent to Iteratively Reweighted Least Squares (IRLS): at each step, solve a weighted least squares problem with weights SS. Converges quadratically near the optimum but costs O(d3)O(d^3) per iteration for the Hessian inverse.

In practice, for large dd: use SGD or L-BFGS. For small dd: Newton/IRLS is standard.

Regularized Variants

Definition

L2-Regularized Logistic Regression

Add λ2w22\frac{\lambda}{2}\|w\|_2^2 to the NLL:

minwi[yilogσ(wTxi)+(1yi)log(1σ(wTxi))]+λ2w2\min_w -\sum_i [y_i \log \sigma(w^T x_i) + (1-y_i)\log(1-\sigma(w^T x_i))] + \frac{\lambda}{2}\|w\|^2

The gradient becomes XT(σ(Xw)y)+λwX^T(\sigma(Xw) - y) + \lambda w. The Hessian becomes XTSX+λIX^T S X + \lambda I, which is always positive definite. This ensures a unique solution and prevents w\|w\| \to \infty on separable data.

Definition

L1-Regularized Logistic Regression (Lasso)

Replace L2 penalty with λw1\lambda \|w\|_1. This produces sparse weights (many wj=0w_j = 0), performing automatic feature selection. Not differentiable at zero, so requires proximal gradient methods or coordinate descent.

Multiclass Extension: Softmax

For multi-class and multi-label classification problems with KK classes, replace the sigmoid with the softmax function:

P(Y=kx)=exp(wkTx)j=1Kexp(wjTx)P(Y = k \mid x) = \frac{\exp(w_k^T x)}{\sum_{j=1}^K \exp(w_j^T x)}

The loss becomes the categorical cross-entropy:

=k=1K1[y=k]logP(Y=kx)\ell = -\sum_{k=1}^K \mathbf{1}[y=k] \log P(Y=k \mid x)

The gradient for class kk has the same residual structure: wk=XT(pk1[y=k])\nabla_{w_k} = X^T(p_k - \mathbf{1}[y=k]) where pkp_k is the vector of predicted probabilities for class kk. This is the output layer of every classification neural network.

Common Confusions

Watch Out

Logistic regression is a classifier, not a regressor

Despite the name, logistic regression is used for classification. The "regression" refers to the fact that it regresses the log-odds onto a linear function of the features. Historically, "regression" meant "fitting a model," not "predicting a continuous value." The name stuck.

Watch Out

Cross-entropy loss and log loss are the same thing

You will see the logistic regression loss called "cross-entropy loss," "log loss," "logistic loss," and "negative log-likelihood." These are all the same quantity. The name depends on the field: information theory (cross-entropy), statistics (NLL), and Kaggle (log loss).

Summary

  • Logistic regression: P(Y=1x)=σ(wTx)P(Y=1 \mid x) = \sigma(w^T x)
  • Cross-entropy loss = negative log-likelihood of the Bernoulli model
  • Gradient: XT(σ(Xw)y)X^T(\sigma(Xw) - y). The residual form
  • No closed-form solution (unlike linear regression)
  • Convex loss: any local minimum is global
  • MLE does not exist for linearly separable data without regularization
  • Softmax generalizes sigmoid to KK classes

Exercises

ExerciseCore

Problem

Derive the gradient wNLL\nabla_w \text{NLL} for a single training example (x,y)(x, y) with y{0,1}y \in \{0,1\}. Use the identity σ(z)=σ(z)(1σ(z))\sigma'(z) = \sigma(z)(1 - \sigma(z)).

ExerciseAdvanced

Problem

Why does the MLE for logistic regression not exist when the training data is linearly separable? What happens to ww and the loss?

Related Comparisons

References

Canonical:

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapter 4
  • Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapter 8
  • McCullagh & Nelder, Generalized Linear Models (1989), Chapters 1-4
  • Agresti, Categorical Data Analysis (2013), Chapter 5

Current:

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapter 4

Next Topics

The natural next step from logistic regression:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics