ML Methods
Logistic Regression
The foundational linear classifier: sigmoid link function, maximum likelihood estimation, cross-entropy loss, gradient derivation, and regularized variants.
Why This Matters
Logistic regression is the simplest non-trivial classifier and serves as the foundation for understanding neural networks. Every neural network with a sigmoid or softmax output layer is, at its final layer, performing logistic regression on learned features. The cross-entropy loss that dominates modern deep learning is the logistic regression loss. If you understand logistic regression deeply. The MLE derivation, the gradient form, why there is no closed-form solution. You understand the core mechanics of training any classifier.
Mental Model
Linear regression predicts a real number . But for classification, you need a probability . Logistic regression passes the linear prediction through the sigmoid function to squash it into , then interprets the result as . Training finds the weights that make the observed labels most likely under this model.
Core Definitions
Sigmoid Function
The sigmoid (logistic) function is:
Key properties: , , and the derivative factors cleanly as . It maps .
Logistic Regression Model
For binary classification with :
Equivalently, the log-odds (logit) is linear:
The decision boundary is the hyperplane .
Maximum Likelihood Estimation
Given data with , the likelihood is:
The negative log-likelihood (NLL) is:
Cross-Entropy Loss
The per-sample cross-entropy loss is:
where . This is identical to the negative log-likelihood of the Bernoulli model. Minimizing cross-entropy loss = maximizing likelihood. They are the same optimization problem.
The Gradient
Gradient of Logistic Regression Loss
Statement
The gradient of the NLL with respect to is:
where is the data matrix and is the label vector.
Intuition
The gradient is a sum of residuals weighted by the corresponding feature vectors . If the model over-predicts (), the gradient pushes in the direction that reduces , and vice versa. This simplification arises because , and the terms cancel during differentiation of the log-likelihood.
Proof Sketch
For a single sample: .
Using :
.
Summing over all samples in matrix form: .
Why It Matters
This gradient has the same form as the linear regression gradient , except with applied. This is not a coincidence: both are generalized linear models, and the gradient of the NLL for any GLM with canonical link has this structure. This is why the same code template works for linear, logistic, and Poisson regression.
Failure Mode
If the data is linearly separable, the MLE does not exist: along the separating direction, driving the loss to zero but never reaching a finite optimum. Regularization prevents this.
No Closed-Form Solution
Unlike linear regression, setting does not yield a closed-form solution. The equation is nonlinear in because of the sigmoid. You must use iterative methods.
Convexity of Logistic Regression Loss
Statement
The NLL of logistic regression is convex in . The Hessian is:
where is a diagonal matrix of variance terms. Since is positive semi-definite (each diagonal entry is in ), is positive semi-definite.
Intuition
Convexity means any local minimum is the global minimum. There is at most one optimal (unique if has full column rank), so gradient descent and Newton methods are guaranteed to converge to it.
Proof Sketch
The second derivative of the per-sample NLL is , which is a PSD matrix scaled by a positive scalar. Summing PSD matrices over samples gives a PSD matrix.
Why It Matters
Convexity is why logistic regression is reliable in practice: no local minima to worry about, and standard optimizers converge. This is in stark contrast to neural networks, whose loss landscapes are highly non-convex.
Failure Mode
If features are collinear, is singular, and the Hessian is only PSD, not PD. The solution exists but is not unique. L2 regularization fixes this by adding to the Hessian.
Optimization Methods
Gradient Descent: Update . Simple but slow convergence (linear rate).
Newton-Raphson / IRLS: Update . This is equivalent to Iteratively Reweighted Least Squares (IRLS): at each step, solve a weighted least squares problem with weights . Converges quadratically near the optimum but costs per iteration for the Hessian inverse.
In practice, for large : use SGD or L-BFGS. For small : Newton/IRLS is standard.
Regularized Variants
L2-Regularized Logistic Regression
Add to the NLL:
The gradient becomes . The Hessian becomes , which is always positive definite. This ensures a unique solution and prevents on separable data.
L1-Regularized Logistic Regression (Lasso)
Replace L2 penalty with . This produces sparse weights (many ), performing automatic feature selection. Not differentiable at zero, so requires proximal gradient methods or coordinate descent.
Multiclass Extension: Softmax
For multi-class and multi-label classification problems with classes, replace the sigmoid with the softmax function:
The loss becomes the categorical cross-entropy:
The gradient for class has the same residual structure: where is the vector of predicted probabilities for class . This is the output layer of every classification neural network.
Common Confusions
Logistic regression is a classifier, not a regressor
Despite the name, logistic regression is used for classification. The "regression" refers to the fact that it regresses the log-odds onto a linear function of the features. Historically, "regression" meant "fitting a model," not "predicting a continuous value." The name stuck.
Cross-entropy loss and log loss are the same thing
You will see the logistic regression loss called "cross-entropy loss," "log loss," "logistic loss," and "negative log-likelihood." These are all the same quantity. The name depends on the field: information theory (cross-entropy), statistics (NLL), and Kaggle (log loss).
Summary
- Logistic regression:
- Cross-entropy loss = negative log-likelihood of the Bernoulli model
- Gradient: . The residual form
- No closed-form solution (unlike linear regression)
- Convex loss: any local minimum is global
- MLE does not exist for linearly separable data without regularization
- Softmax generalizes sigmoid to classes
Exercises
Problem
Derive the gradient for a single training example with . Use the identity .
Problem
Why does the MLE for logistic regression not exist when the training data is linearly separable? What happens to and the loss?
Related Comparisons
References
Canonical:
- Bishop, Pattern Recognition and Machine Learning (2006), Chapter 4
- Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapter 8
- McCullagh & Nelder, Generalized Linear Models (1989), Chapters 1-4
- Agresti, Categorical Data Analysis (2013), Chapter 5
Current:
- Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapter 4
Next Topics
The natural next step from logistic regression:
- Support Vector Machines: a different approach to linear classification via margin maximization
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Differentiation in RnLayer 0A
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A