What Each Measures
Both SVMs and logistic regression produce linear (or nonlinear) decision boundaries for binary classification. They differ in what they optimize and what they output.
SVM finds the hyperplane that maximizes the margin, the distance between the decision boundary and the nearest training points. It minimizes the hinge loss: where .
Logistic Regression finds the hyperplane that maximizes the likelihood of the data under a logistic model. It minimizes the log loss (cross-entropy): .
Side-by-Side Statement
Support Vector Machine (Hard Margin)
For linearly separable data with labels :
The decision boundary is , and the margin is . Only the points with (support vectors) determine the solution.
Soft-Margin SVM
For non-separable data, introduce slack variables :
Equivalently, the unconstrained form is:
Logistic Regression
Model the probability: where .
Minimize negative log-likelihood (log loss):
The Loss Functions Compared
The core difference is the loss:
| Margin | Hinge loss (SVM) | Log loss (LR) |
|---|---|---|
| (misclassified) | (linear penalty) | (roughly $ |
| (on the margin) | ||
| (very correct) | (exactly) | (exponentially small) |
The crucial difference: hinge loss is exactly zero once a point is on the correct side of the margin (). Log loss is never exactly zero; it always penalizes, however slightly. This means:
- SVM ignores points far from the boundary. Only support vectors matter.
- LR uses all points, with distant points contributing less.
Where Each Is Stronger
SVM wins on sparsity of the solution
The SVM solution depends only on support vectors, the small subset of training points near the boundary. This makes SVM memory-efficient at prediction time and robust to outliers far from the boundary. In the dual formulation, most Lagrange multipliers are zero.
LR wins on calibrated probabilities
Logistic regression directly outputs , a well-calibrated probability. SVM outputs a decision value with no probabilistic interpretation. You can fit a sigmoid to SVM outputs (Platt scaling), but this is a post-hoc fix and can be poorly calibrated.
If you need to rank predictions by confidence, threshold at non-default values, or combine predictions with other probability estimates, LR is the natural choice.
SVM wins with the kernel trick
The SVM dual formulation depends on data only through inner products . Replacing these with a kernel implicitly maps data to a high-dimensional (possibly infinite-dimensional) feature space without computing the mapping explicitly.
Common kernels:
- RBF: , infinite-dimensional
- Polynomial: , maps to degree- features
Logistic regression can also be kernelized, but it is less natural because the LR objective does not have the same sparse dual structure. In practice, kernel LR is rare.
LR wins on scalability
Modern LR is solved with stochastic gradient descent, which scales linearly in . Kernel SVMs require computing and storing the kernel matrix, which is in memory and to solve. For large datasets, this makes kernel SVMs impractical without approximations (e.g., Nystrom, random features). Linear SVMs are fast, but then LR is equally fast.
Where Each Fails
SVM fails at probability estimation
The raw SVM output is a signed distance to the margin, not a probability. Platt scaling (fitting a sigmoid ) can rescue this, but it requires a hold-out set and the resulting probabilities are less reliable than those from LR, especially in the tails.
LR fails with nonlinear boundaries (without feature engineering)
In its base form, LR fits a linear boundary. For nonlinear problems, you must manually construct polynomial or interaction features, use basis expansions, or wrap LR in a neural network. SVM with an RBF kernel handles nonlinearity automatically.
SVM fails in multi-class settings (without reduction)
SVMs are natively binary classifiers. Multi-class requires one-vs-one ( classifiers) or one-vs-rest ( classifiers), each with its own issues. LR extends naturally to multinomial logistic regression (softmax) with a single, coherent optimization problem.
Both fail in very high dimensions without regularization
Without regularization, both overfit in high dimensions. With L2 regularization (standard for both), they behave similarly in terms of generalization. The regularized soft-margin SVM with hinge loss and LR with log loss are in fact closely related convex optimization problems.
Key Assumptions That Differ
| SVM | Logistic Regression | |
|---|---|---|
| Loss | Hinge: | Log: |
| Output | Decision value (no probability) | Calibrated probability |
| Sparsity | Support vectors only | All data points contribute |
| Kernel trick | Natural (dual depends on inner products) | Possible but rare |
| Multi-class | Requires reduction (OvO, OvR) | Native (softmax) |
| Optimization | Quadratic program | Convex, smooth; use SGD |
| Probabilistic model | No (discriminative geometry) | Yes (conditional $P(y |
What to Memorize
- SVM = hinge loss + margin maximization + sparse support vectors
- LR = log loss + maximum likelihood + calibrated probabilities
- Kernel trick is natural for SVM, awkward for LR
- Hinge loss is zero for ; log loss is never zero
- Decision rule: Need probabilities? Use LR. Need nonlinear boundary without feature engineering? Use kernel SVM. Need to scale to millions of examples? Use LR (or linear SVM).
- With L2 regularization, both are convex and produce similar decision boundaries in the linear case.
When a Researcher Would Use Each
Medical diagnosis with calibrated risk scores
A hospital needs not just a yes/no diagnosis but a risk probability: "this patient has a 73% chance of diabetes." Use logistic regression. The calibrated output can be communicated directly to clinicians and combined with other risk factors in a transparent way.
Image classification with complex boundaries (pre-deep learning)
You have image features (SIFT, HOG) and need a nonlinear classifier. Use SVM with RBF kernel. Before deep learning, kernel SVMs were the standard for image classification (e.g., the winning method on MNIST and CIFAR in the 2000s). The kernel handles nonlinearity without manually designing features.
Text classification at scale
You have millions of documents with bag-of-words features (high-dimensional, sparse). Use logistic regression with SGD. Linear SVMs also work well here (liblinear), and the two give very similar results. LR is preferred because it gives probabilities and extends naturally to multi-class.
Common Confusions
SVMs are not inherently better for small datasets
A common claim is that SVMs work better on small datasets because they use only support vectors. This is misleading. SVMs and LR with the same regularization often give very similar performance. The advantage of SVMs on small data comes primarily from the kernel trick providing a richer implicit feature space, not from the hinge loss per se.
Regularization makes them more similar than different
L2-regularized hinge loss (SVM) and L2-regularized log loss (LR) are both convex objectives on the same hypothesis class. Their decision boundaries are often nearly identical. The practical differences lie in outputs (probability vs. decision value), not in boundary location.
The max-margin property does not imply better generalization in general
The margin-based generalization bounds for SVMs are appealing, but they apply to the specific hypothesis class defined by the kernel. LR has its own generalization bounds via the log loss. There is no universal theorem saying max-margin always generalizes better than max-likelihood.