SVM vs. Logistic Regression. Margin Maximization vs. Probability Calibration

What Each Measures

Both SVMs and logistic regression produce linear (or nonlinear) decision boundaries for binary classification. They differ in what they optimize and what they output.

SVM finds the hyperplane that maximizes the margin, the distance between the decision boundary and the nearest training points. It minimizes the hinge loss: $\ell(z) = \max(0, 1 - z)$ where $z = y \cdot f(x)$ .

Logistic Regression finds the hyperplane that maximizes the likelihood of the data under a logistic model. It minimizes the log loss (cross-entropy): $\ell(z) = \log(1 + e^{-z})$ .

Side-by-Side Statement

Definition

Support Vector Machine (Hard Margin)

For linearly separable data with labels $y_i \in \{-1, +1\}$ :

$\min_{w, b} \frac{1}{2}\|w\|^2 \quad \text{s.t.} \quad y_i(w^T x_i + b) \geq 1 \; \forall i$

The decision boundary is $w^T x + b = 0$ , and the margin is $2 / \|w\|$ . Only the points with $y_i(w^T x_i + b) = 1$ (support vectors) determine the solution.

Definition

Soft-Margin SVM

For non-separable data, introduce slack variables $\xi_i \geq 0$ :

$\min_{w, b, \xi} \frac{1}{2}\|w\|^2 + C \sum_{i=1}^n \xi_i \quad \text{s.t.} \quad y_i(w^T x_i + b) \geq 1 - \xi_i$

Equivalently, the unconstrained form is:

$\min_{w} \frac{1}{n}\sum_{i=1}^n \max(0, 1 - y_i(w^T x_i + b)) + \frac{\lambda}{2}\|w\|^2$

Definition

Logistic Regression

Model the probability: $P(y = 1 | x) = \sigma(w^T x + b)$ where $\sigma(z) = 1/(1 + e^{-z})$ .

Minimize negative log-likelihood (log loss):

$\min_{w, b} \frac{1}{n}\sum_{i=1}^n \log(1 + e^{-y_i(w^T x_i + b)}) + \frac{\lambda}{2}\|w\|^2$

The Loss Functions Compared

The core difference is the loss:

Margin $z = y \cdot f(x)$	Hinge loss (SVM)	Log loss (LR)
$z < 0$ (misclassified)	$1 - z$ (linear penalty)	$\log(1 + e^{-z})$ (roughly $\lvert z\rvert$ )
$z = 1$ (on the margin)	$0$	$\log(1 + e^{-1}) \approx 0.31$
$z \gg 1$ (very correct)	$0$ (exactly)	$e^{-z} \approx 0$ (exponentially small)

The crucial difference: hinge loss is exactly zero once a point is on the correct side of the margin ( $z \geq 1$ ). Log loss is never exactly zero; it always penalizes, however slightly. This means:

SVM ignores points far from the boundary. Only support vectors matter.
LR uses all points, with distant points contributing less.

Where Each Is Stronger

SVM wins on sparsity of the solution

The SVM solution depends only on support vectors, the small subset of training points near the boundary. This makes SVM memory-efficient at prediction time and robust to outliers far from the boundary. In the dual formulation, most Lagrange multipliers are zero.

LR wins on calibrated probabilities

Logistic regression directly outputs $P(y = 1 | x) = \sigma(w^T x + b)$ , a well-calibrated probability. SVM outputs a decision value $w^T x + b$ with no probabilistic interpretation. You can fit a sigmoid to SVM outputs (Platt scaling), but this is a post-hoc fix and can be poorly calibrated.

If you need to rank predictions by confidence, threshold at non-default values, or combine predictions with other probability estimates, LR is the natural choice.

SVM wins with the kernel trick

The SVM dual formulation depends on data only through inner products $x_i^T x_j$ . Replacing these with a kernel $K(x_i, x_j)$ implicitly maps data to a high-dimensional (possibly infinite-dimensional) feature space without computing the mapping explicitly.

Common kernels:

RBF: $K(x, x') = \exp(-\gamma\|x - x'\|^2)$ , infinite-dimensional
Polynomial: $K(x, x') = (x^T x' + c)^d$ , maps to degree- $d$ features

Logistic regression can also be kernelized, but it is less natural because the LR objective does not have the same sparse dual structure. In practice, kernel LR is rare.

LR wins on scalability

Modern LR is solved with stochastic gradient descent, which scales linearly in $n$ . Kernel SVMs require computing and storing the $n \times n$ kernel matrix, which is $O(n^2)$ in memory and $O(n^3)$ to solve. For large datasets, this makes kernel SVMs impractical without approximations (e.g., Nystrom, random features). Linear SVMs are fast, but then LR is equally fast.

Where Each Fails

SVM fails at probability estimation

The raw SVM output $f(x) = w^T x + b$ is a signed distance to the margin, not a probability. Platt scaling (fitting a sigmoid $P(y=1|x) = 1/(1 + e^{-Af(x)-B})$ ) can rescue this, but it requires a hold-out set and the resulting probabilities are less reliable than those from LR, especially in the tails.

LR fails with nonlinear boundaries (without feature engineering)

In its base form, LR fits a linear boundary. For nonlinear problems, you must manually construct polynomial or interaction features, use basis expansions, or wrap LR in a neural network. SVM with an RBF kernel handles nonlinearity automatically.

SVM fails in multi-class settings (without reduction)

SVMs are natively binary classifiers. Multi-class requires one-vs-one ( $\binom{k}{2}$ classifiers) or one-vs-rest ( $k$ classifiers), each with its own issues. LR extends naturally to multinomial logistic regression (softmax) with a single, coherent optimization problem.

Both fail in very high dimensions without regularization

Without regularization, both overfit in high dimensions. With L2 regularization (standard for both), they behave similarly in terms of generalization. The regularized soft-margin SVM with hinge loss and LR with log loss are in fact closely related convex optimization problems.

Key Assumptions That Differ

	SVM	Logistic Regression
Loss	Hinge: $\max(0, 1-z)$	Log: $\log(1+e^{-z})$
Output	Decision value (no probability)	Calibrated probability
Sparsity	Support vectors only	All data points contribute
Kernel trick	Natural (dual depends on inner products)	Possible but rare
Multi-class	Requires reduction (OvO, OvR)	Native (softmax)
Optimization	Quadratic program	Convex, smooth; use SGD
Probabilistic model	No (discriminative geometry)	Yes (conditional $P(y \mid x)$ )

What to Memorize

SVM = hinge loss + margin maximization + sparse support vectors
LR = log loss + maximum likelihood + calibrated probabilities
Kernel trick is natural for SVM, awkward for LR
Hinge loss is zero for $z \geq 1$ ; log loss is never zero
Decision rule: Need probabilities? Use LR. Need nonlinear boundary without feature engineering? Use kernel SVM. Need to scale to millions of examples? Use LR (or linear SVM).
With L2 regularization, both are convex and produce similar decision boundaries in the linear case.

When a Researcher Would Use Each

Example

Medical diagnosis with calibrated risk scores

A hospital needs not just a yes/no diagnosis but a risk probability: "this patient has a 73% chance of diabetes." Use logistic regression. The calibrated output can be communicated directly to clinicians and combined with other risk factors in a transparent way.

Example

Image classification with complex boundaries (pre-deep learning)

You have image features (SIFT, HOG) and need a nonlinear classifier. Use SVM with RBF kernel. Before deep learning, kernel SVMs were the standard for image classification (e.g., the winning method on MNIST and CIFAR in the 2000s). The kernel handles nonlinearity without manually designing features.

Example

Text classification at scale

You have millions of documents with bag-of-words features (high-dimensional, sparse). Use logistic regression with SGD. Linear SVMs also work well here (liblinear), and the two give very similar results. LR is preferred because it gives probabilities and extends naturally to multi-class.

Common Confusions

Watch Out

SVMs are not inherently better for small datasets

A common claim is that SVMs work better on small datasets because they use only support vectors. This is misleading. SVMs and LR with the same regularization often give very similar performance. The advantage of SVMs on small data comes primarily from the kernel trick providing a richer implicit feature space, not from the hinge loss per se.

Watch Out

Regularization makes them more similar than different

L2-regularized hinge loss (SVM) and L2-regularized log loss (LR) are both convex objectives on the same hypothesis class. Their decision boundaries are often nearly identical. The practical differences lie in outputs (probability vs. decision value), not in boundary location.

Watch Out

The max-margin property does not imply better generalization in general

The margin-based generalization bounds for SVMs are appealing, but they apply to the specific hypothesis class defined by the kernel. LR has its own generalization bounds via the log loss. There is no universal theorem saying max-margin always generalizes better than max-likelihood.