Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

AI Safety

Adversarial Machine Learning

Attacks across the ML lifecycle: evasion (adversarial examples), poisoning (corrupt training data), model extraction (steal via queries), privacy leakage (membership inference), and LLM jailbreaks. Why defenses are hard.

AdvancedTier 2Current~55 min
0

Why This Matters

ML models are deployed in safety-critical systems: autonomous vehicles, medical diagnosis, content moderation, fraud detection. Adversarial machine learning studies how attackers can manipulate these systems and what defenders can do about it.

The field began with a startling discovery: adding tiny, imperceptible perturbations to images can cause state-of-the-art classifiers to misclassify with high confidence. Since then, the threat model has expanded far beyond image perturbations to encompass the entire ML lifecycle.

NIST published the Adversarial Machine Learning taxonomy (NIST AI 100-2) to standardize how we think about these threats. Understanding this taxonomy is essential for anyone deploying ML in production.

Mental Model

Think of adversarial ML as a game between an attacker and a defender. The attacker has some knowledge of the model (white-box, black-box, or gray-box) and some capability (perturb inputs, corrupt training data, query the API). The defender wants the model to behave correctly despite the attacker's efforts.

The key insight: standard ML optimizes for average-case performance. Security requires worst-case guarantees. This gap is why adversarial ML is hard.

Formal Setup and Notation

Let fθ:XYf_\theta: \mathcal{X} \to \mathcal{Y} be a classifier with parameters θ\theta. Let \ell be the loss function. Let (x,y)(x, y) be a correctly classified input-label pair.

Definition

Adversarial Example

An adversarial example is a perturbed input x=x+δx' = x + \delta such that:

fθ(x)yandδpϵf_\theta(x') \neq y \quad \text{and} \quad \|\delta\|_p \leq \epsilon

The perturbation δ\delta is bounded in some p\ell_p norm (typically \ell_\infty or 2\ell_2) so that xx' is perceptually similar to xx.

Definition

Adversarial Risk

The adversarial risk (or robust risk) of a classifier fθf_\theta is:

Radv(θ)=E(x,y)D[maxδpϵ(fθ(x+δ),y)]R_{\text{adv}}(\theta) = \mathbb{E}_{(x,y) \sim \mathcal{D}} \left[ \max_{\|\delta\|_p \leq \epsilon} \ell(f_\theta(x + \delta), y) \right]

This replaces the standard expectation with a worst-case inner maximization over the perturbation set.

Definition

Threat Model

A threat model specifies: (1) the attacker's goal (misclassification, targeted misclassification, denial of service), (2) the attacker's knowledge (white-box, black-box, gray-box), (3) the attacker's capability (what they can modify and by how much).

NIST AML Taxonomy

The NIST taxonomy organizes attacks by the ML lifecycle stage they target:

Evasion attacks (inference time): modify inputs to fool a deployed model. This includes adversarial examples for classifiers and jailbreaks for LLMs.

Poisoning attacks (training time): inject malicious data into the training set to degrade model performance or plant backdoors.

Privacy attacks (post-deployment): extract sensitive information about the training data or the model itself.

Abuse attacks (misuse): use the model for purposes it was not designed for, such as generating harmful content.

Evasion Attacks in Detail

Definition

Fast Gradient Sign Method (FGSM)

The simplest evasion attack. Compute the gradient of the loss with respect to the input and step in the sign direction:

x=x+ϵsign(x(fθ(x),y))x' = x + \epsilon \cdot \text{sign}(\nabla_x \ell(f_\theta(x), y))

This is a single-step, \ell_\infty-bounded attack. It is fast but weak compared to iterative methods.

Definition

Projected Gradient Descent (PGD) Attack

An iterative refinement of FGSM. Starting from x0=xx_0 = x, repeat:

xt+1=ΠBϵ(x)(xt+αsign(xt(fθ(xt),y)))x_{t+1} = \Pi_{\mathcal{B}_\epsilon(x)} \left( x_t + \alpha \cdot \text{sign}(\nabla_{x_t} \ell(f_\theta(x_t), y)) \right)

where ΠBϵ(x)\Pi_{\mathcal{B}_\epsilon(x)} projects back onto the ϵ\epsilon-ball around xx. PGD with random restarts is considered a strong first-order attack.

Carlini-Wagner (C&W) attack reformulates adversarial example generation as an optimization problem with a different objective. It often finds smaller perturbations than PGD but is more computationally expensive.

Transferability: adversarial examples generated for one model often fool other models trained on similar data. This enables black-box attacks: generate adversarial examples on a surrogate model and apply them to the target.

Poisoning Attacks

Data poisoning corrupts the training set. The attacker injects or modifies a fraction of training examples to either degrade overall accuracy or cause specific targeted misclassifications.

Backdoor attacks (trojan attacks) plant a trigger pattern. The model behaves normally on clean inputs but misclassifies any input containing the trigger. For example, a small patch in the corner of an image could cause the model to always predict a target class.

The poisoning rate needed can be surprisingly small. In some settings, poisoning less than 1% of the training data suffices to plant a reliable backdoor.

Model Extraction and Privacy

Model extraction (model stealing): the attacker queries a black-box model API and uses the responses to train a functionally equivalent copy. This threatens intellectual property and also enables white-box attacks on the copy.

Membership inference: given a model and a data point, determine whether that point was in the training set. This is a privacy violation. Overfit models are particularly vulnerable because they memorize training data.

Training data extraction: for large language models, carefully crafted prompts can cause the model to regurgitate memorized training data, including personally identifiable information.

LLM Jailbreaks

Jailbreaks are evasion attacks adapted for language models with safety training. The attacker crafts prompts that bypass safety filters and alignment training to produce harmful outputs.

Categories include: role-playing attacks (pretending the model is an unrestricted AI), encoding attacks (using unusual character encodings or languages), multi-turn attacks (gradually escalating across a conversation), and optimization-based attacks (using gradient-based search over token sequences).

Main Theorems

Theorem

Robustness-Accuracy Tradeoff

Statement

For certain natural data distributions, any classifier achieving adversarial robustness within an 2\ell_2 ball of radius ϵ\epsilon must suffer a drop in standard (clean) accuracy. Specifically, for a dd-dimensional Gaussian mixture model with means ±μ\pm \mu and identity covariance, the optimal robust classifier has standard accuracy:

accstd=Φ(μϵ)\text{acc}_{\text{std}} = \Phi\left(\|\mu\| - \epsilon\right)

where Φ\Phi is the standard normal CDF, compared to Φ(μ)\Phi(\|\mu\|) for the optimal non-robust classifier.

Intuition

Robustness requires the classifier to maintain correct predictions even when inputs are pushed toward the decision boundary. This effectively shrinks the margin by ϵ\epsilon in every direction, which in high dimensions causes a meaningful accuracy drop. The tradeoff is inherent in the geometry of the problem, not a limitation of any particular defense.

Proof Sketch

In the Gaussian mixture model, the optimal robust classifier must correctly classify all points in the ϵ\epsilon-expanded region. The Bayes-optimal robust classifier is a shifted version of the Bayes-optimal standard classifier, with the decision boundary moved by ϵ\epsilon toward the true mean. Computing the probability of correct classification under this shifted boundary yields the stated result.

Why It Matters

This theorem explains why every proposed defense against adversarial examples pays an accuracy cost. It is not that defenses are poorly designed. the tradeoff is fundamental. Practitioners must decide how much clean accuracy they are willing to sacrifice for robustness.

Failure Mode

The theorem is proven for specific distributional assumptions. Real data distributions may have different tradeoff curves. There is active research on whether the tradeoff can be mitigated with more data or better representations.

Defenses

Adversarial training augments the training set with adversarial examples. The objective becomes:

minθE(x,y)[maxδpϵ(fθ(x+δ),y)]\min_\theta \mathbb{E}_{(x,y)} \left[ \max_{\|\delta\|_p \leq \epsilon} \ell(f_\theta(x + \delta), y) \right]

This is the most effective empirical defense but is computationally expensive (requires running PGD at every training step) and incurs the accuracy tradeoff described above. In multi-agent settings, robust adversarial policies extend this idea to game-theoretic formulations where agents must be robust to adversarial opponents.

Certified defenses provide provable guarantees that no perturbation within the threat model can change the prediction. Randomized smoothing is the most scalable certified defense: classify xx by taking a majority vote over predictions on Gaussian-perturbed copies of xx.

Input preprocessing (denoising, compression) and detection methods try to filter adversarial inputs before classification. These have repeatedly been broken by adaptive attacks.

Common Confusions

Watch Out

Adversarial robustness is not the same as input validation

Input validation checks that inputs are well-formed (correct format, in expected range). Adversarial robustness handles inputs that look legitimate but are designed to cause misclassification. A valid JPEG image can be an adversarial example.

Watch Out

Defense evaluation requires adaptive attacks

Many published defenses were broken because they were only tested against non-adaptive attacks (attacks not designed for the specific defense). A defense must be evaluated against an attacker who knows the defense mechanism and adapts accordingly. Obfuscated gradients, for example, can make gradient-based attacks fail without providing real robustness.

Summary

  • Adversarial examples exist for all ML models, not just neural networks
  • The NIST taxonomy: evasion, poisoning, privacy, abuse
  • PGD attack: iterative FGSM with projection, the standard strong attack
  • Robustness-accuracy tradeoff: fundamental, not an artifact of bad defenses
  • Adversarial training: effective but expensive and accuracy-reducing
  • Always evaluate defenses against adaptive attackers
  • LLM jailbreaks are the evasion attack analog for language models

Exercises

ExerciseCore

Problem

Compute the FGSM perturbation for a linear classifier f(x)=wx+bf(x) = w^\top x + b with squared loss (f(x),y)=(f(x)y)2\ell(f(x), y) = (f(x) - y)^2. What direction does the perturbation point in, and why does this make geometric sense?

ExerciseAdvanced

Problem

Explain why black-box adversarial attacks work despite the attacker having no access to the target model's gradients. What property of adversarial examples enables this?

ExerciseResearch

Problem

The robustness-accuracy tradeoff theorem shows a fundamental tension in the Gaussian mixture setting. Could additional unlabeled data help reduce this tradeoff? Discuss the theoretical arguments and empirical evidence.

References

Canonical:

  • Goodfellow, Shlens & Szegedy, "Explaining and Harnessing Adversarial Examples" (2015)
  • Madry et al., "Towards Deep Learning Models Resistant to Adversarial Attacks" (2018)

Current:

  • NIST AI 100-2, "Adversarial Machine Learning: A Taxonomy" (2024)
  • Tsipras et al., "Robustness May Be at Odds with Accuracy" (2019)
  • Carlini et al., "On Evaluating Adversarial Robustness" (2019)

Next Topics

The natural next step from adversarial ML:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics