AI Safety
Adversarial Machine Learning
Attacks across the ML lifecycle: evasion (adversarial examples), poisoning (corrupt training data), model extraction (steal via queries), privacy leakage (membership inference), and LLM jailbreaks. Why defenses are hard.
Prerequisites
Why This Matters
ML models are deployed in safety-critical systems: autonomous vehicles, medical diagnosis, content moderation, fraud detection. Adversarial machine learning studies how attackers can manipulate these systems and what defenders can do about it.
The field began with a startling discovery: adding tiny, imperceptible perturbations to images can cause state-of-the-art classifiers to misclassify with high confidence. Since then, the threat model has expanded far beyond image perturbations to encompass the entire ML lifecycle.
NIST published the Adversarial Machine Learning taxonomy (NIST AI 100-2) to standardize how we think about these threats. Understanding this taxonomy is essential for anyone deploying ML in production.
Mental Model
Think of adversarial ML as a game between an attacker and a defender. The attacker has some knowledge of the model (white-box, black-box, or gray-box) and some capability (perturb inputs, corrupt training data, query the API). The defender wants the model to behave correctly despite the attacker's efforts.
The key insight: standard ML optimizes for average-case performance. Security requires worst-case guarantees. This gap is why adversarial ML is hard.
Formal Setup and Notation
Let be a classifier with parameters . Let be the loss function. Let be a correctly classified input-label pair.
Adversarial Example
An adversarial example is a perturbed input such that:
The perturbation is bounded in some norm (typically or ) so that is perceptually similar to .
Adversarial Risk
The adversarial risk (or robust risk) of a classifier is:
This replaces the standard expectation with a worst-case inner maximization over the perturbation set.
Threat Model
A threat model specifies: (1) the attacker's goal (misclassification, targeted misclassification, denial of service), (2) the attacker's knowledge (white-box, black-box, gray-box), (3) the attacker's capability (what they can modify and by how much).
NIST AML Taxonomy
The NIST taxonomy organizes attacks by the ML lifecycle stage they target:
Evasion attacks (inference time): modify inputs to fool a deployed model. This includes adversarial examples for classifiers and jailbreaks for LLMs.
Poisoning attacks (training time): inject malicious data into the training set to degrade model performance or plant backdoors.
Privacy attacks (post-deployment): extract sensitive information about the training data or the model itself.
Abuse attacks (misuse): use the model for purposes it was not designed for, such as generating harmful content.
Evasion Attacks in Detail
Fast Gradient Sign Method (FGSM)
The simplest evasion attack. Compute the gradient of the loss with respect to the input and step in the sign direction:
This is a single-step, -bounded attack. It is fast but weak compared to iterative methods.
Projected Gradient Descent (PGD) Attack
An iterative refinement of FGSM. Starting from , repeat:
where projects back onto the -ball around . PGD with random restarts is considered a strong first-order attack.
Carlini-Wagner (C&W) attack reformulates adversarial example generation as an optimization problem with a different objective. It often finds smaller perturbations than PGD but is more computationally expensive.
Transferability: adversarial examples generated for one model often fool other models trained on similar data. This enables black-box attacks: generate adversarial examples on a surrogate model and apply them to the target.
Poisoning Attacks
Data poisoning corrupts the training set. The attacker injects or modifies a fraction of training examples to either degrade overall accuracy or cause specific targeted misclassifications.
Backdoor attacks (trojan attacks) plant a trigger pattern. The model behaves normally on clean inputs but misclassifies any input containing the trigger. For example, a small patch in the corner of an image could cause the model to always predict a target class.
The poisoning rate needed can be surprisingly small. In some settings, poisoning less than 1% of the training data suffices to plant a reliable backdoor.
Model Extraction and Privacy
Model extraction (model stealing): the attacker queries a black-box model API and uses the responses to train a functionally equivalent copy. This threatens intellectual property and also enables white-box attacks on the copy.
Membership inference: given a model and a data point, determine whether that point was in the training set. This is a privacy violation. Overfit models are particularly vulnerable because they memorize training data.
Training data extraction: for large language models, carefully crafted prompts can cause the model to regurgitate memorized training data, including personally identifiable information.
LLM Jailbreaks
Jailbreaks are evasion attacks adapted for language models with safety training. The attacker crafts prompts that bypass safety filters and alignment training to produce harmful outputs.
Categories include: role-playing attacks (pretending the model is an unrestricted AI), encoding attacks (using unusual character encodings or languages), multi-turn attacks (gradually escalating across a conversation), and optimization-based attacks (using gradient-based search over token sequences).
Main Theorems
Robustness-Accuracy Tradeoff
Statement
For certain natural data distributions, any classifier achieving adversarial robustness within an ball of radius must suffer a drop in standard (clean) accuracy. Specifically, for a -dimensional Gaussian mixture model with means and identity covariance, the optimal robust classifier has standard accuracy:
where is the standard normal CDF, compared to for the optimal non-robust classifier.
Intuition
Robustness requires the classifier to maintain correct predictions even when inputs are pushed toward the decision boundary. This effectively shrinks the margin by in every direction, which in high dimensions causes a meaningful accuracy drop. The tradeoff is inherent in the geometry of the problem, not a limitation of any particular defense.
Proof Sketch
In the Gaussian mixture model, the optimal robust classifier must correctly classify all points in the -expanded region. The Bayes-optimal robust classifier is a shifted version of the Bayes-optimal standard classifier, with the decision boundary moved by toward the true mean. Computing the probability of correct classification under this shifted boundary yields the stated result.
Why It Matters
This theorem explains why every proposed defense against adversarial examples pays an accuracy cost. It is not that defenses are poorly designed. the tradeoff is fundamental. Practitioners must decide how much clean accuracy they are willing to sacrifice for robustness.
Failure Mode
The theorem is proven for specific distributional assumptions. Real data distributions may have different tradeoff curves. There is active research on whether the tradeoff can be mitigated with more data or better representations.
Defenses
Adversarial training augments the training set with adversarial examples. The objective becomes:
This is the most effective empirical defense but is computationally expensive (requires running PGD at every training step) and incurs the accuracy tradeoff described above. In multi-agent settings, robust adversarial policies extend this idea to game-theoretic formulations where agents must be robust to adversarial opponents.
Certified defenses provide provable guarantees that no perturbation within the threat model can change the prediction. Randomized smoothing is the most scalable certified defense: classify by taking a majority vote over predictions on Gaussian-perturbed copies of .
Input preprocessing (denoising, compression) and detection methods try to filter adversarial inputs before classification. These have repeatedly been broken by adaptive attacks.
Common Confusions
Adversarial robustness is not the same as input validation
Input validation checks that inputs are well-formed (correct format, in expected range). Adversarial robustness handles inputs that look legitimate but are designed to cause misclassification. A valid JPEG image can be an adversarial example.
Defense evaluation requires adaptive attacks
Many published defenses were broken because they were only tested against non-adaptive attacks (attacks not designed for the specific defense). A defense must be evaluated against an attacker who knows the defense mechanism and adapts accordingly. Obfuscated gradients, for example, can make gradient-based attacks fail without providing real robustness.
Summary
- Adversarial examples exist for all ML models, not just neural networks
- The NIST taxonomy: evasion, poisoning, privacy, abuse
- PGD attack: iterative FGSM with projection, the standard strong attack
- Robustness-accuracy tradeoff: fundamental, not an artifact of bad defenses
- Adversarial training: effective but expensive and accuracy-reducing
- Always evaluate defenses against adaptive attackers
- LLM jailbreaks are the evasion attack analog for language models
Exercises
Problem
Compute the FGSM perturbation for a linear classifier with squared loss . What direction does the perturbation point in, and why does this make geometric sense?
Problem
Explain why black-box adversarial attacks work despite the attacker having no access to the target model's gradients. What property of adversarial examples enables this?
Problem
The robustness-accuracy tradeoff theorem shows a fundamental tension in the Gaussian mixture setting. Could additional unlabeled data help reduce this tradeoff? Discuss the theoretical arguments and empirical evidence.
References
Canonical:
- Goodfellow, Shlens & Szegedy, "Explaining and Harnessing Adversarial Examples" (2015)
- Madry et al., "Towards Deep Learning Models Resistant to Adversarial Attacks" (2018)
Current:
- NIST AI 100-2, "Adversarial Machine Learning: A Taxonomy" (2024)
- Tsipras et al., "Robustness May Be at Odds with Accuracy" (2019)
- Carlini et al., "On Evaluating Adversarial Robustness" (2019)
Next Topics
The natural next step from adversarial ML:
- LLM application security: OWASP Top 10 for LLM applications
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A
Builds on This
- LLM Application SecurityLayer 5