Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Methodology

Convex Tinkering

Taleb's concept applied to ML research: designing small experiments with bounded downside and unbounded upside, and why this strategy dominates scale-first approaches under uncertainty.

CoreTier 2Stable~40 min

Prerequisites

0

Why This Matters

Most ML research budgets are spent on large runs that either confirm a hypothesis or waste compute. A 1000-GPU training run that fails teaches you one bit of information at enormous cost. A set of 100 small experiments on 10 GPUs each teaches you far more per dollar, and any single success can be scaled up.

Convex tinkering is the principle that you should design experiments where the downside is bounded (small cost if it fails) and the upside is unbounded (large payoff if it succeeds). This is not vague advice. It has a precise mathematical formulation rooted in the theory of convex functions and Jensen's inequality.

Mental Model

Consider two research strategies:

Strategy A (scale-first): Spend $1M training one large model. If the hypothesis is right, you get a strong result. If it is wrong, you lose $1M and learn only that this specific configuration failed.

Strategy B (tinker-first): Spend $10K each on 100 small experiments. Most will fail. But you learn something from every experiment, and the few successes can be scaled up. Your total downside is the same ($1M), but your expected information gain is far higher.

Strategy B dominates Strategy A whenever the payoff function is convex in the space of experimental configurations. This is the case when small discoveries compound or when you are uncertain about which direction is correct.

Formal Setup and Notation

Definition

Convex Payoff Function

A payoff function f:XRf: \mathcal{X} \to \mathbb{R} is convex if for all x,yXx, y \in \mathcal{X} and λ[0,1]\lambda \in [0, 1]:

f(λx+(1λ)y)λf(x)+(1λ)f(y)f(\lambda x + (1 - \lambda) y) \leq \lambda f(x) + (1 - \lambda) f(y)

In the context of ML experiments, X\mathcal{X} is the space of experimental configurations and f(x)f(x) is the information value or research payoff of running experiment xx. Convexity means that diversified experiments yield higher expected payoff than a single concentrated bet.

Definition

Optionality

An experiment has optionality when you can choose to act on its result or ignore it. Formally, the payoff of an experiment with uncertain outcome XX is:

V=E[max(XK,0)]V = \mathbb{E}[\max(X - K, 0)]

where KK is the threshold for a result to be useful. This is the payoff of a call option. By Jensen's inequality, if max(,0)\max(\cdot, 0) is convex (which it is), then diversifying across many experiments with independent outcomes increases VV.

Core Definitions

The downside of an experiment is the maximum you can lose: the compute cost, the researcher time, the opportunity cost of not running something else.

The upside is the maximum you can gain: a publishable result, a new capability, an insight that redirects the research program.

An experiment is convex if its upside is much larger than its downside. A small ablation study costs a few GPU-hours and might reveal that a widely-used technique is unnecessary, saving thousands of GPU-hours for everyone. That is a convex bet.

An experiment is concave if its downside is large relative to its upside. Training a model at full scale to match a known benchmark, when the outcome is either "we match" or "we do not match," is a concave bet. The information gain per dollar is low.

Main Theorems

Theorem

Jensen's Inequality for Research Portfolios

Statement

Let f:RRf: \mathbb{R} \to \mathbb{R} be convex, and let X1,,XnX_1, \ldots, X_n be independent random variables representing experiment outcomes with E[Xi]=μ\mathbb{E}[X_i] = \mu. Then:

E[1ni=1nf(Xi)]f(E[1ni=1nXi])=f(μ)\mathbb{E}\left[\frac{1}{n}\sum_{i=1}^n f(X_i)\right] \geq f\left(\mathbb{E}\left[\frac{1}{n}\sum_{i=1}^n X_i\right]\right) = f(\mu)

Running nn small experiments and taking the best yields at least as high expected payoff as running one experiment at the average configuration. When ff is strictly convex and Var(Xi)>0\text{Var}(X_i) > 0, the inequality is strict: diversification is strictly better.

Intuition

When the payoff function is convex, variance helps you. High-variance outcomes occasionally produce large successes, and the convex payoff function amplifies these successes more than it penalizes the failures. This is the opposite of risk aversion: under convex payoffs, you want more variance (more diverse experiments), not less.

Proof Sketch

Direct application of Jensen's inequality. Convexity of ff implies E[f(X)]f(E[X])\mathbb{E}[f(X)] \geq f(\mathbb{E}[X]). The average of nn evaluations of ff at independent points has higher expectation than ff evaluated at the average point.

Why It Matters

This is the mathematical reason why a portfolio of small experiments dominates a single large experiment under uncertainty. It applies whenever the researcher can select which results to act on (optionality) and the cost of failure is bounded.

Failure Mode

The inequality reverses when ff is concave. If there are large fixed costs (e.g., building infrastructure that only pays off at scale), then concentrating resources on one large project may dominate. The convexity assumption must be checked for each research context.

Proof Ideas and Templates Used

The proof uses Jensen's inequality, which is the foundational result connecting convexity and expectation. The same inequality appears in information theory (entropy is concave), optimization (convex relaxation gives lower bounds), and finance (option pricing).

Practical Applications

NanoGPT-Style Speedruns

Training a 124M parameter GPT-2 reproduction in minutes rather than days is a convex tinker. The cost is small (one GPU for a few hours). Each experiment tests a specific hypothesis: does this learning rate schedule help? Does this architecture change matter? The community has learned more per compute-dollar from nanoGPT speedruns than from any single large training run.

Ablation Studies as Optionality

Every ablation study is a convex bet. Removing a component costs one training run. If the component turns out to be unnecessary, the savings on all future runs are enormous. If it turns out to be necessary, you learn that cheaply.

Hyperparameter Sweeps on Small Models

Tuning hyperparameters on a 10M parameter model and transferring to a 1B parameter model is a convex strategy. The cost of the sweep is small. If the hyperparameters transfer (as muP predicts), the payoff is large. If they do not transfer, you lose only the sweep cost.

Canonical Examples

Example

Why scale-first is fragile

Consider a lab that spends $10M training a 100B parameter model on a specific dataset mix. If the dataset mix is suboptimal (which they cannot know in advance), the entire run is wasted. A convex alternative: spend $100K on 100 different dataset mixes at 1B scale. The best mix is likely near-optimal for larger scale. Total cost is $10M + $100K (the large run plus the tinkering), but the probability of the large run succeeding is much higher.

Common Confusions

Watch Out

Convex tinkering is not the same as random search

Random search explores uniformly over a space. Convex tinkering is adaptive: each experiment is designed based on what you learned from previous ones. The key property is bounded downside, not randomness. A carefully designed small experiment is a convex tinker. A random large experiment is not.

Watch Out

Bounded downside does not mean low risk

A portfolio of convex tinkers has bounded downside per experiment but can still fail to produce any useful result. The claim is not that tinkering eliminates risk. The claim is that tinkering converts downside risk into upside variance, which is favorable under convex payoffs.

Watch Out

Some research genuinely requires scale

Certain phenomena only emerge at scale (in-context learning, chain-of-thought reasoning). For studying these phenomena, you need large models. The convex tinkering principle does not say "never train large models." It says "do the cheap experiments first to maximize the probability that the expensive experiment succeeds."

Misreadings of Convex Tinkering

Watch Out

It means chaos and randomness are good

Wrong. Taleb's point is not that disorder is inherently valuable (see editorial principles for scope conditions on Taleb and other lenses). The point is that under a convex exposure, volatility can help rather than hurt. Without the convexity (bounded downside, open-ended upside), randomness is just noise. Convex tinkering requires structure: cap the cost per trial, increase the number of trials, and retain the option to scale winners. That is disciplined experimentation, not celebration of chaos.

Watch Out

It means do lots of random things with no plan

Wrong. Taleb explicitly contrasts convex trial-and-error with undirected flailing. The 1/N dispersion logic is: lower the cost per trial, increase the number of trials, and minimize the chance of missing upside. Each trial should test something specific. The absence of a grand forecast is not the absence of local hypotheses.

Watch Out

It means passive waiting for luck

Wrong. Convex tinkering is active. You run experiments, observe results, adapt, and scale what works. The passivity critique applies to lottery-ticket thinking ("I will try one thing and hope it works"). Convex tinkering is the opposite: many cheap active experiments designed so that any single success can be amplified.

Watch Out

It is a synonym for entrepreneurship or content creation

Not necessarily. Convex tinkering applies only when the payoff distribution is meaningfully asymmetric and downside is controlled. Many entrepreneurial activities have concave payoffs (large fixed costs, small marginal gains) or uncontrolled downside (betting the company on one product). The convexity condition must be checked, not assumed.

Connection to Bayesian Optimization

Bayesian optimization formalizes one version of convex tinkering. The acquisition function (e.g., expected improvement, upper confidence bound) balances exploration (trying uncertain configurations) and exploitation (refining known-good configurations). This is precisely the strategy of exploring where uncertainty is high, which is the core of convex tinkering. The key difference: Bayesian optimization assumes a smooth objective with a known kernel. Convex tinkering as a research philosophy does not assume you can model the objective function.

Exercises

ExerciseCore

Problem

You have a budget of 100 GPU-hours. You can either (a) run one experiment for 100 GPU-hours or (b) run 10 experiments for 10 GPU-hours each. Under what conditions on the payoff function does strategy (b) dominate strategy (a) in expectation?

ExerciseAdvanced

Problem

A lab is deciding between training one 70B model or seven 10B models with different architectures. Assume training cost scales linearly with parameters and the chance that any given architecture achieves state-of-the-art on the target benchmark is p=0.3p = 0.3 for 70B and p=0.1p = 0.1 for each 10B model (independently). Which strategy has a higher probability of at least one state-of-the-art result?

References

Canonical:

  • Taleb, Antifragile: Things That Gain from Disorder (2012), Chapter 12
  • Taleb, The Black Swan (2007), Chapter 17

Current:

  • Karpathy, nanoGPT speedrun experiments (2023-2024), GitHub repository

  • Yang et al., Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer (muP, 2022)

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 7-8

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics