Convex Tinkering

Sneiderman, Robby

Methodology

Convex Tinkering

Taleb's concept applied to ML research: designing small experiments with bounded downside and unbounded upside, and the precise conditions (convex payoff, nonzero variance, ex-post selection) under which this can dominate scale-first approaches.

CoreTier 2StableSupporting~40 min

Prerequisites

Common Inequalities Editorial Principles Non Probability Sampling

Prereq Map

Learning position

Read this page in the graph.

methodology | layer 2 | tier 2. This page has 2 direct prerequisites and 3 published dependents.

Open Atlas Prerequisites Leads to

What next

Fat Tails and Heavy-Tailed Distributions

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Most ML research budgets are spent on large runs that either confirm a hypothesis or waste compute. A 1000-GPU training run that fails teaches you one bit of information at enormous cost. A set of 100 small experiments on 10 GPUs each teaches you far more per dollar, and any single success can be scaled up.

Convex tinkering is the principle that you should design experiments where the downside is bounded (small cost if it fails) and the upside is unbounded (large payoff if it succeeds). This is not vague advice. It has a precise mathematical formulation rooted in the theory of convex functions and Jensen's inequality.

Mental Model

Consider two research strategies:

Strategy A (scale-first): Spend $1M training one large model. If the hypothesis is right, you get a strong result. If it is wrong, you lose $1M and learn only that this specific configuration failed.

Strategy B (tinker-first): Spend $10K each on 100 small experiments. Most will fail. But you learn something from every experiment, and the few successes can be scaled up. Your total downside is the same ($1M), but your expected information gain is far higher.

Strategy B can dominate Strategy A whenever the payoff function is convex in the space of experimental configurations and the experiments genuinely produce variance to exploit and the researcher can recognize the winners ex post. None of these conditions is automatic: if the payoff is concave in the relevant region, if outcomes are nearly deterministic, or if you cannot tell winners from losers, the strict ordering can flip.

Formal Setup and Notation

Definition

Convex Payoff Function

A payoff function $f: \mathcal{X} \to \mathbb{R}$ is convex if and only if for all $x, y \in \mathcal{X}$ and $\lambda \in [0, 1]$ :

$f(\lambda x + (1 - \lambda) y) \leq \lambda f(x) + (1 - \lambda) f(y)$

In the context of ML experiments, $\mathcal{X}$ is the space of experimental configurations and $f(x)$ is the information value or research payoff of running experiment $x$ . Convexity means that diversified experiments yield higher expected payoff than a single concentrated bet.

Definition

Optionality

An experiment has optionality when you can choose to act on its result or ignore it. Formally, the payoff of an experiment with uncertain outcome $X$ is:

$V = \mathbb{E}[\max(X - K, 0)]$

where $K$ is the threshold for a result to be useful. This is the payoff of a call option. By Jensen's inequality, if $\max(\cdot, 0)$ is convex (which it is), then diversifying across many experiments with independent outcomes increases $V$ .

Core Definitions

The downside of an experiment is the maximum you can lose: the compute cost, the researcher time, the opportunity cost of not running something else.

The upside is the maximum you can gain: a publishable result, a new capability, an insight that redirects the research program.

Call an experiment convex when its upside is much larger than its downside. A small ablation study costs a few GPU-hours and might reveal that a widely-used technique is unnecessary, saving thousands of GPU-hours for everyone. That is a convex bet.

Call an experiment concave when its downside is large relative to its upside. Training a model at full scale to match a known benchmark, when the outcome is either "we match" or "we do not match," is a concave bet. The information gain per dollar is low.

Main Theorems

Theorem

Jensen's Inequality for Average Research Payoffs

Statement

Let $f: \mathbb{R} \to \mathbb{R}$ be convex, and let $X_1, \ldots, X_n$ be independent random variables representing experiment outcomes with $\mathbb{E}[X_i] = \mu$ . Then the expected average payoff satisfies:

$\mathbb{E}\left[\frac{1}{n}\sum_{i=1}^n f(X_i)\right] = \mathbb{E}[f(X_1)] \geq f(\mu).$

Running $n$ small experiments and averaging their payoffs yields at least as high expected payoff as the payoff evaluated at the average configuration. When $f$ is strictly convex and $\text{Var}(X_i) > 0$ , the inequality is strict: diversified convex exposure is strictly better than a single bet at the mean.

Intuition

This result is about AVERAGES, not maxima. When the payoff function is convex, variance of the input amplifies the expected output. High-variance outcomes occasionally produce large successes, and the convex payoff function rewards these successes more than it penalizes the failures. This is the opposite of risk aversion: under convex payoffs of diversified exposures, you want more variance (more diverse experiments), not less.

Proof Sketch

By convexity of $f$ and Jensen's inequality, $\mathbb{E}[f(X_i)] \geq f(\mathbb{E}[X_i]) = f(\mu)$ for each $i$ . Averaging over $i$ preserves the bound. The inequality is strict whenever $f$ is strictly convex and $X_i$ has nonzero variance.

Why It Matters

This is the mathematical reason why a portfolio of small experiments with convex payoffs yields at least the payoff of a single experiment at the mean configuration. The result is about diversified convex exposure via averaging. The separate gain from optionality (taking the best of $n$ trials) requires a stronger argument via order statistics, handled in the proposition below.

Failure Mode

The inequality reverses when $f$ is concave. If there are large fixed costs (e.g., building infrastructure that only pays off at scale), then concentrating resources on one large project may dominate. The convexity assumption must be checked for each research context.

report a correction →

Proposition

Optionality: Expected Best of n Experiments

Statement

Let $f: \mathbb{R} \to \mathbb{R}$ be convex and non-decreasing, and let $X_1, \ldots, X_n$ be i.i.d. with $\mathbb{E}[X_i] = \mu$ . Let $X_{(n)} = \max_i X_i$ denote the sample maximum. Then:

$\mathbb{E}\left[\max_{i} f(X_i)\right] = \mathbb{E}\left[f(X_{(n)})\right] \geq \mathbb{E}[f(X_1)] \geq f(\mu).$

The first inequality holds because $X_{(n)} \geq X_1$ pointwise and $f$ is non-decreasing. The second is Jensen's inequality. Provided $f$ is strictly convex (or strictly increasing on the support of $X_{(n)}$ ) and $X$ has nonzero variance, when the researcher keeps only the best of $n$ trials the expected payoff strictly exceeds both the average-payoff bound and the single-bet payoff $f(\mu)$ . Without those strictness conditions the inequalities can be tight.

Intuition

Optionality is distinct from averaging. Averaging $n$ draws of a convex payoff beats $f(\mu)$ by Jensen. Taking the MAX of $n$ draws adds a second layer of gain driven by order statistics: the right tail of $n$ i.i.d. draws grows with $n$ . The rate of growth of $\mathbb{E}[X_{(n)}]$ depends on the tail of $X$ . For light tails (sub-Gaussian) it grows like $\sqrt{2 \log n}$ . For heavy (regularly varying) tails it grows polynomially in $n$ (see de Haan and Ferreira, 2006, for formal rates). Convex tinkering combines both: averaging to guarantee the Jensen gain, and optionality to capture the max.

Proof Sketch

Order statistics: $X_{(n)} \geq X_i$ for every $i$ , so $f(X_{(n)}) \geq f(X_i)$ when $f$ is non-decreasing. Taking expectations gives the first inequality. Jensen's inequality on $\mathbb{E}[f(X_1)]$ gives the second.

Why It Matters

This is the mathematical reason why researchers who can discard failed trials and scale only the winners dominate researchers who must commit to one bet up front. The gap between $\mathbb{E}[f(X_{(n)})]$ and $f(\mu)$ is the quantitative value of optionality. See multi-armed bandits for the adaptive-selection version of this idea.

Failure Mode

The optionality gain requires that you can actually identify the winner ex post. If experiments return noisy signals and you cannot reliably rank $X_{(n)}$ above $X_{(n-1)}$ , the realized gain shrinks. For heavy-tailed $X$ without finite mean, even the formal expectation $\mathbb{E}[X_{(n)}]$ may not exist, though median-based versions still apply.

report a correction →

Proof Ideas and Templates Used

The proof uses Jensen's inequality, which is the foundational result connecting convexity and expectation. The same inequality appears in information theory (entropy is concave), optimization (convex relaxation gives lower bounds), and finance (option pricing).

Practical Applications

NanoGPT-Style Speedruns

Training a 124M parameter GPT-2 reproduction in minutes rather than days is a convex tinker. The cost is small (one GPU for a few hours). Each experiment tests a specific hypothesis: does this learning rate schedule help? Does this architecture change matter? The community has learned more per compute-dollar from nanoGPT speedruns than from any single large training run.

Ablation Studies as Optionality

Every ablation study is a convex bet. Removing a component costs one training run. If the component turns out to be unnecessary, the savings on all future runs are enormous. If it turns out to be necessary, you learn that cheaply.

Hyperparameter Sweeps on Small Models

Tuning hyperparameters on a 10M parameter model and transferring to a 1B parameter model is a convex strategy. The cost of the sweep is small. If the hyperparameters transfer (as muP predicts), the payoff is large. If they do not transfer, you lose only the sweep cost.

Canonical Examples

Example

Why scale-first is fragile

Consider a lab that spends $10M training a 100B parameter model on a specific dataset mix. If the dataset mix is suboptimal (which they cannot know in advance), the entire run is wasted. A convex alternative: spend $100K on 100 different dataset mixes at 1B scale. The best mix is likely near-optimal for larger scale. Total cost is $10M + $100K (the large run plus the tinkering), but the probability of the large run succeeding is much higher.

Common Confusions

Watch Out

Convex tinkering is not the same as random search

Random search explores uniformly over a space. Convex tinkering is adaptive: each experiment is designed based on what you learned from previous ones. The key property is bounded downside, not randomness. A carefully designed small experiment is a convex tinker. A random large experiment is not.

Watch Out

Bounded downside does not mean low risk

A portfolio of convex tinkers has bounded downside per experiment but can still fail to produce any useful result. The claim is not that tinkering eliminates risk. The claim is that tinkering converts downside risk into upside variance, which is favorable under convex payoffs.

Watch Out

Some research genuinely requires scale

Certain phenomena only emerge at scale (in-context learning, chain-of-thought reasoning). For studying these phenomena, you need large models. The convex tinkering principle does not say "never train large models." It says "do the cheap experiments first to maximize the probability that the expensive experiment succeeds."

Misreadings of Convex Tinkering

Watch Out

It means chaos and randomness are good

Wrong. Taleb's point is not that disorder is inherently valuable. The point is that under a convex exposure, volatility can help rather than hurt. Without the convexity (bounded downside, open-ended upside), randomness is just noise. Convex tinkering requires structure: cap the cost per trial, increase the number of trials, and retain the option to scale winners. That is disciplined experimentation, not celebration of chaos.

Watch Out

It means do lots of random things with no plan

Wrong. Taleb explicitly contrasts convex trial-and-error with undirected flailing. The 1/N dispersion logic is: lower the cost per trial, increase the number of trials, and minimize the chance of missing upside. Each trial should test something specific. The absence of a grand forecast is not the absence of local hypotheses.

Watch Out

It means passive waiting for luck

Wrong. Convex tinkering is active. You run experiments, observe results, adapt, and scale what works. The passivity critique applies to lottery-ticket thinking ("I will try one thing and hope it works"). Convex tinkering is the opposite: many cheap active experiments designed so that any single success can be amplified.

Watch Out

It is a synonym for entrepreneurship or content creation

Not necessarily. Convex tinkering applies only when the payoff distribution is meaningfully asymmetric and downside is controlled. Many entrepreneurial activities have concave payoffs (large fixed costs, small marginal gains) or uncontrolled downside (betting the company on one product). The convexity condition must be checked, not assumed.

When Convex Tinkering Fails: The Discipline It Requires

Convex tinkering is not leaderboard overfitting. The convexity comes from cheap downside and retained upside. If each trial creates hidden maintenance cost, false confidence from selection bias, reputational risk, or noisy winner-selection that does not survive contact with fresh data, the payoff is no longer convex — it just looks convex until the bill arrives.

Three statistical failure modes a careful operator must guard against:

Jensen convexity holds: $\mathbb{E}[f(X)] \ge f(\mathbb{E}[X])$ . Cheap, real gain when payoff is convex.
Optionality holds: $\mathbb{E}[\max_i Y_i] \ge \max_i \mathbb{E}[Y_i]$ . The right tail of $n$ trials grows with $n$ .
Winner's curse / selection bias also holds: $\mathbb{E}[\max_i \hat{Y}_i]$ is biased upward relative to $\max_i Y_i$ because the noise that pushed the apparent winner up is rarely re-paid on fresh data.

The first two are the gain. The third is the trap. They all live in the same expectation, which is why "many trials, keep the best" is worth doing only when the validated payoff (held-out, replicated, deployed) survives. Without out-of-sample check, "convex tinkering" collapses into noise mining.

The operational discipline:

Cheap downside cap: write the dollar / GPU-hour / human-time budget per trial before running it, and kill at the cap.
Pre-registered kill criteria: define what would make a trial a failure before the data come in. "The result was interesting" is not a kill criterion; it is the absence of one.
Independent validation of winners: the apparent winner from $n$ trials is biased upward; re-run on a held-out task / dataset / seed before scaling.
Scale only the validated option: scaling an unvalidated winner is convex tinkering's single most common failure mode.

The Bayesian-optimization analogue formalizes the same idea: the acquisition function caps each evaluation cost, the surrogate model reasons about posterior uncertainty before committing, and the search terminates when the regret bound says further trials are unlikely to pay off. Convex tinkering without those guardrails is just expensive hope.

Connection to Bayesian Optimization

Bayesian optimization formalizes one version of convex tinkering. The acquisition function (e.g., expected improvement, upper confidence bound) balances exploration (trying uncertain configurations) and exploitation (refining known-good configurations). This is precisely the strategy of exploring where uncertainty is high, which is the core of convex tinkering. The key difference: Bayesian optimization assumes a smooth objective with a known kernel. Convex tinkering as a research philosophy does not assume you can model the objective function.

Exercises

ExerciseCore

Problem

You have a budget of 100 GPU-hours. You can either (a) run one experiment for 100 GPU-hours or (b) run 10 experiments for 10 GPU-hours each. Under what conditions on the payoff function does strategy (b) dominate strategy (a) in expectation?

ExerciseAdvanced

Problem

A lab is deciding between training one 70B model or seven 10B models with different architectures. Assume training cost scales linearly with parameters and the chance that any given architecture achieves state-of-the-art on the target benchmark is $p = 0.3$ for 70B and $p = 0.1$ for each 10B model (independently). Which strategy has a higher probability of at least one state-of-the-art result?

Further directions

Thompson sampling as Bayesian convex tinkering
Bayesian optimization developed more fully (GP-UCB, EI, acquisition functions)
Lottery ticket hypothesis connection
"Failure fraction" metric: what fraction of research attempts should fail for optimal learning
Anti-patterns: when NOT to convex-tinker (fixed-cost settings, high integration costs)
Scaling-law and extrapolation discussion (Kaplan, Chinchilla): when small-scale results transfer

References

Canonical:

Dixit and Pindyck, Investment Under Uncertainty (1994), Chapters 2, 5. Real options and the value of waiting under irreversibility.
Lattimore and Szepesvari, Bandit Algorithms (2020), Chapters 6-8. Adaptive experimentation, regret bounds, and the bandit generalization of convex tinkering. See also multi-armed bandits.
Box, Hunter, Hunter, Statistics for Experimenters (2005), Chapters 8-13. Sequential and factorial experimental design.
de Haan and Ferreira, Extreme Value Theory: An Introduction (2006), Chapters 1-2. Formal rates of growth of $\mathbb{E}[\max_i X_i]$ as a function of $n$ via the Fisher-Tippett-Gnedenko theorem. This is the precise extreme-value backing for the optionality proposition.

Editorial framing (scope-limited):

Taleb, Antifragile: Things That Gain from Disorder (2012), Chapter 12. Informal exposition of convex payoff exposure. Treat as motivating metaphor, not as mathematical foundation.
Taleb, The Black Swan (2007), Chapter 17. Optionality under fat-tailed uncertainty. Claims are informal.

Current:

Karpathy, nanoGPT speedrun experiments (2023-2024), GitHub repository
Yang et al., Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer (muP, 2022)

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Common Inequalitieslayer 0A · tier 1
Non-Probability Samplinglayer 2 · tier 1

Derived topics

3

Bounded Rationalitylayer 2 · tier 1
Fat Tails and Heavy-Tailed Distributionslayer 2 · tier 1
Kelly Criterionlayer 2 · tier 2

Graph-backed continuations

Fat Tails and Heavy-Tailed Distributions Kelly Criterion Bounded Rationality