Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Neural Architecture Search

Automating network architecture design: search spaces, search strategies (RL, evolutionary, differentiable), performance estimation via weight sharing, and the gap between NAS hype and practical gains.

AdvancedTier 3Current~50 min
0

Why This Matters

Human-designed architectures (ResNet, Transformer) work well, but there is no guarantee they are optimal for a given task and compute budget. Neural Architecture Search (NAS) attempts to automate architecture design by searching over a structured space of possible networks.

NAS produced EfficientNet, which achieved state-of-the-art ImageNet accuracy at lower compute than hand-designed alternatives. However, NAS is also one of the most over-hyped areas of ML: the search cost can be enormous, the search spaces are heavily constrained by human priors, and many "NAS-found" architectures differ only marginally from hand-designed ones.

Formal Setup

A NAS problem consists of three components.

Definition

Search Space

The search space A\mathcal{A} is the set of all architectures the search can consider. Typically parameterized as a directed acyclic graph where nodes are feature maps and edges are operations (convolution, pooling, skip connection). The space is finite but combinatorially large.

Definition

Search Strategy

The search strategy selects which architectures to evaluate from A\mathcal{A}. Common strategies: reinforcement learning (controller generates architectures, reward is validation accuracy), evolutionary algorithms (population of architectures, mutation and selection), gradient-based optimization (DARTS).

Definition

Performance Estimation Strategy

The performance estimation strategy approximates the true validation performance of a candidate architecture without training it fully from scratch. Methods: training for fewer epochs, weight sharing across architectures (supernets), learning curve extrapolation.

Search Strategies

Reinforcement Learning (Zoph and Le, 2017)

A recurrent neural network (the "controller") generates architecture descriptions token by token. Each generated architecture is trained to convergence, and the validation accuracy serves as the reward signal. The controller is updated with REINFORCE.

The original NAS paper used 800 GPUs for 28 days. This established NAS as a concept but also demonstrated its impracticality at scale.

Evolutionary Methods

Maintain a population of architectures. At each step: select a parent, mutate it (add/remove a layer, change an operation), train the child, and add it to the population if it improves upon the weakest member. AmoebaNet (Real et al., 2019) showed evolutionary NAS matches RL-based NAS at lower cost.

Differentiable Architecture Search (DARTS)

Proposition

DARTS Continuous Relaxation

Statement

Let O={o1,,oK}\mathcal{O} = \{o_1, \ldots, o_K\} be the set of candidate operations for each edge. DARTS replaces the discrete choice with a continuous mixture:

oˉ(x)=k=1Kexp(αk)j=1Kexp(αj)ok(x)\bar{o}(x) = \sum_{k=1}^{K} \frac{\exp(\alpha_k)}{\sum_{j=1}^{K} \exp(\alpha_j)} \cdot o_k(x)

where αk\alpha_k are architecture parameters. The bilevel optimization is:

minαLval(w(α),α)s.t.w(α)=argminwLtrain(w,α)\min_\alpha \mathcal{L}_{\text{val}}(w^*(\alpha), \alpha) \quad \text{s.t.} \quad w^*(\alpha) = \arg\min_w \mathcal{L}_{\text{train}}(w, \alpha)

After optimization, the final architecture is obtained by selecting argmaxkαk\arg\max_k \alpha_k for each edge.

Intuition

Instead of searching over discrete architectures (combinatorial), DARTS relaxes the problem to a continuous optimization over mixing weights. You jointly train the network weights ww and the architecture parameters α\alpha using gradient descent. This reduces NAS from days to hours.

Proof Sketch

The relaxation is valid because the softmax-weighted sum approaches a hard selection as the α\alpha values diverge. In practice, the bilevel optimization is approximated: alternate one step of ww update (on training data) with one step of α\alpha update (on validation data). Liu et al. (2019) showed this approximation works empirically but can be unstable.

Why It Matters

DARTS reduced NAS cost from thousands of GPU-days to a single GPU-day. This made NAS accessible to researchers without massive compute budgets and established differentiable NAS as the dominant paradigm.

Failure Mode

DARTS suffers from collapse: the search often converges to architectures dominated by skip connections and parameter-free operations because these are easy to optimize. The bilevel approximation (one-step unrolling) introduces bias. Several follow-up works (DARTS+, FairDARTS, SDARTS) address collapse by regularizing the architecture parameters.

Weight Sharing and Supernets

Training every candidate architecture from scratch is prohibitively expensive. Weight sharing trains a single large network (the supernet or one-shot model) that contains all candidate architectures as subgraphs. To evaluate a candidate, extract its subgraph and use the shared weights.

The assumption: a subnetwork's performance with shared weights correlates with its performance when trained independently. This assumption often fails. The ranking of architectures under shared weights can differ substantially from their ranking after independent training. This is the main weakness of one-shot NAS.

EfficientNet: A NAS Success Story

Tan and Le (2019) used NAS to search over a mobile-sized architecture space, finding EfficientNet-B0. They then applied a compound scaling rule (scale depth, width, and resolution together with fixed ratios) to produce EfficientNet-B1 through B7. EfficientNet-B7 matched the best ImageNet accuracy at the time with 8.4x fewer parameters than the previous state of the art.

The success was partly NAS and partly the scaling rule. Disentangling the contribution of the search from the contribution of the scaling methodology is difficult.

Honest Assessment of NAS

What NAS does well:

  • Finds good architectures within a constrained search space
  • Removes some human bias in architecture design
  • Compound scaling (from EfficientNet) is a genuine contribution

What NAS does poorly:

  • The search space itself is designed by humans, baking in strong priors
  • Search cost can exceed the cost of training the final model many times over
  • Weight sharing introduces ranking errors
  • Many NAS papers compare against weak baselines or use different training recipes
  • For LLMs, the Transformer architecture has held up across scales; NAS has not produced a replacement that wins on matched compute

Common Confusions

Watch Out

NAS searches architectures, not hyperparameters

NAS operates over the structure of the network (number of layers, operation types, connectivity). Hyperparameter optimization (learning rate, batch size, weight decay) is a separate problem. Some frameworks combine both, but the distinction matters for understanding what NAS actually automates.

Watch Out

DARTS is not truly differentiable over architectures

DARTS makes the relaxed problem differentiable, but the final architecture is obtained by discretizing (argmax). The discretization gap means the relaxed optimum may not correspond to a good discrete architecture. This is the source of the skip-connection collapse problem.

Canonical Examples

Example

DARTS search space

Consider a cell with 4 intermediate nodes. Each node receives input from all previous nodes and the two cell inputs. For each edge, there are 7 candidate operations: 3×33 \times 3 separable conv, 5×55 \times 5 separable conv, 3×33 \times 3 dilated conv, 5×55 \times 5 dilated conv, 3×33 \times 3 max pool, 3×33 \times 3 average pool, skip connection. With 4 nodes and up to 14 edges, each with 7 choices, the discrete search space has roughly 7141011.87^{14} \approx 10^{11.8} architectures. DARTS explores this space with only 14 continuous parameters α\alpha per operation choice.

Exercises

ExerciseCore

Problem

In DARTS, why is the architecture optimized on validation data while network weights are optimized on training data? What would go wrong if both used training data?

ExerciseAdvanced

Problem

A one-shot NAS evaluates 1000 candidate architectures using shared weights from a supernet. The Kendall rank correlation between shared-weight accuracy and independently-trained accuracy is τ=0.3\tau = 0.3. Is this sufficient for NAS to find a good architecture? Justify quantitatively.

References

Canonical:

  • Zoph & Le, "Neural Architecture Search with Reinforcement Learning" (ICLR 2017)
  • Liu, Simonyan, Yang, "DARTS: Differentiable Architecture Search" (ICLR 2019)
  • Tan & Le, "EfficientNet: Rethinking Model Scaling for CNNs" (ICML 2019)

Current:

  • Elsken, Metzen, Hutter, "Neural Architecture Search: A Survey" (JMLR 2019), Sections 2-5
  • Li & Talwalkar, "Random Search and Reproducibility for NAS" (UAI 2020)

Next Topics

The ideas from NAS connect to broader AutoML and efficient model design.

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.