LLM Construction
Neural Architecture Search
Automating network architecture design: search spaces, search strategies (RL, evolutionary, differentiable), performance estimation via weight sharing, and the gap between NAS hype and practical gains.
Prerequisites
Why This Matters
Human-designed architectures (ResNet, Transformer) work well, but there is no guarantee they are optimal for a given task and compute budget. Neural Architecture Search (NAS) attempts to automate architecture design by searching over a structured space of possible networks.
NAS produced EfficientNet, which achieved state-of-the-art ImageNet accuracy at lower compute than hand-designed alternatives. However, NAS is also one of the most over-hyped areas of ML: the search cost can be enormous, the search spaces are heavily constrained by human priors, and many "NAS-found" architectures differ only marginally from hand-designed ones.
Formal Setup
A NAS problem consists of three components.
Search Space
The search space is the set of all architectures the search can consider. Typically parameterized as a directed acyclic graph where nodes are feature maps and edges are operations (convolution, pooling, skip connection). The space is finite but combinatorially large.
Search Strategy
The search strategy selects which architectures to evaluate from . Common strategies: reinforcement learning (controller generates architectures, reward is validation accuracy), evolutionary algorithms (population of architectures, mutation and selection), gradient-based optimization (DARTS).
Performance Estimation Strategy
The performance estimation strategy approximates the true validation performance of a candidate architecture without training it fully from scratch. Methods: training for fewer epochs, weight sharing across architectures (supernets), learning curve extrapolation.
Search Strategies
Reinforcement Learning (Zoph and Le, 2017)
A recurrent neural network (the "controller") generates architecture descriptions token by token. Each generated architecture is trained to convergence, and the validation accuracy serves as the reward signal. The controller is updated with REINFORCE.
The original NAS paper used 800 GPUs for 28 days. This established NAS as a concept but also demonstrated its impracticality at scale.
Evolutionary Methods
Maintain a population of architectures. At each step: select a parent, mutate it (add/remove a layer, change an operation), train the child, and add it to the population if it improves upon the weakest member. AmoebaNet (Real et al., 2019) showed evolutionary NAS matches RL-based NAS at lower cost.
Differentiable Architecture Search (DARTS)
DARTS Continuous Relaxation
Statement
Let be the set of candidate operations for each edge. DARTS replaces the discrete choice with a continuous mixture:
where are architecture parameters. The bilevel optimization is:
After optimization, the final architecture is obtained by selecting for each edge.
Intuition
Instead of searching over discrete architectures (combinatorial), DARTS relaxes the problem to a continuous optimization over mixing weights. You jointly train the network weights and the architecture parameters using gradient descent. This reduces NAS from days to hours.
Proof Sketch
The relaxation is valid because the softmax-weighted sum approaches a hard selection as the values diverge. In practice, the bilevel optimization is approximated: alternate one step of update (on training data) with one step of update (on validation data). Liu et al. (2019) showed this approximation works empirically but can be unstable.
Why It Matters
DARTS reduced NAS cost from thousands of GPU-days to a single GPU-day. This made NAS accessible to researchers without massive compute budgets and established differentiable NAS as the dominant paradigm.
Failure Mode
DARTS suffers from collapse: the search often converges to architectures dominated by skip connections and parameter-free operations because these are easy to optimize. The bilevel approximation (one-step unrolling) introduces bias. Several follow-up works (DARTS+, FairDARTS, SDARTS) address collapse by regularizing the architecture parameters.
Weight Sharing and Supernets
Training every candidate architecture from scratch is prohibitively expensive. Weight sharing trains a single large network (the supernet or one-shot model) that contains all candidate architectures as subgraphs. To evaluate a candidate, extract its subgraph and use the shared weights.
The assumption: a subnetwork's performance with shared weights correlates with its performance when trained independently. This assumption often fails. The ranking of architectures under shared weights can differ substantially from their ranking after independent training. This is the main weakness of one-shot NAS.
EfficientNet: A NAS Success Story
Tan and Le (2019) used NAS to search over a mobile-sized architecture space, finding EfficientNet-B0. They then applied a compound scaling rule (scale depth, width, and resolution together with fixed ratios) to produce EfficientNet-B1 through B7. EfficientNet-B7 matched the best ImageNet accuracy at the time with 8.4x fewer parameters than the previous state of the art.
The success was partly NAS and partly the scaling rule. Disentangling the contribution of the search from the contribution of the scaling methodology is difficult.
Honest Assessment of NAS
What NAS does well:
- Finds good architectures within a constrained search space
- Removes some human bias in architecture design
- Compound scaling (from EfficientNet) is a genuine contribution
What NAS does poorly:
- The search space itself is designed by humans, baking in strong priors
- Search cost can exceed the cost of training the final model many times over
- Weight sharing introduces ranking errors
- Many NAS papers compare against weak baselines or use different training recipes
- For LLMs, the Transformer architecture has held up across scales; NAS has not produced a replacement that wins on matched compute
Common Confusions
NAS searches architectures, not hyperparameters
NAS operates over the structure of the network (number of layers, operation types, connectivity). Hyperparameter optimization (learning rate, batch size, weight decay) is a separate problem. Some frameworks combine both, but the distinction matters for understanding what NAS actually automates.
DARTS is not truly differentiable over architectures
DARTS makes the relaxed problem differentiable, but the final architecture is obtained by discretizing (argmax). The discretization gap means the relaxed optimum may not correspond to a good discrete architecture. This is the source of the skip-connection collapse problem.
Canonical Examples
DARTS search space
Consider a cell with 4 intermediate nodes. Each node receives input from all previous nodes and the two cell inputs. For each edge, there are 7 candidate operations: separable conv, separable conv, dilated conv, dilated conv, max pool, average pool, skip connection. With 4 nodes and up to 14 edges, each with 7 choices, the discrete search space has roughly architectures. DARTS explores this space with only 14 continuous parameters per operation choice.
Exercises
Problem
In DARTS, why is the architecture optimized on validation data while network weights are optimized on training data? What would go wrong if both used training data?
Problem
A one-shot NAS evaluates 1000 candidate architectures using shared weights from a supernet. The Kendall rank correlation between shared-weight accuracy and independently-trained accuracy is . Is this sufficient for NAS to find a good architecture? Justify quantitatively.
References
Canonical:
- Zoph & Le, "Neural Architecture Search with Reinforcement Learning" (ICLR 2017)
- Liu, Simonyan, Yang, "DARTS: Differentiable Architecture Search" (ICLR 2019)
- Tan & Le, "EfficientNet: Rethinking Model Scaling for CNNs" (ICML 2019)
Current:
- Elsken, Metzen, Hutter, "Neural Architecture Search: A Survey" (JMLR 2019), Sections 2-5
- Li & Talwalkar, "Random Search and Reproducibility for NAS" (UAI 2020)
Next Topics
The ideas from NAS connect to broader AutoML and efficient model design.
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A