LLM Construction
Model Merging and Weight Averaging
Combining trained models by averaging or interpolating their weights: SWA, SLERP, TIES-Merging, DARE. Why it works (loss landscape mode connectivity), when it fails, and applications to combining specialized models.
Prerequisites
Why This Matters
Training a large model is expensive. If you have two models trained for different tasks (one for code, one for math), can you combine them into a single model that does both, without retraining from scratch? Model merging says yes, sometimes, by directly combining their weights.
This works because of a surprising property of neural network loss landscapes: models trained from the same initialization (or fine-tuned from the same base model) often lie in the same loss basin, connected by a path of low loss. Averaging their weights produces a model that lands in this basin and retains capabilities from both.
Model merging is cheap, fast, and requires no additional training data. It has become a standard tool for creating capable open-weight models by combining specialized fine-tunes.
Weight Averaging Basics
Weight Averaging
Given two models with weights and , the simplest merge is linear interpolation:
where controls the mixing ratio. When , this is a uniform average. This can be extended to models:
The obvious question: why would averaging weights produce a good model? Averaging predictions (ensembling) has theoretical justification through bias-variance decomposition. But averaging weights is different. Two models can have identical loss but very different weight configurations, and their average might have terrible loss.
Why Merging Works: Mode Connectivity
Linear Mode Connectivity for Fine-tuned Models
Statement
Let be a pretrained model, and let and be two models obtained by fine-tuning on different data or with different hyperparameters. Under typical fine-tuning conditions (moderate learning rate, limited epochs), the linear path between and :
has loss that remains close to the loss of the endpoints:
for small . That is, the linear interpolation does not pass through a high-loss barrier.
Intuition
Fine-tuning from a shared initialization makes small adjustments to the weights. Both fine-tuned models stay in the same "valley" of the loss landscape. The straight line between them stays in the valley. If the models had been trained from scratch with different random initializations, they would likely be in different valleys separated by high-loss barriers, and averaging would produce a model on the barrier.
Why It Matters
Mode connectivity is the theoretical justification for model merging. Without it, weight averaging would be unprincipled. The key condition is shared initialization: models must start from the same pretrained weights. This is why merging works well for fine-tuned variants of the same base model (e.g., two LoRA fine-tunes of Llama 3) but fails for independently trained models.
Failure Mode
Mode connectivity breaks when fine-tuning is too aggressive (high learning rate, many epochs), pushing models into different basins. It also fails when the two models are trained from different initializations. In practice, merging two independently pretrained models (e.g., Llama 3 and Mistral 7B) produces poor results because they occupy different regions of weight space with different internal representations, even if they have the same architecture.
Stochastic Weight Averaging (SWA)
Stochastic Weight Averaging
Statement
Stochastic Weight Averaging (Izmailov et al., 2018) averages the weights visited by SGD during training:
where are the weights at the end of each training epoch (or at regular intervals). With a cyclic or sufficiently high constant learning rate, SGD explores the periphery of a flat loss basin. SWA moves toward the center of this basin, producing a solution with:
- Lower loss on validation data
- Broader minima (flatter curvature)
- Better calibration
Intuition
SGD with moderate learning rate bounces around a flat region of the loss landscape. Each checkpoint is near the edge of the basin. Averaging many checkpoints produces a point near the center. The center of a flat basin generalizes better because small perturbations to the weights (which correspond to changes in data distribution) cause smaller changes in loss.
Why It Matters
SWA is the simplest model merging method and requires no additional training or data. You just save checkpoints during a single training run and average them. It consistently improves generalization over the final SGD iterate and is nearly free in compute cost.
Failure Mode
SWA assumes the checkpoints are in the same basin. If the learning rate is too high and SGD jumps between basins, averaging can land on a high-loss barrier between them. SWA also requires the model to be in a low-loss region already; it does not help with convergence from a bad initialization.
SLERP: Spherical Linear Interpolation
SLERP for Model Weights
Spherical linear interpolation treats weight vectors as points on a hypersphere and interpolates along the great circle:
where is the angle between the weight vectors.
Why SLERP over linear interpolation? Linear interpolation shrinks the norm of the result: when the vectors are not parallel. SLERP preserves the norm by interpolating along the sphere surface. For neural network weights where the scale of activations matters, this can produce better results than linear interpolation.
In practice, SLERP is applied layer-by-layer or parameter-group-by-group rather than to the full weight vector. It is the default merging method in community tools like mergekit.
TIES-Merging
TIES-Merging
TIES-Merging (Yadav et al., 2023) addresses a problem with naive averaging: when merging models, task-specific weight changes from different models can cancel each other out if they point in opposite directions.
The TIES algorithm has three steps:
-
Trim: For each model, compute the task vector (the change from the base model). Zero out entries with small magnitude (below a threshold), keeping only the most important changes.
-
Elect sign: For each weight position, take a majority vote across models on whether the change should be positive or negative. This resolves sign conflicts.
-
Merge: Average only the entries that agree with the elected sign. Discard entries that disagree.
where is a scaling factor.
The key insight: naive averaging treats all weight changes equally. TIES recognizes that some changes are noise (small magnitude) and some conflict (opposite signs across models). By trimming noise and resolving conflicts before merging, TIES produces cleaner combinations.
DARE: Drop and Rescale
DARE
DARE (Yu et al., 2024) takes a more aggressive approach to sparsification before merging:
- Compute task vectors for each model.
- For each task vector, randomly drop a fraction of the entries (set them to zero).
- Rescale the remaining entries by to preserve the expected magnitude.
- Merge the sparsified task vectors by averaging.
where is a binary mask with entries drawn i.i.d. Bernoulli().
Why this works: DARE's premise is that task vectors are highly redundant. Most entries contribute little to the task-specific capability. By randomly dropping entries and rescaling, you preserve the expected contribution while reducing interference between models during merging.
DARE and TIES can be combined: apply DARE's random dropping, then TIES's sign election, then merge.
Model Soups
Model Soups
"Model soups" (Wortsman et al., 2022) refers to averaging the weights of models trained with different hyperparameters (learning rate, weight decay, augmentation) on the same task. Instead of selecting the best model by validation performance, you average all models that exceed a quality threshold.
The result consistently outperforms the best individual model because:
- Each hyperparameter setting explores a slightly different part of the basin
- Averaging moves toward the center of the basin (similar to SWA)
- The averaged model is more robust to distribution shift than any individual model
Applications to LLM Merging
Model merging has become especially popular in the open-weight LLM community:
Combining specialized fine-tunes. Start with a base model (e.g., Llama 3 8B). Fine-tune separately for code, math, and conversation. Merge the three fine-tunes to get a model that does all three. This is much cheaper than training a single model on all three datasets jointly.
Community model development. Open-weight model communities on Hugging Face routinely merge models to combine capabilities. Tools like mergekit provide SLERP, TIES, DARE, and linear interpolation as merge strategies.
Avoiding catastrophic forgetting. Fine-tuning on a specialized dataset often degrades performance on general tasks (catastrophic forgetting). Merging the fine-tuned model with the base model (with the base model weighted more heavily) preserves general capability while adding specialized skill.
Common Confusions
Merging weights is not the same as ensembling predictions
Ensembling runs multiple models and combines their outputs (by averaging predictions or voting). This always works but requires running all models at inference time. Weight merging produces a single model. It is much cheaper at inference but only works when the models share a loss basin. These are structurally different operations.
Merging independently trained models usually fails
Two models trained from scratch with different random seeds will have different internal representations. The same concept might be encoded in different neurons. Averaging their weights mixes incompatible representations. Merging requires a shared starting point (same pretrained base model) so that the weight space is aligned.
SLERP is not always better than linear interpolation
SLERP preserves norm and interpolates along the sphere. For some models and tasks, linear interpolation works just as well or better. The theoretical advantage of SLERP (norm preservation) matters most when the models have similar norms and the interpolation path matters. In practice, try both.
Summary
- Weight averaging works because fine-tuned models share a loss basin (mode connectivity)
- Shared initialization is the key requirement; independently trained models cannot be merged
- SWA: average checkpoints from a single run. Cheap, effective, moves toward basin center
- SLERP: spherical interpolation preserving weight norm. Default in community tools
- TIES-Merging: trim small changes, elect sign by majority, merge agreements
- DARE: randomly drop task vector entries, rescale, then merge
- Model soups: average models with different hyperparameters. Beats the best individual model
- Applications: combining specialized LLM fine-tunes, reducing catastrophic forgetting
Exercises
Problem
Explain why averaging the weights of two models trained from the same base model is more likely to succeed than averaging the weights of two models trained from different random initializations. Use the concept of mode connectivity.
Problem
In TIES-Merging, the "trim" step removes small-magnitude entries from each task vector before merging. Explain: (a) why small-magnitude entries should be removed, (b) what happens if the trim threshold is too high, and (c) what happens if the trim threshold is too low.
Problem
You have a base Llama 3 8B model and three specialized fine-tunes: one for Python code generation, one for mathematical reasoning, and one for biomedical text. Design a merging strategy that produces a single model performing well on all three domains. Specify the merging method, any hyperparameters, and how you would evaluate the result.
References
Canonical:
- Izmailov et al., "Averaging Weights Leads to Wider Optima and Better Generalization" (SWA, 2018)
- Wortsman et al., "Model Soups: Averaging Weights of Multiple Fine-tuned Models Improves Accuracy without Increasing Inference Time" (2022)
Current:
- Yadav et al., "TIES-Merging: Resolving Interference When Merging Models" (NeurIPS, 2023)
- Yu et al., "Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch" (DARE, 2024)
- Frankle et al., "Linear Mode Connectivity and the Lottery Ticket Hypothesis" (2020). Mode connectivity theory.
Next Topics
- Transformer architecture: the base architecture for models being merged
- Parameter-efficient fine-tuning: LoRA and other methods that produce mergeable adapters
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Transformer ArchitectureLayer 4
- Attention Mechanism TheoryLayer 4
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Softmax and Numerical StabilityLayer 1
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1