Methodology
Federated Learning
Train a global model without centralizing data. FedAvg, communication efficiency, non-IID convergence challenges, differential privacy integration, and applications in healthcare and mobile computing.
Prerequisites
Why This Matters
In many real applications, data cannot be centralized. Hospitals cannot share patient records. Phone keyboards should not upload everything you type to a server. Companies in competing jurisdictions cannot pool their datasets. Federated learning trains a shared model while the data stays on each client device.
The core idea is simple: send the model to the data, not the data to the model. Each client trains locally and sends back model updates (gradients or weight deltas). A central server aggregates these updates. No raw data leaves the client.
Problem Setup
Federated Learning Objective
There are clients, each with local dataset . The global objective is:
where and . This is a weighted average of local losses, where each client's weight is proportional to its data size.
Communication Round
One communication round consists of: (1) the server sends the current global model to a subset of clients, (2) each selected client trains locally for epochs starting from , (3) clients send updated parameters back to the server, (4) the server aggregates updates to form .
FedAvg: Federated Averaging
The FedAvg algorithm (McMahan et al., 2017):
- Server initializes
- For each round :
- Server selects a fraction of clients (typically )
- Each selected client runs SGD for local epochs on , starting from
- Client sends to the server
- Server computes
FedAvg Convergence for IID Data
Statement
Under IID data distribution across clients, with -smooth loss and SGD with bounded gradient variance , FedAvg with learning rate , local epochs, and communication rounds satisfies:
For appropriately chosen , this gives convergence rate to a stationary point.
Intuition
FedAvg behaves like SGD with large mini-batches (aggregated across clients) but with an extra error term from the local steps. More local epochs () reduce communication but introduce "client drift": each client moves further from the global optimum before averaging. The convergence rate balances communication rounds with local computation .
Proof Sketch
Decompose the one-round update into a true gradient step plus noise and client drift. The true gradient step makes progress proportional to . The noise term scales as from stochastic gradients. Client drift adds because local SGD steps can diverge from the global direction. Sum over rounds and rearrange.
Why It Matters
This shows FedAvg converges under reasonable assumptions and quantifies the communication-computation tradeoff. More local epochs means fewer communication rounds needed, but each round is noisier. In practice, between 1 and 5 epochs works well for IID data.
Failure Mode
The IID assumption is critical and rarely holds. When clients have non-IID data (different label distributions, different feature distributions), client drift becomes much worse and can prevent convergence entirely. The next theorem addresses this.
The Non-IID Problem
In practice, client data is almost never IID. A hospital in rural area sees different conditions than one in a city. Your phone keyboard reflects your vocabulary, not the population's.
Client Drift Under Non-IID Data
Statement
Define the gradient dissimilarity as . After local SGD steps starting from the same , the divergence between client 's local model and the ideal global update satisfies:
The first term grows linearly with and . Higher data heterogeneity () and more local steps () cause greater divergence.
Intuition
Each local SGD step pushes client 's model toward minimizing , not . After steps, the client has wandered away from where it would be if it were minimizing the global objective. This drift is the source of FedAvg's difficulty with non-IID data.
Proof Sketch
Expand steps of SGD on starting from . Compare with steps on . At each step, the gradient difference is bounded by . Over steps, these differences accumulate linearly (for the bias term) and as (for the variance term, by martingale concentration).
Why It Matters
This explains why FedAvg struggles with heterogeneous data. Solutions include: reducing (more communication), using gradient correction methods (SCAFFOLD), or adding a proximal term to the local objective (FedProx).
Failure Mode
The bound assumes bounded gradient dissimilarity, which may be violated when clients have completely disjoint label sets (e.g., one client has only cats, another only dogs). In extreme non-IID settings, even single-step local updates can diverge.
Communication Efficiency
Communication is the bottleneck in federated learning. Sending full model updates over mobile networks is slow and expensive.
Gradient compression: quantize gradients to lower precision (e.g., 1-bit SGD) or sparsify them (send only the top- gradient entries). SignSGD sends only the sign of each gradient entry.
FedAvg itself is a communication reduction: by doing local epochs per round, you reduce the number of communication rounds by a factor of compared to distributed SGD.
Model distillation: instead of sending full model updates, clients send logits or soft labels on a shared public dataset. This compresses the communication to instead of .
Federated Learning + Differential Privacy
Federated learning does not guarantee privacy by itself. Model updates can leak information about the training data (via gradient inversion attacks). Adding differential privacy provides formal guarantees.
Each client clips their gradient update to norm and adds Gaussian noise:
The server aggregates the noisy updates. The noise provides -differential privacy: an adversary observing the aggregated update cannot determine whether any specific data point was in any client's dataset, up to a privacy budget .
The cost: added noise reduces convergence speed. More privacy (smaller ) means more noise and slower convergence.
Common Confusions
Federated learning does not guarantee privacy
Raw model updates can reveal training data. Gradient inversion attacks can reconstruct input images from gradients with high fidelity. Federated learning must be combined with differential privacy, secure aggregation, or both to provide meaningful privacy guarantees.
FedAvg is not the same as distributed SGD
In distributed SGD, workers compute gradients on different data shards and average them every step. In FedAvg, clients run multiple local SGD steps before averaging. This distinction matters: FedAvg introduces client drift, which distributed SGD does not have (because it averages every step).
Canonical Examples
Mobile keyboard next-word prediction
Google's Gboard uses federated learning for next-word prediction. Each phone trains locally on the user's typing data. Model updates are sent to a server during idle charging time. The server averages updates across millions of devices. No individual keystrokes leave the phone. The model improves globally while each user's data remains local.
Summary
- Federated learning sends the model to the data, not data to the model
- FedAvg: local SGD for epochs, then average parameters across clients
- IID data gives convergence rate ; non-IID data causes client drift
- Communication efficiency comes from local computation, gradient compression, or distillation
- Privacy requires additional mechanisms: differential privacy or secure aggregation
- Non-IID data is the primary open challenge
Exercises
Problem
You have 100 clients, each with 1000 examples. You run FedAvg with (selecting 10% of clients per round) and local epochs. How many total gradient steps are performed across all clients in one communication round?
Problem
Client A has data that is entirely class 0. Client B has data that is entirely class 1. Both start from the same initial model. After one round of FedAvg with local epochs, the averaged model performs poorly on both classes. Explain why, and propose a modification to FedAvg that would help.
References
Canonical:
- McMahan et al., "Communication-Efficient Learning of Deep Networks from Decentralized Data" (FedAvg, 2017)
- Kairouz et al., "Advances and Open Problems in Federated Learning" (2021), Sections 1-4
Current:
-
Li et al., "Federated Optimization in Heterogeneous Networks" (FedProx, 2020)
-
Karimireddy et al., "SCAFFOLD: Stochastic Controlled Averaging for Federated Learning" (2020)
-
Abadi et al., "Deep Learning with Differential Privacy" (2016), Section 3
-
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 7-8
Next Topics
- This topic builds on distributed training theory for the foundations of parallel optimization
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Distributed Training TheoryLayer 5
- Optimizer Theory: SGD, Adam, and MuonLayer 3
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Adam OptimizerLayer 2
- Gradient Descent VariantsLayer 1
- Stochastic Gradient Descent ConvergenceLayer 2
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A