Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Methodology

Federated Learning

Train a global model without centralizing data. FedAvg, communication efficiency, non-IID convergence challenges, differential privacy integration, and applications in healthcare and mobile computing.

AdvancedTier 2Current~50 min

Why This Matters

In many real applications, data cannot be centralized. Hospitals cannot share patient records. Phone keyboards should not upload everything you type to a server. Companies in competing jurisdictions cannot pool their datasets. Federated learning trains a shared model while the data stays on each client device.

The core idea is simple: send the model to the data, not the data to the model. Each client trains locally and sends back model updates (gradients or weight deltas). A central server aggregates these updates. No raw data leaves the client.

Problem Setup

Definition

Federated Learning Objective

There are KK clients, each with local dataset Dk\mathcal{D}_k. The global objective is:

minθF(θ)=k=1KnknFk(θ),Fk(θ)=1nk(x,y)Dk(fθ(x),y)\min_\theta F(\theta) = \sum_{k=1}^{K} \frac{n_k}{n} F_k(\theta), \quad F_k(\theta) = \frac{1}{n_k} \sum_{(x,y) \in \mathcal{D}_k} \ell(f_\theta(x), y)

where nk=Dkn_k = |\mathcal{D}_k| and n=knkn = \sum_k n_k. This is a weighted average of local losses, where each client's weight is proportional to its data size.

Definition

Communication Round

One communication round consists of: (1) the server sends the current global model θt\theta^t to a subset of clients, (2) each selected client trains locally for EE epochs starting from θt\theta^t, (3) clients send updated parameters θkt+1\theta_k^{t+1} back to the server, (4) the server aggregates updates to form θt+1\theta^{t+1}.

FedAvg: Federated Averaging

The FedAvg algorithm (McMahan et al., 2017):

  1. Server initializes θ0\theta^0
  2. For each round t=0,1,2,t = 0, 1, 2, \ldots:
    • Server selects a fraction CC of clients (typically C=0.1C = 0.1)
    • Each selected client kk runs SGD for EE local epochs on Dk\mathcal{D}_k, starting from θt\theta^t
    • Client sends θkt+1\theta_k^{t+1} to the server
    • Server computes θt+1=knknθkt+1\theta^{t+1} = \sum_k \frac{n_k}{n} \theta_k^{t+1}
Theorem

FedAvg Convergence for IID Data

Statement

Under IID data distribution across KK clients, with LL-smooth loss and SGD with bounded gradient variance σ2\sigma^2, FedAvg with learning rate η\eta, EE local epochs, and TT communication rounds satisfies:

1Tt=0T1EF(θt)2O(F(θ0)FηT+ηLσ2+η2L2E2σ2)\frac{1}{T}\sum_{t=0}^{T-1} \mathbb{E}\|\nabla F(\theta^t)\|^2 \leq O\left(\frac{F(\theta^0) - F^*}{\eta T} + \eta L \sigma^2 + \eta^2 L^2 E^2 \sigma^2\right)

For appropriately chosen η=O(1/(LET))\eta = O(1/(L\sqrt{ET})), this gives convergence rate O(1/ET)O(1/\sqrt{ET}) to a stationary point.

Intuition

FedAvg behaves like SGD with large mini-batches (aggregated across clients) but with an extra error term from the EE local steps. More local epochs (EE) reduce communication but introduce "client drift": each client moves further from the global optimum before averaging. The convergence rate balances communication rounds TT with local computation EE.

Proof Sketch

Decompose the one-round update into a true gradient step plus noise and client drift. The true gradient step makes progress proportional to ηF2\eta \|\nabla F\|^2. The noise term scales as η2Lσ2\eta^2 L \sigma^2 from stochastic gradients. Client drift adds η2L2E2σ2\eta^2 L^2 E^2 \sigma^2 because EE local SGD steps can diverge from the global direction. Sum over TT rounds and rearrange.

Why It Matters

This shows FedAvg converges under reasonable assumptions and quantifies the communication-computation tradeoff. More local epochs EE means fewer communication rounds needed, but each round is noisier. In practice, EE between 1 and 5 epochs works well for IID data.

Failure Mode

The IID assumption is critical and rarely holds. When clients have non-IID data (different label distributions, different feature distributions), client drift becomes much worse and can prevent convergence entirely. The next theorem addresses this.

The Non-IID Problem

In practice, client data is almost never IID. A hospital in rural area sees different conditions than one in a city. Your phone keyboard reflects your vocabulary, not the population's.

Proposition

Client Drift Under Non-IID Data

Statement

Define the gradient dissimilarity as δ=maxkFk(θ)F(θ)\delta = \max_k \|\nabla F_k(\theta) - \nabla F(\theta)\|. After EE local SGD steps starting from the same θ\theta, the divergence between client kk's local model and the ideal global update satisfies:

θk(θηEF(θ))O(ηEδ+ηEσ)\|\theta_k - (\theta - \eta E \nabla F(\theta))\| \leq O(\eta E \delta + \eta \sqrt{E} \sigma)

The first term grows linearly with EE and δ\delta. Higher data heterogeneity (δ\delta) and more local steps (EE) cause greater divergence.

Intuition

Each local SGD step pushes client kk's model toward minimizing FkF_k, not FF. After EE steps, the client has wandered O(ηEδ)O(\eta E \delta) away from where it would be if it were minimizing the global objective. This drift is the source of FedAvg's difficulty with non-IID data.

Proof Sketch

Expand EE steps of SGD on FkF_k starting from θ\theta. Compare with EE steps on FF. At each step, the gradient difference is bounded by δ\delta. Over EE steps, these differences accumulate linearly (for the bias term) and as E\sqrt{E} (for the variance term, by martingale concentration).

Why It Matters

This explains why FedAvg struggles with heterogeneous data. Solutions include: reducing EE (more communication), using gradient correction methods (SCAFFOLD), or adding a proximal term to the local objective (FedProx).

Failure Mode

The bound assumes bounded gradient dissimilarity, which may be violated when clients have completely disjoint label sets (e.g., one client has only cats, another only dogs). In extreme non-IID settings, even single-step local updates can diverge.

Communication Efficiency

Communication is the bottleneck in federated learning. Sending full model updates over mobile networks is slow and expensive.

Gradient compression: quantize gradients to lower precision (e.g., 1-bit SGD) or sparsify them (send only the top-kk gradient entries). SignSGD sends only the sign of each gradient entry.

FedAvg itself is a communication reduction: by doing EE local epochs per round, you reduce the number of communication rounds by a factor of EE compared to distributed SGD.

Model distillation: instead of sending full model updates, clients send logits or soft labels on a shared public dataset. This compresses the communication to O(dataset size×classes)O(\text{dataset size} \times \text{classes}) instead of O(parameters)O(\text{parameters}).

Federated Learning + Differential Privacy

Federated learning does not guarantee privacy by itself. Model updates can leak information about the training data (via gradient inversion attacks). Adding differential privacy provides formal guarantees.

Each client clips their gradient update to norm CC and adds Gaussian noise:

g~k=clip(gk,C)+N(0,σ2C2I)\tilde{g}_k = \text{clip}(g_k, C) + \mathcal{N}(0, \sigma^2 C^2 I)

The server aggregates the noisy updates. The noise provides (ϵ,δ)(\epsilon, \delta)-differential privacy: an adversary observing the aggregated update cannot determine whether any specific data point was in any client's dataset, up to a privacy budget (ϵ,δ)(\epsilon, \delta).

The cost: added noise reduces convergence speed. More privacy (smaller ϵ\epsilon) means more noise and slower convergence.

Common Confusions

Watch Out

Federated learning does not guarantee privacy

Raw model updates can reveal training data. Gradient inversion attacks can reconstruct input images from gradients with high fidelity. Federated learning must be combined with differential privacy, secure aggregation, or both to provide meaningful privacy guarantees.

Watch Out

FedAvg is not the same as distributed SGD

In distributed SGD, workers compute gradients on different data shards and average them every step. In FedAvg, clients run multiple local SGD steps before averaging. This distinction matters: FedAvg introduces client drift, which distributed SGD does not have (because it averages every step).

Canonical Examples

Example

Mobile keyboard next-word prediction

Google's Gboard uses federated learning for next-word prediction. Each phone trains locally on the user's typing data. Model updates are sent to a server during idle charging time. The server averages updates across millions of devices. No individual keystrokes leave the phone. The model improves globally while each user's data remains local.

Summary

  • Federated learning sends the model to the data, not data to the model
  • FedAvg: local SGD for EE epochs, then average parameters across clients
  • IID data gives convergence rate O(1/ET)O(1/\sqrt{ET}); non-IID data causes client drift
  • Communication efficiency comes from local computation, gradient compression, or distillation
  • Privacy requires additional mechanisms: differential privacy or secure aggregation
  • Non-IID data is the primary open challenge

Exercises

ExerciseCore

Problem

You have 100 clients, each with 1000 examples. You run FedAvg with C=0.1C = 0.1 (selecting 10% of clients per round) and E=5E = 5 local epochs. How many total gradient steps are performed across all clients in one communication round?

ExerciseAdvanced

Problem

Client A has data that is entirely class 0. Client B has data that is entirely class 1. Both start from the same initial model. After one round of FedAvg with E=10E = 10 local epochs, the averaged model performs poorly on both classes. Explain why, and propose a modification to FedAvg that would help.

References

Canonical:

  • McMahan et al., "Communication-Efficient Learning of Deep Networks from Decentralized Data" (FedAvg, 2017)
  • Kairouz et al., "Advances and Open Problems in Federated Learning" (2021), Sections 1-4

Current:

  • Li et al., "Federated Optimization in Heterogeneous Networks" (FedProx, 2020)

  • Karimireddy et al., "SCAFFOLD: Stochastic Controlled Averaging for Federated Learning" (2020)

  • Abadi et al., "Deep Learning with Differential Privacy" (2016), Section 3

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 7-8

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.