Applied ML
CNNs for Medical Imaging
U-Net and its 3D successors dominate medical image segmentation; ResNet/DenseNet backbones drive radiology classification. Distribution shift across hospitals is the recurring failure mode.
Why This Matters
Medical images differ from natural images in three ways: pixel intensities carry physical meaning (Hounsfield units, T1/T2 contrast), spatial resolution is anisotropic and modality-dependent, and labeled data is scarce. Architectures that succeed here are the ones that respect these constraints. The encoder-decoder U-Net with skip connections (Ronneberger, Fischer, Brox 2015) became the workhorse for segmentation precisely because it preserves spatial detail through the bottleneck while still aggregating context.
Two clinical use classes dominate. Segmentation: outline an organ, lesion, or cell, then compute volume or shape statistics. Classification: from a chest X-ray or histopathology slide, predict the presence of a finding. Both ride on convolutional neural networks; the failure modes are usually about data, not architecture.
The recurring evaluation trap is confounding by acquisition site. CheXNet-era models often learned scanner artifacts, view markers, or hospital-specific framing rather than the radiological feature itself. Zech et al. (2018, PLOS Medicine) showed pneumonia classifiers transfer poorly across hospitals because they had picked up site-specific calibration patches as features.
Core Ideas
U-Net. Symmetric encoder and decoder with skip connections at each resolution. The encoder downsamples through pooling, the decoder upsamples through transposed convolutions or interpolation, and skips concatenate matched-resolution features so the decoder sees both context (deep layers) and locality (shallow layers). For an input volume and prediction the loss is typically a sum of cross entropy and Dice: .
3D U-Net and volumetric variants. Replace 2D convolutions with 3D for CT and MRI volumes. Memory blows up cubically, so most 3D pipelines work on patches and stitch predictions. V-Net (Milletari et al. 2016) added residual blocks; nnU-Net (Isensee et al. 2021, Nature Methods 18) automated patch size, spacing, and normalization choices and won most public segmentation benchmarks without architectural innovation. The lesson: with medical data, configuration discipline beats novel architectures.
Classification backbones. ResNet-50 and DenseNet-121 remain the default for chest X-ray and histology classification. CheXNet (Rajpurkar et al. 2017, arXiv:1711.05225) is a DenseNet-121 trained on ChestX-ray14 to predict pneumonia and 13 other findings. ImageNet pretraining still helps despite the domain gap; the early conv filters transfer.
Vision transformers in pathology. Whole-slide histopathology images are gigapixel; attention over patch tokens fits naturally. Models like CTransPath, HIPT, and UNI use self-supervised vision pretraining on millions of slide patches before fine-tuning. ViT-based foundation models now match or exceed CNNs on many pathology tasks, though convolutional baselines remain competitive when labeled data is small.
Distribution shift and external validation. The single most important methodology rule: report performance on at least one external hospital. Internal AUC of 0.95 and external AUC of 0.75 is the typical gap. Causes include scanner manufacturer differences, demographic shift, and label noise from differing radiologist conventions. Stain normalization (Macenko, Reinhard) helps for histopathology; intensity standardization and bias-field correction help for MRI. Domain randomization at training time and test-time adaptation are the active research directions.
Common Confusions
Dice loss is not symmetric to cross entropy
Dice optimizes the F1-like overlap and is robust to class imbalance, which is severe in lesion segmentation (a tumor may be of voxels). Cross entropy is well-calibrated but gets dominated by background gradients. Most pipelines use a weighted sum so the network learns sharp boundaries (cross entropy) and high overlap (Dice).
ImageNet pretraining helps even though the images look nothing alike
Early conv filters learn edges, textures, and gradients that are universal. The benefit shrinks for late layers but the cost of training from scratch on small medical datasets is usually worse. Self-supervised pretraining on in-domain images (RadImageNet, MICLe) outperforms ImageNet when available.
References
Related Topics
Last reviewed: April 18, 2026
Prerequisites
Foundations this topic depends on.
- Convolutional Neural NetworksLayer 3
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Eigenvalues and EigenvectorsLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Vectors, Matrices, and Linear MapsLayer 0A
- Object Detection and SegmentationLayer 3