Representation Learning in Cosmology

Sneiderman, Robby

Applied ML

Representation Learning in Cosmology

Self-supervised pretraining on simulations and survey data: contrastive embeddings for galaxy morphology, AstroCLIP-style multimodal models, and cross-survey transfer for photometric redshift and galaxy property estimation.

AdvancedTier 3CurrentReference~15 min

Prerequisites

Contrastive Learning Self Supervised Vision

Prereq Map

Learning position

Read this page in the graph.

applied-ml | layer 4 | tier 3. This page has 2 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Representation Learning Theory

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Astronomy has a labels problem and a labels-are-cheap problem at the same time. Spectroscopic redshifts, morphological classifications, and physical parameters (stellar mass, star-formation rate, metallicity) require expensive follow-up or careful modeling. Imaging is plentiful: surveys produce $10^9$ galaxy cutouts where the only "label" is the pixels themselves.

Self-supervised pretraining converts the imaging glut into a backbone that downstream tasks fine-tune with small labeled sets. The pretrained representation captures morphology, color, and structural features that are useful across photo- $z$ , classification, and anomaly hunting. The same playbook that took NLP from task-specific RNNs to GPT-class foundation models is now standard in survey astronomy.

The cross-survey transfer angle matters because surveys differ in depth, filter set, PSF, and noise. A representation that generalizes from SDSS to DES to LSST avoids retraining from scratch each time a new instrument comes online. Robust embeddings are also what enables similarity-based anomaly search across $10^9$ -galaxy catalogs.

Core Ideas

Contrastive learning for galaxy morphology. Hayat et al. (2021, ApJL 911; arXiv 2012.13083) applied SimCLR-style contrastive pretraining to SDSS galaxy images. With augmentations covering rotations, flips, and photometric reddening, the learned representation produced photo- $z$ estimates matching supervised baselines at $1\%$ of the labeled data, and morphology classifications competitive with the Galaxy Zoo CNN baseline. The dominant modes in the embedding space corresponded to physical axes (color, size, ellipticity) without supervision.

Multimodal foundation models: AstroCLIP and friends. Parker et al. (2024, MNRAS 531; arXiv 2310.03024) trained AstroCLIP on paired galaxy images and spectra from DESI, using a CLIP-style contrastive objective across modalities. The shared embedding supports image-to-spectrum retrieval, spectrum-conditioned image generation, and zero-shot regression on stellar mass and redshift. SpectraGPT and SpectraFM extend the same idea with transformer backbones over tokenized spectra.

Self-supervised pretraining on simulations. Pretraining on outputs of hydrodynamic simulations (IllustrisTNG, EAGLE) gives the network priors that match physical scaling relations before it ever sees real data. Sarmiento et al. (2021, ApJ 921) and follow-up work used this for stellar-population inference. The risk is the simulation-to-observation gap: representations that overfit to simulator artifacts transfer poorly.

Cross-survey transfer. A representation trained on SDSS imaging and fine-tuned with a few hundred labeled DES galaxies can match models trained from scratch on $10^4$ DES labels (Walmsley et al. 2022, MNRAS 509). The practical impact is on rare-class detection (mergers, ring galaxies, strong lenses) where labeled examples are scarce in any single survey. Foundation models trained jointly on multiple surveys are the natural next step and are under active development for Rubin LSST commissioning.

Common Confusions

Watch Out

Linear probe accuracy is not downstream task accuracy

A high linear-probe score on a benchmark like Galaxy Zoo does not guarantee that the representation transfers to a different survey, a different morphology task, or a regression target like redshift. Evaluation should match the deployment setting: same survey, same noise level, same class distribution. Generic linear-probe leaderboards inflate apparent transfer.

References

Hayat et al., Self-Supervised Representation Learning for Astronomical Images (Astrophysical Journal Letters 911, 2021; arXiv 2012.13083).
Parker et al., AstroCLIP: A Cross-Modal Foundation Model for Galaxies (MNRAS 531, 2024; arXiv 2310.03024).
Walmsley et al., Practical galaxy morphology tools from deep supervised representation learning (MNRAS 509, 2022; arXiv 2110.12735).
Stein et al., Self-supervised similarity search for large scientific datasets (NeurIPS ML4PS workshop 2021; arXiv 2110.13151). Contrastive embeddings for catalog-scale similarity search.
Smith et al., Astronomia ex machina: a history, primer and outlook on neural networks in astronomy (Royal Society Open Science 10, 2023; arXiv 2211.03796). Survey of representation learning in astronomy.
Sarmiento et al., Capturing the physical properties of galaxies with self-supervised learning (Astrophysical Journal 921, 2021; arXiv 2104.08300).

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics