What Each Loss Does
Both losses learn embedding spaces where similar items are close and dissimilar items are far apart. They differ in how many negatives are used per update and how the separation is enforced.
Triplet loss operates on triplets : an anchor, a positive (same class), and a negative (different class):
where is a fixed margin. The loss is zero when the negative is at least farther from the anchor than the positive. Only one negative is contrasted per triplet.
InfoNCE (contrastive loss, NT-Xent) operates on a query , one positive , and negatives simultaneously:
where is cosine similarity and is a temperature parameter. This is a softmax cross-entropy over candidates, treating the positive as the correct class.
Why InfoNCE Dominates Modern Self-Supervised Learning
InfoNCE has three structural advantages over triplet loss:
1. All negatives per update. InfoNCE uses all negatives in each batch to compute one loss term. Triplet loss uses one negative per triplet. For a batch of 4096 samples, InfoNCE produces 4096 loss terms each referencing 4095 negatives. Triplet loss produces at most 4096 triplets, each referencing 1 negative. The information efficiency per batch is dramatically higher.
2. Implicit hard negative weighting. The softmax in InfoNCE automatically upweights hard negatives. If is large (hard negative), the exponential makes it dominate the denominator. Easy negatives with small similarity contribute negligibly. This achieves the effect of hard negative mining without explicit selection.
3. Connection to mutual information. InfoNCE provides a lower bound on the mutual information :
This bound tightens with more negatives , giving a principled reason to use large batch sizes. SimCLR, MoCo, and CLIP all exploit this property.
The Role of Temperature and Margin
Triplet margin : Controls the minimum separation between positive and negative distances. Large forces wider separation but makes training harder (more triplets have nonzero loss). Small is easier to satisfy but produces less discriminative embeddings. Typical values: for L2-normalized embeddings.
InfoNCE temperature : Controls the sharpness of the similarity distribution. Small makes the softmax peaky, concentrating gradient on the hardest negatives. Large smooths the distribution, treating all negatives more equally.
This limit resembles a hard-triplet loss. Temperature provides a continuous knob between uniform and hard-negative weighting, while the triplet margin is a binary threshold (loss is either zero or positive).
Side-by-Side Comparison
| Property | Triplet Loss | InfoNCE (Contrastive) |
|---|---|---|
| Negatives per update | 1 per triplet | per query (full batch) |
| Separation mechanism | Fixed margin | Temperature-scaled softmax |
| Hard negative handling | Requires explicit mining | Implicit via softmax weighting |
| Gradient signal | Zero for easy triplets () | Always nonzero (softmax never saturates) |
| Batch construction | Mine triplets (expensive) | All pairs in batch (simple) |
| Theory | Margin-based metric learning | Mutual information lower bound |
| Training stability | Sensitive to mining strategy | Stable with large batches |
| Scaling with batch size | Diminishing returns | Log-linear improvement (tighter MI bound) |
| Used in | FaceNet, classic metric learning | SimCLR, MoCo, CLIP, DINO |
Hard Negative Mining for Triplet Loss
Triplet loss performance depends critically on which triplets are selected. Random triplets are mostly easy (the negative is already farther than ), producing zero loss and zero gradient.
Hard mining: Select the triplet where is the closest negative to in the current embedding space. This maximizes the loss signal but can lead to collapsed embeddings early in training when the model is poorly calibrated.
Semi-hard mining: Select negatives that are farther than the positive but within the margin: . This provides a stable gradient without the collapse risk of pure hard mining.
Offline vs. online mining: Offline mining precomputes embeddings for the full dataset and selects triplets. Online mining selects triplets within the current batch. Online is more practical but limits the negative pool to batch size.
InfoNCE bypasses this entire problem. Every non-matching pair in the batch is a negative, and the softmax automatically weights them by difficulty.
When Each Wins
InfoNCE: self-supervised pretraining and large-batch training
SimCLR, MoCo, CLIP, and virtually all self-supervised vision methods use InfoNCE or close variants. The ability to exploit thousands of negatives per batch and the theoretical connection to mutual information make it the default for representation learning at scale.
Triplet loss: fine-grained retrieval with curated negatives
When you have domain-specific hard negatives (e.g., similar product images in visual search, near-duplicate documents in retrieval), triplet loss with carefully mined negatives can outperform InfoNCE. The explicit margin provides direct control over the embedding geometry. Face verification (FaceNet) and image retrieval systems often use triplet loss with specialized mining.
Triplet loss: small-batch settings
When batch sizes are small (e.g., few-shot learning, limited GPU memory), InfoNCE's advantage diminishes because the MI bound is loose with few negatives. Triplet loss with semi-hard mining can be more effective in the small-batch regime.
Common Confusions
Contrastive loss and InfoNCE are not the same as the original contrastive loss
The original contrastive loss (Hadsell et al. 2006) operates on pairs, not batches: . InfoNCE (Oord et al. 2018) is a multi-class softmax over the full batch. Modern usage of "contrastive loss" usually means InfoNCE or NT-Xent, not the 2006 pairwise formulation.
Large batch size helps InfoNCE but does not help triplet loss equally
Doubling the batch for InfoNCE doubles the number of negatives per query, tightening the MI bound. Doubling the batch for triplet loss increases the mining pool but each triplet still uses one negative. The scaling behavior differs: InfoNCE benefits from negatives, while triplet loss benefits from better mining over a larger pool, which saturates faster.
Temperature in InfoNCE is not just a hyperparameter to tune
Temperature controls the uniformity-tolerance tradeoff of the learned representation. Very low produces uniform embeddings on the hypersphere but can collapse nearby classes. Very high produces less discriminative but more tolerant embeddings. Wang and Isola (2020) showed this tradeoff formally. It is not a nuisance parameter.
References
- Schroff, F., Kalenichenko, D., and Philbin, J. (2015). "FaceNet: A Unified Embedding for Face Recognition and Clustering." CVPR 2015. (Triplet loss with semi-hard mining for face verification.)
- Oord, A. v. d., Li, Y., and Vinyals, O. (2018). "Representation Learning with Contrastive Predictive Coding." arXiv:1807.03748. (InfoNCE loss and mutual information interpretation.)
- Chen, T. et al. (2020). "A Simple Framework for Contrastive Learning of Visual Representations." ICML 2020. (SimCLR, NT-Xent loss, large-batch contrastive training.)
- He, K. et al. (2020). "Momentum Contrast for Unsupervised Visual Representation Learning." CVPR 2020. (MoCo, momentum encoder for large negative queues.)
- Radford, A. et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision." ICML 2021. (CLIP, contrastive loss between image and text embeddings.)
- Wang, T. and Isola, P. (2020). "Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere." ICML 2020. (Theory of temperature and representation quality in contrastive learning.)
- Hadsell, R., Chopra, S., and LeCun, Y. (2006). "Dimensionality Reduction by Learning an Invariant Mapping." CVPR 2006. (Original pairwise contrastive loss.)