Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Recommender Systems

User-item interaction modeling via matrix factorization, collaborative filtering, and content-based methods: the math of SVD-based recommendations, cold start, implicit feedback, and why evaluation is harder than the model.

CoreTier 2Stable~55 min
0

Why This Matters

Recommender systems drive engagement on most consumer internet products: Netflix, YouTube, Amazon, Spotify. The underlying problem is matrix completion: given a sparse user-item interaction matrix, predict the missing entries.

The mathematical core is low-rank matrix approximation via singular value decomposition. The practical difficulty is that the data is sparse, biased, and implicit. Understanding both the clean theory and the messy reality is necessary.

Mental Model

You have a matrix RR of size m×nm \times n where rows are users and columns are items. Most entries are missing. The assumption is that RR is approximately low-rank: user preferences are determined by a small number of latent factors. Factoring RUVTR \approx UV^T where URm×kU \in \mathbb{R}^{m \times k} and VRn×kV \in \mathbb{R}^{n \times k} gives you kk-dimensional representations of users and items.

Collaborative Filtering

Definition

Collaborative Filtering

Predict a user's preference for an item based on the preferences of similar users (user-based CF) or similar items (item-based CF). No content features are needed: the signal comes entirely from the interaction pattern.

User-based CF: to predict user uu's rating for item ii, find users most similar to uu (by cosine similarity or Pearson correlation of their rating vectors) and average their ratings for item ii.

r^ui=rˉu+vN(u)sim(u,v)(rvirˉv)vN(u)sim(u,v)\hat{r}_{ui} = \bar{r}_u + \frac{\sum_{v \in N(u)} \text{sim}(u,v)(r_{vi} - \bar{r}_v)}{\sum_{v \in N(u)} |\text{sim}(u,v)|}

Item-based CF replaces user similarity with item similarity. In practice, item-based CF is more stable because item similarity changes less over time than user similarity.

Matrix Factorization

Definition

Low-Rank Matrix Factorization

Decompose the rating matrix RR into the product of two low-rank matrices:

minU,V(u,i)Ω(ruiuuTvi)2+λ(UF2+VF2)\min_{U, V} \sum_{(u,i) \in \Omega} (r_{ui} - u_u^T v_i)^2 + \lambda(\|U\|_F^2 + \|V\|_F^2)

where Ω\Omega is the set of observed ratings, uuRku_u \in \mathbb{R}^k is user uu's latent vector, viRkv_i \in \mathbb{R}^k is item ii's latent vector, and λ\lambda controls regularization.

Theorem

Eckart-Young-Mirsky Theorem

Statement

Let R=UΣVTR = U\Sigma V^T be the SVD of RR with singular values σ1σ2\sigma_1 \geq \sigma_2 \geq \cdots. The best rank-kk approximation in Frobenius norm is Rk=UkΣkVkTR_k = U_k \Sigma_k V_k^T (the top kk singular vectors), and the error is:

RRkF2=j=k+1min(m,n)σj2\|R - R_k\|_F^2 = \sum_{j=k+1}^{\min(m,n)} \sigma_j^2

Intuition

The SVD finds orthogonal directions of maximum variance. Keeping only the top kk directions gives the closest low-rank approximation. Each discarded singular value contributes its square to the error.

Proof Sketch

By the unitary invariance of the Frobenius norm, the problem reduces to approximating the diagonal matrix Σ\Sigma. The best rank-kk diagonal approximation keeps the kk largest entries and zeros the rest.

Why It Matters

This theorem justifies why truncated SVD is the starting point for matrix factorization in recommender systems. If user preferences truly depend on kk latent factors, the SVD recovers them optimally when the matrix is fully observed.

Failure Mode

The Eckart-Young theorem applies to fully observed matrices. Rating matrices are extremely sparse (often less than 1% observed). Applying SVD to a matrix filled with zeros for missing entries is wrong: it treats missing as zero. This is why alternating least squares (ALS) and SGD-based methods that only optimize over observed entries are preferred in practice.

Alternating Least Squares (ALS): Fix VV, solve for UU in closed form (a ridge regression per user). Fix UU, solve for VV. Alternate until convergence. Each step is a convex problem, but the joint problem is non-convex.

SGD-based: Sample an observed rating (u,i,rui)(u, i, r_{ui}), compute gradient, update uuu_u and viv_i using stochastic gradient descent. This scales to very large datasets.

Content-Based Methods

Use item features (genre, text description, image) instead of collaborative signals. Build a user profile from features of items they liked, then score new items by similarity to that profile.

Content-based methods solve the item cold start problem (new items have features even without ratings) but cannot capture collaborative signals (users who liked X also liked Y).

Hybrid Approaches

Combine collaborative and content signals. Common strategies: feature augmentation (add content features to the factorization model), ensemble (average predictions from separate models), or a unified model that jointly learns from both.

Neural Collaborative Filtering

Replace the dot product uuTviu_u^T v_i with a neural network f(uu,vi)f(u_u, v_i) that can learn non-linear interactions between user and item embeddings. The simplest version concatenates the embeddings and passes them through an MLP.

Whether the added expressiveness of neural models justifies the training cost over simple matrix factorization remains context-dependent. On many benchmarks, well-tuned matrix factorization is competitive with neural methods.

The Cold Start Problem

Watch Out

Cold start has two distinct variants

User cold start: a new user has no interaction history. Item cold start: a new item has no ratings. These require different solutions. User cold start can be addressed by asking for initial preferences or using demographic features. Item cold start can be addressed by content-based features. Pure collaborative filtering cannot solve either.

Evaluation Metrics

Definition

Precision at k

Of the top kk items recommended, the fraction that the user actually interacted with. P@k=relevanttop-k/kP@k = |\text{relevant} \cap \text{top-}k| / k.

Definition

Normalized Discounted Cumulative Gain

A rank-weighted metric that gives more credit to relevant items appearing higher in the list:

DCG@k=i=1k2ri1log2(i+1)\text{DCG}@k = \sum_{i=1}^{k} \frac{2^{r_i} - 1}{\log_2(i+1)}

where rir_i is the relevance of the item at position ii. NDCG normalizes by the ideal DCG (perfectly ranked list).

Common Confusions

Watch Out

Implicit feedback is not the same as explicit ratings

Most real systems observe clicks, views, and purchases, not star ratings. A user not clicking an item does not mean they dislike it; they may not have seen it. This means the "negative" class is contaminated with unseen positives. Methods that treat all non-interactions as negatives (e.g., naive SGD on implicit data) learn a biased model. Techniques like BPR (Bayesian Personalized Ranking) or weighted matrix factorization address this.

Watch Out

Offline metrics do not predict online performance

A model with higher NDCG on a held-out test set can perform worse in production due to position bias (users click higher-ranked items regardless of relevance), popularity bias (test sets over-represent popular items), and feedback loops (the model trained on data generated by a previous model). A/B testing is the only reliable evaluation.

Key Takeaways

  • The core problem is matrix completion under a low-rank assumption
  • SVD gives the optimal low-rank approximation for fully observed matrices, but rating matrices are sparse
  • ALS and SGD optimize only over observed entries, avoiding the missing-as-zero error
  • Content-based methods handle cold start but miss collaborative patterns
  • Implicit feedback (clicks, views) is structurally different from explicit ratings
  • Offline metrics like NDCG do not reliably predict online performance

Exercises

ExerciseCore

Problem

A rating matrix has 1,000 users and 5,000 items. Only 0.5% of entries are observed. You use rank-20 matrix factorization. How many parameters does the model have? Compare this to the number of observed entries.

ExerciseAdvanced

Problem

Why does applying standard SVD to a sparse rating matrix (filling missing entries with zeros) produce poor recommendations? Describe what the singular vectors capture in this case.

References

Canonical:

  • Koren, Bell, Volinsky, "Matrix Factorization Techniques for Recommender Systems" (IEEE Computer, 2009)
  • Eckart and Young, "The Approximation of One Matrix by Another of Lower Rank" (Psychometrika, 1936)

Current:

  • Rendle et al., "BPR: Bayesian Personalized Ranking from Implicit Feedback" (UAI 2009)

  • He et al., "Neural Collaborative Filtering" (WWW 2017)

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.