ML Methods
Recommender Systems
User-item interaction modeling via matrix factorization, collaborative filtering, and content-based methods: the math of SVD-based recommendations, cold start, implicit feedback, and why evaluation is harder than the model.
Prerequisites
Why This Matters
Recommender systems drive engagement on most consumer internet products: Netflix, YouTube, Amazon, Spotify. The underlying problem is matrix completion: given a sparse user-item interaction matrix, predict the missing entries.
The mathematical core is low-rank matrix approximation via singular value decomposition. The practical difficulty is that the data is sparse, biased, and implicit. Understanding both the clean theory and the messy reality is necessary.
Mental Model
You have a matrix of size where rows are users and columns are items. Most entries are missing. The assumption is that is approximately low-rank: user preferences are determined by a small number of latent factors. Factoring where and gives you -dimensional representations of users and items.
Collaborative Filtering
Collaborative Filtering
Predict a user's preference for an item based on the preferences of similar users (user-based CF) or similar items (item-based CF). No content features are needed: the signal comes entirely from the interaction pattern.
User-based CF: to predict user 's rating for item , find users most similar to (by cosine similarity or Pearson correlation of their rating vectors) and average their ratings for item .
Item-based CF replaces user similarity with item similarity. In practice, item-based CF is more stable because item similarity changes less over time than user similarity.
Matrix Factorization
Low-Rank Matrix Factorization
Decompose the rating matrix into the product of two low-rank matrices:
where is the set of observed ratings, is user 's latent vector, is item 's latent vector, and controls regularization.
Eckart-Young-Mirsky Theorem
Statement
Let be the SVD of with singular values . The best rank- approximation in Frobenius norm is (the top singular vectors), and the error is:
Intuition
The SVD finds orthogonal directions of maximum variance. Keeping only the top directions gives the closest low-rank approximation. Each discarded singular value contributes its square to the error.
Proof Sketch
By the unitary invariance of the Frobenius norm, the problem reduces to approximating the diagonal matrix . The best rank- diagonal approximation keeps the largest entries and zeros the rest.
Why It Matters
This theorem justifies why truncated SVD is the starting point for matrix factorization in recommender systems. If user preferences truly depend on latent factors, the SVD recovers them optimally when the matrix is fully observed.
Failure Mode
The Eckart-Young theorem applies to fully observed matrices. Rating matrices are extremely sparse (often less than 1% observed). Applying SVD to a matrix filled with zeros for missing entries is wrong: it treats missing as zero. This is why alternating least squares (ALS) and SGD-based methods that only optimize over observed entries are preferred in practice.
Alternating Least Squares (ALS): Fix , solve for in closed form (a ridge regression per user). Fix , solve for . Alternate until convergence. Each step is a convex problem, but the joint problem is non-convex.
SGD-based: Sample an observed rating , compute gradient, update and using stochastic gradient descent. This scales to very large datasets.
Content-Based Methods
Use item features (genre, text description, image) instead of collaborative signals. Build a user profile from features of items they liked, then score new items by similarity to that profile.
Content-based methods solve the item cold start problem (new items have features even without ratings) but cannot capture collaborative signals (users who liked X also liked Y).
Hybrid Approaches
Combine collaborative and content signals. Common strategies: feature augmentation (add content features to the factorization model), ensemble (average predictions from separate models), or a unified model that jointly learns from both.
Neural Collaborative Filtering
Replace the dot product with a neural network that can learn non-linear interactions between user and item embeddings. The simplest version concatenates the embeddings and passes them through an MLP.
Whether the added expressiveness of neural models justifies the training cost over simple matrix factorization remains context-dependent. On many benchmarks, well-tuned matrix factorization is competitive with neural methods.
The Cold Start Problem
Cold start has two distinct variants
User cold start: a new user has no interaction history. Item cold start: a new item has no ratings. These require different solutions. User cold start can be addressed by asking for initial preferences or using demographic features. Item cold start can be addressed by content-based features. Pure collaborative filtering cannot solve either.
Evaluation Metrics
Precision at k
Of the top items recommended, the fraction that the user actually interacted with. .
Normalized Discounted Cumulative Gain
A rank-weighted metric that gives more credit to relevant items appearing higher in the list:
where is the relevance of the item at position . NDCG normalizes by the ideal DCG (perfectly ranked list).
Common Confusions
Implicit feedback is not the same as explicit ratings
Most real systems observe clicks, views, and purchases, not star ratings. A user not clicking an item does not mean they dislike it; they may not have seen it. This means the "negative" class is contaminated with unseen positives. Methods that treat all non-interactions as negatives (e.g., naive SGD on implicit data) learn a biased model. Techniques like BPR (Bayesian Personalized Ranking) or weighted matrix factorization address this.
Offline metrics do not predict online performance
A model with higher NDCG on a held-out test set can perform worse in production due to position bias (users click higher-ranked items regardless of relevance), popularity bias (test sets over-represent popular items), and feedback loops (the model trained on data generated by a previous model). A/B testing is the only reliable evaluation.
Key Takeaways
- The core problem is matrix completion under a low-rank assumption
- SVD gives the optimal low-rank approximation for fully observed matrices, but rating matrices are sparse
- ALS and SGD optimize only over observed entries, avoiding the missing-as-zero error
- Content-based methods handle cold start but miss collaborative patterns
- Implicit feedback (clicks, views) is structurally different from explicit ratings
- Offline metrics like NDCG do not reliably predict online performance
Exercises
Problem
A rating matrix has 1,000 users and 5,000 items. Only 0.5% of entries are observed. You use rank-20 matrix factorization. How many parameters does the model have? Compare this to the number of observed entries.
Problem
Why does applying standard SVD to a sparse rating matrix (filling missing entries with zeros) produce poor recommendations? Describe what the singular vectors capture in this case.
References
Canonical:
- Koren, Bell, Volinsky, "Matrix Factorization Techniques for Recommender Systems" (IEEE Computer, 2009)
- Eckart and Young, "The Approximation of One Matrix by Another of Lower Rank" (Psychometrika, 1936)
Current:
-
Rendle et al., "BPR: Bayesian Personalized Ranking from Implicit Feedback" (UAI 2009)
-
He et al., "Neural Collaborative Filtering" (WWW 2017)
-
Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Eigenvalues and EigenvectorsLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A