Recommender Systems

Sneiderman, Robby

ML Methods

Recommender Systems

User-item interaction modeling via matrix factorization, collaborative filtering, and content-based methods: the math of SVD-based recommendations, cold start, implicit feedback, and why evaluation is harder than the model.

CoreTier 2StableSupporting~55 min

Prerequisites

Eigenvalues and Eigenvectors

Start 8-question practice · 1 available Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 2 | tier 2. This page has 1 direct prerequisite and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Take the diagnostic

No published continuation is declared yet, so the diagnostic is the clean next route.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Recommender systems drive engagement on most consumer internet products: Netflix, YouTube, Amazon, Spotify. The underlying problem is matrix completion: given a sparse user-item interaction matrix, predict the missing entries.

The mathematical core is low-rank matrix approximation via singular value decomposition. The practical difficulty is that the data is sparse, biased, and implicit. Understanding both the clean theory and the messy reality is necessary.

Mental Model

You have a matrix $R$ of size $m \times n$ where rows are users and columns are items. Most entries are missing. The assumption is that $R$ is approximately low-rank: user preferences are determined by a small number of latent factors. Factoring $R \approx UV^T$ where $U \in \mathbb{R}^{m \times k}$ and $V \in \mathbb{R}^{n \times k}$ gives you $k$ -dimensional representations of users and items.

Collaborative Filtering

Definition

Collaborative Filtering

Predict a user's preference for an item based on the preferences of similar users (user-based CF) or similar items (item-based CF). No content features are needed: the signal comes entirely from the interaction pattern.

User-based CF: to predict user $u$ 's rating for item $i$ , find users most similar to $u$ (by cosine similarity or Pearson correlation of their rating vectors) and average their ratings for item $i$ .

$\hat{r}_{ui} = \bar{r}_u + \frac{\sum_{v \in N(u)} \text{sim}(u,v)(r_{vi} - \bar{r}_v)}{\sum_{v \in N(u)} |\text{sim}(u,v)|}$

Item-based CF replaces user similarity with item similarity. In practice, item-based CF is more stable because item similarity changes less over time than user similarity.

Matrix Factorization

Definition

Low-Rank Matrix Factorization $R \approx U V^{T}$

Decompose the rating matrix $R$ into the product of two low-rank matrices:

$\min_{U, V} \sum_{(u,i) \in \Omega} (r_{ui} - u_u^T v_i)^2 + \lambda(\|U\|_F^2 + \|V\|_F^2)$

where $\Omega$ is the set of observed ratings, $u_u \in \mathbb{R}^k$ is user $u$ 's latent vector, $v_i \in \mathbb{R}^k$ is item $i$ 's latent vector, and $\lambda$ controls regularization.

Theorem

Eckart-Young-Mirsky Theorem

Statement

Let $R = U\Sigma V^T$ be the SVD of $R$ with singular values $\sigma_1 \geq \sigma_2 \geq \cdots$ . The best rank- $k$ approximation in Frobenius norm is $R_k = U_k \Sigma_k V_k^T$ (the top $k$ singular vectors), and the error is:

$\|R - R_k\|_F^2 = \sum_{j=k+1}^{\min(m,n)} \sigma_j^2$

Intuition

The SVD finds orthogonal directions of maximum variance. Keeping only the top $k$ directions gives the closest low-rank approximation. Each discarded singular value contributes its square to the error.

Proof Sketch

By the unitary invariance of the Frobenius norm, the problem reduces to approximating the diagonal matrix $\Sigma$ . The best rank- $k$ diagonal approximation keeps the $k$ largest entries and zeros the rest.

Why It Matters

This theorem justifies why truncated SVD is the starting point for matrix factorization in recommender systems. If user preferences truly depend on $k$ latent factors, the SVD recovers them optimally when the matrix is fully observed.

Failure Mode

The Eckart-Young theorem applies to fully observed matrices. Rating matrices are extremely sparse (often less than 1% observed). Applying SVD to a matrix filled with zeros for missing entries is wrong: it treats missing as zero. This is why alternating least squares (ALS) and SGD-based methods that only optimize over observed entries are preferred in practice.

report a correction →

Alternating Least Squares (ALS): Fix $V$ , solve for $U$ in closed form (a ridge regression per user). Fix $U$ , solve for $V$ . Alternate until convergence. Each step is a convex problem, but the joint problem is non-convex.

SGD-based: Sample an observed rating $(u, i, r_{ui})$ , compute gradient, update $u_u$ and $v_i$ using stochastic gradient descent. This scales to very large datasets.

Content-Based Methods

Use item features (genre, text description, image) instead of collaborative signals. Build a user profile from features of items they liked, then score new items by similarity to that profile.

Content-based methods solve the item cold start problem (new items have features even without ratings) but cannot capture collaborative signals (users who liked X also liked Y).

Hybrid Approaches

Combine collaborative and content signals. Common strategies: feature augmentation (add content features to the factorization model), ensemble (average predictions from separate models), or a unified model that jointly learns from both.

Neural Collaborative Filtering

Replace the dot product $u_u^T v_i$ with a neural network $f(u_u, v_i)$ that can learn non-linear interactions between user and item embeddings. The simplest version concatenates the embeddings and passes them through an MLP.

Whether the added expressiveness of neural models justifies the training cost over simple matrix factorization remains context-dependent. On many benchmarks, well-tuned matrix factorization is competitive with neural methods.

The Cold Start Problem

Watch Out

Cold start has two distinct variants

User cold start: a new user has no interaction history. Item cold start: a new item has no ratings. These require different solutions. User cold start can be addressed by asking for initial preferences or using demographic features. Item cold start can be addressed by content-based features. Pure collaborative filtering cannot solve either.

Evaluation Metrics

Definition

Precision at k $P @ k$

Of the top $k$ items recommended, the fraction that the user actually interacted with. $P@k = |\text{relevant} \cap \text{top-}k| / k$ .

Definition

Normalized Discounted Cumulative Gain $N D C G$

A rank-weighted metric that gives more credit to relevant items appearing higher in the list:

$\text{DCG}@k = \sum_{i=1}^{k} \frac{2^{r_i} - 1}{\log_2(i+1)}$

where $r_i$ is the relevance of the item at position $i$ . NDCG normalizes by the ideal DCG (perfectly ranked list).

Common Confusions

Watch Out

Implicit feedback is not the same as explicit ratings

Most real systems observe clicks, views, and purchases, not star ratings. A user not clicking an item does not mean they dislike it; they may not have seen it. This means the "negative" class is contaminated with unseen positives. Methods that treat all non-interactions as negatives (e.g., naive SGD on implicit data) learn a biased model. Techniques like BPR (Bayesian Personalized Ranking) or weighted matrix factorization address this.

Watch Out

Offline metrics do not predict online performance

A model with higher NDCG on a held-out test set can perform worse in production due to position bias (users click higher-ranked items regardless of relevance), popularity bias (test sets over-represent popular items), and feedback loops (the model trained on data generated by a previous model). A/B testing is the only reliable evaluation.

Summary

The core problem is matrix completion under a low-rank assumption
SVD gives the optimal low-rank approximation for fully observed matrices, but rating matrices are sparse
ALS and SGD optimize only over observed entries, avoiding the missing-as-zero error
Content-based methods handle cold start but miss collaborative patterns
Implicit feedback (clicks, views) is structurally different from explicit ratings
Offline metrics like NDCG do not reliably predict online performance

Exercises

ExerciseCore

Problem

A rating matrix has 1,000 users and 5,000 items. Only 0.5% of entries are observed. You use rank-20 matrix factorization. How many parameters does the model have? Compare this to the number of observed entries.

ExerciseAdvanced

Problem

Why does applying standard SVD to a sparse rating matrix (filling missing entries with zeros) produce poor recommendations? Describe what the singular vectors capture in this case.

References

Canonical:

Koren, Bell, Volinsky, "Matrix Factorization Techniques for Recommender Systems" (IEEE Computer, 2009)
Eckart and Young, "The Approximation of One Matrix by Another of Lower Rank" (Psychometrika, 1936)

Current:

Rendle et al., "BPR: Bayesian Personalized Ranking from Implicit Feedback" (UAI 2009)
He et al., "Neural Collaborative Filtering" (WWW 2017)

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Eigenvalues and Eigenvectorslayer 0A · tier 1

Derived topics

0

No published topic currently declares this as a prerequisite.