Recommendation Systems: From Matrix Factorization to Two-Tower Models

Published on Sunday, Feb 8, 2026

1 min read


Recommendation systems are the engines powering personalization across the web, from YouTube videos and Spotify playlists to Amazon products and TikTok feeds. Their goal is simple: connect users with items they’ll love. But behind this simple goal lies a massive challenge: scale. Modern platforms deal with millions, even billions, of users and items. How can we efficiently find the few truly relevant items for a specific user from such a vast ocean of possibilities?

In this post, we’ll explore the evolution of recommendation techniques—starting with collaborative filtering, moving through matrix factorization (one of the most popular techniques for collaborative filtering), and finally diving into the two-tower model, a deep learning architecture designed for scale.


🔍 Understanding Collaborative Filtering

Collaborative filtering is one of the most widely used methods in recommendation systems. The core idea is intuitive: if users shared common interests in the past, they are likely to share them again in the future.

User-Based Collaborative Filtering

This method recommends items to a user based on the preferences of other similar users. If two users have shared interests in the past, they are likely to share more interests in the future, making it effective for recommending products or content liked by similar users.

Item-Based Collaborative Filtering

Instead of comparing users, item-based collaborative filtering compares items to identify which ones are similar. Recommendations are then made based on the similarity between items the user has interacted with and others that are alike.

Neighborhoods

Collaborative filtering works by creating “neighborhoods” of similar users or items. By comparing users’ behavior patterns, or the characteristics of items, it determines which ones are likely to be relevant for a given user. This helps refine recommendations by focusing on the most relevant neighbors.

Challenges of Collaborative Filtering

While powerful, collaborative filtering faces significant challenges:

  • Cold-Start Problem: New users or items with no interaction history are difficult to recommend.
  • Data Sparsity: Most users interact with only a tiny fraction of available items, leaving the interaction matrix extremely sparse.

🧮 Matrix Factorization: Tackling Sparsity

Matrix Factorization (MF) models tackle the sparsity problem by learning latent factors (hidden features) for both users and items from the interaction data. These factors represent underlying characteristics that explain user preferences and item attributes.

The Interaction Matrix

Imagine a large matrix R where rows represent users and columns represent items. The entry R_ui contains the interaction value between user u and item i.

  • Explicit Feedback: R_ui could be a user’s rating (e.g., 1-5 stars).
  • Implicit Feedback: R_ui could be binary (1 if interacted, 0 otherwise) or represent interaction frequency (e.g., number of views, purchase count).

This matrix R is typically very sparse, meaning most entries are unknown or zero.

The Factorization Approach

The key idea is approximating the large sparse matrix R by the product of two smaller, dense matrices P and Qᵀ.

Matrix Factorization aims to find two lower-dimensional matrices such that their product approximates the original interaction matrix:

  • P is a matrix where each row p_u is a vector of latent factors for user u (size: num_users × k).
  • Q is a matrix where each row q_i is a vector of latent factors for item i (size: num_items × k).
  • k is the number of latent factors (dimensionality), a hyperparameter typically much smaller than the number of users or items.

What are Latent Factors?

These factors are not predefined; they are learned from the data. They might capture underlying dimensions like genres, user age appeal, item complexity, or user adventurousness—but often in a way that’s not directly interpretable by humans. The key is that users with similar latent factors tend to like items with similar latent factors.


⚙️ Training Matrix Factorization Models

Two common techniques are used to learn the factorization:

1. Stochastic Gradient Descent (SGD)

  • Initialize random values for P and Q matrices.
  • Calculate the prediction error for known interactions.
  • Update the matrices by moving slightly in the direction that reduces the error, considering regularization.
  • Repeat until convergence.

2. Alternating Least Squares (ALS)

ALS takes advantage of the fact that if you fix one of the matrices (P or Q), the objective function becomes quadratic in the other, which can be solved optimally using least squares regression.

  • Initialize P and Q randomly.
  • Fix Q, solve for P: For each user u, find the p_u that minimizes the error for all items rated by that user, holding all q_i constant.
  • Fix P, solve for Q: Vice versa—for each item i, find the optimal q_i.
  • Repeat until convergence.

📊 MF vs. Other Recommendation Techniques

How does Matrix Factorization compare to other approaches?

  • Neighborhood Methods (User/Item KNN): MF often scales better and can provide better accuracy by capturing more global patterns than local neighborhoods.
  • Deep Learning Models (Two-Tower, NCF, etc.): Deep learning models offer more flexibility in architecture, can model non-linear interactions, and easily incorporate diverse side features end-to-end. However, they are generally more complex to implement and train. MF is often used as a baseline to compare against.
  • Content-Based Filtering: MF is purely collaborative (uses only interaction data), while content-based uses item features. Hybrid models often combine both techniques.

Matrix Factorization provides an elegant and powerful way to perform collaborative filtering by uncovering latent dimensions of user preferences and item characteristics from sparse interaction data. Algorithms like SGD and especially ALS (with its weighted variant for implicit feedback) offer practical ways to learn these factors.


🏗️ The Two-Tower Model: Deep Learning for Scale

While Matrix Factorization is effective, the Two-Tower Model represents the next evolution—a deep learning architecture designed for massive-scale recommendation systems.

Core Architecture

At its heart, the two-tower model separates the computation for users and items into two distinct neural networks—the “towers.”

1. The User Tower: This network takes various user-related inputs (like user ID, demographics, historical interactions, device, context) and processes them through layers (embedding layers, MLPs, RNNs, etc.) to output a single vector: the user embedding u. This vector represents the user’s preferences and characteristics in a dense, low-dimensional space.

u = Tower_User(User_Features)

2. The Item Tower: Similarly, this network takes item-related inputs (item ID, category, description, image features, etc.) and processes them through its own set of layers to output an item embedding v. This vector represents the item’s properties in the same embedding space.

v = Tower_Item(Item_Features)

Scoring in the Embedding Space

The magic happens when we need to predict the affinity between a user and an item. Instead of feeding all features into one giant network, the two-tower model calculates the score directly from the pre-computed embeddings, typically using a simple similarity function:

  • Dot Product: Score(u, v) = u · v (Most common)
  • Cosine Similarity: Score(u, v) = (u · v) / (||u|| ||v||)

🚀 The Power of Decoupled Serving

The true elegance of the two-tower model shines during serving (inference):

Training Phase

The two towers are trained jointly end-to-end. The goal is to learn embedding spaces where the similarity score (e.g., dot product) between a user and relevant items is high, and low for irrelevant items. This typically involves optimizing loss functions like log loss (pointwise) or contrastive losses, often using negative sampling (including efficient in-batch negative sampling).

Serving Phase

Offline Item Computation: Because the item tower only depends on item features, you can pre-compute the embeddings v for all items in your corpus (potentially billions!) offline and store them.

Online Retrieval: When a user request comes in:

  1. Compute the user embedding u in real-time using the user tower (fast, as it’s one forward pass).
  2. Use this user embedding to query the pre-computed item embeddings.
  3. Since calculating the dot product for billions of items is still too slow, we use Approximate Nearest Neighbor (ANN) search techniques (e.g., Faiss, ScaNN, HNSW).
  4. ANN allows us to efficiently find the items whose embeddings have the highest similarity with the user embedding, retrieving the top-K candidates in milliseconds.

This decoupling makes retrieving relevant candidates from massive catalogs feasible.


🧩 Building the Towers: Features and Architectures

The performance of a two-tower model heavily depends on how you build the towers and train the system.

Input Features

Effectively representing users and items is crucial:

  • IDs (User/Item): Learned via embedding layers.
  • Categorical Features: Also use embedding layers (e.g., item category, user location).
  • Numerical Features: Often normalized and fed into MLPs (e.g., user age, item price).
  • Text Features: Processed using anything from simple TF-IDF to sophisticated Transformer embeddings (like BERT).
  • Image Features: Often derived from pre-trained CNNs.
  • Sequential Features (User History): Modeled using RNNs (LSTM/GRU), CNNs, or Attention/Transformers (e.g., BERT4Rec, SASRec) within the user tower.

Tower Network Architectures

  • MLPs: An easy starting point for combining various embedded and numerical features within each tower.
  • Specialized Networks: Towers can incorporate CNNs, RNNs, or Transformers to handle specific modalities (text, image, sequence) before feeding into final MLP layers.

Training Objectives and Strategies

  • Loss Functions: Pointwise (Log Loss, MSE), Pairwise (BPR), Listwise, or increasingly popular Contrastive Losses (like InfoNCE) using in-batch negatives.
  • Negative Sampling: Essential for implicit feedback. Strategies range from random sampling to popularity-based (beware bias!) and hard negative mining (selecting challenging negatives). In-batch negatives are often highly effective and efficient.
  • Regularization & Optimization: Standard deep learning techniques (dropout, batch norm, Adam optimizer) apply. Temperature scaling in contrastive loss is an important hyperparameter.

🎯 Two-Tower Models and the Cold Start Problem

While candidate generation is the dominant use case, two-tower models are particularly effective at solving the Cold Start problem in high-velocity marketplaces.

Why does cold start work better with two-tower models?

Unlike Matrix Factorization, which learns only user_id and item_id latent vectors, two-tower models also learn from features. The embeddings come from rich feature representations, not just IDs. This means a new item with no interaction history can still get a meaningful embedding based on its category, description, images, and other attributes.


✅ Advantages and Disadvantages

Advantages

  • Highly Scalable: Handles enormous item catalogs efficiently via ANN search at serving time.
  • Efficient Inference: Pre-computation of item embeddings drastically reduces online latency.
  • Flexible Feature Integration: Easily incorporates diverse feature types within each tower.
  • Effective Representation Learning: Learns meaningful user and item embeddings capturing complex patterns.
  • Modular Design: User and item towers can often be iterated upon somewhat independently.

Disadvantages

  • Limited Feature Interactions: By design, it doesn’t explicitly model interactions between user and item features until the final dot product. This limits its ability to capture fine-grained conditional preferences (e.g., “user likes this brand but only in that category”). This is why a separate ranker is usually needed.
  • Cold-Start Challenges: While better than MF, performance can still suffer for new users/items with very few features or interactions.
  • Potential for Bias: Like any model learning from historical data, it can capture and even amplify biases (popularity, exposure, etc.). Negative sampling strategy heavily influences this.
  • Simple Scoring Function: Dot product/cosine similarity might be too simplistic for complex user-item affinity.

🔁 Putting It All Together

The two-tower model combined with ANN search rapidly narrows down the millions or billions of items to a manageable set of hundreds or thousands (the “candidates”). Recall and efficiency are paramount at this retrieval stage.

In practice, recommendation systems often use a multi-stage architecture:

  1. Candidate Generation (Two-Tower + ANN): Retrieve top-K candidates quickly.
  2. Ranking (Complex Models): Apply more sophisticated models that can capture fine-grained feature interactions.
  3. Re-ranking (Business Logic): Apply diversity, freshness, or business rules.

This combination gives you both the scale of two-tower models and the precision of more complex ranking models.


This exploration has taken us from the foundational concepts of collaborative filtering through the elegance of matrix factorization to the scalable power of two-tower deep learning architectures. Each technique builds on the limitations of the previous one, and understanding this progression helps in choosing the right approach for your recommendation use case.

If you found this helpful, feel free to connect! I’m always excited to discuss recommendation systems, machine learning, and the fascinating challenges of building at scale.


Happy Learning!