May 10, 2017

One of my clients, Youth1 Media, recently came to me for help with curating content for an infinite scroll feature.

If you’re reading this, you probably know the basics of infinite scroll: scroll down a webpage to the bottom of a section of content, and more content magically appears.  Scroll down to the bottom of that content and… even more content! ESPN articles are a great example of this.

The technical mechanics of how to serve content like this is pretty basic. The tough part is curating what to serve.

Collaborative filtering for content recommendation

Content recommendation is nothing new. It has been a common feature of websites during the CMS era. We’ve all seen “Similar Articles” blocks on news and blogging sites, as well as “Other Customers Who Viewed This Item Bought” recommendations on e-commerce sites like Amazon.

These blocks all require some kind of algorithm for judging “similarity.” There are at least two very general kinds of measures of similarity:

  • Content similarity: based on shared characteristics of items; keywords or tags, textual similarity of product descriptions or article body, product price, brands, etc.
  • User similarity: based on shared characteristics of users; demographic information like age / sex / location / income, purchasing history, browsing history, etc.

It is this second category – often called “Collaborative Filtering” – that is all the rage these days. The basic idea is that you are likely to consume stuff that people who are like you consume.

What makes users “similar”?

My client and I agreed with this general approach of prioritizing similar users over similar content, so we wanted to incorporate a collaborative filter into our article infinite scroll feature.

But what does it mean for users to be alike?

Again, there two basic schools of thought here:

  • similarity of users themselves (e.g., demographic information)
  • similarity of actions (e.g., purchasing history, browsing habits)

Actions make way more sense for us, since we know very little about our users. Unless you’re basically Facebook, or you have a strong profile-driven community site, this is going to be the case for you, too. I won’t use real numbers, but let’s say about 75% of our traffic comes from first time users, and of those remaining 25% who are returning users, only roughly 2% have registered accounts. These kinds of numbers are very common for content-driven sites like blogs and news outlets. Rather than focusing on user characteristics, it’s much easier for us to monitor browsing habits via sessions or cookies, and to make predictions based on this history.

On a much more basic level, which I know runs the risk of sounding incredibly trite, actions do speak louder than words, or louder than what we think we know about the user in this case. If a user buys a product or visits a page, that is more powerful than what their profile might tell us.

The major problem of collaborative filtering: the “cold start”

Collaborative filtering is awesome when we have plenty of user data to work with. Where it falls short is trying to understand new users, new products, or new content.

This is known as the “cold start” problem, a known limitation of collaborative filtering that can be a strong case for using a content-oriented approach to recommendation.

The “cold start” problem is especially pronounced for content-driven sites, where new content is introduced frequently.

Take Youth1 for example. Youth1 publishes new youth sports articles every day. At the time a new article is published, the number of users who have browsed this article is 0. It can take several hours, or several days, until enough users have viewed this article so that we can make well-informed predictions about browsing similarity. How should we make content recommendations in the meantime?

A combined approach: the Simple Collaborative Filter module

The solution I settled on for reconciling a collaborative filter with the cold start problem is simple: use a collaborative filter when we have enough data, or else fallback to content similarity and popularity as a stopgap.

The final product of this endeavor is the Simple Collaborative Filter module for Drupal 7.

Here’s a high level explanation of how it works:

Browsing history is recorded via the Focused History module. This module tracks browsing history by sessions instead of just a user’s id, so we know about anonymous user behavior.

The Simple Collaborative Filter module uses the Views module to output a list of similar content based on three fields calculated on the backend:

  • viewed_together_count -- The number of times nodes have been viewed together recently.
  • same_terms_count -- The number of taxonomy terms that nodes share.
  • total_views_count -- The total number of times a node has been viewed recently.
Sample results table for the Simple Collaborative Filter

Sample results table for the Simple Collaborative Filter with computed fields.

Most of the magic happens behind the scenes in a custom contextual filter, which takes a node id as an argument, calculates these computed fields, and appends them to each result.

I've also exposed a couple of configuration options for site administrators to adjust:

  • Viewed Together Threshold is the minimum number of times that two nodes must have been viewed together to be relevant to viewed_together_count. Because of sample size, sites owners almost definitely won't care if two items were only viewed together once or twice. In that case, we can set the "viewed together threshold" at 3, and nodes viewed together less than 3 times get a viewed_together_count of 0 for sorting purposes.
  • Same Term Vocabularies are the taxonomy term vocabularies that are included in same_terms_count.
Simple Collaborative Filter's contextual filter configuration menu

The contextual filter allows site admins to specify the level at which the 'collaborative' part of the filter turns on, and which vocabularies will be used to calculate term similarity.

The results are then sorted according to the order of the sort filters.

The content-based recommendations of this module could definitely be more sophisticated. There are lots of cool ways to improve these results through machine learning, or by using database-driven techniques like MYSQL `FULLTEXT` (see the Similar Entries module). But since we’re doing collaborative filter-first, it seemed like overkill to require site administrators to mess with distinct computational servers or change database engines. We wanted to provide this module to the Drupal community for quick plug-and-play. This is why we wanted a simple collaborative filter that doesn’t have any crazy dependencies.

So there you have it: a simple approach to a collaborative filter that addresses the cold start problem.