News recommendations are often seen as a special case in the recommender systems field. Methods and architectures are created specifically for news recommendation. One important reason is that the domain has to deal with a rapidly changing set of relevant items. New items are constantly published, while older ones become obsolete. Algorithms need to take these changes into account.
The state of the art for news recommendations typically uses a hybrid approach, combining content-based recommendations with collaborative filtering and/or item features such as popularity and recency.
For media organisations planning to use content-based approaches for news recommendations, it’s important to understand how they work and why they are frequently used for news.
What are content-based approaches for news?
Items a user has read are used to construct a user profile. This user profile is made up of what we call “features.” These features can be categories (like domestic news and sports), topics, entities, or abstract features based on natural language processing techniques used on title or article text.
How are these profiles computed? The first step is nearly always to compute a profile for each of the items. These can be binary vectors (is something related to a topic or not?), float vectors (embeddings computed through neural networks), or a combination of the two.
The news recommendation framework SCENE uses related topics and entities to represent items (binary profile), while CHAMELEON computes an embedding for the items using a convolutional neural network. The user profiles are constructed based on these item profiles. Typical approaches include averaging the profiles of each item a user has seen, or incorporating the combination into a neural network.
The system used in Google News takes a different approach. Instead of the typical similarity approaches, it uses a probabilistic approach. It computes for each category of items, such as how likely a user is to click on or interact with it based on historical data. It’s this probability it saves in the user profile. The computed user profile can be used for both analysis and recommendations.
In analysis, the profiles can help answer questions like “How many users are interested in topic X?” or “How big is the overlap in readers between entity A and entity B?”
To use the profiles for recommendation, the typical approach is to compute a similarity between the profile of an item and that of a user, either through direct measures such as cosine similarity and Jaccard similarity, or a neural network.
Why use content-based recommendations for news?
The defining characteristic of constantly changing relevant items creates two issues that we can solve with content-based recommendation.
The first is the well-known issue of cold start, specifically item cold start. New items have not been read before, and so models relying on (co-)visitation are unable to recommend them, even if they are relevant.
Usually, the content and tags of an item are known when it is being published. Content-based models, therefore, do not suffer at all from the item cold start problem. Once an item becomes available it can immediately be recommended to users that read similar content.
The second issue is one of sparsity. If items are only relevant for a short period of time, the knowledge about an item needs to be collected over a short period of time. As two items are published further and further apart, the chance of a user having visited both decreases, even if the two are (strongly) related. Models like collaborative filtering that rely on co-visitation obviously struggle with this, because sparsity causes historic data to rapidly become irrelevant to these models.
By basing a user’s profile not on the ID of an item, but instead on the metadata of the item, content-based models can somewhat alleviate the second issue. There are usually fewer features used in the profiles than there are items, thus automatically reducing the sparsity.
Secondly, the features are usually long lived: For example, sports topics will always remain relevant, and so we can recommend new sports-related articles to users that read old sports articles, even if the two articles were never before read by the same user.
How are content-based recommendations used?
While the content-based approach is often a fundamental part of the recommender system, most state-of-the-art approaches combine it with other approaches to improve results. Some companies (like Google) combine the user’s content-based score with a global current interest score and a collaborative filtering score to get the final result. The global interest score helps to account for certain global interest spikes due to special events, like COVID-19, World Cup, and natural disasters.
The collaborative filtering helps find the right articles within the articles related to the broad topics used. SCENE also combines the content-based score with additional features like recency, popularity, and list diversity to rank the candidate list.
What can we learn from the literature and apply to our own use cases?
We use a content-based approach in our TF-IDF (term frequence, inverse document frequency) algorithm. The difference with the approaches suggested here is that we compute the user profiles in real time to use the most recent interactions of the user. Our TF-IDF algorithm precomputes item profiles based on the frequency that tokens (words) occur in the title and category strings.
These raw frequencies are weighted by the inverse of the frequency of a token in the data of all items. This means that a token like “sport” occurring in a lot of articles gets less weight than a token like “tennis,” which will occur in fewer items and so be more representative of a user’s interest. The user profiles are computed in real time as the average of the profiles of items they recently consumed.
We could pre-compute these user profiles as well, storing them to be used at prediction time, or for analysis purposes. The advantage of the pre-computation is that we can take longer histories into account, now that we don’t need to compute them in real time.
The downside to pre-computation is that a user’s profile is not updated with the most recent user events. It takes a rerun of the computation to consolidate them into the profiles. Another disadvantage is that you risk computing large amounts of profiles that will never be used since you don’t know which user will be going online.
When the history you want to use is short, then it is better to not pre-compute the profiles, since that would introduce complexity where it is not necessary. You should use pre-computed profiles when you want to use a long history such that loading all the events in real time and computing the profiles would take too long.
Content-based approaches help solve the cold start problem, since they rely on information that is readily available, whereas collaborative filtering needs user interactions that are amassed over time. By representing the user’s interest as their preferences for certain features, it also reduces the sparsity of the data. (Typically, there are fewer categories than there are items, and they are relevant for longer periods of time.) Stored user profiles are used to handle a large user history, without having to recompute them on the fly.
To get the best final result, content-based approaches are combined with other models to improve recommendation. Simple features like popularity, recency, and diversity all play a role in getting the best result for the user. In addition, collaborative filtering can be used to incorporate the user’s recent interests and browsing behaviour.