Froomle, University of Antwerp research likelihood of users to read regional news
Satisfying Audiences Blog | 19 December 2021
When newspapers want to recommend an article to a reader, they quickly run into some limitations. How do you know if that person will find the article interesting, for example? By looking at that person’s reading history? Should you recommend a popular or recent article?
These are just some of the questions news managers struggle with daily.
Froomle, the leading Belgian Artificial Intelligence (AI) company with a strong focus on journalism, recognises that different recommendation use cases each require a specific approach. Therefore, we joined forces to research this topic.
The subject being studied? We wanted to see if we can predict which regional articles people want to read by jointly considering the user’s article and geographical preferences, social influence, and time.
On newspaper Web sites, such articles are often suggested based on the municipality where you live. That data is extracted from your subscription information or by looking at your IP address. Afterward, the media rank recent and/or popular news articles from that region.
This way of working creates biases related to imbalance, such as item popularity and city dominance, where readers living in smaller towns get more recommendations from big cities. The second set of biases is related to the lack of data, such as cold-start users, cold-start items, and cold-start regions, or low population regions without recent publications. Thirdly, many biases are temporal, such as the short lifetime of news articles and concept drift, where user preferences and local news topics evolve.
I researched if this method could be improved — and if machine learning could come into play.
200 GB of articles
I used a dataset of 200 GB of articles and Web analytics sourced from Het Nieuwsblad through Froomle’s Big Data platform. I loaded all interactions and article metadata during 40 days (June 1, 2021-August 11, 2021) and excluded all articles containing general news and sports. I also fetched each location’s corresponding longitude and latitude coordinates using a public API that supports forward geocoding. This allows me to compute geographically nearby regions.
The study calculated the probability that someone would read an article offline, then tested that prediction against that person’s online reading behaviour, using a sliding window-based evaluation. The Python algorithm produced a ranking of different articles, sorted according to the relevance of the article. I used four metrics to measure the success of my work:
- Recall: Relevant percentage of recommendations; the article was suggested at the right time.
- Hit rate: Percentage of users that view the recommendations; in the real world people have viewed this article.
- Kendall Tau: percentage of articles ranked correctly.
- NDCG: the division between Recall and Kendall Tau.
The results always included the distinction between the data with and without popular articles because article popularity is not always related to a specific region.
The results
The relevant percentage of recommendations if we were to recommend 10 articles (“recall@10”) of the experiment was 33%. When it comes to the hit rate at 10 recommendations, the performance of the research group was 36% higher than the control group. This means that, additionally, 36% more users would have read an article from the list of recommended articles if this algorithm had been running in a live environment.
To achieve such success, I conducted several experiments to see what works and what does not. For example, there was a noticeable difference by suggesting articles from the past two weeks and going back in time up to three months. The most decisive parameter, of course, was the location. However, there are different ways to approach that.
I analysed which regional news users had read in the past three months, excluded the 1% most popular articles and then ranked the most popular regions to create user profiles. I used only the top regions because the NDCG at 10 recommended articles increases to 13.2% from 8.0% (+65%) if he uses the top two instead of the complete user profile.
I then used OpenCage API to determine the closest regions based on longitude and latitude coordinates. The recall for 10 recommended articles was 33%, and the hit rate was 54.4%. This is a lot more than when working with the existing list of Het Nieuwsblad, which even decreased the hit rate by -1.5%.
For optimising ranking, I experimented with different functions to improve ranking, thereby assigning a score to articles based on a combination of recency, popularity, and relevance of the location of an article. In the end, a combination of jointly filtering on recency and popularity and ranking on the popularity in the past 24 hours divided by the age of an article in hours improved results for Kendall Tau by 38.4% for anonymous users.
After evaluating thousands of algorithms and methodologies offline, the following content-based recommendation algorithm came out with the best results for this use case:
- Create user profiles based on regional reading history, excluding national and popular articles.
- Limit the number of regions in each user profile.
- Take neighbouring regions into account.
- From the articles matching a user’s profile, recent articles perform the best.
Many experiments had little impact, such as ensembles with collaborative filtering. Experiments using content-based recommendations did not give a substantial improvement as well. For example, I investigated whether articles could be linked via word2vec based on content or title, but this did not significantly increase the accuracy.
What is next?
Further research focus is on increasing the complexity of the integrated model. I want to further research the perfect ranking function, including a personalisation component (either content-wise or using collaborative filtering) together with geographic preferences. In early 2022, the second part of the research will be published.
In the meantime, Froomle’s data engineering team has already progressed on implementing this offline research into a live production environment. In November 2021, the first online tests using the regional profile started for “push audience selection,” soon to be followed by Web site recommendations.