Aggregating data flags notable reader behaviour at Dagens Nyheter
Big Data For News Publishers | 29 March 2021
Reader behaviour on digital news media platforms changes over time. It could be due to turnovers in the reader base, external factors like pandemic restrictions that force people to work from home, or seasonal fluctuations.
By aggregating metrics from a longer period and splitting them into weekdays and hour slots, you can find interesting patterns. These patterns can help you learn a lot about your readers’ preferences and habits.
This is a simplified description of how you can perform these analyses and what you can learn.
At Dagens Nyheter, we perform behavioural analysis on a regular basis to better understand the habits of the users on our platforms. It’s also a simple way to shift to a more outside-in perspective on your publication schedule and trigger internal discussions on reader behaviour.
The goal is to publish our articles when they have the best possible chance of being read (by many readers and to the end of the article), converting new subscribers, and being shared. By digging into large amounts of data, we find clues on the readers’ preferences.
For example: At what days and times do our readers have enough time to finish long articles? When do they just want quick updates from the headlines on the home page and have no time to read entire articles? When do they usually come to us from organic search, and what kinds of articles they tend to look for then?
Agree on what question it is that you want to find an answer to. Do you want to investigate how working from home affects the preferences and habits of your readers? For example, do they wake up later in the morning? Do they have more time to read longer articles? Do they still have a similar behaviour as if they were commuting to work? Or, do you want to find out what kind of changes in reader behaviours we can expect from an upcoming vacation or holiday? Or, maybe we just want to simply map an average week with a couple of metrics.
Choose metrics according to the questions you want to answer. Maybe it’s published articles, pageviews, unique users, total time spent per visit, pageviews per visit, pageviews on the home page, conversion rates, or visits from social, search, e-mails, or push notifications. Try to experiment and see what findings are valuable.
Then choose a timeframe you want to investigate. Make sure it is long enough to contain a sufficient amount of data to identify patterns. But, don’t pick a too long of a period of time. Behaviours change, and you don’t want to draw conclusions from obsolete patterns and outdated data. Choose whole weeks so that you get as many Mondays as Tuesdays and so on. If you want to study the percentage change for a specific metric, pick a reference period that is about the same length as the primary timeframe you want to study.
Collect your data and then aggregate it so you calculate the average value for each hour during a week. For example, if you want to study the average number of users per day and hour, summarise the total number of users for each combination of weekday and hour of the day, then divide by the number of weeks in your timeframe. You’ll now have 168 slots (seven weekdays times 24 hours) and corresponding values. If you want to study how a metric have changed over time, calculate the percentage change for each slot according to the reference period.
To avoid specific news events that affect your numbers, you can exclude outliers for each slot. Removing outliers can be done in various ways. One way would be to set the outlier limit to three standard deviations.
Now comes the fun part: visualisation and conclusions. Plot each slot in chronological order. We usually do heat maps, with weekdays on the x-axis and hours on the y-axis. Plot both average and the percentage change.
Experiment with different selections. For example, limit the data to look at specific sections or types of articles ( such as articles longer than X characters). Combine heatmaps of different metrics and put them side by side — what does that tell you? (For example: Try combining average time spent per article by each user and the number of long reads to see whether your publication schedule matches when readers have time to read.)
The best way to do this is probably by writing a notebook; Jupyter Notebook and Google Colab are good tools. Then you can share it within your organisation and make sure that the analysis is consistent, regardless of who runs the code. Of course, the same thing can be done in Excel or Google Sheets. The initial effort is probably smaller in a spreadsheet, but the notebook version will be very time efficient to run and configure while it’s in place.
Keep in mind, this analysis visualises what the average patterns looks like. Ask yourself whether the average is representable enough to draw conclusions from. Also, think about correlation and causality when doing this. For example: Do readers spend much time on our platform at specific times because we’ve published long articles, or do we publish long articles at specific times because readers spend so much time on our page then?