“What required a lot of effort, time, and money spent on annotating data to train our own algorithms now can be achieved with surprisingly little data, short time, and low cost,” said Kasper Lindskow, head of research and innovation at Ekstra Bladet, part of JP/Politikens Hus in Denmark.
In an interview with INMA, Lindskow discussed practical applications of natural language processing, a method of extracting information from text.
Publishers use it, for example, to automatically analyse contents of thousands of articles and identify their topics. This classification is then used to match ads to relevant articles and to derive interests of readers (based on the topics they had read).
Text mining is a new key capability of news publishing, as it enables targeting ads contextually rather than based on individual user data. Other typical uses include assistance in editorial planning, suggesting headlines or summaries, automated content curation, and engaging audiences with recommendations.
Breakthroughs in text mining
Invented in the 1950s, natural language processing was revolutionised with the introduction of machine learning in the 1980s and further advances in its techniques, such as artificial neural networks, in later decades.
A big challenge of this type of analysis was that it required a lot:
Large quantities of articles annotated by people to train the machines.
A lot of storage.
And a lot of computing power.
This all changed in 2018 when Google, Stanford University, OpenAI, and others started publishing a new type of algorithms pre-trained on large datasets, such as the whole English Wikipedia.
“Publishers can today apply such algorithms to their articles, and after relatively easy and quick fine-tuning classify articles with high accuracy,” Lindskow explained.
Fine-tuning requires feeding the algorithm with examples of texts classified by people. According to Lindskow, one gets a decent result on topic classification with only 5,000 annotated articles. Ekstra Bladet decided to annotate twice as many to improve their model.
For this task, it hired linguistic students at US$25 per hour. Developing a basic model took two data scientists six weeks. This can be shorter if they would use open-source resources instead of developing the architecture themselves. The time-consuming and costly part has been to integrate the new database with advertising systems to allow planning campaigns and targeting ads.
Road to data independence
Ekstra Bladet has been upgrading its data and advertising infrastructure for years, aiming at reducing its dependence on tech giants’ systems, such as Google:
In November 2019, it launched its own data platform, Relevance, that segmented users based on first-party reader data and context derived from content.
In October 2020, it launched a contextual advertising network, named the Publisher Platform, in collaboration with six peer publishers, such as TV2 and Berlingske Media, offering advertisers reach to 90% of Danes across all devices or browsers.
In January 2021, Extra Bladet ditched Google Analytics for its own Web analytics software, Longboat, and has become independent of third-party technologies in the entire data value chain.
Kasper Lindskow is now heading an ambitious project to develop systems for news personalisation in collaboration with Denmark’s leading universities.
Treasures hidden in articles
Ekstra Bladet wants to extract more information from articles, such as people, organisations, places, things, and sentiments to allow more granular segmentation. “We have a proof of concept and now we are waiting for integration into our data products,” Lindskow said.
One of his ambitions is to link the content metadata with information from other databases or Web sites — for example, an article mentioning a person with her biography elsewhere. Scientists call such databases knowledge graphs: They collect pieces of information from different resources and organise them by linking through keywords. Google and Facebook use their graphs to improve search results, news feeds, and more.
Ekstra Bladet’s Lindskow believes publishers need to develop machine learning and Artificial Intelligence systems themselves to offer competitive reader experiences and to ensure these systems reflect journalistic values and ethics — and not those of the tech giants.
How are you tying data analytics to your business objectives? E-mail firstname.lastname@example.org. INMA members can subscribe to the Smart Data Initiative bi-weekly newsletter here.