Gazeta Wyborcza mines its articles to better target ads by context

By Greg Piechota


Oxford, United Kingdom


In the beginning of the COVID pandemic, Poland’s Gazeta Wyborcza built and launched a new offer of contextually targeted ads. After a year, the new offer is generating 6% of its online display ad revenue. 

“We’ve got evidence that contextual targeting is effective,” said Joanna Balowska, director of online ad sales and development at Gazeta Wyborcza, in an interview with INMA. 

Many news publishers see contextual targeting as one of the alternatives to targeting users based on third-party cookies. The search for alternatives accelerated after the tech giants, such as Google, abandoned tracking online users across websites, citing privacy concerns and regulatory risks.

Gazeta Wyborcza and others invested in text mining technologies, parsing articles to better identify their topics and sentiments. So far, INMA members across Europe reported advertisers and their agencies have been slow in shifting budgets, fearing lower effectiveness of ads.

How effective is contextual targeting: Gazeta Wyborcza provided INMA exclusive access to the results of 70 online ad campaigns it sold between April 2020 and March 2021. Although INMA cannot report the exact figures, nor reveal the advertisers, our review of the data set confirmed the new targeting method effective: 

  • The contextually targeted campaigns had, on average, 1.7 times higher click-through rates (CTR) than Run-of-Network campaigns sold on, which in turn matched an industry average for display ads.
  • The contextually targeted pre-roll video ad campaigns enjoyed, on average, a similar lift of 1.7 times higher CTR.
  • The highest lift, nine-fold or 9.2x, was observed for two campaigns: one for an automotive brand and another for an entertainment brand. 

For the use of its proprietary data and algorithms, Balowska said, the newspaper charged 30% extra on top of the CPM or a negotiated flat fee, depending on a contract.

By December 2020, the new offer called Content Categories generated 2% of online display ad revenue of In 2021, through mid-March, the share of revenue from contextually targeted campaigns has increased to 6%.

What is the business problem: has 9 million users monthly and 241,000 digital-only subscribers. Although readers bring most, or 54%, revenue in print and online, the newspaper sees an opportunity in growing advertising revenue with new offers built on first-party user and content data.

“We were looking at how best to monetise our editorial content without creating multiple verticals focused on single topics like horizontal portals do in Poland. For example, our weekend edition has articles on a variety of subjects, making targeting relatively difficult,” explained Joanna Balowska. 

The newspaper started working on a new offer in late 2019. The COVID outbreak accelerated the project. “Many advertisers wanted to avoid articles on the pandemic. Soon though, most articles mentioned the coronavirus-related keywords. We needed a smarter way to classify content to differentiate contexts and sentiments,” said Balowska.

By April 2020, in less than two months since the beginning of the pandemic in Poland, the newspaper was able to run the first test campaigns. By June, it offered 25 basic content categories matching a popular taxonomy by Google, familiar to advertisers. 

By February 2021, clients got tools to create custom categories. For example, one automotive brand set hundreds of conditions: topics, keywords, or sentiments to be included or excluded from their campaign.

How did data science solve it: Gazeta Wyborcza is part of Agora, a media and entertainment conglomerate, which has established a group-wide Big Data department in 2014. The team has experience in natural language processing, a method of extracting information from text using machine learning, having built an automated tagging system for editorial purposes.

Two scientists supervised by Luiza Pawela, the department’s head, worked on a new classification tool, optimised for contextual targeting of ads. 

Initially, the team considered using commercial tools, such as Google Cloud Natural Language API, but it faced challenges: 

  • Wyborcza publishes articles mostly in Polish, so they needed to be automatically translated first, using Google Cloud Translation API, before they could have been classified. Sometimes, nuances were lost in translation.
  • As Google classification algorithms were trained on English texts, the results were not perfect either. For example, the signature series on abuse of animal rights was classified as about Hobbies and Leisure. While Google knew all about American football, which isn’t popular in Poland, it missed ski jumping, which is a national favourite.
  • Querying two APIs for each article was feared to be unsustainable too if scaled to thousands of new articles published daily or millions of articles in the archive. For example, Google charges a dollar per 1,000 articles of 1,000 characters or equivalents analysed by its NLP API.

The team ended up using Google APIs for classifying samples of texts only. After a human review, they used the samples to train own algorithms for each category. They used libraries developed by Polish scientists for mining the language properly. 

“This approach helped the team to produce the proof of concept quickly, achieve a better quality of classification than using Google tools only, and we saved money,” explained Luiza Pawela. 

Every two or three months, the models for Content Categories are updated and they are feeding other algorithms, such as article recommendations and reader segmentation by interests derived from topics read. 

How are you tying data analytics to business objectives? Share the story. E-mail me at: INMA members can subscribe to the Smart Data Initiative bi-weekly newsletter here.

About Greg Piechota

By continuing to browse or by clicking “ACCEPT,” you agree to the storing of cookies on your device to enhance your site experience. To learn more about how we use cookies, please see our privacy policy.