Structured data, not Internet scraping, results in trustworthy robot-produced journalism

By Sören Karlsson

United Robots

Malmö, Sweden


There’s an important discussion going on in our industry around the use of robots to produce journalism. Having worked in this field for several years now, I’ve heard sceptics question how it’s possible to build algorithms that create consistently reliable and correct articles. This concern is valid when the underlying data is collected by scraping the Internet. But when you build your automated content on structured data sets, the risk of error is minimal.

Building automated stories off of specified data results in meaningful machine-created content.
Building automated stories off of specified data results in meaningful machine-created content.

It’s easy to understand the disquiet in the debate about automated journalism and the confusion at the root of it. As with all technology, it can be used for nefarious ends. With modern computer programming, it’s possible to create an algorithm that will go out and scrape the Internet for just the kind of data or content you’re looking for. So, it would be possible to write stories to suit someone’s political purposes, for example.

But this technology can also be used for good. There are lots of serious news publishers — with journalistic principles at the core — who use automated journalism to strengthen their businesses, some of whom work with United Robots. The process they use is very different from what I described above.

First of all, there’s the data. To produce reliable, factual texts, you need to work from structured sets of quality data, such as land registry data or sports results, which are not only correct, but will be consistently available over time. The subsequent automation workflow then includes careful analysis of the data, a verified language process, and, finally, distribution on the right platform, to the right audience.

With United Robots’ technology, the algorithms are managed by man and machine in tandem. The journalists and editors we work with determine how an article should be constructed. That is, should it include, for example, a headline of a certain type and/or length, a standfirst to some specification, a certain number of facts, a summary at the end, etc.

The structure of the texts and rules around what angles to look for can be defined in quite some detail by the newsroom. So, for example, when we build articles about property sales, a property may be defined as a “mansion” if the house is X large and the land is a minimum Y square metres or acres.

Or if we generate sports texts about football, for example, a “sensational” turn of events may be if XYZ happens. And a “crushing defeat” may only require being beaten by three goals in the top league but by six in Division IV. The work in the newsroom to determine what is what is an interesting process in itself, forcing editors and reporters to really think through the language and values they use.

Once the rules for the text structure and angles are established by man, machine takes over. And what machines — the robots — contribute is that they never make factual or logical errors. If a fact is in the data, it’s correct and may be included (and consequently, if it’s not in the data, it will not feature). In other words, we build text on insights gleaned from the data analysis alone, with the rule system set in line with the journalistic principles and style sheets of the newsroom in question.

From a business perspective, working from structured data sets consistently published over time — as opposed to scraping for data — means a guaranteed volume of articles will be regularly generated, without which you can’t build sustainable news products and services. And only with structured data can you ensure the quality and reliability necessary to maintain trust in your journalism.

About Sören Karlsson

By continuing to browse or by clicking “ACCEPT,” you agree to the storing of cookies on your device to enhance your site experience. To learn more about how we use cookies, please see our privacy policy.