One of the advantages of having a data science team that’s in touch with the newsroom’s needs is collaboration in developing cutting-edge tools that work for both journalists and readers.

Cicero, named for the Roman orator, was created by our data scientists through conversations with the editor-in-chief and editorial staff. It’s an Artificial Intelligence platform used to reduce reporters’ manual work while helping them find connections and providing more transparency to readers for increased engagement.

The new AI platform is helping investigative reporters streamline their work.
The new AI platform is helping investigative reporters streamline their work.

Here, Shengqing Wu, The Globe’s senior manager of data science, describes Cicero and its advantages.

Where did the idea come from?

The Globe is known for our in-depth investigative journalism. I was lucky to get first-hand experience working with an investigative reporter on a big award-winning story about sexual assault called Unfounded. I didn’t realise how intense this kind of journalism is. Unfounded took 20 months to produce and required a huge amount of manual effort to sort through documents to find the threads of the story.

This experience highlighted the efforts and limitations involved in how journalists research, iterate, and do their work. It was clear there was an unsolved problem on the table and that, if we could crack it, the solution would pay back exponentially for the journalists. We also wanted to open a door for our readers so they can see the stories’ paths and perhaps even explore deeper into new reporting avenues for journalists to pursue.

How does Cicero work?

Cicero turns unstructured data into usable information.

Journalists upload information including file types from images to audio to PDFs, such as freedom of information (FOI) and access to information and privacy (ATIP) documents. These large and disparate data sets join all the stories The Globe has ever published to create an institutional database available to all users.

Journalists can then conduct deep research more efficiently by searching on their subject rather than pore over mountains of documents. It mitigates the risk of spending too much time only to find there’s no important story. It can also help uncover connections that might otherwise have been overlooked.

What does a journalist see when he runs a search?

There are three output options.

The first is a straightforward list of documents linked to the search term. The second is a knowledge graph, which shows the connections between your search topic and associated people or organisations.

This knowledge graph is a sample graph using real search terms from Cicero.
This knowledge graph is a sample graph using real search terms from Cicero.

The third is an entity chain. This is where the AI really shines. With the entity chain, Cicero uses reinforcement learning (the AI learns by interacting with an environment; it’s used in self-driving cars) to analyse the data in the institutional database. It automatically selects entities, names, and organisations to suggest an investigative path that can potentially show previously hidden connections.

This entity chain is a sample graph using real search terms from Cicero.
This entity chain is a sample graph using real search terms from Cicero.

What’s been the feedback so far?

Since Cicero is unique — we don’t know any other media organisation with an AI tool with these capabilities — it’s been interesting to get the feedback.

We’ve been working with a focus group of 10 journalists, and their response has been very positive. So far, the use has been mostly for larger investigations, rather than day-to-day reporting. For instance, one reporter uploaded 500 pictures of court documents taken on his phone. Then instead of sitting down to read through 500 pages, he used Cicero to find the information he needed.

What are the next steps?

We anticipate a broader impact for journalists in how they’ll do their reporting and research.

We’ll be tweaking the system based on feedback and rolling it out to larger groups. We’re also developing the user interface, which will let readers access Cicero and help understand more of the investigative journey. We think this will build trust and show transparency.

Down the road, we’re going to widen Cicero’s scope and knowledge database by integrating feeds from financial and other databases, instead of only having reporters upload their documents.