The Guardian’s “failed” AI tool was a resounding success
Generative AI Initiative Blog | 12 May 2025
When generative AI first exploded onto the scene in late 2022, news companies around the world scrambled to respond. Many raced to build and release AI-powered tools, but The Guardian resisted the urge to release “something very shiny” right away.
Rather than immediately launching a public-facing product, the company took a step back, instead opting for deep experimentation and reflection.
“We made a very conscious decision that wasn’t what we were going to do,” explained Chris Moran, head of editorial innovation, during a recent INMA Webinar. Instead, the team committed to understanding the technology thoroughly before trying to apply it.
That sparked a months-long journey into exploring its possibilities and use cases. For The Guardian, a big focus was on fine-tuning:
“We wanted to draw some fundamental conclusions about the technology. We wanted to understand whether or not it really could capture editorial guidelines, styles, and values of The Guardian, and whether that made a difference when you were applying it to quite broad use cases like summarisation, tasks in production, transforming journalism, and so on.”

Focusing on fine-tuning
Summaries became an obvious use case for them to explore. Moran noted that most models can churn out decent summaries of traditional news articles, thanks to their structured nature. But The Guardian wasn’t interested in solving a problem that had already been cracked.
Instead, the team focused on live blogs, which are used for rolling coverage of breaking news, political events, or cultural moments. These blogs are more challenging to summarise, as they feature a reverse chronological format and frequent updates — thus making it harder for a large language model to understand what matters most.
It quickly became apparent that fine-tuning an LLM is a partnership between humans and technology.
They started by assembling a dataset of 3,700 live blogs from the past decade, drawn from a wide range of styles: breaking news events, daily political briefings, and even community-driven cultural coverage like TV shows.
“The style of writing starts to look very, very different,” depending on the subject, Moran noted.

Each live blog included summaries written by journalists, which had to be paired precisely with the content they summarised. In some cases, the team had to reshape posts for machine readability.
“None of that was easy, and when you’re trying to work out whether or not the effort versus impact is worthwhile, the effort was already starting to stack up,” Moran admitted.
As they worked with the data, the team learned how prompts and settings impacted output. They tested temperature settings — controls for how “creative” or predictable a model is — and experimented with different phrasing in prompts. Even small changes in a prompt, Moran said, “can make a radical difference to the output.”
Evaluating the AI’s efforts
Once the model was trained, there was one more challenge to overcome.
“The thing we had to think about probably more than anything else was this question of evaluation,” Moran said. “Once we trained a model, how did we judge whether or not what it was doing was useful?
Ultimately, The Guardian created a process for human reviewers to assess the AI’s summaries.
Each AI-generated summary was shown next to the original human-written one. Reviewers scored each bullet point in the AI output on three criteria:
- Accurate and important enough to be included.
- Accurate but unimportant.
- Inaccurate or misleading.
They also answered a crucial question: Does this summary match Guardian style?
The evaluation required a significant amount of involvement from the newsroom: “We had to pull together a group of people,” he explained. “We couldn’t just stop all live bloggers from live blogging.”
Instead, they pulled together volunteers from across the organisation — people with enough live blogging experience to judge the outputs.

Early results were rough. In one summary about the death of comedian Barry Humphries, best known by his public persona of Dame Edna Everage, the model generated a passage with three quotes.
“John Cleese said he was a huge inspiration. Broadcaster Andrew Neil said he was the greatest comedian I’ve ever seen work live. And the comedian and writer Frankie Boyle said he was the greatest ever writer and deliverer of insults,” Moran said.
The problem was, only one of those quotes was accurate: Neil’s words had been misrepresented, and Boyle had said nothing at all.
The mistake originated from a format quirk in which a quote was embedded in a tweet — something the model couldn’t read properly. “We simply hadn’t thought about the fact that sometimes the text would be included in an embedded tweet,” Moran said.

In another case, the AI summarised a tragic bus crash by stating, “There were 35 people on the bus,” having combined the reported 10 deaths and 25 injuries. It seemed logical, but it was unverified — and potentially dangerous.
“That wasn’t a fact,” Moran emphasised. “It may have been true, but there may have been more people on the bus.”
Through fine-tuning, the model improved. A later summary about Prince Harry’s legal case against Mirror Group Newspapers was far more accurate.
“The machine has really got better, not only being accurate but also identifying the right kind of thing to appear in these summaries,” Moran said. However, the summary still included one minor error, which “wasn’t particularly enormous” but still introduced legal risk.
A valuable experiment
Despite making progress, The Guardian ultimately chose not to launch the tool.
“Even though we improved the model enormously, the conclusion we came to was … it was even harder to spot one single mistake in 400 words for an editor than it could be to spot a whole flurry of them,” he explained.
They agreed this was too high a risk. And it drove home the point that simpler outputs, such as headlines or short bullet points, are more practical.
“Everybody knows that they really matter to people and they’re subject to an enormous amount of scrutiny,” Moran said. “And it’s really obvious when there’s a very big error in them.”
From a product perspective, the project didn’t yield a tool, but as a research initiative, it was a resounding success. The team learned about data preparation, prompt design, evaluation frameworks, and AI limitations in real newsroom contexts.
“If you look at it through the pure lens of whether or not we created a product, one might deem this a failure,” Moran said. “But ultimately, the number of things that we drew from it … was absolutely worth the four months we spent working on it.”