Two machine approaches to quotations: generative AI and tagged data

By Ariane Bernard


New York City, Paris


Hi everyone.

I am putting the finishing touches to our speaker lineup for our data master class series in October. Yes, it seems far down the road, but summer is a black hole. It is an A++ lineup, if I do say so myself. I couldn’t be more excited.

However, there are a couple more open slots, so if you’ve got an absolutely awesome case study you’d like to present in the company of some excellent folks, please do raise your hand. Extra special consideration to folks who don’t usually speak, whether because you’re weary of the exercise (come in, the water is fine) or because you hail from an underrepresented group! Your courage will always be inspiring to others.

I don’t know why I don’t use this space more to put in my own sponsored content. This is great. And hey, you don’t even have to click a cookie banner to see my ads.

On this … the main event.

See you on the next one!

Ariane (

At Ippen, using generative AI to extract quotes and use them in generated summaries

By now, readers of this newsletters probably know that one of the uses you can make of generative AI in its early stage is summarisation. It’s a finite job, it doesn’t require outside knowledge (so there’s no issue around algorithm training cut-off dates), and — for the purposes of humans who would be tasked with checking the quality of the work of a generative AI algorithm (large language model) — it’s also a reasonable task to check on.

One thing generative AI doesn’t do too well, however, is citing — using quotations. LLMs given a job to summarise will generally paraphrase (rewrite) all material. Part of the reason is that if you think about it, the job of summarising IS paraphrasing. But also because, often, a quote isn’t so much the most perfect way of saying something.

We value a quote rarely for its effectiveness but rather because it represents the most authentic representation of a person’s thought. I’m going to wager this is a dimension that’s cleanly lost to a LLM.

The team at Ippen Media in Germany took on trying to get OpenAI’s LLMs (the GPT family) and see if they could get summaries that also preserved quotes. In a recent paper shared on Medium, Alessandro Alviani, their product lead for NLP, shared his team’s iteration in trying to improve outcomes with different methods. Side note: We were very lucky to have Alessandro recently present at our data workshop at INMA’s World Congress of News Media in New York. You can catch up on this presentation if you were an attendee.

Alessandro Alviani, Ippen Media's product lead for NLP, speaking at the Smart Data Workshop during the INMA World Congress of News Media in May.
Alessandro Alviani, Ippen Media's product lead for NLP, speaking at the Smart Data Workshop during the INMA World Congress of News Media in May.

Alessandro and his team went for an approach that many a software engineer will recognise as a tried-and-true method whenever you are working on a large, complex problem: breaking down this problem in far smaller chunks, where the chances of success by your app improve because each step is fundamentally presented as a much simpler problem.

So the team at Ippen first created an entirely separate record — presenting all the quotes in a given article.  And, Alessandro notes, this also provided a record against which to check the final product of a summary that includes quotes. Were quotes being fabricated? A quote needed to be in that first output of extracted quotes in order to be “a quote.”

Ippen also looked at how GPT-4 performed relative to previous versions and found much better results. Additionally, providing sufficient system-prompt to the app’s Playground interface also improved results. 

“Using our two-step approach, almost all quotes were correctly recognised. In 11 out of 12 sample articles, all quotations were treated accurately — compared to seven out of 12 with GPT-3.5. In total, 44 out of 45 single quotes were correctly inserted in the new output text — both for summaries and article variations — compared to up to 32 quotes using our original prompts (without the two-step approach),” Alessandro wrote.

Furthermore, in terms of summary writing itself, Ippen found GPT-4 strongly outperforming version 3.5: “When it comes to summaries, GPT-4 is far superior to GPT-3.5. In 11 out of 12 articles, all quotes were correctly included in the AI-generated summary. With GPT-3.5, this rate dropped to two out of 12.” 

That said, the team also found there was variance depending on the test set of articles. Whenever you find variance in results, it generally indicates you haven’t “locked down” a reliable solution. This would surprise absolutely no one working with these young LLMs. Much is unknown in their behaviour (or, misbehaviour, as it were), and there’s a measure of trial-and-error in coming up with prompts to somehow land on reliable outcomes.

At The Guardian and the AFP, treating quotes as a specialty language and using tagged data for machine learning

Ippen’s work with GPT-4 trying to extract quotes reminded me of the work I had read about from The Guardian and the AFP, a few months ago. The organisations teamed up to teach an algorithm to recognise quotes and extract them as structured data. 

These two organisations didn’t use generative AI for this work but rather created a training set of quotes, breaking them down in discrete tagged components allowing deep learning to take place on this set. You can think of it as teaching the anatomy of what a quote looks like, which is really not just the quote itself but also what surrounds the quote: attribution (called “source” in their model) and the word that introduces the quote (called “cue” in their model). 

The Guardian and the AFP avoided generative AI when it came to a project on quotations, instead creating a set of training quotes..
The Guardian and the AFP avoided generative AI when it came to a project on quotations, instead creating a set of training quotes..

What’s interesting about this approach is that by focusing on a discrete, if complex, issue, there is a linear relationship between the size and diversity of the training set and the correctness of the outcome. In their experiment, the AFP and The Guardian tagged 800 sentences and found a success rate of 89% in correctly identifying quotes. 

Some of the failures of the model were already well understood: The team mentioned, for example, the fuzziness of what “we” may be referring to when we use “we” as a collective, like somebody speaking on behalf of a company for example.

Still, fundamentally, this is an approach that takes the tack that quotes are a kind of “speciality” within the language, if you will. And, from a journalistic perspective, this is true, of course: The presence of a quote, even if it is sometimes a bit more awkward than a paraphrase, highlights a degree of authenticity that we value in our reports. So the extra labour is justified in the particular value of this type of sentence.


Thinking about these summarisation issues when complexity increases in the presence of languages that we wish to keep undisturbed led me into a bit of a spiral (what else is new). Because misery loves company, please join me … 

What other types of languages are out there that also need to be exactly preserved? Where else we are using languages where extra special care is placed in the individual choice of words or turn of phrase? 

Quotes aren’t the only cases for this. For example, think of instructions to operate technical equipment or software. Applying summarisation or any kind of paraphrase is likely to make this type of content unusable since instructions often make very precise references to, say, a label on a button or the message a display may show. If you end up making any changes to this language, you could easily render an instruction manual far less usable or even misleading.

This is an issue for all technical writing, and it’s not like technical writing is found only in user manuals. There are plenty such examples in our news reports — for example, if a reporter describes topics around health, science, or how technology works. There is often a considerable amount of deliberateness in picking one specific word over another — simplifying complex things in a manner that is faithful to the complex problem is often a bit of a tight rope exercise. And changing just one word may be the difference between correct and incorrect. 

The problem with technical writing is that it may not be quite so easy to detect as quotes. Quotes, of course, contain quote marks. These are nice clean indicators, even if quote marks are sometimes used for other purposes. Still, as a pretty reliable indicator that a quote is present, quote marks are certainly helpful.

For a LLM given the task of paraphrasing or summarising, proper summarisation may lie in recognising when the LLM may, actually, not summarise but rather needs to quote the original material and attribute.

And this, really, would make our attempts at recognising quotation — whether going the tagged route like the AFP and The Guardian,  or the LLM route like Ippen — feel like child’s play in comparison. And is something we’d have to be weary of as we ask LLMs to step in to summarise our content at scale.

Further afield on the wide, wide Web

For this week’s installment of FAWWW, I’d like to warn you: She’s very newsy.

• First, we’ve got movement on the AI legislation front: The EU Parliament approved a draft of the AI Act (The Guardian), which aims codifying where (and how) AI-powered systems can support humans, act on their own, or may not be allowed to be present. This touches on facial recognition, including in the context of policing or deep-fake videos. This text still has various steps to pass and opportunities for further changes, but it’s already changed quite a bit.

Meanwhile, Time magazine published an exclusive look at the papers that OpenAI, the company that owns the GPT language family and ChatGPT, wrote in an effort to influence the watering down of the AI Act. Meanwhile, in a somewhat adjacent corner of the world, Andreessen-Horowitz, the venture capital firm, wrote a paper with a strong bend for the eventual goodness of AI, as well as a light call for regulation. Because the audience of this newsletter comes from media, I am sure that all will keep in mind the specific perspective of the author as they read this viewpoint from Silicon Valley.

• Closer to our corner of the world, NY Magazine asked, “Will Google’s AI plans would destroy the news media?” In it, the author notes the anxiety of our industry over the way generative AI summaries in Google may further commoditise us, but also notes it’s not like the end-user experience or even business viability of this enterprise for Google is necessarily guaranteed to work either:

“Will Google users be happy with a machine-improvised Wikipedia article at the top of their search results? Will it change their relationship to the sponsored links at the heart of Google’s business? Will they take product recommendations seriously from a Google bot? Will Google’s AI testing phase result in doubling down on content automation or quietly rolling it back? Will that be because users don’t care for it or because they do, but it’s in a way that threatens Google’s business? Their predicament is the AI dilemma in not-so-miniature: a confrontation with the essential weirdness of generating synthetic information.”

• Let’s now move to the Labor Doomsday beat: Over in The New York Times, a battle of estimations for how many jobs may be disrupted (replaced entirely or greatly reduced) by AI. As the article notes, the estimations you may hear are made trickier by the fact that while AI may affect many job tasks and required skills, it may not entirely replace them either (gift link). Meanwhile, Axel Springer, the owner of the tabloid Bild in Germany, has announced a cost-cutting programme (The Guardian), which while not caused by the arrival of AI-powered processes, makes the announcement that further reorganisation would likely occur as new automation becomes available.

• For two non-newsy items on the “big reports” beat: The UNESCO published “Reporting on Artificial Intelligence: a handbook for journalism educators” from a group of industry authors. The report is also directed at journalism covering AI, equipping readers with broad theoretical education on the topic in a manner that’s approachable and helpful. 

Finally, I’ll have to settle for a deceptively short paragraph for a very significant piece of work: Nieman published a meaty report with great example of generative AI tools and processes being applied in newsrooms, from translation to story writing. And if you want INMA’s own version of this, please do avail yourself to my own report on this from April. (We end this newsletter the way we started it, with my own sponsored content. What a week …)

About this newsletter

Today’s newsletter is written by Ariane Bernard, a Paris- and New York-based consultant who focuses on publishing utilities and data products, and is the CEO of a young incubated company,

This newsletter is part of the INMA Smart Data Initiative. You can e-mail me at with thoughts, suggestions, and questions. Also, sign up to our Slack channel.

About Ariane Bernard

By continuing to browse or by clicking “ACCEPT,” you agree to the storing of cookies on your device to enhance your site experience. To learn more about how we use cookies, please see our privacy policy.