By now, readers of this newsletters probably know that one of the uses you can make of generative AI in its early stage is summarisation. It’s a finite job, it doesn’t require outside knowledge (so there’s no issue around algorithm training cut-off dates), and — for the purposes of humans who would be tasked with checking the quality of the work of a generative AI algorithm (large language model) — it’s also a reasonable task to check on.
One thing generative AI doesn’t do too well, however, is citing — using quotations. LLMs given a job to summarise will generally paraphrase (rewrite) all material. Part of the reason is that if you think about it, the job of summarising IS paraphrasing. But also because, often, a quote isn’t so much the most perfect way of saying something.
We value a quote rarely for its effectiveness but rather because it represents the most authentic representation of a person’s thought. I’m going to wager this is a dimension that’s cleanly lost to a LLM.
At Ippen, using generative AI to extract quotes and use them in generated summaries
The team at Ippen Media in Germany took on trying to get OpenAI’s LLMs (the GPT family) and see if they could get summaries that also preserved quotes. In a recent paper shared on Medium, Alessandro Alviani, their product lead for NLP, shared his team’s iteration in trying to improve outcomes with different methods. Side note: We were very lucky to have Alessandro recently present at our data workshop at INMA’s World Congress of News Media in New York. You can catch up on this presentation if you were an attendee.
Alessandro and his team went for an approach that many a software engineer will recognise as a tried-and-true method whenever you are working on a large, complex problem: breaking down this problem in far smaller chunks, where the chances of success by your app improve because each step is fundamentally presented as a much simpler problem.
So the team at Ippen first created an entirely separate record — presenting all the quotes in a given article. And, Alessandro notes, this also provided a record against which to check the final product of a summary that includes quotes. Were quotes being fabricated? A quote needed to be in that first output of extracted quotes in order to be “a quote.”
Ippen also looked at how GPT-4 performed relative to previous versions and found much better results. Additionally, providing sufficient system-prompt to the app’s Playground interface also improved results.
“Using our two-step approach, almost all quotes were correctly recognised. In 11 out of 12 sample articles, all quotations were treated accurately — compared to seven out of 12 with GPT-3.5. In total, 44 out of 45 single quotes were correctly inserted in the new output text — both for summaries and article variations — compared to up to 32 quotes using our original prompts (without the two-step approach),” Alessandro wrote.
Furthermore, in terms of summary writing itself, Ippen found GPT-4 strongly outperforming version 3.5: “When it comes to summaries, GPT-4 is far superior to GPT-3.5. In 11 out of 12 articles, all quotes were correctly included in the AI-generated summary. With GPT-3.5, this rate dropped to two out of 12.”
That said, the team also found there was variance depending on the test set of articles. Whenever you find variance in results, it generally indicates you haven’t “locked down” a reliable solution. This would surprise absolutely no one working with these young LLMs. Much is unknown in their behaviour (or, misbehaviour, as it were), and there’s a measure of trial-and-error in coming up with prompts to somehow land on reliable outcomes.
At The Guardian and the AFP, treating quotes as a specialty language and using tagged data for machine learning
Ippen’s work with GPT-4 trying to extract quotes reminded me of the work I had read about from The Guardian and the AFP, a few months ago. The organisations teamed up to teach an algorithm to recognise quotes and extract them as structured data.
These two organisations didn’t use generative AI for this work but rather created a training set of quotes, breaking them down in discrete tagged components allowing deep learning to take place on this set. You can think of it as teaching the anatomy of what a quote looks like, which is really not just the quote itself but also what surrounds the quote: attribution (called “source” in their model) and the word that introduces the quote (called “cue” in their model).
What’s interesting about this approach is that by focusing on a discrete, if complex, issue, there is a linear relationship between the size and diversity of the training set and the correctness of the outcome. In their experiment, the AFP and The Guardian tagged 800 sentences and found a success rate of 89% in correctly identifying quotes.
Some of the failures of the model were already well understood: The team mentioned, for example, the fuzziness of what “we” may be referring to when we use “we” as a collective, like somebody speaking on behalf of a company for example.
Still, fundamentally, this is an approach that takes the tack that quotes are a kind of “speciality” within the language, if you will. And, from a journalistic perspective, this is true, of course: The presence of a quote, even if it is sometimes a bit more awkward than a paraphrase, highlights a degree of authenticity that we value in our reports. So the extra labour is justified in the particular value of this type of sentence.
Thinking about these summarisation issues when complexity increases in the presence of languages that we wish to keep undisturbed led me into a bit of a spiral (what else is new). Because misery loves company, please join me …
What other types of languages are out there that also need to be exactly preserved? Where else we are using languages where extra special care is placed in the individual choice of words or turn of phrase?
Quotes aren’t the only cases for this. For example, think of instructions to operate technical equipment or software. Applying summarisation or any kind of paraphrase is likely to make this type of content unusable since instructions often make very precise references to, say, a label on a button or the message a display may show. If you end up making any changes to this language, you could easily render an instruction manual far less usable or even misleading.
This is an issue for all technical writing, and it’s not like technical writing is found only in user manuals. There are plenty such examples in our news reports — for example, if a reporter describes topics around health, science, or how technology works. There is often a considerable amount of deliberateness in picking one specific word over another — simplifying complex things in a manner that is faithful to the complex problem is often a bit of a tight rope exercise. And changing just one word may be the difference between correct and incorrect.
The problem with technical writing is that it may not be quite so easy to detect as quotes. Quotes, of course, contain quote marks. These are nice clean indicators, even if quote marks are sometimes used for other purposes. Still, as a pretty reliable indicator that a quote is present, quote marks are certainly helpful.
For a LLM given the task of paraphrasing or summarising, proper summarisation may lie in recognising when the LLM may, actually, not summarise but rather needs to quote the original material and attribute.
And this, really, would make our attempts at recognising quotation — whether going the tagged route like the AFP and The Guardian, or the LLM route like Ippen — feel like child’s play in comparison. And is something we’d have to be weary of as we ask LLMs to step in to summarise our content at scale.
If you’d like to subscribe to my bi-weekly newsletter, INMA members can do so here.