The idea of an “LLM for journalism” has interesting benefits
Smart Data Initiative Newsletter Blog | 19 October 2023
Hi everyone.
We’re looking at a “what if” this week: What if the journalism industry built a LLM for its purposes? This one is offered in the spirit of the interview you’ll hear about, with Aimee Rinehart, the senior product manager for AI strategy at the Associated Press, imagining the upsides of a certain scenario and how we may go at it.
And so, in that same spirit, do write back to me with the thoughts this post inspired!
Until next time,
Ariane
LLM for journalism
I was recently attending the American edition of Newsgeist, the unconference Google convenes for journalists and technologists. Several of the discussions I attended touched on AI, of course, and in one of them, Aimee Rinehart, the senior product manager for AI strategy at the Associated Press, mentioned the idea of a “LLM for journalism.”
Now, colour me intrigued, of course!
If you’re even an occasional reader of this newsletter, you know there are some recurring challenges from the news industry — and, in fact, not just our industry — using generative AI to create new content, even if much of it seems very promising.
By now you have probably heard about the issue of so-called hallucinations from large-language models, the content engine that underpin tools like ChatGPT. Hallucinations is the name given to content output that, in human terms, seems to be “made-up” knowledge from the LLM.
The other issue is how/where the LLM was trained. What did it read for breakfast, lunch, and dinner? This debate often will veer into debates about “truth,” “bias,” or, on the copyright and ethical end of things, about whether every party who contributed to the training was fairly compensated for this work.
So after a very short night (because Werefolf, iykyk), I caught up early the next day with Aimee to chat about her intriguing idea. And in the interest of making this as clear as I can, this is an idea that Aimee floated, not a project she is working on or one that has received the greenlight from her employer. This is shared here in the spirit of the Newsgeist conference: a chat among folks with similar goals but often different expertise and perspectives.
“One of the biggest frustrations among the journalism community about large language models is we don’t know the training data that’s gone into it. So my thinking is, let’s build on trusted, licensed, more representative content than what we have currently,” Aimee said, pointing out that she saw this as more as a foundational tool than an area of competition for news organisations.
The approach, in Aimee’s mind, would lead to building a licensable LLM to which organisations could then add their own archives as an element of reinforcement for the LLM.
It’s worth noting what reinforcement means in the context of deep learning, which is basically various highly focused methods to train-in or train-out an undesirable outcome in the system. In a regular, non-deep learned system, you remove bad outcomes by looking for the lines of codes that cause the bad outcome. While the bad outcome may be caused by several disparate parts of your code, perhaps only with specific triggers and conditions, all the code is legible. So fixing it is only a matter of understanding the issue well enough and making the targeted, direct adjustment necessary.
Deep learning, on the other hand, is a black box by nature. So if you think of the way you get a child to behave differently on something you don’t agree with, you can’t really reach into their brain and take out what causes the problem. What you do instead is present the child with competing alternatives to their behaviour, you provide additional reasons, you repeat yourself, etc. That’s reinforcement training.
One basic way to do reinforcement training is copying the same data a large number of times. This is really like repeating to your kid “Say, thank you” over and over until they do it on their own. If you copy the content of Wikipedia into your LLM 1,000 times, this LLM will be reinforced for Wikipedia relative to another source that’s just copied over one time. The LLM doesn’t “know” more in the sense of diversity, but because of the probabilistic nature of LLMs, it is now slightly tipped toward the content of Wikipedia.
It will not have escaped you that the amount of reinforcement necessary for anything is largely commensurate with the underlying size of the model. Adding one pound of something to a scale weighted with 98,320 elephants will not have the same effect as that same extra pound added to a scale weighted with a walnut.
So, Aimee’s idea to use as a base, well-licensed “LLM for journalism” and customise/reinforce it with your own licensed archive for your personal version of it has a few interesting consequences.
One of them is the ability to leverage an archive.
“There are a number of archival opportunities for an AI co-pilot, especially as newsrooms have such a high turnover, there’s no more institutional knowledge,” Aimee said. She identified things like being able to query a LLM for the abstraction of how a story may have evolved over time as one such example. But there is also a possible end-user application.
“If I’m an audience member and I’m interested in this news story, and I go to your Web site and ask, ‘What is the mayor doing about power?’ and then I get the summary.”
Aimee also sees the ability to reinforce the base LLM with the archive as a way to have an output with a stronger tone fidelity to your own writing style. (As she reads this, INMA’s editor is dreaming of a day when she can run my copy through a robot, press the “fix this mess” button, and get something that removes all my valuable hot takes, Taylor Swift references nobody asked for, and generally agrees with INMA’s house style.)
Something that’s particularly attractive in this approach is that it also potentially gives an interesting edge to industry participants that aren’t in the scale game — namely, regional publications. The more niche the content, the trickier it is to imagine that a LLM will be able to reflect your specific content back at you. And local content, in many ways, is niche — drowned out by the Big Internet of general-interest news and content. So fine-tuning a LLM for the needs of specific local actors feels like a necessary step if there’s any hope to have an LLM be able to generate content with this vantage point.
Local organisations, particularly these local organisations that have a long history and therefore deep archives, are uniquely valuable.
Any “LLM for journalism” would still have to contend with a necessary condition for any LLM to work, which is the enormous amount of content necessary to understand language. And LLMs need this for both sides of what happens when they get used: On the input side, when an end user passes a prompt to the LLM, and on the generation side. Just grabbing the contents of the AP and your entire archive is not a fraction of a fraction of what is needed for a basic LLM. So this asks the question of what the base model is built on.
“Before anyone gets into the millions of dollars of investment necessary for the foundational model, I think a logical first step is to look around the open-source models and see which one is the best and why,” Aimee said. “I think the best version would be the one that has the most transparency on what it has ingested and has made an effort to either use Creative Commons information, or licensed, though that one is not likely with open source models.”
Now, I found Aimee’s thought experiment super interesting because I do think that fine-tuning, in general, is one of the logical approaches to looking for improvements when a use case is well defined and specific.
Whether “journalism” as a concern is well defined enough that fine tuning for our use cases is possible, or would show improvements, that’s a big question. Bloomberg, famously, trained its own LLM, BloombergGPT, but Bloomberg’s use case is specialised journalism. General news is perhaps more niche than “all the content on the Internet,” but it’s definitely closer to the middle of the bell curve of general than financial news is on that same curve. (What could be at the top of that curve, one could wonder? I think it’s cats. Cat videos are really the regression to the mean of the Internet. But this one is a question for another day.)
There is also another dimension — and this one is a drum I will beat given any opportunity to do so — which is that purely LLM-based approaches to knowledge extraction are perhaps too antithetical to how LLMs work in the first place.
Fine-tuning tips odds in a probabilistic system, but there is still not a notion of “knowledge” in the way that human brains apply certain attributes to information that effectively creates different classes for the information in our brains. LLM hallucinations from a system design perspective aren’t really an error from the LLM. They are not desirable from our perspective, but an LLM doesn’t have a notion of “fact” to distinguish it versus everything else. It makes a statistical call on what is likely the most correct language (i.e., “string of words”) to generate for a given query.
Where we humans differ in how we treat Knowledge (capital K) in a deterministic way is this: We separately treat facts as being impervious to other manipulations. So even if I am told once a piece of information from a party I trust (or my own observation), no amount of random other information in my model will make me change that fact. But LLMs, like GPT-4, are non-deterministic (though it is a puzzling bit for generative AI specialists) so the notion of a protected “class” for facts would actually be defeated by that non-determinism if it could somehow be modeled.
I am sharing all of this because this is a dimension of LLMs that may make them ultimately unsuitable for tasks where high factual accuracy on complex information is an important part of the requirements. The idea of a LLM for journalism is one that treats our specific context and use cases as exceptionalists and proposes that we may explore an exceptionalist approach and see how far this gets us.
In the spirit in which Aimee shared this, it is the definition of how technical systems are often improved: identifying unique use cases and trying to build a system that is specifically interested in solving for them. You can think of it as an effort to subtract in parts and add in others. But unlike other types of simpler systems where you can make some informed projection for how you think the system will likely improve the outcome, the very murky nature of generative AI introduces a lot of uncertainty in these calculations in the first place.
Which means if this were an idea that moved forward (and folks should reach out to Aimee!), accepting a certain amount of uncertainty, the chances of failure, and the fact that, well, it’s just hard, are basically part of the requirements. But then again, there are the words of JFK about choosing to go to the moon: “We choose to go to the moon in this decade and do the other things — not because they are easy, but because they are hard.”
Further afield on the wide, wide Web
Some good reads from the wider world of data:
- The BBC posted its principles for generative AI, a short document meant to underpin experiments the public broadcaster is planning to embark on. Of note, a commitment to human creation and storytelling, and taking the public stance that the scraping of BBC content data shouldn’t be a free-for-all. (BBC)
- A fascinating look at what the visual generative AI tools return when you ask them for things like “a person in India” or “a plate of food in China” — the publication Rest of World noted some reductionist, highly stereotypical outputs with a regularity and sameness which asks questions of how reductive working with AI may become. (Rest of World)
- Digiday is reporting on the business side of publishers making use of generative AI to help their ad sales efforts. Among concerns is the privacy of the client data being shared by the publishers with the AI platforms — BuzzFeed, BDG among others. There is some mitigation, like BuzzFeed paying for a license to make sure its data isn’t used for training, but the reality of a new technical frontier is that there are more unanswered questions than anything else. (Digiday)
- While this is Instagram comedy, I’d like to reassure your boss that you are doing work by watching this while on company clock. This is absolutely an explainer of engagement-based personalisation! (Elle Cordova via Instagram)
- This one actually came out a few months ago, but it didn’t cross my desk. If it’s a repeat for you, I’m very jealous — it’s a gem, and you got it before I did. But for everyone else, a long (long) read for the nerds, but top 10: “What are embeddings” from Vicki Boykis at Normcore Tech. The full paper (PDF) is the money, but I’ll let you find it from the Substack because it deserves exposure. As far as the deep dive into embeddings, the audience is, well, someone like me — technical, but not a data scientist. So yeah, it’s for y’all, technical product managers, pretty much.
- The NYT’s interactive team presents the outcome of the visual artist David LaSalle working with two technologists on a text-to-image AI trained to imitate his visual style. The feature helps the reader (you) to see how the output evolved with the prompts and where it seemed to gain what may be called “simili-creativity” (it’s hard to know whether it’s real creativity of course). (NYT, gift link)
I’ll wrap up this week’s FAWWW with something that rankled me a bit, an editorial “written by AI” about why AI is bad for journalism. To be clear, it’s not the content of the editorial that annoyed me (it’s incredibly mid, as these things go). It’s the trope of having an AI chatbot write about something about itself and printing this like AI has the personality to be an author. “We found that Bing Chat made lucid and persuasive arguments for keeping AI out of journalism,” per the editorial’s authors.
Except Bing Chat or any other such chatbot is not wise. It doesn't have arguments for or against anything. It predicted the string of words that would best meet the request of the prompt asker. Having an argument involves being able to weigh them, and judgment is a most human trait. If a LLM prints “killing is bad,” it is not arguing this. It doesn’t hold these views. It doesn’t defend them. It just predicts this is the correct output for the prompt.
I don’t know whether AI is bad for journalism (clearly much of these newsletters are about the complications of this brave new world), but I do know that it doesn’t do the media any service to loan human traits of intentionality to the output of an AI chatbot. It doesn’t elevate AI so much as it diminishes what makes human intelligence, and specifically journalism, so special.
About this newsletter
Today’s newsletter is written by Ariane Bernard, a Paris- and New York-based consultant who focuses on publishing utilities and data products, and is the CEO of a young incubated company, Helio.cloud.
This newsletter is part of the INMA Smart Data Initiative. You can e-mail me at Ariane.Bernard@inma.org with thoughts, suggestions, and questions. Also, sign up to our Slack channel.