ChatGPT breaking news: Your content is driving it
Newsroom Transformation Initiative Newsletter Blog | 01 March 2023
Welcome to the lastest Newsroom Initiative newsletter.
As promised last time, we’re keeping going on questions for publishers and newsrooms raised by the arrival of generative Artificial Intelligence search engines, especially ChatGPT.
The immediate threat to journalism and journalists’ jobs may be overstated — as well as being the wrong lens through which to view the amazing possibilities of the technology. But there’s no doubt it is a critical moment that poses challenges and opportunities for us all.
It’s your content that drives ChatGPT
It turns out that the question of how ChatGPT attributes the source of content is a very big question, but maybe not as big as the fact that OpenAI has already ingested probably everything you’ve already published to the Internet, ever.
We talked about attribution — or lack of it — in various iterations of generative Artificial Intelligence bots powered by ChatGPT in the last Newsroom Initiative newsletter and blog post. It raises important issues about sourcing, trust, payment to content creators, or even recognition that they exist, as well as future sources of revenue for this new type of search engine.
Just as big an issue may be the fact that OpenAI — a non-profit organisation originally dedicated to the ethical development of Artificial Intelligence but with a for-profit arm that intends to commercial tools like ChatGPT — already has and has used your content to create the corpus from which it draws its weirdly uncanny answers to almost any question.
I wish I had known it was there before I did last week’s note, and I am grateful to the Twitter post of computational journalist Francisco Marconi for sharing the GitHub link with what appears to be the entire training set of data that OpenAI used to build ChatGPT. It is a remarkable list and it would almost be easier to try to find publishers who aren’t there than tell you whose content was scraped to create the ChatGPT base of information.
The Guardian, New York Times, BBC, and Reuters are all there. FAZ, Sueddeutsche Zeitung, and McClatchy are on the list. Netflix is in there as well. There are an initial 1,000 sites on the list, and one can perhaps assume there are many more below the top 1,000.
Quite how publishers may react to this is so far unclear. But the fact that ChatGPT is on the verge of being commercialised inside the new Microsoft Bing search engine suggests critical questions need to be asked. As a non-profit, OpenAI may argue — with some justification — that it is doing research and not running a commercial scraping operation to create its AI model. But commercial exploitation surely breaches publisher terms and conditions.
In a submission to U.S. authorities, the company claimed the scraping constituted “fair use.” Publishers are already raising the alarm about that provision as ChatGPT goes commercial.
“Anyone who wants to use the work of Wall Street Journal journalists to train Artificial Intelligence should be properly licensing the rights to do so from Dow Jones,” Jason Conti, general counsel for News Corp’s Dow Jones unit, was quoted as telling Bloomberg News in a statement. “Dow Jones does not have such a deal with OpenAI.”
And here is where it gets interesting for newsrooms when an organisation like the Journal and Dow Jones, owned by News Corporation, says things like Conti added: “We take the misuse of our journalists’ work seriously and are reviewing this situation.”
Marconi, the person who shared that GitHub list, is also the author of Newsmakers: Artificial Intelligence and the Future of Journalism. He is also listed as the chief executive officer of a news early warning technology company Applied XL.
“This debate is both fascinating and complex: Fair Use can boost AI innovation, but at the same time raises concerns about the lack of compensation (or even attribution) for publishers who produced training data,” Marconi wrote in a Twitter post related to the OpenAI fair use claim.
Further reading and listening on this subject
- Generative AI is a legal minefield, by Ina Fried of Axios Login, looks at a range of questions that will be important to publishers from being paid for having their content scraped, the attribution question, and whether AI could increase libel risk.
- ChatGPT is a data privacy nightmare. If you’ve ever posted online, you ought to be concerned, in The Conversation and written by Uri Gal, professor in business information aystems, University of Sydney, looks beyond publisher content to much more personal data ingested by ChatGPT.
- The youth, pitfalls of generative AI are key context to this moment, by my INMA colleague Arian Bernard, urges a note of caution on the potential for the sky to fall in.
- Another Podcast, by technology commentator Benedict Evans and his sidekick Toni Cowan-Brown, is always thought-provoking, and they devoted three episodes to the range of questions posed by generative AI — including answering what it is and is not.
A couple of media must-reads
The phenomenon of newspapers moving to a digital-only presence is gathering pace, especially in the United States, but we can expect it to accelerate in Scandinavia and New Zealand and many other markets. News Corp Australia moved resolutely in that direction three years ago — closing 100 print titles even before the dramatic price rises in newsprint since then.
Advance Publications last week week said it would close three Alabama newspaper print editions and move them entirely online behind its primary Alabama site Al.com.
“The Alabama Media Group says that after Feb. 26, 2023, one last Sunday, it will permanently stop the presses for The Birmingham News, The Huntsville Times, and Mobile's Press-Register. The company had already curtailed publishing from daily to three times a week in 2012 — part of a restructuring by parent company Advance Publications that also affected New Orleans’ The Times-Picayune,” NPR reported on the decision.
Expect many more publishers around the world to do the same. Watch this space.
Where did Facebook’s funding for journalism really go? is a revealing investigation by the Columbia Journalism Review into how Facebook’s pledge to invest something like US$300 million into local news organisations may have yielded support about 10% of that level.
Reporting on data gathered by Columbia’s Tow Center for Digital Journalism, the CJR says the evident shortfall in promised investment could be relevant to attempts in Canada and elsewhere to follow the Australian government in legislating for platforms to pay for news.
“At a time when policymakers around the world are legislating in ways that both restrict technology platforms and occasionally benefit local news industries, we intend our research to be useful in assessing the scale of recent programs,” the CJR said.
Recommended follow
John Sweeny @jonsweeneyroar is a great reporter with a history of investigating hard-to-reach stories on topics from North Korea and Scientology. For more than a year, he’s dedicated himself to reporting on and from Ukraine, taking significant risks and giving a real sense of presence. He’s published Killer in the Kremlin, billed as an explosive account of Vladimir Putin’s “reign of terror.”
Talk back
Tell me what you want to read and what you like or don't like in this newsletter, please. E-mail: peter.bale@inma.org. There’s also an INMA Newsroom Initiative Slack channel.
About this newsletter
Today’s newsletter is written by Peter Bale, based in New Zealand and the U.K. and lead for the INMA Newsletter Initiative. Peter will share research, case studies, and thought leadership on the topic of global newsrooms.
This newsletter is a public face of the Newsroom Initiative by INMA, outlined here. E-mail Peter at peter.bale@inma.org or newsroom@inma.org with thoughts, suggestions, and questions.