Your content is driving ChatGPT

By Peter Bale


New Zealand and the U.K.


It turns out that the question of how ChatGPT attributes the source of content is a very big question, but maybe not as big as the fact that OpenAI has already ingested probably everything you’ve already published to the Internet, ever.

We talked about attribution — or lack of it — in various iterations of generative Artificial Intelligence bots powered by ChatGPT in the last Newsroom Initiative newsletter and blog post. It raises important issues about sourcing, trust, payment to content creators, or even recognition that they exist, as well as future sources of revenue for this new type of search engine.

Just as big an issue may be the fact that OpenAI — a non-profit organisation originally dedicated to the ethical development of Artificial Intelligence but with a for-profit arm that intends to commercial tools like ChatGPT — already has and has used your content to create the corpus from which it draws its weirdly uncanny answers to almost any question.

I wish I had known it was there before I did last week’s note, and I am grateful to the Twitter post of computational journalist Francisco Marconi for sharing the GitHub link with what appears to be the entire training set of data that OpenAI used to build ChatGPT. It is a remarkable list and it would almost be easier to try to find publishers who aren’t there than tell you whose content was scraped to create the ChatGPT base of information. 

Journalist Francisco Marconi shared a list of media companies OpenAI scraped to build ChatGPT.
Journalist Francisco Marconi shared a list of media companies OpenAI scraped to build ChatGPT.

 The Guardian, New York Times, BBC, and Reuters are all there. FAZ, Sueddeutsche Zeitung, and McClatchy are on the list. Netflix is in there as well. There are an initial 1,000 sites on the list, and one can perhaps assume there are many more below the top 1,000.

Quite how publishers may react to this is so far unclear. But the fact that ChatGPT is on the verge of being commercialised inside the new Microsoft Bing search engine suggests critical questions need to be asked. As a non-profit, OpenAI may argue — with some justification — that it is doing research and not running a commercial scraping operation to create its AI model. But commercial exploitation surely breaches publisher terms and conditions.

In a submission to U.S. authorities, the company claimed the scraping constituted “fair use.” Publishers are already raising the alarm about that provision as ChatGPT goes commercial.

“Anyone who wants to use the work of Wall Street Journal journalists to train Artificial Intelligence should be properly licensing the rights to do so from Dow Jones,” Jason Conti, general counsel for News Corp’s Dow Jones unit, was quoted as telling Bloomberg News in a statement. “Dow Jones does not have such a deal with OpenAI.”

And here is where it gets interesting for newsrooms when an organisation like the Journal and Dow Jones, owned by News Corporation, says things like Conti added: “We take the misuse of our journalists’ work seriously and are reviewing this situation.”

Marconi, the person who shared that GitHub list, is also the author of Newsmakers: Artificial Intelligence and the Future of Journalism. He is also listed as the chief executive officer of a news early warning technology company Applied XL.

“This debate is both fascinating and complex: Fair Use can boost AI innovation, but at the same time raises concerns about the lack of compensation (or even attribution) for publishers who produced training data,” Marconi wrote in a Twitter post related to the OpenAI fair use claim.

If you’d like to subscribe to my bi-weekly newsletter, INMA members can do so here.

About Peter Bale

By continuing to browse or by clicking “ACCEPT,” you agree to the storing of cookies on your device to enhance your site experience. To learn more about how we use cookies, please see our privacy policy.