Chatbots can replicate paywalled content

By Sonali Verma

INMA

Toronto, Ontario, Canada

Connect      

ChatGPT, Perplexity, and Grok can provide summaries of paywalled content, according to a recent investigation. They replicate content from paywalled news articles without actually accessing the content behind the paywalls.

Instead, they use publicly available fragments — such as quoted snippets, social media re-posts, archives, metadata, or archived fragments — and reconstruct full-text or accurate summaries.

This practice is often effective even for content behind paywalls like The AtlanticThe New York Times, or The Financial Times — news brands that have invested extensively in building a hard paywall that is difficult to circumvent.

Is this report, by Henk Van Ess, an internationally recognised expert in online research methods, true? I decided to test it. The blog post I linked to above is paywalled. So, purely for the purpose of verification, I posted the link in ChatGPT and asked for a summary of it. 

The result above is what I got, and it roughly matches what Van Ess himself has said about the investigation. Perplexity told me much the same, adding: “While older debates focused on AI training on paywalled or copyrighted content, the latest and more urgent concern is that AI bots are exploiting the public digital ecosystem to deliver premium content.”

Screenshot of query to Perplexity and its response.
Screenshot of query to Perplexity and its response.

As Van Ess says: We see “AI systems performing real-time searches to actively reconstruct paywalled articles from live, untrained data sources — content they’ve never encountered during their original training.

“Most chatbots have rules not to break paywalls, and say so loudly, but the internal reasoning documents obtained during this investigation show they’re systematically planning and executing these circumvention operations while maintaining plausible deniability about their methods.”

The report found AI systems successfully reconstructed about half of paywalled content across a sampling of top-tier publications, especially popular stories that have already been widely discussed online. 

OK, fine, you say. It’s only half of the content. But take a moment to think about how much content your publication produces — and how much of that the typical reader reads anyway. Is it really more than 50%?

This comes just days after Cloudflare, a prominent content delivery network, said it would block bots from accessing publishers’ content by default and instead ask them to “pay per crawl.”

Cloudflare’s supporters include Condé Nast, Dotdash Meredith (now People), Ziff DavisThe Associated PressGannettThe AtlanticFortune, and Time. This development has the potential to hinder AI chatbots’ ability to harvest data for training and search purposes.

But, as you can see, the tech companies are not accessing the content directly in order to know what the article said. Instead, they are crawling social media sites and other publicly available forums — and then using the power of generative AI to reassemble the gist of the articles based on screenshots and reader comments.

And also: It appears ChatGPT can now autonomously bypass Cloudflare’s “I am not a robot” test, which is one of the most common security measures employed by sites to block automated traffic. If the LLM can now deceive online verification systems, it means Web sites now need to reevaluate their human-testing methods. (To be fair, LLM agents are perhaps not robots in the strictest sense of the word, but the test is really to see if they are humans or machines.) 

Screenshot from Reddit, where ChatGPT explained the process of getting around Cloudflare’s anti-bot verification measures.
Screenshot from Reddit, where ChatGPT explained the process of getting around Cloudflare’s anti-bot verification measures.

How does one deal with this?

We will be addressing these and other existential questions that the news media industry faces at our Media and Tech Week in San Francisco in October, where we will be hearing from the vice-president of product at Cloudflare and executives from Scalepost and Prorata — two companies that provide ways for publishers to charge Big Tech for using content for their LLMs.

If you have read this far, this is a week that you’ll want to spend with us.  

If you’d like to subscribe to my bi-weekly newsletter, INMA members can do so here.

About Sonali Verma

By continuing to browse or by clicking “ACCEPT,” you agree to the storing of cookies on your device to enhance your site experience. To learn more about how we use cookies, please see our privacy policy.
x

I ACCEPT