Chatbots can replicate paywalled content
Generative AI Initiative Blog | 18 August 2025
ChatGPT, Perplexity, and Grok can provide summaries of paywalled content, according to a recent investigation. They replicate content from paywalled news articles without actually accessing the content behind the paywalls.
Instead, they use publicly available fragments — such as quoted snippets, social media re-posts, archives, metadata, or archived fragments — and reconstruct full-text or accurate summaries.
This practice is often effective even for content behind paywalls like The Atlantic, The New York Times, or The Financial Times — news brands that have invested extensively in building a hard paywall that is difficult to circumvent.
Is this report, by Henk Van Ess, an internationally recognised expert in online research methods, true? I decided to test it. The blog post I linked to above is paywalled. So, purely for the purpose of verification, I posted the link in ChatGPT and asked for a summary of it.
The result above is what I got, and it roughly matches what Van Ess himself has said about the investigation. Perplexity told me much the same, adding: “While older debates focused on AI training on paywalled or copyrighted content, the latest and more urgent concern is that AI bots are exploiting the public digital ecosystem to deliver premium content.”

As Van Ess says: We see “AI systems performing real-time searches to actively reconstruct paywalled articles from live, untrained data sources — content they’ve never encountered during their original training.
“Most chatbots have rules not to break paywalls, and say so loudly, but the internal reasoning documents obtained during this investigation show they’re systematically planning and executing these circumvention operations while maintaining plausible deniability about their methods.”
The report found AI systems successfully reconstructed about half of paywalled content across a sampling of top-tier publications, especially popular stories that have already been widely discussed online.
OK, fine, you say. It’s only half of the content. But take a moment to think about how much content your publication produces — and how much of that the typical reader reads anyway. Is it really more than 50%?
This comes just days after Cloudflare, a prominent content delivery network, said it would block bots from accessing publishers’ content by default and instead ask them to “pay per crawl.”
Cloudflare’s supporters include Condé Nast, Dotdash Meredith (now People), Ziff Davis, The Associated Press, Gannett, The Atlantic, Fortune, and Time. This development has the potential to hinder AI chatbots’ ability to harvest data for training and search purposes.
But, as you can see, the tech companies are not accessing the content directly in order to know what the article said. Instead, they are crawling social media sites and other publicly available forums — and then using the power of generative AI to reassemble the gist of the articles based on screenshots and reader comments.
And also: It appears ChatGPT can now autonomously bypass Cloudflare’s “I am not a robot” test, which is one of the most common security measures employed by sites to block automated traffic. If the LLM can now deceive online verification systems, it means Web sites now need to reevaluate their human-testing methods. (To be fair, LLM agents are perhaps not robots in the strictest sense of the word, but the test is really to see if they are humans or machines.)

How does one deal with this?
We will be addressing these and other existential questions that the news media industry faces at our Media and Tech Week in San Francisco in October, where we will be hearing from the vice-president of product at Cloudflare and executives from Scalepost and Prorata — two companies that provide ways for publishers to charge Big Tech for using content for their LLMs.
If you have read this far, this is a week that you’ll want to spend with us.
If you’d like to subscribe to my bi-weekly newsletter, INMA members can do so here.








