Generative AI search will challenge all news publishers
Newsroom Transformation Initiative Blog | 19 February 2023
If publishers thought they had it hard to extract value from or understand Google Search, they face an immense struggle from the new breed of search chatbots using generative Artificial Intelligence led by those based on the ChatGPT model.
Attribution of content and sources — already seen as a big problem by publishing content creators — is set to become a battleground between publishers, companies promoting AI-driven search, and regulators and politicians who generally lag in responding to technology upheaval.
“Search seems like a sort of Utopia when compared to what’s happening with AI and how we’re going to get attribution in AI. It already feels like that’s the real battleground over the next few months and years,” said a senior executive who handles search policy at an international publisher but preferred not to be named this early in the discussion about the implications.
The existing model of ChatGPT created by OpenAI has almost no attribution or clarity as to where the material contained in its uncannily correct answers derives from.
The new OpenAI-powered Bing showcased by Microsoft appears to have some attribution and links back to content creators, but nothing like the funnels Google Search has historically shown and which drive billions of pageviews back to publisher Web sites — much as they might complain about an imbalance of power between Google and journalism.
In the Bing demonstration, the results use numbered citations that allow you to drill down to the site where the actual answers reside — very much like Wikipedia but again, less obvious.
When did you last drill down to the actual source from a Wikipedia page?
Perplexity.ai, which is evidently also built on a version of ChatGPT, gives a hint of what a generative search engine that understands the value of attributing the origin of a fact or piece of news might look like, with clear branding and paths to major sources such as Reuters, NPR, or a range of relatively authoritative publishers, as well as government and official sources.
But the vast range of material open to or digested by the new generative AI models itself opens up all sorts of issues of copyright, terms, accuracy, as well as the big question of payment.
Where does this stuff come from?
“How do we have any level of grip over what’s going on with the use of AI with news and information? There’s been very, very little discussion between the key platforms and publishers about how this is going to work,” the source told me, having been heavily involved in negotiations and unofficial product discussions with Google especially.
That lack of consultation reflects how fast the field is moving with some products perhaps rushed into the market without adequate thought.
In its frequently asked questions, Microsoft’s Bing search engine acknowledges that Artificial Intelligence engines can give what you might call flaky answers, saying: “Bing aims to base all its responses on reliable sources — but AI can make mistakes, and third-party content on the Internet may not always be accurate or reliable. Bing will sometimes misrepresent the information it finds, and you may see responses that sound convincing but are incomplete, inaccurate, or inappropriate. Use your own judgment and double check (sic) the facts before making decisions or taking action based on Bing's responses.”
Does that sound like a search service based on reliable information and attributed sources?
Publishers worldwide have complained for years of what they say is the opacity of how the Google search algorithm works — what they see as an imbalance of power. They also increasingly object to Google search responses that may contain the entire answer a user seeks, making it less likely they believe that traffic goes from the search engine to publisher sites — now known as “zero-click” answers to critics from the publishing industry.
(There’s much more on these gripes with the current Google Search — and Google’s answers to them — in the recent INMA report How Newsrooms Succeed in Google Search.)
Those questions over the current dominant search engine pale in comparison with the issues raised by the first generation of generative AI search devices coming to market right now.
For example, it is widely understood that Open AI has scraped an immense corpus from the Internet — much of it copyrighted material from publishers globally — which may be fine when it is used for experimental or non-commercial purposes. But what happens when Bing tries to present answers derived from that content and gain from it commercially?
“That will be where the next wave of concern probably comes in for publishers,” my source said. “We need a sustainable revenue stream in order to be part of the service.”
Conventional search, they said, will still be important for publishers for a long time to come. We know also the Google standards can help publishers who demonstrate the qualities of expertise, authoritativeness, and trustworthiness Google seeks.
But there will now be a big focus on the implications of years of historic content as well as breaking news and fresh information being sucked into the maw of generative AI without attribution, linkage, or some form of adequate compensation for publishers — and much of it may also be plain wrong.
Given that in ChatGPT-driven answers the entire answer is usually complete, it can be assumed that very little traffic will be generated back to publisher sites and, therefore, little revenue.
However, it is clear Microsoft has the margins Google makes from search in its sights, not the margins publishers make. Microsoft Chairman and Chief Executive Satya Nadella told the Financial Times the new paradigm of generative AI search would permanently trim margins from the search business, fundamentally changing the economics of the entire industry.
“From now on, the [gross margin] of search is going to drop forever,” Nadella said in an interview with the Financial Times, making clear he believed Google or Alphabet was more vulnerable given its narrower spread of revenue-earning products: “There is such margin in search, which for us is incremental. For Google it’s not, they have to defend it all.”
In the same way that platforms had been dragged to negotiate on payments with legislation in Australia and other markets and in others, they had chosen to work with publishers. Some way to create a sustainable base of reliable content was in the interests of all parties.
“Without news sources, it becomes anarchy,” the source said.
Wikipedia is one way to think about generative AI
Journalists have reacted with horror-tinged-with-disdain at the ability of even early generative AI models to produce more or less adequate and sometimes very good articles of various kinds. It seems to particularly suit data-led or often repetitive journalistic modes, such as stock market reports, sports results, or weather updates.
But we can forget it is not original material.
Wikipedia, the crowd-sourced online encyclopedia, is perhaps a good proxy to understand what generative AI is doing with journalism created and theoretically owned by publishers.
Wikipedia is not a source in its own right, though it is often used that way by students and others. Wikipedia is, in fact, a curated distillation by human volunteers rather than Artificial Intelligence, with a supporting collection of sources — all identified and attributed and linked back where it can be to the original source or owner of the material.
I asked Wikipedia co-founder Jimmy Wales, for whom I once worked on a journalism start-up, how he was thinking of ChatGPT and the comparison with the encyclopedia of sources.
“Since ChatGPT doesn’t really ‘understand’ anything, it might not really be able to ‘know’ where it learned something — or maybe it can,” Jimmy said in an e-mail. “I continue to be absolutely astonished by it and absolutely frustrated by how bad it is alongside how good it is.”
If you think about the last time you traced a link back from Wikipedia to its original source, you have some idea why publishers may have a coronary at the thought of losing traffic from search if generative AI search of the sort demonstrated in ChatGPT takes off.
“It’s actually in the interests of the people who are making those products to ensure that there is a sustainable ecosystem still exists beneath them,” my source said. “The fact is it seems like they have been scraping all publisher Web sites on the open Web, probably breaching the terms of service of those publishers for months if not years. Nobody I’m aware of has ever been approached by these companies asking for a commercial license to do that.”
The voice of God problem, again
Politicians and regulators will inevitably take time to get up to speed even as Microsoft launches its Open AI pilot and Google evidently rushes its Bard generative AI tool to the market and millions of people try and clearly enjoy the answers provided by GPT and other tools.
The battle lines are being drawn and moving fast, and publishers need to get up to speed. After all, it’s not as though we haven’t been here before.
“If you then have a tool, which claims to be God, but that doesn't have any attribution in it, producing content at zero cost to the consumer and then free to distribute across the Web. It’s, it’s insane. It’s completely insane,” my source said. “It shouldn’t satisfy lawmakers. After all, we’ve only just come through a process where social media platforms were just spewing stuff.”
If you’d like to subscribe to my bi-weekly newsletter, INMA members can do so here.