Generative-AI content needs clear disclaimers for readers
Smart Data Initiative Newsletter Blog | 09 March 2023
Hi everyone.
In my last e-mail, we looked at some of the foundational gotchas of generative AI and where our earliest opportunity may lie for us — building tools that humans can use rather than directly have AI publish its work unchecked.
This week, the generative AI waterboarding continues: the need for disclaimers when we feature synthetic content (and understanding our specific responsibility when it comes to publishing synthetic content) and the question of governance for these AIs.
Hanging up my AI hat for now because I’d like to (sees there are big chains and locks on the door) ... oh, wait the editor says I have to get back to it because I have to finish a report on AI first.
No rest for the wicked.
See you soon for the next one. As always, all my best.
Ariane
When we use AI, we should disclaim it
GPT-3, the large language model that powers ChatGPT, was trained using a technology called neural networks. They are the big win of recent years in machine learning, allowing a high number of powerful computers to join up in digesting, processing, and rearranging a bunch of data — and, in so doing, creating new learning and rules.
The problem with this method is that it makes explainability (which we will talk about in a second) difficult, and, therefore it makes ascertaining how something was actually generated difficult or impossible to do.
Wherever machine learning occurs, a key question — one that will be familiar to any seasoned journalist — is asking where the information was acquired, and, importantly, where information was not acquired.
You’d ask a politician whether they considered all sides in building a policy. You’d ask a music critic whether they are informed about all genres of music and have taken stock of a deep bench within each genre before deciding to trust their information and opinions.
With deep learning — the kind of Artificial Intelligence that doesn’t use human rules to go about building itself — the question of the underlying data set is crucial to appreciating where and how the resulting intelligence may have gaps.
The large language model GPT2 (the foreparent of GPT3) has this warning on its Github page:
“The dataset our GPT-2 models were trained on contains many texts with biases and factual inaccuracies, and thus GPT-2 models are likely to be biased and inaccurate as well.
“To avoid having samples mistaken as human-written, we recommend clearly labeling samples as synthetic before wide dissemination. Our models are often incoherent or inaccurate in subtle ways, which takes more than a quick read for a human to notice.”
This suggestion is a good one — and one we are familiar with in our industry: We have codified the way we edit certain changes (corrections). We have codified how certain types of content appear (advertorials, native advertising, whether free products or services were used in a review or whether affiliate links may be generating income for the organisation).
The Partnership on AI, a partnership of organisations (which includes the BBC, the CBC, and The New York Times for its news media industry partners) is working on broad recommendations for the adoption of AI-generated content across a range of industries and use cases. They have already recommended disclaimers as well, with suggestions from watermarking visual media, audio disclaimers for audio content, and text indicators for text-based media.
Regardless the specific lines that your organisation has drawn for its own practices, rare are the organisations that don’t have a code of conduct for disclaiming the specific circumstances under which a piece of content was produced. And this is especially true wherever we realise our readers or users wouldn’t be able to recognise these distinctions.
News organisations also have long codified attribution: What is in quotes is not like what we paraphrase and attribute, and this too is different than material that is handed out — where the material is used as is, but a copyright label provides the attribution.
For anyone who has sat in a newsroom legal seminar (like yours truly), you’ll remember that beyond the usefulness of disclaimers and attribution for the sake of our reader, there is also a liability angle to consider. For example, content in quotation doesn’t have the same libelous weight for a publication reporting it than content out of quote. (“So-and-so is a thief and a liar’ said Joe Schmoe” is not the same as your organisation calling So-and-so the same thing without attribution.)
This is not legal advice (because I’m not an attorney), but wherever we are taking the work of an AI unedited, we would do well to remind our readers of this particular origination. Whether from the perspective of intellectual honesty of authorship — a kind of byline or credit line — or from the perspective of disclaiming, it seems we’d never be wrong for erring on the side of informing our users about the origination of AI content.
This means using plain language to do this. Attributing something to Dall-E will only be understood to be automated content generation by a minuscule fraction of your audience You have to know what Dall-E is in the first place, and I can tell you that my mother has never heard of it (and she reads the newspaper). So this work of disclaimers really means a whole education of our users for them to be properly aware of the distinction of AI-generated content.
What is algorithmic governance and how does this affect us?
Algorithmic governance is an attempt at providing guidelines and controls for understanding how an AI behaves, learns, and what blindspots or negative outcomes it may be generating.
This is far from easy because the very nature of neural networks — which is the architecture of systems that allow deep learning for technologies like large language models — are black boxes to a large extent.
Explainability — the ability to understand why an AI gives the answer it gives — is very much still only a goal for many AI systems. The consequence of explainability is both an ability to control the outcome but also a way to potentially influence the system design to fix or improve the AI system.
Various corners have called for increased oversight and control of algorithms, including U.S. President Joe Biden, who in an executive order in February 2023, said the federal government should “promote equity in science and root out bias in the design and use of new technologies, such as Artificial Intelligence.”
“The executive order makes a direct connection between racial inequality, civil rights, and automated decision-making systems and AI, including newer threats like algorithmic discrimination,” said Janet Haven, the executive director of the Data & Society Research Institute. “Understanding and acting on that connection is vital to advancing racial equity in America.”
The impact is real too for the news industry: If we use algorithmic libraries unchecked, its mechanisms unexamined, and its outcomes unquestioned, we compound biases that are present in these AIs in the first place because we are a place of distribution of content.
Our responsibility comes from the very role that we have in society: As mass media, we are, in fact, where citizens come to inform themselves about the world.
The warning is present in GPT-3’s so-called Model Card (the structured information on what goes into its machine learning source and training):
“GPT-3, like all large language models trained on Internet corpora, will generate stereotyped or prejudiced content. The model has the propensity to retain and magnify biases it inherited from any part of its training, from the datasets we selected to the training techniques we chose. This is concerning, since model bias could harm people in the relevant groups in different ways by entrenching existing stereotypes and producing demeaning portrayals amongst other potential harms.”
GPT-3 was trained by crawling a large chunk of the Internet, specifically Common Crawl — a pre-cleaned subset of the Internet, which includes common sources like the world’s largest news publishers, Wikipedia, the UN, etc. In other words, GPT-3 is likely to have crawled your own news organisation to build its artificial brain muscle.
But, and this is where it gives us, the news media, double the responsibility: We’re crawled to build the AI’s brain, but we’re also feeding the next generation of this AI.
So as we may be creating content that is synthetic in nature, we have to worry about the quality of this in the immediate term — what our users will encounter on our owned distribution channels, but also in what way this content’s quality will affect the AI’s later evolution.
To give you an example of what this may look like, imagine you have an AI printing unchecked “bad content” on your property, acme.com. Acme.com is part of Common Crawl, so whatever you publish will eventually be part of the AI’s model.
Because you’re not really checking what the AI is doing — pumping articles about how to tie your shoe laces and other useful content for the Internet — this content is not just bad for your users today, it also becomes part of the parameters that the AI will eventually retrain on!
In this respect, our own good governance is pretty key. Unlike organisations that are not part of Common Crawl, we can be both consumers and inputs for AI.
Further afield on the wide, wide Web
Some good reads from the wider world of data. This week:
- What a gift I have for you: My most favoritest late night show dedicated a whole episode to AI. I give you, John Oliver’s Last Week Tonight. Apropos of nothing, I will casually mention here that VPNs can be very useful.
- Craig Smith of Eye on AI (my former boss!) interviews Yann LeCunn, the chief AI person at Meta here. Level: Advanced but no data scientist background required because Craig is a great interviewer.
About this newsletter
Today’s newsletter is written by Ariane Bernard, a Paris- and New York-based consultant who focuses on publishing utilities and data products, and is the CEO of a young incubated company, Helio.cloud.
This newsletter is part of the INMA Smart Data Initiative. You can e-mail me at Ariane.Bernard@inma.org with thoughts, suggestions, and questions. Also, sign up to our Slack channel.