A few weeks ago, I was lucky to attend the Nordic AI in Media Summit in Copenhagen, full of interesting case studies from publishers or academics taking on work that touched various parts of machine learning. One of the cases there addressed the upsides, but also complexities, of using synthetic voices to read the news.
There is, indeed, a lot of excitement around the growing capabilities of text-to-speech, and much of this is for good reason. This is a technology that has advanced enormously in just a few years — and not just for the biggest languages.
But a lot of the commentary on the technology focuses on pure capabilities — and, well, you’re reading a newsletter that often focuses on capabilities, too, so that’s in no way an indictment of such a focus. But since many of us have an interest in the final outcome as it applies in our industry — can we use this to, specifically, convey the news— we have to look beyond whether this text-to-speech AI is serviceable to see if it actually accomplishes this final outcome, which is doing a good, credible job of being a vector for news distribution. And there is a bit of difference between these two concerns.
Lene Heiselberg, associate professor with the Centre for Journalism of Syddansk Universitet in Denmark, shared her research that dealt with the audience response to this technology. She carried out semi-structured interviews with frequent radio listeners from different ages and across different regions of Denmark.
Some groups did identify or suspect that the news was being read by a synthetic voice and were impacted — sometimes positively or negatively — when they identified the presence of technology. “They thought it sounded like their GPS or like a robot,” said Professor Heiselberg. But, on the other hand, Professor Heiselberg also identified that users could get equally annoyed (though not for the same reason) with the same voice when they thought it was human.
Where things had more clarity were on factors that affected the credibility of the synthetic voice. And Professor Heidelberg listed some of the same factors as increasing or decreasing credibility — basically, the eye of the beholder was more of a factor than the inherent characteristics themselves.
For example, the synthetic voice lacking emotionality could be perceived as a source of increased objectivity for the reporting because users felt they couldn’t be emotionally manipulated. “That came as a surprise to me,” Professor Heiselberg said.
But lack of emotion could also be perceived as jarring against the user’s expectation of what they felt the tone of a news-reading voice ought to be for certain types of stories, like the weather, deaths, or sports. “When you ask the listeners, they want to feel the enthusiasm in the reporter’s voice when their team won,” said Professor Heiselberg.
Similarly, the source of the voice as it pertains to the persona the synthetic voice was loaned, which also could play in its favour or disfavour:
Finally, the context in which the synthetic voice was introduced — was it disclaimed as such — also affected perceptions of credibility. The results were less ambiguous that disclaiming helped make the technology credible but could have a knock-on effect of a more social nature about the place and nature of how AI-driven content creation.
The anthropomorphization of AI, or of robots, is a divisive topic. On the one hand, it helps make these technologies more approachable to us humans. On the other hand, anthropomorphization is a fallacy which exploits the way we’ve evolved as a species to relate to a category of “things” (other living creatures) that a piece of software, not being a living creature, does not belong to. That’s another way of saying it hijacks the regular way you’d assess a non-living thing and takes a more social path instead. A misbehaving robot can go in the trash, but a misbehaving puppy cannot.
Professor Heiselberg noted four ways that user would humanize their synthetic news reader:
- Giving them physical attributes.
- Associating them to real humans/famous humans.
- Projecting that the voice suggested it belonged to someone with certain stereotypical characteristics.
- Giving the voices a human backstory.
I’d like for a second to point at some of the text-to-voice AIs many of us have encountered — Alexa, Siri. These both have human-sounding names, which is probably not by luck. Furthermore, they have female names, also probably seen to be more friendly, collaborative, and benign — qualities that we loan to women — even if these assistants also have alternate male voices available.
And some of these synthetically voiced assistants very much lean into the notion of being friendly and having emotions. I burn with the fury of a thousand suns when Alexa tells me, “I hope you’re having a great day,” a disingenuous attempt at trying to trigger in me some reciprocal empathy for a hunk of plastic and a computer processor. Now, to be fair, you apparently don’t even need to have a voice to get this outcome: The Internet tells me that there’s a whole trend of people who think that their Roomba robot vacuum cleaners have personalities because, apparently, they are programmed to mimic having one.
This is an example where having personality, or an attempt by the makers of the software to give their product a personality, is seen as desirable, presumably for further entrenching the robot into our lives.
In the context of news, however, credibility —rather than friendliness — is the place where we have to measure whether we live or die. In this respect, Professor Heiselberg shared some takeaways from her survey participants:
- Communicating the presence of an AI-generated voice.
- Making considerations for the type of news content where text-to-speech should be used.
- Being mindful of how credibility was affected by having a voice deliver without emotion.
- Being thoughtful about the tone of voice itself – remembering our tendencies for anthropomorphization.
- How this voice will become part of your brand identity.
I would not be surprised if there was some regional variance in how users perceived these synthetic voices. Just as some societies treat pets as members for their family while other societies don’t, how we may feel toward robots in our lives probably varies, too.
If you are a news media company who is making a push for text-to-voice at scale, this type of user testing may highlight qualitative insights you may not readily see in your quantitative user data. Yes, they are just robots, but they are providing information that goes well beyond confirming that you’ve locked the garage door.
And also, maybe they are not “just robots.”
If you’d like to subscribe to my bi-weekly newsletter, INMA members can do so here.