Physical Address

304 North Cardinal St.
Dorchester Center, MA 02124

A massed army of audio bots is coming over the hill. What will it do to us?

Recent eyesight problems have had me relying more than I would usually on getting my information via sound rather than vision. The experience of listening to more audiobooks, audio articles and podcasts has been mixed, but it has opened my eyes (metaphorically) to the rapid developments taking place in text-to-audio services and in audio-to-text: virtually all sound-editing software now comes with sophisticated and remarkably accurate transcription built in.
For several years now, newspapers, including this one, have been generating automated audio versions of their articles. The initial impetus came from the belief of publishers that media consumption would be dominated by devices such as Alexa and Google Assistant. It turns out that most of us are not as keen on having an always-on spy in our homes as the tech companies hoped.
But, regardless of the hardware, the quality of automated audio is bounding forward, propelled by generative artificial intelligence. Until very recently the ghost of Hal from 2001: A Space Odyssey hung over these products. Not any more. The previous uncanny valleys, weird intonations and mispronunciations are a thing of the past. The new generation of automated voices are so close to pitch-perfect that you’ll be hard pressed to tell them apart from an actual human being.
[ The best AI tools for future-focused companies: chatbots, writing assistants, productivity boosters and moreOpens in new window ]
All of which, of course, will give rise to further disquiet among actual human beings like you and me who fear being left on the scrapheap. Slowly but steadily, the massed army of audio bots is coming over the hill to take the jobs of call-centre workers, DJs, weather forecasters, news-bulletin readers, university tutors and who knows who else.
As a podcast presenter I was not delighted when a colleague shared a trial sample of Google’s experimental Audio Overview feature. Using the company’s artificial-intelligence tool Gemini, “two AI hosts” turn your “documents, slides, charts and more into engaging discussions” in which they “summarise your material, make connections between topics, and banter back and forth”.
Applied to a recent Irish Times article about the Catholic Church’s domination of Irish primary schools, the results were simultaneously under- and overwhelming. Like most AI, it was impressively mediocre. Two American voices (no local accents available yet) mulled over the issue, offering a half-decent summary of the article in a conversational mode that wouldn’t be out of place on a mid-morning magazine radio show. You can see how some people might prefer it, once the tech gets better, to a simple read-out of the original text. But it was no Socratic dialogue.
Nevertheless, Socrates might have approved. In the Phaedrus he worried that written texts were replacing oral culture. Writing, he predicted, “would create forgetfulness in the learners’ souls, because they will not use their memories; they will trust to the external written characters and not remember of themselves”. Similar dire predictions would be made two and a half millenniums later about the internet.
But despite Socrates’s fears, a sense is deeply embedded in contemporary culture (and shared by me) that reading a text involves a more profound, more immersed engagement than hearing the same words spoken aloud. Reading requires full attention, after all; listening can be done while making the dinner or walking the dog. And the voice itself, whether bot or human, acts as an intermediary and, inevitably, sometimes as a distraction from the words themselves. For our current age of distraction, it seems the more appropriate medium.
We hear a lot about how our culture is becoming more visual and less literate. We hear less about how it is becoming more aural and less textual. Or, perhaps more accurately, how the written and spoken word are merging, presented side by side. You can read a TikTok or a Netflix release or an Apple podcast. You can listen to this article.
This coming together of speech and text applies to production as well as to consumption. I’m tapping this column out as usual on a keyboard. But it would be just as easy to dictate it, using one of the many free programs now available. Would that make the process or the result any different? Writers from John Milton to Barbara Cartland have dictated their work, so there’s nothing new there. But some suggest that the progression from quill to typewriter to word processor has changed the way writers write, just as other technologies have changed the way readers read. The end of the keyboard may seem hard to imagine, but it’s not so long ago that many would have said the same about the end of handwriting,
One might as well bow to the inevitable. Clears throat. Taps mic. Is this thing on?

en_USEnglish