Saturday, March 2, 2024

Status Update • Audio Books: Past, Present, and Future


We recently were given the opportunity to participate in a product beta. 

The product is an AI-assisted text-to-speech conversion utility. The objective is to develop a software tool that can convert a complete e-book into a serviceable audio book in a matter of hours. We gave it a good hard tryout, running five of our novels through it, and the results are… 

Promising. Interesting. Though-provoking. It’s not quite there yet, but it will be, and sooner than you think.
 

—The Past—


As many of you already know, audio/video production fascinates me. I didn’t start out to be a science fiction writer. I started out to be a musician and composer, and spent a lot of time doing radio, TV, and above all, recording studio work, before I transitioned to writing for print publication. I’d always intended to get back to doing that sort of work, one of these days, once everything else settled down. I named this company Rampant Loon Media, after all, with the idea that eventually we would branch out into doing audio and video production.

When you bring up the topic of the AI-assisted generation of audio books, most people’s first response is the same as it is to every other intrusion of automation into their world: it’s immediate, visceral, and basically reactionary. “They took our jobs!”


In my mind’s ear I hear armies of farm workers shouting the same complaint 190 years ago, when Cyrus McCormick patented the mechanical reaper and they could no longer make a living harvesting grain by hand with a scythe.

After that, the next response is usually, “I don’t like it. It sounds mechanical. It’ll never sound as good as a good reading by a human actor.”

My answer is, that’s right. For now. But the technology is improving rapidly, and in a very short time text-to-speech audio has gone from sounding like something being read by a Dalek to something being read by a voicemail chatbot, or perhaps a bored primary school teacher reading to her class. It won’t be too much longer before it does sound every bit as good as a text being read by your average run-of-the-mill voice actor, with the added advantage of being a lot more reliable. AIs won’t fully replace actors until they can skip rehearsals, show up drunk on opening night, or miss their entrances because they’re making out in a backstage broom closet with a member of the costume crew. But aside from those shortcomings…

§

Perhaps I’m more relaxed about this than you are because I’ve already watched this happen in the music industry. Back in my musician days I hated to use even dynamic range compression in the recording studio, because I thought it warped the natural sound of the human voice. But now, try finding contemporary pop music that does not consist almost entirely of loops, samples, and voices strained through multiple layers of compression, equalization, audio enhancement, and auto-tuning. Sometimes I catch myself thinking, this isn’t music, it’s audio découpage



But then I realize that I’m just being a grumpy old Luddite. No one is forcing me to listen to contemporary pop music. Whenever I want to, I can put some of my music on whatever audio system is handy. I just got the remastered 3-disk set of Arturo Benedetti Michelangeli’s complete Debussy recordings for Deutsche Grammophon, and am fighting the urge to binge-listen to the entire thing in one sitting. More than two hours of Debussy’s solo piano music, played by a master pianist. Absolutely glorious.

Meanwhile, those who grew up with loops, samples, EDM, and all the other things that grew out of scratch, dub, electronica, post-punk and industrial dance music—those who enjoy music that sounds like it’s been written by and is being performed by robots—

They like it and want more of it.

There. That’s a thought to keep in mind as we continue traveling into the future. 

—The Present—

Here at Muppet Labs—excuse me, Rampant Loon Media—we’ve been experimenting with audio books for years. When Henry Vogel’s novel, The Fugitive Heir, was riding high on the Kindle best-seller list, we went the full (and expensive!) ACX route and hired a professional voice actor to produce a complete professional-grade audio book.

 

 

Available now on Audible.

Available now on Kindle Audiobook.

The initial results were promising enough to warrant re-hiring the same voice actor to do the sequel, The Fugitive Pair.

Available now on Audible.

Available now on Kindle Audiobook.

Thus setting ourselves up for the insanely expensive learning experience that was The Counterfeit Captain. For this one we went all-out and hired a between-roles Hollywood actress to read the book, with the recording and production to be done by some of her Hollywood movie industry friends.

Available now on Audible.

Available now on Kindle Audiobook.

It turned into the Project from Hell. We spent a small fortune on this one, and in the end the audio files they delivered were unusable. We had to engage someone else to fix the whole mess in post-production. (Apparently, to people in Hollywood this is completely normal and nothing for them to get excited about. “You can fix it in post.” They were already off and working on their next project.)

The resulting audio book sounds great—you should pop out to Audible or Amazon and give the sample a listen—but boy, was it an expensive pain-in-the-@$$ to get there. So expensive, in fact, that it devoured the budget I’d allocated for doing the audiobook of The Fugitive Snare, so we decided to put that one off until we had a better handle on what we were trying to accomplish. 

§

We haven’t been idle since then. We’ve completed production on the 10-episode streaming audio adaptation of Dawn of Time and the 30-episode streaming audio adaptation of The Odin Chronicles and will be rolling those out shortly; watch for more details coming soon.

[If you’re curious, here is a work-in-progress sample of Episode 1 of The Odin Chronicles. It’s not quite the finished version, but it’s really close, and definitely worth a listen. I am pretty pleased with it, anyway, and when it comes to audio production quality I am downright OCD.]    

We have an audio book adaptation of The Fugitive Heir in development. I should check up on that one and see how it’s coming along. 

People keep asking why we don’t release an audiobook adaptation of Headcrash. The problem, essentially, is me. I grew up on The Firesign Theater and logged all those years of working in theater and recording studios, and now that I finally own all rights to Headcrash again, I can’t seem to let go and let someone just read it. I begin with that idea, but then pretty soon I’m wanting to add sound effects, and incidental music, and to get different voice actors to read the different characters…

And, well, then it mushrooms into being the full-blown multimedia production I always saw it being in my mind, and I start to wonder if maybe we can get Rod Lord on board for the project…

Never mind that now. Among other projects we’ve attempted in recent years has been an audiobook adaptation of Eric Dontigney’s paranormal thriller, The Midnight Ground. We’ve actually committed to this project several times—

And each time, the voice actor who committed to the project failed to deliver a finished book. (See foregoing comments about actors, reliable.) Which brings us up to now, or perhaps more appropriately, fifteen minutes into the future.

 

§

When the invitation to participate in this beta first showed up in my inbox, I wasn’t too excited. We see these sorts of offers from time to time and they rarely pan out. For example, Henry Vogel used Apple’s AI to convert his novel, Trouble in Twi-Town, to an audio book, and was not happy with the result. He said it took him about two weeks and a lot of manual fiddling to produce something he deemed almost good enough, but there were still problems with pronunciation and diction. (I believe the audio book is out on iTunes now but can’t confirm that.)

This latest offer came from KDP, though, and they’ve been working on developing text-to-speech conversion since at least 2010. Among other things they promised fast e-book to audiobook conversion, and helpfully provided us with a list of our own Kindle titles that were already deemed suitable for conversion.* After Henry and I talked it over for a bit and played around with various options, we selected his novel, The Recognition Run, to be our first victim test subject.

KDP really delivered on the fast part. Once Henry and I agreed on a virtual voice for the book and I clicked the button to commit, the finished audiobook was live on Amazon in less than two hours; live on Audible a little later. It worked so well that we decided to convert the rest of the Recognition trilogy right away, too, and then, what the heck, Hart for Adventure.

In a matter of hours we had four new audiobooks live on Amazon and Audible.

The Recognition RunAudible | Kindle Audiobook

The Recognition RejectionAudible | Kindle Audiobook

The Recognition RevelationAudible | Kindle Audiobook

Hart for AdventureAudible | Kindle Audiobook


These audiobooks aren’t perfect. You’ll never mistake them for an audiobook read by a really good human narrator. I’m not entirely satisfied with the range of virtual voices currently available, and can think of plenty of improvements I’d like to see both in the publisher’s user interface on the front end and in the final output that goes to customers at the back end of the process.

But that’s the point of a beta, isn’t it? KDP does seem to be receptive to our feedback, so I’m happy to participate in this program, and looking forward to seeing—and more importantly, hearing—the final product. In the meantime, I am really happy with the seamless integration between the e-book and audiobook editions, and especially happy with the way customers can get the audio book for free (or at least, steeply discounted) if they already own the e-book.   

So happy, in fact, that I contacted Eric Dontigney, and we agreed to produce a Virtual Voice version of The Midnight Ground, which we’ll keep available until such time as a living human actually delivers a finished version of the audiobook. You can get it here:

The Midnight GroundAudible | Kindle Audiobook

If you listen to it, let us know what you think of it.

§

* About that “list of our own Kindle titles that were already deemed suitable for conversion”

First off, it’s an “opt in” program, so nothing will be converted unless we request that it be converted. Secondly, Stupefying Stories and SHOWCASE contributors can relax. We did not buy the audio rights to your story. Therefore, we will NOT convert your story to audio without negotiating a new contract with you. Right now YOU still own all the audio rights to your story, unless you yourself have sold them elsewhere. We will not be converting any issues of Stupefying Stories or any SHOWCASE stories to audiobook.

Capisce? 

Very good. Moving right along then, to…

—The Future—


There is a wonderful moment in the 1998 Godzilla movie. Dr. Niko Tatopoulus (Matthew Broderick) is traveling with a team of French special forces soldiers disguised as Americans. As they’re being grilled at a U.S. Army checkpoint, the guard turns to the leader of the French team, Philippe Roaché (Jean Reno), and demands—

Guard: You got a problem talkin’?

For an instant you see the panic on Tatopoulus’s face. Oh no, our cover is blown. We’re caught. Throughout the entire movie up to this point, Roaché has spoken English with a French accent as thick as Brie cheese. Then Roaché smiles at the guard, and answers—

Roaché: Why, no suh, ah’m fine.

Guard: All right, keep it movin’.

Roaché: Well, thank you very much.

They drive through the checkpoint. Tatopoulus gives Roaché a puzzled look. Roaché answers, in his normal thick French accent—

Roaché: Elvis Presley movies. He was The King.

§

A spoken language is a living, fluid, dynamic, ever-changing and constantly evolving thing. There are words and concepts in my vocabulary now that didn’t exist when I was in college; ways of speaking, levels of meaning, and idiomatic expressions that would have made absolutely no sense to my father. At the same time, thanks to technology, there is an accelerating trend towards widespread homogeneity in how a language sounds. Regional accents began to disappear with the advent of the gramophone; they did so at an accelerating pace with the development of radio and “talkie” movies; and the trend went worldwide with television. It’s become a standing rhetorical joke: why do so many British people sing with American Southern accents? 

I dunno. Probably for much the same reason as why so many Americans can rattle off old Monty Python routines in fluent Cockney. We listen. We learn. We parrot, like a Norwegian Blue, pining for the fjords. We humans are made to learn speech from listening and imitating.

When I was in Iceland, I noticed a funny thing. The market for media in the Icelandic language is so small that almost no one bothers to overdub British or American movies or television programs into Icelandic. Instead, everything is subtitled, and as a result, Icelandic children develop fluency in English at a very early age. But the funny part is, when they open their mouths to speak, you can almost tell which TV programs they watched as a child, by the accents they use when they speak English as an adult.

This is a feature, not a flaw. We learn how to speak by listening. We develop our sense of what is “normal” speech by who we listen to most.

Or what we listen to most.

AI-assisted speech generation is here. It’s not going away. Yes, it may sound odd to your ears, perhaps flat and lifeless, with what may seem to you to be peculiar inflections and odd pronunciations. AI-generated speech will improve and become more “lifelike,” as more people continue to work on it.

But at the same time, AI-generated speech will also change how we speak.

We’re already starting to see that in how we communicate. We enunciate more clearly when we’re negotiating with a voicemail chatbot system. We change the way we write text on the computer in order to facilitate better text-to-speech conversion. This change in how we speak is only going to accelerate. Your children or grandchildren will grow up thinking the way their tablet reads a book to them is “normal,” and the way you speak sounds funny.

Our technology will continue to evolve, to better communicate with us. At the same we will adapt to better accommodate our technology, as we have ever since our first hairy ancestor chipped a flint.   

And somewhere in the middle, we’ll someday meet, and have a conversation. 

1 comments:

~brb said...

Something I really want to stress here is the production time factor. The audio book of THE FUGITIVE HEIR, narrated by a really good human voice actor, runs about 5-1/2 hours long. To get an audio track with a real-time running length like that usually requires at least 20 hours of recording studio time, and that recording time has to be spread out over several weeks. No human can read out-loud without stopping for five and a half hours straight. Their voice just plain wears out.

By way of comparison, the audio book of THE RECOGNITION RUN runs about 7-1/2 hours long, but being generated by robots, it took less than two hours to create it from initiating the process to seeing the audio book go live on Amazon.

¿Tu comprende? AI-assisted audio book generation is decoupled from real time. This alone is a powerful incentive for publishers to continue to develop and improve the technology.