SHARE

Recording an audiobook is no easy task, even for experienced voice actors. But demand for audiobooks is on the rise, and major streaming platforms like Spotify are making dedicated spaces for them to grow into. To fuse innovation with frenzy, MIT and Microsoft researchers are using AI to create audiobooks from online texts. In an ambitious new project, they are collaborating with Project Gutenberg, the world’s oldest and probably largest online repository of open-license ebooks, to make 5,000 AI-narrated audiobooks. This collection includes classic titles in literature like Pride and Prejudice, Madame Bovary, Call of the Wild, and Alice’s Adventures in Wonderland. The trio published an arXiv preprint on their efforts in September. 

“What we wanted to do was create a massive amount of free audiobooks and give them back to the community,” Mark Hamilton, a PhD student at the MIT Computer Science & Artificial Intelligence Laboratory and a lead researcher on the project, tells PopSci. “Lately, there’s been a lot of advances in neural text to speech, which are these algorithms that can read text, and they sound quite human-like.”

The magic ingredient that makes this possible is a neural text-to-speech algorithm which is trained on millions of examples of human speech, and then it’s tasked to mimic it. It can generate different voices with different accents in different languages, and can create custom voices with only five seconds of audio. “They can read any text you give them and they can read them incredibly fast,” Hamilton says. “You can give it eight hours of text and it will be done in a few minutes.”

Importantly, this algorithm can pick up on the subtleties like tones and the modifications humans add when reading words, like how a phone number or a website is read, what gets grouped together, and where the pauses are. The algorithm is based off previous work from some of the paper’s co-authors at Microsoft. 

Like large language models, this algorithm relies heavily on machine learning and neural networks. “It’s the same core guts, but different inputs and outputs,” Hamilton explains. Large language models take in text and fill in gaps. They use that basic functionality to build chat applications. Neural text-to-speech algorithms, on the other hand, take in text, pump them through the same kinds of algorithms, but now instead of spitting out text, they’re spitting out sound, Hamilton says.

[Related: Internet Archive just lost a federal lawsuit against big book publishers]

“They’re trying to generate sounds that are faithful to the text that you put in. That also gives them a little bit of leeway,” he adds. “They can spit out the kind of sound they feel is necessary to solve the task well. They can change, group, or alter the pronunciation to make it sound more humanlike.” 

A tool called a loss function can then be used to evaluate whether a model did a good job, a bad job. Implementing AI in this way can speed up the efforts of projects like Librivox, which currently uses human volunteers to make audiobooks of public domain works.

The work is far from done. The next steps are to improve the quality. Since Project Gutenberg ebooks are created by human volunteers, every single person who makes the ebook does it slightly differently. They may include random text in unexpected places, and where ebook makers place page numbers, the table of contents, or illustrations might change from book to book. 

“All these different things just result in strange artifacts for an audiobook and stuff that you wouldn’t want to listen to at all,” Hamilton says. “The north star is to develop more and more flexible solutions that can use good human intuition to figure out what to read and what not to read in these books.” Once they get that down, their hope is to use that, along with the most recent advances in AI language technology to scale the audiobook collection to all the 60,000 on Project Gutenberg, and maybe even translate them.

For now, all the AI-voiced audiobooks can be streamed for free on platforms such as Spotify, Google Podcasts, Apple Podcasts, and the Internet Archive.

There are a variety of applications for this type of algorithm. It can read plays, and assign distinct voices to each character. It can mock up a whole audiobook in your voice, which could make for a nifty gift. However, even though there are many fairly innocuous ways to use this tech, experts have previously voiced their concerns about the drawbacks of artificially generated audio, and its potential for abuse

Listen to Call of the Wild, below.