How to Create AI Voices that Are Better than Text-to-Speech

5 min readNov 10, 2022

The next movie or TV show you watch may be the work of artificial intelligence. Imagine that the actors from Hollywood, Bollywood, or any other motion picture industry are able to speak any language fluently in upcoming films or TV shows. In fact, this is a common scenario. Only the voices may not belong to the actor because they’re all deepfakes. Not to be confused with fraudulence, but created using AI.

However, this level of voice quality is only possible when working with the right AI voice generator. Otherwise, the result would be a robotic voice that is full of inaccuracies that distract the audience. This article will cover some of the biggest shortcomings of some AI-generated voices and reveal the best alternatives.

What is an AI voice and what are the different ways to create one?

The 2019 film “Every Time I Die” was dubbed into different languages with the help of AI. It was one of the first attempts to replace the voice of a dubbing actor with a digital agent. And everything worked out — even the developer of this technology could not always distinguish the synthesized voice from the real one.

An AI voice was also used in The Mandalorian for Luke Skywalker’s reveal. AI voice replaced the voice of the real Mark Hamill, who is now 70 years old.

Hamill himself gave permission for the use of his voice. That’s when Respeecher went to work training a neural network based on fragments of the actor’s voice that were recorded over 40 years ago. Sources were past films, an old radio show, and Hamill’s taped voice.

AI voices are used not only for film dubbing but also for other content, including video games. You can also dub old movies with a damaged or not very natural audio track. The AI copes with these challenges without any problems.

Most AI voices can be leveraged using:

Text-to-speech (TTS) technology
Speech-to-speech voice conversion

What is text-to-speech voice conversion?

Text-to-speech voice conversion is a computer simulation of human speech from text with the help of machine learning techniques. Developers use TTS to create voice robots such as IVR (Interactive Voice Response).

The technology allows businesses to save time and reduce costs by automatically generating a voice, eliminating the need for studio recording (and re-recording) every time a script is adjusted.

A TTS application can read any piece of text in a voice that is almost identical to a human voice. And although the best representatives of TTS technologies are able to achieve impressive quality, you will most likely still be able to tell that it is a robot’s speech. However, TTS is applicable to a range of use cases.

How text-to-speech tools work

To convert text to speech, the ML algorithm needs to perform the following:

Convert text to words
Complete phonetic transcription
Convert transcription to speech

In general, there are three areas that TTS voice conversions can be used in your business or content production. They are:

Voice notifications and reminders. These allow for the delivery of any information to your customers all over the world with a phone call. The good news is that the messages are delivered in the customers’ native languages.
Listening to written content. You can hear a synthesized voice reading of your favorite book, email, or website content. This is important for people with limited reading and writing abilities, or for those who prefer listening over reading.
Localization. Hiring employees who can speak the multiple languages of your customer can be costly if you operate internationally. TTS allows for practically instant vocalization from English (or other languages) to any foreign language. This is assuming that you use a proper translation service.

The disadvantages of TTS applications

Despite the benefits TTS brings to the above-mentioned areas, it still has a number of shortcomings.

To achieve a natural-sounding voice with text-to-speech synthesis, the software should be capable of producing critical nuances such as tone of voice, stresses, pauses, cadences, and so on. Almost all text-to-speech applications fail to perform this complicated task, leading to low-quality results.

The most widespread TTS software shortcomings are:

Inability to properly convey emotions
Limited vocabulary and languages
Slow synthesis process
Inaccuracies
Robotic sounding voices

These points may significantly affect the result and lead to unpleasant business outcomes. This is obvious since nobody wants to listen to low-quality, robotic voices. This is where speech-to-speech voice cloning steps in.

Speech-to-Speech (STS) voice cloning

So what is speech-to-speech voice synthesis? It is an AI-powered technology that uses one person’s speech (not text) to generate speech in another person’s voice.

With speech-to-speech voice cloning technology, you can make your own voice sound like anyone you want. So how exactly can STS help to make TTS voices sound natural?

In short, speech synthesis powered by AI allows for addressing critical use cases where you need to use one person’s speech as a source to generate speech in another voice.

With speech-to-speech voice cloning technology, you can make yourself sound like anyone.

So why choose STS over the TTS tech? Here are some of the most critical reasons:

STS allows you to do what is impossible with TTS, like synthesizing iconic voices of the past or saving time and money on ADR for movie production.
STS voice cloning allows you to achieve speech that utilizes a more colorful emotional palette. The generated voice will be absolutely indistinguishable from the target voice.
STS technology allows for the scaling of content production for those celebrities who want but can’t spend time working simultaneously on several projects.

Check the video to see how Respeecher’s speech-to-speech technology allows for producing different voice nuances.

Respeecher’s Voice Marketplace allows you to license a human voice from our voice library. You can synthesize an unlimited amount of speech using this voice for your project without leaving the same platform. All this at an amazing price point and level of quality that no one else can match.

With Respeecher’s speech-to-speech voice synthesis, you can convert your voice into 60+ natural human (and animal) voices without sacrificing the full range of emotion.

Try it and experience the difference for yourself!

This article was initially published on the Respeecher blog.