Debunking the 4 Most Common Voice Synthesis Myths

Respeecher
4 min readJun 3, 2021

In this article, we continue discussing voice synthesis technology and the positive consequences of its applications. We will look at the most common myths surrounding this technology and figure out if they carry any weight.

Before we get started, let’s recall what voice cloning is. At Respeecher, we use artificial intelligence (AI) to synthesize speech. You might be familiar with services like Google that can generate speech from the text you type. Respeecher is different. Our software does speech-to-speech conversion: instead of replacing a human being, it allows a person to speak in a different voice.

In short, it works like this. The voice cloning system analyzes the original target’s voice. Any other person can then produce the speech needed.

Respeecher then synthesizes the dialogue, combining the voice of the target person and the speech spoken by someone else. As a result, we get full-fledged speech in the target voice, except that the target person themselves did not say a single word of it. With their consent, of course.

All the intonations, emotions, and specific characteristics are conveyed with the same precision that the target person themselves would have conveyed them with.

1. Does that involve deepfake technology? I heard it is often used with malicious intent

Firstly, we encourage you to read or listen to the Code[ish] Podcast: The Ethical and Technical Side of Deep Fakes. There we explained in detail how the technology works for both video and audio deepfakes.

It makes no sense to deny that cybercriminals can use the technology to commit crimes and create negative news headlines. As with any technology, the problem is not the approach itself but how it is used by specific people.

To prove our point, here are some examples of how this same technology makes people’s lives better:

  • Synthesized speech helps people with various disabilities speak in their own voice, which they otherwise wouldn’t be able to do.
  • Video and audio deepfakes are widely used in the movie and game industries. The technology helps with dubbing in foreign languages as well as easing the post-production process.
  • Deepfakes can be used for multiple use cases in museums and universities. It helps re-create authentic historical figures for educational purposes.

As a company actively working with technologies close to deepfake technologies, we take the possible moral and political implications seriously. Respeecher has developed a strict ethical code and has implemented tools such as an audio watermark to identify content synthesized using our technology.

2. The cloned voice still differs from the original, and not for the better

On the Internet, you may come across opinions stating that a voice synthesized using AI and machine learning can never be 100% similar to the original. This is perhaps one of the most easily debunked myths in voice synthesis.

Look at how our Chief Research Officer Grant Reaber is speaking in Danielle Cohn’s voice. Pretty neat right?

Speech-to-speech conversion software like Respeecher preserve the natural prosody of a person’s voice because the system excels at duplicating the source speaker’s prosody.

The algorithm comes equipped with an infinite prosodic palette for content creators, so the sound of the synthesized voice is indistinguishable from the original.

Moreover, there’s no issue with syncing lips or other inconsistencies that traditional dubbing introduces because the voice produced is a cloned version.

Just watch this quick demo showcasing how our team plays around with the features that Respeecher has to offer. The voice quality is indistinguishable from the original to the layman — you would not suspect that it’s voice synthesis.

3. A cloned voice is indistinguishable from the original

This myth is the opposite of the previous one — that a synthesized voice is so good that it is indistinguishable from the original. But as we said above, this is true for people other than sound professionals.

There are already several solutions on the market that specialize in voice fraud detection. In general, all of them use so-called voice biometric engines. In particular, the software is used to detect deceitful voice samples and protect user data from incorrectly granting access to a device or application.

Also, services like Respeecher develop unique watermarks that are embedded in the synthesized audio recording. They are indistinguishable to the ears of the average listener but easily detectable by sound engineers. The purpose is to make it easier to identify inappropriate content created using deepfake technologies.

4. Voice cloning will never be affordable for anyone other than big Hollywood studios

Let’s be honest, speech synthesis is unlikely to become available to video bloggers with a small following or private persons any time soon. However, access to this technology isn’t restricted to huge companies and media giants. We’ve worked with small businesses, educational organizations, and prominent YouTubers.

In addition to, and without the previous low-entry threshold, we are constantly working to democratize the synthetic media market. Not so long ago, we launched a Voice Marketplace, where small content creators can access voice cloning technology for a fraction of the cost.

In any case, whether you are a VTuber, a film company, or just curious about how Respeecher works, the use of our technology allows you to avoid having to invest in costly production items such as:

  • Additional dialogue replacement
  • Virtual character creation
  • Voice dubbing
  • Localization

If you have questions about how you can use speech-to-speech conversion technologies in your project, contact us today. We will gladly advise you on where to start, provide you with a demo, and a potential roadmap.

This article was initially published on the Respeecher blog.

--

--

Respeecher

AI Speech-to-Speech and Text-to-Speech Voice Synthesis for Next Generation Content Creators