What is Speech Synthesis: Deepfake Voice Explained
In recent years, technology continues to amaze us by being a disruptor for many industries, including the entertainment industry. If we were to refer to synthetic media, we could define it as media content generated or modified through Artificial Intelligence (using machine learning and deep learning). Synthetic media includes technologies such as voice cloning or voice synthesis, video synthesis, music synthesis, and so on.
In this article, we are going to dive deeper into speech synthesis technology, which represents a computer-generated simulation of human speech. The roots of speech synthesis were identified in 1779 when a Russian professor called Christian Kratzenstein built acoustic resonators that mimicked the human vocal tract when activated by vibrating reeds. After several stages of evolution, we could say that, in 2021, speech synthesis has significantly evolved, due to the usage of Artificial Intelligence.
So basically, through speech synthesis, a person’s voice can be replaced with someone else’s (voice cloning). This technology is successfully used in the entertainment industry (TV, gaming industries, call centers, and so on). The main goal of this technology is to bring more flexibility to the way content is produced, making it easier for content creators.
A text-to-speech (TTS) system converts normal language text into speech and it was commonly used for transforming written words into spoken words. It also can help blind people or people who struggle with reading.
At Respeecher we use speech-to-speech (STS) technology: using software, we create a speech that’s indistinguishable from the original speaker. Through STS voice conversion, the voice is replaced, but the general speech patterns of the speaker maintain their unique particularities. This aspect makes the voice sound more natural. SST also allows a bigger control of the expressed emotions while speaking.
We could say that generally, speech-to-speech technology is preferred over text-to-speech, because it provides a higher level of authenticity, while text-to-speech technology provides a dull performance.
Application of speech synthesis
This technology can be used for multiple goals, such as:
1. For people who are unable to speak or read
Voice cloning can offer a chance to people who lost their ability to speak due to some serious illnesses, such as strokes, brain injuries, and so on. At the same time, speech synthesis software can help blind people read and communicate.
2. Voice assistance
Let’s think about Siri (Apple) or Alexa (Amazon). Nowadays, more than ever, retailers and other companies tend to personalize the interaction with their customers, delivering a tailored experience and services thanks to these voice assistants, which are expected to create a special interaction, as human as possible.
3. In the entertainment industry
The filmmaking, gaming, and content creation industries can benefit the most from voice cloning technology, as it is an ideal solution for making changes to dialogue in films and other projects that require a specific voice and the speaker is not available anymore, or for the moments when this process can be time-consuming.
For example, speech-to-speech technology could be used for the post-production processes taking place when films are being created. Then, with the permission of the actor, another person could record the lines and, through speech-to-speech technology, their voice could be cloned.
At the same time, this technology could be used to resurrect voices from the past, especially when we talk about a famous deceased personality or when producers need a younger voice.
4. In business industries
Voice conversion could be also used in many business industries, such as call-centers (BPO companies) for people for whom English is not their primary language.
This technology could bring value also to marketing campaigns, as it helps personalize the interaction between a brand and its consumer.
Synthetic speech could unfortunately be used indeed for unethical purposes. Deepfake voice (“deep learning” and “fake”) refers to pretending that someone said something they never said and it’s known for being used in fake news, frauds, and another negative context.
We believe that people should be educated to detect deepfakes. For example, we collaborate with Hollywood studios and other companies from the entertainment industry and when duplicating a voice, we require the consent of the person, or, if it’s the historical personality, we collaborate with the estates that own the rights for using the person’s identity. Moreover, our primary goal is to make sure that synthetic speech technology is used in proper ways, according to ethical principles — you can read more on our website about ethical voice cloning.
Although voice cloning is used mostly for the entertainment industry and for other content creators, we will soon also see other companies investing more in this kind of technology, as it can be used for many business purposes.
Speech synthesis software can truly revolutionize some different industries, bringing flexibility and alternatives. Media content will be produced easily, in innovative ways.
This article was initially published by Respeecher as a guest post on The Tech Headlines.