If you thought that ‘voice cloning’ and ‘deepfakes’ are recent buzzwords, think again. The first original record of mimicking human voice dates back to 1779, in Russia. Professor Christian Kratzenstein built acoustic resonators that mimicked the human vocal tract when activated by means of vibrating reeds (just like wind instruments), in his lab in St. Petersburg.
The next groundbreaking discovery followed in 1838, when Willis discovered the connection between the organization of the vocal tract and specific vowel sounds. This discovery inspired the construction of speaking machines: Alexander Graham Bell and his father built one towards the end of the 19th century.
At the 1939 New York World’s Fair, Homer Dudley presented the first electrical speech synthesizer, VODER (Voice Operating Demonstrator). VODER shared the same infrastructure as present devices based on the source-filter-model of speech, but had a lower quality and intelligibility of speech.
Then, after a number of formant synthesizers e.g., PAT (Parametric Artificial Talker), OVE (Orator Verbis Electris), and rudimentary articulatory synthesizers, e.g., DAVO (Dynamic Analog of the VOcal tract), Noriko Umeda and his colleagues built the first full text-to-speech system for English in Japan in 1968.
From this moment onwards, the devices were able to produce intelligible speech. Throughout the ’80s and ’90s, neural networks and hidden Markov models have been used extensively for speech synthesis, aiming to produce more and more intricate sounds, approaching the complexities of the human voice.
So where are we today?
Approaches based on deep learning or Generative Adversarial Networks (GAN) currently lead the development of voice synthesis. Although it is undeniable that voice synthesis can be a cost saving mechanism (in gaming or audio books production), the focus is on facilitating the use of several different sounding voices at scale. Consider virtual assistants like Alexa or Siri, which require huge datasets for training to produce a single voice.
How can companies use voice cloning?
Let’s dive into that subject by taking a closer look at what Respeecher does and how it works. Respeecher uses deep neural networks, classical digital signal processing techniques, and deep learning for voice conversion, i.e., allowing the speaker to ‘borrow’ somebody else’s voice. Famous voices of talents can be scaled and changed, while accents can be improved and made easier to understand via a neutralisation engine.
Some current application areas of voice cloning are the production and localization of movies and TV, video games, the production of podcasts and audiobooks, etc. Many content creators and prominent Hollywood film studios are already using synthetic voices.
Through the development of a real-time speech-to-speech conversion system (which we are currently working on), the range of application areas will expand. Voice conversion will be useful in call centers, and within gaming, it will allow players to use different voices in in-game chats. Besides these commercial directions, voice cloning technology will be potentially convenient to use for encryption and therapy.
Respeecher technology is twofold: a spectral mapper, and a neural vocoder network. The neural networks that implement both parts are trained with parallel data from the source and the target speakers.
The mapper first converts the content and the personal characteristics of a discourse segment which make up the ‘raw’ spectral representation of source speech, into the representation of the target speech. The neural vocoder is responsible for polishing, so to say, the ‘raw’ representation of the converted speech with more nuanced, fine-grained features. Ultimately, the vocoder outputs the resulting waveform, which allows the source to speak with the voice of the target person.
Respeecher applications: The Nixon project
Respeecher worked with a team of researchers, journalists, and artists at MIT to create an alternate history of the first venture to the moon, where astronauts Neil Armstrong and Buzz Aldrin are stranded on the moon. An alternate video of president Nixon was created, where he informs the public about the tragic outcome of the journey to the moon.
The video was included in the award-winning art installation “In the Event of a Moon Disaster,” directed by Francesca Panetta and Halsey Burgund.
“We worked with Respeecher on a film called ‘In Event of Moon Disaster’ first shown at the Amsterdam Documentary Film Festival 2019. They helped us create a synthetic voice of Richard Nixon to bring to life a never-read contingency speech in case the Apollo 11 mission went badly. We created a highly realistic film, in a large part due to their work.
‘In Event of Moon Disaster’ shows the creative possibility of voice replacement technology as well as highlighting just how realistic deepfake technologies can be, acting as a civic engagement project for the public along with contextual and educational resources.”
Francesca Panetta, Creative Director of MIT’s Center for Advanced Virtuality
Ethical voice cloning
The ethical use of our voice cloning technology, meaning anticipating and preventing harmful potential applications, is foundational for the mission of Respeecher. Fake news, or making people believe someone said something they didn’t, is the quintessence of these to-be-avoided applications.
We only use Respeecher technology for non-deceptive content-creation applications. If you’re curious to find out more about our mission for ethical voice cloning, you can read about it on our ethics page.
Voice Cloning B2B Use Cases
- Voice replication
Speech that cannot be distinguished from the original speaker is created. Even subtle nuances and emotions are captured in the digitally replicated voices. It can be leveraged in the film industry, gaming, content creation, etc.
2. Dubbing and ADR
Voice cloning is a more effective alternative to traditional dubbing, because it allows the job to be done with fewer actors. Additionally, voiceovers may be added in an actor’s voice when their physical presence can’t be secured.
New content can be created and spoken by voices of people who are either impossible, or extremely difficult to be recorded. Imagine listening to Mrs Dalloway read by Virginia Wolf herself. In fact, now you no longer need to imagine, you can actually listen to the audiobook read by Wolf’s voice.
4. Call centers
Voice cloning can be used in call centers to make the voices that answer phone calls more uniform. You can actually make the whole call center speak with just one voice. Alternatively, you can address customers individually with a particular voice, wisely chosen to match their current disposition.
You can make sure to win the next karaoke competition you take part in, by singing in the voice of the famous person who released the song. When playing video games, voice cloning fosters the use of your character’s voice. These examples demonstrate that Respeecher offers you what might be dubbed “auditively enhanced entertainment”.
6. Therapy for speech problems
In case of an accident, stroke, or even if you are hearing-impaired, Respeecher technology gives you the chance to (re)gain your voice, and thereby fosters real communication.
We are committed to the mission of offering content creators of the future resources that, until not long ago, were simply not available. We aim to deliver our clients quality synthetic speech, enabling them to overcome all sorts of speech-related problems.
Using speech-to-speech conversion technology — the milestone that follows text-to-speech — with human-generated content, makes users feel included in a dynamic and emotional environment of the 21st century.
What Respeecher does can be rightly labeled “ethical voice cloning” because we never dismiss our commitment to a morally and legally responsible use of breakthrough technology.
This article has initially been published by Respeecher as a guest post on Techtelegraph.