Deepfake Voice Technology: The Good. The Bad. The Future

5 min readFeb 8, 2021

Deepfake voice technology based on voice cloning, or quasi-perfect reproductions of a person’s voice, can be used both for the good and for the bad. It can be put in the service of voice synthesis that can give a voice back to people who would otherwise lose it due to acute or chronic conditions such as ALS, apraxia, traumatic brain injury, stroke, etc.

It is already being used in the film and TV industry, gaming, call centers, and it is also potentially convenient for encryption and therapy. However, there is no denying that it can pose a significant threat to democratic processes, particularly to those related to the value of privacy. If used inappropriately, deepfake voice technology can perpetrate deceit and harassment.

Precisely because we are fully aware of this, we at Respeecher place a great deal of emphasis on the use of voice technology in ways that minimize the risk of fooling people into thinking someone said something they didn’t.

We are committed to ensuring that our groundbreaking technology is only used for ethical projects, and does not fall into the wrong hands. We do not use the voice of a private person or an actor without permission, and we always ask for voice owners’ written consent. However, we do allow non-deceptive use of the voices of historical figures and politicians, such as Richard Nixon or Barack Obama, but only for projects that meet ethical standards.

What about use cases for businesses? Numbers show that there is a huge opportunity for economic organizations to leverage voice to acquire and retain new business. According to an AppDynamics report from 2018, half the web searches were forecasted to be voice driven by 2020.

61% of the surveyed IT decision-makers take this even further, expecting that voice commands would completely replace manually typed commands for finding information on the Internet. And the young generation leads the way — 84% of millennials already use voice assistants to help them keep track of their daily schedule and responsibilities.

The Good.

Giving the ability to speak naturally back to people who suffer from a wide range of medical conditions

The ability to communicate, sharing thoughts and feelings by uttering words is extremely important, in fact, it is among those things that make us, humans, special. But there is more to what groundbreaking voice conversion technology can do for the vocally impaired people.

Consider the expansion of home automation technologies with vocal control. Voice cloning could make people who can’t speak naturally more independent, better able to make use of the devices that can be vocally commanded.

Voice assistants

According to Ovum’s Digital Assistant and Voice AI–Capable Device Forecast: 2016–21, by 2021 voice assistants will outnumber the human beings that live on Earth. Let’s take Google Assistant as an example. Its voice is generated by the text-to-speech system Tacotron 2, based on two deep neural networks.

The first transforms the text into a visual representation (i.e., the spectrogram) of audio frequencies over time, and then a WaveNet system analyses the spectrogram and creates audio elements. The outcome is speech that is nearly indistinguishable from human speech, even when it comes to the pronunciation of challenging words.

Interactive content for online learning courses

Voice cloning with artificial intelligence makes it unnecessary to record notes for every new session or to record again in order to correct potential mistakes. This leads to the reduction of both financial and time costs of professionally recorded lectures, and hence it fosters the proliferation of online courses. And this is no small thing, particularly during the tough times that we are traversing due to COVID-19 restrictions.

The Bad

Blackmail

Fake, yet extremely realistic, videos with explicit sexual or violent scenes may be created by joint usage of deepfake voice technology and deepfake video.

Spam emails

If you’ve received an email asking you to “contact bank X via below email to guide you further on the wire transfer procedure”, it’s quite likely that you regard it as spam and do nothing about it. However, a subsequent phone call from somebody who sounds exactly like a trusted contact, advising you to respond to the email, might change your mind and make you do something you will have wished you hadn’t.

Serving unlawful competition

Someone may pretend to be the CEO of company X, and from this position present sham data during fake earnings calls, fooling stakeholders and investors into believing that stock prices are different from what they really are. The same illicit technique can be used to sabotage industry rivals.

We are painfully aware that synthetic media technology can potentially be used in harmful ways. This is one of the reasons why our tech is not accessible to the public. We restrict usage of our voice conversion systems to non-deceptive content creation applications, by limiting who we work with and what we allow them to do with our technology.

We hope that being early to market, we can actually help educate the public about what is technically possible and make people less likely to fall for deceptive synthetic speech. We also think that an important role in limiting this harm can be played by gatekeepers such as Youtube and Facebook, and we are ready to work with such platforms to detect and prominently label synthetic speech.

The Future

B2B use cases

Given the findings of the above cited AppDynamics report, the use of voice conversion technology by companies should be framed not in terms of if, but of when. 69% of the IT decision makers work for organizations that already invest in, or plan to invest in voice technology during the next 3 years.

Whether you’re looking at voice replication, therapy for speech problems, dubbing and ADR, encryption, gaming, etc., all of these are likely to benefit from leveraging voice cloning. The envisaged development of a real-time system (currently an ongoing process) within gaming, for instance, will allow players to use different voices in in-game chats. Things are evolving fast when it comes to deepfake voice technology.

Voice cloning for call centers

Respeecher is now working on breakthrough technologies that will make overseas operators able to sound like locals. At long last, we are getting closer to the solution for getting operators to sound more like the people they are speaking with over the phone. Along the same lines, our robotic operators will soon sound more human once we get the voice makeover “up and running”.

Conclusion

Deepfake voice technology does indeed pose security risks, but acknowledging them and trying to minimize them are the first steps towards focusing mainly on the good. Cloned voices that sound indistinguishable from the original speakers are a match made in heaven for filmmakers, game developers, other media content creators, and soon, also for call centers.

Since digitally replicated voices can already capture nuances and emotions, the scope of application areas is enlarging, as it may be obvious for all of you who have recently tried to talk to (yes, the phrase is correct!) virtual assistants such as Samsung’s Bixby, Apple’s Siri or Amazon’s Alexa.

Counseling and companionship are the new functions that are expected to emerge from generating voices that express emotions. The creative process will be streamlined by the newly acquired possibility to change content without the need to re-record original voices.

This article has initially been published by Respeecher as a guest post on HackerNoon.

Deepfake Voice Technology: The Good. The Bad. The Future

Written by Respeecher

No responses yet