Why Voice Cloning Matters More Than You Think

voice-cloning.jpg

I remember growing up as a kid watching Rich Little impersonate a wide range of celebrities, politicians, and sports figures. I was always fascinated at how realistic the impersonations were and amazed anyone could mimic another voice with such accuracy. Today there are a handful of comedians such as Frank Caliendo who can provide similar realistic impersonations of famous people. The fact we can only identify a handful of these performers demonstrates how unique this talent is, even more so when they can impersonate more than a few voices. 

What if I told you that one day you will be able to do this? 

What if I told you it would be as easy as typing a message into your computer or speaking into a microphone? 

What if I told you that the technology is not only available today but it is also rapidly improving and becoming significantly more scalable?

Last week I attended “You Don’t Say: An FTC Workshop on Voice Cloning Technologies” in Washington, D.C. and was blown away by what I heard. The workshop’s goal was to discuss the state of voice cloning technology and to examine particular areas of interest. It was divided into 3 discussion panels and started with an introduction into the state of voice cloning technologies. 

The state of voice cloning technologies

Voice synthesis technologies began to take root in the 1960s with simple electronic models which would approximate new sound by matching frequency and amplitude. This resulted in voices that were very robotic and non-human yet close enough to human speech to be usable. The example given by the speaker was the old Speak and Spell toy.

speak and spell.jpg

If you are old enough to remember, the Speak and Spell used a computer generated voice to interact with the user. Interestingly enough, Stephen Hawking used a similar technology for his now famous voice. Even as industry advanced, he declined system upgrades and kept the robotic voice as it had become his own.

The speaker went on to explain how there are roughly 20 different structures within the human body that can alter sound in the vocal track (e.g. lips, tongue, sinus). This understanding has allowed us to move beyond simple synthesis and to create more complex sounds. Elements such as tone and harmonics come into play and the synthesized voice is now reflected as the true instrument that it represents. 

The introduction and evolution of deep learning has also improved voice synthesis and subsequently voice cloning capabilities. Deep learning is used to build very large models and complex collections of statistical functions. In 2012, capabilities expanded even further with the increased availability of high compute Graphics Processing Units (GPUs). 

Just 2-3 years ago Generative Adversarial Network (GAN) approaches required significant audio to build usable models. Now, combining vocal tract and language models enables the creation of synthesized voices based on as few as 5 clips of 5 second audio samples. 

Models still have some limitations. For instance, synthetic speech still needs to be more dynamic and more natural to the human ear. However, machines are starting to develop dynamic elements of speech (e.g. emphasis on certain words) so as we move forward, it will become increasingly difficult to distinguish what is real and what is fake.  

Panel 1 – The positives and negatives of voice cloning

This was a very interesting and thought provoking panel discussion on the use of voice cloning technology and how it can be used for good, and for bad.

Healthcare 

The panel started with fascinating use cases provided by a representative from the Boston Children’s Hospital. He explained how there are a number of medical conditions that can cause people to lose their voice. For instance, approximately 80% of people diagnosed with Amyotrophic Lateral Sclerosis (ALS) will lose their voice. Other conditions and surgeries can also result in patients losing their voice.

In order to improve the quality of life for these patients, hospitals will often store voice samples from the patient into software systems so the voice can be used in the future. This not only includes the stored phrases but also the ability of the software to utilize machine learning to generate dynamic content that sounds like the patient’s voice. All of this so the patient can continue to authentically represent their emotions using their true voice. This was an inspiring use of technology that is making a real difference in people’s lives.  

Professional protections

One of the interesting aspects of this workshop was the diverse point of view provided by participating panelists. For instance, a representative from the Screen Actors Guild - American Federation of Television and Radio Artists (SAG-AFTRA) also presented their unique perspective on voice cloning.

SAG-AFTRA represents approximately 160,000 actors, announcers, broadcast journalists, recording artists, voice over artists, and other entertainers. As you know, some actors are known for their voices and many times it is their most valuable asset. SAG-AFTRA is interested in protecting that asset and ensuring there is consent and compensation whenever the performer’s voice is used. There have already been situations where an actor’s voice was used in a video game without consent. 

The voice of doom

Beyond entertainment, there are serious ramifications with the misuse of voice cloning technology. Now more than ever the news media needs to be trusted. This can be a matter of national security. Viewers and listeners need to have a trust in broadcast journalism. Imagine what could happen if a “recording” was posted on social media of prominent politicians involved in criminal activity using voice cloning technology or if a radio broadcast was interrupted using a voice clone of a famous newscaster to incite panic or provide misinformation. 

A representative from the Department of Justice was present to weigh in on the ugly side of the technology and served as the self-professed “voice of doom”. In particular, she spoke to two potential areas that were ripe for voice cloning exploitation; cyberstalking and fraud. With deepfake video and audio, criminals now have the ability to communicate anonymously and law enforcement is seeing an uptick in this activity. 

She provided a number of terrifying examples that demonstrate the potential for serious misuse. Cyberstalkers could create a voice clone of their target’s voice and then use it along with personal information such as her address to call and invite individuals with bad intentions to her house without her knowledge. In sextortion cases criminals could create a voice clone to impersonate someone’s spouse to solicit explicit images from the other individual and then threaten them after the fact with public distribution of the compromising images. 

There are also numerous opportunities for fraud. Using voice cloning technologies, criminals could more easily conduct social engineering and get personal information and passwords from unsuspecting victims. Grandparent scams could be enhanced when criminals use the voice of a grandchild to call the grandparent to solicit money and have it transferred to the criminal’s account.

news.png

We’ve already seen a case where a CEO was impersonated using voice cloning technology to wire nearly $250,000 to a criminal’s bank account.

Given the ability of the algorithms to work with very small fragments of audio, criminals could easily eavesdrop on individuals with high quality mics to collect enough samples to generate a usable clone. Viability of the clone is enhanced when used over a degraded signal such as a telephone, making it even harder for the human on the other end to identify a fraud. 

Even with these harrowing examples, her position was not to outlaw the technology outright, but rather to increase public awareness so people start to develop the skills necessary to understand a threat when it is encountered. These crimes are not yet rampant, but the damage incurred from such attacks could be devastating and long-lasting for the victim.  

Panel 2 – The ethics of voice cloning

This panel discussed the ethical dilemmas that may be faced when it comes to the use of voice cloning technologies. Companies providing these capabilities will need to understand the ramifications of the technology and how it might be misused. 

Microsoft spoke to the controls surrounding their new Custom Neural Voice solution as part of their cloud-based Azure Cognitive Services platform. This is a gated technology that requires applying to Microsoft in order for you to use it. Upon application, they assess the benefits of your proposed implementation, analyze the risk, and then layer a governance model on top. This is still an evolving capability and there remains work to be done to ensure the use of the technology remains proper. 

Panel 3 – Authentication, detection, and mitigation

Given the emerging nature of this technology there remains a lot of research into authentication, detection, and mitigation. This session discussed the positive findings from research into bispectral analysis of voice samples and the ability to detect a synthesized voice. One of the websites they provided as an example of the technology was at fakejoerogan.com. Here you can see the capability of the technology for yourself. 

I’m not going to pretend to be an expert in bispectral analysis but suffice it to say they are using it to try to determine when voice samples are real or fake. Bispectral analysis looks at the voice wave as frequencies. Lyrebird and Wavenet are two voice synthesizers that have been analyzed and they both may sound real to the human ear but look different to machines performing an analysis. This approach is still being tested but initial results are encouraging. With AI algorithms facilitating fake voices it’s important to develop digital media forensics and foster a community to fight fake media. As with other security vulnerabilities it will remain a game of cat and mouse. 

A representative from the Defense Advanced Research Projects Agency (DARPA) discussed their Semantic Forensics (Semafor) research program. Through this work they are investigating rich semantic algorithms to detect, attribute, and characterize fake multi-modal media to defend against large scale automated misinformation attacks. 

The vendor community was also represented in the discussion and provided insight into how they are developing software to detect spoofs and voice clones. One way they are accomplishing this is by determining if the sound emanates from a recording device or a human vocal tract. Their proprietary algorithms can distinguish between the two as the synthesized voice is typically more flat and less dynamic. Additionally, it is possible to determine if the voice sample contains artifacts of synthetic speech. During analysis, synthetic speech is many times too perfect and natural artifacts are absent. This lack of artifacts actually serves as a flag for clone detection. 

Conclusion

Voice cloning technology is a fascinating technology that will continue to evolve and improve rapidly. While the technology can be used for significant human benefit, the risks are real and emerging. It will remain important to stay abreast of the changes within the industry and to understand the state of scientific research in order to monitor the maturity of this technology over time. 

Of critical importance will be the need to build public awareness. Just as was seen with Photoshopped images in the past, a new hyper awareness needs to be established in order to combat the complex and dynamic threat posed by deepfake audio and video. As with many technologies today, I’m excited about the potential yet somewhat concerned about the pitfalls.

Previous
Previous

Aggregating Data to Meet the Challenges of COVID-19

Next
Next

5 Ways Face Recognition Has Transformed Over the Last 20 Years