Even though there are fears that our current interpretation of AI technology is due to hit a ceiling around 2026, the industry is still evolving with every new initiative released. The latest initiative is Microsoft’s Vall-E, and it can replicate and mimic human speech in a matter of seconds.
The impressive part here is that this isn’t the first time that a company has attempted to create an AI that can mimic human speech. However, previous attempts have routinely proven how difficult and time-consuming such an enterprise is. The core issue seems to be that it takes far too long for these systems to learn individual voices, not to mention the vocal intricacies that each person instinctively employs.
Microsoft has done something truly remarkable here. Vall-E has astonished almost the entire tech community with the way that Vall-E is able to replicate and mimic human speech in an incredibly short time. In fact, it only takes a few seconds. When we say a few seconds, we truly do mean no more than a few seconds. On average, Vall-E only needs around 3 seconds of speech to be able to replicate someone’s voice, intonation, and general vocal idiosyncrasies. These few seconds of speech required to train Vall-E in replicating a human voice has gone down as the smallest sample size that the industry has ever seen.
If you’re interested, researchers at Cornell University recently released a paper on how Vall-E works. The paper also breaks down all the differences between Vall-E and other text-to-speech synthesizers.
Here is an excerpt from the paper that’ll impress science and technology wards. ‘Large-scale data crawled from the Internet cannot meet the requirement, and always lead to performance degradation. Because the training data is relatively small, current TTS systems still suffer from poor generalization. Speaker similarity and speech naturalness decline dramatically for unseen speakers in the zero-shot scenario.’
‘VALL-E significantly outperforms the state-of-the-art zero-shot TTS system [Casanova et al., 2022b] in terms of speech naturalness and speaker similarity, with +0.12 comparative mean option score (CMOS) and +0.93 similarity mean option score (SMOS) improvement on LibriSpeech. VALL-E also beats the baseline on VCTK with +0.11 SMOS and +0.23 CMOS improvements.’
In simple terms, very smart researchers at Cornell University have found a way to do something that was thought to be relatively impossible. As GHacks recently reported, Apple Books has released an AI tool that can turn any book into an audiobook.
However, the utility has faced harsh criticism over the way the tool sounds. I listened to the utility at work and found it pleasing, but others have not been as kind. The release of a tool like Vall-E, however, could possibly revolutionize the audiobook industry and intensify the good work that Apple has initiated.