When training an artificial intelligence system to transcribe from speech to text it is necessary to use many pairs of audio and text. That is, the AI is given for example the sound “this is a cat” and that same sound transcribed so that it can associate each word with a sound. This is perfect for widespread languages, such as English or Spanish, but not for more minority languages. Facebook, however, claims to have found a solution: wav2vec-U, with “U” standing for “Unsupervised”.
What is wav2vez-U? It’s a way to build a speech recognition system that requires no transcribed pair. It simply learns from audio and unpaired text, which eliminates the need for transcribed audio. To do this, the system makes use of a GAN (generative antagonistic network) that, according to Facebook, competes head-to-head with the best-supervised systems of a few years ago.
As Alexei Baevski, Wei-Ning Hsu, Alexis Conneu, and Michael Auli detail on the Facebook AI blog, their method starts with learning speech structure from unlabeled audio. Using their previous model, wav2vec 2.0, they segmented the speech recording into speech units that correspond to individual sounds. For example, “cat,” cat in English, has three sounds: “/K/,” “/AE/,” and “/T/.”
To teach the system to understand the words in audio, they used a GAN that, like all GANs, consists of a generator and a discriminator. The generator selects each audio fragment, predicts the phoneme corresponding to the sound in each language, and tries to fool the discriminator. This is itself another neural network that has been trained with text outputs from the generator and real text from different sources divided into phonemes. This is important: real text from different sources, not transcriptions of the text we are trying to transcribe.
The job of the discriminator is to evaluate whether the predicted phoneme sequences (“/K/”, “/AE/” and “/T/” if we are talking about “cat”) look realistic. The first transcriptions of the generator are lousy, but with time and discriminator feedback, they become more and more accurate. And this is quite an achievement since the system itself does not know that “cat” is transcribed to another thing, but understands that, because of the sounds that make up the word, it must be written like that.
To test the system, Facebook used the TIMIT and Librispeech tests and claims that “wav2vec-U is as accurate as the state of the art of just a few years ago, without using any labeled training data. All said these two benchmarks measure performance in English speech, a language with a large corpus of spoken and transcribed text. Facebook’s system, however, is more interesting for minority languages, such as Swahili, Tatar, or Kyrgyz, whose corpus of data is smaller.
It is undoubtedly a big step forward in terms of voice transcription. Now it remains to be seen how Facebook will implement it, if at all. On the other hand, Zuckerberg’s company has published the code needed to build this voice recognition system. It can be found on Github and anyone can access it to test and test it.
This post may contain affiliate links, which means that I may receive a commission if you make a purchase using these links. As an Amazon Associate, I earn from qualifying purchases.