Today we are one step closer to the long-promised (since April) immortal celebrity future. Meta announced Voicebox, its generative text-to-speech model that promises to do for spoken language what ChatGPT and Dall-E respectfully did for text and image generation.
It is essentially a text output generator similar to GPT and Dall-E. Instead of creating prose or beautiful images, spit out audio clips. Meta defines the system as “a non-autoregressive flow matching model trained to fill speech based on given audio context and text.” Trained on over 50,000 hours of unfiltered audio. Specifically, Meta used audio and transcripts recorded from numerous public his domain audiobooks written in English, French, Spanish, German, Polish and Portuguese.
The researchers say this diverse data set allows the system to produce more conversational speech regardless of what language each party speaks. “Our results show that speech recognition models trained on synthesized speech generated by Voicebox perform almost similarly to models trained on real speech.” computer-generated speech showed only a 1 percent error rate reduction, compared to 45 to 70 percent error rate reductions.
The system was initially taught to predict speech segments based on the transcripts of surrounding segments and passages. “A model that has learned to embed speech from context can apply this across speech generation tasks, such as generating parts in the middle of an audio recording without recreating the entire input,” the Meta researchers explained. Did.
Voicebox can also reportedly actively edit audio clips to remove noise from the audio or replace mis-spoken words. “Humans can identify which part of the speech is corrupted by noise (such as a barking dog), cut it out, and tell the model to regenerate that part,” the researchers said. rice field. It’s like using image editing software to clean up your photos. .
Text-to-Speech generators have been around for a while. That’s how her TomToms parents allowed Morgan to give dangerous driving directions in her Freeman voice.Kind of like a modern iteration give a speech again Eleven Lab Prime Voice AI They are far more capable, but still require a large amount of source material to properly mimic the subject, plus a separate pile of each. single. other. subject you want to train.
Voicebox is not. Thanks to a new zero-shot text-to-speech training method that Meta calls Flow Matching. Because Meta’s AI reportedly outperformed the current state-of-the-art AI in both intelligibility (word error rate of 1.9 percent vs. 5.9 percent) and “phonetic similarity” (composite score of 0.681 compared to SOA’s 0.580) , the benchmark results are not even close to that. While running 20x faster than the best TTS systems today.
But don’t line up with celebrity navigators just yet. Neither the Voicebox app nor its source code are publicly available at this time. Meta confirmed on Friday, citing “potential risks of exploitation despite many exciting uses.” The case for generative speech models. Instead, the company released a series of audio samples (see above/below) and the program’s first research paper. In the future, the research team hopes to apply this technology to prostheses for vocal cord injury patients, in-game NPCs, and digital assistants.