While U.S. carriers are busy marketing their new 5G networks, the reality is that the vast majority of people won’t experience the advertised speeds. There are still many parts of the U.S. — and around the world — where data speeds are slow, so to compensate, services like Google Duo use compression techniques to efficiently deliver the best possible video and audio experience. Google is now testing a new audio codec that aims to substantially improve audio quality on poor network connections.
In a blog post, the Google AI team details its new high-quality, very low-bitrate speech codec they’ve named “Lyra.” Like traditional parametric codecs, Lyra’s basic architecture involves extracting distinctive speech attributes (also known as “features”) in the form of log mel spectrograms that are then compressed, transmitted over the network, and recreated on the other end using a generative model. Unlike more traditional parametric codecs, however, Lyra uses a new high-quality audio generative model that is not only able to extract critical parameters from speech but is also able to reconstruct speech using minimal amounts of data. The new generative model used in Lyra builds upon Google’s previous work on WaveNetEQ, the generative model-based packet-loss-concealment system currently used in Google Duo.
Google says its approach has made Lyra on par with the state-of-the-art waveform codecs used in many streaming and communication platforms today. The benefit of Lyra over these state-of-the-art waveform codecs, according to Google, is that Lyra doesn’t send over the signal sample-by-sample, which requires a higher bitrate (and thus more data). To overcome the computational complexity concerns of running a generative model on-device, Google says Lyra uses a “cheaper recurrent generative model” that works “at a lower rate” but generates multiple signals at different frequency ranges in parallel that are later combined “into a single output signal at the desired sample rate.” Running this generative model on a mid-range device in real-time yields a processing latency of 90ms, which Google says is “in line with other traditional speech codecs.”
Paired with the AV1 codec for video, Google says that video chats can take place even for users on an ancient 56kbps dial-in modem. That’s because Lyra is designed to operate in heavily bandwidth-constrained environments such as 3kbps. According to Google, Lyra easily outperforms the royalty-free open-source Opus codec as well as other codecs like Speex, MELP, and AMR at very low-bitrates. Here are some speech samples provided by Google. Except for audio encoded in Lyra, each of the speech samples suffers from degraded audio quality at very low bitrates.
Google says it trained Lyra “with thousands of hours of audio with speakers in over 70 languages using open-source audio libraries and then verifying the audio quality with expert and crowdsourced listeners.” As such, the new codec is already rolling out in Google Duo to improve call quality on very low bandwidth connections. While Lyra is currently aimed at speech use cases, Google is exploring how to make it into a general-purpose audio codec.
Update 1: Google open-sources Lyra codec used in Google Duo
Earlier this week, Google announced that it had open-sourced Lyra, the new audio codec used by Google Duo, so other developers can use it in their own communication apps. The release comes with tools needed to encode and decode audio with Lyra and is optimized for 64-bit ARM Android with development on Linux. The open-source library focuses on the use of Lyra for real-time voice communication, but Google is expecting developers to apply the codec to other applications that encode and decode speech in audio. The code is written in C++ and the core API, signal processing toolchain, and a demo Android app are now available on GitHub as a beta release under the Apache license.