WaveNetEQ

Improving Audio Quality in Duo with WaveNetEQ

WaveNetEQ is a generative model, based on DeepMind’s WaveRNN technology, that is trained using a large corpus of speech data to realistically continue short speech segments enabling it to fully synthesize the raw waveform of missing speech. Because Duo calls are end-to-end encrypted, all processing needs to be done on-device. The WaveNetEQ model is fast enough to run on a phone, while still providing state-of-the-art audio quality and more natural sounding PLC than other systems currently in use.

WaveNetEQ uses the autoregressive network to provide the audio continuation during a packet loss event and the conditioning network to model long term features, like voice characteristics. The spectrogram of the past audio signal is used as input for the conditioning network, which extracts limited information about the prosody and textual content. This condensed information is fed to the autoregressive network, which combines it with the audio of the recent past to predict the next sample in the waveform domain. This differs slightly from the procedure that was followed during training of the WaveNetEQ model, where the autoregressive network receives the actual sample present in the training data as input for the next step, rather than using the last sample it produced. This process, called teacher forcing, assures that the model learns valuable information, even at an early stage of training when its predictions are still of low quality. Once the model is fully trained and put to use in an audio or video call, teacher forcing is only used to “warm up” the model for the first sample, and after that its own output is passed back as input for the next step.

The model is applied to the audio data in Duo’s jitter buffer. Once the real audio continues after a packet loss event, we seamlessly merge the synthetic and real audio stream. In order to find the best alignment between the two signals, the model generates slightly more output than is required and then cross-fades from one to the other. This makes the transition smooth and avoids noticeable noise.

Comments

Leave a Reply Cancel reply