Low Bit-rate Speech Coding With VQ-VAE and a WaveNet Decoder

“Low Bit-rate Speech Coding With VQ-VAE and a WaveNet Decoder” by Cristina Garbacea, Aaron van den Oord, Yazhe Li, Felicia S C Lim, Alejandro Luebs, Oriol Vinyals and Thomas C Walters has been accepted at ICASSP 2019 and will be presented this week at the conference in Brighton, UK. The work was carried during my internship with Google Deepmind.  I am posting the abstract of the paper below:

In order to efficiently transmit and store speech signals, speech codecs create a minimally redundant representation of the input signal which is then decoded at the receiver with the best possible perceptual quality. In this work we demonstrate that a neural network architecture based on VQ-VAE with a WaveNet decoder can be used to perform very low bit-rate speech coding with high reconstruction quality. A prosody-transparent and speaker-independent model trained on the LibriSpeech corpus coding audio at 1.6 kbps exhibits perceptual quality which is around halfway between the MELP codec at 2.4 kbps and AMR-WB codec at 23.05 kbps. In addition, when training on high-quality recorded speech with the test speaker included in the training set, a model coding speech at 1.6 kbps produces output of similar perceptual quality to that generated by AMR-WB at 23.05 kbps.

For more details please check the paper.

Author: Cristina Garbacea

PhD student in Computer Science and Engineering, University of Michigan Ann Arbor

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s