Autoencoders and variational autoencoders [1] represent an important family of models commonly used for representation learning, whose main goal is to derive a compact encoding of the data that explains its underlying structure in an unsupervised way. This compact encoding lends itself to a wide variety of use cases including forming a better representation of the data, to be used as features for downstream supervised tasks, or lossy data compression after discretization. In this work, we focus on the data compression use case.
While it is natural to use a forward-only encoder and decoder architecture for non-sequential data like images [2], there is no standard autoencoder architecture for temporally correlated data that has variable-length and long range dependencies such as video, speech, and text. The main challenge lies in the difficulty in capturing correlation information at different time-scales in an online/sequential fashion.
Among many existing works that apply auto-encoding to sequential data, [3, 4] use an RNN for both encoder and decoder, where the encoder summarizes the entire input sequence into the last recurrent state, which is then used as the recurrent state initialization for a decoder. This approach would not work well with variable input length, as a fixed-length code is used irrespective of the sequence length. [5] proposes a two-scale autoencoder design, whereby a local-scale encoder extracts dynamic information of the data in each time step, and a global-scale one encodes long-term information common to the entire sequence. This two-scale design, although providing a nice disentanglement of global and local information, is not suitable for online compression setting since the global encoding is not available until the entire sequence is observed. Another approach is to divide the long sequence into blocks and conduct auto-encoding on each block, e.g., as is done in [6], where the video data is divided into blocks of 8 frames. The drawback is that chunking creates discontinuity in between blocks and hinders the learning of temporal dependencies with range longer than the block size. In VQ-VAE [7, 8], the authors apply a sequential encoder and a WaveNet type of decoder for audio data, whereas we show that decoupled encoder and decoder without feedback may lead to suboptimal performance.
In this paper, we propose a new autoencoder architecture tailored for learning a compact discrete representation of temporally correlated data in a sequential fashion. The discrete bottleneck codes can be used as lossy source codes of the sequential data.
We demonstrate the power of the proposed architecture for speech spectrogram compression, where the data at each time step is a frame of a spectrogram. Since spectrograms (and its variants such as cepstrum) are often used as speech features in automatic speech recognition (ASR) and neural waveform synthesis [9, 10, 11], an efficient compression could help reduce bandwidth in cloud-based ASR as well as bitrate in neural-vocoder based speech coding.
The contributions of this work are as follows. (i) We propose a new recurrent autoencoder architecture, termed feedback recurrent autoencoder (FRAE), with the salient feature of decoder-to-encoder recurrent state feedback, which is shown to be superior to other recurrent schemes. (ii) The recurrent structure of FRAE facilitate an easy extension to its variational counter-part that allow variable rate encoding. (iii) We show that the system can produce high-quality speech waveforms at a low bitrate when paired with a powerful neural vocoder.
In this section, we discuss and compare different autoencoding schemes for the compression of a correlated sequence
The most naive approach is to encode and decode data at each time-step independently, as illustrated in Fig. 1(a). Since the network operations at different time-steps are decoupled, however, any temporal correlation is completely ignored and thus the latent codes necessarily encode redundant information over time, leading to inefficient compression/representation of the correlated data sequence.
Hence, to capture any temporal correlation, we require time-domain coupling in the autoencoding process, which can be achieved by either (1) applying a feed-forward network that can access data from multiple time-steps, e.g. using temporal convolution [8] or selfattention [12], or (2) using recurrent network architectures. In this work, we focus on the latter approach. We list several designs of re-
Fig. 1: Different recurrent autoencoder schemes (recurrent connections are displayed in red) (a) No recurrency (b) Encoder only (c) Decoder only (d) Separate (e) Latent feedback (f) Output feedback
current structures in Fig. 1(b)-(f) and identify defects in each design to motivate the proposed architecture.
First, adding recurrent connection to only the encoder or the decoder, as shown in Fig. 1(b) and (c), does not allow the latent codes to utilize temporal correlation and thus is inherently flawed. In case (b), the encoder can access history information of input data through the recurrent connection, while the decoder only has access to each time-step t. As a result, each
is supposed to fully describe
, and thus temporal redundancy is not being utilized. Similarly for case (c), adding the recurrent connection on the decoder alone has little benefit given that each latent code
is formed only from the current data input
It is also a bad design to simply add separate recurrent connections to the encoder and decoder, as illustrated in Fig. 1(d). In this case, even though both encoder and decoder have access to history information, due to limited dimension in the bottleneck, the information that the decoder has access to is a lossy version of that exposed to the encoder. This is even more pronounced when bottleneck latents are quantized into a finite set of discrete values to perform lossless compression of latent codes. Due to this mismatch, the encoder is unable to construct latent codes that are tailored for decoder’s context. This motivates a feedback connection from decoder to encoder.
For the two schemes illustrated in Fig. 1(e) and (f), either the code or the reconstructed output
is fed back to the encoder at time-step t. These feedback connections inform the encoder of the decoder status at the previous time-step, ignoring longer range dependencies. For case (f), we empirically observe instability in training.
It is worth noting that the video compression framework of DVC [13] can be viewed as an instantiation of Fig. 1(f) where decoded data in the previous time steps are fed back to the encoder for explicit motion and residual information compression, and the one proposed in VQ-VAE [7, 8] can be viewed as a convolutional variant of Fig. 1(d) where both encoder and decoder use convolution to cover a large temporal receptive field without any decoder-to-encoder feedback.
2.1. Feedback Recurrent AutoEncoder
Based on the above discussion, an ideal sequential autoencoding design should have the following properties: (1) both encoder and decoder have recurrent connections; (2) there is feedback from decoder to encoder; (3) capable of utilizing long-term temporal correlation.
We introduce a simple autoencoder design that satisfies all three requirements. As illustrated in Fig. 2, the salient feature is that the decoder feeds back its recurrent state to the encoder, hence the name Feedback Recurrent AutoEncoder (FRAE). This structure can be interpreted as a non-linear predictive coding scheme: the recurrent state contains a summary of previously decoded frames; the encoder may take advantage of the existing information in
a code
which encodes only the residual information missing to reconstruct
. We can interpret
as containing information regarding the prediction/extrapolation of the next frame
, corrected by additional/residual information from the encoder captured in
The proposed FRAE architecture can be used in online lossy compression of sequential data at a fixed-rate: at each time step t, encoder converts a data sample to the quantized latent code
and then losslessly transmits
to decoder with a fixed-length code. The bitrate of this scheme is determined by the dimension of the bottleneck and the size of quantization alphabet in each dimension. Note that the encoding process requires running the decoder network up to the generation of the decoder recurrent state, as illustrated in Fig. 2. This resembles the analysis-by-synthesis principle, as the encoding process carries out decoding operations to come up with a better code.
In Table 1, we compare the performance of different recurrent schemes when applied to speech spectrogram compression. The results show that FRAE leads to lower reconstruction distortion compared with any other scheme in Fig. 1 with the same bottleneck dimension and quantization scheme. Details of the experiment settings are deferred to Section 3.1.
Like FRAE, the DRAW architecture [15, 16] also features a feedback connection from the decoder to the encoder. Its motivation and application, however, is different from FRAE: DRAW focuses on
Fig. 2: Feedback Recurrent Autoencoder (FRAE)
Table 1: Test set performance of different recurrent schemes for wide-band (16KHz) spectrogram compression at 1.6Kbps.
the progressive encoding of a single image, using feedback to allow the network to correct its previous mistakes in an iterative fashion, while FRAE focuses on online compression of sequential data and uses feedback as a way to extract long-term temporal redundancy and provide complementary information to the decoder.
2.2. Feedback Recurrent Variational AutoEncoder
The average bitrate of the previously described lossy compression scheme may be further reduced without sacrificing a distortion level, using a variable-length code with a trainable probability model over the latent code
. We refer to this combination of FRAE and the latent probability model as feedback recurrent variational autoencoder (FR-VAE).
To train a FR-VAE model, we train the FRAE architecture and the probability model jointly based on the following objective that quantifies the rate-distortion of the scheme:
where denotes a distortion function. The first term captures the amount of distortion incurred, while the second term captures the rate as the amount of the ideal codeword length of
is used for entropy coding. We can trade-off between the rate and the distortion by sweeping the hyper-parameter
The regular FRAE training is a special case with
, or it can be viewed as having a fixed uniform prior on the latent codes. We remark that this this rate–distortion objective in Eq. (1) is equivalent to
-VAE objective if the encoder is deterministic as in our case; we refer the interested readers to [17, 6] for a detailed discussion on lossy compression and variational inference.
We propose to use a prior model for an autoregressive prior
, as the decoder recurrent state
ready summarizes the history of the latent code
; see Fig. 3. Not only it requires only a small add-on to the existing FRAE architecture, but we also empirically demonstrate that this specific design choice leads to a better rate-distortion trade-off compared with a time-invariant prior model
and a prior model
that is conditioned only on the latent codes from the previous time-step; see Section 3.2 for detailed experiment.
Fig. 3: Feedback Recurrent Variational AutoEncoder (FR-VAE)
In this section, we focus on the problem of speech spectrogram compression and conduct three experiments. In Section 3.1 we demonstrate the effectiveness of FRAE by comparing its performance with other recurrent autoencoder designs in Fig. 1. In Section 3.2, we focus on FR-VAE and compare the rate-distortion trade-off of different prior models. In Section 3.3, we use FRAE transcoded speech spectrograms to condition a WaveNet [18] and generate high-quality speech waveforms at a low bitrate.
Across all experiments, recurrent autoencoder networks are constructed using a combination of convolutional layers, fullyconnected layers, and GRU layers, with a total of around 1.5 million parameters. For bottleneck quantization, we apply the technique used in [6] to quantize each dimension independently with a jointly learned codebook of size four.
Regarding datasets, for the first two experiments, we use LibriVox audiobook recording of Agnes Grey, a studio-quality single speaker dataset, with 2.3 hours for training and 13 minutes for testing. For the last experiments with WaveNet, we use the multi-speaker WSJ1 dataset, which has 66 hours from 200 speakers for training and 2.2 hours with 10 speaker for testing. The train/test split has a disjoint set of speakers and utterances with even gender distribution. Wide-band (16KHz sampling rate) audios are used for both dataset. Each data sample x is the spectrogram of a speech clip, with representing a single frame of spectrogram at dB scale. The spectrograms are computed from square-root Hanning windowed STFT with window shift of 160 (10ms) and window size (same as FFT-size) of 320 (20ms), corresponding to a frame rate of 100Hz.
A Mel-scale mean squared error (Mel-scale MSE) is used as the reconstruction loss for training, where the MSE of each frequency bin is scaled according to its weight at Mel-frequency [19] to capture human perceptual sensitivity with respect to frequencies. Specifically, the weight on frequency f is defined as
3.1. Comparison of different recurrency schemes
In this experiment, we compare the performance of the recurrency autoencoding schemes listed in Fig. 1 with FRAE. We fix the bottleneck dimension to be 8 for all these autoencoder schemes. Given that each bottleneck dimension has 4 quantization levels and the framerate of spectrogram is 100Hz, the spectrogram is compressed at a fixed bitrate of 1.6Kbps.
Aside from the Mel-scale MSE, we also evaluated the models by converting the transcoded spectrogram back to the time-domain by Inverse-STFT, where we then compute the POLQA score [20], an objective perceptual metric of audio quality, with respect to the ground-truth waveform. The phase of the spectrogram is either taken from the original (genie phase) or computed by running Griffin-Lim algorithm [14] for 100 iterations. The performance comparison is detailed in Table 1. For all three metrics, FRAE outperforms the rest with a significant margin, which demonstrates the effectiveness of the feedback recurrent design.
It is worth mentioning that even though the output feedback scheme in Fig. 1(f) achieves the second best performance, in practice we find it is prone to divergence during training and thus hard to optimize, which may be attributed to the fact that it increased the depth of computation graph after RNN unrolling without proper gating mechanism (as that in GRU or LSTM) to alleviate gradient explosion problem. In constrast, the FRAE scheme always leads to stable training.
3.2. Comparison of different prior models of FR-VAE
Next, we train FR-VAE with three varients of prior models: one that is conditioned on as illustrated in Fig. 3, one that is conditioned on
; and a time-invariant model without any conditioning. A simple MLP is used for the first two. The rate-distortion trade-offs are shown in Fig. 4, with the distortion of each model represented by the average POLQA score of the waveforms generated using the autoencoded spectrogram together with the original phase.
Fig. 4: Rate-distortion of different prior models 0. Distortion is represented by the average POLQA score of test set after converting autoencoded spectrogram to waveform using the original phase.
The three prior models are trained with bottleneck size of 48 and sweeping of with step size of 0.001, optimized for Eq. (1). The results are compared with FRAE (fixed, uniform prior) with bottleneck size of 8, 16, 32, and 36, corresponding to a fixed bitrate of 1.6, 3.2, 6.4, 7.2 Kbps, respectively. From the results we can see that
as proposed in Fig. 3 achieves the best rate-distortion among the four with a maximum rate reduction of around 1.5 Kbps compared to FRAE.
3.3. Waveform generation using WaveNet as phase model
In this experiment, we pair FRAE with a WaveNet model to generate speech waveform. Four FRAE models are trained with the same configurations as the previous experiment (1.6, 3.2, 6.4, 7.2 Kbps). We then freeze the FRAE models and train four separate WaveNets, each conditioned on the autoencoded spectrogram from one of the FRAE models. Speaker identity is not used.
Fig. 5: POLQA score vs bitrate for FRAE+WaveNet, trained on WSJ1 and evaluated on WSJ1 test set, against Opus.
Fig. 6: POLQA score vs bitrate for FRAE+WaveNet, trained on LibriVox-Agnes-Grey and evaluated on LibriVox-Agnes-Grey test set, against Opus.
The experiment is conducted separately on the the multi-speaker WSJ1 dataset and the LibriVox-Agnes-Grey audiobook dataset, with the average POLQA score shown in Fig. 5 and Fig. 6 respectively and compared against Opus [21], a well-known open source speech codec. It can be seen that significant bitrate reduction is achieved by FRAE+WaveNet when compared with Opus, and the gap widens as the bitrate goes lower.
In this work we presented a new scheme of recurrent autoencoder in the context of lossy online compression of temporally correlated data, and demonstrated its effectiveness on the speech spectrogram compression task. We showed that high-quality waveform can be generated at a low bitrate when it is used together with WaveNet, and the bitrate can be reduced further by adding a prior model on the latent codes. An interesting future direction is to modify the FRAE architecture by allowing multiple update rates for different part of the latent codes,capturing correlation information at different time-scales, to achieve further reduction in bitrate.
[1] Diederik P Kingma and Max Welling, “Auto-Encoding Varia- tional Bayes,” arXiv:1312.6114, 2013.
[2] Fabian Mentzer, Eirikur Agustsson, Michael Tschannen, Radu Timofte, and Luc Van Gool, “Conditional Probability Models for Deep Image Compression,” The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[3] Otto Fabius and Joost R. van Amersfoort, “Variational Recur- rent Auto-Encoders,” The International Conference on Learning Representations (ICLR) workshop, 2015.
[4] Quoc V. Le Ilya Sutskever, Oriol Vinyals, “Sequence to Sequence Learning with Neural Networks,” The Conference and Workshop on Neural Information Processing Systems (NeurIPS), 2014.
[5] Yingzhen Li and Stephan Mandt, “Disentangled Sequential Autoencoder,” International Conference on Machine Learning (ICML), 2018.
[6] Amirhossein Habibian, Ties van Rozendaal, Jakub M. Tom- czak, and Taco S. Cohen, “Video Compression With RateDistortion Autoencoders,” International Conference on Computer Vision (ICCV), 2019.
[7] A¨aron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu, “Neural Discrete Representation Learning,” The Conference and Workshop on Neural Information Processing Systems (NeurIPS), 2017.
[8] Cristina Gˆarbacea, A¨aron van den Oord, Yazhe Li, Felicia S. C. Lim, Alejandro Luebs, Oriol Vinyals, and Thomas C. Walters, “Low Bit-rate Speech Coding with VQ-VAE and a WaveNet Decoder,” ICASSP, 2019.
[9] Jean-Marc Valin and Jan Skoglund, “A Real-Time Wideband Neural Vocoder at 1.6 kb/s Using LPCNet,” INTERSPEECH, 2019.
[10] Xin Wang, Shinji Takaki, and Junichi Yamagishi, “Neural Source-Filter-based Waveform Model for Statistical Parametric Speech Synthesis,” ICASSP, 2019.
[11] Ryan Prenger, Rafael Valle, and Bryan Catanzaro, “WaveG- low: A Flow-based Generative Network for Speech Synthesis,” ICASSP, 2019.
[12] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention Is All You Need,” The Conference and Workshop on Neural Information Processing Systems (NeurIPS), 2017.
[13] Guo Lu, Wanli Ouyang, Dong Xu, Xiaoyun Zhang, Chunlei Cai, and Zhiyong Gao, “DVC: An End-to-end Deep Video Compression Framework,” The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[14] Daniel W Griffin and Jae S. Lim, “Signal Estimation from Modified Short-Time Fourier Transform,” IEEE Transactions on Acoustics Speech and Signal Processing, 1984.
[15] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra, “DRAW: A Recurrent Neural Network For Image Generation,” International Conference on Machine Learning (ICML), 2015.
[16] Karol Gregor, Frederic Besse, Danilo Jimenez Rezende, Ivo Danihelka, and Daan Wierstra, “Towards Conceptual Compression,” The Conference and Workshop on Neural Information Processing Systems (NeurIPS), 2016.
[17] Partha Ghosh, Mehdi S. M. Sajjadi, Antonio Vergari, Michael Black, and Bernhard Sch¨olkopf, “From Variational to Deterministic Autoencoders,” arXiv:1903.12436, 2019.
[18] A¨aron van den Oord, Sander Dieleman, Heiga Zen, Karen Si- monyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu, “WaveNet: A Generative Model for Raw Audio,” arXiv: 1609.03499, 2016.
[19] Malcolm Slaney, “Auditory Toolbox: A MATLAB Toolbox for Auditory Modeling Work.,” Technical Report, version 2, Interval Research Corporation, 1998.
[20] ITU, “Perceptual objective listening quality assessment,” ITUT P.863, 2011.
[21] “Opus interactive audio codec,” https://https://opus-codec.org/.