On the Encoder-Decoder Incompatibility in Variational Text Modeling and Beyond

2020·Arxiv

Abstract

Abstract

Variational autoencoders (VAEs) combine latent variables with amortized variational inference, whose optimization usually converges into a trivial local optimum termed posterior collapse, especially in text modeling. By tracking the optimization dynamics, we observe the encoder-decoder incompatibility that leads to poor parameterizations of the data manifold. We argue that the trivial local optimum may be avoided by improving the encoder and decoder parameterizations since the posterior network is part of a transition map between them. To this end, we propose Coupled-VAE, which couples a VAE model with a deterministic autoencoder with the same structure and improves the encoder and decoder parameterizations via encoder weight sharing and decoder signal matching. We apply the proposed Coupled-VAE approach to various VAE models with different regularization, posterior family, decoder structure, and optimization strategy. Experiments on benchmark datasets (i.e., PTB, Yelp, and Yahoo) show consistently improved results in terms of probability estimation and richness of the latent space. We also generalize our method to conditional language modeling and propose Coupled-CVAE, which largely improves the diversity of dialogue generation on the Switchboard dataset.1

1 Introduction

The variational autoencoder (VAE) (Kingma and Welling, 2014) is a generative model that combines neural latent variables and amortized variational inference, which is efficient in estimating and sampling from the data distribution. It infers a posterior distribution for each instance with a shared inference network and optimizes the evidence lower bound (ELBO) instead of the intractable marginal log-likelihood. Given its potential to learn representations from massive text data, there has been much interest in using VAE for text modeling (Zhao et al., 2017; Xu and Durrett, 2018; He et al., 2019).

Prior work has observed that the optimization of VAE suffers from the posterior collapse problem, i.e., the posterior becomes nearly identical to the prior and the decoder degenerate into a standard language model (Bowman et al., 2016; Zhao et al., 2017). A widely mentioned explanation is that a strong decoder makes the collapsed posterior a good local optimum of ELBO, and existing solutions include weakened decoders (Yang et al., 2017; Semeniuta et al., 2017), modified regularization terms (Higgins et al., 2017; Wang and Wang, 2019), alternative posterior families (Rezende and Mohamed, 2015; Davidson et al., 2018), richer prior distributions (Tomczak and Welling, 2018), improved optimization strategies (He et al., 2019), and narrowed amortization gaps (Kim et al., 2018).

In this paper, we provide a novel perspective for the posterior collapse problem. By comparing the optimization dynamics of VAE with deterministic autoencoders (DAE), we observe the incompatibility between a poorly optimized encoder and a decoder with too strong expressiveness. From the perspective of differential geometry, we show that this issue indicates poor chart maps from the data manifold to the parameterizations, which makes it difficult to learn a transition map between them. Since the posterior network is a part of the transition map, we argue that the posterior collapse would be mitigated with better parameterizations.

To this end, we propose the Coupled-VAE approach, which couples the VAE model with a deterministic network with the same structure. For better encoder parameterization, we share the encoder weights between the coupled networks. For better decoder parameterization, we propose a signal matching loss that pushes the stochastic decod- ing signals to the deterministic ones. Notably, our approach is model-agnostic since it does not make any assumption on the regularization term, the posterior family, the decoder architecture, or the optimization strategy. Experiments on PTB, Yelp, and Yahoo show that our method consistently improves the performance of various VAE models in terms of probability estimation and the richness of the latent space. The generalization to conditional modeling, i.e., Coupled-CVAE, largely improves the diversity of dialogue generation on the Switchboard dataset. Our contributions are as follows:

• We observe the encoder-decoder incompatibility in VAE and connect it to the posterior collapse problem.

• We propose the Coupled-VAE, which helps the encoder and the decoder to learn better parameterizations of the data manifold with a coupled deterministic network, via encoder weight sharing and decoder signal matching.

• Experiments on PTB, Yelp, and Yahoo show that our approach improves the performance of various VAE models in terms of probability estimation and richness of the latent space. We also generalize Coupled-VAE to conditional modeling and propose Coupled-CVAE, which largely improves the diversity of dialogue generation on the Switchboard dataset.

2 Background

2.1 Variational Inference for Text Modeling

The generative process of VAE is first to sample a latent code z from the prior distribution P(z) and then to sample the data and Ba, 2015). Since the exact marginalization of the log-likelihood is intractable, a variational family of posterior distributions is adopted to derive the evidence lower bound (ELBO), i.e.,

For training, as shown in Figure 1(a), the encoded text e is transformed into its posterior via a posterior network. A latent code is sampled and mapped to the decoding signal h. Finally, the decoder infers the input with the decoding signal. The objective can be viewed as a reconstruction loss plus a regularization loss (whose form varies), i.e.,

Figure 1: VAE and DAE for text modeling.

However, the optimization of the VAE objective is challenging. We usually observe a very small and a similar to a standard language model, i.e., the well-known posterior collapse problem.

2.2 Deterministic Autoencoders

An older family of autoencoders is the deterministic autoencoder (DAE) (Rumelhart et al., 1986; Ballard, 1987). Figure 1(b) shows an overview of DAE for text modeling, which is composed of a text encoder, an optional MLP, and a text decoder. The reconstruction loss of DAE is usually much lower than that of VAE after convergence.

3 Encoder-Decoder Incompatibility in VAE for Text Modeling

To understand the posterior collapse problem, we take a deeper look into the training dynamics of VAE. We investigate the following questions. How much backpropagated gradient does the encoder receive from reconstruction? How much does it receive from regularization? How much information does the decoder receive from the encoded text?

3.1 Tracking Training Dynamics

To answer the first question, we study the gradient norm of the reconstruction loss w.r.t. the encoded text, i.e., , which shows the magni-

Figure 2: Training dynamics of DAE, VAE, and the proposed Coupled-VAE on the Yelp test set. Please find the analysis in Section 3 and Section 5.7. Best viewed in color (yet the models are distinguished by line markers).

tude of gradients received by the encoder parameters. From Figure 2(a), we observe that it constantly increases in DAE, while in VAE it increases marginally in the early stage and then decreases continuously. It shows that the reconstruction loss actively optimizes the DAE encoder, while the VAE encoder lacks backpropagated gradients after the early stage of training.

We seek the answer to the second question by studying the gradient norm of the regularization loss w.r.t. the encoded text, i.e., totally collapsed posterior, i.e., for each would be zero. Thus, can show how far the posterior of each instance is from the aggregate posterior or the prior. Figure 2(b) shows a constant decrease of the gradient norm in VAE from the 2.5K step until convergence, which shows that the posterior collapse is aggravated as the KL weight increases.

For the third question, we compute the normalized gradient norm of the decoding signal w.r.t. the encoded text, i.e., . As this term shows how relatively the decoding signal changes with the perturbation of the encoded text, it reflects the amount of information passed from the encoder to the decoder. Figure 2(c) shows that for DAE, it constantly increases. For VAE, it at first increases even faster than DAE, slows down, and finally decreases until convergence, indicating that the VAE decoder, to some extent, ignores the encoder in the late stage of training.

3.2 Encoder-Decoder Incompatibility

Based on the training dynamics in Section 3.1 and the observations in previous work (Bowman et al., 2016; Zhao et al., 2017), text VAE has three features, listed as follows. First, the encoder is poorly optimized, as shown by the low ond, the decoder degenerates into a powerful language model. Third, h contains less information from e in VAE than in DAE, which is indicated by the lower . We call these features as encoder-decoder incompatibility.

To bridge the incompatibility and posterior collapse, we start with the manifold hypothesis which states that real-world data concentrates near a manifold with a lower dimensionality than the ambient space (Narayanan and Mitter, 2010; Bengio et al., 2013). In our case, we denote the manifold of text data as where V is the vocab- ulary. In the language of differential geometry, the encoded text and the decoding signal can be viewed as the parameterizations (or coordinates) of under two different charts (or coordinate systems). Formally, we denote the chart maps as and , which satisfy and for any . Given the two charts, the map from E to H is called the transition map between the two charts.

In DAE, the two chart maps and the transition map between them are learned simultaneously via the single reconstruction loss, which we rewrite as

where , and are modeled as the encoder, the MLP, and the decoder (strictly speaking, in text modeling, the range of is not X but distributions on X), as illustrated in Figure 3.

In VAE, as discussed before, both and inadequately parameterize the data manifold. We argue that the inadequate parameterizations make it harder to find a smooth transition map in VAE than in DAE, as shown by the lower

Figure 3: Left: DAE and VAE interpreted as manifold parameterizations and a transition map. Right: A graphical overview of the proposed Coupled-VAE. The upper path is deterministic, and the lower path is stochastic.

Since the posterior network is a part of the transition map, it consequently seeks to map each instance to the prior (discussed in Section 3.1) rather than learning the transition map.

4 Coupling Variational and Deterministic Networks

Based on the above analysis, we argue that posterior collapse could be alleviated by learning chart maps (i.e., ) that better parameterize the data manifold. Inspired by the chart maps in DAE, we propose to couple the VAE model with a deterministic network, outlined in Figure 3. Modules with a subscript c are deterministic networks that share the structure with those in the stochastic network. Sampling is disabled in the deterministic network, e.g., in the case of Gaussian posterior, we use the predicted mean vector for later computation. Please find details for other posterior families in Appendix B. Similar to DAE, the coupled deterministic network is optimized solely by the coupled reconstruction loss , which is the same autore- gressive cross-entropy loss as

To learn a well-optimized , we share the encoder between the stochastic and the deterministic networks, which leverages the rich gradients backpropagated from . To learn better , we propose to guide with a well-learned chart map, i.e., the one characterized by . Thus, we introduce a signal matching loss that pushes the . The objective of our approach is

where are hyperparameters2, coupled reconstruction loss, and the signal matching loss is essentially a distance function between . We evaluate both the Euclidean distance and the Rational Quadratic kernel3, i.e.,

where is a hyperparameter, and Detach prevents gradients to be propagated into since we would like h but not the opposite.

One would question the necessity of sharing the structure of the posterior network by resorting to universal approximation (Hornik et al., 1989). Specifically, a common question is: why not using an MLP as Posterior? We argue that each structure has a favored distribution of H in , so structure sharing facilitates the optimization when we are learning by gradient descent. For example, the latent space learned by planar flows (Rezende and Mohamed, 2015) has compression and expansion, and vMF-VAE (Xu and Durrett, 2018), which is supported on a sphere, may significantly influence the distribution of H in its ambient space

5 Experiments

5.1 Datasets

We conduct the experiments on three commonly used datasets for text modeling, i.e., the Penn Treebank (PTB) (Marcus et al., 1993), Yelp (Xu et al., 2016), and Yahoo. The training/validation/test splits are 42K/3370/3761 for PTB, 63K/7773/8671 for Yelp, and 100K/10K/10K for Yahoo. The vocabulary size for PTB/Yelp/Yahoo is 10K/15K/20K. We discard the sentiment labels in Yelp.

5.2 Baselines

We evaluate the proposed Coupled-VAE approach by applying it to various VAE models, which in-

Table 1: Language modeling results. NLL is estimated with importance sampling. PPL is based on the estimated NLL. KL and MI are approximated by their Monte Carlo estimates. Coupled- stands for “with the coupled deterministic network”. The better results in each block are shown in bold. *The exact NLL is reported. open-source implementation which does not follow our setup and evaluation. Previously reported.

clude VAE (Kingma and Welling, 2014), -VAE (Higgins et al., 2017), vMF-VAE (Xu and Dur- rett, 2018; Davidson et al., 2018) with learnable CNN-VAE (Yang et al., 2017), WAE (Tolstikhin et al., 2018), VAE with normalizing flows (VAENF) (Rezende and Mohamed, 2015), WAE with normalizing flows (WAE-NF), VAE with cyclic annealing schedule (CycAnn-VAE) (Fu et al., 2019), VAE with encoder pretraining and the free bits objective (PreFB-VAE) (Li et al., 2019), and LaggingVAE (He et al., 2019). We also show the result of GRU-LM (Cho et al., 2014) and SA-VAE (Kim et al., 2018). We do not apply our method to SAVAE since it does not follow amortized variational inference. Please find more details in Appendix C and previous footnotes.

5.3 Language Modeling Results

We report negative log-likelihood (NLL), KL divergence, and perplexity as the metrics for language modeling. NLL is estimated with importance sampling, KL is approximated by its Monte Carlo estimate, and perplexity is computed based on NLL. Please find the metric details in Appendix D.

Table 1 displays the language modeling results. For all models, our proposed approach achieves smaller negative log-likelihood and lower perplexity, which shows the effectiveness of our method to improve the probability estimation capability of various VAE models. Larger KL divergence is also observed, showing that our approach helps address the posterior collapse problem.

5.4 Mutual Information and Reconstruction

Language modeling results only evaluate the probability estimation ability of VAE. We are also interested in how rich the latent space is. We report the mutual information (MI) between the text x and the latent code z under Q(z|x), which is approximated with Monte Carlo estimation. Better

Table 2: Mutual information (MI) and reconstruction. Modifying the open-source implementation.

reconstruction from the encoded text is another way to show the richness of the latent space. For each text x, we sample ten latent codes from Q(z|x) and decode them with greedy search. We report the BLEU-1 and BLEU-2 scores between the reconstruction and the input. Please find the metric details in Appendix E. In Table 2, we observe that our approach improves MI on all datasets, showing that our approach helps learn a richer latent space. BLEU-1 and BLEU-2 are consistently improved on Yelp and Yahoo, but not on PTB. Given that text samples in PTB are significantly shorter than those in Yelp and Yahoo, we conjecture that it is easier for the decoder to reconstruct on PTB by exploiting its autoregressive expressiveness, even without a rich latent space.

5.5 Hyperparameter Analysis: Distance Function, λr, and λm

We investigate the effect of key hyperparameters. Results are shown in Table 3. Note that the lowest NLL does not guarantee the best other metrics, which shows the necessity to use multiple metrics for a more comprehensive evaluation.

For the distance function, we observe that the

Euclidean distance (denoted as Eucl in Table 3) is more sensitive to than the Rational Quadratic kernel (denoted as RQ in Table 3).

The first and the third block in Table 3 show that, with larger , the model achieves higher KL divergence, MI, and reconstruction metrics. Our interpretation is that by pushing the stochastic decoding signals closer to the deterministic ones, we get latent codes with richer text information. We leave the analysis of in Section 5.6.

The second block in Table 3 shows the role of which we interpret as follows. When is too small (e.g., 0.5), the learned parameterizations are still inadequate for a smooth transition map; when is too large (e.g., 5.0), it distracts the optimization too far away from the original objective (i.e., ). Note that is equivalent to removing the coupled reconstruction loss

5.6 The Heterogeneous Effect of Signal Matching on Probability Estimation

In Section 5.5 we observe richer latent space (i.e., larger MI and BLEU scores) with larger ever, a richer latent space does not guarantee a better probability estimation result. Thus, in this

Table 3: Hyperparameter analysis. The best results in each block are shown in bold. *Reported in Table 1 and 2.

Table 4: The effect of signal matching on probability estimation. * Reported in Table 1.

part, we delve deeper into whether the decoder signal matching mechanism helps improve probability estimation. We study three models of different posterior families (i.e., Coupled-VAE, Coupled-VAE-NF, and Coupled-vMF-VAE). Results are shown in Table 4, where we do not report the KL, MI, and BLEU scores because they have been shown to be improved with larger in Table 3. We observe that the effects of signal matching on probability estimation vary in different posterior families.

5.7 Is the Incompatibility Mitigated?

We study the three gradient norms defined in Section 3 on the test sets, displayed in Table 5 (for Coupled-VAE, ). Notably, in Coupled-VAE is even larger than in DAE. It has two indications. First, the encoder indeed encodes rich information of the text. Second, compared with DAE, Coupled-VAE better generalizes to the test sets, which we conjecture is due to the regularization on the posterior. Coupled-VAE also has a larger compared with VAE, which based on the argument in Section 3.1 indicates that, in Coupled-VAE, the posterior of each instance is not similar to the prior. We also observe larger in Coupled-VAE, which indicates a better transition map between the two parameterizations in Coupled-VAE than in VAE.

To show how Coupled-VAE ameliorates the training dynamics, we also track the gradient norms of Coupled-VAE (for a clearer comparison), plotted along with VAE and DAE in Figure 2. The curve for Coupled-VAE in Figure 2(a) stands for . We observe that Coupled-VAE receives constantly increasing backpropagated gradients from the reconstruction. In contrast to VAE, the in Coupled-VAE does not decrease significantly as the KL weight increases. The decrease of VAE suffers from, is not observed in Coupled-VAE. Plots on more datasets are in Appendix F.

5.8 Sample Diversity

We evaluate the diversity of the samples from the prior distribution. We sample 3200 texts from the prior distribution and report the Dist-1 and Dist-2 metrics (Li et al., 2016), which are the ratios of distinct unigrams and bigrams over all generated unigrams and bigrams. Distinct-1 and Distinct-2 in Table 6 show that texts sampled from CoupledVAE () are more diverse than those from VAE. Given limited space, we put several samples in Appendix G for qualitative analysis.

5.9 Interpolation

A property of VAE is to match the interpolation in the latent space with the smooth transition in the data space (Bowman et al., 2016). In Table 7, we show the interpolation of VAE and CoupledVAE on PTB. It shows that compared with VAE,

Table 5: Gradient norms defined in Section 3.1 on each test set.

Table 6: Diversity of samples from the prior distribution. D- stands for Distinct-, normalized to [0, 100].

Figure 4: A graphical overview of the generalization to Coupled-CVAE. u is the condition, encoded as difference from Coupled-VAE is shown in red.

Coupled-VAE has smoother transitions of subjects (both sides it) and verbs (are expected have been has been has), indicating that the linguistic information is more smoothly encoded in the latent space of Coupled-VAE.

5.10 Generalization to Conditional Language Modeling: Coupled-CVAE

To generalize our approach to conditional language modeling, we propose Coupled-CVAE. A graphical overview is displayed in Figure 4. Specifically, the (coupled) posterior network and the (coupled) decoder are additionally conditioned. The objective of Coupled-CVAE is identical to Eq. (4).

We compare Couple-CVAE with GRU encoder-decoder (Cho et al., 2014) and CVAE (Zhao et al., 2017) for dialogue generation. We use the Switchboard dataset (John and Holliman, 1993), whose training/validation/test splits are 203K/5K/5K, and the vocabulary size is 13K. For probability estimation, we report the NLL, KL, and PPL based on the gold responses. Since the key motivation of using CVAE in Zhao et al. (2017) is the diversity of responses, we sample one response for each post and report the Distinct-1 and Distinct-2 metrics over all test samples. Please find more details of this part in Appendix I.

Table 8 shows that Coupled-CVAE greatly increases the diversity of dialogue modeling, while it only slightly harms the probability estimation capability. It indicates that Coupled-CVAE better captures the one-to-many nature of conversations than CVAE and GRU encoder-decoder. We also observe that the diversity is improved with increasing , which shows that can control diversity via specifying the richness of the latent space.

6 Relation to Related Work

Bowman et al. (2016) identify the posterior collapse problem of text VAE and propose KL annealing and word drop to handle the problem. Zhao et al. (2017) propose the bag-of-words loss to mitigate this issue. Later work on this problem focuses on less powerful decoders (Yang et al., 2017; Seme- niuta et al., 2017), modified regularization objective (Higgins et al., 2017; Bahuleyan et al., 2019; Wang and Wang, 2019), alternative posterior families (Rezende and Mohamed, 2015; Xu and Dur- rett, 2018; Davidson et al., 2018; Xiao et al., 2018), richer prior distributions (Tomczak and Welling, 2018), improved optimization (He et al., 2019) or KL annealing strategy (Fu et al., 2019), the use of skip connections (Dieng et al., 2019), hierarchical or autoregressive posterior distributions (Park et al., 2018; Du et al., 2018), and narrowing the amortization gap (Hjelm et al., 2016; Kim et al., 2018; Marino et al., 2018). We provide the encoder-

Table 7: Latent space interpolation.

Table 8: Dialogue generation. D-1 and D-2 are normalized to [0, 100]. *The exact NLL is reported.

decoder incompatibility as a new perspective on the posterior collapse problem. Empirically, our approach can be combined with the above ones to alleviate the problem further.

A model to be noted is -VAE (Higgins et al., 2017), in which the reconstruction and regularization are modeled as a hyperparameterized trade-off, i.e., the improvement of one term compromises the other. Different from -VAE, we adopt the idea of multi-task learning, i.e., the coupled reconstruction task helps improve the encoder chart map and the signal matching task helps improve the decoder chart map. Both our analysis in Section 3.2 and the empirical results show that the modeling of posterior distribution can be improved (but not necessarily compromised) with the additional tasks.

Ghosh et al. (2020) propose to substitute stochasticity with explicit and implicit regularizations, which is easier to train and empirically improves the quality of generated outputs. Different from their work, we still strictly follow the generative nature (i.e., data density estimation) of VAE, and the deterministic network in our approach serves as an auxiliary to aid the optimization.

Encoder pretraining (Li et al., 2019) initializes the text encoder and the posterior network with an autoencoding objective. Li et al. (2019) shows that encoder pretraining itself does not improve the performance of VAE, which indicates that initialization is not strong enough as an inductive bias to learn a meaningful latent space.

Given the discrete nature of text data, we highlight the two-level representation learning for text modeling: 1) the encoder and decoder parameterizations via autoencoding and 2) a transition map between the parameterizations. Notably, the transition map has large freedom. In our case, the transition map decides the amount and type of information encoded in the variational posterior, and there are other possible instances of the transition map, e.g., flow-based models (Dinh et al., 2015).

7 Conclusions

In this paper, we observe the encode-decoder incompatibility of VAE for text modeling. We bridge the incompatibility and the posterior collapse problem by viewing the encoder and the decoder as two inadequately learned chart maps from the data manifold to the parameterizations, and the posterior network as a part of the transition map between them. We couple the VAE model with a deterministic network and improve the parameterizations via encoder weight sharing and decoder signal matching. Our approach is model-agnostic and can be applied to a wide range of models in the VAE family. Experiments on benchmark datasets, i.e., PTB, Yelp, and Yahoo, show that our approach improves various VAE models in terms of probability estimation and the richness of the latent space. We also generalize Coupled-VAE to conditional language modeling and propose Coupled-CVAE. Results on Switchboard show that Coupled-CVAE largely improves diversity in dialogue generation.

Acknowledgments

We would like to thank the anonymous reviewers for their thorough and helpful comments.

References

Hareesh Bahuleyan, Lili Mou, Hao Zhou, and Olga Vechtomova. 2019. Stochastic wasserstein autoencoder for probabilistic sentence generation. In Proceedings of NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4068–4076.

Dana H. Ballard. 1987. Modular learning in neural networks. In Proceedings of the 6th National Conference on Artificial Intelligence. Seattle, WA, USA, July 1987., pages 279–284.

Yoshua Bengio, Aaron C. Courville, and Pascal Vin- cent. 2013. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1798–1828.

Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, An- drew M. Dai, Rafal J´ozefowicz, and Samy Bengio. 2016. Generating sentences from a continuous space. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL 2016, Berlin, Germany, August 11-12, 2016, pages 10–21.

Kyunghyun Cho, Bart van Merrienboer, C¸ aglar G¨ulc¸ehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1724–1734.

Tim R. Davidson, Luca Falorsi, Nicola De Cao, Thomas Kipf, and Jakub M. Tomczak. 2018. Hyperspherical variational auto-encoders. In Proceedings of UAI 2018, Monterey, California, USA, August 6-10, 2018, pages 856–865.

Adji B. Dieng, Yoon Kim, Alexander M. Rush, and David M. Blei. 2019. Avoiding latent variable collapse with generative skip models. In AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, volume 89 of Proceedings of Machine Learning Research, pages 2397–2405.

Laurent Dinh, David Krueger, and Yoshua Bengio. 2015. NICE: non-linear independent components estimation. In ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Workshop Track Proceedings.

Jiachen Du, Wenjie Li, Yulan He, Ruifeng Xu, Lidong Bing, and Xuan Wang. 2018. Variational autoregressive decoder for neural response generation. In Proceedings of EMNLP 2018, Brussels, Belgium, October 31 - November 4, 2018, pages 3154–3163.

Hao Fu, Chunyuan Li, Xiaodong Liu, Jianfeng Gao, Asli C¸ elikyilmaz, and Lawrence Carin. 2019. Cyclical annealing schedule: A simple approach to mitigating KL vanishing. In Proceedings of NAACLHLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 240–250.

Partha Ghosh, Mehdi S. M. Sajjadi, Antonio Ver- gari, Michael Black, and Bernhard Scholkopf. 2020. From variational to deterministic autoencoders. In ICLR 2020, Addis Ababa, Ethiopia, April 30, 2020.

Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch¨olkopf, and Alexander J. Smola. 2012. A kernel two-sample test. J. Mach. Learn. Res., 13:723–773.

Junxian He, Daniel Spokoyny, Graham Neubig, and Taylor Berg-Kirkpatrick. 2019. Lagging inference networks and posterior collapse in variational autoencoders. In ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.

Irina Higgins, Lo¨ıc Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. 2017. Learning basic visual concepts with a constrained variational framework. In ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.

R. Devon Hjelm, Ruslan Salakhutdinov, Kyunghyun Cho, Nebojsa Jojic, Vince D. Calhoun, and Junyoung Chung. 2016. Iterative refinement of the approximate posterior for directed belief networks. In NIPS 2016, December 5-10, 2016, Barcelona, Spain, pages 4691–4699.

Kurt Hornik, Maxwell B. Stinchcombe, and Halbert White. 1989. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359–366.

Godfrey John and Edward Holliman. 1993. Switchboard-1 release 2 ldc97s62. Web Download. Philadelphia: Linguistic Data Consortium.

Yoon Kim, Sam Wiseman, Andrew C. Miller, David Sontag, and Alexander M. Rush. 2018. Semiamortized variational autoencoders. In ICML, volume 80 of Proceedings of Machine Learning Research, pages 2683–2692.

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.

Diederik P. Kingma and Max Welling. 2014. Autoencoding variational bayes. In ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings.

Bohan Li, Junxian He, Graham Neubig, Taylor Berg- Kirkpatrick, and Yiming Yang. 2019. A surprisingly effective fix for deep latent variable modeling of text. In Proceedings of EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 3601–3612. Association for Computational Linguistics.

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A diversity-promoting objective function for neural conversation models. In Proceedings of NAACL-HLT 2016, San Diego California, USA, June 12-17, 2016, pages 110–119.

Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of english: The penn treebank. Computational Linguistics, 19(2):313–330.

Joseph Marino, Yisong Yue, and Stephan Mandt. 2018. Iterative amortized inference. In Proceedings of ICML 2018, Stockholmsm¨assan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 3400–3409.

Hariharan Narayanan and Sanjoy K. Mitter. 2010. Sample complexity of testing the manifold hypothesis. In NIPS 2010, December 6-9, 2010, Vancouver, British Columbia, Canada., pages 1786–1794.

Yookoon Park, Jaemin Cho, and Gunhee Kim. 2018. A hierarchical latent structure for variational conversation modeling. In Proceedings of NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 1792–1801.

Danilo Jimenez Rezende and Shakir Mohamed. 2015. Variational inference with normalizing flows. In Proceedings of ICML 2015, Lille, France, 6-11 July 2015, pages 1530–1538.

DE Rumelhart, GE Hinton, and RJ Williams. 1986. Learning internal representations by error propagation. In Parallel distributed processing: explorations in the microstructure of cognition, vol. 1, pages 318–362. MIT Press.

Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. 2017. A hybrid convolutional variational autoencoder for text generation. In Proceedings of EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pages 627–637.

Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958.

Ilya O. Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Sch¨olkopf. 2018. Wasserstein auto-encoders. In ICLR 2018.

Jakub M. Tomczak and Max Welling. 2018. VAE with a vampprior. In AISTATS 2018, 9-11 April 2018, Playa Blanca, Lanzarote, Canary Islands, Spain, volume 84 of Proceedings of Machine Learning Research, pages 1214–1223.

Prince Zizhuang Wang and William Yang Wang. 2019. Riemannian normalizing flow on variational wasserstein autoencoder for text modeling. In Proceedings of NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 284–294.

Yijun Xiao, Tiancheng Zhao, and William Yang Wang. 2018. Dirichlet variational autoencoder for text modeling. CoRR, abs/1811.00135.

Jiacheng Xu, Danlu Chen, Xipeng Qiu, and Xuanjing Huang. 2016. Cached long short-term memory neural networks for document-level sentiment classifica-tion. In Proceedings of EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 1660–1669.

Jiacheng Xu and Greg Durrett. 2018. Spherical latent spaces for stable variational autoencoders. In Proceedings of EMNLP 2018, Brussels, Belgium, October 31 - November 4, 2018, pages 4503–4513.

Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, and Taylor Berg-Kirkpatrick. 2017. Improved variational autoencoders for text modeling using dilated convolutions. In ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 3881–3890.

Tiancheng Zhao, Ran Zhao, and Maxine Esk´enazi. 2017. Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. In Proceedings of ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 654–664.

Appendix A Notations

We first introduce the notations used in the following parts. Calligraphic letters (e.g., ) denotes continuous distributions, and the corresponding lowercase letters (e.g., ) stands for probability density functions. The probability of the text is represented as P.

B Deterministic Networks for Different Posterior Families

In this part, we detail the forward computation of the deterministic networks for different posterior families, including multivariate Gaussian, Gaussian with normalizing flows, and von MisesFisher.

B.1 Multivariate Gaussian

For multivariate Gaussian, we compute the coupled latent code

where is the posterior distribution learned by the coupled deterministic network. In effect, z is the mean vector predicted by the coupled posterior network

B.2 Gaussian with Normalizing Flows

We first review the background and notations of normalizing flows. An initial latent code is first sampled from an initial distribution, i.e., . The normalizing flow is defined as a series of reversible transformations , i.e.,

where k = 1, . . . , K. The evidence lower bound (ELBO) for normalizing flows is derived as

where is the prior distribution of the transformed latent variable and the reversibility of the transformations guarantees non-zero determinants. Obviously, the optimization of the ELBO for normalizing flows requires sampling from the initial distribution; thus, we compute the coupled latent code by transforming the predicted mean vector of the coupled initial distribution, i.e.,

where is the coupled initial distribution and are the coupled transformations. Note that all modules in the deterministic network share the structure with those in the stochastic network. We do not use the posterior mean as the coupled latent code for two reasons. First, our interest is to acquire a deterministic representation that guides the stochastic network, but not necessarily the mean vector. Second, the computation of the posterior mean after the transformations is intractable.

B.3 Von Mises-Fisher

The von Mises-Fisher distribution is supported on a -dimensional sphere in and parameterized by a direction parameter () and a concentration parameter , both of which are mapped from the encoded text by the posterior network. The probability density function is

where is the modified Bessel function of the first kind at order v. We use the direction parameter as the coupled latent code . Note that we do not use the posterior mean as the coupled latent code for two reasons. First, similar to normalizing flows, our interest is a deterministic representation rather than the mean vector. Second, the posterior mean of von Mises-Fisher never lies on the support of the distribution, which is suboptimal to guide the stochastic network.

C Details of the Experimental Setup

The dimension of latent vectors is 32. The dimension of word embeddings is 200. The encoder and the decoder are one-layer GRUs with the hidden state size of 128 for PTB and 256 for Yelp and Yahoo. For optimization, we use Adam (Kingma and Ba, 2015) with a learning rate of and . The decoding signal is viewed as the first word embedding and also concatenated to the word embedding in each decoding step. After 30K steps, the learning rate is decayed by half each 2K steps. Dropout (Srivastava et al., 2014) rate is 0.2. KL-annealing (Bowman et al., 2016) is applied from step 2K to 42K (on Yelp, it is applied from step 1K to 41K for VAE, Coupled-VAE, -VAE, and Coupled--VAE; otherwise, the KL divergence becomes very large in the early stage of training). For each 1K steps, we estimate the NLL for validation.

For normalizing flows (NF), we use planar flows (Rezende and Mohamed, 2015) with three contiguous transformations. For WAE and WAE-NF, we use Maximum Mean Discrepancy (MMD) (Gret- ton et al., 2012) as the regularization term. An additional KL regularization term with the weight (also with KL-annealing) is added to WAE and WAE-NF since MMD does not guarantee the convergence of the KL divergence.

D Estimation of Language Modeling Metrics

For language modeling, we report negative log-likelihood (NLL), KL divergence, and perplexity. To get more reliable results, we make the estimation of each metric explicit. For each test sample x, NLL is estimated by importance sampling, and KL is approximated by its Monte Carlo estimate:

where are sampled latent codes and all notations follow Eq. (1) in the main text. We report the averaged NLL and KL on all test samples. Perplexity is computed based on the estimated NLL. For validation, the number of samples is N = 10; for evaluation, the number of samples is N = 100.

E Estimation of Mutual Information and Reconstruction Metrics

We report the mutual information (MI) between the text x and the latent code z under Q(z|x) to investigate how much useful information is encoded. The MI component of each test sample x is approximated by Monte Carlo estimation:

where the aggregated posterior density is approximated with its Monte Carlo estimate:

where are sampled from the test set. For convenience, most previous work uses the texts within each batch as the sampled ’s (which are supposed to be sampled from the entire test set). However, this convention results in a biased estimation since the is computed when j = i, i.e., the text itself is always sampled when computing its MI component. We remedy it by skipping the term when j = i. The overall MI is then estimated by averaging MIover all test samples. We set the numbers of samples as N = 100 and M = 512.

For reconstruction, we sample ten latent codes from the posterior of each text input and decode them with greedy search. We compute BLEU-1 and BLEU-2 between the reconstruction and the input with the Moses script.

F Training Dynamics of Gradient Norms

G Diversity and Samples from the Prior Distribution

Given the limited space in the main text, we place the comprehensive evaluation of samples from the prior distribution in this part. Table 9 shows the diversity metrics and the first three (thus totally random) samples from each model. Qualitatively, samples from Coupled-VAE is more diverse than those from VAE. The long texts generated from VAE have more redundancies compared with CoupledVAE. Given that both models have the same latent dimension, it indicates that Coupled-VAE is using the latent codes more efficiently.

Figure 5: Training dynamics of DAE, VAE, and Coupled-VAE (). (a), (d), and (g) are DAE and VAE, and for Coupled-VAE. (b), (e), (h) denote . (c), (f), (i) stand for . Best viewed in color (yet the models are distinguished by line markers).

H Interpolation

A property of VAE is to match the interpolation in the latent space with the smooth transition in the text space (Bowman et al., 2016). In Table 7, we show the interpolation of VAE and CoupledVAE on PTB. It shows that compared with VAE, Coupled-VAE has smoother transitions of subjects (both sides it) and verbs (are expected have been ), indicating that the information about subjects and verbs is more smoothly encoded in the latent space of Coupled-VAE.

I Generalization to Conditional Generation: Coupled-CVAE

To generalize our approach to conditional generation, we focus on whether it can improve the CVAE model (Zhao et al., 2017) for dialogue generation. To this end, we propose the Coupled-CVAE model.

I.1 CVAE

CVAE adopts a two-step view of diverse dialogue generation. Let x be the response and y be the post (or the context). CVAE first samples the latent code z from the prior distribution P(z|y) and then samples the response from the decoder . Given the post y, the marginal distribution of the response x is

Similar to VAE, the exact marginalization is intractable, and we derive the evidence lower bound (ELBO) of CVAE as

During training, the response and the post are encoded as and , respectively. The two vectors are concatenated and transformed into the posterior via the posterior network. A latent code is then sampled and mapped to a higher-dimensional h. The decoding signal in CVAE is computed by h and and utilized to infer the response. Similar to VAE, the objective of CVAE can also be viewed as a reconstruction loss and a regularization term in Eq. (15).

I.2 Coupled-CVAE

As observed in Zhao et al. (2017), the CVAE model also suffers from the posterior collapse problem. We generalize our approach to the conditional setting and arrive at Coupled-CVAE. A graphical overview is displayed in Figure 4. The difference from Coupled-VAE is shown in red. Specifically, the (coupled) posterior network and the (coupled) decoder are additionally conditioned on the post representation. The objective of Coupled-CVAE is identical to Eq. (4) in the main text.

The coupled reconstruction loss in Coupled- CVAE has two functions. First, it improves the encoded response , which is similar to CoupledVAE. Second, it encourages to encode more response information rather than the post information, which collaborates with to improve the parameterization h.

I.3 Dataset

We use the Switchboard dataset (John and Holli- man, 1993). We split the dialogues into single-turn post-response pairs, and the number of pairs in the training/validation/test split is 203K/5K/5K. The vocabulary size is 13K.

I.4 Evaluation

For probability estimation, we report the NLL, KL, and PPL based on the gold responses. NLL, KL, and PPL are as computed in Appendix D except for the additional condition on the post. Since the key motivation of using CVAE in Zhao et al. (2017) is the response diversity, we sample one response for each post and report the Distinct-1 and Distinct-2 metrics over all test samples.

I.5 Experimental Setup

We compare our Coupled-CVAE model with two baselines: GRU encoder-decoder (Cho et al., 2014) and CVAE (Zhao et al., 2017). The detailed setup follows that of the PTB dataset in Appendix C. For each 1K steps, we estimate the NLL for validation.

I.6 Results

Experimental results of Coupled-CVAE are shown in the main text.

1. but the market is a bit of the market ’s recent slide and the fed is trying to sell investors to buy back and forth between the s&p N and N 2. the company said it will be developed by a joint venture with the u.s. 3. the new york stock exchange composite index rose N to N

1. the food is good , but the food is good . i had the chicken fried steak with a side of mashed potatoes , and it was a good choice . the fries were good , but the fries were good . i had the chicken breast with a side

2. ok , so i was excited to check out this place for a while . i was in the area , and i was n’t sure what to expect . i was a little disappointed with the food , but i was n’t sure what to expect . i was

3. we went to the biltmore fashion park . we were seated right away , but we were seated right away . we were seated right away , but we were seated right away . we were seated right away and we were seated right away . the staff was very

1. i ’m a fan of the “ asian ” restaurants in the valley , and i ’m not sure what to expect , but i ’m not sure what the fuss is about . the meat is fresh and delicious . i ’m not a fan of the “ skinny

2. i ’m not a fan of the fox restaurants in phoenix , but i have to say that the service is always a great experience . the atmosphere is a little dated and there is a great view of the mountains .

3. i have been here twice , and the food was good , but the service was good , but the food was good . i had a great time , but the service was great . the food was a bit pricey , but the service was a bit slow

1. if you are looking for a good wrestler , what do you think about the future ? i am not sure what i mean . i have been watching the ufc for 3 months . i have been watching the ufc and i have to be able to see what happens . 2. is it true that the war is not a hoax ? it is a myth that the UNK of the war is not a war , but it is not possible to be able to see the war . the UNK is not a war , but it ’s not a crime .

Table 9: Diversity metrics and the first three samples from each model. Redundancies (pieces of text that appeared before) are shown in red.

designed for accessibility and to further open science