b

DiscoverSearch
About
My stuff
Invertible Gaussian Reparameterization: Revisiting the Gumbel-Softmax
2019·arXiv
Abstract
Abstract

The Gumbel-Softmax is a continuous distribution over the simplex that is often used as a relaxation of discrete distributions. Because it can be readily interpreted and easily reparameterized, it enjoys widespread use. We propose a modular and more flexible family of reparameterizable distributions where Gaussian noise is transformed into a one-hot approximation through an invertible function. This invertible function is composed of a modified softmax and can incorporate diverse transformations that serve different specific purposes. For example, the stick-breaking procedure allows us to extend the reparameterization trick to distributions with countably infinite support, thus enabling the use of our distribution along nonparametric models, or normalizing flows let us increase the flexibility of the distribution. Our construction enjoys theoretical advantages over the GumbelSoftmax, such as closed form KL, and significantly outperforms it in a variety of experiments. Our code is available at https://github.com/cunningham-lab/ igr.

Numerous machine learning tasks involve optimization problems over discrete stochastic components whose parameters we wish to learn. Mixture and mixed-membership models, variational autoencoders, language models and reinforcement learning fall into this category [13, 14, 27, 17, 8]. Ideally, as with fully continuous models, we would use stochastic optimization via backpropagation. One strategy to compute the necessary gradients is using score estimators [8, 33], however these estimates suffer from high variance which leads to slow convergence. Another strategy is to find a reparameterizable continuous relaxation of the discrete distribution. Reparameterization gradients exhibit lower variance but are contingent on finding such a relaxation. Jang et al. [12] and Maddison et al. [19] independently found such a continuous relaxation via the Gumbel-Softmax (GS) or Concrete distribution.

The GS has experienced wide use and has been extended to other settings, such as permutations [18], subsets [34] and more [1]. Its success relies on several qualities that make it appealing: (i) it is reparameterizable, that is, it can be sampled by transforming parameter-independent noise through a smooth function, (ii) it can approximate any discrete distribution, (i.e. converge in distribution) (iii) it has a closed form density, and (iv) its parameters can be interpreted as the discrete distribution that it is relaxing. While the last quality is mathematically pleasing, it is not a necessary condition for a valid relaxation. Here we ask: how important is this parameter interpretability? In the context of deep learning models, interpreting the parameters is not a first concern, and we show that the GS can be significantly improved upon by giving up this quality.

In this paper we propose an alternative family of distributions over the simplex to achieve this relaxation, which we call Invertible Gaussian Reparameterization (IGR). Our reparameterization works by transforming Gaussian noise through an invertible transformation onto the simplex, and a temperature hyperparameter allows the distribution to concentrate its mass around the vertices. IGR is more natural, more flexible, and more easily extended than the GS. Furthermore, IGR enables using the reparameterization trick on distributions with countably infinite support, which enables nonparametric uses, and also admits closed form KL divergence evaluation. Finally, we show that our distribution outperforms the GS in a wide variety of experimental settings.

2.1 The reparameterization trick

Many problems in machine learning can be formulated as optimizing parameters of a distribution over an expectation:

image

where  qφis a distribution over S parameterized by  φand  f : S → R. In order to use stochastic gradient methods [28, 3], the gradient of L has to be estimated. A first option is to use score estimators [8, 33]. However, in practice score estimators usually exhibit high variance [20]. The reparameterization trick [14] provides an alternative estimate of this gradient which empirically has less variance, resulting in more efficient optimization. The reparameterization trick consists of finding a function  g(ϵ, φ) such that gis differentiable with respect to  φ and if z ∼ qφ, then:

image

where  ϵis a continuous random variable whose distribution does not depend on  φand is easy to sample from. The gradient is then estimated by:

image

where  ϵ1, . . . , ϵBare iid samples from the distribution of  ϵ. For example, if  φ = (µ, σ)and qφ = N(µ, σ2)then the reparameterization trick is given by  g(ϵ, φ) = µ + σϵ with ϵ ∼ N(0, 1).

2.2 Continuous relaxations

While we can use score estimators whether  qφhas continuous or discrete support, the reparameterization gradient of equation 3 is only valid when  qφhas continuous support. To extend the reparameterization trick to the discrete setting, thus avoiding the high variance issues of score estimators, suppose  qφis a distribution over the set S = {1, 2, . . . , K}. We use one-hot representations of length K for the elements of S, so that S can be interpreted as the vertices of the  (K − 1)-simplex, ∆(K−1) = {z ∈ RK : zk ≥ 0 and �Kk=1 zk = 1}. The idea is to now place a continuous distribu- tion over  ∆(K−1)that approximates  qφ. Note that placing a distribution over  ∆(K−1)is equivalent to placing a distribution over  S(K−1) = {z ∈ RK−1 : zk > 0 and �K−1k=1 zk < 1}, as the last coordinate can be obtained from the previous ones:  zK = 1 − �K−1k=1 zk. Placing a distribution over S(K−1)is mathematically convenient as  S(K−1) ⊂ RK−1has positive Lebesgue measure, while ∆(K−1) ⊂ RK does not. Although this distinction might appear as an irrelevant technicality, it allows us to correctly compute our Jacobians in section 3. We will thus interchangeably think of distributions over S as points in either  S(K−1) or ∆(K−1). The optimization problem of equation 1 is then relaxed to:

image

where  ˜q ˜φis a distribution over  S(K−1)and the function ˜f : S(K−1) → Ris a relaxation of f to S(K−1). As long as  ˜q ˜φconcentrates most of its mass around S and ˜fis smooth, this relaxation is sensible. If  ˜q ˜φcan be reparameterized as in equation 2, then we can use the reparameterization trick. We make two important notes: first, not only the distribution is relaxed, the function f also has to be relaxed to ˜fbecause it now needs to take inputs in  S(K−1)and not just S. In other words, the objective must also be relaxed, not just the distribution. Second, the parameters ˜φof the relaxed distribution need not match the parameters  φof the original distribution.

Maddison et al. [19] and Jang et al. [12] proposed the Gumbel-Softmax distribution, which is parameterized by  α ∈ (0, ∞)K and a temperature hyperparameter  τ > 0, and is reparameterized as:

image

where  ϵ ∈ RKis a vector with independent Gumbel(0, 1) entries and log refers to elementwise logarithm. Note that when the temperature approaches 0, not only does the GS concentrate its mass around S, but it converges to a distribution proportional to  α. The GS distribution implied by equation 5 can be shown to be:

image

We highlight the difference between  α and ˜φ: the former is the parameter of the GS distribution and might depend on the latter, which is the parameter of the loss with respect to which we optimize in equation 4. For example,  αmight be the output of a neural network parameterized by ˜φ. A commonuse of the GS is to relax objectives of the form:

image

where  p0is a distribution over S. Relaxing this KL requires additional care: it cannot be relaxed to  KL(˜q ˜φ||p0)because the KL divergence is not well defined between a continuous and a discrete distribution. In other words, relaxing  f to ˜fis not straightforward when a KL divergence is involved in the objective. When using a GS relaxation, researchers commonly replace this KL with [12, 6, 32]:

image

the idea being that, for low temperatures, the GS approximates a distribution proportional to its parameter, i.e.  ¯α ∈ ∆(K−1). The goal of this substitution is to still compute a KL between two discrete variables, even after relaxing the distribution. This substitution is problematic, as pointed out by Maddison et al. [19], as it does not take into account how close the GS actually is to  ¯α. A more sensible way to relax the discrete KL is to relax it to an actual continuous KL as done by Maddison et al. [19]:

image

where  ˜q0is fixed in such a way that it is close to  p0. For the GS, finding such a distribution is straightforward as a consequence of its parameter interpretability:  ˜q0can be chosen as a GS with parameter  α0 = p0. Note that the KL in equation 9 cannot be directly evaluated, but a Monte Carlo estimate can be formed thanks to the closed form density of equation 6 and thus stochastic gradient descent can be performed.

Finally, it is worth remarking that while Stirn et al. [31] and Gordon-Rodriguez et al. [9] proposed distributions over the simplex which admit reparameterization gradients, their goals are not to obtain discrete relaxations. Thus they do not have a temperature hyperparameter allowing to concentrate mass on the vertices to approximate discrete distributions.

If the only requirements for a continuous relaxation are a reparameterizable distribution on the simplex and a temperature hyperparameter allowing to concentrate mass around the vertices, one might logically ask: why use the specific choices of the GS? Namely, why use the unusual Gumbel noise and be forced to use the softmax as a mapping onto the simplex? If tasked with constructing a reparametrizable distribution on the simplex, we argue that the most natural choice is to sample Gaussian noise and map it to the simplex; trying different mappings and keeping the best performing one. The cost of this choice is losing the parameter interpretability of the GS, but we will show the advantages are numerous and well worth this cost.

We now present the IGR distribution on  S(K−1), which is parameterized by  (µ, σ), where µ ∈ RK−1and  σ ∈ (0, ∞)K−1. Gaussian noise  ϵ = (ϵ1, . . . , ϵK−1) ∼ N(0, IK−1)is transformed in the following way:

image

where diag(σ)is a diagonal matrix whose nonzero elements are given by  σ, g(·, τ)is an invertible smooth function and  τ > 0is a temperature hyperparameter. Note that IGR is not only more natural than the GS, but is is also more flexible, having  2K − 2parameters instead of K. The first advantage of choosing g to be an invertible function is that the density of  ˜zcan be computed in closed form with the change of variable formula:

image

where  Jg(·, τ)is the Jacobian of  g(·, τ). The second advantage of this choice is that it allows us to compute the KL in closed form (as the Jacobian terms cancel out in the ratio):

image

and thus Monte Carlo estimation of equation 9 is no longer needed.

The components of the IGR can be easily mixed-and-matched. For example, while we use Gaussian noise as the most natural first choice because it is reparameterizable and because the KL divergence between two Gaussians has closed form, any other choice with these two properties can also be used. Similarly, any choice of g, as long as it obeys some requirements which we explain in the section 3.1, can also be used. In contrast, changing the Gumbel distribution or the softmax used in the GS cannot be done. These properties of the IGR make it more easily extensible than the GS.

Since the parameter interpretability of the GS is lost in IGR, we cannot directly read  µ0and  σ0from  p0 ∈ S(K−1). Thus when a KL term is involved, while IGR gains the ability to evaluate it analytically, we solve the following optimization problem to obtain these parameters:

image

Note that having to solve this problem is a very small price to pay for losing parameter interpretability: the optimization is a very simple moment matching problem and has to be be computed only once for any given  p0.

3.1 Choosing g(·, τ)

In this section we design some invertible functions that could be used and argue the rationale behind their construction. There are two important desiderata for  g(·, τ): the first one is that we should be able to compute the determinant of its Jacobian efficiently, which enables tractable density evaluation. This tractability can be achieved, for example, by ensuring the Jacobian is triangular. Note that although in many instances we do not actually require evaluating the density of the relaxation (e.g. variational autoencoders [14]), this is a problem-specific property and density evaluation remains desirable in general. The second is that the limit as  τ → 0of  g(y, τ)is in S for almost all y, meaning that as the temperature gets smaller, the distribution places most of its mass around the vertices. The two most natural choices for mapping onto the simplex are the softmax function and the stick-breaking procedure. As we explain below, these alone are not enough, and we thus modify them to make them appropriate for our purposes. The softmax has two issues: first, it maps to  ∆(K−1) andnot  S(K−1)and second, it is not invertible. Both of these problems can be addressed with a small modification of the softmax function:

image

where  δ > 0ensures that the function is invertible and maps to  S(K−1). Furthermore, the Jacobian of this transformation can be efficiently computed with the matrix determinant lemma (see appendix for details). We will refer to this transformation as the softmax++.

The other natural alternative to map from  (0, 1)K−1onto  S(K−1)is through the stick-breaking procedure [7], which we briefly review here. Given  u ∈ (0, 1)K−1, the result v = SB(u) of performing stick-breaking on u is given by:

image

In addition to producing outputs in  S(K−1), this procedure has some useful properties, namely: it is invertible, its Jacobian is triangular, and it can easily be extended to the case where  K = ∞ (whichwill be useful to extend IGR to relax discrete distributions with countably infinite support). While the invertibility property might suggest that the stick-breaking procedure alone is enough to use with IGR, a temperature hyperparameter  τstill needs to be introduced in such a way that as  τ → 0, the resulting distribution concentrates its mass on the vertices. Unlike with the softmax++, simply dividing the input by  τdoes not achieve this limiting behavior. The most natural way of introducing a temperature that achieves the desired limiting behavior is by linearly interpolating to the nearest vertex, resulting in a g function given by:

image

where  PSis the projection onto the vertices of  S(K−1). Note that the Jacobian of this transformation is triangular. However, we found better empirical performance with the following function, which introduces the temperature using the softmax++ function:

image

While it might seem redundant to apply both a stick-breaking procedure and a softmax++as they both map to  S(K−1), the softmax++function allows to introduce  τin such a way that the distribution concentrates its mass around the vertices as  τ → 0. Also, as seen in section 3.2, the stick-breaking procedure proves useful as it enables using the reparameterization trick in the countably infinite support setting.

Finally, another choice of  g(·, τ)could be a normalizing flow [26, 15, 5] followed by softmax++. Normalizing flows are flexible neural networks constructed in such a way that they are invertible while still allowing tractable Jacobian determinant evaluations, so that they enable us to learn g. We note that normalizing flows require additional parameters, so that when using them, IGR is not only parameterized by  µand  σ, but by the parameters of the normalizing flow as well. Thus, if a KL is involved, the optimization problem of equation 14 needs to be solved over the parameters of the normalizing flow too, and as a result the KL in equation 9 cannot be evaluated in closed form anymore, as the parameters of the two involved normalizing flows need not match. However, Monte Carlo estimates of the KL are still readily available.

3.2 Reparameterization trick for countably infinite distributions

Since the stick-breaking procedure can map to  S∞ = {z ∈ R∞ : zk ≥ 0 and �∞k=1 zk = 1}, we can extend equation 18 to the setting where the discrete distribution has countably infinite support (e.g. Poisson, geometric or negative binomial distributions). In this setting, the IGR is parameterized by  µ ∈ R∞and  σ ∈ (0, ∞)∞. Clearly backpropagating through infinitely many parameters cannot be done in a computer, but we do not have to do so as most of the parameters contribute very little to the loss. For a sample  ϵ1, ϵ2, . . .we only update the first K coordinates of  µand  σ, where K is the number such that:

image

where  ρ ∈ (0, 1)is a pre-specified precision hyperparameter and g is as in equation 18. Note that here K is now a random variable that depends on  ϵinstead of being fixed as before, so that in a way the number of (effective) categories gets learned by the data. Note as well that the stick-breaking procedure is necessary to know where to cut K as it guarantees that later terms in the sequence are small, which would not happen if we only applied a softmax++ function.

3.3 Recovering the discrete distribution

Recall that the original objective of continuous relaxations is to solve the discrete problem of equation 1, so that once we have solved the continuous problem of equation 4, it is desirable to have the ability to recover a solution to the former problem. In other words, given the parameters of a continuous relaxation, we should be able to recover the discrete distribution that it is relaxing. The parameter intepretability of the GS allows to directly do so. In this section we derive a method for doing so with the IGR, which is enabled by the following proposition:

Proposition 1: For any  δ > 0, the following holds:

image

Thus, the vector of discrete probabilities associated with IGR is  E[h(˜z)], which can be easily approximated through a Monte Carlo estimate by sampling from the IGR and averaging the results after transforming them with h. This is the last cost to pay for losing parameter interpretability, but once again it is very small: the complexity of this approximation is negligible when compared to the one of solving the problem of equation 4. Note also that this proposition enables the use of straight-through estimators [2], where the sample is discretized during the forward pass, but not for backpropagation. The next proposition shows that when just using the softmax++ as g, the recovered discrete distribution can be written in an even more explicit form:

Proposition 2: If  yk ∼ N(µk, σk) for k = 1, . . . , K − 1, and we define the discrete random variable

image

where  φ and Φare the standard Gaussian density and cumulative distribution function, respectively. Proof: See appendix.

We finish this section by noting that there is literature proposing gradient estimators and attempting to reduce their variance [21, 20, 32, 10, 16]. In particular, Grathwohl et al. [10] and Tucker et al. [32] proposed techniques involving the GS. Their techniques, however, require computing the gradient of the discrete objective with respect to the parameters of the continuous relaxation, which can be done with the GS thanks to its parameter interpretability. Proposition 2 thus enables the use of their methods with IGR, as the integral in equation 29 can be easily approximated numerically. Due to space constraints we include details, along a discussion about bias, in the appendix.

In this section, we contrast the performance of the IGR (with different choices of g) alongside that of the GS. First, in relation to section 3.2, we compare the ability of the IGR and the GS (with varying number of categories) to approximate a countably infinite distribution. We then focus on tasks that involve a KL term in their objective function. Finally, we also consider a Structured Output Prediction task which does not involve a KL term. For the experiments involving a KL term, we use variational autoencoders (VAEs) [14]. We follow the setup of Maddison et al. [19] and Jang et al. [12] (although note that we use a slightly different objective than Jang et al. [12], see appendix for details). The datasets we use are handwritten digits from MNIST, fashion items from FMNIST and alphabet symbols from Omniglot. We ran each experiment 5 times and report averages plus/minus one standard deviation. Additionally, for all the experiments, we used the log scale implementation of the GS (ExpConcrete) as in Maddison et al. [19] since it avoids numerical issues and allows us to run the models involving the GS at lower temperatures. Throughout this section, the label IGR-I denotes the implementation with the softmax++(equation 15), the label IGR-SB the implementation with the stick-breaking transformation followed by a softmax++(equation 18), and finally the label IGR-Planar the implementation using two nested Planar Flows [26] followed by a softmax++ .

Comparing any IGR variant against the GS requires selecting temperature hyperparameters for each model. To make a fair comparison, temperatures  τshould be chosen carefully as they affect models differently, so they cannot just be set to the same value. We thus choose the temperature hyperparameter through cross validation, considering the range of possible temperatures {0.01, 0.03, 0.07, 0.1, 0.25, 0.4, 0.5, 0.67, 0.85, 1.0} and compare best-performing models. However, and very importantly, we use the loss on the recovered discrete model — not the trained continuous one — to select the best performing model. This avoids the potential issue of having one model produce better discrete relaxations which are closer to the vertices of the simplex, while resulting in a larger continuous loss as the other model is allowed to use the simplex more freely. All implementation details are in the appendix.

4.1 Approximating a Poisson Distribution

Here we compare the ability of the IGR-SB and the GS to approximate distributions with countably infinite support. The top panels of Figure 1 show an approximation with the IGR-SB to a Poisson distribution with  λ = 50, while the bottom panels show the same approximations when using a GS with different number K of discrete components. These approximations are computed by optimizing the objective in equation 14. We can see how the IGR-SB outperforms the GS without having to specify K. We show further comparisons when approximating other distributions in the appendix.

4.2 Variational Autoencoders

We trained VAEs composed of 20 discrete variables with 10 categories each. VAEs are latent variable models which maximize the ELBO, a lower bound on the log likelihood involving a KL term (see appendix for details). For MNIST and Omniglot we used a fixed binarization and a Bernoulli decoder, whereas for FMNIST we use a Gaussian decoder. Table 1 shows test log-likelihoods (not ELBOs, these are obtained as in Burda et al. [4] with m = 1000, and are computed on the recovered discrete model) plus/minus one standard deviation for two different architectures. We highlight best results and those within error. The IGR performs best or is within error, except in a single scenario. We report the test log-likelihood as it is the most relevant metric from a machine learning perspective, but from an optimization point of view, the discretized training ELBO is of more interest, as it more accurately measures how well the original objective is being maximized. We include this evaluation, which is also favorable to IGR, in the appendix. It is also worth mentioning that the execution times between the IGR and the GS were almost identical for the I and SB variants. Nonetheless, the IGR Planar is about 30% slower than all the other alternatives.

To verify how much of our performance improvement is due to our closed form KL, we also trained the VAE using the sticking the landing gradient estimator proposed by Roeder et al. [29], which does not involve a closed form KL divergence. Results are also shown in Table 1 (with the label SL). Note that all SL models outperform their non-SL counterparts, suggesting that the closed form KL of the IGR is not a key component of its superior empirical performance. We note that closed form KL remains an attractive theoretical property which could prove more useful in other applications.

In Figure 2 we show that the IGR also outperforms the GS on the continuous model (not only the discrete one). The plot contains error bars, but these are almost imperceptible due to their size and the scale of the plot. Note that while we include this comparison for completeness, as we believe that

image

Figure 1: Approximations to a Poisson distribution for IGR-SB (top right panel) and GS (bottom right panel) after 1,000 training steps. Initial values of the approximations are displayed on the respective left panels. Due to the stick-breaking procedure, a random initialization concentrates to the left.

image

Table 1: Test log-likelihood on MNIST, FMNIST and Omniglot for IGR and GS. Higher is better.

the most relevant comparisons are on the recovered discrete model, it is interesting to see that the performance gains of the IGR over the GS on discretized models do not come at the cost of poorer continuous ones.

Finally, we compared IGR and the GS using the variance reduction technique of Grathwohl et al. [10], whose use is enabled thanks to proposition 2. We include this comparison — which was yet again favorable to IGR — and the corresponding discussion, along with a comparison against the estimator proposed by Kool et al. [16], in the appendix.

4.3 Structured Output Prediction

We consider a Structured Output Prediction task, where we reconstruct the lower part of an image given the upper part by using a binary stochastic feedforward neural network. In contrast to our

image

Figure 2: Test ELBO (of continuous model) on MNIST. Higher is better. This result is expected to defer from the discrete log-likelihood in Table 1 (see appendix for details).

image

Table 3: Test log-likelihood on MNIST, FMNIST and Omniglot for IGR and GS. Higher is better.

previous experiments, this is a task that does not require the computation of a KL divergence. It was first proposed in Raiko et al. [25] and replicated by Gu et al. [11], Jang et al. [12] and Maddison et al. [19], and does not involve a KL. The results of this experiment are in Table 2, where we can see that once again, IGR outperforms the GS.

4.4 Nonparametric Modeling

In our last experiment, we show that the truncation strategy that we use in equation 19 not only enables the use of continuous relaxations in the countably infinite setting, but that it also allows us to have reparameterizable distributions on  S∞(the difference being that one should concentrate most of its mass around the vertices while the other one not necessarily). In order to show this, we follow Nalisnick et al. [24], who use a VAE with a Dirichlet Process mixture of Gaussians as the prior. For the approximate posterior, they apply the stick-breaking procedure to K Kumaraswamy random variables, where K has to be specified in advance and is thus treated as a hyperparameter [23]. We note that this prespecification of K is problematic, as while the prior remains nonparametric, the resulting optimization objective matches the ELBO that would be obtained using a Dirichlet mixture of Gaussians with K components as the prior, effectively losing the nonparametric aspect of the model. In contrast, we use an IGR distribution as the approximate posterior, and the truncation strategy from equation 19 allows us to retain the true nonparametric nature of the model.

Since this task does not require a continuous relaxation but just a reparameterizable distribution on the simplex, we use equation 17 but replace g with the identity function, thus dropping the temperature hyperparameter. We use the label IGR-SB-MM to make this explicit, and use the label DLDPMM for the model of Nalisnick et al. [24]. Table 3 compares the two methods, where we trained DLDPMM with K = 7, 9, 11, 13, 15, 17 and report the best result. We can see that not only does IGR enable truly nonparametric inference thus not requiring an expensive hyperparameter search over K, but also that this does not come at the cost of decreased performance. We also note that a single IGR-SB-MM run takes the same amount of time as a single DLDPMM run.

In this paper we propose IGR, a flexible discrete reparameterization as an alternative to the GS in which Gaussian noise is transformed through an invertible function onto the simplex. At the cost of losing the parameter interpretability of the GS, our method results in a more natural and more flexible distribution, which has the further advantage of admitting closed form KL evaluation. We show that IGR significantly outperforms the GS and that, perhaps surprisingly, this improvement is not due to this nice theoretical property. Finally, IGR also extends the reparameterization trick to discrete distributions with countably infinite support and can be incorporated in nonparametric settings.

We do not foresee our work having any negative ethical implications or societal consequences.

We thank Harry Braviner for useful comments. We thank Edmond Cunningham for noticing an error in our Jacobian determinant derivation, which we have now corrected, and also for providing a succinct procedure for deriving it, which we have used. We also thank the Simons Foundation, Sloan Foundation, McKnight Endowment Fund, NIH NINDS 5R01NS100066, NSF 1707398, and the Gatsby Charitable Foundation for support.

[1] M. Balog, N. Tripuraneni, Z. Ghahramani, and A. Weller. Lost relatives of the gumbel trick. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 371–379. JMLR. org, 2017.

[2] Y. Bengio, N. Léonard, and A. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.

[3] L. Bottou. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade, pages 421–436. Springer, 2012.

[4] Y. Burda, R. Grosse, and R. Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.

[5] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using real nvp. ICLR, 2017.

[6] E. Dupont. Learning disentangled joint continuous and discrete representations. In Advances in Neural Information Processing Systems, pages 710–720, 2018.

[7] T. S. Ferguson. A bayesian analysis of some nonparametric problems. The annals of statistics, pages 209–230, 1973.

[8] P. W. Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33(10):75–84, 1990.

[9] E. Gordon-Rodriguez, G. Loaiza-Ganem, and J. P. Cunningham. The continuous categorical: a novel simplex-valued exponential family. In International Conference on Machine Learning, 2020.

[10] W. Grathwohl, D. Choi, Y. Wu, G. Roeder, and D. Duvenaud. Backpropagation through the void: Optimizing control variates for black-box gradient estimation. ICLR, 2018.

[11] S. Gu, S. Levine, I. Sutskever, and A. Mnih. Muprop: Unbiased backpropagation for stochastic neural networks. arXiv preprint arXiv:1511.05176, 2015.

[12] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. ICLR, 2017.

[13] M. Johnson, D. K. Duvenaud, A. Wiltschko, R. P. Adams, and S. R. Datta. Composing graphical models with neural networks for structured representations and fast inference. In Advances in neural information processing systems, pages 2946–2954, 2016.

[14] D. P. Kingma and M. Welling. Auto-encoding variational bayes. ICLR, 2014.

[15] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pages 4743–4751, 2016.

[16] W. Kool, H. van Hoof, and M. Welling. Estimating gradients for discrete random variables by sampling without replacement. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rklEj2EFvB.

[17] M. J. Kusner and J. M. Hernández-Lobato. Gans for sequences of discrete elements with the gumbel-softmax distribution. arXiv preprint arXiv:1611.04051, 2016.

[18] S. Linderman, G. Mena, H. Cooper, L. Paninski, and J. Cunningham. Reparameterizing the birkhoff polytope for variational permutation inference. In International Conference on Artificial Intelligence and Statistics, pages 1618–1627, 2018.

[19] C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: A continuous relaxation of discrete random variables. ICLR, 2017.

[20] A. Miller, N. Foti, A. D’Amour, and R. P. Adams. Reducing reparameterization gradient variance. In Advances in Neural Information Processing Systems, pages 3708–3718, 2017.

[21] A. Mnih and D. Rezende. Variational inference for monte carlo objectives. In International Conference on Machine Learning, pages 2188–2196, 2016.

[22] A. Mnih and D. J. Rezende. Variational inference for monte carlo objectives. CoRR, 1602.06725, 2016. URL http://arxiv.org/abs/1602.06725.

[23] E. Nalisnick and P. Smyth. Stick-breaking variational autoencoders. In ICLR, 2017.

[24] E. Nalisnick, L. Hertel, and P. Smyth. Approximate inference for deep latent gaussian mixtures. In NIPS Workshop on Bayesian Deep Learning, volume 2, 2016.

[25] T. Raiko, M. Berglund, G. Alain, and L. Dinh. Techniques for learning binary stochastic feedforward neural networks, 2014.

[26] D. Rezende and S. Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning, pages 1530–1538, 2015.

[27] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pages 1278–1286, 2014.

[28] H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.

[29] G. Roeder, Y. Wu, and D. K. Duvenaud. Sticking the landing: Simple, lower-variance gradient estimators for variational inference. In Advances in Neural Information Processing Systems, pages 6925–6934, 2017.

[30] N. Steen, G. Byrne, and E. Gelbard. Gaussian quadratures for the integrals

image

[31] A. Stirn, T. Jebara, and D. Knowles. A new distribution on the simplex with auto-encoding applications. In Advances in Neural Information Processing Systems, pages 13670–13680, 2019.

[32] G. Tucker, A. Mnih, C. J. Maddison, J. Lawson, and J. Sohl-Dickstein. Rebar: Low-variance, unbiased gradient estimates for discrete latent variable models. In Advances in Neural Information Processing Systems, pages 2627–2636, 2017.

[33] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.

[34] S. M. Xie and S. Ermon. Reparameterizable subset sampling via continuous relaxations. In International Joint Conference on Artificial Intelligence, 2019.

image

As mentioned in section 3.1, we can use the matrix determinant lemma to efficiently compute the determinant of the Jacobian of the softmax++. If, for i = 1, . . . , K − 1, we let:

image

then it is not hard to see that when  i ̸= j then:

image

and that:

image

Combining the previous equations we have that:

image

Given the previous expression, we can compute the determinant as follows:

image

From the previous expression the log determinant becomes apparent:

image

Proposition 1: For any  δ > 0, the following holds:

image

where  ek ∈ RK−1 is the one-hot vector with a 1 in its k-th coordinate.

Proof: We will assume that  arg maxk=1,...,K−1(yk)is unique, and denote  y∗ = maxk=1,...,K−1(yk).We then have:

image

If  yk < y∗, the numerator goes to 0 as  τ → 0, while it is equal to 1 if  yk = y∗. The denominator goes to  ∞ if y∗ < 0, while it goes to  1 if y∗ > 0. Combining these observations finishes the proof.

Proposition 2: If  yk ∼ N(µk, σk) for k = 1, . . . , K − 1, and we define the discrete random variable

image

where  φ and Φare the standard Gaussian pdf and cdf, respectively.

Proof: For  k = 1, . . . , K − 1, we have:

image

which finishes the first part of the proof. The remaining probability, P(H = K) can obviously be recovered as one minus the sum of the above probabilities, but we can also obtain the following expression:

image

which finishes the proof.

image

Table 4: Discretized Train ELBO (not log-likelihood) on MNIST, FMNIST and Omniglot for IGR and GS. Higher is better.

In our experiments with RELAX [10] in section D of the appendix we approximate the required integrals using a Gaussian quadrature as in Steen et al. [30], and backpropagate through this procedure. Note that the involved integrals are one-dimensional and thus can be accurately approximated with quadrature methods. Although we found better performance with these approximations than with a Monte Carlo approximation, we found the method prone to numerical instabilities, which we solved by limiting the range of values that  µ and σare allowed take as follows:

image

where  µ′ and σ′ are the parameters that we optimize over.

As mentioned in the main manuscript, our VAE experiments closely follow Maddison et al. [19]: we use the same continuous objective and the same evaluation metrics. The experiments differ to Jang et al. [12] since they use a KL term as in equation 8 of the main manuscript, whereas Maddison et al. [19] use a continuous KL as in equation 9 of the main manuscript. Using the former KL results in optimizing a continuous objective which is not a log-likelihood lower bound anymore, which is mainly why we followed Maddison et al. [19].

In addition to the reported comparisons in the main manuscript, we include further comparisons in Table 4 reporting the discretized training ELBO instead.

Tucker et al. [32] and Grathwohl et al. [10] proposed REBAR and RELAX, respectively. These are variance reduction techniques which heavily lean on the GS to improve the variance of the obtained gradients. We make several important notes: First, REBAR is a special case of RELAX, so that we will only compare against RELAX. Second, RELAX takes advantage of the parameter interpretability of the GS, as it considers the gradients of the relaxed objective as approximations to the gradients of the objective of interest:

image

where  αis a discrete distribution, which we think of as a vector of length K and  ˜qα,τis a GS distribution. RELAX builds upon equation 38 to develop an estimator with reduced variance. Extending this observation to IGR is not immediately straightforward, as  ∇µ,σEqµ,σ,τ (˜z)[ ˜f(˜z)]is not an approximation to the gradient on the left hand size of the above equation: it is not even the same shape. However, thanks to proposition 2 we can parameterize a discrete distribution using µand  σ, so that  α(µ, σ)is the discrete distribution given by proposition 2. This way, instead of directly optimizing over the discrete distribution, we optimize over its parameters,  µand  σ, so that the gradient of interest becomes  ∇µ,σEz∼α(µ,σ)[f(z)], and its corresponding approximation:

image

where  ˜qµ,σ,τis an IGR distribution, thus enabling the use of RELAX along IGR. Third, it should also be noted that the bias and variance of the gradient estimator of RELAX are central points of discussion by Grathwohl et al. [10]. However, comparing bias and variance between the GS and IGR is a difficult task, as they are intrinsically approximating different gradients (equations 38 and 39, respectively). To make the fairest possible comparison, we compare between IGR and the GS not by trying to estimate biases and variances, but by empirically comparing the recovered discrete objectives. Ultimately, bias and variance of a stochastic gradient estimator are used as proxies for how adequately optimized the corresponding objective will be, so that directly comparing on this metric is sensible. We show results of running IGR and GS with and without RELAX in Table 5.

image

Table 5: Test log-likelihood on MNIST for nonlinear architecture. Higher is better.

Finally, Kool et al. [16] proposed USPGBL, an unbiased estimator (unlike the GS or IGR, which are biased), which is based on sampling without replacement. Their method requires using several approximate posterior samples to estimate the ELBO. We used S = 4 samples, and for a fair comparison against GS and IGR, we also estimated the ELBO using 4 samples (instead of 1, which we used in every other experiment). Results are in Table 6 and we can see that again, IGR performs best.

image

USPGBL -106.89  ± 1.37Table 6: Test ELBO on MNIST for nonlinear architecture with 4 samples. Higher is better.

In this section we describe the hyperparameters and architecture for the VAEs we use. The choice of hyperparameters and architecture are aligned with [19, 12, 32, 10, 16].

image

Encoder: One fully connected dense layers of 200 units with linear activation. Decoder: Symmetrical to the Encoder. One fully connected dense layer of 200 units with linear activation.

image

Encoder: Two fully connected dense layers of 512 units and 256 units respectively. The nonlinear activations are ReLu.

Decoder: Symmetrical to the Encoder. Two fully connected dense layers of 256 units and 512 units respectively. The nonlinear activations are ReLu.

The hyperparameters are shared across the models. The only thing that changes is the temperature, which is selected through cross validation as specified in the main manuscript. We use the following configuration:

image

image

For the structure output prediction task the architecture used is:

Four-layered non-linear architecture 240  ∼240 - (first sample) & 240  ∼240 - (second sample)

First sample is taken from double-layer 240 with tahn activation followed by a layer with 240 units with a linear activation

image

The hyperparameters used are

image

For the nonparameteric mixture model the architecture used is:

image

Encoder: 3 fully connected dense layers with 200 units and a ReLu activation. Decoder: 3 fully connected dense layers with 200 units and a ReLu activation.

The hyperparameters used are

image

Next we compare the GS and the IGR in approximating discrete distributions. We took 1,000 samples of the learned parameters of the IGR from solving equation 14 from the main manuscript.

image

Figure 3: IGR approximation to a Binomial(N = 12, p = 0.3)

image

Figure 4: GS approximation to a Binomial(N = 12, p = 0.3)

We observe how both methods approximate the Binomial adequately, although it seems that the advantage of the IGR-SB to better approximate countably infinite distributions was not translated to this simple example.

image

Figure 5: IGR approximation to a Discrete defined as  p = ( 1046, 146, 546, 146, 1046, 1046, 146, 646, 146, 146).

image

Figure 6: GS approximation to a Discrete defined as  p = ( 1046, 146, 546, 146, 1046, 1046, 146, 646, 146, 146).

Results for this discrete distribution are similar to those observed on the Binomial.

image

Figure 7: IGR-SB approximation to a Negative Binomial(r = 50, p = 0.6).

image

Figure 8: IGR-SB approximation to a Negative Binomial(r = 50, p = 0.6).

Here again we see how the GS has difficulty approximating another distribution with a countably infinite support. The GS with K = 40 (middle-purple) doest not assign mass to the right tail where as the GS with K = 100 has difficulty taking out sufficient weight from the right tail of the distribution.


Designed for Accessibility and to further Open Science