Generalized Gumbel-Softmax Gradient Estimator for Generic Discrete Random Variables

2020·Arxiv

Abstract

Abstract

Estimating the gradients of stochastic nodes in stochastic computational graphs is one of the crucial research questions in the deep generative modeling community, which enables the gradient descent optimization on neural network parameters. Stochastic gradient estimators of discrete random variables are widely explored, for example, Gumbel-Softmax reparameterization trick for Bernoulli and categorical distributions. Meanwhile, other discrete distribution cases such as the Poisson, geometric, binomial, multinomial, negative binomial, etc. have not been explored. This paper proposes a generalized version of the Gumbel-Softmax estimator, which is able to reparameterize generic discrete distributions, not restricted to the Bernoulli and the categorical. The proposed estimator utilizes the truncation of discrete random variables, the Gumbel-Softmax trick, and a special form of linear transformation. Our experiments consist of (1) synthetic examples and applications on VAE, which show the efficacy of our methods; and (2) topic models, which demonstrate the value of the proposed estimation in practice.

Keywords: Reparameterization Trick, Discrete Random Variable, Gumbel-Softmax Trick, Variational Autoencoder, Deep Generative Model

1. Introduction

Stochastic computational graphs, including variational autoencoders (VAEs) [1], are widely used for probabilistic modeling, representation learning, and generating data in the deep generative model (DGM) society. Optimizing the network parameters through back-propagating gradients requires an estimation of the gradient values. However, the stochasticity requires the computation of expectation, which differentiates this problem from the deterministic gradient of ordinary neural networks. Regarding such a perspective, there are two common ways of estimating the gradients from stochastic nodes: score function (SF) methods and reparameterization methods. The SF-based estimators tend to result in unbiased gradients with high variances, hence, the SF-based estimators aim to reduce the variances of gradients for stable and fast optimizations. Meanwhile, the reparameterization estimators result in biased gradients with low variances [2], but they require the differentiable non-centered parameterization [3] of random variables.

For continuous random variables such as the Gaussian, the reparameterization estimators are widely utilized due to the nature of the differentiability of the typical continuous distributions [1, 4, 5]. Nowadays, it is feasible to estimate gradients for continuous cases with automatic differentiation [6, 7] in TensorFlow [8] or PyTorch [9]. Meanwhile, for the discrete random variables which follow Bernoulli or categorical distributions, the SF-based methods are widely explored since the reparameterization method can not directly work due to the non-differentiability.

Alternatively, the Gumbel-Softmax trick [10, 11] overcomes this difficulty through the reparameterization with continuous relaxation of one-hot selection values. Also, a line of works [12, 13] utilizes both the SF-based method and the reparameterization method.

While the Bernoulli and the categorical cases have been studied deeply, the gradient estimators for other discrete distributions have hardly been explored. There are several gradient estimators which are specialized in discrete distributions on combinatorial spaces, such as k-hot vector, permutation, and spanning trees on graphs [14, 15]. However, other renowned discrete distribution cases, such as Poisson, binomial, multinomial, geometric, negative binomial distributions, etc., have not been explored, which we mainly focus on in this paper. Prior works on probabilistic graphical models, such as Ranganath et al. [16, 17], adopted Poisson latent variables for latent counting. Another line of work [18] utilized the Gaussian approximation on the Poisson to count the number of words in deep generative modeling, which can be a poor approximation when the rate parameter is small. Regarding those perspectives, the stochastic gradient estimator for the discrete distribution needs to be studied further to extend the choice of prior assumptions.

This paper proposes a generalized version of the GumbelSoftmax trick, which can reparameterize the broader ranges of discrete distributions, not limited to the Bernoulli and the categorical. The proposed Generalized Gumbel-Softmax (GenGS) is probabilistically grounded, and it utilizes (1) a truncation to finitize the infinite supports of the discrete distributions; and (2) a transformation that enables the generalization of the GumbelSoftmax trick. Moreover, through the truncation, GenGS provides a posterior inference procedure that can be either explicit or implicit as a practical implementation. Our experiments show the efficacy with synthetic examples and VAEs, as well as the usability in topic model applications.

2. Preliminary

2.1. Back-propagation through Stochastic Nodes

In the stochastic computational graph, assume that there is an intermediate stochastic node or a latent variable z (z), where the distribution depends on the parent node as a distribution parameter. The goal is optimizing the objective function [f(z)], where fis a differentiable function with respect to , i.e., the neural networks. To optimize the objective function with respect to the parameter through the gradient methods, we need to compute which is intractable with its original form.

SF-based estimator utilizes a score function (z) by utilizing log-derivative trick to compute the gradient [19]. Then, the gradient ) can be derived as the below:

The SF methods compute exact gradients due to the unbiasedness with the Monte Carlo property. However, those methods suffer from the high variance of gradients, which results in the slow and unstable convergence of the objective function. To reduce the variance of gradients, the control variate methods [20, 21] or the RaoBlackwellization [22, 23] are widely used.

Reparameterization trick, illustrated in Figure 1(a), introduces an auxiliary variable ), which takes over all randomness of the latent variable z, to compute . Then, the sampled value z can be re-written as z ), with a deterministic and differentiable function g in terms of . Here, the gradient ) is derived as the below:

Equation (2) is computable, however, the differentiability requires the continuity of the random variable z, so the distribution of z is limited to the continuous distributions.

2.2. Gumbel Tricks

To utilize the differentiable reparameterization trick on discrete random variables, continuous relaxation can be applied. A Gumbel-Softmax (GS) trick [10, 11] is an approximation of a Gumbel-Max (GM) trick, which are alternatives of a one-hot categorical sampling. The categorical random variable Z where 1)-simplex , can be reparameterized by the GM trick: (1) draw uj 1) to generate a Gumbel sample gj ) for j ; and (2) compute k . This procedure generates a one-hot sample z such that zk , and zeros in other entries. Instead of the argmax in the GM, the GS utilizes the softmax with a temperature 0, i.e., z . This substitution relaxes the discreteness of the categorical random variable to the one-hot-like form in the continuous domain. In other words, if we denote GM() and GS() as the distribution generated by the GM trick and the GS trick, respectively, then GS(

3. Methodology

3.1. Problem Setting

We begin an explanation on GenGS with discrete distribution having finite support. Assume that a random variable Z follows a discrete distribution Q() with explicit probability mass function (PMF) and a finite support C . Next, define an outcome vector c = (c1) and the corresponding PMF value vector ) in the same index order. Here, each sample ck can be either scalar or vector (or, even matrix) by the choice of distribution Q. Finally, for w = (w1, introduce a transformation T as follows:

3.2. Sampling through Generalized Gumbel Tricks

We claim that the sampling process of Z can be derived by the one-hot categorical selection process, i.e., the GM process. Suppose that we draw a sample Z , then the sampled value ck can be alternatively selected as follows. Since we fix the index order of the outcome vector c and its corresponding PMF vector , the one-hot indicating vector of the sample ck can be regarded as w = one_hot(ck|c), which can be alternatively sampled from GM(). Then, the remaining step is a conversion of w = one_hot(ck|c) to ck, and this step can be done by the linear transformation T in Equation (3). Hence, we can reparameterize the original sample value ck as T (GM()), as stated in Proposition 1.

Proposition 1. For the transformation T , if a discrete random variable Z has a finite support with known PMF values, Z can be reparameterized by

Proof. With the transformation T and the constant outcome vector c ), the following holds:

Since the transformation T is a deterministic function, by introducing the auxiliary Gumbel random variable, we can remove the randomness of Z from . Hence, the discrete random variable Z with finite support can be reparameterized by the GM trick and the transformation T .

3.3. Relaxation of One-Hot Sample

Replacing the argmax in the GM trick to the softmax enables differentiation, which is a key point of the GS reparameterization. In Proposition 2, we claim that the same argument still holds under the transformation T : if W ) where is a PMF vector of discrete random variable Z ) with finite support, then Z can be continuously relaxed and reparameterized as )), which we define as G

Proposition 2. For the transformation T and a categorical parameter , the convergence property of GS to GM still holds under the linear transformation as

Figure 1: Visualization of reparameterization trick. The shaded nodes indicate random nodes, and the dotted lines denote sampling processes. The auxiliary random variables enable the back-propagation flow of the gradients.

Proof. Define fM and f be GM reparameterization and GS reparameterization with a temperature 0, respectively, where the both take the categorical parameter and a Gumbel sample as inputs. For the Gumbel sample g, if we assume gm , then the following holds:

where ej is a n-dimensional one-hot vector, which has 1 in the jth entry and 0 in all other entries. As 0, the statement GS(

Hence, f (holds for some relaxed one-hot vector of em by introducing the softmax relaxation. As a consequence,

Hence, by taking the summation, the following holds:

3.4. Finitizing Support by Truncation

In Section 3.1, we assumed that the support of the distribution to be finite. Without the finite assumption, the Gumbel tricks cannot be applied since it requires a limited number of categories. Hence, for the discrete distributions with infinite support, we truncate tails of the distributions to approximate the original distribution. In that manner, we can extend GenGS to the discrete distribution with infinite support. Definition 3 utilizes truncation range to truncate the non-negative discrete random variable.

Definition 3. For a non-negative discrete random variable X D(), define Zn = X if X 1, and Zn 1 if X > n. The random variable Zn is said to follow a truncated discrete distribution TD() with a parameter and a truncation range R = [0, n). Alternatively, we write as truncation level R = n if the left truncation is at zero in the non-negative case.

Definition 3 not only finitizes the support but also provides the modification of the PMF: (Zn (X ) for k 1, and (Zn . For the truncated random variable Zn ) having an outcome vector c = (c1), the ordering of ck is not significant, as long as the index of the outcome vector c and the PMF vector are aligned. Proposition 4 provides a theoretical basis that Zn approximates X as the truncation range widens enough. Hence, the truncation enables GenGS to be extendedly applied to the discrete distributions with infinite support such as the Poisson, geometric, negative binomial, etc. Similarly, Definition 3 and Proposition 4 can be generalized to general discrete distribution as Proposition 5 which we omit the proof.

Proposition 4. Zn converges to X almost surely as n

Proof. Note that . Then, we have the following:

since X is a non-negative discrete random variable.

Proposition 5. For a discrete random variable X (i) Zmif m < X < n; (ii) Zm1 if X ; and (iii) Zmconverges to X almost surely as m

4. Hyper-Parameters & Variations of GenGS

Figure 1(b) illustrates the full steps of the GenGS trick, and Algorithm 1 shows the alternative sampling process of GenGS. Figure 2 describes how the GenGS trick approximates the original distribution by adjusting the temperature and the truncation range. The following are the hyper-parameter and the variations of the proposed GenGS.

Temperature. Since GenGS utilizes the GS, GenGS inherits the temperature as a hyper-parameter. The decrement of temperature from results in the closer distribution to the original distribution. However, the initially small leads to high bias and variance of gradients, which becomes problematic at the learning stage on . Hence, the annealing of from a large value to a small value is necessary to provide a learning chance of

Figure 2: (a) Approximation of GenGS in terms of choices of the truncation level n and the temperature (7). As sub-figures go from left to right, the truncation level grows. Hence, the popped-out sticks, implying remaining probability of the right side, disappears if the truncation level is large enough. As sub-figures go from top to bottom, the temperature decreases, and the PMF of truncated distributions and the original distributions becomes similar. (b) On the y-axis, as temperature is a computed PMF value of TD(), according to Proposition 2. TD() can be reparameterized by the GM trick with a linear transformation T as in Proposition 1. On the x-axis, TD() as truncation level n , according to Proposition 4.

Truncation Range. The truncated distribution becomes closer to the original distribution as we widen the truncation range R, and the choice of truncation range is crucial in terms of covering many probable samples or the modes of the distribution. Hence, we can set the truncation range to cover most of the support through a certain threshold. Also, for the popular discrete distributions such as Poisson, geometric, negative binomial, etc. which have a single mode in the PMF, we can use the mode as a criterion. Alternatively, we can set non-parameteric truncation range by thresholding, as in Algorithm 2.

Implicit Inference. The stochastic gradient estimators or the alternative sampling with the reparameterization tricks are widely utilized for the sampling from some distribution after infering the distribution parameter. In such cases, there are distribution regularizer in the objective function, for example in VAEs, the KL divergence between the prior distribution and the approximate posterior distribution. To do so, we first infer the distribution parameter , and then the proposed GenGS can be

utilized by computing the PMF values explicitly. However, we can instead directly infer the PMF values with softmax function, not through the distribution parameter . We name the two cases as explicit inference and implicit inference (GenGS Imp), respectively, and GenGS Imp becomes possible by truncating distribution due to the finiteness of the dimension in the softmax function with the neural networks. Although the implicit inference cannot directly infer the distribution parameter we found that loosening the posterior shape leads to a significant performance gain in our VAE examples. Algorithm 3 presents GenGS Imp process.

Discretization. Straight-Through (ST) method [24, 10] can be applied to GenGS for drawing discrete samples. This variation samples discrete values in the feed-forward step, but utilizes the estimated gradient from the continuously relaxed and reparameterized GenGS in the back-propagation step. However, as in the original GS with the ST technique, GenGS with ST results in significant performance degradation in our synthetic examples.

5. Related Work

RF denotes the basic REINFORCE [19]. NVIL [20] utilizes a neural network to introduce the optimal control variate. MuProp [21] utilizes the first-order Taylor expansion on the objective function as a control variate. VIMCO [25] is designed as multi-sample gradient estimator. REBAR [12] and RELAX [13] utilize reparameterization trick for constructing the control variate. DetRB [22] uses the weighted value of the fixed gradients from m-selected categories and the estimated gradients from the remaining categories with respect to their odds to reduce the variance. The idea of StoRB [23] is essentially same as that

Figure 3: Synthetic example performance curves in log scale: (Top Row) losses, variances, and biases of gradients for Poisson; (Middle Row) losses for Binomial, Multinomial, and NegativeBinomial; (Bottom Row) variances of gradients for Binomial, Multinomial, and NegativeBinomial. We utilize the cumulative average for smoothing the curves following from Kool et al. [23].

of DetRB, but StoRB randomly chooses the categories at each step. Kool et al. [23] also suggested UnOrd, utilizing samples without replacements. Hence, DetRB, StoRB, and UnOrd can be considered as multi-sample gradient estimators. The denotes a built-in control variate introduced in the work of Kool et al. [23]. IRG [26] is an alternative of the GS and its variations can be applied to approximate the PMF including the Poisson, etc. However, IRG does not infer the distribution parameter as GenGS Imp, and the paper mainly focuses on replacing the GS rather than reparameterizing the general class of the discrete random variables.

6. Experiment

6.1. Synthetic Example

Experimental Setting. This experiment expands the toy experiments from Tucker et al. [12], Grathwohl et al. [13] to diverse discrete distributions. We optimize the objective function (zi )2with respect to parame- ter for fixed constants t1. Here, we set distribution p(z) as Poisson(), Binomial(20), Multinomial(3), and NegativeBinomial(3). For the Poisson and the negative binomial, which have infinite supports, we also applied Algorithm 2 to search the full range of support without pre-defined truncation range. Since the GenGSes are basically a single-sample gradient estimator, we utilize a single sample of z for the fundamental comparison among the gradient estimators in the toy example.

Experimental Result. Figure 3 compares the log-loss and the log-variance of estimated gradients from various estimators. The log-loss needs to be minimized with the estimated gradient value in the learning process by back-propagation. Additionally, the log-variance requires being minimized to maintain the consistency of the gradients, so the gradient descent can be efficient. The GenGSes show the best log-loss and the best log-variance if the GenGSes keep the continuous relaxation of the modeled discrete random variable. For the Poisson, the exact gradient can be computed in closed-form, and the GenGSes show the lowest bias among all gradient estimators.

6.2. VAE: Synthetic Experiment on DGMs

Experimental Setting. To test the performance of the gradient estimators in DGMs, we adopt VAE which is one of the simplest DGMs. We follow the VAE experiment scheme of Figurnov et al. [6] which utilizes various prior distributions. In our discrete case, we utilize the Poisson (Pois), the geometric (Geom), and the negative binomial (NB) distributions, as the latent factor count. Note that the purpose of the VAE experiments is not to compare the performance across various prior distributions such as the categorical or the Gaussian, but to compare the performance across gradient estimators within the same prior distributions.

Table 1: Training negative ELBO on MNIST and OMNIGLOT datasets. The lower is better for the negative ELBO, and the best and the second best results are marked in bold and underline, repecitvely. The symbol “—” indicates no convergence.

The VAE is considered as a more challenging task than the synthetic example, since (1) this task requires computing the gradients of the encoder network parameters through the latent distribution parameter ; and (2) each stochastic gradient of the latent dimension affects every encoder parameter since we are utilizing the fully-connected layers. Hence, a single poorly estimated gradient of the latent distribution parameter could harm the learning of encoder parameters, so the VAE experiment can dynamically show the performance of the gradient estimators. Objective function of the VAE is the evidence lower bound (ELBO) [log p(x(q(z(z)). In the GenGSes, by truncating the original distribution, the KL divergence between the approximate posterior and the prior distributions becomes the derivation with categorical distributions.

Proposition 6. Assume two truncated distributions X TD(Then, the KL divergence between X and Y can be represented in the KL divergence between the categorical distributions where KL(Y||X) = KL(Categorical(ˆCategorical(

We separately compare the performance of the single-sample and multi-sample estimators with 10 samples. For the GenGSes, we anneal the temperature from 1. to .1, and set truncation levels 12 for Pois(2); 15 for Pois(3); 25 for Geom(.25); 15 for Geom(.5); 30 for NB(3, .5); and 30 for NB(5, .3).

Experimental Result. To compare the estimators in the optimization perspective, Table 1 provides the training negative ELBO curves on MNIST and OMNIGLOT. DetRB, D, and StoRB are not listed since they hardly lead to the optimal point. The variants of GenGS showed the lowest negative ELBO in general for both the single and the multi cases. Also, loosening the PMF condition (i.e., the implicit inference) reached the better optimal points. The empirical reason why the implicit version is better than the explicit version is that the posterior PMF shape in the implicit case is thinner. Hence, the implicit

Figure 4: (Left) A graphical notation of NVPDEF with the generative (inference () processes. (Right) A neural network view of NVPDEF: diamond nodes indicate the auxiliary random variable for the reparameterization trick.

distribution has a lower variance than the explicit one, and samples consistent values that lead to better trained neural network parameters, although it loses the original PMF shape.

6.3. Application on Topic Model

Experimental Setting. This experiment shows practical usage of discrete distribution with GenGS in the deep generative topic modeling. The authors of Deep Exponential Families (DEFs) [16] utilized the exponential family on the stacked latent layers. We focus on the Poisson DEF, which assumes the Poisson latent layers to capture the count of latent super-topics and sub-topics. We convert the Poisson DEF into a neural variational form, which resembles NVDM [27], namely NVPDEF.

The generative and the inference process are

Gen: zk Inf: ˆ

where MLR stands for multinomial logistic regression adopted from Miao et al. [27]. Each zk ) represents the count distribution of sub-topics from the super-topic, which is approximated by posterior q(zk(ˆ). Each component of Wk, wkis positive, and wkcaptures the positive weight of relationship between super-topic i of the kth layer and sub-topic j of the (k layer. Finally, NVPDEF optimizes

where z0 = x for simplifying the equation. Figure 4 shows the graphical notation and the neural network structure.

For the single-stacked version of NVPDEF, we set 75 with truncation level 15 and fix the temperature as multi-stacked version of NVPDEF, we set truncation level 15. To have better chances of learning during

Table 2: NVPDEF test perplexity on 20Newsgroups and RCV1. (m) in the model name indicates that m samples are used for estimating the gradients. The lower is better for the perplexity, and the best results are marked in bold. The symbol “—” indicates no convergence.

the training period in the multi case, we utilize temperature annealing from to 5, and multi-sample on the latent layers for the stable optimization of consecutive sampling. We compare the estimators by perplexity exp(Nd is the number of words in document d, and D is the total number of documents.

Experimental Result. We enumerate the results of NVPDEF with various gradient estimators in Table 2, and we confirmed that GenGS gives the lowest perplexity with 20Newsgroups and RCV1, which is aligned with the VAE results. Especially, the multi-stacked RCV1 case extremely shows the performance of the gradient estimators that many of the gradient estimators fail to reach the optimal point and GenGS gives the lowest perplexity.

7. Conclusion

This paper suggests variants of GenGS, a generalized version of the Gumbel-Softmax estimator, with the theoretical background. The synthetic analysis and the VAE experiment demonstrate the efficacy of GenGS, and the topic model application shows the usage of GenGS. The proposed GenGSes can be simply implemented, and the experimental result shows that the variants of GenGS lead to the better optimal point compared to existing stochastic gradient estimators. With the generalization, we expect that GenGS can diversify the options of distributions in the deep generative model community.

Declaration of Competing Interest

The authors declare that there are no conflicts of interest.

Acknowledgments

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (RS-2022-00166289).

References

[1] D. P. Kingma, M. Welling, Auto-encoding variational bayes, International Conference on Learning Representations (2014).

[2] M. Xu, M. Quiroz, R. Kohn, S. A. Sisson, Variance reduction properties of the reparameterization trick, International Conference on Artificial Intelligence and Statistics (2019).

[3] D. P. Kingma, M. Welling, Efficient gradient-based inference through transformations between bayes nets and neural nets, International Conference on Machine Learning (2014).

[4] E. Nalisnick, P. Smyth, Stick-breaking variational autoencoders, International Conference on Learning Representations (2017).

[5] W. Joo, W. Lee, S. Park, I. C. Moon, Dirichlet variational autoencoder, Pattern Recognition, 107, 107514 (2020).

[6] M. Figurnov, S. Mohamed, A. Mnih, Implicit reparameterization gradients, Advances in Neural Information Processing Systems (2018).

[7] M. Jankowiak, F. Obermeyer, Pathwise derivatives beyond the reparameterization trick, International Conference on Machine Learning (2018).

[8] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., Tensorflow: A system for largescale machine learning, USENIX Symposium on Operating Systems Design and Implementation (2016).

[9] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, Pytorch: An imperative style, high-performance deep learning library, Advances in Neural Information Processing Systems (2019).

[10] E. Jang, S. Gu, B. Poole, Categorical reparameterization with gumbel-softmax, International Conference on Learning Representations (2017).

[11] C. J. Maddison, A. Mnih, Y. W. Teh, The concrete distribution: A continuous relaxation of discrete random variables, International Conference on Learning Representations (2017).

[12] G. Tucker, A. Mnih, C. J. Maddison, J. Lawson, J. Sohl-Dickstein, Rebar: Low-variance, unbiased gradient estimates for discrete latent variable models, Advances in Neural Information Processing Systems (2017).

[13] W. Grathwohl, D. Choi, Y. Wu, G. Roeder, D. Duvenaud, Backpropagation through the void: Optimizing control variates for black-box gradient estimation, International Conference on Learning Representations (2017).

[14] A. Gadetsky, K. Struminsky, C. Robinson, N. Quadrianto, D. P. Vetrov, Low-variance black-box gradient estimates for the plackett-luce distribution, The Association for the Advancement of Artificial Intelligence (2020).

[15] M. B. Paulus, D. Choi, D. Tarlow, A. Krause, C. J. Maddison, Gradient estimation with stochastic softmax tricks, Advances in Neural Information Processing Systems (2020).

[16] R. Ranganath, L. Tang, L. Charlin, D. Blei, Deep exponential families, Artificial Intelligence and Statistics (2015).

[17] R. Ranganath, A. Perotte, N. Elhadad, D. Blei, Deep survival analysis, Machine Learning for Healthcare Conference, PMLR (2016).

[18] J. Wu, Y. Rao, Z. Zhang, H. Xie, Q. Li, F. L. Wang, Z. Chen, Neural mixed counting models for dispersed topic discovery, Annual Meeting of the Association for Computational Linguistics (2020).

[19] R. J. Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine Learning, 8(3-4), 229-256 (1992).

[20] A. Mnih, K. Gregor, Neural variational inference and learning in belief networks, International Conference on Machine Learning (2014).

[21] S. Gu, S. Levine, I. Sutskever, A. Mnih, Muprop: Unbiased backpropagation for stochastic neural networks, International Conference on Learning Representations (2016).

[22] R. Liu, J. Regier, N. Tripuraneni, M. I. Jordan, J. McAuliffe, Raoblackwellized stochastic gradients for discrete distributions, International Conference on Machine Learning (2019).

[23] W. Kool, H. van Hoof, M. Welling, Estimating gradients for discrete random variables by sampling without replacement, International Conference on Learning Representations (2020).

[24] Y. Bengio, N. Leonard, A. Courville, Estimating or propagating gradients through stochastic neurons for conditional computation, arXiv preprint arXiv:1308.3432, (2013).

[25] A. Mnih, D. J. Rezende, Variational inference for monte carlo objectives, International Conference on Machine Learning (2016).

[26] A. Potapczynski, G. Loaiza-Ganem, J. P. Cunningham, Invertible gaussian reparameterization: Revisiting the gumbel-softmax, Advances in Neural Information Processing Systems (2020).

[27] Y. Miao, L. Yu, P. Blunsom, Neural variational inference for text processing, International Conference on Machine Learning (2016).