MetFlow: A New Efficient Method for Bridging the Gap between Markov Chain Monte Carlo and Variational Inference

2020·Arxiv

Abstract

Abstract

In this contribution, we propose a new computationally efficient method to combine Variational Inference (VI) with Markov Chain Monte Carlo (MCMC). This approach can be used with generic MCMC kernels, but is especially well suited to MetFlow, a novel family of MCMC algorithms we introduce, in which proposals are obtained using Normalizing Flows. The marginal distribution produced by such MCMC algorithms is a mixture of flow-based distributions, thus drastically increasing the expressivity of the variational family. Unlike previous methods following this direction, our approach is amenable to the reparametrization trick and does not rely on computationally expensive reverse kernels. Extensive numerical experiments show clear computational and performance improvements over state-of-the art methods.

1. Introduction

One of the biggest computational challenge these days in machine learning and computational statistics is to sample from a complex distribution known up to a multiplicative constant. Indeed, this problem naturally appears in Bayesian inference (Robert, 2007) or generative modeling (Kingma & Welling, 2013). Very popular methods to address this problem are Markov Chain Monte Carlo (MCMC) algorithms (Brooks et al., 2011) and Variational Inference (VI) (Wainwright et al., 2008; Blei et al., 2017). The main contribution of this paper is to present a new methodology to successfully combine these two approaches mitigating their drawbacks and providing the state-of-the-art sampling quality from high dimensional unnormalized distributions. Starting from a parameterized family of distributions Q = , VI approximates the intractable distribution with density on by maximizing the evidence lower bound (ELBO) defined by

using an unnormalized version . Indeed, this approach consists in mini- mizing The design of the family Q of variational distributions has a huge influence on the overall performance – more flexible families provide better approximations of the target.

Recently, it has been suggested to enrich the traditional mean field variational approximation by combining them with invertible mappings with additional trainable parameters. A popular implementation of this principle is the Normalizing Flows (NFs) approach (Dinh et al., 2016; Rezende & Mohamed, 2015; Kingma et al., 2016) in which a mean-field variational distribution is deterministically transformed through a fixed-length sequence of parameterized invertible mappings. NFs have received a lot of attention recently and have proven to be very successful for VI and in particular for Variational Auto Encoder; see (Kobyzev et al., 2019; Papamakarios et al., 2019) and the references therein.

The drawback of variational methods is that they only allow the target distribution to be approximated by a parametric family of distributions. On the contrary, MCMC are generic methods which have theoretical guarantees (Robert & Casella, 2013). The basic idea behind MCMC is to design a Markov chain whose stationary distribution is Under mild assumptions, the distribution of converges to the target goes to infinity. Yet, this convergence is in most cases very slow and therefore this class of methods can be prohibitively computationally expensive.

The idea to “bridge the gap” between MCMC and VI was first considered in (Salimans et al., 2015) and has later been pursued in several works; see (Wolf et al., 2016), (Hoffman et al., 2019) and (Caterini et al., 2018) and the references therein. In these papers, based on a family of Markov kernels with target distribution and depending on trainable parameters , the family Q consists in the marginal distribution obtained after K iterations of these Markov kernels.

In this paper, we develop a new approach to combine the VI and MCMC approaches. Compared to (Salimans et al., 2015) and (Wolf et al., 2016), we do not need to extend the variational approximation on the joint distribution of the K samples of the Markov chain and therefore avoid to introduce and learn reverse Markov kernels.

Our main contributions can be summarized as follows:

1) We propose a new computationally tractable ELBO which can be applied to most MCMC algorithms, including Metropolis-Adjusted Langevin Algorithm - MALA - and Hamiltonian Monte Carlo -HMC-. Compared to (Hoffman, 2017), the Markov kernels can be jointly optimized with the initial distribution

2) We propose an implementation of our approach MetFlow using a new family of ergodic MCMC kernels in which the proposal distributions are constructed using Normalizing Flows. Then, combining these Markov kernels with classical mean-field variational initial distributions, we obtain variational distributions with more expressive power than NF models, at the reasonable increase of the computational cost. Moreover, unlike plain NFs, we guarantee that each Markov kernel leaves the target invariant and that each iteration improves the distribution.

3) We present several numerical illustrations to show that our approach allows us to meaningfully trade-off between the approximation of the target distribution and computation, improving over state-of-the-art methods. The following link provides access to the implementation of the proposed method and all the experiments: https:// github.com/stat-ml/metflow.

Our paper is organized as follows. In Section 2, we start by describing our new methodology. Then, in Section 3, we introduce MetFlow, a class of “deterministic” MCMC algorithm, taking advantage of the flexibility of Normalizing Flows as proposals. Section 4 discusses the related work in more detail. The benefits of our method are illustrated through several numerical experiments in Section 5. Finally, we discuss the outcomes of the study and some future research directions in Section 6.

2. A New Combination Between VI and

2.1. Basics of Metropolis-Hastings

The Metropolis Hastings (MH) algorithm to sample a density w.r.t. the Lebesgue measure on defines a Markov chain with stationary distribution as follows. Conditionally to the current state , a proposal is sampled where a sequence of i.i.d. random variables valued in a measurable space (U, U), with density h w.r.t. to a -finite measure and is a measurable function, referred to as the proposal mapping, and parameterized by . In this work, collectively denotes the parameters used in the proposal distribution, and is referred to as the innovation noise. Then, is accepted with probability designed so that the resulting Markov kernel, denoted by , is reversible w.r.t. , i.e. satisfies the detailed balance . With this notation, can be written, for

where is the family of Markov kernels given for any

In this definition, stands for the Dirac measure at z and is a family of acceptance functions related to the MH acceptance probabilities by

To illustrate this definition, consider first the symmetric Random Walk Metropolis Algorithm (RWM). In such case, is the Lebesgue measure, and g is the D-dimensional standard normal density. The proposal mapping is

where is a parametric family of positive definite matrices, and the acceptance function is given by

Consider now the Metropolis Adjusted Langevin Algorithm (MALA); see (Besag, 1994). Assume that is differentiable and denote by its gradient. The proposal mapping and the associated acceptance function for MALA algorithm are given by

where is the proposal kernel density.

2.2. Variational Inference Meets Metropolis-Hastings

Let on a parametric family of distributions and density functions w.r.t . Con- sider now the following variational family

obtained by iteratively applying to the initial distribution the Markov kernels

The objective of VI approach (Wainwright et al., 2008; Blei et al., 2017) is to minimize the Kullback-Leibler (KL) divergence w.r.t. the parameter to find the distribution which best fits the target . For any , and , denote by , and similarly for any ,

The key assumption in this section is that for any and is a diffeomorphism. This property is satisfied under mild condition on the proposal. In particular, this holds for RWM and MALA associated with the proposal mappings . It holds also for Hamiltonian Monte Carlo at the expense of extending the state space to include a momentum variable, see the supplementary paper Appendix A.4 and Appendix B.3. However, it is in general not needed to specify a valid MCMC procedure. For a diffeomorphism absolute value of the Jacobian determinant at

Lemma 1. Let . Assume that admits a density w.r.t. the Lebesgue measure. Assume in addi- tion is a diffeomorphism. Then, the distribution has a density w.r.t. the Lebesgue measure given by

and has a density given by

Proof. Let f be a nonnegative measurable function on (6) follows from the change of variable

An induction argument extends this property to the K-th marginal . Let us define . For a family of mappings on and for a sequence of innovation noise densities w.r.t. Finally, set

Proposition 1. Assume that for any a diffeomorphism and admits a density Lebesgue measure. For any , the distribution has a density by

where

In particular, for a sequence of innovation noise densities, has a density w.r.t. the Lebesgue measure, explicitly given, for any

We can now apply the VI approach the family Q defined in (5). Consider a family of inference function

This family may depend upon some parameters, implicit in this notation. As shown below, our objective is to take very simple expressions for those functions. We define our ELBO , using now the extended space , by

Note that is a lower bound of L expressed in (1) since defining and , we obtain

where denotes the KL diver-

gence between We specify the inference functions . In particular, a simple choice is , where r is a similar family of inference function on {0, 1}. This architecture is based on the representation of the Markov kernel (3) we built and simplifies our ELBO. In the following, we always assume the form of such inference function. Note here that the key step of our approach for defining is to rely on the representation (2), allowing us to write explicitly our marginals compared to (Salimans et al., 2015).

The ELBO can be optimized w.r.t. , typically by stochastic gradient methods, which requires an unbiased estimator of the gradient . Such estimator can be obtained using the reparameterization trick (Rezende et al., 2014). The implementation of this procedure is a bit more involved in our case, as the integration is done on a mixture of the components . To develop an understanding of the methodology, we consider first the case K = 1. Denote by g the density of the D-dimensional standard Gaussian distribution. Suppose for example that we can write with . Other parameterization could be handled as well. With two changes of variables, we can integrate w.r.t. g, which implies that

Justification and extension to the general case is given in the supplementary paper, see Appendix B.1. From (11), using the general identity gradient of is given by

Therefore, an unbiased estimator of can be obtained by sampling independently the proposal innovation and the starting point , and then the acceptation . A complete derivation for the case is given in the supplementary paper, Appendix B.2.

3. MetFlow: MCMC and Normalizing Flows

In this section, we extend the construction above to a new class of MCMC methods, for which the proposal mappings are Normalizing Flows (NF). Our objective is to capitalize on the flexibility of NF to represent distributions, while keeping the exactness of MCMC. This new class of MCMC are referred to as MetFlow, standing for Metropolized Flows.

Consider a flow parametrized by It is assumed that for any is a diffeomorphism. Set . For any , consider the involution , defined for

The variable v is called the direction. If v = 1 (respectively ), the “forward”(resp. “backward”) flow ) is used. For any , , we define the kernel

where is the acceptance function.

Proposition 2. Let be a distribution on V, and . Assume that satisfies for any

Then for any defined by (13) is reversible with respect to . In particular, if for any

for is satisfied if for any

Remark 1. The condition (14) on the acceptance ratio has been reported in (Tierney, 1998, Section 2) (see also (Andrieu & Livingstone, 2019, Proposition 3.5) for extensions to the non reversible case). Standard choices for the acceptance function are the Metropolis-Hastings and Barker ratios which correspond to and respectively.

If we define for

we retrieve the framework defined in Section 2. In turn, from a distribution for the direction, the family defines a MH kernel, given for

M

The key result of this section is

Corollary 1. For any and any distribution , the kernel is reversible w.r.t.

Consider for example the MALA proposal mapping If we set

with , then for any and any distribution is reversible w.r.t. , which is not the case for defined in (3) with acceptance function given by (4). Recall indeed that the reversibility is only satisfied for the kernel

As the reversibility is satisfied for any , we typically get rid of the integration w.r.t. the innovation noise and rather consider a fixed sequence of proposal noise. For RWM or MALA, this sequence could be a completely uniformly distributed sequence as in Quasi Monte Carlo method for MCMC, see (Schwedes & Calderhead, 2018; Chen et al., 2011). Using definition (15) and Proposition 1, we can write the density of the dis- tribution . Setting as in the previous section, we can write, for any ,

Moreover, as reversibility is satisfied for any distribution we could let it depend upon some parameters also denoted and write . Defining an inference function , we can thus obtain the lower bound parametrized by the fixed sequence

for which stochastic optimization can be performed using the same reparametrization trick (11).

The choice of the transformation is really flexible. Let be a family of K diffeomorphisms on . A flow model based on is defined as a compo- sition that pushes an initial distribution with density to a more complex target distri- bution with density , given for any by , see (Tabak & Turner, 2013; Rezende & Mohamed, 2015; Kobyzev et al., 2019; Papamakarios et al., 2019). We now proceed to the construction of MetFlow, based on the same deterministic sequence of diffeomorphisms. A MetFlow model is obtained by applying successively the Markov kernels , written as, for ,

Each of those is reversible w.r.t. the stationary distribution and thus leaves invariant. In such a case, the resulting distribution is a mixture of the pushforward of flows The parameters of the flows are optimized by maximizing an ELBO similar to (17) in which is substituted by with . The kernel shares some similarity with Transformation-based MCMC (T-MCMC) introduced in (Dutta & Bhattacharya, 2014). However, the transformations considered in (Dutta & Bhattacharya, 2014) and later by (Dey et al., 2016) are elementary additive or multiplicative transforms acting coordinate-wise. In our contribution, we considered much more sophisticated transformations, inspired by the recent advances on normalizing flows.

Among the different flow models which have been considered recently in the literature (Papamakarios et al., 2019), we chose Real-Valued Non-Volume Preserving (RNVP) flows (Dinh et al., 2016) because they are easy to compute and invert. An extension of our work would be to consider other flows, such as Flow++ (Ho et al., 2019), which can also be computed and inverted efficiently. We could also use autoregressive models, such as Inverse Autoregressive Flows (Kingma et al., 2016), which have a tractable - albeit non parallelizable - inverse. Even more expressive flows like NAF (Huang et al., 2018), BNAF (Cao et al., 2019) or UMNN (Wehenkel & Louppe, 2019) could be experimented with. Although they are not invertible analytically, this problem could be solved either by the Distribution Distillation method (van den Oord et al., 2017), or simply by a classic root-finding algorithm: this is theoretically tractable because of the monotonous nature of these flows, and empirically satisfactory (Wehenkel & Louppe, 2019).

4. Related Work

In this section, we compare our method with the state-of-the-art for combining MCMC and VI. The first attempt to bridge the gap between MCMC and VI is due to (Salimans et al., 2015). The method proposed in (Salimans et al., 2015) uses a different ELBO, based on the joint distribution of the K steps of the Markov chain MetFlows are based on the marginal distribution of the K-th component. (Salimans et al., 2015) introduce an auxiliary inference function r and consider the ELBO:

where is the joint distribution of the path w.r.t. the Lebesgue measure. An optimal choice of the auxiliary inference distribution is , the conditional distribution of the Markov chain path given its terminal value , but this distribution is in most cases intractable. (Salimans et al., 2015; Wolf et al., 2016) discuss several way to construct sensible approximations of by introducing learnable time inhomogeneous backward Markov kernels. This introduces additional parameters to learn and degrees of freedom in the choice on the reverse kernels which are not easy to handle. On the top of that, this increases significantly the computational budget.

(Hoffman, 2017) also suggests to build a Markov Chain to enrich the approximation of . More precisely, (Hoffman, 2017) optimizes the ELBO with respect to the initial distribution , and only uses the MCMC steps to produce “better” samples to the target distribution. However, there is no feedback from the resulting marginal distribution there to optimize the parameters of the variational distribution . This method does not thus directly and completely uni-fies VI and MCMC, even though it simplifies the process by avoiding the use of the extended space and the reverse kernels. (Ruiz & Titsias, 2019) refines (Hoffman, 2017) by using a contrastive divergence approach; compared to the methodology presented in this paper, (Ruiz & Titsias, 2019) do not capitalize on the expression of the marginal density.

Another solution to avoid reverse kernels is considered in (Caterini et al., 2018) which amounts to remove randomness from an Hamiltonian MC algorithm. However, by getting rid of the accept-reject step and the resampling of the momenta in the HMC algorithm, this approach forgoes the guarantees that come with exact MCMC algorithms.

5. Experiments

In this section, we illustrate our findings. We present examples of sampling from complex synthetic distributions which are often used to benchmark generative models, such as a mixture of highly separated Gaussians and other nonGaussian 2D distributions. We also present posterior inference approximations and inpainting experiments on MNIST dataset, in the setting outlined by (Levy et al., 2017). Many more examples are given in the supplementary paper.

We implement the MetFlow algorithm described in Section 3 to highlight the efficiency of our method. For our learnable transitions , we use RNVP flows.

We consider two settings. In the deterministic setting, we use K different RNVP transforms , and the pa- rameters for each individual transform are different. In the pseudo-randomized setting, we define global transformation are K independent draws from a standard normal distribution. In such case, the parameters are the same for the flows , only the innovation noise differs. Typically, RNVP are encoded by neural networks. In the second setting, the network will thus take as input z and u stacked together.

In the second setting, once training has been completed and a fit of the parameters has been obtained, we can sample additional noise innovations . We then consider the distribution given by is typically the uni- form on , as defined as in Section 3. mK corresponds to the length of the final Markov chain we consider. In practice, we have found that sampling additional noise innovations this way yields a more accurate approximation of the target, thanks to the asymptotic guarantees of MCMC.

Barker ratios (see Remark 1) have the advantage of being differentiable everywhere. Metropolis-Hastings (MH) ratios are known to more efficient than Barker ratios in the Peskun sense, see (Peskun, 1973). Moreover, although they are not differentiable at every point , differentiating MH ratios is no harder than differentiating a ReLu function. We thus use MH ratios in the following.

All code is written with the Pytorch (Paszke et al., 2017) and Pyro (Bingham et al., 2018) libraries, and all our experiments are run on a GeForce GTX 1080 Ti.

5.1. Synthetic data. Examples of sampling.

5.1.1. MIXTURE OF GAUSSIANS

The objective is to sample from a mixture of 8 Gaussians in dimension 2, starting from a standard normal prior distribution , and compare MetFlow to RNVP. We are using an architecture of five RNVP flows (K = 5), each of which is parametrized by two three-layer fully-connected neural networks with LeakyRelu (0.01) activations. In this example, we consider the pseudo-randomized setting. The results for MetFlow and for RNVPs alone are shown on Figure 1. First, we observe that while our method successfully finds

Figure 1. Sampling a mixture of 8 Gaussian distributions. Top row from left to right: Target distribution, MetFlow, MetFlow with 145 resampled innovation noise. Bottom row from left to right: Prior distribution, First run of RNVP, Second run of RNVP. MetFlow finds all the modes and improves with more iterations, while RNVP depend on a good initialization to find all the modes and fails to separate them correctly.

all modes of the target distribution, RNVP alone struggles to do the same. Our method is therefore able to approximate multimodal distributions with well separated modes. Here, the mixture structure of the distribution (with potentially modes) produced by MetFlow is very appropriate to such a problem. On the contrary, classical flows are unable to approximate well separated modes starting from a simple unimodal prior, without much surprise. In particular, mode dropping is a serious issue even in small dimension. Moreover, an other advantage of MetFlow in the pseudo randomized setting is to be able to iterate the learnt kernels which still preserve the target distribution. Iterating MetFlow kernels widens the gap between both approaches, significantly improving the accuracy of our approximation.

5.1.2. NON-GAUSSIAN 2D DISTRIBUTIONS

In a second experiment, we sample the non-Gaussian 2D distributions proposed in (Rezende & Mohamed, 2015). Figure 2 illustrates the performance of MetFlow compared to RNVP. We are again using 5 RNVPs (K = 5) with the architecture described above, and use the pseudo-randomized setting for MetFlow. After only five steps, MetFlow already finds the correct form of the target distribution, while the simple RNVP fails on the more complex distributions. Moreover, iterating again MetFlow kernels allows us to approximate the target distribution with striking precision, after only 50 MCMC steps.

Figure 2. Density matching example (Rezende & Mohamed, 2015) and comparison between RNVP and MetFlow.

Figure 3. Mixture of ’3’ digits. Top: Fixed digits, Middle: NAF samples, Bottom: MetFlow samples. Compared to NAF, MetFlow is capable to mix better between these modes, while NAF seems to collapse.

5.2. Deep Generative Models

Deep Generative Models (DGM), such as Deep Latent Gaussian Models (see Kingma & Welling (2013); Rezende et al. (2014)) have recently become very popular. The basic assumption in a DGM is that the observed data x is generated by sampling a latent vector z which is used as the input of a deep neural network. This network then outputs the parameters of a family of distributions (e.g., the canonical parameters of exponential family like Bernoulli or Gaussian

Figure 4. Top line: Mean-Field approximation and MetFlow, Middle line: Mean-Field approximation, Bottom line: Mean-Field Approximation and NAF. Orange samples on the left represent the initialization image. We observe that MetFlow easily mixes between the modes while other methods are stuck in one mode.

distributions) from which the data are sampled. Given data generated by a DGM, a classical problem is to approach the conditional distribution p(z | x) of the latent variables z given the observation x, using variational inference to construct an amortized approximation.

We consider the binarized MNIST handwritten digit dataset. The generative model is as follows. The latent variable z is a l = 64 dimensional standard normal Gaussian. The observation is a vector of D = 784 bits. The bits are, given the latent variable z, conditionally inde- pendent Bernoulli distributed random variables with success probability where is the output of a con- volutional neural network. In this framework, is called the decoder. In the following, we show that our method provides a flexible and accurate variational approximation of the conditional distribution of the latent variable given the observation , outperforming mean-field and Normalizing Flows based approaches.

As we are focusing in this paper on the comparison of VI methods to approximate complex distributions and not on learning the Variational Auto Encoder itself, we have chosen to use a fixed decoder for both Normalizing Flows (here, Neural Autoregessive Flows) and MetFlow (with RNVP transforms). The decoder is obtained using state-of-the-art method described in the supplementary paper. We can illustrate the expressivity of MetFlow in two different ways. We first fix L different samples. In this example, we take L = 3 images representing the digit “3”. We are willing to approximate, for a given decoder , the posterior distribution . We show in Figure 3 the de- coded samples corresponding to the following variational approximations of : (i) a NAF trained from the decoder to approximate and (ii) MetFlow in the deterministic setting with K = 5 RNVP flows.

Figure 3 shows that the samples generated from (i) collapse essentially to one mode corresponding to the first digit. On the contrary, MetFlow is able to capture the three different modes of the posterior and generates much more variability in the decoded samples. The same phenomenon is observed in different settings by varying L and the digits chosen, as illustrated in the supplementary paper.

We now consider the in-painting set-up introduced in (Levy et al., 2017, Section 5.2.2). Formally, we in-paint the top of an image using Block Gibbs sampling. Given an image x, we denote the top and the bottom half pixels. Starting from an image , we sample at each step and then . We the set . We give the output of this process when sampling from the mean-field approximation of the posterior only, the mean-field pushed by a NAF, or using our method. The result for the experiment can be seen on Figure 4.

We can see that MetFlow mixes easily between different modes, and produces sharp images. We recognize furthermore different digits (3,5,9). It is clear from the middle plot that the mean-field approximation is not able to capture the complexity of the distribution . Finally, the NAF improves the quality of the samples but does not compare to MetFlow in terms of mixing.

6. Conclusions

In this paper, we propose a novel approach to combine MCMC and VI which alleviates the computational bottleneck of previously reported methods. In addition, we design MetFlow, a particular class of MH algorithms which fully takes advantage of our new methodology while capitalizing on the great expressivity of NF. Finally, numerical experiments highlight the benefits of our method compared to state-of-the-arts VI.

This work leads to several natural extensions. All NF applications can be adapted with MetFlow, which can be seen as a natural extension of a NF framework. MetFlow are very appropriate for VAE by amortizing. Due to lack of space, we did not present applications with Forward KL divergence. The mixture structure of the distribution obtained by MetFlow suggests the Variational Expectation Maximization (VEM) is a sensible strategy and in particular a chunky version of VEM in the case where the number of steps K is large (Verbeek et al., 2003).

References

Andrieu, C. and Livingstone, S. Peskun-Tierney ordering for Markov chain and process Monte Carlo: beyond the reversible scenario. arXiv preprint arXiv:1906.06197, 2019.

Besag, J. Comments on representations of knowledge in complex systems by u. Grenander and m. Miller. J. Roy. Statist. Soc. Ser. B, 56:591–592, 1994.

Bingham, E., Chen, J. P., Jankowiak, M., Obermeyer, F., Pradhan, N., Karaletsos, T., Singh, R., Szerlip, P., Horsfall, P., and Goodman, N. D. Pyro: Deep universal probabilistic programming, 2018.

Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859877, Feb 2017. ISSN 1537-274X. doi: 10.1080/01621459.2017. 1285773.

Bou-Rabee, N. and Jes´us Mar´ıa, S.-S. Geometric integrators and the Hamiltonian Monte Carlo method. Acta Numerica, pp. 1–92, 2018.

Brooks, S., Gelman, A., Jones, G., and Meng, X.-L. Handbook of Markov chain Monte Carlo. CRC press, 2011.

Cao, N. D., Titov, I., and Aziz, W. Block neural autoregressive flow. In UAI, 2019.

Caterini, A. L., Doucet, A., and Sejdinovic, D. Hamiltonian variational auto-encoder. In Advances in Neural Information Processing Systems, pp. 8167–8177, 2018.

Chen, S., Dick, J., Owen, A. B., et al. Consistency of Markov chain quasi-Monte Carlo on continuous state spaces. The Annals of Statistics, 39(2):673–701, 2011.

Dey, K. K., Bhattacharya, S., et al. On geometric ergodicity of additive and multiplicative transformation-based Markov chain Monte Carlo in high dimensions. Brazilian Journal of Probability and Statistics, 30(4):570–613, 2016.

Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real NVP. arXiv preprint arXiv:1605.08803, 2016.

Duane, S., Kennedy, A., Pendleton, B. J., and Roweth, D. Hybrid Monte Carlo. Physics Letters B, 195(2):216 – 222, 1987. ISSN 0370-2693.

Dutta, S. and Bhattacharya, S. Markov chain Monte Carlo based on deterministic transformations. Statistical Methodology, 16:100116, Jan 2014. ISSN 1572-3127. doi: 10.1016/j.stamet.2013.08.006.

Ho, J., Chen, X., Srinivas, A., Duan, Y., and Abbeel, P. Flow++: Improving flow-based generative models with variational dequantization and architecture design. arXiv preprint arXiv:1902.00275, 2019.

Hoffman, M., Sountsov, P., Dillon, J. V., Langmore, I., Tran, D., and Vasudevan, S. Neutra-lizing bad geometry in Hamiltonian Monte Carlo using neural transport. arXiv preprint arXiv:1903.03704, 2019.

Hoffman, M. D. Learning deep latent Gaussian models with Markov chain Monte Carlo. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 1510–1519, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.

Huang, C.-W., Krueger, D., Lacoste, A., and Courville, A. Neural autoregressive flows. arXiv preprint arXiv:1804.00779, 2018.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.

Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pp. 4743–4751, 2016.

Kobyzev, I., Prince, S., and Brubaker, M. A. Normalizing flows: Introduction and ideas. arXiv preprint arXiv:1908.09257, 2019.

Levy, D., Hoffman, M. D., and Sohl-Dickstein, J. Generalizing Hamiltonian Monte Carlo with neural networks. arXiv preprint arXiv:1711.09268, 2017.

Neal, R. M. Slice sampling. Ann. Statist., 31(3):705–767, 06 2003. doi: 10.1214/aos/1056562461.

Neal, R. M. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, pp. 113–162, 2011.

Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., and Lakshminarayanan, B. Normalizing flows for probabilistic modeling and inference. arXiv preprint arXiv:1912.02762, 2019.

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in PyTorch. 2017.

Peskun, P. H. Optimum Monte Carlo sampling using Markov chains. Biometrika, 60(3):607–612, 1973.

Rezende, D. and Mohamed, S. Variational inference with normalizing flows. In Bach, F. and Blei, D. (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 1530–1538, Lille, France, 07–09 Jul 2015. PMLR.

Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.

Robert, C. The Bayesian choice: from decision-theoretic foundations to computational implementation. Springer Science & Business Media, 2007.

Robert, C. and Casella, G. Monte Carlo statistical methods. Springer Science & Business Media, 2013.

Ruiz, F. and Titsias, M. A contrastive divergence for combining variational inference and MCMC. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 5537– 5545, Long Beach, California, USA, 09–15 Jun 2019. PMLR.

Salakhutdinov, R. and Murray, I. On the quantitative analysis of deep belief networks. In Proceedings of the 25th international conference on Machine Learning, pp. 872– 879, 2008.

Salimans, T., Kingma, D., and Welling, M. Markov chain Monte Carlo and variational inference: Bridging the gap. In International Conference on Machine Learning, pp. 1218–1226, 2015.

Schwedes, T. and Calderhead, B. Quasi Markov chain Monte Carlo methods. arXiv preprint arXiv:1807.00070, 2018.

Tabak, E. G. and Turner, C. V. A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics, 66(2):145–164, 2013.

Tierney, L. A note on Metropolis-Hastings kernels for general state spaces. Ann. Appl. Probab., 8(1):1–9, 02 1998.

van den Oord, A., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O., Kavukcuoglu, K., van den Driessche, G., Lockhart, E., Cobo, L. C., Stimberg, F., Casagrande, N., Grewe, D., Noury, S., Dieleman, S., Elsen, E., Kalchbrenner, N., Zen, H., Graves, A., King, H., Walters, T., Belov, D., and Hassabis, D. Parallel wavenet: Fast high-fidelity speech synthesis, 2017.

Verbeek, J., Vlassis, N., and Nunnink, J. A variational EM algorithm for large-scale mixture modeling. In Vassiliades, S., Florack, L., Heijnsdijk, J., and van der Steen, A.

(eds.), 9th Annual Conference of the Advanced School for Computing and Imaging (ASCI ’03), pp. 136–143, Heijen, Netherlands, June 2003.

Wainwright, M. J., Jordan, M. I., et al. Graphical models, exponential families, and variational inference. Foundations and Trends Rin Machine Learning, 1(1–2):1–305, 2008.

Wehenkel, A. and Louppe, G. Unconstrained monotonic neural networks. In Advances in Neural Information Processing Systems, pp. 1543–1553, 2019.

Wolf, C., Karl, M., and van der Smagt, P. Variational inference with Hamiltonian Monte Carlo. arXiv preprint arXiv:1609.08203, 2016.

Supplementary Material A. Proofs

A.1. Proof of Proposition 1

The proof is by induction on . The base case K = 1 is given by Lemma 1. Assume now that the statement holds for . Then noticing that , using again Lemma 1 and the induction hypothesis, we get that admits a density w.r.t. the Lebesgue measure given for any

Using the induction hypothesis, the density w.r.t. the Lebesgue measure of of the form (7). Therefore, we obtain that

where in the last step, we have used that for any differentiable functions for any

A.2. Proof of Proposition 2

Let . We want to find a condition such that the kernel is reversible w.r.t. is a distribution on V. This means that for any

(19) By definition of (13), the left-hand side simplifies to

where using that is an involution, and for , the change of variable

where in the last step, we have the change of variable . As for the right-hand side of (19), we have by definition (13),

Therefore combining this result with (20)-(21), we get that if (14) holds then is reversible w.r.t.

Moreover, let us suppose that is a function satisfying and for any . If for any

π

then

which concludes the proof for Proposition 2

A.3. Proof of Corollary 1

Let us suppose that for any are chosen such that defined by (13) is invariant, where is any distribution on V. Then, by definition, for any

Thus concluding that is reversible w.r.t. , for any distribution

A.4. Checking the Assumption of Lemma 1 for RWM and MALA algorithms

which clearly is a diffeomorphism with inverse

In the simple case where , using Proposition 1, we get

MALA: We prove here that under appropriate conditions, the transformations defined by the Metropolis Adjusted Langevin Algorithm (MALA) are diffeomorphisms. We consider only the case where . The general case can be easily deduced by a simple adaptation. Remember, for MALA, , where U is defined as

Proposition 3. Assume the potential U is gradient Lipschitz, that is there exists L in such that for any , . Then for any diffeomorphism.

Proof. Let and . First we show that is invertible. Consider, for each y in , the mapping . We have, for

and is a contraction mapping and thus has a unique fixed point and we have:

and existence and uniqueness of the fixed point thus complete the proof for invertibility of . The fact that the inverse of follows from a simple application of the local inverse function theorem.

Therefore, Proposition 1 can be applied again. Although there is no explicit expression available for because of the intractability of the inverse of , numerical approximations can be used.

B. Reparameterization trick and estimator of the gradient

B.1. Expression for the reparameterization trick

The goal of the reparameterization trick is to rewrite a distribution depending on some parameters as a simple transformation of a fixed one. The implementation of this procedure is a bit more involved in our case, as the integration is now done on a mixture of the components , for . To develop an understanding of the methodology we suggest, we consider first the case K = 1. Recall that g stands for the density of the standard Gaussian distribution over , and suppose here that there exists a diffeomorphism such that for any , which is the basic assumption of the reparameterization trick. With the two changes of variables,

This result implies that we can integrate out everything with respect to g.

The intuition is the same after K steps, and we can write:

B.2. Unbiased estimator for the gradient of the objective

We now show in the following that we can also estimate without bias the gradient of the objective starting from the expression (24) of the ELBO assuming that we can perform the associated reparameterization trick.

Again, we start by the simple case K = 1 to give the reader the main idea of the proof. Using for any function , the gradient of our ELBO is given for any

Now, this form is particularly interesting, because we can access an unbiased estimator of this sum by sampling and and computing the expression between brackets.

The method goes the same way for the K-th step. Indeed for any , we have using for any function ,

And again, this sum can be estimated by sampling , then recursively and for i > 1,

Those variables sampled, the expression between brackets provides an unbiased estimator of the gradient of our objective.

B.3. Extension to Hamiltonian Monte-Carlo

The following proposition show a key result for applicability of HMC to our method.

Proposition 4. Assume that for any defines an involution, i.e. . If for any

then the kernel defined by (3) and acceptance functions and transformation is reversible w.r.t.

This result is direct consequence of Proposition 2. In particular, the functions identified in Proposition 2 are still applicable here.

An important special example which falls into the setting of Proposition 4 is the Hamiltonian Monte-Carlo algorithm (HMC) (Duane et al., 1987; Neal, 2011). In such a case, the state variable is , where q stands for the position and p the momentum. The unnormalized target distribution is defined as is the density of the D-dimensional standard Gaussian distribution. Define the potential . Hamiltonian dynamics propagates a particle in this extended space according to Hamilton’s equation, for any

This dynamics preserves the extended target distribution as the flow described above is reversible, symplectic and preserves the Hamiltonian H(q, p), defined as the sum of the potential energy and the kinetic energy (note that we can write ), see (Bou-Rabee & Jes´us Mar´ıa, 2018). It is not however usually possible to compute exactly the solutions of the continuous dynamics described above. However, reversibility and symplectiness can be preserved exactly by discretization using a particular scheme called the leap-frog integrator. Given a stepsize elementary leap-frog is given by

The N-steps leap-frog integrator with stepsize is defined by -times composition of For some parameters , consider now the two following transformations:

Here, the parameter stands for the stepsize and the auto-regressive coefficient a in the momentum refreshment transform. Other parameters could be included as well; see for example (Levy et al., 2017). For any is a continuously differentiable diffeomorphism. Then, taking h = g and setting the acceptance ratios to , it is easily showed that defined by (2) – with – is reversible w.r.t. . On the other hand, by composition and a straightforward induction, for any is continuously differentiable if is twice continuously differentiable on and the determinant of its Jacobian is equal to 1 on . In addition, is an involution by reversibility to the momentum flip operator of the Hamiltonian dynamics (Bou-Rabee & Jes´us Mar´ıa, 2018, Section 6). Indeed, is here written as the composition of and the momentum flip operator . (Bou-Rabee & Jes´us Mar´ıa, 2018) show indeed that satisfies the following property . As S is also an involution, we can write is an involution. Thus, Proposition 4 applies and the expression given for is the classical acceptance ratio for HMC when . HMC algorithm is obtained by alternating repeatedly these two kernels and . To fall exactly into the framework outlined above, one might consider the extension , the transformation

and the two densities is an arbitrary density since depend on u. The kernel defined by (2) associated with this transformation and density , and similarly . Then, the density after K HMC steps with parameters can be written as

C. Optimization Procedure

C.1. Optimization in the general case

We saw in the previous section a way to compute an unbiased estimator of our objective. We will perform gradient ascent in the following, using the estimator provided before.

At each step of optimization, we will sample and then sequentially and for . With those variables, we can compute the expression between brackets in the formula above (25). We then perform a stochastic gradient scheme, repeating the process until convergence of our parameters. A detailed algorithm highlighting the simplicity of our method is presented in C.1. Note that our method “follows” a trajectory, conditioned on noise and accept/reject Booleans , which is conceptually equivalent to a flow pushforward.

C.2. Optimization for MetFlow

We give in the following a detailed algorithm for the optimization procedure for MetFlow, highlighting the low computational complexity of our method compared to the previous attempts. Note that the increased complexity is only linear in K the number of steps in our Markov Chain. The functions denoted as acceptance function are for :

Define the Rademacher distribution Rad(p) on with parameter by . The noise v for MetFlow will be sampled using a Rademacher distribution, whose parameter p is let to depend on some parameters which will be optimized.

D. Experiments

In all the sampling experiments presented in the main paper (mixture of Gaussians, (Rezende & Mohamed, 2015) distributions) as well as for the additional experiments presented here (Neal’s funnel distribution, mixture of Gaussians in higher dimensions), the flows used are real non volume preserving (R-NVP) (Dinh et al., 2016) flows. For , one R-NVP transform is defined as follows, for any is the element-wise product, with cardinal |I|, called an auto regressive mask, and , . Typically, the two functions are parametrized by a neural network and that case represents the corresponding weights. The use of this kind of functions is justified by the fact that inverses for these flows can be easily computed and have a tractable Jacobian. Indeed, a straightforward calculation leads to for any with

In our experiments, we consider a generalization of this setting which we refer to as latent noisy NVP (LN-NVP) flows defined for , of the form for any with , where I is an auto-regressive mask and . All the results on R-NVP apply to our LN-NVP, in particular for any and

D.1. Mixture of Gaussians: Additional Results

We use MetFlow with the pseudo random setting. Recall that by this, we mean that we define a unique function and before training, we sample innovation noise that will be “fixed” during all the training process. We then perform optimization on the “fixed” flows . The specific setting we consider first is as follows. We compare the sampling based on variational inference with R-NVP Flow and our methodology MetFlow with LN-NVP. In both case, each elementary transformation we consider is the composition of 6 NVP transforms. Here

with D = 2 and the target distribution is the one described in Section 5.1. Each function t and s in the NVP flows is a neural network with one fully connected hidden layers of size 4 for R-NVP and LN-NVP, and LeakyRelu (0.01) activation functions, and final layers with activation functions tanh for s and identity for t. Each method is trained for 25000 iterations, with ADAM (Kingma & Ba, 2014), with learning rate of and betas = (0.9, 0.999), and an early stopping criterion of 250 iterations (If within 250 iterations, the objective did not improve, we stop training). The prior distribution is a standard normal Gaussian. Figure 5 and Figure 6 show the results and the effect of each trained flow. The first pictures are gradient-coloured along the x-axis, with the colour of each point corresponding to its position in the previous image. This helps us to understand how well the processes mixes and transforms an original distribution. It is interesting to note that our method produces flows that are all relevant and all help the distribution to improve at each step. On the contrary, classical R-NVP and flows in general only take interest in what happens at the very last (K-th) step: the pushforward distribution can take any form in between. However, this representation is only interesting when each of the transformations have different sets of parameters, increasing a lot the total numbers of parameters to tune. The evolution of our method can be interpreted as well with the presence of acceptance ratios, that at each step “filter” or “tutor” the distribution, helping it not to get too far from the target. This allows us as well to introduce the setting where we can iterate our flows to refine more and more the approximation we produce in the end. We also show that our method is robust to a change in the prior (even a violent one). In the following figure, we change the standard normal prior to a mixture of two well-separated Gaussian distributions, with a standard deviation of 1 and means . Figure 7 shows that MetFlow still remains efficient after iterating only a few times the learnt kernel, without retraining it at any point. It obviously does not work for R-NVP. Recall that by ”iterating a kernel”, we mean (as is explained in the main paper) that additional noise is sampled to define other transformations and thus additional MetFlow kernels. This particular property could find applications with

Figure 5. Consecutive outputs of each MetFlow kernel. Left: prior normal distribution, then successive effect of the 5 trained MetFlow kernels.

Figure 6. Consecutive outputs of each block of R-NVP. Left: prior normal distribution, then successive effect of the 5 trained R-NVP blocks - 6 transforms each.

time-changing data domains. Indeed, it does not require retraining existing models, while remaining an efficient way to sample from a given distribution at low computational cost.

Figure 7. Changing the prior to a mixture of two separated Gaussians, having trained the method on a standard normal prior. Top row, from left to right: Subsituted prior, 5 trained MetFlow kernels, re-iteration of 100 MetFlow kernels, 200 MetFlow kernels. Bottom row: Substituted prior, R-NVP flow.

In addition to the 2-dimensional results presented in the paper, we show here the results of our method for mixture of Gaussians of higher dimensions. Figure 8 shows the number of Gaussians retrieved by our model when the target distribution is a mixture of 8 Gaussians with variance 1, located at the corners of the d-dimensional hypercube, for different values of d. We see that our method significantly outperforms others, be they state-of-the-art Normalizing Flows (NAF) or MCMC methods (NUTS). Furthermore, our method scales more efficiently with dimension, as it is still able to retrieve two of the modes in dim 100 (for some runs), when other methods collapse to one mode for . MetFlow here is used in the pseudo random setting, with 7 LN-NVP blocks of two elementary transforms, t and s being as previously two one-layer fully connected neural networks, input dimension of twice the dimension considered 2D (again, we take ) and hidden dimension of 2D. The activations functions stay the same. NUTS is ran with 1500 warm-up steps and a length of 10 000 samples.

Figure 8. Number of modes retrieved by different methods. The target distribution is a mixture of 8 isotropic Gaussian distributions of variance 1 located at the corners of a d-dimensional hypercube. The methods are trained to convergence or retrieval of all modes, and mode retrieval is computed by counting the number of samples in a ball of radius 4d around the center of a mode. Error bars represent the standard deviation of the mean number of modes retrieved for different runs of the method (different initialization and random seed).

Figure 9. Density matching for funnel. Top row: Target distribution, MetFlow with 5 trained kernels, MetFlow with 5 trained kernel iterated 100 times. Bottom row: Prior distribution, First run of 5 R-NVP, second run of 5 R-NVP

D.2. Funnel distribution

We now test our approach on an other hard target distribution for MCMC proposed in (Neal, 2003). The target distribution has density Lebesgue density w.r.t. the Lebesgue measure given for any

We are using exactly the same setting for MetFlow with LN-NVP and R-NVP as for the mixture of Gaussian. We train the flows with the same optimizer, for 25000 iterations again. Figure 9 show the results of MetFlow and R-NVP flows. Again, as we are in the pseudo random setting, we sample 95 additional innovation noise and show the distribution produced by 100 MetFlow kernels, having only trained the first five transformations , as described in Section 5 of the main document. We can observe that after only five steps, the distribution has been pushed toward the end of the funnel. However, the amplitude is not recovered fully. It can be interpreted in light of the “tutoring” analogy we used previously. As the Accept/Reject control at each step the evolution of the points, if the number of steps is too small, the proposals do not got to the far end of the distribution. However, the plots show that the proposal given by MetFlow are still learnt relevantly, as Figure 9 illustrates that iterating only a few more MetFlow kernels matches the target distribution in all its amplitude.

D.3. Real-world inference - MNIST

In the experiments ran on MNIST described in the main paper, we fix a decoder . To obtain our model, we use the method described in (Kingma et al., 2016). The mean-field approximation output by the encoder is here pushforward by a flow. In (Kingma et al., 2016), the flows used were Inverse Autoregressive Flows, introduced in the same work. We choose here to use a more flexible class of flows neural auto-regressive flows (NAF), introduced in (Huang et al., 2018), which show better results in terms of log-likelihood. In practice, we use convolutional networks for our encoder and decoder, matching the architecture described in (Salimans et al., 2015). The inference network (encoder) consists of three convolutional layers, each with filters of size and a stride of 2, and output 16, 32, and 32 feature maps, respectively. The output of the third layer feeds a fully connected layer with hidden dimension 450, which then is fully connected to the outputs, means and standard deviations, of the size of our latent dimension, here 64. Softplus activation functions are used everywhere except before outputting the means. For the decoder, a similar but reversed architecture is used, using upsampling instead of stride, again as described in (Salimans et al., 2015). The Neural Autoregressive Flows are given by pyro library. NAF have a hidden layer of 64 units, with an AutoRegressiveNN which is a deep sigmoidal flow (Huang et al., 2018), with input dimension 64 and hidden dimension 128. Our data is the classical stochastic binarization of MNIST (Salakhutdinov & Murray, 2008). We train our model using Adam optimizer for 2000 epoches, using early stopping if there is no improvement after 100 epoches. The learning rate used is , and betas (0.9, 0.999). This produces a complex and expressive model for both our decoder and variational approximation.

We show first using mixture experiments that MetFlow can overcome state-of-the-art sampling methods. The mixture experiment described in the main paper goes as follows. We fix L different samples, and wish to approximate the complex posterior . We give two approximations of this distribution, given by a state-of-the-art method, and MetFlow. The state-of-the-art method is a NAF a hidden layer of 16 units, with an AutoRegressiveNN which is a deep sigmoidal flow (Huang et al., 2018), with input dimension 64 and hidden dimension 128. MetFlow is trained here in the deterministic setting, with 5 blocks of 2 R-NVPs, where again each function t and s is a neural network with one fully connected hidden layers of size 128 and LeakyRelu (0.01) activation functions, and final layers with activation functions tanh for s and identity for t. MetFlow and NAF are optimized using 10000 batches of size 250 and early stopping tolerance of 250, with ADAM, with learning rate of and betas = (0.9, 0.999). We use Barker ratios as well here. The prior in both cases is a standard 64-dimensional Gaussian.

Figure 10. Fixed digits for mixture experiment.

Figure 11. Mixture of 3, MetFlow approximation.

Figure 12. Mixture of 3, NAF approximation.

We see on Figure 12 that if NAF “collapses” to one fixed digit (thus one specific mode of the posterior), MetFlow Figure 11 is able to find diversity and multimodality in the posterior, leading even to “wrong” digits sometimes, showing that it truly explores the complicated latent space.

Moreover, we can compare the variational approximations computed for our VAE (encoder - mean field approximation

Figure 13. Gibbs inpainting experiments starting from digit 0.

Figure 14. Gibbs inpainting experiments starting from digit 3.

- and encoder and NAF) to MetFlow approximation. MetFlow used here are the composition of 5 blocks of 2 R-NVPs (deterministic setting), where each function t and s is a neural network with one fully connected hidden layers of size 128 and LeakyRelu (0.01) activation functions, and final layers with activation functions tanh for s and identity for t. MetFlow is optimized using 150 epoches of 192 batches of size 250 over the dataset MNIST, with ADAM, with learning rate of and betas = (0.9, 0.999) and early stopping with tolerance of 25 (if within 25 epoches the objective did not improve, we stop training). The optimization goes as follows. We fix encoder (mean-field approximation) and decoder (target distribution). For each sample x in a minibatch of the dataset, we optimize MetFlow starting from the prior given by the encoder (mean-field approximation) and targetting posterior . Note that during all this optimization procedure, the initial distribution of MetFlow is fixed to be the encoder, which corresponds to freeze the corresponding parameters. This is a simple generalization of our method to amortized inference. In the following, we use Barker ratios. The additional results are given by Figures 13 to 17.

The encoder represents just the mean-field approximation here, while encoder and NAF represent the total variational distribution learnt by the VAE described above.

D.4. Additional setting of experiments

So far, we have described two settings. The first one, that we have called deterministic, in which the transformations take no input “innovation noise”, but all have different sets of parameters. The second one, pseudo random, defines one global transformation with a unique set of parameters , considers K initially sampled at random “innovation noise” , and considers densities such that the noise resampled at every step of the optimization is “fixed”. It then learns the parameters for the global transformation using the considered transformations We can consider here a last setting, fully random, on which a global transformation with a unique set of parameters is considered. However, now, we consider a unique density h (typically Gaussian) to sample the “innovation noise” at every step of the optimization - note that this is still covered by Algorithm C.2. This allow us to consider properly random

Figure 15. Gibbs inpainting experiments starting from digit 9.

Figure 16. Gibbs inpainting experiments starting from digit 6.

Figure 17. Gibbs inpainting experiments starting from digit 4.

transformations, and looks more like the classical framework of MCMC. Even if this introduces more noise in our stochastic gradient descent, it encourages MetFlow kernels trained to incorporate a lot more diversity. This can be seen with the two experiments described before, for mixture of different digits in MNIST, or the inpainting experiments. Even though experiments can show more diversity, it is important to note that they typically require a longer number of epoches to reach convergence. We give in Appendix D.4 a comparison of the three settings for a mixture of digits problem. As we can see, the diversity introduced by our method is highest in the fully random setting. We can see a wider variety of 3, even though other digits tend to appear more when we decode them as well.

Figure 18. Comparison of the different settings described for a mixture of digits experiment. From left to right, deterministic setting, pseudo-random setting, fully random setting.