b

DiscoverSearch
About
My stuff
MetFlow: A New Efficient Method for Bridging the Gap between Markov Chain Monte Carlo and Variational Inference
2020·arXiv
Abstract
Abstract

In this contribution, we propose a new computationally efficient method to combine Variational Inference (VI) with Markov Chain Monte Carlo (MCMC). This approach can be used with generic MCMC kernels, but is especially well suited to MetFlow, a novel family of MCMC algorithms we introduce, in which proposals are obtained using Normalizing Flows. The marginal distribution produced by such MCMC algorithms is a mixture of flow-based distributions, thus drastically increasing the expressivity of the variational family. Unlike previous methods following this direction, our approach is amenable to the reparametrization trick and does not rely on computationally expensive reverse kernels. Extensive numerical experiments show clear computational and performance improvements over state-of-the art methods.

One of the biggest computational challenge these days in machine learning and computational statistics is to sample from a complex distribution known up to a multiplicative constant. Indeed, this problem naturally appears in Bayesian inference (Robert, 2007) or generative modeling (Kingma & Welling, 2013). Very popular methods to address this problem are Markov Chain Monte Carlo (MCMC) algorithms (Brooks et al., 2011) and Variational Inference (VI) (Wainwright et al., 2008; Blei et al., 2017). The main contribution of this paper is to present a new methodology to successfully combine these two approaches mitigating their drawbacks and providing the state-of-the-art sampling quality from high dimensional unnormalized distributions. Starting from a parameterized family of distributions Q = {qφ : φ ∈ Φ ⊂ Rq}, VI approximates the intractable distribution with density  πon  RDby maximizing the evidence lower bound (ELBO) defined by

image

using an unnormalized version  ˜π of π, i.e. π = ˜π/Cπ settingCπ =�RD ˜π(z)dz. Indeed, this approach consists in mini- mizing  φ �→ KL(qφ|π) since L(φ) = log(Cπ)−KL(qφ|π).The design of the family Q of variational distributions has a huge influence on the overall performance – more flexible families provide better approximations of the target.

Recently, it has been suggested to enrich the traditional mean field variational approximation by combining them with invertible mappings with additional trainable parameters. A popular implementation of this principle is the Normalizing Flows (NFs) approach (Dinh et al., 2016; Rezende & Mohamed, 2015; Kingma et al., 2016) in which a mean-field variational distribution is deterministically transformed through a fixed-length sequence of parameterized invertible mappings. NFs have received a lot of attention recently and have proven to be very successful for VI and in particular for Variational Auto Encoder; see (Kobyzev et al., 2019; Papamakarios et al., 2019) and the references therein.

The drawback of variational methods is that they only allow the target distribution to be approximated by a parametric family of distributions. On the contrary, MCMC are generic methods which have theoretical guarantees (Robert & Casella, 2013). The basic idea behind MCMC is to design a Markov chain  (zk)k∈Nwhose stationary distribution is  π.Under mild assumptions, the distribution of  zKconverges to the target  π as Kgoes to infinity. Yet, this convergence is in most cases very slow and therefore this class of methods can be prohibitively computationally expensive.

The idea to “bridge the gap” between MCMC and VI was first considered in (Salimans et al., 2015) and has later been pursued in several works; see (Wolf et al., 2016), (Hoffman et al., 2019) and (Caterini et al., 2018) and the references therein. In these papers, based on a family of Markov kernels with target distribution  πand depending on trainable parameters  φ, the family Q consists in the marginal distribution obtained after K iterations of these Markov kernels.

In this paper, we develop a new approach to combine the VI and MCMC approaches. Compared to (Salimans et al., 2015) and (Wolf et al., 2016), we do not need to extend the variational approximation on the joint distribution of the K samples of the Markov chain and therefore avoid to introduce and learn reverse Markov kernels.

Our main contributions can be summarized as follows:

1) We propose a new computationally tractable ELBO which can be applied to most MCMC algorithms, including Metropolis-Adjusted Langevin Algorithm - MALA - and Hamiltonian Monte Carlo -HMC-. Compared to (Hoffman, 2017), the Markov kernels can be jointly optimized with the initial distribution  q0φ.

2) We propose an implementation of our approach MetFlow using a new family of ergodic MCMC kernels in which the proposal distributions are constructed using Normalizing Flows. Then, combining these Markov kernels with classical mean-field variational initial distributions, we obtain variational distributions with more expressive power than NF models, at the reasonable increase of the computational cost. Moreover, unlike plain NFs, we guarantee that each Markov kernel leaves the target invariant and that each iteration improves the distribution.

3) We present several numerical illustrations to show that our approach allows us to meaningfully trade-off between the approximation of the target distribution and computation, improving over state-of-the-art methods. The following link provides access to the implementation of the proposed method and all the experiments: https:// github.com/stat-ml/metflow.

Our paper is organized as follows. In Section 2, we start by describing our new methodology. Then, in Section 3, we introduce MetFlow, a class of “deterministic” MCMC algorithm, taking advantage of the flexibility of Normalizing Flows as proposals. Section 4 discusses the related work in more detail. The benefits of our method are illustrated through several numerical experiments in Section 5. Finally, we discuss the outcomes of the study and some future research directions in Section 6.

image

2.1. Basics of Metropolis-Hastings

The Metropolis Hastings (MH) algorithm to sample a density  πw.r.t. the Lebesgue measure on  RD defines a Markov chain  (Zk)k∈Nwith stationary distribution  πas follows. Conditionally to the current state  Zk ∈ RD, k ∈ N, a proposal  Yk+1 = Tφ(Zk, Uk+1)is sampled where  (Uk)k∈N∗ isa sequence of i.i.d. random variables valued in a measurable space (U, U), with density h w.r.t. to a  σ-finite measure  µU,and  Tφ : RD × U → RD is a measurable function, referred to as the proposal mapping, and parameterized by  φ ∈ Φ. In this work,  φcollectively denotes the parameters used in the proposal distribution, and  (Uk)is referred to as the innovation noise. Then,  Yk+1is accepted with probability  αMHφ �Zk, Tφ(Zk, Uk+1)�, where αMHφ : R2D → [0, 1] isdesigned so that the resulting Markov kernel, denoted by Mφ,h, is reversible w.r.t.  π, i.e. satisfies the detailed balance  π(dz)Mφ,h(z, dz′) = π(dz′)Mφ,h(z′, dz). With this notation,  Mφ,hcan be written, for  z ∈ RD, A ∈ B(RD), as

image

where  {Qφ : φ ∈ Φ}is the family of Markov kernels given for any  z ∈ RD, u ∈ U, A ∈ B(RD), by:

image

In this definition,  δzstands for the Dirac measure at z and {αφ : RD × U → [0, 1] , φ ∈ Φ}is a family of acceptance functions related to the MH acceptance probabilities by αφ(z, u) = αMHφ �z, Tφ(z, u)�.

To illustrate this definition, consider first the symmetric Random Walk Metropolis Algorithm (RWM). In such case, U = RD, µUis the Lebesgue measure, and g is the D-dimensional standard normal density. The proposal mapping is

image

where  {Σφ, φ ∈ Rq}is a parametric family of positive definite matrices, and the acceptance function is given by αRWMφ (z, u) = 1 ∧�π(T RWMφ (z, u))/π(z)�.

Consider now the Metropolis Adjusted Langevin Algorithm (MALA); see (Besag, 1994). Assume that  z �→ log π(z)is differentiable and denote by  ∇ log π(z)its gradient. The proposal mapping and the associated acceptance function for MALA algorithm are given by

image

where  gφ(z1, z2) = |Σφ|−1/2 g�Σ−1/2φ (z2 −T MALAφ (z1, 0))�is the proposal kernel density.

2.2. Variational Inference Meets Metropolis-Hastings

Let  K ∈ N∗, {ξ0φ : φ ∈ Φ}on  RDa parametric family of distributions and  {hi}Ki=1density functions w.r.t  µU. Con- sider now the following variational family

image

obtained by iteratively applying to the initial distribution  ξ0φthe Markov kernels  (Mφ,hi)Ki=1.

The objective of VI approach (Wainwright et al., 2008; Blei et al., 2017) is to minimize the Kullback-Leibler (KL) divergence  KL(ξKφ |π)w.r.t. the parameter  φ ∈ Φto find the distribution which best fits the target  π. For any  φ ∈ Φ, u ∈ Uand  z ∈ RD, denote by  Tφ,u(z) = Tφ(z, u), αφ,u(z) = αφ(z, u)and similarly for any  A ∈ B(RD), Qφ,u(z, A) = Qφ�(z, u), A�.

The key assumption in this section is that for any  φ ∈ Φand  u ∈ U, Tφ,uis a  C1diffeomorphism. This property is satisfied under mild condition on the proposal. In particular, this holds for RWM and MALA associated with the proposal mappings  T RWMφ,u and T MALAφ,u. It holds also for Hamiltonian Monte Carlo at the expense of extending the state space to include a momentum variable, see the supplementary paper Appendix A.4 and Appendix B.3. However, it is in general not needed to specify a valid MCMC procedure. For a  C1(RD, RD)diffeomorphism  ψ, define by Jψ(z) theabsolute value of the Jacobian determinant at  z ∈ RD.

Lemma 1. Let  (u, φ) ∈ U × Φ. Assume that  ξ0φadmits a density  m0φw.r.t. the Lebesgue measure. Assume in addi- tion  Tφ,uis a  C1diffeomorphism. Then, the distribution ξ1φ(·|u) =�Rd m0φ(z0)Qφ,u(z0, ·)dz0has a density w.r.t. the Lebesgue measure given by

image

and ξ1φhas a density given by m1φ(z) =�m1φ(z|u)h(u)µU(du).

Proof. Let f be a nonnegative measurable function on  RD.(6) follows from the change of variable  z1 = Tφ,u(z0):

image

An induction argument extends this property to the K-th marginal  ξKφ . Let us define  T 0 = Id. For a family  {Ti}Ki=1of mappings on  RD and 1 ≤ i ≤ k < K, define ⃝kj=iTj =Ti ◦ · · · ◦ Tkand for a sequence  {hi}Ki=1of innovation noise densities w.r.t.  µU, define h1:K(u1:K) = �Ki=1 hi(ui).Finally, set  α1φ,u(z) = αφ,u(z) and α0φ,u(z) = 1 − αφ,u(z).

Proposition 1. Assume that for any  (u, φ) ∈ U×Φ, Tφ,u isa  C1 diffeomorphism and  ξ0φ admits a density  m0φ w.r.t. theLebesgue measure. For any  {ui}Ki=1 ∈ UK, the distribution ξKφ (· | u1:K) = ξ0φQφ,u1 · · · Qφ,uK has a density  mKφ givenby

image

where

image

In particular, for a sequence  {hi}Ki=1of innovation noise densities,  ξKφ (5)has a density w.r.t. the Lebesgue measure, explicitly given, for any  z ∈ RD, by

image

We can now apply the VI approach the family Q defined in (5). Consider a family of inference function

{ρ(a1:K, u1:K | z): z ∈ RD, a1:K ∈ {0, 1}K, u1:K ∈ UK} .

This family may depend upon some parameters, implicit in this notation. As shown below, our objective is to take very simple expressions for those functions. We define our ELBO Laux(φ), using now the extended space  (zK, a1:K, u1:K), by

image

Note that Lauxis a lower bound of L expressed in (1) since defining  mKφ (z, a1:K, u1:K) =mKφ (z, a1:K|u1:K)h1:K(u1:K)and  mKφ (a1:K, u1:K|z) =mKφ (z, a1:K, u1:K)/mKφ (z), we obtain

image

where  KL�mKφ (•|zK)∥ρ(•|zK)�denotes the KL diver-

gence between  mKφ (a1:K, u1:K|zK) and ρ(a1:K, u1:K|zK).We specify the inference functions  ρ. In particular, a simple choice is  ρ(a1:K, u1:K|z) = r(a1:K|z, u1:K)h1:K(u1:K), where r is a similar family of inference function on {0, 1}. This architecture is based on the representation of the Markov kernel (3) we built and simplifies our ELBO. In the following, we always assume the form of such inference function. Note here that the key step of our approach for defining  Lauxis to rely on the representation (2), allowing us to write explicitly our marginals  mKφcompared to (Salimans et al., 2015).

The ELBO  Lauxcan be optimized w.r.t.  φ, typically by stochastic gradient methods, which requires an unbiased estimator of the gradient  ∇Laux(φ). Such estimator can be obtained using the reparameterization trick (Rezende et al., 2014). The implementation of this procedure is a bit more involved in our case, as the integration is done on a mixture of the components  mKφ (z, a1:K|u1:K). To develop an understanding of the methodology, we consider first the case K = 1. Denote by g the density of the D-dimensional standard Gaussian distribution. Suppose for example that we can write  m0φ(z) = g(V −1φ (z))JV −1φ (z)with  Vφ(y) = µφ + Σ−1/2φ y. Other parameterization could be handled as well. With two changes of variables, we can integrate w.r.t. g, which implies that

image

Justification and extension to the general case  K ∈ N∗is given in the supplementary paper, see Appendix B.1. From (11), using the general identity  ∇α = α∇ log(α), thegradient of  Lauxis given by

image

Therefore, an unbiased estimator of  ∇Laux(φ)can be obtained by sampling independently the proposal innovation u1 ∼ h1and the starting point  y ∼ N(0, I), and then the acceptation  a1 ∼ Ber�αa1u1�Vφ(y)��. A complete derivation for the case  K ∈ N∗is given in the supplementary paper, Appendix B.2.

In this section, we extend the construction above to a new class of MCMC methods, for which the proposal mappings are Normalizing Flows (NF). Our objective is to capitalize on the flexibility of NF to represent distributions, while keeping the exactness of MCMC. This new class of MCMC are referred to as MetFlow, standing for Metropolized Flows.

Consider a flow  Tφ : RD×U → RD parametrized by  φ ∈ Φ.It is assumed that for any  u ∈ U, Tφ,u : z �→ Tφ(z, u)is a  C1diffeomorphism. Set  V = {−1, 1}. For any  u ∈ U, consider the involution ˚Tφ,u on RD×V, i.e. ˚Tφ,u◦˚Tφ,u = Id, defined for  z ∈ RD, v ∈ {−1, 1} by

image

The variable v is called the direction. If v = 1 (respectively v = −1), the “forward”(resp. “backward”) flow  Tφ,u (resp.T −1φ,u) is used. For any  z ∈ RD, v ∈ {−1, 1}, A ∈ B(RD), B ⊂ V, we define the kernel

image

where  ˚αφ,u : RD × V → [0, 1]is the acceptance function.

Proposition 2. Let  νbe a distribution on V, and  (u, φ) ∈U × Φ. Assume that  ˚αφ,u : RD × V → [0, 1]satisfies for any  (z, v) ∈ RD × V,

image

Then for any  (u, φ) ∈ U × Φ, Rφ,udefined by (13) is reversible with respect to  π ⊗ ν. In particular, if for any

image

for  ϕ: R+ → R+, then (14)is satisfied if  ϕ(+∞) = 1 andfor any  t ∈ R+, tϕ(1/t) = ϕ(t).

Remark 1. The condition (14) on the acceptance ratio ˚αφ,uhas been reported in (Tierney, 1998, Section 2) (see also (Andrieu & Livingstone, 2019, Proposition 3.5) for extensions to the non reversible case). Standard choices for the acceptance function  ˚αφ,uare the Metropolis-Hastings and Barker ratios which correspond to  ϕ: t �→ min(1, t)and  t �→ t/(1 + t)respectively.

If we define for  u ∈ U, v ∈ V, z ∈ RD, A ∈ B(RD),

image

we retrieve the framework defined in Section 2. In turn, from a distribution  νfor the direction, the family {Qφ,(u,v) : (u, v) ∈ U × V}defines a MH kernel, given for  u ∈ U, z ∈ RD, A ∈ B(RD) by

Mφ,u,ν(z, A)= ν(1)Qφ,(u,1)(z, A)+ν(−1)Qφ,(u,−1)(z, A) .

The key result of this section is

Corollary 1. For any  u ∈ Uand any distribution  ν, the kernel  Mφ,u,νis reversible w.r.t.  π.

Consider for example the MALA proposal mapping  T MALAφ .If we set

image

with  Tφ ← T MALAφ, then for any  u ∈ RDand any distribution  ν, M MALAφ,u,νis reversible w.r.t.  π, which is not the case for  QMALAφ,udefined in (3) with acceptance function  αMALAφgiven by (4). Recall indeed that the reversibility is only satisfied for the kernel�QMALAφ,u (z, A) g(u)du.

As the reversibility is satisfied for any  u1:K ∈ UK, we typically get rid of the integration w.r.t. the innovation noise h1:Kand rather consider a fixed sequence  u1:Kof proposal noise. For RWM or MALA, this sequence could be a completely uniformly distributed sequence as in Quasi Monte Carlo method for MCMC, see (Schwedes & Calderhead, 2018; Chen et al., 2011). Using definition (15) and Proposition 1, we can write the density  mKφ,u1:K(·|v1:K)of the dis- tribution  ξKφ,u1:K(· | v1:K) = ξ0φQφ,(u1,v1) · · · Qφ,(uK,vK). Setting  αφ,u,v(z) = ˚αφ,u(z, v)as in the previous section, we can write, for any  z ∈ RD, a1:K ∈ {0, 1}K,

image

Moreover, as reversibility is satisfied for any distribution  ν,we could let it depend upon some parameters also denoted φand write  νφ,1:K = �Ki=1 νφ,i. Defining an inference function  ru1:K(a1:K|z, v1:K), we can thus obtain the lower bound parametrized by the fixed sequence  u1:K and φ:

image

for which stochastic optimization can be performed using the same reparametrization trick (11).

The choice of the transformation  Tφis really flexible. Let {Tφ,i}Ki=1be a family of K diffeomorphisms on  RD. A flow model based on  {Tφ,i}Ki=1is defined as a compo- sition  Tφ,K ◦ · · · ◦ Tφ,1that pushes an initial distribution  ξ0φwith density  m0φto a more complex target distri- bution  ξKφwith density  mKφ, given for any  z ∈ RDby mKφ (z) = m0�⃝Ki=1T−1φ,i(z)�J⃝Ki=1T−1φ,i(z), see (Tabak & Turner, 2013; Rezende & Mohamed, 2015; Kobyzev et al., 2019; Papamakarios et al., 2019). We now proceed to the construction of MetFlow, based on the same deterministic sequence of diffeomorphisms. A MetFlow model is obtained by applying successively the Markov kernels Mφ,1,ν, . . . , Mφ,K,ν, written as, for  z ∈ RD, A ∈ B(RD),

image

Each of those is reversible w.r.t. the stationary distribution πand thus leaves  πinvariant. In such a case, the resulting distribution  ξKφ is a mixture of the pushforward of  ξ0φ by theflows  {TvKaKφ,K ◦· · ·◦Tv1a1φ,1 , v1:K ∈ VK, a1:K ∈ {0, 1}K}.The parameters  φof the flows  {Tφ,i}Ki=1are optimized by maximizing an ELBO similar to (17) in which  mKφ,u1:Kis substituted by  mKφ,1:Kwith  Tφ,ui ← Tφ,i. The kernel Mφ,i,νshares some similarity with Transformation-based MCMC (T-MCMC) introduced in (Dutta & Bhattacharya, 2014). However, the transformations considered in (Dutta & Bhattacharya, 2014) and later by (Dey et al., 2016) are elementary additive or multiplicative transforms acting coordinate-wise. In our contribution, we considered much more sophisticated transformations, inspired by the recent advances on normalizing flows.

Among the different flow models which have been considered recently in the literature (Papamakarios et al., 2019), we chose Real-Valued Non-Volume Preserving (RNVP) flows (Dinh et al., 2016) because they are easy to compute and invert. An extension of our work would be to consider other flows, such as Flow++ (Ho et al., 2019), which can also be computed and inverted efficiently. We could also use autoregressive models, such as Inverse Autoregressive Flows (Kingma et al., 2016), which have a tractable - albeit non parallelizable - inverse. Even more expressive flows like NAF (Huang et al., 2018), BNAF (Cao et al., 2019) or UMNN (Wehenkel & Louppe, 2019) could be experimented with. Although they are not invertible analytically, this problem could be solved either by the Distribution Distillation method (van den Oord et al., 2017), or simply by a classic root-finding algorithm: this is theoretically tractable because of the monotonous nature of these flows, and empirically satisfactory (Wehenkel & Louppe, 2019).

In this section, we compare our method with the state-of-the-art for combining MCMC and VI. The first attempt to bridge the gap between MCMC and VI is due to (Salimans et al., 2015). The method proposed in (Salimans et al., 2015) uses a different ELBO, based on the joint distribution of the K steps of the Markov chain  z0:K = (z0, . . . , zK), whereasMetFlows are based on the marginal distribution of the K-th component. (Salimans et al., 2015) introduce an auxiliary inference function r and consider the ELBO:

image

where  qφ(z0:K)is the joint distribution of the path  z0:Kw.r.t. the Lebesgue measure. An optimal choice of the auxiliary inference distribution is  r(z0:K−1 | zK) =qφ(z0:K−1|zK), the conditional distribution of the Markov chain path  z0:K−1given its terminal value  zK, but this distribution is in most cases intractable. (Salimans et al., 2015; Wolf et al., 2016) discuss several way to construct sensible approximations of  qφ(z0:K−1 | zK)by introducing learnable time inhomogeneous backward Markov kernels. This introduces additional parameters to learn and degrees of freedom in the choice on the reverse kernels which are not easy to handle. On the top of that, this increases significantly the computational budget.

(Hoffman, 2017) also suggests to build a Markov Chain to enrich the approximation of  π. More precisely, (Hoffman, 2017) optimizes the ELBO with respect to the initial distribution  q0φ, and only uses the MCMC steps to produce “better” samples to the target distribution. However, there is no feedback from the resulting marginal distribution there to optimize the parameters of the variational distribution φ. This method does not thus directly and completely uni-fies VI and MCMC, even though it simplifies the process by avoiding the use of the extended space and the reverse kernels. (Ruiz & Titsias, 2019) refines (Hoffman, 2017) by using a contrastive divergence approach; compared to the methodology presented in this paper, (Ruiz & Titsias, 2019) do not capitalize on the expression of the marginal density.

Another solution to avoid reverse kernels is considered in (Caterini et al., 2018) which amounts to remove randomness from an Hamiltonian MC algorithm. However, by getting rid of the accept-reject step and the resampling of the momenta in the HMC algorithm, this approach forgoes the guarantees that come with exact MCMC algorithms.

In this section, we illustrate our findings. We present examples of sampling from complex synthetic distributions which are often used to benchmark generative models, such as a mixture of highly separated Gaussians and other nonGaussian 2D distributions. We also present posterior inference approximations and inpainting experiments on MNIST dataset, in the setting outlined by (Levy et al., 2017). Many more examples are given in the supplementary paper.

We implement the MetFlow algorithm described in Section 3 to highlight the efficiency of our method. For our learnable transitions  Tφ, we use RNVP flows.

We consider two settings. In the deterministic setting, we use K different RNVP transforms  {Tφ,i}Ki=1, and the pa- rameters for each individual transform  Tφ,iare different. In the pseudo-randomized setting, we define global transformation  Tφ on RD ×U and set Tφ,i = Tφ(·, ui), where u1:Kare K independent draws from a standard normal distribution. In such case, the parameters are the same for the flows Tφ,i, only the innovation noise  uidiffers. Typically, RNVP are encoded by neural networks. In the second setting, the network will thus take as input z and u stacked together.

In the second setting, once training has been completed and a fit ˆφof the parameters has been obtained, we can sample additional noise innovations  u(K+1):mK. We then consider the distribution given by  ξmKˆφ =ξKˆφ M ˆφ,uK+1,ν, . . . , M ˆφ,umK,ν where νis typically the uni- form on  {−1, 1}, as defined as in Section 3. mK corresponds to the length of the final Markov chain we consider. In practice, we have found that sampling additional noise innovations this way yields a more accurate approximation of the target, thanks to the asymptotic guarantees of MCMC.

Barker ratios (see Remark 1) have the advantage of being differentiable everywhere. Metropolis-Hastings (MH) ratios are known to more efficient than Barker ratios in the Peskun sense, see (Peskun, 1973). Moreover, although they are not differentiable at every point  z ∈ RD, differentiating MH ratios is no harder than differentiating a ReLu function. We thus use MH ratios in the following.

All code is written with the Pytorch (Paszke et al., 2017) and Pyro (Bingham et al., 2018) libraries, and all our experiments are run on a GeForce GTX 1080 Ti.

5.1. Synthetic data. Examples of sampling.

5.1.1. MIXTURE OF GAUSSIANS

The objective is to sample from a mixture of 8 Gaussians in dimension 2, starting from a standard normal prior distribution  q0, and compare MetFlow to RNVP. We are using an architecture of five RNVP flows (K = 5), each of which is parametrized by two three-layer fully-connected neural networks with LeakyRelu (0.01) activations. In this example, we consider the pseudo-randomized setting. The results for MetFlow and for RNVPs alone are shown on Figure 1. First, we observe that while our method successfully finds

image

Figure 1. Sampling a mixture of 8 Gaussian distributions. Top row from left to right: Target distribution, MetFlow, MetFlow with 145 resampled innovation noise. Bottom row from left to right: Prior distribution, First run of RNVP, Second run of RNVP. MetFlow finds all the modes and improves with more iterations, while RNVP depend on a good initialization to find all the modes and fails to separate them correctly.

all modes of the target distribution, RNVP alone struggles to do the same. Our method is therefore able to approximate multimodal distributions with well separated modes. Here, the mixture structure of the distribution (with potentially 35 = 243modes) produced by MetFlow is very appropriate to such a problem. On the contrary, classical flows are unable to approximate well separated modes starting from a simple unimodal prior, without much surprise. In particular, mode dropping is a serious issue even in small dimension. Moreover, an other advantage of MetFlow in the pseudo randomized setting is to be able to iterate the learnt kernels which still preserve the target distribution. Iterating MetFlow kernels widens the gap between both approaches, significantly improving the accuracy of our approximation.

5.1.2. NON-GAUSSIAN 2D DISTRIBUTIONS

In a second experiment, we sample the non-Gaussian 2D distributions proposed in (Rezende & Mohamed, 2015). Figure 2 illustrates the performance of MetFlow compared to RNVP. We are again using 5 RNVPs (K = 5) with the architecture described above, and use the pseudo-randomized setting for MetFlow. After only five steps, MetFlow already finds the correct form of the target distribution, while the simple RNVP fails on the more complex distributions. Moreover, iterating again MetFlow kernels allows us to approximate the target distribution with striking precision, after only 50 MCMC steps.

image

Figure 2. Density matching example (Rezende & Mohamed, 2015) and comparison between RNVP and MetFlow.

image

Figure 3. Mixture of ’3’ digits. Top: Fixed digits, Middle: NAF samples, Bottom: MetFlow samples. Compared to NAF, MetFlow is capable to mix better between these modes, while NAF seems to collapse.

5.2. Deep Generative Models

Deep Generative Models (DGM), such as Deep Latent Gaussian Models (see Kingma & Welling (2013); Rezende et al. (2014)) have recently become very popular. The basic assumption in a DGM is that the observed data x is generated by sampling a latent vector z which is used as the input of a deep neural network. This network then outputs the parameters of a family of distributions (e.g., the canonical parameters of exponential family like Bernoulli or Gaussian

image

Figure 4. Top line: Mean-Field approximation and MetFlow, Middle line: Mean-Field approximation, Bottom line: Mean-Field Approximation and NAF. Orange samples on the left represent the initialization image. We observe that MetFlow easily mixes between the modes while other methods are stuck in one mode.

distributions) from which the data are sampled. Given data generated by a DGM, a classical problem is to approach the conditional distribution p(z | x) of the latent variables z given the observation x, using variational inference to construct an amortized approximation.

We consider the binarized MNIST handwritten digit dataset. The generative model is as follows. The latent variable z is a l = 64 dimensional standard normal Gaussian. The observation  x = (xj)Dj=1is a vector of D = 784 bits. The bits (xj)Dj=1are, given the latent variable z, conditionally inde- pendent Bernoulli distributed random variables with success probability  pθ(z)jwhere  (pjθ)Dj=1is the output of a con- volutional neural network. In this framework,  pθis called the decoder. In the following, we show that our method provides a flexible and accurate variational approximation of the conditional distribution of the latent variable given the observation  pθ(z | x), outperforming mean-field and Normalizing Flows based approaches.

As we are focusing in this paper on the comparison of VI methods to approximate complex distributions and not on learning the Variational Auto Encoder itself, we have chosen to use a fixed decoder for both Normalizing Flows (here, Neural Autoregessive Flows) and MetFlow (with RNVP transforms). The decoder is obtained using state-of-the-art method described in the supplementary paper. We can illustrate the expressivity of MetFlow in two different ways. We first fix L different samples. In this example, we take L = 3 images representing the digit “3”. We are willing to approximate, for a given decoder  pθ, the posterior distribution  pθ(z|(xi)Li=1). We show in Figure 3 the de- coded samples corresponding to the following variational approximations of  pθ(·|(xi)Li=1): (i) a NAF trained from the decoder to approximate  pθ(·|(xi)Li=1)and (ii) MetFlow in the deterministic setting with K = 5 RNVP flows.

Figure 3 shows that the samples generated from (i) collapse essentially to one mode corresponding to the first digit. On the contrary, MetFlow is able to capture the three different modes of the posterior and generates much more variability in the decoded samples. The same phenomenon is observed in different settings by varying L and the digits chosen, as illustrated in the supplementary paper.

We now consider the in-painting set-up introduced in (Levy et al., 2017, Section 5.2.2). Formally, we in-paint the top of an image using Block Gibbs sampling. Given an image x, we denote  xt, xb the top and the bottom half pixels. Starting from an image  x0, we sample at each step  zt ∼ pθ(z | xt)and then  ˜x ∼ pθ(x | zt). We the set  xt+1 = (˜xt, xb0). We give the output of this process when sampling from the mean-field approximation of the posterior only, the mean-field pushed by a NAF, or using our method. The result for the experiment can be seen on Figure 4.

We can see that MetFlow mixes easily between different modes, and produces sharp images. We recognize furthermore different digits (3,5,9). It is clear from the middle plot that the mean-field approximation is not able to capture the complexity of the distribution  pθ(z | x). Finally, the NAF improves the quality of the samples but does not compare to MetFlow in terms of mixing.

In this paper, we propose a novel approach to combine MCMC and VI which alleviates the computational bottleneck of previously reported methods. In addition, we design MetFlow, a particular class of MH algorithms which fully takes advantage of our new methodology while capitalizing on the great expressivity of NF. Finally, numerical experiments highlight the benefits of our method compared to state-of-the-arts VI.

This work leads to several natural extensions. All NF applications can be adapted with MetFlow, which can be seen as a natural extension of a NF framework. MetFlow are very appropriate for VAE by amortizing. Due to lack of space, we did not present applications with Forward KL divergence. The mixture structure of the distribution obtained by MetFlow suggests the Variational Expectation Maximization (VEM) is a sensible strategy and in particular a chunky version of VEM in the case where the number of steps K is large (Verbeek et al., 2003).

Andrieu, C. and Livingstone, S. Peskun-Tierney ordering for Markov chain and process Monte Carlo: beyond the reversible scenario. arXiv preprint arXiv:1906.06197, 2019.

Besag, J. Comments on representations of knowledge in complex systems by u. Grenander and m. Miller. J. Roy. Statist. Soc. Ser. B, 56:591–592, 1994.

Bingham, E., Chen, J. P., Jankowiak, M., Obermeyer, F., Pradhan, N., Karaletsos, T., Singh, R., Szerlip, P., Horsfall, P., and Goodman, N. D. Pyro: Deep universal probabilistic programming, 2018.

Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859877, Feb 2017. ISSN 1537-274X. doi: 10.1080/01621459.2017. 1285773.

Bou-Rabee, N. and Jes´us Mar´ıa, S.-S. Geometric integrators and the Hamiltonian Monte Carlo method. Acta Numerica, pp. 1–92, 2018.

Brooks, S., Gelman, A., Jones, G., and Meng, X.-L. Handbook of Markov chain Monte Carlo. CRC press, 2011.

Cao, N. D., Titov, I., and Aziz, W. Block neural autoregressive flow. In UAI, 2019.

Caterini, A. L., Doucet, A., and Sejdinovic, D. Hamiltonian variational auto-encoder. In Advances in Neural Information Processing Systems, pp. 8167–8177, 2018.

Chen, S., Dick, J., Owen, A. B., et al. Consistency of Markov chain quasi-Monte Carlo on continuous state spaces. The Annals of Statistics, 39(2):673–701, 2011.

Dey, K. K., Bhattacharya, S., et al. On geometric ergodicity of additive and multiplicative transformation-based Markov chain Monte Carlo in high dimensions. Brazilian Journal of Probability and Statistics, 30(4):570–613, 2016.

Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real NVP. arXiv preprint arXiv:1605.08803, 2016.

Duane, S., Kennedy, A., Pendleton, B. J., and Roweth, D. Hybrid Monte Carlo. Physics Letters B, 195(2):216 – 222, 1987. ISSN 0370-2693.

Dutta, S. and Bhattacharya, S. Markov chain Monte Carlo based on deterministic transformations. Statistical Methodology, 16:100116, Jan 2014. ISSN 1572-3127. doi: 10.1016/j.stamet.2013.08.006.

Ho, J., Chen, X., Srinivas, A., Duan, Y., and Abbeel, P. Flow++: Improving flow-based generative models with variational dequantization and architecture design. arXiv preprint arXiv:1902.00275, 2019.

Hoffman, M., Sountsov, P., Dillon, J. V., Langmore, I., Tran, D., and Vasudevan, S. Neutra-lizing bad geometry in Hamiltonian Monte Carlo using neural transport. arXiv preprint arXiv:1903.03704, 2019.

Hoffman, M. D. Learning deep latent Gaussian models with Markov chain Monte Carlo. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 1510–1519, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.

Huang, C.-W., Krueger, D., Lacoste, A., and Courville, A. Neural autoregressive flows. arXiv preprint arXiv:1804.00779, 2018.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.

Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pp. 4743–4751, 2016.

Kobyzev, I., Prince, S., and Brubaker, M. A. Normalizing flows: Introduction and ideas. arXiv preprint arXiv:1908.09257, 2019.

Levy, D., Hoffman, M. D., and Sohl-Dickstein, J. Generalizing Hamiltonian Monte Carlo with neural networks. arXiv preprint arXiv:1711.09268, 2017.

Neal, R. M. Slice sampling. Ann. Statist., 31(3):705–767, 06 2003. doi: 10.1214/aos/1056562461.

Neal, R. M. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, pp. 113–162, 2011.

Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., and Lakshminarayanan, B. Normalizing flows for probabilistic modeling and inference. arXiv preprint arXiv:1912.02762, 2019.

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in PyTorch. 2017.

Peskun, P. H. Optimum Monte Carlo sampling using Markov chains. Biometrika, 60(3):607–612, 1973.

Rezende, D. and Mohamed, S. Variational inference with normalizing flows. In Bach, F. and Blei, D. (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 1530–1538, Lille, France, 07–09 Jul 2015. PMLR.

Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.

Robert, C. The Bayesian choice: from decision-theoretic foundations to computational implementation. Springer Science & Business Media, 2007.

Robert, C. and Casella, G. Monte Carlo statistical methods. Springer Science & Business Media, 2013.

Ruiz, F. and Titsias, M. A contrastive divergence for combining variational inference and MCMC. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 5537– 5545, Long Beach, California, USA, 09–15 Jun 2019. PMLR.

Salakhutdinov, R. and Murray, I. On the quantitative analysis of deep belief networks. In Proceedings of the 25th international conference on Machine Learning, pp. 872– 879, 2008.

Salimans, T., Kingma, D., and Welling, M. Markov chain Monte Carlo and variational inference: Bridging the gap. In International Conference on Machine Learning, pp. 1218–1226, 2015.

Schwedes, T. and Calderhead, B. Quasi Markov chain Monte Carlo methods. arXiv preprint arXiv:1807.00070, 2018.

Tabak, E. G. and Turner, C. V. A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics, 66(2):145–164, 2013.

Tierney, L. A note on Metropolis-Hastings kernels for general state spaces. Ann. Appl. Probab., 8(1):1–9, 02 1998.

van den Oord, A., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O., Kavukcuoglu, K., van den Driessche, G., Lockhart, E., Cobo, L. C., Stimberg, F., Casagrande, N., Grewe, D., Noury, S., Dieleman, S., Elsen, E., Kalchbrenner, N., Zen, H., Graves, A., King, H., Walters, T., Belov, D., and Hassabis, D. Parallel wavenet: Fast high-fidelity speech synthesis, 2017.

Verbeek, J., Vlassis, N., and Nunnink, J. A variational EM algorithm for large-scale mixture modeling. In Vassiliades, S., Florack, L., Heijnsdijk, J., and van der Steen, A.

(eds.), 9th Annual Conference of the Advanced School for Computing and Imaging (ASCI ’03), pp. 136–143, Heijen, Netherlands, June 2003.

Wainwright, M. J., Jordan, M. I., et al. Graphical models, exponential families, and variational inference. Foundations and Trends R⃝in Machine Learning, 1(1–2):1–305, 2008.

Wehenkel, A. and Louppe, G. Unconstrained monotonic neural networks. In Advances in Neural Information Processing Systems, pp. 1543–1553, 2019.

Wolf, C., Karl, M., and van der Smagt, P. Variational inference with Hamiltonian Monte Carlo. arXiv preprint arXiv:1609.08203, 2016.

A.1. Proof of Proposition 1

The proof is by induction on  K ∈ N∗. The base case K = 1 is given by Lemma 1. Assume now that the statement holds for K − 1 ∈ N∗. Then noticing that  ξKφ (·|u1:K) = ξK−1φ (·|u1:K−1)Qφ,uK, using again Lemma 1 and the induction hypothesis, we get that  ξKφ (·|u1:K)admits a density  mKφ (·|u1:K)w.r.t. the Lebesgue measure given for any  z ∈ RD by

image

Using the induction hypothesis, the density  mK−1φ (·|u1:K−1)w.r.t. the Lebesgue measure of  ξK−1φ (·|u1:K−1)of the form (7). Therefore, we obtain that

image

where in the last step, we have used that for any differentiable functions  ψ1, ψ2 : RD → RD, Jψ1◦ψ2(z) = Jψ1(ψ2(z))Jψ2(z)for any  z ∈ RD.

A.2. Proof of Proposition 2

Let  (u, φ) ∈ U × Φ. We want to find a condition such that the kernel  Rφ,u defined by (13)is reversible w.r.t.  π ⊗ ν where νis a distribution on V. This means that for any  A1, A2 ∈ B(RD), B1, B2 ⊂ V,

image

(19) By definition of  Rφ,u(13), the left-hand side simplifies to

image

where using that ˚Tφ,uis an involution, and for  v ∈ V, the change of variable  ˜z = T −vφ,u(z),

image

where in the last step, we have the change of variable  ˜v = −v. As for the right-hand side of (19), we have by definition (13), �

image

Therefore combining this result with (20)-(21), we get that if (14) holds then  Rφ,uis reversible w.r.t.  π ⊗ ν.

Moreover, let us suppose that  ϕ: R+ → R+is a function satisfying  ϕ(+∞) = 1and for any  t ∈ R+, tϕ(1/t) = ϕ(t). If for any  z ∈ RD, u ∈ U, v ∈ V,

π(z)ν(v)

then

image

which concludes the proof for Proposition 2

A.3. Proof of Corollary 1

Let us suppose that for any  u ∈ U, ˚αφ,uare chosen such that  Rφ,udefined by (13) is  π ⊗ νinvariant, where  νis any distribution on V. Then, by definition, for any  A1, A2 ∈ B(RD), B1, B2 ⊂ V, we have:�

image

image

Thus concluding that  Mφ,u,νis reversible w.r.t.  π, for any distribution  ν and any u ∈ U.

A.4. Checking the Assumption of Lemma 1 for RWM and MALA algorithms

image

which clearly is a  C1(RD, RD)diffeomorphism with inverse

In the simple case where  JT RWMφ,u (z) = 1, using Proposition 1, we get

image

MALA: We prove here that under appropriate conditions, the transformations defined by the Metropolis Adjusted Langevin Algorithm (MALA) are  C1 diffeomorphisms. We consider only the case where  Σφ = γ Id. The general case can be easily deduced by a simple adaptation. Remember, for MALA,  T = {Tγ,u : z �→ z + γ∇U(z) + √2γu : u ∈ RD, γ > 0}, where U is defined as  U(z) = log(˜π(z)),

Proposition 3. Assume the potential U is gradient Lipschitz, that is there exists L in  R+such that for any  z1, z2 ∈ RD, ∥∇U(z1) − ∇U(z2)∥ ≤ L∥z1 − z2∥, and that γ ≤ 1/(2L). Then for any  u in RD, Tγ,u : z �→ z + γ∇U(z) + √2γu is aC1 diffeomorphism.

Proof. Let  γ ≤ 1/(2L)and  u ∈ RD. First we show that  Tγ,uis invertible. Consider, for each y in  RD, the mapping Hy,u(z) = y − √2γ − γ∇U(z). We have, for  z1, z2 ∈ RD,

image

and  γL ≤ 1/2. Hence Hy,uis a contraction mapping and thus has a unique fixed point  zy,uand we have:

image

and existence and uniqueness of the fixed point  zy,uthus complete the proof for invertibility of  Tγ,u. The fact that the inverse of  Tγ,u is C1 follows from a simple application of the local inverse function theorem.

Therefore, Proposition 1 can be applied again. Although there is no explicit expression available for  mKγbecause of the intractability of the inverse of  Tγ,u, numerical approximations can be used.

B.1. Expression for the reparameterization trick

The goal of the reparameterization trick is to rewrite a distribution depending on some parameters as a simple transformation of a fixed one. The implementation of this procedure is a bit more involved in our case, as the integration is now done on a mixture of the components  mKφ (z, a1:K|u1:K), for  a1:K ∈ {0, 1}K. To develop an understanding of the methodology we suggest, we consider first the case K = 1. Recall that g stands for the density of the standard Gaussian distribution over  RD, and suppose here that there exists  Vφ : RD → RDa  C1diffeomorphism such that for any  z ∈ RD, m0φ(z) = g(V −1φ (z))JVφ(V −1φ (z))which is the basic assumption of the reparameterization trick. With the two changes of variables,  ˜z = T −a1φ,u1(z) and y = V −1φ (˜z), we get

image

This result implies that we can integrate out everything with respect to g.

The intuition is the same after K steps, and we can write:

image

B.2. Unbiased estimator for the gradient of the objective

We now show in the following that we can also estimate without bias the gradient of the objective starting from the expression (24) of the ELBO assuming that we can perform the associated reparameterization trick.

Again, we start by the simple case K = 1 to give the reader the main idea of the proof. Using for any function  f : RD → R∗+,∇f = f∇ log(f), the gradient of our ELBO is given for any  φ ∈ Φ by

image

Now, this form is particularly interesting, because we can access an unbiased estimator of this sum by sampling  u1 ∼ h1and  y ∼ g, then a1 ∼ Ber{α1φ,u1(Vφ(y))}and computing the expression between brackets.

The method goes the same way for the K-th step. Indeed for any  φ ∈ Φ, we have using for any function  f : RD → R∗+, ∇f = f∇ log(f),

image

And again, this sum can be estimated by sampling  u1:K ∼ h1:K, y ∼ g, then recursively  a1 ∼ Ber(α1φ,u1(Vφ(y))and for i > 1, ai ∼ Ber�αaiφ,ui�⃝1j=i−1T ajφ,uj(Vφ(y))��.

Those variables sampled, the expression between brackets provides an unbiased estimator of the gradient of our objective.

B.3. Extension to Hamiltonian Monte-Carlo

The following proposition show a key result for applicability of HMC to our method.

Proposition 4. Assume that for any  (u, φ) ∈ U × Φ, Tφ,udefines an involution, i.e.  Tφ,u ◦ Tφ,u = Id. If for any  z ∈ RD,

image

then the kernel defined by (3) and acceptance functions  αand transformation  Tφis reversible w.r.t.  π.

This result is direct consequence of Proposition 2. In particular, the functions  ϕidentified in Proposition 2 are still applicable here.

image

An important special example which falls into the setting of Proposition 4 is the Hamiltonian Monte-Carlo algorithm (HMC) (Duane et al., 1987; Neal, 2011). In such a case, the state variable is  z = (q, p) ∈ R2D, where q stands for the position and p the momentum. The unnormalized target distribution is defined as  ˜π(q, p) = ˜π(q) g(p) where gis the density of the D-dimensional standard Gaussian distribution. Define the potential  U(q) = − log(˜π(q)). Hamiltonian dynamics propagates a particle in this extended space according to Hamilton’s equation, for any  t ≥ 0,

image

This dynamics preserves the extended target distribution as the flow described above is reversible, symplectic and preserves the Hamiltonian H(q, p), defined as the sum of the potential energy  U(q) = − log(˜π(q))and the kinetic energy  (1/2)pT p(note that we can write  ˜π(q, p) ∝ exp(−H(q, p))), see (Bou-Rabee & Jes´us Mar´ıa, 2018). It is not however usually possible to compute exactly the solutions of the continuous dynamics described above. However, reversibility and symplectiness can be preserved exactly by discretization using a particular scheme called the leap-frog integrator. Given a stepsize  γ, theelementary leap-frog  LFγ(q0, p0) for any (q0, p0) ∈ R2D is given by  LFγ(q0, p0) = (q1, p1) where

image

The N-steps leap-frog integrator with stepsize  γ > 0is defined by  LFγ,N = ⃝Ni=1LFγ is the N-times composition of  LFγ.For some parameters  a ∈ (0, 1) and γ > 0, consider now the two following transformations:

image

Here, the parameter  φstands for the stepsize  γand the auto-regressive coefficient a in the momentum refreshment transform. Other parameters could be included as well; see for example (Levy et al., 2017). For any  a ∈ (0, 1), T refφ,u is a continuously differentiable diffeomorphism. Then, taking h = g and setting the acceptance ratios to  αrefφ,u ≡ 1, it is easily showed that M refφ,hdefined by (2) – with  Tφ,u ← T refφ,u– is reversible w.r.t.  ˜π. On the other hand, by composition and a straightforward induction, for any  φ ∈ Φ, T LFφis continuously differentiable if  log(π)is twice continuously differentiable on  RDand the determinant of its Jacobian is equal to 1 on  RD. In addition,  T LFφis an involution by reversibility to the momentum flip operator of the Hamiltonian dynamics (Bou-Rabee & Jes´us Mar´ıa, 2018, Section 6). Indeed,  T LFφis here written as the composition of  LFγ,Nand the momentum flip operator  S(q, p) = (q, −p), T LFφ = LFγ,N ◦ S. (Bou-Rabee & Jes´us Mar´ıa, 2018) show indeed that  LFγ,Nsatisfies the following property  LFγ,N ◦ S = S ◦ LF−1γ,N. As S is also an involution, we can write  S ◦ LFγ,N ◦ S = LF−1γ,N and thus LFγ,N ◦ S ◦ LFγ,N ◦ S = Id, hence T LFφ is an involution. Thus, Proposition 4 applies and the expression given for  αφ,uis the classical acceptance ratio for HMC when  ϕ(t) = min(1, t). HMC algorithm is obtained by alternating repeatedly these two kernels  M refφ,hand  M LFφ. To fall exactly into the framework outlined above, one might consider the extension  U × {0, 1}, for (q, p) ∈ (RD)2, (u, v) ∈ U × {0, 1}, the transformation

image

and the two densities  href(u, v) = h(u)I{1}(v), hLF(u, v) = f(u)I{0}(v) where fis an arbitrary density since  T LFφ does notdepend on u. The kernel defined by (2) associated with this transformation and density  href is Mφ,href = M refφ,h, and similarly Mφ,hLF = M LFφ,g. Then, the density after K HMC steps with parameters  φcan be written as  ξKφ = ξ0φ(Mφ,hLFMφ,href)K.

C.1. Optimization in the general case

We saw in the previous section a way to compute an unbiased estimator of our objective. We will perform gradient ascent in the following, using the estimator provided before.

At each step of optimization, we will sample  u1:K ∼ h1:K, y ∼ gand then sequentially  a1 ∼ Ber(α1φ,u1(Vφ(y))and for i > 1, ai ∼ Ber(αaiφ,ui(⃝1j=i−1T ajφ,uj(Vφ(y))). With those variables, we can compute the expression between brackets in the formula above (25). We then perform a stochastic gradient scheme, repeating the process until convergence of our parameters. A detailed algorithm highlighting the simplicity of our method is presented in C.1. Note that our method “follows” a trajectory, conditioned on noise  u1:Kand accept/reject Booleans  a1:K, which is conceptually equivalent to a flow pushforward.

C.2. Optimization for MetFlow

We give in the following a detailed algorithm for the optimization procedure for MetFlow, highlighting the low computational complexity of our method compared to the previous attempts. Note that the increased complexity is only linear in K the number of steps in our Markov Chain. The functions denoted as acceptance function are for  u ∈ U, v ∈ {−1, 1}: α1u,v(z) = ˚αφ,u(z, v) and α0u,v(z) = 1 − ˚αφ,u(z, v).

Define the Rademacher distribution Rad(p) on  {−1, 1}with parameter  p ∈ [−1, 1]by  Rad(p) = pδ1 + (1 − p)δ−1. The noise v for MetFlow will be sampled using a Rademacher distribution, whose parameter p is let to depend on some parameters  φwhich will be optimized.

image

In all the sampling experiments presented in the main paper (mixture of Gaussians, (Rezende & Mohamed, 2015) distributions) as well as for the additional experiments presented here (Neal’s funnel distribution, mixture of Gaussians in higher dimensions), the flows used are real non volume preserving (R-NVP) (Dinh et al., 2016) flows. For  φ ∈ Φ, one R-NVP transform  fφ on RD is defined as follows, for any  z ∈ RD, ˜z = fφ(z), with ˜zI = zI and ˜zIc = zIc ⊙ exp(sφ(zI)) + tφ(zI), where⊙is the element-wise product,  I ⊂ {1, . . . , D}with cardinal |I|, called an auto regressive mask, and  Ic = {1, . . . , D} \ I, sφ, tφ : R|I| → RD−|I|. Typically, the two functions  sφ, tφare parametrized by a neural network and that case  φrepresents the corresponding weights. The use of this kind of functions is justified by the fact that inverses for these flows can be easily computed and have a tractable Jacobian. Indeed, a straightforward calculation leads to for any  z ∈ RD, ˜z = f −1φ (z)with ˜zI = zI and ˜zIc = (zIc − tφ(zI)) ⊙ exp(−sφ(zI)) and Jfφ(z) = exp {�D−|I|j=1 (sφ(zI))j}.

In our experiments, we consider a generalization of this setting which we refer to as latent noisy NVP (LN-NVP) flows defined for  (φ, u) ∈ Φ × U, fφ,u : RD → RD, of the form for any  z ∈ RD, ˜z = fφ,u(z)with  ˜zI = zI, ˜zIc = zIc ⊙exp(sφ(zI, u)) + tφ(zI, u), where I is an auto-regressive mask and  sφ, tφ : R|I| × U → RD−|I|. All the results on R-NVP apply to our LN-NVP, in particular for any  z ∈ RD, ˜z = f −1φ,u(z) with ˜zI = zI and ˜zIc = (zIc − tφ(zI, u)) ⊙ exp(−sφ(zI, u))and  Jfφ,u(z) = exp {�D−|I|j=1 (sφ(zI, u))j}.

D.1. Mixture of Gaussians: Additional Results

We use MetFlow with the pseudo random setting. Recall that by this, we mean that we define a unique function  (z, u) →Tφ(z, u)and before training, we sample innovation noise  u1, . . . , uKthat will be “fixed” during all the training process. We then perform optimization on the “fixed” flows  Tφ,i ← Tφ(·, ui) , i ∈ {1, . . . , K}. The specific setting we consider first is as follows. We compare the sampling based on variational inference with R-NVP Flow and our methodology MetFlow with LN-NVP. In both case, each elementary transformation we consider is the composition of 6 NVP transforms. Here

image

U = RDwith D = 2 and the target distribution is the one described in Section 5.1. Each function t and s in the NVP flows is a neural network with one fully connected hidden layers of size 4 for R-NVP and LN-NVP, and LeakyRelu (0.01) activation functions, and final layers with activation functions tanh for s and identity for t. Each method is trained for 25000 iterations, with ADAM (Kingma & Ba, 2014), with learning rate of  10−3and betas = (0.9, 0.999), and an early stopping criterion of 250 iterations (If within 250 iterations, the objective did not improve, we stop training). The prior distribution is a standard normal Gaussian. Figure 5 and Figure 6 show the results and the effect of each trained flow. The first pictures are gradient-coloured along the x-axis, with the colour of each point corresponding to its position in the previous image. This helps us to understand how well the processes mixes and transforms an original distribution. It is interesting to note that our method produces flows that are all relevant and all help the distribution to improve at each step. On the contrary, classical R-NVP and flows in general only take interest in what happens at the very last (K-th) step: the pushforward distribution can take any form in between. However, this representation is only interesting when each of the transformations have different sets of parameters, increasing a lot the total numbers of parameters to tune. The evolution of our method can be interpreted as well with the presence of acceptance ratios, that at each step “filter” or “tutor” the distribution, helping it not to get too far from the target. This allows us as well to introduce the setting where we can iterate our flows to refine more and more the approximation we produce in the end. We also show that our method is robust to a change in the prior (even a violent one). In the following figure, we change the standard normal prior to a mixture of two well-separated Gaussian distributions, with a standard deviation of 1 and means  (−50, 0) and (50, 0). Figure 7 shows that MetFlow still remains efficient after iterating only a few times the learnt kernel, without retraining it at any point. It obviously does not work for R-NVP. Recall that by ”iterating a kernel”, we mean (as is explained in the main paper) that additional noise  {ui}i≥K+1is sampled to define other transformations  Tφ,i ← Tφ(·, ui)and thus additional MetFlow kernels. This particular property could find applications with

image

Figure 5. Consecutive outputs of each MetFlow kernel. Left: prior normal distribution, then successive effect of the 5 trained MetFlow kernels.

image

Figure 6. Consecutive outputs of each block of R-NVP. Left: prior normal distribution, then successive effect of the 5 trained R-NVP blocks - 6 transforms each.

time-changing data domains. Indeed, it does not require retraining existing models, while remaining an efficient way to sample from a given distribution at low computational cost.

image

Figure 7. Changing the prior to a mixture of two separated Gaussians, having trained the method on a standard normal prior. Top row, from left to right: Subsituted prior, 5 trained MetFlow kernels, re-iteration of 100 MetFlow kernels, 200 MetFlow kernels. Bottom row: Substituted prior, R-NVP flow.

In addition to the 2-dimensional results presented in the paper, we show here the results of our method for mixture of Gaussians of higher dimensions. Figure 8 shows the number of Gaussians retrieved by our model when the target distribution is a mixture of 8 Gaussians with variance 1, located at the corners of the d-dimensional hypercube, for different values of d. We see that our method significantly outperforms others, be they state-of-the-art Normalizing Flows (NAF) or MCMC methods (NUTS). Furthermore, our method scales more efficiently with dimension, as it is still able to retrieve two of the modes in dim 100 (for some runs), when other methods collapse to one mode for  d ≥ 20. MetFlow here is used in the pseudo random setting, with 7 LN-NVP blocks of two elementary transforms, t and s being as previously two one-layer fully connected neural networks, input dimension of twice the dimension considered 2D (again, we take  U = RD) and hidden dimension of 2D. The activations functions stay the same. NUTS is ran with 1500 warm-up steps and a length of 10 000 samples.

image

Figure 8. Number of modes retrieved by different methods. The target distribution is a mixture of 8 isotropic Gaussian distributions of variance 1 located at the corners of a d-dimensional hypercube. The methods are trained to convergence or retrieval of all modes, and mode retrieval is computed by counting the number of samples in a ball of radius 4d around the center of a mode. Error bars represent the standard deviation of the mean number of modes retrieved for different runs of the method (different initialization and random seed).

image

Figure 9. Density matching for funnel. Top row: Target distribution, MetFlow with 5 trained kernels, MetFlow with 5 trained kernel iterated 100 times. Bottom row: Prior distribution, First run of 5 R-NVP, second run of 5 R-NVP

D.2. Funnel distribution

We now test our approach on an other hard target distribution for MCMC proposed in (Neal, 2003). The target distribution has density Lebesgue density w.r.t. the Lebesgue measure given for any  z ∈ R2 by

image

We are using exactly the same setting for MetFlow with LN-NVP and R-NVP as for the mixture of Gaussian. We train the flows with the same optimizer, for 25000 iterations again. Figure 9 show the results of MetFlow and R-NVP flows. Again, as we are in the pseudo random setting, we sample 95 additional innovation noise  (ui)i∈{6,...,100}and show the distribution produced by 100 MetFlow kernels, having only trained the first five transformations  {Tφ,i , i ∈ {1, . . . , 5}}, as described in Section 5 of the main document. We can observe that after only five steps, the distribution has been pushed toward the end of the funnel. However, the amplitude is not recovered fully. It can be interpreted in light of the “tutoring” analogy we used previously. As the Accept/Reject control at each step the evolution of the points, if the number of steps is too small, the proposals do not got to the far end of the distribution. However, the plots show that the proposal given by MetFlow are still learnt relevantly, as Figure 9 illustrates that iterating only a few more MetFlow kernels matches the target distribution in all its amplitude.

D.3. Real-world inference - MNIST

In the experiments ran on MNIST described in the main paper, we fix a decoder  pθ. To obtain our model, we use the method described in (Kingma et al., 2016). The mean-field approximation output by the encoder is here pushforward by a flow. In (Kingma et al., 2016), the flows used were Inverse Autoregressive Flows, introduced in the same work. We choose here to use a more flexible class of flows neural auto-regressive flows (NAF), introduced in (Huang et al., 2018), which show better results in terms of log-likelihood. In practice, we use convolutional networks for our encoder and decoder, matching the architecture described in (Salimans et al., 2015). The inference network (encoder) consists of three convolutional layers, each with filters of size  5 × 5and a stride of 2, and output 16, 32, and 32 feature maps, respectively. The output of the third layer feeds a fully connected layer with hidden dimension 450, which then is fully connected to the outputs, means and standard deviations, of the size of our latent dimension, here 64. Softplus activation functions are used everywhere except before outputting the means. For the decoder, a similar but reversed architecture is used, using upsampling instead of stride, again as described in (Salimans et al., 2015). The Neural Autoregressive Flows are given by pyro library. NAF have a hidden layer of 64 units, with an AutoRegressiveNN which is a deep sigmoidal flow (Huang et al., 2018), with input dimension 64 and hidden dimension 128. Our data is the classical stochastic binarization of MNIST (Salakhutdinov & Murray, 2008). We train our model using Adam optimizer for 2000 epoches, using early stopping if there is no improvement after 100 epoches. The learning rate used is  10−3, and betas (0.9, 0.999). This produces a complex and expressive model for both our decoder and variational approximation.

We show first using mixture experiments that MetFlow can overcome state-of-the-art sampling methods. The mixture experiment described in the main paper goes as follows. We fix L different samples, and wish to approximate the complex posterior  pθ(·|(xi)Li=1)) ∝ p(z) �Li=1 pθ(xi|z). We give two approximations of this distribution, given by a state-of-the-art method, and MetFlow. The state-of-the-art method is a NAF a hidden layer of 16 units, with an AutoRegressiveNN which is a deep sigmoidal flow (Huang et al., 2018), with input dimension 64 and hidden dimension 128. MetFlow is trained here in the deterministic setting, with 5 blocks of 2 R-NVPs, where again each function t and s is a neural network with one fully connected hidden layers of size 128 and LeakyRelu (0.01) activation functions, and final layers with activation functions tanh for s and identity for t. MetFlow and NAF are optimized using 10000 batches of size 250 and early stopping tolerance of 250, with ADAM, with learning rate of  10−3and betas = (0.9, 0.999). We use Barker ratios as well here. The prior in both cases is a standard 64-dimensional Gaussian.

image

Figure 10. Fixed digits for mixture experiment.

image

Figure 11. Mixture of 3, MetFlow approximation.

image

image

Figure 12. Mixture of 3, NAF approximation.

We see on Figure 12 that if NAF “collapses” to one fixed digit (thus one specific mode of the posterior), MetFlow Figure 11 is able to find diversity and multimodality in the posterior, leading even to “wrong” digits sometimes, showing that it truly explores the complicated latent space.

Moreover, we can compare the variational approximations computed for our VAE (encoder - mean field approximation

image

Figure 13. Gibbs inpainting experiments starting from digit 0.

image

Figure 14. Gibbs inpainting experiments starting from digit 3.

- and encoder and NAF) to MetFlow approximation. MetFlow used here are the composition of 5 blocks of 2 R-NVPs (deterministic setting), where each function t and s is a neural network with one fully connected hidden layers of size 128 and LeakyRelu (0.01) activation functions, and final layers with activation functions tanh for s and identity for t. MetFlow is optimized using 150 epoches of 192 batches of size 250 over the dataset MNIST, with ADAM, with learning rate of  10−4and betas = (0.9, 0.999) and early stopping with tolerance of 25 (if within 25 epoches the objective did not improve, we stop training). The optimization goes as follows. We fix encoder (mean-field approximation) and decoder (target distribution). For each sample x in a minibatch of the dataset, we optimize MetFlow starting from the prior given by the encoder (mean-field approximation) and targetting posterior  pθ(·|x). Note that during all this optimization procedure, the initial distribution of MetFlow is fixed to be the encoder, which corresponds to freeze the corresponding parameters. This is a simple generalization of our method to amortized inference. In the following, we use Barker ratios. The additional results are given by Figures 13 to 17.

The encoder represents just the mean-field approximation here, while encoder and NAF represent the total variational distribution learnt by the VAE described above.

D.4. Additional setting of experiments

So far, we have described two settings. The first one, that we have called deterministic, in which the transformations take no input “innovation noise”, but all have different sets of parameters. The second one, pseudo random, defines one global transformation  Tφwith a unique set of parameters  φ, considers K initially sampled at random “innovation noise” u1, . . . , uK, and considers densities  h1:Ksuch that the noise resampled at every step of the optimization is “fixed”. It then learns the parameters for the global transformation using the considered transformations  Tφ,i ← Tφ(·, ui) , i ∈ {1, . . . , K}.We can consider here a last setting, fully random, on which a global transformation  Tφwith a unique set of parameters φis considered. However, now, we consider a unique density h (typically Gaussian) to sample the “innovation noise” at every step of the optimization - note that this is still covered by Algorithm C.2. This allow us to consider properly random

image

Figure 15. Gibbs inpainting experiments starting from digit 9.

image

Figure 16. Gibbs inpainting experiments starting from digit 6.

image

Figure 17. Gibbs inpainting experiments starting from digit 4.

transformations, and looks more like the classical framework of MCMC. Even if this introduces more noise in our stochastic gradient descent, it encourages MetFlow kernels trained to incorporate a lot more diversity. This can be seen with the two experiments described before, for mixture of different digits in MNIST, or the inpainting experiments. Even though experiments can show more diversity, it is important to note that they typically require a longer number of epoches to reach convergence. We give in Appendix D.4 a comparison of the three settings for a mixture of digits problem. As we can see, the diversity introduced by our method is highest in the fully random setting. We can see a wider variety of 3, even though other digits tend to appear more when we decode them as well.

image

Figure 18. Comparison of the different settings described for a mixture of digits experiment. From left to right, deterministic setting, pseudo-random setting, fully random setting.


Designed for Accessibility and to further Open Science