A key strategy in machine learning is to break down a problem into smaller and more manageable parts, then process data or unknown variables recursively. Well known examples of this are message passing algorithms for graphical models and annealing for optimization or sampling. Sequential Monte Carlo (SMC) is a class of methods that are tailored to solved statistical inference problems recursively. These methods have mostly received attention in the signal processing and statistics communities. With well over two decades of research in SMC, they have enabled inference in increasingly complex and challenging models. Recently, there has been an emergent interest in this class of algorithms from the machine learning community. We have seen applications to probabilistic programming [Wood et al., 2014], variational inference (VI) [Maddison et al., 2017, Naesseth et al., 2018, Le et al., 2018], inference evaluation [Grosse et al., 2015, Cusumano-Towner and Mansinghka, 2017], probabilistic graphical models (PGMs) [Ihler and McAllester, 2009, Naesseth et al., 2014, Paige and Wood, 2016], Bayesian nonparametrics [Fearnhead, 2004] and many other areas.
We provide a unifying view of the SMC methods that have been developed since their conception in the early 1990s [Gordon et al., 1993, Stewart and McCarty, 1992, Kitagawa, 1993]. In this introduction we provide relevant background material, introduce a running example, and discuss the use of code snippets throughout the tutorial.
1.1 Historical Background
SMC methods are generic tools for performing approximate (statistical) inference, predominantly Bayesian inference. They use a weighted sample set to iteratively approximate the posterior distribution of a probabilistic model. Ever since the dawn of Monte Carlo methods (see e.g. Metropolis and Ulam [1949] for an early discussion), random sample-based approximations have been recognized as powerful tools for inference in complex probabilistic models. Parallel to the development of Markov chain Monte Carlo (MCMC) methods [Metropolis et al., 1953, Hastings, 1970], sequential importance sampling (SIS) [Handschin and Mayne, 1969] and sampling/importance resampling [Rubin, 1987] laid the foundations for what would one day become SMC.
SMC methods where initially known as particle filters [Gordon et al., 1993, Stewart and McCarty, 1992, Kitagawa, 1993]. Particle filters where conceived as algorithms for online inference in nonlinear state space models (SSMs) et al., 2005]. Since then there has been a flurry of work applying SMC and particle filters to perform approximate inference in ever more complex models. While research in SMC initially focused on SSMs, we will see that SMC can be a powerful tool in a much broader setting.
1.2 Probabilistic Models and Target Distributions
As mentioned above, SMC methods were originally developed as an approximate solution to the so called filtering problem, which amounts to online inference in dynamical models. Several overview and tutorial articles focus on particle filters, i.e. the SMC algorithms specifically tailored to solve the online filtering problem [Arulampalam et al., 2002, Doucet and Johansen, 2009, Fearnhead and Künsch, 2018]. In this tutorial we will take a different view and explain how SMC can be used to solve more general “offline” problems. We shall see how this viewpoint opens up for many interesting applications of SMC in machine learning that do not fall in the traditional filtering setup, and furthermore how it gives rise to new and interesting design choices. For complementary review and tutorial articles that treat a similar topic see also Del Moral et al. [2006b], Doucet and Lee [2018].
We consider a generic probabilistic model given by a joint probability dis-
tribution function (PDF) of latent variables x and observed data y,
We focus on Bayesian inference, where the key object is the posterior distribution
where p(y) is known as the marginal likelihood.
The target distributions are a sequence of probability distributions that we recursively approximate using SMC. We define each target distribution in the sequence as a joint PDF of latent variables x1:t
t
is denoted by
where is a positive integrable function and Zt is the normalization constant, ensuring that
is indeed a PDF.
We connect the target distributions to the probabilistic model through a requirement on the final target distribution . We enforce the condition that
is either equivalent to the posterior distribution, or that it contains the posterior distribution as a marginal distribution. The intermediate target distributions, i.e.
, are useful only insofar they help us approximate the final target
. This approach is distinct from most previous tutorials on particle filters and SMC that traditionally focus on the intermediate targets, i.e. the filtering distributions. We stress that it is not necessarily the case that x1:T (the latent variables of the target distribution) is equal to x (the latent variables of the probabilistic model of interest), as the former can include additional auxiliary variables.
Below we introduce a few examples of probabilistic models and some straightforward choices of target distributions. We introduce and illustrate our running example which will be used throughout. We will return to the issue of choosing the sequence of intermediate targets in Section 3.2.
State Space Models The state space model (or hidden Markov model) is a type of probabilistic models where the latent variables and data satisfy a Markov property. For this model we typically have x . Often the measured data can also be split into a sequence of the same length (T) as the latent variables, i.e. y
. The model is defined by a transition PDF f and an observation PDF g,
The joint PDF is
where pis the prior on the initial state x1. This class of models is especially common for data that has an inherent temporal structure such as in the field of signal processing. A common choice is to let the target distributions follow the same sequential structure as in Eq. (1.5):
which means that the final normalized target distribution satisfies p(x|y) as required. This is the model class and target distributions which are studied in the classical filtering setup.
Non-Markovian Latent Variable Models The non-Markovian latent variable models (LVMs) are characterized by either no, or higher order, Markov structure between the latent variables x and/or data y. This can be seen as a non-trivial extension of the SSM, see Eq. (1.4), which has a Markov structure. Also for this class of models it is common to have x
Unlike the SSM, the non-Markovian LVM in its most general setting requires access to all previous latent variables x1:tto generate xt, yt
where we again refer to ft and gt as the transition PDF and observation PDF, respectively. The joint PDF is given by
where pis again the prior on x1. A typical target distribution is given by
with . Another option is
For both these sequences of target distributions the final iteration T is the posterior distribution, i.e. . However, the former one will often lead to more accurate inferences. This is because we introduce information from the data at an earlier stage in the SMC algorithm.
Throughout the monograph we will exemplify the different methods using a Gaussian special case of Eq. (1.7), see Example 1.2.1. We let the prior on x1:t, defined by the transition PDFs f1,... ft, be Markovian and introduce the non-Markov property instead through the observation PDFs g1,..., gt.
Example 1.2.1 (Non-Markovian Gaussian Sequence Model). As our running example for illustration purposes we use a non-Markovian Gaussian sequence model. It is
with observed variables yt (data), and where
where denotes a Gaussian distribution on x with mean
and (co)variance
. We let the prior at t
. Artificial data was generated using
0.9,1,0.5,1). The distribution of interest is the posterior distribution p
. We illustrate a few sample paths of y1:T in Fig. 1.1 for T = 100.
We can adjust the strength of the dependence on previous latent variables in the observations, yt, through the parameter . If we set
Figure 1.1: Five sample paths of y1:T from our running example for T = 100.
a linear Gaussian SSM, since the data depends only on the most recent latent xt. On the other hand if we let , this signifies that xk for k < t has equally strong effect on yt as does xt.
Another example of a non-Markovian LVM often encountered in machine learning is the stochastic recurrent neural network (RNN) [Chung et al., 2015, Bayer and Osendorfer, 2014, Fraccaro et al., 2016, Maddison et al., 2017, Naesseth et al., 2018]. We define the stochastic RNN below and will return to it again in Chapter 3.
Example 1.2.2 (Stochastic Recurrent Neural Network). A stochastic RNN is a non-Markovian LVM where the parameters of the transition and observation models are defined by RNNs. A common example is using the conditional Gaussian distribution to define the transition PDF
where the functions are defined by RNNs.
Conditionally independent models A common model in probabilistic machine learning is to assume that the datapoints yk in the dataset y are conditionally independent given the latent x. This means that the joint PDF
is given by
where p(y|x) is the likelihood. For this class of models it might not be immediately apparent that we can define a useful sequence of target distributions on latent variables x1:t. There is no obvious sequence structure as in the SSM or the non-Markovian LVM. However, as we shall see, we can make use of auxiliary variables to design target distributions that can help with inference.
We will discuss two approaches to design the sequence of target distributions: using data tempering and likelihood tempering, respectively. Both of these will make use of an auxiliary variable technique, where each xt is a random variable on the same space as x.
Data tempering: Using data tempering we add the data yk to the target distribution one by one. In this case the data index k coincides with the target index t. We define the target distribution
where the distributions skare a design choice, known as backward kernels [Chopin, 2002, Del Moral et al., 2006a]. With this choice, we have that the marginal distribution of xT at the final iteration is exactly the posterior distribution, i.e.
. In fact, at each step we have that the marginal target distribution is a partial posterior
Likelihood tempering: With likelihood tempering, instead of adding the data one by one, we change the likelihood p(y|x) through a sequence of positive variables. We define the target distribution
where 0 1, and again make use of user chosen backward kernels sk
Chopin, 2002, Del Moral et al., 2006a]. In this setting all data is considered at each iteration. Since
1, we have that the final marginal target distribution is again equal to the posterior
Applying SMC methods to tempered (and similar) target distributions has been studied by e.g. Chopin [2002], Del Moral et al. [2006a]. If the proposal qkis a Markov kernel with stationary distribution
commonly used backward kernel is sk
. We refer to the works of Chopin [2002], Del Moral et al. [2006a] for a thorough discussion on the choice of backward kernels sk
. Another well known example is annealed importance sampling by Neal [2001], which uses the backward kernel example above to define the target distributions, but relies on SIS instead of SMC for inference.
Models and Targets We have seen several probabilistic models with examples of corresponding target distributions. While not limited to these, this illustrates the wide range of the applicability of SMC. In fact, as long as we can design a sequence of target distributions such that coincides with the distribution of interest, we can leverage SMC for inference.
1.3 Applications
Sequential Monte Carlo and importance sampling methods have already seen a plethora of applications to machine learning and statistical inference problems. Before we turn to the fundamentals of the various algorithms it can be helpful to understand some of these applications. We present and discuss a few select examples of applications of SMC and importance sampling (IS) to probabilistic graphical models, Bayesian nonparametric models, probabilistic programming, and inference evaluation.
Probabilistic Graphical Models Probabilistic graphical models (PGM; see e.g. Koller et al. [2009], Wainwright and Jordan [2008]) are probabilistic models where the conditional independencies in the joint PDF are described by edges in a graph. The graph structure allows for easy and strong control on the type of prior information that the user can express. The main limitation of the PGM is that exact inference is often intractable and approximate inference is challenging.
The PGM is a probabilistic model where the PDF factorizes according to an underlying graph described by a set of cliques C , i.e. fully connected subsets of the vertices V
contains all individual components of x
and y. The undirected graphical model can be denoted by
where xC includes all elements of x and y in the clique C, and Z is a normalization constant ensuring that the right hand side is a proper PDF.
SMC methods have recently been successfully applied to the task of inference in general PGMs, see e.g. MacEachern et al. [1999], Chopin [2002], Del Moral et al. [2006a], Ihler and McAllester [2009], Naesseth et al. [2014], Paige and Wood [2016], Lindsten et al. [2017, 2018] for representative examples.
Probabilistic Programming Probabilistic programming languages are programming languages designed to describe probabilistic models and to automate the process of doing inference in those models. We can think of probabilistic programming as that of automating statistical inference, particularly Bayesian inference, using tools from computer science and computational statistics. Developing a syntax and semantics to describe and denote probabilistic models and the inference problems we are interested in solving is key to the probabilistic programming language. To define what separates a probabilistic programming language from a standard programming language we quote Gordon et al. [2014]: “Probabilistic programs are usual functional or imperative programs with two added constructs: (1) the ability to draw values at random from distributions, and (2) the ability to condition values of variables in a program via observations.” This aligns very well with the notion of Bayesian inference through the posterior distribution Eq. (1.2); through the syntax of the language we define x to be the random values we sample and y our observations that we condition on through the use of Bayes rule. The probabilistic program then essentially defines our joint probabilistic model p(x,y).
One of the main challenges of probabilistic programming is to develop algorithms that are general enough to enable inference for any model (probabilistic program) that we could conceivably write down using the language. Recently Wood et al. [2014] have shown that SMC-based approaches can be used as inference back-ends in probabilistic programs.
For a more thorough treatment of probabilistic programming we refer the interested reader to the recent tutorial by van de Meent et al. [2018] and the survey by Gordon et al. [2014].
Bayesian nonparametric models Nonparametric models are characterized by having a complexity which grows with the amount of available data. In a Bayesian context this implies that the usual latent random variables (i.e., parameters) of the model are replaced by latent stochastic processes. Examples include Gaussian processes, Dirichlet processes, and Beta processes; see e.g. Hjort et al. [2010] for a general introduction.
Sampling from these latent stochastic processes, conditionally on observed data, can be done using SMC. To give a concrete example, consider the Dirichlet process mixture model, which is a clustering model that can handle an unknown and conceptually infinite number of mixture components. Let yt, t = 1,2,... be a stream of data. Let xt, t = 1,2,... be a sequence of latent integer-valued random variables, such that xt is the index of the mixture component to which datapoint yt belongs. A generative representation of the mixture assignment variables is given by
where Jt :is the number of distinct mixture components represented by the first t datapoints, and nt,j :
is the number of datapoints among y1:t that belong to the jth component.
The model is completed by specifying the distribution of the data, conditionally on the mixture assignment variables:
where Gis an emission probability distribution parameterized by
F
is a prior distribution over the mixture parameters
Note that the mixture assignment variables xt, t = 1,2,... evolve according to a latent stochastic process. Solving the clustering problem amounts to computing the posterior distribution of this stochastic process, conditionally on the observed data. One way to address this problem is to use SMC; see Fearnhead [2004] for an efficient implementation tailored to the discrete nature of the problem.
Inference Evaluation An important problem when performing approximate Bayesian inference is to figure out when our approximation is “good enough”? Is it possible to give practical guarantees on the approximation we obtain? We need ways to evaluate how accurate our approximate inference algorithms are when compared to the true target distribution that we are trying to approximate. We will refer to the process of evaluating and validating approximate inference methods as inference evaluation.
Inference evaluation is mainly concerned with measuring how close our approximation is to the true object we are trying to estimate, often a posterior distribution. For simulated data, Grosse et al. [2015, 2016] have shown that we can make use of SMC and IS to bound the symmetrized Kullback-Leibler (KL) divergence between our approximate posterior and the true posterior. In another related work Cusumano-Towner and Mansinghka [2017] have shown that SMC-based methods show promise in estimating the symmetric KL divergence between the approximate posterior and a gold standard algorithm.
1.4 Example Code
We will be making use of inline Python code snippets throughout the manuscript to illustrate the algorithms and methods. Below we summarize the modules that are necessary to import to run the code snippets:
1.5 Outline
The remainder of this tutorial is organized as follows. In Chapter 2, we first introduce IS, a foundational building block for SMC. Then, we discuss the limitations of IS and how SMC resolves these. Finally, the section concludes with discussing some practical issues and theoretical results relevant to SMC methods.
Chapter 3 is focused on the two key design choices of SMC: the proposal and target distributions. Initially we focus on the proposal, discussing various ways of adapting and learning good proposals that will make the approximation more accurate. Then we discuss the sequence of target distributions; how we can learn intermediate distributions that help us when we try to approximate the posterior.
Chapter 4 focuses on pseudo marginal (PM) methods and other SMC methods that rely on a concept known as proper weights. First, we provide a simple and straightforward proof of the unbiasedness property of the SMC normalization constant estimate. Then, we describe and illustrate the combination of MCMC and SMC methods through PM algorithms. We move on to detail properly weighted SMC, a concept that unites and extends the random weights and nested SMC algorithms. Finally, we conclude the chapter by considering a few approaches for distributed and parallel SMC.
In Chapter 5 we introduce conditional sequential Monte Carlo (CSMC), and related methods for simulating from and computing expectations with respect to a target distribution. First, we introduce the basic CSMC algorithm and provide a straightforward proof of the unbiasedness of the inverse normalization constant estimate. Then, we show how SMC and CSMC can be combined to leverage multi-core and parallel computing architectures in the interacting particle Markov chain Monte Carlo (IPMCMC) algorithm. Finally, we discuss recent applications of CSMC for evaluating approximate inference methods.
The tutorial concludes with a discussion and outlook in Chapter 6.
Typical applications require us to be able to evaluate or sample from the target distributions , as well as compute their normalization constants Zt. For most models and targets this will be intractable, and we need approximations based on e.g. Monte Carlo methods.
In this section, we first review IS and some of its shortcomings. Then, we introduce the SMC method, the key algorithm underpinning this monograph. Finally, we discuss some key theoretical properties of the SMC algorithm.
2.1 Importance Sampling
Importance sampling [Kahn, 1950a,b, Robert and Casella, 2004] is a Monte Carlo method that constructs an approximation using samples from a proposal distribution, and corrects for the discrepancy between the target and proposal using (importance) weights.
Most applications of Bayesian inference can be formulated as computing the expectation of some generic function ht, referred to as a test function, with respect to the target distribution
Examples include posterior predictive distributions, Bayesian p-values, and point estimates such as the posterior mean. Computing Eq. (2.1) is often in-
tractable, but by a clever trick we can rewrite it as
The PDF qt is a user chosen proposal distribution, we assume it is simple to sample from and evaluate. We can now estimate the right hand side of Eq. (2.2) using the Monte Carlo method,
where are simulated iid from qt. We will usu- ally write Eq. (2.3) more compactly as
where the normalized weights wit are defined by
with a shorthand for
. The estimate in Eq. (2.4) is strongly con- sistent, converging (almost surely) to the true expectation as the number of samples N tend to infinity. An alternate view of IS is to consider it an (empirical) approximation1 of
where denotes the Dirac measure at X. Furthermore, IS provides an approximation of the normalization constant,
Because the weights depend on the random samples, x i1:t, they are themselves random variables. One of the key properties is that it is unbiased, This can be easily seen by noting that x i1:t are iid draws from qt and therefore
since Zt is nothing by the normalization constant for
property will be important for several of the more powerful IS and SMC-based methods considered in this monograph.
We summarize the importance sampling method in Algorithm 1. This algorithm is sometimes referred to as self-normalized IS, because we are normalizing each individual weight using all samples.
The straightforward implementation of the IS method we have described thus far is impractical for many of the example models and targets in Section 1.2. It is challenging to design good proposals for high-dimensional models. A good proposal is typically more heavy-tailed than the target; if it is not, the weights can have infinite variance. Another favorable property of a proposal is that it should cover the bulk of the target probability mass, putting high probability on regions of high probability under the target distribution. Even Markovian models, such as the SSM, can have a prohibitive computational complexity without careful design of the proposal. In the next section we will describe how we can alleviate these concerns using SIS, a special case of IS, with a kind of divide-and-conquer approach to tackle the high-dimensionality in T.
2.1.1 Sequential Importance Sampling
Sequential importance sampling [Handschin and Mayne, 1969, Robert and Casella, 2004] is a variant of IS were we select a proposal distribution that has an autoregressive structure, and compute importance weights recursively. By choosing a proposal defined by
we can decompose the proposal design problem into T conditional distributions. This means we obtain samples x i1:t by reusing x i1:tfrom the previous iteration, and append a new sample, x it, simulated from qt
unnormalized weights can be computed recursively by noting that
We summarize the SIS method in Algorithm 2, where q1
If we need to evaluate the normalization constant estimate , analogously to IS we make use of Eq. (2.6). However, we may also obtain a (strongly)
consistent estimate of the ratio of normalization constants Zt
While the estimate of the ratio is consistent, it is in general not unbiased. However, SIS is a special case of IS. This means that the SIS estimate of the normalization constant for in Eq. (2.6), is still unbiased and consistent.
In Example 2.1.1 we detail a first example proposal qt for the running example, and derive the corresponding weights . Furthermore, we include a code snippet that illustrates how to implement the sampler in Python.
Example 2.1.1 (Sequential importance sampling for Example 1.2.1). We revisit our running non-Markovian Gaussian example. The target distribution is
with p
A common approach is to set the proposal to be the prior (or transition) distribution f . A sample from the proposal qtis generated as follows
We refer to this proposal simply as the prior proposal. The corresponding weight update is
where . We provide Example Code 2.1 to illustrate how to implement SIS with the prior proposal for this model in Python.
For improved numerical stability we update the log-weights log and subtract the logarithm of the sum of weights (the weight normalization) before exponentiating.
SIS can be implemented efficiently for a large class of problems, the computational complexity is usually linear in N and T. Even so, the IS methods suffer from severe drawbacks limiting their practical use for many high-dimensional problems.
2.1.2 Shortcomings of Importance Sampling
The main drawback of IS is that the variance of the estimator scales unfavorably with the dimension of the problem; the variance generally increases exponentially in T [Doucet and Johansen, 2009is a special case of IS it inherits this unfavorable property. To see that this is true we can illustrate using a toy example where the target factorizes completely over time and with identical distributions for each factor
. We illustrate in Example 2.1.2.
Example 2.1.2 (Sequential importance sampling variance of Assume that the target distribution of interest is given by
, and we pick a proposal distribution that follows the same factorization qt
The normalization constant is then given by Zt
normalization constant estimate is
and its corresponding variance, normalized by Zt, is
where for all k. This means that if the proposal is not exactly equal to the target the variance of the normalization constant estimate will in general increase exponentially with t.
Another way in which the unfavorable scaling in t manifests itself in practice is through the normalized weights wit. The maximum of the weights, maxi wit, will quickly approach one as t increases, and consequently all other weights approach zero; a phenomena known as weight degeneracy. This means that, effectively, we approximate the target distribution using a single sample.
We illustrate weight degeneracy in Example 2.1.3 using the running example.
Example 2.1.3 (SIS weight degeneracy). We return to our running example, set the length T = 6, number of particles N 0.9,1.0,0.5,1.0). Fig. 2.1 shows the particles and the normalized weights wit, where the area of the discs correspond to the size of the weights. We can see that as t increases nearly all mass concentrates on the fourth particle x4t . This means that the normalized weight of the particle is almost one, w4t
. The remaining particles have nor- malized weights that are all close to zero, and thus have a negligible contribution to the approximation. This concentration of mass for SIS to a single particle happens very quickly. Even for very simple Markovian models the variance of e.g. our normalization constant estimator can increase exponentially fast as a function of T.
Sequential Monte Carlo methods tackle the weight degeneracy issue by choosing a proposal that leverages information contained in , the previous iteration’s target distribution approximation.
2.2 Sequential Monte Carlo
Sequential Monte Carlo methods [Gordon et al., 1993, Stewart and McCarty, 1992, Kitagawa, 1993] improve upon IS by mitigating the weight degeneracy
Figure 2.1: Weight degeneracy of the SIS method. The size of the disks represent the size of the corresponding weights wit.
issue through a clever choice of proposal distribution. For certain sequence models the weight degeneracy issue can be resolved altogether, providing estimators to the final marginal distribution that do not deteriorate for increasing T. For other sequence models, SMC still tends to provide more accurate estimates in practice compared to IS.
Just like in SIS we need a sequence of proposal distributions qtfor t = 1,..., T. This is a user choice that can significantly impact the accuracy of the SMC approximation. For now we assume that the proposal is given and return to this choice in Chapter 3. Below, we detail the iterations (or steps) of a basic SMC algorithm.
Step 1: The first iteration of SMC boils down to approximating the target distribution using standard IS. Simulating N times independently from the first proposal
and assigning corresponding weights
lets us approximate (cf. Eq. (2.5)) by
The key innovation of the SMC algorithm is that it takes advantage of the information provided in , Eq. (2.12), when constructing a proposal for the next target distribution
Step 2: In the second iteration of SMC we sample x1:2 from the proposal , rather than from q1
times independently from
and assign weights
Simulating x i1:2, Eq. (2.13), can be broken down into parts: resampling x i1 propagation x i2
, and concatenation x i1:2
the overloaded notation for x i1. We replace the initial sample set
Step 1, with the resampled set x i1
Resampling can refer to a variety of methods in statistics, for our purpose it is simple (weighted) random sampling with replacement from x1:N1
with weights w1:N1
. Resampling N times independently means that the number of times each particle is selected is multinomially distributed. This resampling algorithm is known as multinomial resampling, see Example Code 2.2. In Sections 2.2.1 and 2.2.2 we revisit resampling and present alternative resampling algorithms, increasing efficiency by correlation and adaptation.
Propagation generates new samples independently from the proposal, i.e. x i2 . By concatenating x i1:2
obtain a complete sample from the proposal
Finally, new importance weights are computed according to (2.14). This expression can be understood as a standard importance sampling weight—the target divided by the proposal. Note, however, that we use
in the denominator: when computing the weights we “pretend” that the proposal is given by
rather than by its approximation
. This is of course just an interpretation and the motivation for using this particular weight expression is given by the convergence analysis for SMC; see Section 2.3.
The resulting approximation of
SMC can in essence be described as a synthesis of SIS and resampling, which explains its alternate names sequential importance resampling (SIR) or sequential importance sampling and resampling (SISR).
Step t: The remaining iterations follow the recipe outlined in step 2. First, the proposal is the product of the previous empirical distribution approximation and a conditional distribution
Samples for i = 1,..., N are generated as follows
Finally, we assign the weights
and approximate
The normalization constant Zt can be estimated by
We summarize the full sequential Monte Carlo sampler in Algorithm 3, where method typically achieves drastic improvements compared to SIS. In Example 2.2.1 we return to our running example, using the same settings as in Example 2.1.3, to study the sample diversity and quality of the basic SMC method compared to SIS.
Example 2.2.1 (SMC sample diversity). We illustrate the weights and resampling dependencies in Fig. 2.2. The grey arrows represent what sample from iteration t that generated the current sample at iteration t, referred to as its ancestor. We can see that the weights tend to be more evenly distributed for SMC. The algorithm dynamically chooses to focus computational effort on more promising samples through the resampling step. Particles with low weights tend to not be resampled, and particles with high weights are resampled more frequently.
Not only do we get more diversity in our sample set, SMC also tends to find areas of higher probability. We illustrate this phenomenon in Table 2.1. We fix the number of particles N = 10, then study the average log-probability of our target distribution log , under the sampling distributions
, normalized by the number of iterations T.
Figure 2.2: Diversity of samples in the SMC method. The size of the disks represent the size of the weights wit, and the grey arrows represent resampling.
Table 2.1: Average log-probability values of the unnormalized target distribution with respect to the sampling distributions of SIS and SMC. The number of particles N = 10 is fixed for both methods.
While SMC methods do not suffer from weight degeneracy as t increases, the fundamental difficulty of simulating from a high-dimensional target still remains. For SMC this instead manifests itself in another form of weight degeneracy and something known as path degeneracy. We discuss weight degeneracy below and illustrate path degeneracy in Example 2.2.2.
Weight degeneracy in SMC: While introducing resampling can alleviate the weight degeneracy stemming from SIS’s unfavorable scaling in t, SMC can still suffer from weight degeneracy due to a high-dimensional xt. If at each iteration of Algorithm 3 the maximum of the normalized importance weight maxi wi is close to one, and the remaining close to zero, we say that the SMC sampler suffers from weight degeneracy. This occurs when the variance of the unnormalized importance weights is high. Snyder et al. [2008], Bickel et al. [2008], Bengtsson et al. [2008], Snyder et al. [2015] showed that to avoid weight degeneracy the number of particles N must scale at least exponentially with the variance of the log-weights, which in simple examples is proportional to the dimension nx of the latent variable xt. These results rely on a Gaussian target, are only valid as both N and nx tend to infinity, and focus on the variability of the log-weights. Naesseth et al. [2019, Proposition 3] showed that asymptotic variance of the resulting estimator of simple test functions can also scale exponentially in the dimension nx, relying on asymptotics only in N and not nx. We return to this example as we discuss the central limit theorem (CLT) in Section 2.3.
The weight degeneracy in SMC caused by high-dimensional latent variables is still an open research problem, see e.g. Djuri´c and Bugallo [2013], Briggs et al. [2013], Beskos et al. [2014], Rebeschini and Van Handel [2015], Naesseth et al. [2019] and the references therein. In this tutorial we focus on how choosing good proposals and target distributions can ameliorate, but not completely solve, this important issue.
Example 2.2.2 (SMC path degeneracy). In Fig. 2.3 we reiterate the result from our previous example, Example 2.2.1. However, this time we only include the arrows corresponding to samples that have been consistently resampled and form our final approximation . We can see that our approximation for early iterations collapses back to a single ancestor, e.g. we have N = 5 identical copies of x11 in Fig. 2.3. This phenomena is known as path degeneracy and occurs be- cause of the resampling mechanism. In Jacob et al. [2015] the authors study the impact this has on the memory requirements. They show that for state space models, under suitable conditions on the observation PDF g, the expected distance from the current iteration to a coalescence of the paths is bounded from above by
In Fig. 2.4 we study the impact that increasing dependence on earlier iterations has on our SMC estimate of the log-normalization constant. We let N = 20, T = 100 be fixed and vary the value of , where increasing values of
correspond to more long-range dependencies in our non-Markovian LVM. We can see that for modest values of
method achieves accurate estimates, whereas for higher values the drop-off in efficiency is significant. This manifests itself as an increase in the Monte Carlo (MC) variance in the estimate (width of bars), as well as a negative bias. As we will discuss in Section 2.3, this negative bias in the estimate of log ZT is typical for SMC and IS.
Figure 2.3: Path degeneracy of the SMC method for smoothing approximation. The size of the disks represent the size of the weights wit, and the grey arrows represent resampling.
In Sections 2.2.1 and 2.2.2 we will discuss two standard practical approaches that help alleviate (but not solve) the issue of path degeneracy and improve overall estimation accuracy. The first is low-variance resampling — we lower the variance in the resampling step by correlating random variables. The second is adaptive resampling, which essentially means that we do not resample at each iteration but only on an as-needed basis. In Chapter 3 we go a step further and show how to choose, or learn, proposal and target distributions that lead to even further improvements.
2.2.1 Low-variance Resampling
The resampling step introduces extra variance in the short-term, using estimate
is better than using the resampled particle set. However, discarding improbable particles, and putting more emphasis on promising ones is crucial to the long-term performance of SMC. To keep long-term performance but minimize the added variance, we can employ standard variance reduction techniques based on correlating samples drawn from
. First, we explain a common technique for implementing multinomial resampling based on inverse transform sampling. Then, we explain two low-variance resampling alterna-
Figure 2.4: Violinplots for the log of the SMC estimate of the normalization constant divided by the true value, i.e. log . The number of particles are N = 20, length of the data is T = 100, and we study five different settings of
0.001,0.1,0.5,0.7,0.99).
tives, stratified and systematic resampling. To sample from we use inverse transform sampling based on the weightswit
. We have that
where a is an integer random variable on {1,..., N}, such that the following is true
for u . Repeating the above process independently N times gives multinomial resampling. That is, we draw ui
and find the corresponding ai for each i = 1,..., N. In the context of SMC the index variables
determine the ancestry of the particles and they are therefore often referred to as ancestor indices.
Stratified Resampling One way to improve resampling is by stratification on the uniform random numbers ui [Kitagawa, 1996, Fearnhead, 1998]. This
means we divide the unit interval into N strata, Then, we generate ui
each strata i = 1,..., N. Finally, the corresponding ancestor indices ai are given by studying Eq. (2.19).
Example Code 2.3 shows how this can be implemented in Python. The main change compared to multinomial resampling, Example Code 2.2, is the way the ui’s are generated.
Systematic Resampling We can take correlation to the extreme by only generating a single uniform number u to set all the ui’s. Systematic resampling [Carpenter et al., 1999, Whitley, 1994] means that we let
where the random number u is identical for all i = 1,..., N. Then, we find corresponding indices ai by again studying Eq. (2.19). Note that just like in stratified resampling, systematic resampling also generates one ui in each strata
. However, in this case the ui’s are based on a single random value u. This means that systematic resampling will be more computationally efficient than stratifies and multinomial resampling.
The code change to implement systematic resampling is simple. We only need to change a single line in Example Code 2.3. We replace line 3 by: u = (np.arange(N) + npr.rand())/N.
Both systematic and stratified resampling are heavily used in practice. In many cases systematic resampling achieves slightly better results [Hol et al., 2006]. However, systematic resampling can sometimes lead to non-convergence [Gerber et al., 2017], depending on the ordering of the samples.
For a more thorough discussion on these, and other, resampling methods see e.g. Douc and Cappé [2005], Hol et al. [2006].
2.2.2 Effective Sample Size and Adaptive Resampling
The resampling step introduces extra variance by eliminating particles with low weights, and replicating particles with high weights. If the variance of the normalized weights is low, this step might be unnecessary. By tracking the variability of the normalized weights, and trigger a resampling step only when the variability crosses a pre-specified threshold, we can alleviate this issue. This is known as adaptive resampling in the SMC literature. Often we study the effective sample size (ESS) [Liu, 2004] to gauge the variability of the weights, which for iteration t is
The ESS is a positive variable taking values in the continuous range between 1 and N. For IS the ESS is an approximation of the number of exact samples from that we would need to achieve a comparable estimation accuracy [Doucet and Johansen, 2009]. A common approach is to resample only at iterations when the ESS falls below N
Adaptive resampling can be implemented by slight alterations to the proposal and weight updates in Eqs. (2.15) and (2.16). If ESStis above the prespecified threshold we simply omit the resampling step in Eq. (2.15) and obtain
The fact that resampling is omitted at some iterations needs to be accounted for when computing the weights. For iterations where we do not resample the weight update is,
This can be thought of as adding an extra importance correction, where the previous weights define a “target distribution” for the ancestor indices and the factor 1
is the “proposal” corresponding to not resampling. When the
ESS falls below our pre-set threshold, we use the standard update Eqs. (2.15) and (2.16) with resampling instead.
Adaptive resampling is usually combined with the low-variance resampling techniques explained above for further variance reduction.
Remark 2.2.1. The constant factors in the weight expression will cancel when normalizing the weights. However, these constants still need to be included for the expression for the normalizing constant estimate Eq. (2.18) to be valid. That is, the extra importance correction added at those iterations when we do not resample should be the ratio of the normalized weights from the previous iteration, divided by the constant 1. An alternative approach appearing in the literature is to neglect the constants when computing the weights, but then modify the expression for the normalizing constant estimate Eq. (2.18) instead.
2.3 Analysis and Convergence
Since its conception in the 1990s, significant effort has been spent on studying the theoretical properties of SMC methods. We will review and discuss a few select results in this section. For an early review of the area, see e.g. Del Moral [2004]. The theorems we discuss below all hold for a number of conditions on the proposal, probabilistic model, and test functions. For brevity we have omitted these exact conditions and refer to the cited proofs for details.
Unbiasedness One of the key properties of SMC approximations is that they provide unbiased approximations of integrals of functions ht with respect to the unnormalized target distribution . We formalize this in Theorem 2.3.1.
Theorem 2.3.1 (Unbiasedness).
Proof. See Del Moral [2004, Theorem 7.4.2]. For the special case ht also Section 4.1 and Appendix 4.A.
A particularly important special case is when ht 1 and we approximate the normalization constant of
. We have that
, where the expectation is taken with respect to all the random variables generated by the SMC algorithm. If we instead consider the more numerically stable log
by Jensen’s inequality that
. This means that the estimator of the log-normalization constant is negatively biased. This is illustrated by the violin plot discussed in Example 2.2.2.
We will delve deeper into the applications of the unbiasedness property and its consequences in Chapter 4.
Laws of Large Numbers While integration with respect to unnormalized distributions can be estimated unbiasedly, this is unfortunately not true when estimating expectations with respect to the normalized target distribution However, SMC methods are still strongly consistent, leading to exact solutions when the number of particles N tend to infinity. We formalize this law of large numbers in Theorem 2.3.2.
Theorem 2.3.2 (Law of Large Numbers).
Proof. See Del Moral [2004, Theorem 7.4.3].
Central Limit Theorem While the law of large numbers from the previous section shows that SMC approximations are exact in the limit of infinite computation, it tells us nothing about the quality of our estimate. The CLT in Theorem 2.3.3 tells us about the limiting distribution of our SMC estimate and its asymptotic variance. This gives us a first approach to get an understanding for the precision of our SMC approximation to
Theorem 2.3.3 (Central Limit Theorem).
where Vtis defined recursively for a measurable function h,
initialized with
Proof. See Chopin [2004].
Example 2.3.1 (State Space Model). We study the following example of a state space model
where nx denotes the dimension of xl and yl. For simplicity assume that yl,k = yl,m, m and that
. Let the function of interest be ht
, i.e. the sum of the components of the state at the most recent iteration. Studying the asymptotic variance in Theorem 2.3.3 for the so called fully adapted
SMC, explained in Section 3.2.2, we get
where the constants Al, Bl, and Cl are defined by
For the proof see Naesseth et al. [2019, Proposition 3].
This result tells us that the asymptotic variance is exponential in nx, the dimension of xt. Furthermore, it tells us that for the asymptotic variance to be stable in t, i.e. to not be increasing with increasing t, we need that our model forgets its initial conditions. Concretely, we need the inner expectations in Al and Cl to tend to 0 (the posterior mean) quickly enough to be able to ensure that we can bound the sum by a (finite) constant that does not depend on t. This holds more generally (under various assumptions on the model and proposal) for approximations of the filtering distribution, . We formalize the result in Proposition 2.3.1.
Proposition 2.3.1 (Asymptotic Variance for Filtering Approximation in SSM). Under appropriate forgetting conditions on the model
for some finite constant c.
Proof. See Whiteley [2013].
Proposition 2.3.1 shows that the asymptotic variance can be stable as a function of t, this is in general only true for test functions that depend only on the most recent latent variable xt and under various assumptions on the model and proposal [Del Moral and Guionnet, 2001, Del Moral, 2004, Whiteley, 2013]. When considering the variance of s, it is instead possible to show that the variance increases only linearly with t [Whiteley, 2013, Del Moral, 2004]. This implies that to obtain a good estimate of the normalization constant, we must choose the number of particles N to scale at least linearly in t.
Sample Bounds Another way of looking at the SMC method, disregarding test functions, is as a direct approximation to the target distribution itself. The weights and samples generated by the SMC algorithm provide an approximation , Eq. (2.17), to the true target distribution
With this point of view we can study bounds on the difference between the distribution of a sample drawn from the SMC approximation compared to that of a sample drawn from the target distribution. Specifically, assume that we generate a sample xby first running an SMC sampler to generate an approximation
, and then simulate x
. Then, the marginal distribution of x
, where the expectation is taken with respect to all random variables generated by the SMC algorithm. It is worth noting that this distribution,
, may be continuous despite the fact that
is a point-mass distribution by construction. In Theorem 2.3.4 we restate a generic sample bound on the KL divergence from the expected SMC approximation
the target distribution
divergence from a distribution q to another distribution p is defined by KL
Theorem 2.3.4 (Sample Bound).
for a finite constant
Proof. See Del Moral [2004, Theorem 8.3.2].
From this result we can conclude that in fact the SMC approximation tends to the true target in distribution as the number of particles increase.
Recently there has been an increased interest for using SMC as an approximation to the target distribution, rather than just as a method for estimating expectations with respect to test functions. This point of view has found applications in e.g. probabilistic programming and variational inference [Wood et al., 2014, Naesseth et al., 2018, Huggins and Roy, 2015].
The two main design choices of sequential Monte Carlo methods are the proposal distributions qt, t and the intermediate (unnormalized) target distributions
. Carefully choosing these can drastically improve the efficiency of the algorithm and the accuracy of our estimation. Adapting both the proposal and the target distribution is especially important if the latent variables are high-dimensional.
In the first section below we discuss how to choose, or learn, the proposal distribution for a fixed target distribution. Then, in the final section we discuss how we can design a good sequence of intermediate target distributions for a given probabilistic model.
3.1 Designing the Proposal Distribution
The choice of the proposal distribution is perhaps the most important design choice for an efficient sequential Monte Carlo algorithm. A common choice is to propose samples from the model prior; this is simply known as the prior proposal. When using the prior as proposal the algorithm is commonly known as the bootstrap particle filter or bootstrap SMC [Gordon et al., 1993]. However, using the prior can lead to poor approximations for a small number of particles, especially if the latent space is high-dimensional.
We will in this section derive the locally (one-step) optimal proposal distribution, or optimal proposal for short. Because it is typically intractable, we further discuss various ways to either emulate it or approximate it directly. Finally, we discuss alternatives to learn efficient proposal distributions using an end-to-end variational perspective.
3.1.1 Locally Optimal Proposal Distribution
The locally optimal proposal distribution [Doucet et al., 2000] is the distribution we obtain if we assume that we have a perfect approximation at iteration t 1. Then, choose the proposal qt that minimizes the KL divergence from the joint distribution
. We formalize this result in Proposition 3.1.1. Equivalently, we can view this as the proposal minimizing the variance of the incremental weights, i.e.
, with respect to the newly generated samples x it.
Proposition 3.1.1 (Locally optimal proposal distribution). The optimal proposal qminimizing KL
by
Proof. If we let “const” denote terms constant with respect to the proposal distribution qt
where the inner (conditional) Kullback-Leibler divergence is zero if and only if qt
In Example 3.1.1 we show that the optimal proposal is analytically tractable for our running non-Markovian Gaussian example.
Example 3.1.1 (Optimal proposal for Example 1.2.1). If we let , then the (locally) optimal proposal for our
running example is analytically tractable. It is
In most practical cases the optimal proposal distribution is not a feasible alternative. The resulting importance weights are intractable, or simulating random variables from the optimal proposal is too computationally costly. Below we discuss various common approaches for approximating it.
3.1.2 Approximations to the Optimal Proposal Distribution
There have been a number of suggestions over the years on how to approximate the optimal distribution. We will review three analytic approximations based on a Gaussian assumption, as well as briefly describe an exact approximation.
Laplace Approximation We obtain the Laplace approximation to the optimal proposal by a second-order Taylor approximation of the log-PDF around a point ¯x [Doucet et al., 2000]. If we let lt, supressing the dependence on x1:t
where are the gradient and the Hessian of the log-PDF with respect to xt, respectively. A natural approximation to the optimal proposal is then
The mode of the distribution can be a good choice for the linearization point ¯x if the distribution is unimodal. With this choice the mean simplifies to just ¯x. However, the mode is usually unknown and will depend on the value of x1:t. This means that we are required to run a separate optimization for each particle x i1:t
and iteration t to find the mode and Hessian. This can outweigh the benefits of the improved proposal distribution.
Extended and Unscented Kalman Filter Approximations The extended Kalman filter (EKF) and the unscented Kalman filter (UKF) [Anderson and Moore, 1979, Julier and Uhlmann, 1997] are by now standard solutions to non-linear and non-Gaussian filtering problems. We can leverage these ideas to derive another class of Gaussian approximations to the optimal proposal distribution [Doucet et al., 2000, Van Der Merwe et al., 2001].
The two methods depend on a structural equation representation of our probabilistic model,
where vt and et are random variables. The representation implies a joint distribution on xt and yt conditional on x1:t. The locally optimal proposal distribution corresponding to this representation is given by q
approximations rely on a Gaussian approximation to the joint conditional distribution of xt and yt,
where we have suppressed the dependence on x1:tfor clarity. We block the mean
and covariance
as follows
Under the assumptions of the approximation in Eq. (3.4), the distribution of xt is tractable:
with
The key difference between the EKF- and UKF-based approximations is how to compute the estimates . We leave the details of these procedures to Appendix 3.A, and focus here on the intuition behind them.
The EKF uses a first-order Taylor approximation to the non-linear functions a, c to derive an approximation to the full posterior distribution based on exact (analytical) updates of the approximate model. Following this line of thinking we can similarly linearize a, c locally. Then under a Gaussian assumption on the noise vt, et, we compute the distribution exactly for the linearized model. The EKF-based approximations
can be found in Eq. (3.47) in the appendix.
The UKF on the other hand uses so-called sigma points to reach the Gaussian approximation . The key idea is to choose a set of points, pass them through the nonlinear functions a and c, and then estimate the mean and variance of the transformed point-set. Unlike the EKF-based approximation, the UKF-based approximation does not require that the noises vt and et are Gaussian distributed. However, we do require that the mean and variance for these random variables are available. The UKF-based approximations
can be found in Eq. (3.49) in the appendix.
Analytic Gaussian Approximations We refer to the Laplace-, EKF- and UKF-based approximations as analytic Gaussian approximations to the locally optimal proposal distribution. We summarize these three approaches in Algorithm 4. Which one works best depends heavily on the model being studied. In Example 3.1.2 we study a simple example when x and y are one-dimensional random variables.
Example 3.1.2 (Gaussian Proposal Approximations). We illustrate the analytic Gaussian approximations from Algorithm 4 based on a simple scalar example
The prior on the latent variable x is a standard normal, but the measurement model c is a polynomial in x with additive Gaussian noise. In Fig. 3.1 we show the true (bimodal) posterior p(x | y), and the corresponding approximations based on the Laplace, EKF, and UKF methods. Neither of these methods can capture the bimodality of the true normalized distribution, since they are all based on the basic simplifying assumption that the posterior is Gaussian. However, as previously discussed a good proposal will cover the bulk of probability mass of the target. This means it is likely that the EKF and UKF proposals will outperform the Laplace proposal in this case.
Exact Approximations In exact approximations we use another level of MC. Instead of drawing exact samples from q, we approximate draws from it using a nested MC algorithm [Naesseth et al., 2015, 2019]. This nested algorithm is chosen such that it does not alter the asymptotic exactness of the outer SMC method. We have discussed how SMC and IS can be used as distribution approximators. Nested MC methods takes this one step further, using a separate SMC (or IS) approximation
for each sample from q
we would like. This can lead to substantial benefits for e.g. high-dimensional latent variables xt. One of the key differences between the analytic Gaussian approximations and the exact approximations is the fixed form distribution assumption on the analytic approximation. The exact approximation is not limited to a standard parametric distribution, and we can obtain arbitrarily accurate approximations with enough compute. The trade-off is the additional effort needed to generate each sample x it, which requires us to run independent SMC algorithms for each particle i at each iteration t. While this might seem wasteful, it can in fact improve accuracy compared to SMC methods with standard proposals [Naesseth et al., 2015], even for equal compute.
We are interested in approximating the locally optimal proposal distribu-
Figure 3.1: Analytic Gaussian approximations of p(x | y) for the model in Eq. (3.8).
tion, i.e. q, separately for each particle i using an MC method. A first approach is to use a nested IS sampler with a proposal rt
targeting q
. We construct an approximation of the optimal proposal independently for each particle x i1:t
where the weights are given by
We replace qtin Eq. (2.15) with samples from the distribution de-fined in Eq. (3.9). The corresponding weight update, under the assumption that we have resampled x i1:t
, is given by
We summarize the nested IS approach to approximate the optimal proposal in Algorithm 5. The algorithm generates a single sample, and must be repeated for each i = 1,..., N.
By what we know from the theory of SMC, see Section 2.3, we would expect that . This is true, and as shown by Naesseth et al. [2019] the asymptotic variance in the limit of M
is equal to that of using the locally optimal proposal distribution. The main implication is that we can obtain arbitrarily accurate optimal proposal approximations at the cost of increasing the number of samples M for the nested MC method. We can also combine the analytic Gaussian approximations above with exact approximations by choosing rt as either the Laplace, EKF, UKF, or any other analytic approximation.
The computational complexity of the nested IS method is increased by a factor of M compared to standard SMC. A natural question is therefore whether or not we are better of running standard SMC with rt as a proposal and using N M particles. However, the memory requirement is lower for nested SMC and there are opportunities for speeding up through parallelization. There are even variants of nested MC that can outperform on the same computational budget [Naesseth et al., 2015], i.e. when compared to standard SMC with N M particles without taking parallelization into account. We return to this class of methods in Section 4.3.2.
3.1.3 Learning a Proposal Distribution
Rather than relying on approximating the locally optimal proposal distribution, we can instead directly learn a good proposal distribution that can take into account global information of the target distribution. By parameterizing a suitable class of distributions and choosing a cost function to optimize, we can use standard optimization tools (such as stochastic gradient descent) to adapt our proposal to the problem we are trying to solve. When combined with the SMC approximation, we refer to these types of approaches as adaptive or variational SMC methods.
We will in this section focus on proposal distributions, qt, parameterized by , we denote this as qt
. A common choice is to use a conditional Gaussian distribution
where are neural networks parameterized by
. The proposal is also a function of the data, either explicitly as a part of the input to the functions or implicitly through the cost function for the parameters. These types of conditional distributions have recently been used with success in everything from image and speech generation [Gulrajani et al., 2017, Maddison et al., 2017] to causal inference [Louizos et al., 2017, Krishnan et al., 2017].
Below we discuss two classes of methods to learn the parameters tive and variational methods, respectively.
Adaptive Sequential Monte Carlo Adaptive methods are characterized by choosing a cost function that tries to fit directly the proposal distribution qtto the target distribution
. This can either be done locally at each iteration for qt
as in e.g. Cornebise et al. [2008], or globally for qT
as in e.g. Gu et al. [2015], Paige and Wood [2016]. Because we are mainly interested in the approximation of the final target distribution
we will here focus our attention to methods that take a global approach to learning proposals. Below we focus on recent methods developed in the machine learning literature. However, there is also an important body of work on adaptive SMC samplers [Del Moral et al., 2006a] stemming from the statistics community. For a comprehensive description of those methods we refer to Jasra et al. [2011], Fearnhead and Taylor [2013], Schäfer and Chopin [2013], Buchholz et al. [2018].
Since we generally need the proposal distribution to cover areas of high probability under the target distribution, we need the proposal to be more diffuse than the target. This will ensure that the weights we assign will have finite variance and that our approximation gets more accurate. This means that using the cost function KL, like in standard variational inference [e.g. Blei et al., 2017], would not result in a good proposal distribution. This cost function will encourage a proposal that is more concentrated than the target, and not less. Instead we focus our attention on the inclusive KL divergence KL
like in e.g. Gu et al. [2015] or Paige and Wood [2016]. The inclusive KL divergence encourages a qT that covers the high-probability regions of
We focus our exposition on adaptive methods to the case when the posterior distribution p
. The cost function to minimize is the inclusive KL divergence
where “const” includes all terms constant with respect to the proposal distribution. If we could compute the gradients of Eq. (3.11) with respect to employ a standard gradient descent method to optimize it. Unfortunately the gradients, given by
requires us to compute expectations with respect to the posterior distribution. Computing these types of expectations is the problem we try to solve with SMC in the first place! Below we will detail two approaches to solve this problem. First, we consider a stochastic gradient method that uses SMC to estimate the gradients. Then, we consider an alternative solution that uses inference amortization combined with stochastic gradient methods.
In the first approach, we use our current best guess for the parameters and then the SMC procedure itself to approximate the gradient in Eq. (3.12). Given this estimate of the gradient we can do an update step based on standard stochastic gradient descent methods. Suppose we have the iterate
timate the gradient in Eq. (3.12) using Algorithm 3 with proposals qt
and get
With stochastic gradient descent we update our iterate by
where are a set of positive step-sizes, typically chosen such that
and
. Note that unlike standard stochastic optimization methods, the gradient estimate is biased and convergence to a local minima of the cost function is not guaranteed. However, this approach has still shown to deliver useful proposal adaptation in practice [Gu et al., 2015].
The other approach relies on amortizing inference [Gershman and Goodman, 2014, Paige and Wood, 2016]. One view of inference amortization is as a procedure that does not just learn a single optimal for the observed dataset y1:T, but rather learns a mapping
from the data space to the parameter space. This mapping is optimized to make sure qT
approximation to p
that is likely under the probabilistic model p
. Note that the way we parameterize qT may differ from the above approach even though we use the same notation for the parameters
In this setting instead of the KL divergence in Eq. (3.11), we consider minimizing
If we let be a parametric function with parameters
, we can compute stochastic gradients of Eq. (3.15) w.r.t.
with no need to resort to SMC. A common choice is to let
be defined by a neural network where
weights and biases. We define the gradient
and estimate it, for the current iterate
Previous methods we have considered focus on proposals that try to emulate the posterior or the locally optimal proposal, both conditionally on the observed data y1:T. However, the amortized inference approach, in this setting, learns proposals based on simulated data from the model pis performed offline before using the learned proposal and the real dataset for inference.
The amortized inference method follows the same procedure as above in Eq. (3.14) when updating , replacing the gradient with the expression from Eq. (3.17)
Unlike the above approach that uses SMC to estimate the gradients, the amortized inference approach results in an unbiased approximation of the gradient. This means that using the update Eq. (3.18) will ensure convergence to a local minima of its cost function Eq. (3.15) by standard stochastic approximation results [Robbins and Monro, 1951]. On the other hand, this approach requires us to learn a proposal that works well for any dataset that could be generated by our model. This puts extra stress on the model to be accurate for the actual observed dataset to be able to learn a good proposal. For more thorough discussion of these topics, see Paige and Wood [2016].
We summarize these two approaches to optimize the proposal in Algorithm 6.
Variational Sequential Monte Carlo The key idea in variational sequential Monte Carlo (VSMC) [Naesseth et al., 2018, Le et al., 2018, Maddison et al., 2017] is to use the parametric distribution for our proposals, qtand optimize the fit in KL divergence from the expected SMC approximation
to the target distribution
. The dependence on
enters implicitly in the expected SMC approximation
through the proposed samples, weights, and resampling step. In contrast to mimicking the locally optimal proposal distribution, this cost function takes into account the complete SMC approximation and defines a coherent global objective function for it. Compared to the previously described adaptive SMC methods, VSMC optimizes the fit of the final SMC distribution approximation to the true target, rather than the fit between the proposal to the target distribution. This means that we explicitly take into account the resampling steps that are a key part of the SMC algorithm.
Studying the marginal distribution of a single sample from it is possible to show that [Naesseth et al., 2018
divergence from this distribution to the target distribution
can be bounded from above
where the log-normalization constant estimate, cf. Eq. (2.18), is
The expectation is taken with respect to all random variables generated by the
SMC algorithm. Because log ZT does not depend on the parameters mizing Eq. (3.19) is equivalent to maximizing
The KL divergence KLis non-negative, which means that Eq. (3.21) is a lower bound to the log-normalization constant log ZT. This is why Eq. (3.21) is typically known as an evidence lower bound in the variational inference literature. By maximizing the cost function in Eq. (3.21), with respect to
can find a proposal that fits the complete SMC distribution to the target distribution. The main issue is that evaluating and computing the gradient of this cost function is intractable, so we need to resort to approximations.
We assume that the proposal distributions qt are reparameterizable [Kingma and Welling, 2014, Rezende et al., 2014, Ruiz et al., 2016, Naesseth et al., 2017]. This means we can replace simulating xt
where the distribution of the random variable is independent of the parameters
. If we further assume that ht is differentiable we can use
where can be computed using e.g. automatic differentiation, replacing x i1:t with its defintion through Eq. (3.22). The approximation follows from ignoring the gradient part that results from the resampling step, which has been shown to work well in practice [Naesseth et al., 2018, Le et al., 2018, Maddison et al., 2017].
Just like for the adaptive SMC methods, Eq. (3.23) suggests a stochastic gradient method to optimize . Given an iterate
, we compute the gradient estimate by running SMC with proposals qt
and evaluate
With stochastic gradient ascent we update our iterate by
3.2 Adapting the Target Distribution
A less well-known design aspect of SMC algorithms is the target distribution itself. If we are mainly interested in the final distribution , we are free to choose the intermediate target distributions
to maximize the accuracy of our estimate
. By making use of information such as future observations, we can change the target distribution to take this into account. This leads to so-called auxiliary or twisted SMC methods [Guarniero et al., 2017, Heng et al., 2017].
3.2.1 Twisting the Target
Even if we could simulate exactly from the locally optimal proposal distribution Eq. (3.1), we still would not be getting exact samples from our target distribution. The reason for this is that when sampling and weighting our particles at iteration t we do not take into account potential future iterations. These future iterations can add new information regarding earlier latent variables. For instance, in an LVM (cf. Eq. (1.9)) a natural choice of target distributions is
However, with this choice we do not take future observations ytaccount at iteration t. The SMC approximation for earlier iterations has finite support (represented by the particles x i1:t
and so the only way we can in- corporate this is by reweighting and resampling. These two operations will typically impoverish our approximation for the full states x1:t, a phenomena related to the path degeneracy discussed in Chapter 2. The key idea in twisting (or tilting) the target distribution is to change our intermediate target distributions to take into account information from future iterations already at the current step. These types of approaches also have a connection to so-called lookahead strategies [Lin et al., 2013] and block sampling [Doucet et al., 2006] for SMC.
The optimal target distribution, under the assumption that we are using the locally optimal proposal, is to let the target at each iteration be the marginal distribution of the final iteration’s target, i.e.
With this choice all samples from x1:T are perfect samples from
can easily see this if we write out the optimal proposal for this sequence of targets
where the final equality corresponds to the case when is the posterior distribution p
. This means that the locally optimal proposal distribution is no longer only locally optimal, it is in fact optimal in a global sense.
In the following discussion we will assume that the final unnormalized target distribution can be split into a product of factors:
With this form of the joint distribution of data and latent variables, we choose the following structure for our unnormalized target distributions:
where 0 are our so-called twisting potentials, with
is easy to check that this recursive definition is consistent with our assumed factorization of the final target
in Eq. (3.27). The target distribution structure postulated in Eq. (3.28) can be directly applied in our basic SMC method in Algorithm 3 without any changes.
To deduce the optimal twisting potentials we can make use of the prop- erty that the optimal targets must fulfill
This follows from Eq. (3.26), each target should be the marginal distribution of x1:t under the final target distribution
. By replacing
in Eq. (3.29) with the definition from Eq. (3.28) we get
While Eq. (3.30) is typically not available analytically for practical applications, we will see in Section 3.2.2 that it can serve as a guideline for designing tractable twisting potentials.
Below we give a few concrete examples on what the optimal potentials might look like for both non-Markovian LVMs and conditionally independent models with tempering.
Non-Markovian Latent Variable Model The non-Markovian LVM can be described by a transition PDF ft and an observation PDF gt and follows the structure in Eq. (3.27) by defining
For this model we can rewrite Eq. (3.30) as follows
where the final expression is the predictive likelihood of ytgiven the latent variables x1:t. In Example 3.2.1 we derive the optimal twisting potentials for our Gaussian non-Markovian LVM.
Example 3.2.1 (Analytical Twisting Potential). Our running example is a nonMarkovian LVM with an analytical optimal twisting potential . This is because the joint distribution is Gaussian, and thus the integrals we need to compute to construct it (cf. Eq. (3.31)) are tractable. We can rewrite the equation for the observations yt for each t only as a function of x1:l, et and vl
, denoted by yt
as follows
where is given by
Because we know that we get the optimal value by considering the distribution of
are independent Gaussian, by Eq. (3.33) the predictive likelihood is a correlated multivariate Gaussian.
Conditionally Independent Models Another class of probabilistic models we discussed in Section 1.2 was conditionally independent models. In the notation of Eq. (3.28) we can express this class of models as
where p(x) is the prior distribution on the (static) latent variable, st is a backward kernel, and gt is a potential depending on the observed data y. We initialize by . The potential gt usually takes either of the two following forms:
where 0 1. The backward kernel st is an artificially introduced distribution to enable the use of SMC methods, requiring approximating a space of increasing dimension. See Section 1.2 for more examples of useful sequence models in annealing methods, or Del Moral et al. [2006a] for a more thorough treatment of the subject.
When we replace in Eq. (3.30) with its definition from Eq. (3.34) we get
where we can see that because of the (reverse) Markov structure in the annealing model from Eq. (3.34), the optimal twisting potential only depends on the current value of xt, i.e. . This will hold true also for the SSM, the Markovian special case of Eq. (3.31) where ft
gt
3.2.2 Designing the Twisting Potentials
The optimal twisting potentials are typically not tractable, requiring us to solve integration problems that might be as difficult to compute as finding the posterior itself. However, they can give insight into how to design and parameterize approximate twisting functions. First, we study a certain class of models that admit a locally optimal twisting potential, which gives rise to the so-called fully adapted SMC [Pitt and Shephard, 1999, Johansen and Doucet, 2008]. After discussing the locally optimal choice, we then move on to discuss a general approach to learning twisting potentials for SMC algorithms.
Locally Optimal Twisting Potential If we assume that the twisting potentials only satisfy the optimal twisting equation, Eq. (3.30), for one iteration we get a locally optimal twisting potential
The target distribution based on these potentials will allow us to adapt for information from one step into the future.
Furthermore, we combine the locally optimal twisting potentials with the proposals based only on factors up until iteration t, i.e.
With these choices of twisting functions and proposals, based on the twisted target structure Eq. (3.28), we obtain the fully adapted SMC [Pitt and Shephard, 1999, Johansen and Doucet, 2008] with corresponding weights
Note that this perspective on fully adaptive SMC approximates the “predictive target”
including information from iteration t + 1 already at iteration t. In some situations we might be interested in the “filtering target”. If so, we can make use of the approach described above to generate the particles x i1:t. Then to estimate the filtering target we simply average the particles x i1:t, ig- noring the weights in Eq. (3.38), accounting for the discrepancy between the filtering and predictive targets.
The proposal in Eq. (3.37) becomes
with weights according to Eq. (3.38) given by be recognized as a one-step look-ahead strategy, incorporating the observation yt
already at iteration t. This is indeed how the fully adaptive SMC is often presented for SSMs and non-Markovian LVMs.
Remark 3.2.1. The proposal used above is not the locally optimal proposal distribution for our (predictive) target distribution. The locally optimal proposal (see Eq. (3.1)) for the target Eq. (3.28) is
which would result in weights
Just like in the non-twisted case the weights for the locally optimal proposal are independent of the actual samples we generate at iteration t of the SMC algorithm. However, even if the potentials in Eq. (3.36) are available, the integration in Eq. (3.39) is not necessarily tractable.
Except for a few special cases the locally optimal twisting potential is not available, therefore we resort to approximating it. We can extend the ideas from Section 3.1 and either make an approximation to the locally optimal potential, or we can learn a full twisting potential.
Learning Twisting Potentials By parameterizing a class of positive functions and a cost function, we can effectively learn useful twisting potentials for the application at hand. From Eq. (3.30) we know that the optimal twisting potentials must satisfy a recursive relationship. We can make use of this property as we define the cost function and the twisting potentials.
We explain the approach by Guarniero et al. [2017], but generalize slightly to allow for non-Markovian LVMs. This means that our model is defined by ft
, cf. Eq. (3.31). Furthermore, we define the twisting potentials
implicitly as expectations with respect to ft
of positive functions ¯
with parameters
We further assume that this expectation can be computed analytically. With this setup we focus on learning the ¯’s instead. Using the optimality condition from Eq. (3.30) with twisting potentials defined by Eq. (3.40), we get the recursive approximation criteria for the ¯
With this choice of proposal we tie together the parameters for the proposal and the potentials. The corresponding weights in the SMC algorithm are
What remains is how to use Eq. (3.41) to actually learn a useful set of potentials. To do this we follow the approach by Guarniero et al. [2017] which uses an iterative refinement process: initialize parameters, run SMC with the current parameters, update parameters based on the current SMC approximation
, repeat. The update step solves for
by recursively minimizing
where the last step is for t 1,...,1. With the updated set of parameters, we re-run SMC and repeat the updates in Eq. (3.44) if a stopping criteria has not been met. Guarniero et al. [2017] proposes a stopping criteria based on the normalization constant estimate
We summarize the iterated auxiliary SMC method explained above in Algorithm 8, and give an example setup for a stochastic RNN in Example 3.2.3.
Example 3.2.3 (Stochastic Recurrent Neural Network). A stochastic RNN is a non-Markovian LVM where the parameters of the transition and observation models are defined by RNNs. A common example is using the conditional Gaussian distribution to define the transition PDF
where the functions are defined by RNNs. This model together with a Gaussian-like definition for the potentials satisfies the criteria for using Algorithm 8. We define ¯
as follows
where the functions depend on the parameters
—for notational brevity we have not made this dependence explicit. The proposal (cf. Eq. (3.42)) is given by
and the twisting potential
Twisting the target distribution is an area of active research [Guarniero et al., 2017, Heng et al., 2017, Lawson et al., 2018, Lindsten et al., 2018]. The iterated auxiliary SMC [Guarniero et al., 2017] makes strong assumptions on the model for easier optimization. End-to-end optimization of a priori independent twisting potentials and proposals qt might lead to even more accurate algorithms.
3.A Taylor and Unscented Transforms
In this appendix we detail the EKF- and UKF-based approximations to the distribution p. We refer to the two approaches as Taylor and Unscented transforms, respectively.
Taylor Transform If we assume that vt and et are zero-mean and linearize a around vt
where ¯a denotes the Jacobian of the function a with respect to vt evaluated at the point ¯a. We rewrite Eq. (3.45):
For the approximation in Eq. (3.46) xt, yt is indeed Gaussian, and with vt
, we can identify the blocks of
Using e.g. automatic differentiation we can easily compute the Jacobians needed for the EKF-based proposal. However, we still have to be able to evaluate the densities defined by afor the weight updates
notational convenience we have omitted the dependence on previous latent states in the expressions above. However, in general because of this dependence we will have to compute the proposals separately for each particle, which includes inverse matrix computations that can be costly if xt is high-dimensional.
Unscented Transform We rewrite Eq. (3.3) as a single joint random variable, and such that the functions a, c only depend on x1:t
where zt . By approximating the conditional distribution of
given x1:t
as Gaussian, we can again compute the distribution of xt
for the approximation. We choose sigma points based on the two first moments of zt. The mean and variance of zt is
where the scalars and vectors ul correspond to the singular value decomposition of
. We then choose the 2nx
1 sigma points for z as follows,
where Julier and Uhlmann, 1997]. We will discuss the choice of design parameters
below. The mean and variance of
is estimated by
Var
where the coefficients
and is another design parameter.
Like in the EKF-based approximation we still need access to the densities corresponding to Eq. (3.3). Unlike in the EKF case the, UKF-based approximation does not need access to derivatives. However, we need to choose the three design parameters: control the spread of the sigma points, and
is related to the distribution of zt [Julier, 2002]. Typical values of these parameters are
In this chapter we will discuss nesting Monte Carlo (MC) algorithms, using several layers of MC to construct composite algorithms. Usually, this is done to approximate some intractable idealized algorithm. The intractability can stem from e.g. a likelihood or weight that we cannot evaluate in closed form, or a proposal distribution that we cannot sample exactly from. Either way, the key idea of these methods is to introduce additional auxiliary variables to tackle the intractability in the idealized method. The approximate method is constructed in such a way that it retains some of the favorable properties of the idealized MC method that it tries to emulate. We will in this section study particle-based pseudo marginal (PM) methods for several idealized MC methods, ranging from marginal Metropolis-Hastings (MH) [Andrieu and Roberts, 2009, Andrieu et al., 2010] to distributed SMC [Vergé et al., 2015] and various nested SMC methods [Fearnhead et al., 2010, Chopin et al., 2013, Naesseth et al., 2015].
In the first section below we discuss the unbiasedness property of the SMC normalization constant estimate, a key property underpinning several nested MC methods. Then, in the next section we discuss particle Metropolis-Hastings (PMH). Next, we introduce the concept of proper weights and discuss various SMC methods relying on random (but proper) weights. Finally, we introduce a few perspectives on distributed SMC.
4.1 The Unbiasedness of the Normalization Constant Estimate
We briefly touched upon one of the key theoretical properties of SMC in Section 2.3, the unbiasedness of the normalization constant estimate . This will be important for several of the PM methods that we discuss in this chapter. We repeat the definition of
again for clarity:
In Proposition 4.1.1 we provide a result that formalizes the unbiasedness property. Furthermore, we provide a straightforward self-contained proof of this important property of the SMC algorithm.
Proposition 4.1.1. The SMC normalization constant estimate is non-negative and unbiased
Proof. See Appendix 4.A.
Often it will prove useful to think of the simply as a function of the random seed used to generate all the random variables in the SMC algorithm. Below we formalize this point-of-view, which will be useful for e.g. PMH and nested sequential Monte Carlo (NSMC) in Section 4.2 and Section 4.3, respectively.
Auxiliary Variables and Random Seeds The complete set of random variables generated in the SMC algorithm consists of each of the particles x it, for i , and ancestor variables ait for i
and t
are often generated by sampling base auxiliary random numbers such as Gaussian or uniforms, then the particles and ancestor variables are simply a function of these auxiliary variables. We will denote the collection of all base auxiliary random variables, also referred to as the random seed, by u. Each individual particle x it and ancestor ait are then a transformation from the random seed. For example we might have u
, where each uix,t,uia,t
. The function that con- nects uix,t to x it and uia,t to ait are then the inverse cumulative distribution func- tions (CDFs) of the proposal qt and resampling mechanism from Section 2.2.1, respectively. A common tool in PM is to think of the normalization constant estimate in Eq. (4.1) as a function of all the random variables u, effectively the random seed of the SMC algorithm. Then we can write Eq. (4.1) as
where the weights are functions of the random seed u. We denote the distribution of the random seed u by
. The main result in Proposition 4.1.1 can then be written in terms of the random seed u:
Example 4.1.1 (for Example 1.2.1). We revisit our running non-Markovian Gaussian example and consider a standard SMC estimator with the prior proposal and no twisting of the target. This means that for u
where x i1:k , the ancestor variable aik
is defined by the resampling mechanism Section 2.2.1 using uik
In Example Code 4.1 we provide a code snippet, definitions left out for clarity, that illustrates how can be computed. For numerical stability purposes we compute the log-normalization constant estimate, log
Example Code 4.1: Computing log for Example 1.2.1 with u
u1:N1:T
4.2 Particle Metropolis-Hastings
PMH melds particle-based methods, such as IS and SMC, with Metropolis-Hastings, providing the practitioners with powerful approximate Bayesian inference for a large class of intractable likelihood problems. The particle MCMC methods discussed in this section were mainly developed in Andrieu et al. [2010].
MH is a MCMC method [Robert and Casella, 2004] for approximate simulation from a PDF. The typical example is the posterior distribution of unknown model parameters given the observed data y,
where here x are latent (unknown) variables that we wish to integrate out. The likelihood pis often intractable. A common solution is to use data augmentation, sampling instead (approximately) from the posterior distribution of both
given the data y, i.e. p
see Algorithm 9, on the other hand generates approximate samples directly from the marginal distribution p
, provided that the likelihood can be evaluated exactly. The key idea is to generate a Markov chain
a stationary distribution equal to the target distribution of interest Eq. (4.4). The accept-reject step in Algorithm 9, accepting the proposed value
new iterate with probability A, ensures that the stationary distribution is equal to Eq. (4.4). The acceptance probability
is the minimum of one and the ratio of the posterior and the proposal evaluated at the proposed value, divided by the same ratio evaluated at the previous iterate. For a more thorough treatment of the MH algorithm see e.g. Robert and Casella [2004].
4.2.1 Particle Marginal Metropolis-Hastings
The pseudo marginal method replaces the exact likelihood with an estimate of it, usually constructed by IS or SMC like in Eq. (4.1). For a fixed value of the parameter we can use the tools from the previous chapters to construct a non-negative and unbiased approximation to p
, we will use the notation from (4.3) and label this estimate
. From Proposition 4.1.1 we have that
Simply replacing pwith this unbiased and non-negative approximation,
, results in the particle marginal Metropolis-Hastings (PMMH) algorithm in Algorithm 10 [Andrieu et al., 2010]. In a practical implementation we do not need to save the entire seed u for each iteration, it suffices to save the scalar value for our normalization constant estimate
. We illustrate the MH and PMMH algorithms applied to our running example in Example 4.2.1.
Example 4.2.1 (Particle marginal Metropolis-Hastings for Example 1.2.1). We revisit our running non-Markovian Gaussian example. To keep it simple we focus on the IS special case, i.e. with T = 1, which means we are interested in inference for . We assign a log-normal prior to the variance parameters q and r, i.e. logq
. The corresponding marginal likelihood is then given by p
. We use a normal distribution random walk proposal with variance 0.5 for the logarithm of the parameters. In Fig. 4.1 we can see the kernel density estimate based on the samples of a 20000 iterations long run of the two respective algorithms. The two algorithms obtain the same marginal distribution as expected.
The motivation for why this works is fairly straightforward. We can think of the PMMH algorithm as a standard MH on the extended space of adding the random seed as an auxiliary variable. The proposal on this extended space is
and the distribution we are trying to draw samples from, i.e. the target distribution, is
Figure 4.1: Kernel density estimates of the samples for the parameters logq and log r using 20000 MCMC iterations.
which by the unbiasedness of has the posterior distribution as a marginal
This means that if we can generate samples approximately distributed as the distribution in Eq. (4.7), we get samples approximately distributed as pas a by-product by considering only the samples for
To show that PMMH is a standard MH on an extended space we first note that the proposal qis the distribution of the proposed value
Then studying the corresponding acceptance probability A for the above choice of proposal and target distributions,
we get exactly the expression from Algorithm 10. These two properties ensure that PMMH is an MH algorithm on the extended space with Eq. (4.7) as its stationary distribution.
4.2.2 Sampling the Latent Variables
If we are interested not just in the parameters but in the latent variables also, we can adapt the particle MH framework with a simple extension of the PMMH algorithm [Andrieu et al., 2010]. We can use particle methods to generate approximate samples from the complete target with both parameters and latent variables, p. We simply use the final SMC approximation for p
that we obtain as a by-product of computing
as a proposal in the MH algorithm. The SMC approximation to the conditional distribution x
where we suppress the dependence on the parameters for clarity. Just like
the conditional distribution estimator is also completely de-fined by the random seed u. Using the above distribution as our proposal for x lets us derive the particle Metropolis-Hastings algorithm, see Algorithm 11.
Clearly, this algorithm can be applied for sampling the latent variables also when the model parameters are all known. In this case we simply drop all
dependent terms and variables in Algorithm 11. The proposal on the extended space is then
which corresponds to first running the SMC algorithm (sample uand then sample one particle trajectory from the resulting SMC approximation (sample x
. Since this is done independently from the current state of the Markov chain, it is referred to as an independent proposal and the resulting algorithm is know as particle independent Metropolis-Hastings (PIMH) [Andrieu et al., 2010]. The acceptance probability for PIMH is simply the minimum of one and the ratio of normalization constant estimators, i.e.
Example 4.2.2 (Particle independent Metropolis-Hastings for Example 1.2.1). We revisit our running non-Markovian Gaussian example. We set T = 20 and run PIMH for N = 5, 10 and 20. The kernel density estimates of the marginal
posterior distributions of x1 and xT based on the 10,000 samples generated can be seen in Fig. 4.2. We can see that, while in theory exact in the limit of infinite number of samples, the approximations for the lower number of particles N = 5, 10, are distinctly non-Gaussian. It is only when we use a large enough number of particles, e.g. N = 20, that we obtain more satisfactory results for a finite number of MCMC iterations.
Analogously to PMMH, we can motivate PMH as a standard MH algorithm on an extended space. We focus on the the setting where the model parameters are known and the extended space is given by (u,x). The proposal on this extended space is
and the target distribution is
By the unbiasedness property of SMC, i.e. Theorem 2.3.1, we have that the posterior distribution p(x|y) is a marginal of Eq. (4.11)
Figure 4.2: Illustration of the kernel density estimates of the PIMH marginal posterior approximation of x1 and xT forT = 20 for N = 5, 10 and 20.
where d= denotes equal to in distribution. Studying the corresponding acceptance probability A for the above proposal and target distribution we obtain
which is identical to the expression from Algorithm 11 (when the model parameters are known).
4.2.3 Correlated Pseudo-Marginal Methods
One of the main limitations of the PM methods that we have covered so far is that they require that the normalization constant estimate is sufficiently accurate to achieve good performance. This requirement means we have to focus much of the computational effort on the SMC algorithm by using a high number of particles N, rather than generating more samples from the posterior p
By studying the PMMH algorithm we can see that the proposed parameter
at each iteration is correlated with the previous parameter
, encouraging local exploration and higher acceptance rates. However, this is not true for the random seed u that is generated independently at each iteration. Introducing correlation also for this random seed, i.e. replacing the independent proposal
with a correlated proposal
, can allow us to effectively lower the computational cost of the SMC estimate through choosing a lower number of particles N. Since the dimension of u typically increases linearly with N, we need a proposal distribution which scales favorably to high dimensions. Examples include preconditioned Crank-Nicolson [Deligiannidis et al., 2018, Dahlin et al., 2015], elliptical slice sampling [Murray and Graham, 2016] and Hamiltonian Monte Carlo [Lindsten and Doucet, 2016].
4.3 Proper Weights and Sequential Monte Carlo
So far in this manuscript we have focused on SMC algorithms where the particles are sampled exactly from the proposal qt and where the weights deterministic functions of the particles. We can in fact relax both of these restrictions significantly by considering the weights and particles as joint random variables that satisfy a proper weighting condition. This will lead us to powerful algorithms such as the random-weights particle filter [Fearnhead et al., 2010], SMC2 [Chopin et al., 2013
Naesseth et al., 2015, 2019].
To formally justify these algorithms we can make use of the concept of proper weighting. This is a property on the expected value of the weights and samples when applied to functions.
Definition 4.3.1 (Proper Weighting). We say the (random) pair erly weighted for the (unnormalized) distribution
a.s. and for all measurable functions h
for some positive constant that is independent of x and
This is highly related to the unbiasedness property of the SMC method that we saw already in Theorem 2.3.1. However, while the basic SMC has deterministic weights when conditioned on the samples, here we allow also for random weights within the confines of Definition 4.3.1. See also Liu [2004] for an early discussion on this topic in the context of random weights and IS.
We will denote the distribution of the random pair simulating from Q independently
we can construct an estimate of that satisfies the law of large numbers,
Proper weighting ensures that the numerator converges to Zthe denominator to Z
, and thus the right hand side converges to the true expectation
. Recall that the constant
is as defined in Definition 4.3.1.
We have already seen one example of a properly weighted IS method in Section 3.1, namely nested IS (Algorithm 5). Because standard IS is a special case of nested IS it follows that also standard IS leads to weight-sample pairs that are properly weighted. Below we will see how this can be used to motivate more powerful algorithms for approximate inference.
4.3.1 Random Weights Sequential Monte Carlo
Random weights SMC and IS focuses on solving the problem with intractable weights. Because we pick the proposal ourselves it is often easy to sample from, whereas the target distribution depends largely on the model we want to study and can lead to intractable importance weights. A concrete example of this is partially observed stochastic differential equations (SDEs) as studied by e.g. Beskos et al. [2006], Fearnhead et al. [2010]. In this section we focus exclusively on random weights SMC algorithms for SDEs. The corresponding target distribution, and thus the weights, contains an exponentiated integral that is intractable. If we replace the intractable weights in the standard SMC framework, , with (non-negative) unbiased approximations of them,
obtain random weights SMC. While we focus the exposition on the basic IS case, the result and method follows for SMC when we have unbiased estimates of the weights conditional on previous samples. It is straightforward to prove that the random pair
are properly weighted for
A simplified version of the target distribution in an SDE has the following form of its intractable part
where is an intractable functional, often an integral. We assume that we can generate unbiased approximations
of the functional for a given value of x. Using the Poisson estimator [Wagner, 1987] we can obtain an unbiased estimate of the exponentiated functional in Eq. (4.15).
where the expectation is with respect to estimate of
Eq. (4.16) leads to the Poisson estimator
where are iid unbiased estimators of
. While Eq. (4.17) is unbiased, it is not necessarily non-negative and might be unsuitable for weights estimation in an SMC algorithm. With proper modification to the estimate we can sometimes ensure that the estimate is non-negative [Fearnhead et al., 2010, Jacob and Thiery, 2015]. Replacing the intractable part in the weights with its unbiased estimate, like Eq. (4.17) in the SDE case, results in the random weights SMC algorithm Algorithm 12. The samples x it are generated by a stan- dard user chosen proposal qt
just like in standard SMC.Rather than interpreting random weights SMC as a properly weighted SMC algorithm, we can instead motivate it as a standard SMC on an extended space.
4.3.2 Nested Sequential Monte Carlo
We already briefly touched on NSMC in Section 3.1, illustrating how a nested IS sampler can be used to approximate the locally optimal proposal distribution. Here we will instead consider the case when the nested sampler is itself SMC rather than IS. This usually allows the user to derive more efficient proposal approximations, pushing the boundaries of the type of problems that SMC methods can be effectively applied to.
NSMC [Naesseth et al., 2015, 2019] is specifically targeting models where the latent state xt is of dimension nx > 1 (typically in the order of 10’s to 100’s). The goal of NSMC is to approximate the locally optimal proposal distribution by running a separate internal (or nested) SMC algorithm over the components of xt for each sample we need. At first this might seem wasteful of our computational resources, running a nested SMC sampler with M internal particles for each particle in the outer algorithm results in
operations. However, for many models it results in much more accurate estimate than, for instance, a corresponding standard SMC algorithm with N M particles and the prior proposal.
Unlike in the random weights SMC, NSMC changes not only the way we compute weights, but also the way we generate the samples x it. The key idea is to construct a SMC approximation to the locally optimal proposal qFor each particle x it we want to simulate we construct an SMC approximation to q
where are properly weighted for q
. We assume that Eq. (4.18) is constructed using a nested SMC sampler on the components of xt; first targeting xt,1, then xt,1:2, etc, where the final target is q
With this approximation NSMC replaces the sampling step Eq. (2.15) by simulating from Eq. (4.18), and the weighting step Eq. (2.16) by
where are the corresponding weights in the nested SMC sampler. We sum- marize NSMC in Algorithm 13 and give examples below of ways to construct
Example 4.3.1 (Nested sequential Monte Carlo for Example 1.2.1). We revisit our running non-Markovian Gaussian example. Because xt is scalar for this example, we make a straightforward extension to the model to illustrate NSMC. We replicate the model from Example 1.2.1 nx times, i.e.
To construct we run an internal SMC sampler with M samples tar- geting
for l . We denote the particles and weights for the nested SMC by ¯x j,i1:l and
, respectively. A straightforward proposal for the above target distribution is ¯x j,i1:l
, i.e. using the prior proposal. We have that
where the weights are given by
We obtain the approximation to the locally optimal proposal . This defines the NSMC method in Algorithm 13 for the multi- variate extension to our running example. We illustrate the nested target distributions in Fig. 4.3.
Figure 4.3: Illustration of NSMC for the running example with nx = 4. Note that the each variable xt,l depends on the complete sequence of variables from the previous steps x1:t
Just like in the standard SMC sampler, there is room for improvement in the nested SMC by e.g. choosing a better internal proposal distributions and target distributions under the proper weighting constraint. It is possible to further improve performance by correlating samples through a combination of fully adapted SMC and a forward-filtering backward-sampling procedure as discussed by Naesseth et al. [2015, 2019].
4.4 Distributed Sequential Monte Carlo
Developing computational methods that can take advantage of parallel and multi-core computing architectures is important for efficient approximate inference algorithms. In this section we discuss how the idea of nesting MC algorithms can open up for distributed implementation of SMC. First we discuss independent importance weighted SMC samplers, a way to combine the outputs from multiple SMC samplers using an outer level importance sampler. Then we discuss the island particle filter [Vergé et al., 2015], where the outer level sampling is done using SMC as well. It should be noted that the purpose of this section is to illustrate how the idea of nested MC opens up for parallelization of SMC, and not to provide a comprehensive overview of distributed SMC algorithms. Indeed, many useful methods have been proposed for parallelizing the SMC algorithm itself. Often this is done at the granularity of individual particles. The key challenge is then to parallelize the resampling step, since this step requires interaction between particles. One way to address this challenge is to make use of MH or rejection sampling to implement the resampling [Murray et al., 2016]. Another approach is to modify the resampling step in order to limit the interaction between particles, thereby making a distributed implementation easier [Whiteley et al., 2016].
4.4.1 Importance Weighted Independent SMC Samplers
Assume that we have access to M computing nodes. These could correspond to different machines, different cores, or even M consecutive runs on the same machine. A (very) simple way of distributing computation across these nodes is to run separate independent SMC samplers on each node. Using the notation from above, let um denote all the random variables generated during the run of an SMC sampler with N particles in node m
variable um can also be thought of as the random seed used to initialize the m:th sampler. We then get m independent approximations of the target
given by
where we have assumed that each node uses the same number of particles N and the same SMC algorithm. Assume that we want to compute the expected
value of some test function h. Each node provides an estimate,
and since these estimates are independent and identically distributed a natural approach is to take the average 1Mas the final estimate. However, since each SMC estimate is biased for finite N, the resulting average will also be biased regardless of the number of nodes M used in the computation. A better approach can therefore be to weight the independent SMC samplers according to their respective normalizing constant estimates:
This aggregated estimator is obtained by viewing the SMC sampler in each node as a proposal for an outer-level importance sampler. Specifically, the target distribution for this importance sampler is given by
and the proposal is given by , resulting in weights
By computations analogous to those in Section 4.2 it holds that
. Consequently, by a standard importance sampling argument it holds that Eq. (4.21) is a consistent estimate of
large, regardless of the number N of particles used in each SMC sampler.
It is not expected that Eq. (4.21) will be a better estimate than what we would obtain from a single SMC sampler with a total of N M particles. However, as pointed out above, the computation of the estimators Eq. (4.21) can be distributed, or even performed sequentially on the same machine in situations when memory constraints limit the number of particles than can be used.
4.4.2 The Island Particle Filter
In the preceding section we considered importance sampling of SMC estimators. It is possible to take this one step further and develop SMC sampling of SMC estimators. This is the idea behind the island particle filter by Vergé et al. [2015].
Similarly to above, the island particle filter consists of M nodes, each running an SMC sampler with N particles. However, these M samplers are allowed to interact at each iteration according to the standard resampling, propagation, and weighting steps of SMC. We provide pseudo-code in Algorithm 14, where we assume that resampling at the outer level only occurs when the effective sample size (ESS) drops below some threshold (see Section 2.2.2). This type of adaptive resampling is recommended for a distributed implementation of the island particle filter. Indeed, resampling at the outer level at each iteration would result in very high communication costs, since it involves copying complete SMC states (particles and weights) from node to node. Furthermore, we can control the outer level ESS by increasing N, thereby ensuring that outer level resampling happens rarely when done adaptively.
The island particle filter fits into the standard SMC framework. To make this explicit we start by writing u corresponds to all the basic random variables generated at iteration t of an (internal) SMC sampler with N particles. For example, ut could correspond to a collection of independent uniform random variables needed to implement the resampling and propagation step at iteration t. Equivalently, we can view ut as the random seed used for generating these variables. By making the (non restrictive) assumption that these auxiliary variables are drawn independently at each iteration we can write
The island particle filter is then a standard SMC sampler with M particles, targeting the sequence of distributions
and using as a proposal distribution at iteration t. Similarly to the preceding section, targeting these “extended distributions” is motivated by the fact that
is defined analogously to (4.20).
If we denote the outer level (island level) unnormalized SMC weights by
at iterations when the islands are resampled, and
at iterations when the outer level resampling is omitted; see Eq. (2.21). For implementation purposes we can use the latter expression at all iterations if we overwrite whenever the islands are resampled. In the above expressions,
denote the unnormalized (inner level) SMC weights at node M.
The derivation above is based on viewing the complete random seed as an auxiliary variable. In practice it is usually more convenient to simply keep track of the resulting particle trajectories and weights, which can be computed deterministically given u1:t. Thus, in Algorithm 14 we have identified u1:t with vt :and express the steps of the algorithm using the latter variable.
4.A Proof of Unbiasedness
For clarity we prove the unbiasedness of the normalizing constant estimate for the version of SMC presented in Algorithm 3, which involves multinomial resampling at each iteration of the algorithm. We note, however, that the property holds more generally, for instance when using low-variance and adaptive resampling. The proof presented here can be straightforwardly extended to such cases.
The distribution of all random variables generated by the SMC method in Algorithm 3 is given by
where the particles are x1:N1:T , and the ancestor variables from the resampling step are a1:N1:T
. Assuming that
and t, we have that
0. We are now going to prove that
from which the result follows. Note that the results holds for any value of T. To prove the result we introduce another set of variables biT t
1. The notation describes the ancestor index at iteration t for the final particle x i1:T. This means that we can write x i1:T
Using this notation we rewrite the integrand in Eq. (4.23),
where the first equality is the definition, the second equality separates the de-
pendence of the particle x i1:T from the rest, and the third equality sim- plifies the product between weights and proposals for particle x i1:T
Now, using Eq. (4.23) and the above rewrite of the integrand we obtain
where in the first equality we note that the sum over i generates N equal values, and thus we can arbitrarily choose one of these (in this case i = 1) to evaluate and multiply the result by N. In the second equality we marginalize the variables not involved in the particle xb11:T1:T . The two final equalities follow because we are averaging N Tvalues that are all equal to 1. This concludes the proof.
Conditional sequential Monte Carlo (CSMC) is a recent algorithm combining SMC and MCMC [Andrieu et al., 2010], originally developed as a means to approximate and mimic idealized Gibbs samplers for local latent variable models. The intractability of the exact Gibbs sampler stems from the challenge in simulating from the exact conditional of latent variables given the data and the parameters. The key idea in CSMC is to simulate from this conditional distribution approximately using a slightly modified SMC approximation, retaining a single particle at each iteration. By iterating this procedure we obtain a MCMC method that has the desired target distribution as its stationary distribution. We will see how this can be used not only for simulating from a target distribution, but also how we can use it to generate unbiased inverse normalization constant estimates, speed up inference for e.g. probabilistic programming, and evaluating other approximate inference methods.
In the first section below we introduce the CSMC algorithm and illustrate it with an example. Then, we discuss the unbiasedness property of the CSMC inverse normalization constant estimate, a key property underpinning the rest of the section. Using CSMC as a tool we move on by deriving the distribution of the expected SMC empirical distribution approximation. Next, we introduce the IPMCMC algorithm, leveraging CSMC and SMC to derive a method that can take advantage of distributed and multi-core computational architectures. Finally, we discuss ways in which SMC and CSMC can be used for evaluating approximate inference.
5.1 Conditional Sequential Monte Carlo
One way to look at CSMC is to view it as Markov kernel with the target distribution as its stationary distribution. Each iteration takes the previous sample as an input, a reference trajectory, and outputs an updated sample with no need for an accept–reject step. If we keep iterating, applying CSMC over and over, we obtain the iterated CSMC algorithm. The iterated CSMC sampler is a type of Gibbs sampler on an extended space of all random variables generated at each step [Andrieu et al., 2010]. We will sometimes refer to the previous sample x
, i.e. the reference trajectory, as the retained particle.
The CSMC algorithm differs from standard SMC described in Section 2.2 only in that a single particle is set deterministically to the retained particle from the previous iteration. This means that we condition on the event that a specific particle survives and is equal to the retained particle x, hence the name conditional SMC. We obtain the CSMC algorithm by a simple modification of the SMC proposal and weight update equations. At each iteration, for the first N
1 particles, i
1, we follow the standard updates according to Eqs. (2.15) and (2.16). The proposal is
and the weights
The last particle i = N we set deterministically equal to the retained particle x N1:t . Its weight is computed according to Eq. (5.2), that is we use the same weight expression for all N particles. The empirical distribution approximation of
is, analogously to SMC, equal to
The updated retained particle is obtained by drawing a single sample from the empirical distribution approximation in Eq. (5.3), i.e. xWe summarize the CSMC algorithm in Algorithm 15. Iterated CSMC is obtained by letting the retained particle at the next iteration be the output of the previous step. This algorithm constructs a Markov chain, consisting of the
retained particles at each iteration, with the target distribution as its stationary distribution. For a thorough treatment of the theory of iterated CSMC, particle Gibbs (PG) and particle Markov chain Monte Carlo (PMCMC), i.e. the extension to sampling also model parameters, see Andrieu et al. [2010].
The path degeneracy issue discussed in Section 2.2 is particularly prominent in iterated CSMC. Because we enforce that the retained particle survives to the very end, we simultaneously force that all the particles collapse for early steps t to x. This means that the output from Algorithm 15, with high prob- ability, is equal to the input for early values of x
. The samples we get are highly correlated and any estimator will require a significant number of samples to reasonably approximate expectation with respect to the target distribution
. By modifying the way we update particle N, the retained particle, we can alleviate this problem. One way to do this is using ancestor sampling [Lindsten et al., 2014]. With ancestor sampling, instead of setting x N1:t
deterministically as in Algorithm 15, we sample at each time step
where the auxiliary weights are given by
Using ancestor sampling does not avoid path degeneracy but it can allow for the particle trajectory we degenerate to to be different than the retained particle from a previous iteration. For an in-depth treatment of the path degeneracy issue and its other potential solutions we refer to Lindsten and Schön [2013].
Example 5.1.1 (Conditional sequential Monte Carlo for Example 1.2.1). We revisit our running non-Markovian Gaussian example. We let T = 20 and study the performance of iterated CSMC without (blue) and with ancestor sampling (red) in Fig. 5.1. We set N = 5, 10 and study the kernel density estimates based on 10,000
MCMC samples of the marginal distributions , respectively.
We can see the striking effect that path degeneracy has the iterated CSMC approximation in Fig. 5.1a; effectively we approximate using a single sample when N = 5, whereas the approximation for
is more diverse. When using ancestor sampling on the other hand, the approximation is already accurate for N = 5. This accuracy comes at a cost however, the ancestor sampling step requires that we evaluate Eq. (5.5) at each iteration. This extra step has a computational cost that in general may be proportional to T 2, whereas the cost of standard CSMC is often only on the order of T. For some target distributions, such as the SSM or our running example, there is only a small constant computational overhead associated with evaluating Eq. (5.5). However, whether the benefit of ancestor sampling will outweigh its computational cost will have to be evaluated on a case by case basis.
5.2 The Unbiasedness of the Inverse Normalization Constant Estimate
Unlike in the standard SMC algorithm, the normalization constant estimate for CSMC based on the product of the average weights is biased. We restate the normalization constant estimate below:
Figure 5.1: Illustration of the kernel density estimates based on 10,000 MCMC samples of the marginal distributions from iterated CSMC (top two rows, blue) and CSMC with ancestor sampling (bottow two rows, red).
Because we condition on a reference trajectory, this normalization constant estimate has different properties than does the SMC estimate. However, it is still consistent, converging almost surely to ZT as the number of particles tends to infinity.
While the above estimator is biased for a finite number of particles, interestingly the CSMC inverse normalization constant estimate 1is unbiased. We will see that this is particularly relevant for the inference evaluation methods that we will study in Section 5.5. The unbiasedness holds under the assumption that the reference trajectory is distributed according to the target distribution x
. This assumption relies on the Markov chain having already converged to its stationary distribution. However, in some applications it is possible to make use of CSMC with the problem set up in such a way that this is always true.
We formalize and prove the unbiasedness result in Proposition 5.2.1.
Proposition 5.2.1. Assuming that the retained particle xCSMC inverse normalization constant estimate 1
is non-negative and unbiased
Proof. See Appendix 5.A.
5.3 The Expected SMC Empirical Distribution
Another method for representing the expected SMC empirical distribution is using CSMC. This can be useful for some applications of SMC, such as in variational SMC or in inference evaluation which we will see later in this chapter. The marginal distribution of a single sample from the SMC empirical distribution
where QSMC (see Eq. (4.22) for its definition) is the distribution of all particles and ancestor variables generated by a single run of the SMC algorithm.
We formalize the alternate representation of the expected SMC empirical distribution in Proposition 5.3.1.
Proposition 5.3.1. The expected SMC empirical distribution satisfies
where QCSMC is the distribution of all random variables generated by a single run of the CSMC algorithm with x1:T as its retained particle.
Proof. See Appendix 5.B.
5.4 Parallelization
In Section 4.4 we briefly discussed a few approaches for leveraging multi-core and distributed computational architectures. Here we will see how we can combine SMC and CSMC to derive IPMCMC [Rainforth et al., 2016], an algorithm that takes advantage of parallelization to speed up inference for amongst other things probabilistic programming.
The IPMCMC method, like the iterated CSMC, can be thought of as a type of Gibbs sampler on an extended space that draws approximate samples from the target distribution . However unlike iterated CSMC, the IPMCMC algorithm generates not only one sample at each iteration but several. The method is based on a pool of several interacting standard and conditional sequential Monte Carlo samplers. These pools of samplers are run as parallel processes and will be referred to as nodes below. Each node either runs an iteration of CSMC like in Algorithm 15, or an iteration of SMC like in Algorithm 3. After each run of this pool, we update the node indices running CSMC by applying successive Gibbs updates. This means that the nodes running CSMC can change from iteration to iteration. This approach can let the user trade off exploration (SMC) and exploitation (CSMC) to achieve a faster and more accurate approximate inference method.
To simplify notation we will let x denote the full trajectory x1:T. We assume that we have a total of M nodes, indexed by j = 1,..., M, of which P run CSMC and the remaining run standard SMC. The conditional nodes, i.e. the nodes running CSMC at any given time, are denoted by a set of indices c1:P is distinct. The conditional node cj’s retained particle is referred to by x
. After each run all nodes produce a normalization constant estimate
, i.e. the product of the average unnormalized weights in that node. IPMCMC proceeds by successively updating the conditional node indices and
retained particles by Gibbs steps. For j = 1,..., P we sample
where is the identity function evaluating to one if its argument is true and zero otherwise. Furthermore, we have used the node index shorthand c
. After selecting new nodes to become conditional nodes through Eq. (5.10) we select their new corresponding retained particle by simulating
We summarize one iteration of IPMCMC in Algorithm 16. Iterating the update steps, letting the retained particles at the previous step be the input for the subsequent step, generates a Markov chain on an extended space where each retained particle is approximately distributed according to the target distribution.
IPMCMC is suited to distributed and multi-core architectures. The main computational cost is running the SMC and CSMC iterations, which can be done in parallel by the nodes. At each iteration, only a single normalization constant estimate needs be communicated between nodes and the calculation of the updates of c1:P is negligible. Retained particles do not need to be communicated and may be stored locally in each node until needed. Furthermore, IPMCMC may be amenable to asynchronous adaptation under certain assumptions [Rainforth et al., 2016]. The algorithm is also suited for use in probabilistic programming where we are often only able to simulate from the prior and evaluate the likelihood. Interacting particle Markov chain Monte Carlo can allow for tackling more challenging inference problems, i.e. more complex probabilistic programs, compared to standard PMCMC methods due to leveraging distributed and multi-core computational architectures [Rainforth et al., 2016].
5.5 Evaluating Inference
Approximate Bayesian inference is central to a large part of probabilistic machine learning and its application. It is a coherent approach to treat uncertainty in both the data and the model with applications to everything from robotics to health care. A notoriously difficult problem is to measure the accuracy of our posterior approximation compared to the true posterior distribution. Inference checking, or inference evaluation, refers to the process of comparing our approximate inference result to what we would have obtained using the true posterior. In practice whether an approximation is accurate or not may differ between different applications, we might be interested in different aspects of the posterior. Some examples are tail probabilities in rare event estimation or posterior mean for minimum mean square error estimation. In this section we will focus on comparing directly the posterior approximation to the exact posterior distribution in terms of symmetrized KL divergence, also known as the Jeffreys divergence, following Grosse et al. [2016], Cusumano-Towner and Mansinghka [2017].
The posterior distribution that we want to approximate is denoted by p(x|y), and the approximate posterior will be denoted by q(x). The approximation q could for example be the expected SMC empirical distribution approximation, similar to VSMC, or it could be an optimized variational approximation, or any of the other examples detailed in Grosse et al. [2016], Cusumano-Towner and Mansinghka [2017]. In either case, the symmetrized KL divergence from q to p is given by
where the standard Kullback-Leibler divergence from q(x) to p(x) is defined by KL. We can rewrite Eq. (5.12) to
eliminate the need for computing the marginal likelihood p(y) as follows
To be able to estimate Eq. (5.13) using a standard MC estimator we require that we can sample q(x) and p(x|y), respectively. Furthermore, we require that we are able to evaluate the PDFs q(x) and p(x,y) pointwise. These requirements can be restrictive for the types of problems we encounter in practice, and Grosse et al. [2016], Cusumano-Towner and Mansinghka [2017] make various relaxations and assumptions to make the problem tractable. Below we will detail the two approaches to inference checking known as bounding divergences with reverse annealing (BREAD) [Grosse et al., 2016] and auxiliary inference divergence estimator (AIDE) [Cusumano-Towner and Mansinghka, 2017].
Bounding Divergences with Reverse Annealing The BREAD approach focuses on bounding the symmetrized KL for a posterior approximation based on annealed importance sampling (AIS) [Neal, 2001] for simulated data. By using simulated data we ensure that we can generate an exact sample from the posterior distribution x . We can think of AIS as a single sample, N
method that uses as its proposal a Markov kernel with stationary distribution
. The target distribution is typically chosen according to a data or likelihood tempering scheme as discussed in Section 1.2. Per definition the posterior distribution is obtained as the marginal at the final iteration, i.e. p
. The posterior distribution approximation is then defined by q
and the expectation is with respect to all the proposal distributions.
Because of this set up the intractability in estimating, or bounding, the Jeffreys divergence in Eq. (5.13) comes from having to evaluate the PDF q(x) pointwise. We can bound Eq. (5.13) from above by noting that marginalizing out variables cannot increase the KL divergence
where the equality follows from Proposition 5.3.1. We can further bound
Eq. (5.14) using Jensen’s inequality and simplifying, leading to
The inequality in Eq. (5.15) holds generally even if we would use a standard
SMC algorithm rather than AIS as our posterior approximation. The details are provided in Section 5.C. However, when AIS is used we can further simplify to obtain the same bound as Grosse et al. [2016]
where qare the proposals (forward kernels), s
are the corresponding backward kernels (cf. Section 1.2, Conditionally independent models), and the weights are in the likelihood tempering variant given by
with 1. This means it is possible to obtain a stochastic upper bound, i.e. a MC estimate of the upper bound in Eq. (5.16), by running a standard (forward) AIS evaluating the negative log-weights, and running a reverse AIS evaluating the log-weights; the left and right expressions in Eq. (5.16), respectively.
Auxiliary Inference Divergence Estimator The AIDE approach, on the other hand, treats a more general class of posterior approximations such as variational inference, SMC, or MCMC. However, rather than assuming simulated data, Cusumano-Towner and Mansinghka [2017] propose to study the Jeffreys divergence from the approximation to that of a gold standard inference approximation. Typically this gold standard approximation would consist of a long run of an MCMC method or SMC with a large number of particles that we know approximates the posterior well.
In this setting not only is the PDF for the approximation q(x) difficult to evaluate, so is the PDF for our gold standard algorithm. Rather than introducing new notation for the gold standard algorithm we will keep the posterior distribution notation p(x|y). The approach that the AIDE takes is to use meta-inference methods, based on SMC or CSMC, to obtain approximations of the PDFs and their reciprocals. A meta-inference algorithm takes as input a sample from the distribution of interest and outputs an unbiased estimate of either its PDF or the reciprocal. In fact it suffices that the meta-inference algorithms provide estimates that are unbiased up to a constant of proportionality because any constant is canceled by the symmetry in the Jeffrey’s divergence.
We note that the approach suggested by Cusumano-Towner and Mansinghka [2017] is essentially equivalent to the bound derived in Eqs. (5.14) to (5.16). However, instead of the exact target distribution another SMC approximation to represent the gold standard algorithm. To differentiate between the two SMC approximations we will use superscript q for the approximation and p for the gold standard. With this the corresponding Jeffrey’s divergence is
where QqCSMC, distributions and normalization constant estimates for the approximation and gold standard algorithm, respectively. The first equality follows from the definition of q(x) and p(x|y). The second equality follows from Proposition 5.3.1 for the two distributions, respectively. The divergence in Eq. (5.17) can be bounded, analogously to the
BREAD case, as follows
where q(x) and p(x|y) are the two corresponding expected SMC empirical distributions.
The bound above differs slightly from that derived by Cusumano-Towner and Mansinghka procedure uses a form of importance weighted version of Eq. (5.18), in their notation it is equivalent to the bound based on a single run of the meta-inference for the approximation (Mt = 1) and gold standard (Mg = 1). For more details we refer the interested reader to that paper.
5.A Proof of Unbiasedness
When introducing the CSMC method we consistently set the N:th particle to the reference trajectory. In practice it does not matter which particle is deterministically set to the reference trajectory. It will prove simplifying to let the particle that is conditioned on, i.e. the reference trajectory, have its own index b1:T . For clarity we focus on using multinomial resampling at each iteration of the algorithm. The distribution of all random variables generated by the CSMC method in Algorithm 15 is then given by
where the particles are x, and the ancestor vari- ables from the resampling step are a
that
, we have that
0 and thus 1
0. We are now going to prove that when
from which the result follows. We rewrite the integrand in Eq. (5.20),
where the first line is the integrand. In the first equality we have divided and multiplied by q1to be able to identify the difference between QSMC and QCSMC. Note that we have replaced bt
in the notation for QSMC
to identify the SMC distribution. For the second equality we have rewritten the normalized weights in the denominator, together with N
and dividing and multiplying by
, to identify the normalization constant estimator. For the final equality we see that the normalization constant estimators cancel, and the product in the denominator simplifies together with
to become
Using this simplified expression for the integrand in Eq. (5.20) we obtain
where again we first replace the original sum over b1:Twith the ancestor variable a notation. The first equality marginalizes bT and the final equality follows because QSMC is a distribution on x1:N1:T , a1:N1:T
. This concludes the proof.
5.B Proof of SMC Distribution Representation
We will prove the following alternate representation of the expected SMC empirical distribution
were we for clarity again focus on using multinomial resampling at each iteration of the algorithm. Note that the left hand side is a probability measure and the right hand side a probability distribution function. To be precise the equality means that the corresponding probability measure of the right hand side (usually adding either the Lebesgue our counting measure) is equal almost everywhere to the left hand side.
To prove the result we need to show that the expectation of any bounded and continuous function hcoincides for the left and right hand sides of Eq. (5.21). We have that the expectation of h
with respect to the right
where the first equality is the definition. In the second equality we have expanded the terms and can identify the expression derived in Section 5.A above. In the third equality we have replaced it with the corresponding QSMC distribution, again using bt in the notation for QSMC. In the penultimate equality we re-arrange a few terms and replace the sum over bT with an integration with respect to a point mass distribution. In the final equality we identify the expectation with respect to
. This concludes the proof.
5.C Jeffreys Divergence Bound
We detail the derivation of the bound in Eq. (5.15) below under the assumption of multinomial resampling at each iteration of the algorithm,
where the first equality is the definition. The second equality is a simplification of the previous expression where the terms cancelling, , have been removed. Next, the inequality follows from the (conditional) Jensen’s inequality for the convex functions x log x and
, respectively. The final equality follows from Proposition 5.3.1, noting that the first term is equivalent to an expectation with respect to the full SMC distribution QSMC.
The Jeffrey’s divergence bound between two generic expected SMC empirical distributions in Eq. (5.17) is a straightforward extension to the above derivation.
We have described sequential Monte Carlo, methods that use random samples to approximate the posterior. The goal is to approximate the distribution of the latent (unobserved) variables given the data. The idea is to draw samples from a (simpler) proposal distribution, then correct for the discrepancy between the proposal and the posterior using weights. SMC is generally applicable and can be considered to be an alternative, or complement, to MCMC methods.
First, we discussed the roots of SMC in (sequential) importance sampling and sampling/importance resampling. Next, we introduced the SMC algorithm in its full generality. The algorithm was illustrated with synthetic data from a non-Markovian latent variable model. Then, we discussed practical issues and theoretical results. The SMC estimate satisfies favorable theoretical properties; under appropriate assumptions it is both consistent and has a central limit theorem. Then, we discussed and explained how to learn a good proposal and target distribution. As a motivating example, we described how to go about learning a proposal- and target distribution for a stochastic recurrent neural network. Next, we discussed proper weighting and how to use SMC to construct nested estimators based on several layers of MC approximations to approximate idealized MC algorithms. We provided a straightforward proof to the unbiasedness property of the SMC normalization constant estimate. We studied PMH algorithms, methods that construct high-dimensional proposals for MH based on SMC samplers. We studied random and nested SMC methods that deal with intractability in the weights and proposal distributions with applications to stochastic differential equations and optimal proposal approximation. We illustrated how to leverage proper weighting and SMC to design algorithms that makes use of distributed and multi-core computing architectures. Finally, we introduced conditional sequential Monte Carlo algorithms and applications of these. We provided proof that the CSMC sampler obtains unbiased estimates of the inverse normalization constant. We showed how the IPMCMC method combines SMC and CSMC to design a powerful MCMC method that can take advantage of parallel computing. As a motivating example we show how SMC and CSMC can be used to evaluate inference in the context of approximate Bayesian inference.
There are several avenues open for further work and research in SMC methods and their application. We provide a few examples below. The proposal distribution is crucial for performance, and finding efficient ways to learn useful proposals is an area of continued research. The twisting potential can drastically improve accuracy. However, jointly learning twisting potentials and proposal distributions for complex models is a challenging and largely unsolved problem.
A rather obvious—but still very interesting and important—direction consists in adapting the SMC algorithm to parallel and distributed computing environments. This is indeed an active area of research, see e.g. Lee and Whiteley [2016], Lee et al. [2010]. However, we have reason to believe that there is still a lot that can be done in this direction. Just to give a few concrete examples we mention the fact that we can assemble SMC algorithm in new ways that we directly designed for parallel and distributed computing. As a concrete example we mention the smoother developed by Jacob et al. [2019] which is based on the clever debiasing scheme of Glynn and Rhee [2014]. Perhaps the divide and conquer SMC algorithm [Lindsten et al., 2017] can be useful in this effort. Its successful implementation within a probabilistic programming environment with support for parallel computation might give rise to an interesting scalable use of SMC.
An important observation is that SMC is increasingly being used as a component in various larger (often iterative) algorithms. Just to give a few examples of this we have Bayesian learning, where the particle filter is used within a Markov chain Monte Carlo algorithm [Andrieu et al., 2010, Lindsten et al., 2014], iterated filtering [Ionides et al., 2015] and maximum likelihood using expectation maximization [Schön et al., 2011]. The idea of using SMC within itself has occurred in several forms, including SMC2 by Chopin et al. [2013] and nested SMC by Naesseth et al. [2015]. There are still most likely many interesting algorithms to uncover by following this idea in new directions.
There has recently been a very interesting line of research when it comes to merging deep learning architectures—such as recurrent neural networks, temporal convolutional networks and gated recurrent units—with stochastic latent models—such as the state space model—for sequence modelling, see e.g. [Bai et al., 2018, Aksan and Hilliges, 2019, Osendorfer and Bayer, 2014, Fraccaro et al., 2016]. An important idea here is to combine the expressiveness of the deep neural networks with the stochastic latent variable model’s ability to represent uncertainty. These models are often trained by optimizing the ELBO objective. It has recently been shown that we can exploit the fact that the SMC estimator of the likelihood is unbiased to construct tighter lower bounds of this objective [Le et al., 2018, Maddison et al., 2017, Naesseth et al., 2018]. This can most likely open up for better training of combinations of deep learning architectures and stochastic latent models.
Given that we have now understood that the SMC methods are much more generally applicable than most of us first thought we foresee an interesting and creative future for these algorithms. We hope that this paper can serve as a catalyst for further research on these methods and in particular on their application in machine learning.
This research was financially supported by the Swedish Foundation for Strategic Research (SSF) via the projects ASSEMBLE (contract number: RIT15-0012) and Probabilistic Modeling and Inference for Machine Learning (contract number: ICA16-0015), and by the Swedish Research Council via the projects NewLEADS - New Directions in Learning Dynamical Systems (contract number: 621-2016-06079) and Learning of Large-Scale Probabilistic Dynamical Models (contract number: 2016-04278).
E. Aksan and O. Hilliges. STCN: stochastic temporal convolutional networks. In International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 2019.
B. Anderson and J. Moore. Optimal Filtering. Prentice-Hall, Englewood Cliffs, NJ, 1979.
C. Andrieu and G. O. Roberts. The pseudo-marginal approach for efficient Monte Carlo computations. The Annals of Statistics, 37(2):697–725, 2009.
C. Andrieu, A. Doucet, and R. Holenstein. Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269–342, 2010.
M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Transactions on signal processing, 50(2):174–188, 2002.
S. Bai, J. Z. Kolter, and V. Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv:1803.01271, 2018.
J. Bayer and C. Osendorfer. Learning stochastic recurrent networks. arXiv:1411.7610, 2014.
T. Bengtsson, P. Bickel, and B. Li. Curse-of-dimensionality revisited: Collapse of the particle filter in very large scale systems, volume Volume 2 of Collections, pages 316–334. Institute of Mathematical Statistics, 2008.
A. Beskos, O. Papaspiliopoulos, G. O. Roberts, and P. Fearnhead. Exact and computationally efficient likelihood-based estimation for discretely observed
diffusion processes (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):333–382, 2006.
A. Beskos, D. Crisan, and A. Jasra. On the stability of sequential Monte Carlo methods in high dimensions. The Annals of Applied Probability, 24(4):1396– 1445, 2014.
P. Bickel, B. Li, and T. Bengtsson. Sharp failure rates for the bootstrap particle filter in high dimensions, volume Volume 3 of Collections, pages 318–329. Institute of Mathematical Statistics, 2008.
D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518): 859–877, 2017.
J. Briggs, M. Dowd, and R. Meyer. Data assimilation for large-scale spatiotemporal systems using a location particle smoother. Environmetrics, 24(2): 81–97, 2013.
A. Buchholz, N. Chopin, and P. E. Jacob. Adaptive tuning of Hamiltonian Monte Carlo within sequential Monte Carlo. arXiv:1808.07730, 2018.
O. Cappé, E. Moulines, and T. Rydén. Inference in Hidden Markov Models. Springer-Verlag New York, 2005.
J. Carpenter, P. Clifford, and P. Fearnhead. Improved particle filter for nonlinear problems. IEE Proceedings-Radar, Sonar and Navigation, 146(1):2–7, 1999.
N. Chopin. A sequential particle filter method for static models. Biometrika, 89(3):539–552, 2002.
N. Chopin. Central limit theorem for sequential Monte Carlo methods and its application to bayesian inference. The Annals of Statistics, 32:2385–2411, 2004.
N. Chopin, P. E. Jacob, and O. Papaspiliopoulos. SMC2: an efficient algorithm for sequential analysis of state-space models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(3):397–426, 2013.
J. Chung, K. Kastner, L. Dinh, K. Goel, A. Courville, and Y. Bengio. A recurrent latent variable model for sequential data. In Advances in Neural Information Processing Systems, 2015.
J. Cornebise, É. Moulines, and J. Olsson. Adaptive methods for sequential importance sampling with application to state space models. Statistics and Computing, 18(4):461–480, 2008.
M. F. Cusumano-Towner and V. K. Mansinghka. AIDE: An algorithm for measuring the accuracy of probabilistic inference algorithms. In Advances in Neural Information Processing Systems, pages 3004–3014. 2017.
J. Dahlin, F. Lindsten, J. Kronander, and T. B. Schön. Accelerating pseudo-marginal Metropolis-Hastings by correlating auxiliary variables. arXiv:1511.05483, 2015.
P. Del Moral. Feynman-Kac formulae: genealogical and interacting particle systems with applications. Springer Verlag, 2004.
P. Del Moral and A. Guionnet. On the stability of interacting processes with applications to filtering and genetic algorithms. In Annales de l’IHP Probabilités et statistiques, volume 37, pages 155–194, 2001.
P. Del Moral, A. Doucet, and A. Jasra. Sequential Monte Carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411– 436, 2006a.
P. Del Moral, A. Doucet, and A. Jasra. Sequential Monte Carlo methods for Bayesian Computation. Bayesian Statistics 8, 2006b.
G. Deligiannidis, A. Doucet, and M. K. Pitt. The correlated pseudo-marginal method. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(5):839–870, 2018.
P. M. Djuri´c and M. F. Bugallo. Particle filtering for high-dimensional systems. In 2013 5th IEEE International Workshop on Computational Advances in MultiSensor Adaptive Processing (CAMSAP), pages 352–355. IEEE, 2013.
R. Douc and O. Cappé. Comparison of resampling schemes for particle filter-ing. In Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis, pages 64–69. IEEE, 2005.
A. Doucet and A. M. Johansen. A tutorial on particle filtering and smoothing: Fifteen years later. Handbook of nonlinear filtering, 12(656-704):3, 2009.
A. Doucet and A. Lee. Sequential Monte Carlo methods. Handbook of Graphical Models, pages 165–189, 2018.
A. Doucet, S. Godsill, and C. Andrieu. On sequential Monte Carlo sampling methods for bayesian filtering. Statistics and computing, 10(3):197–208, 2000.
A. Doucet, M. Briers, and S. Sénécal. Efficient block sampling strategies for sequential Monte Carlo methods. Journal of Computational and Graphical Statistics, 15(3):693–711, 2006.
P. Fearnhead. Sequential Monte Carlo methods in filter theory. PhD thesis, University of Oxford Oxford, 1998.
P. Fearnhead. Particle filters for mixture models with an unknown number of components. Statistics and Computing, 14:11–21, 2004.
P. Fearnhead and H. R. Künsch. Particle filters and data assimilation. Annual Review of Statistics and Its Application, 5:421–449, 2018.
P. Fearnhead and B. M. Taylor. An adaptive sequential Monte Carlo sampler. Bayesian analysis, 8(2):411–438, 2013.
P. Fearnhead, O. Papaspiliopoulos, G. O. Roberts, and A. Stuart. Random-weight particle filtering of continuous time processes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):497–512, 2010.
M. Fraccaro, S. K. Sø nderby, U. Paquet, and O. Winther. Sequential neural models with stochastic layers. In Advances in Neural Information Processing Systems (NeurIPS), 2016.
M. Gerber, N. Chopin, and N. Whiteley. Negative association, ordering and convergence of resampling methods. arXiv:1707.01845, 2017.
S. Gershman and N. Goodman. Amortized inference in probabilistic reasoning. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 36, 2014.
P. W. Glynn and C.-H. Rhee. Exact estimation for markov chain equilibrium expectations. Journal of Applied Probability, 51A:377–389, 2014.
A. D. Gordon, T. A. Henzinger, A. V. Nori, and S. K. Rajamani. Probabilistic programming. In Proceedings of the on Future of Software Engineering, pages 167–181. ACM, 2014.
N. J. Gordon, D. J. Salmond, and A. F. M. Smith. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. Radar and Signal Processing, IEE Proceedings F, 140(2):107 –113, Apr. 1993.
R. B. Grosse, Z. Ghahramani, and R. P. Adams. Sandwiching the marginal likelihood using bidirectional Monte Carlo. arXiv:1511.02543, 2015.
R. B. Grosse, S. Ancha, and D. M. Roy. Measuring the reliability of MCMC inference with bidirectional Monte Carlo. In Advances in Neural Information Processing Systems 29, pages 2451–2459. Curran Associates, Inc., 2016.
S. Gu, Z. Ghahramani, and R. E. Turner. Neural adaptive sequential Monte Carlo. In Advances in Neural Information Processing Systems, pages 2629– 2637, 2015.
P. Guarniero, A. M. Johansen, and A. Lee. The iterated auxiliary particle filter. Journal of the American Statistical Association, 112(520):1636–1647, 2017.
I. Gulrajani, K. Kumar, F. Ahmed, A. A. Taiga, F. Visin, D. Vazquez, and A. Courville. PixelVAE: A latent variable model for natural images. In International Conference on Representation Learning, 2017.
J. E. Handschin and D. Q. Mayne. Monte Carlo techniques to estimate the conditional expectation in multi-stage non-linear filtering. International Journal of Control, 9(5):547–559, 1969.
W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109, 1970.
J. Heng, A. N. Bishop, G. Deligiannidis, and A. Doucet. Controlled sequential Monte Carlo. arXiv:1708.08396, 2017.
N. L. Hjort, C. Holmes, P. Müller, and S. G. Walker, editors. Bayesian Nonparametrics. Cambridge University Press, 2010.
J. D. Hol, T. B. Schon, and F. Gustafsson. On resampling algorithms for particle filters. In IEEE Nonlinear Statistical Signal Processing Workshop, pages 79–82. IEEE, 2006.
J. H. Huggins and D. M. Roy. Sequential Monte Carlo as approximate sampling: bounds, adaptive resampling via -ESS, and an application to particle Gibbs. arXiv:1503.00966, 2015.
A. Ihler and D. McAllester. Particle belief propagation. In Artificial Intelligence and Statistics, pages 256–263, 2009.
E. L. Ionides, D. Nguyen, Y. Atchadé, S. Stoev, and A. A. King. Inference for dynamic and latent variable models via iterated, perturbed Bayes maps. Proceedings of the National Academy of Sciences, 112(3):719–724, 2015.
P. E. Jacob and A. H. Thiery. On nonnegative unbiased estimators. The Annals of Statistics, 43(2):769–784, 04 2015.
P. E. Jacob, L. M. Murray, and S. Rubenthaler. Path storage in the particle filter. Statistics and Computing, 25(2):487–496, 2015.
P. E. Jacob, F. Lindsten, and T. B. Schön. Smoothing with couplings of conditional particle filters. Journal of the American Statistical Association (JASA), 2019.
A. Jasra, D. A. Stephens, A. Doucet, and T. Tsagaris. Inference for Lévy-driven stochastic volatility models via adaptive sequential Monte Carlo. Scandinavian Journal of Statistics, 38(1):1–22, 2011.
A. M. Johansen and A. Doucet. A note on auxiliary particle filters. Statistics & Probability Letters, 78(12):1498–1504, 2008.
S. J. Julier. The scaled unscented transformation. In Proceedings of the 2002 American Control Conference, volume 6, pages 4555–4559. IEEE, 2002.
S. J. Julier and J. K. Uhlmann. New extension of the kalman filter to nonlinear systems. In Signal processing, sensor fusion, and target recognition VI, volume 3068, pages 182–194. International Society for Optics and Photonics, 1997.
H. Kahn. Random sampling (Monte Carlo) techniques in neutron attenuation problems–i. Nucleonics, 6(5):27–37, 1950a.
H. Kahn. Random sampling (Monte Carlo) techniques in neutron attenuation problems–ii. Nucleonics, 6(6):60–65, 1950b.
D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014.
G. Kitagawa. A Monte Carlo filtering and smoothing method for non-Gaussian nonlinear state space models. In Proceedings of the 2nd US-Japan joint Seminar on Statistical Time Series Analysis, pages 110–131, 1993.
G. Kitagawa. Monte Carlo filter and smoother for non-Gaussian nonlinear state space models. Journal of computational and graphical statistics, 5(1):1–25, 1996.
D. Koller, N. Friedman, and F. Bach. Probabilistic graphical models: principles and techniques. MIT press, 2009.
R. G. Krishnan, U. Shalit, and D. Sontag. Structured inference networks for nonlinear state space models. In AAAI, pages 2101–2109, 2017.
D. Lawson, G. Tucker, C. A. Naesseth, C. Maddison, and Y. Whye Teh. Twisted variational sequential Monte Carlo. Third workshop on Bayesian Deep Learning (NeurIPS), 2018.
T. A. Le, M. Igl, T. Rainforth, T. Jin, and F. Wood. Auto-encoding sequential Monte Carlo. In International Conference on Learning Representations, 2018.
A. Lee and N. Whiteley. Forest resampling for distributed sequential Monte Carlo. Journal of statistical analysis and data mining, 9(4):230–248, 2016.
A. Lee, C. Yau, M. B. Giles, A. Doucet, and C. C. Holmes. On the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods. Journal of Computational and Graphical Statistics, 10(4): 769–789, 2010.
M. Lin, R. Chen, and J. S. Liu. Lookahead strategies for sequential Monte Carlo. Statistical Science, 28(1):69–94, 2013.
F. Lindsten and A. Doucet. Pseudo-marginal Hamiltonian Monte Carlo. arXiv.org, arXiv:1607.02516, 2016.
F. Lindsten and T. B. Schön. Backward simulation methods for Monte Carlo statistical inference. Foundations and Trends® in Machine Learning, 6(1): 1–143, 2013.
F. Lindsten, M. I. Jordan, and T. B. Schön. Particle Gibbs with ancestor sampling. The Journal of Machine Learning Research, 15(1):2145–2184, 2014.
F. Lindsten, A. Johansen, C. A. Naesseth, B. Kirkpatrick, T. B. Schön, J. Aston, and A. Bouchard-Côté. Divide-and-conquer with sequential Monte Carlo. Journal of Computational and Graphical Statistics, 26(2):445–458, 2017.
F. Lindsten, J. Helske, and M. Vihola. Graphical model inference: Sequential Monte Carlo meets deterministic approximations. In Advances in Neural Information Processing Systems 31, pages 8201–8211. Curran Associates, Inc., 2018.
J. S. Liu. Monte Carlo strategies in scientific computing. Springer Science+Business Media, 2004.
C. Louizos, U. Shalit, J. M. Mooij, D. Sontag, R. Zemel, and M. Welling. Causal effect inference with deep latent-variable models. In Advances in Neural Information Processing Systems, pages 6446–6456. Curran Associates, Inc., 2017.
S. N. MacEachern, M. Clyde, and J. S. Liu. Sequential importance sampling for nonparametric Bayes models: The next generation. Canadian Journal of Statistics, 27(2):251–267, 1999.
C. J. Maddison, D. Lawson, G. Tucker, N. Heess, M. Norouzi, A. Mnih, A. Doucet, and Y. Whye Teh. Filtering variational objectives. In Advances in Neural Information Processing Systems, 2017.
N. Metropolis and S. Ulam. The Monte Carlo method. Journal of the American statistical association, 44(247):335–341, 1949.
N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087–1092, 1953.
I. Murray and M. M. Graham. Pseudo-marginal slice sampling. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS), Cadiz, Spain, 2016.
L. M. Murray, A. Lee, and P. E. Jacob. Parallel resampling in the particle filter. Journal of Computational and Graphical Statistics, 25(3):789–805, 2016.
C. A. Naesseth, F. Lindsten, and T. B. Schön. Sequential Monte Carlo for Graphical Models. In Advances in Neural Information Processing Systems 27, pages 1862–1870. Curran Associates, Inc., Montreal, Canada, 2014.
C. A. Naesseth, F. Lindsten, and T. B. Schön. Nested sequential Monte Carlo methods. In International Conference on Machine Learning (ICML), pages 1292–1301, 2015.
C. A. Naesseth, F. Ruiz, S. W. Linderman, and D. M. Blei. Reparameterization gradients through acceptance-rejection sampling algorithms. In Artificial Intelligence and Statistics (AISTATS), pages 489–498, 2017.
C. A. Naesseth, S. Linderman, R. Ranganath, and D. M. Blei. Variational sequential Monte Carlo. In International Conference on Artificial Intelligence and Statistics, volume 84, pages 968–977. PMLR, 2018.
C. A. Naesseth, F. Lindsten, and T. B. Schön. High-dimensional filtering using nested sequential Monte Carlo. IEEE Transactions on Signal Processing, 67 (16):4177–4188, 2019.
R. M. Neal. Annealed importance sampling. Statistics and computing, 11(2): 125–139, 2001.
C. Osendorfer and J. Bayer. Learning stochastic recurrent networks. Technical report, arXiv:1411.7610, 2014.
B. Paige and F. Wood. Inference networks for sequential Monte Carlo in graphical models. In International Conference on Machine Learning, pages 3040– 3049, 2016.
M. K. Pitt and N. Shephard. Filtering via simulation: Auxiliary particle filters. Journal of the American statistical association, 94(446):590–599, 1999.
T. Rainforth, C. Naesseth, F. Lindsten, B. Paige, J.-W. Van de Meent, A. Doucet, and F. Wood. Interacting particle Markov chain Monte Carlo. In International Conference on Machine Learning, pages 2616–2625, 2016.
P. Rebeschini and R. Van Handel. Can local particle filters beat the curse of dimensionality? The Annals of Applied Probability, 25(5):2809–2866, 2015.
D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pages 1278–1286. PMLR, 2014.
H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 09 1951.
C. Robert and G. Casella. Monte Carlo statistical methods. Springer Science & Business Media, 2004.
D. B. Rubin. Comment: A noniterative sampling/importance resampling alternative to the data augmentation algorithm for creating a few imputations when fractions of missing information are modest: The SIR algorithm. Journal of the American Statistical Association, 82(398):543–546, 1987.
F. Ruiz, M. Titsias, and D. Blei. The generalized reparameterization gradient. In Advances in neural information processing systems, pages 460–468, 2016.
C. Schäfer and N. Chopin. Sequential Monte Carlo on large binary sampling spaces. Statistics and Computing, 23(2):163–184, 2013.
T. B. Schön, A. Wills, and B. Ninness. System identification of nonlinear state-space models. Automatica, 47(1):39–49, Jan. 2011.
C. Snyder, T. Bengtsson, P. Bickel, and J. Anderson. Obstacles to high-dimensional particle filtering. Monthly Weather Review, 136(12):4629–4640, 2008.
C. Snyder, T. Bengtsson, and M. Morzfeld. Performance bounds for particle filters using the optimal proposal. Monthly Weather Review, 143(11):4750– 4761, 2015.
L. Stewart and P. McCarty, Jr. Use of Bayesian belief networks to fuse continuous and discrete information for target recognition, tracking, and situation assessment. In Proc. SPIE, volume 1699, pages 177–185, 1992.
J.-W. van de Meent, B. Paige, H. Yang, and F. Wood. An introduction to probabilistic programming. arXiv preprint arXiv:1809.10756, 2018.
R. Van Der Merwe, A. Doucet, N. De Freitas, and E. A. Wan. The unscented particle filter. In Advances in neural information processing systems (NIPS), pages 584–590, 2001.
C. Vergé, C. Dubarry, P. Del Moral, and E. Moulines. On parallel implementation of sequential Monte Carlo methods: the island particle model. Statistics and Computing, 25(2):243–260, 2015.
W. Wagner. Unbiased Monte Carlo evaluation of certain functional integrals. Journal of Computational Physics, 71(1):21–33, 1987.
M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1 (1–2):1–305, 2008.
N. Whiteley. Stability properties of some particle filters. The Annals of Applied Probability, 23(6):2500–2537, 2013.
N. Whiteley, A. Lee, and K. Heine. On the role of interaction in sequential Monte Carlo algorithms. Bernoulli, 22(1):494–529, 2016.
D. Whitley. A genetic algorithm tutorial. Statistics and computing, 4(2):65–85, 1994.
F. Wood, J. W. Meent, and V. Mansinghka. A new approach to probabilistic programming inference. In Artificial Intelligence and Statistics, pages 1024– 1032, 2014.