The reproducing Stein kernel approach for post-hoc corrected sampling

2020·Arxiv

Abstract

Abstract

: Stein importance sampling [39] is a widely applicable technique based on kernelized Stein discrepancy [40], which corrects the output of approximate sampling algorithms by reweighting the empirical distribution of the samples. A general analysis of this technique is conducted for the previously unconsidered setting where samples are obtained via the simulation of a Markov chain, and applies to an arbitrary underlying Polish space. We prove that Stein importance sampling yields consistent estimators for quantities related to a target distribution of interest by using samples obtained from a geometrically ergodic Markov chain with a possibly unknown invariant measure that differs from the desired target. The approach is shown to be valid under conditions that are satisfied for a large number of unadjusted samplers, and is capable of retaining consistency when data subsampling is used. Along the way, a universal theory of reproducing Stein kernels is established, which enables the construction of kernelized Stein discrepancy on general Polish spaces, and provides sufficient conditions for kernels to be convergence determining on such spaces. These results are of independent interest for the development of future methodology based on kernelized Stein discrepancies.

Keywords and phrases: reproducing kernel, Stein’s method, Markov chain Monte Carlo, importance sampling. MSC2020 subject classifications: Primary 65C05; secondary 60J22, 60B10.

1. Introduction

Our problem of interest is the efficient computation of integrals with respect to some target probability measure . Adopting the Monte Carlo approach, an empirical distribution is used to approximate the target measure, formed from samples drawn according to it. However, in many problems of interest, it is not possible to simulate according to exactly, and so further approximate methods must be used. Arguably the most widely employed and general approach is Markov Chain Monte Carlo (MCMC): successively drawing samples as a realization of a Markov chain. The dominant approach to MCMC involves the simulation of a process that is -ergodic, often constructed by the Metropolis-Hastings algorithm from an underlying irreducible and aperiodic Markov chain [56].

However, there has been significant recent interest in so-called unadjusted MCMC approaches [14, 18, 29, 42]. A common strategy with these methods is the approximate numerical simulation of Langevin diffusions, which are ]. For the same computational effort, one can achieve substantially lower variance of estimates at the cost of incurring additional (asymptotic) bias. Despite poorer asymptotic guarantees [20], the ensuing Markov chains are often rapidly mixing, and perform particularly well in high dimensional settings [19]. However, despite significant recent empirical success [29] and the presence of explicit error estimates [14, 18], such algorithms have not been widely adopted. Statisticians and practitioners alike remain skeptical, as consistency is often considered a basic theoretical necessity for any sampling algorithm.

The aim of this work is to construct a universal theoretical framework for Stein importance sampling. Consequently, one can achieve the best of both worlds by generating approximate samples from a rapidly mixing Markov chain, and then subsequently ‘correcting’ them, without pre-existing knowledge of the chain, to obtain consistent estimators of quantities involving an unnormalized probability distribution. This allows us to obtain both a low-variance estimator with good practical performance, and a theoretical guarantee that increasing the number of samples will provide an estimator with diminishing error. The approach is viable for all algorithms, most notably, including biased samplers. Stein importance sampling is not new, having been introduced as a form of black-box importance sampling in [39], but lacks sufficient theoretical justification to support its use as a universal post-hoc correction tool, particularly as it applies to Markov chains. It belongs to a large class of algorithms derived from the kernelized Stein discrepancy (KSD), first introduced in [40], and motivated by ideas in [50]. Interpreted as an integral probability metric with respect to a fixed target distribution , the KSD combines ideas from Stein’s method [58, 61] and reproducing kernel Hilbert spaces [2, 62], and has the advantage of being explicitly computable. KSD has been applied to a myriad of problems, including measurement of sample quality [27], goodness-of-fit tests [11, 32, 40, 68], variance reduction [39, 49, 50], variational inference [38, 54], and the construction of other sampling algorithms [9, 10, 15, 41, 64].

Stein importance sampling stands out as one of the most straightforward techniques involving KSD: for elements are selected such that the resulting weighted empirical distribution ˆwith integrals

is as close as possible to a target distribution under a corresponding KSD ). As all weights must be non-negative and sum to one for ˆto be a valid probability distribution, the resulting optimization problem reduces to the constrained quadratic program

where is the Gram matrix corresponding to a , where the inequality is to be interpreted element-wise. Stein importance sampling requires no modification to existing methods; it is a post-hoc correction applicable for any arbitrary collection program (2) is a well-studied problem for which many solution techniques exist, for example, interior point methods, or alternating direction method of multipliers (see [48] for a comprehensive survey).

1.1. Contribution

In this work, we make several important theoretical contributions to the theory of Stein importance sampling and reproducing Stein kernels:

(I) Our main result is presented in as Theorem 1, where we establish sufficient conditions for Stein importance sampling — when applied to samples generated by a Markov chain — to yield consistent estimators for integrals of test functions under

The result is particularly interesting for two reasons. Firstly, somewhat surprisingly, we do not require that the Markov chain is -ergodic. In fact, technical conditions for the result that pertain to the ergodicity and convergence of the chain are easily verified for many unadjusted samplers, which are of increasing interest to practitioners due to their rapid mixing in high dimension. Secondly, the result is established for a generalized form of Stein importance sampling where certain terms are estimated unbiasedly, for example by data subsampling, which provides justification for the use of the method in the large data settings common in twenty-first century statistical computing.

To establish the above result for general underlying spaces, we first in — following some background exposition — establish new theory for Stein kernels and KSD that is of independent interest:

(II) Construction of reproducing Stein kernels, and therefore KSD, with respect to probability measures on arbitrary Polish spaces is addressed.

The construction subsumes existing approaches pertaining to Euclidean space [40, 50], discrete spaces [68], and Riemannian manifolds [4, 37], and further enables construction of Stein kernels and KSD on any Hilbert space, and other separable function spaces, such as C([0, 1]). It builds off the generator approach for Stein’s method [3]; Proposition 1 shows that reproducing Stein kernels can always be constructed from the generators of ergodic processes with strongly continuous semigroups, and Proposition 2 deals with the construction of reproducing Stein kernels on product spaces. Finally,

(III) We establish general sufficient conditions for reproducing Stein kernels to yield a convergence determining KSD.

Previously, specialized direct proofs have been used to establish the convergence determining property for specific KSDs; see [10, 27]. In Propositions 3 and 4, we show that in the locally compact setting, separability and divergence of some function in the image of the RKHS under the Stein operator yields the convergence determining property.

1.2. Notation

In the sequel, S denotes an arbitrary Polish space, with )) the set of arbitrary (resp. bounded) continuous functions from S to R, and F a Banach subspace of C(S). For locally compact ) denotes the closure of the space of compactly supported continuous functions from S to R under the uniform norm ) denotes the set of k-times continuously-differentiable functions from . The space of probability measures on S is denoted by P(S). The law of a random element X is denoted L(X). Let M(S) denote the space of -finite measures on S, equipped with the vague topology. For any measure the space of p-integrable functions with respect to is denoted ). For each let ), we write is absolutely continuous with respect to , that is, if — the corresponding Radon-Nikodym derivative in such cases is denoted when the measures are equivalent, that is, if ) for any Borel set B. We also adopt big O notation in probability: ) if for any 0, there exists a finite M > 0 such that sup. For any operator A and function f in the domain of A (denoted dom(A)), Af(x) denotes the evaluation of the function Af at x. For operators elements of a function space F of scalar-valued functions on denotes the tensor product of , acting on functions

For function spaces denotes the tensor product space of collections of elements are denoted in bold, and for any , we denote

2. Reproducing Stein Kernels

Discrepancy measures are a fundamental concept in many areas of probability and statistics. However, the vast majority of discrepancies can not be estimated unbiasedly, let alone computed exactly. The majority of useful discrepancies valid for arbitrary probability measures lie in the class of integral probability metrics [47], which compare two probability measures in terms of their integrals with respect to some class of test functions Φ:

For example, by taking Φ to be the unit ball in becomes the total variation metric. Choosing Φ to be the set of Fr´echet-differentiable functions with derivative bounded uniformly by one yields the Wasserstein-1 metric [63, Remark 6.5]. Conditions that guarantee is a metric on the space of probability measures, and that any bounded continuous function property), can be found in [21, Theorem 3.4.5].

There are now two obstacles encountered when computing a metric between an empirical measure and a target probability measure . Firstly, while the expectation ) is easy enough to evaluate, ) is not. In fact, the empirical measure is usually designed to approximate computing ) is often the objective! One of the first universal methods for solving this problem is , an ingenious technique which encodes details about a target probability measure into a so-called . Nearly twenty years after the method’s introduction in [61], Barbour [3] developed a popular framework for constructing Stein operators known as the generator approach, which also serves as intuition behind the method.

We briefly outline the approach: let be an arbitrary probability measure on a -ergodic Markov process that induces a strongly continuous semigroup by . Arguably the most important case is when S is locally compact and is Feller, in which case is strongly continuous on 19.6]. Let denote the Markov generator of , that is, the linear operator satisfying lim) for functions where the limit exists with respect to the topology of F. Let Φ be a dense subset of the domain of with respect to the topology of F. By [21, Proposition 1.1.5(c)], if is distributed according to , then for any

and by taking

For example, let be Dirac measures at points (respectively) and a deterministic process satisfying . The infinitesimal generator of is which is precisely the fundamental theorem of calculus for line integrals, along the path parameterized by . Therefore, (3) is a stochastic generalization of the fundamental theorem of calculus for line integrals, where the infinitesimal generator plays the role of the gradient, and the stochastic process the curve joining two probability measures.

Keeping with this analogy, the Stein equation is a stochastic analogue of the mean value theorem. Let denote the expectation operator subject to the event . Under the same assump- tions, by [21, Proposition 1.1.5], the function lies in the domain of 0, with image (in other words, the generator and the integral may be exchanged). Because is closed [21, Proposition 1.1.6], assuming strong continuity of one may derive the Stein equation

where (see the proof of [26, Theorem 5], for example). Denoting by the image of Φ under the operation

no longer involves the expectation ), and therefore sidesteps the challenge in its computation. An important observation of Stein is that for any operator with the property

we may proceed to formulate (4), solve for accordingly, and arrive at (5), even if Markov generator. This procedure describes Stein’s method at its highest level of generality. Hence, operators satisfying (6) are often called Stein operators.

The second issue with computing is the non-trivial optimization problem of obtaining the supremum over Φ. In the context of (5), this is even more difficult due to the complex relationship between Φ and . The trick, as outlined in [40], is to choose Φ such that is the unit ball of a reproducing kernel Hilbert space. Recall that a Hilbert space H of functions a reproducing kernel Hilbert space (RKHS) if for all , the evaluation functional is bounded [2]. By the Riesz representation theorem, for any RKHS H, there exists a unique symmetric, positive-definite function , called the reproducing kernel of H, such that

The relation (7) is often called the reproducing property of the kernel. Conversely, the MooreAronszajn theorem [62, Theorem 4.21] states that any positive-definite kernel induces a unique corresponding RKHS. There exists a significant theory of RKHS; see, for example, [2] and [62, Reproducing kernel Hilbert spaces arise in machine learning as they can be used to reduce infinite-dimensional optimization problems over function spaces, to problems over finite-dimensional spaces. Such a simplification occurs in (5) when is the unit ball of a RKHS H, due to the existence of a reproducing Stein kernel, which is guaranteed when the Stein operator is the generator of a Markov process (Proposition 1). For brevity, we will often drop the term ‘reproducing’ as it is clear in our context, but stress that these Stein kernels should not be confused with those seen in [13], for example.

Proposition 1 (Stein Kernels from Generators)be a probability measure on -ergodic process with strongly continuous semigroup and Markov generator Further, let be a RKHS with reproducing kernel k. Assuming that any is a RKHS with reproducing kernel

The proof relies on the interchangeability of inner products with Bochner integrals and differential operators on , Lemma 4.34] for example. This property extends to Markov generators through their time derivative formulation.

Proof. Since the Markov generator is the time derivative of its corresponding semigroup, we can proceed in a similar fashion to [62, Lemma 4.34] concerning interchangability of derivatives and inner products. Let denote the semigroup of , recalling that for any

where is the family of transition probability kernels of . Interpreting (9) as a Bochner integral, for any is a continuous linear operator, the reproducing property together with properties of the Bochner integral imply

Therefore, by letting ∆is the identity operator,

To show that (exists, it suffices to show that for any sequence of positive real numbers converging to zero and any is a Cauchy sequence in H. Since H is complete, by definition of the generator

For any . Choosing any

Now, for any , by multiple applications of (10),

The Kolmogorov equations [21, Proposition 1.5(b)] state that for any in the domain of , which, by the mean value theorem, implies ∆for some ) are both assumed to be in the domain of the symmetry of k), for some

The strong continuity of the semigroup implies that for any 0, there is an for

The above inequality combined with (12) implies that is a Cauchy sequence, and so (11) holds. Together with (10),

The result now follows from [62, Theorem 4.21].

Example 1 (Riemannian Stein Kernels). We rederive the construction of Stein kernels on Riemannian manifolds outlined in [4, 37]. Let (M, g) be a complete connected d-dimensional Riemannian manifold with basis coordinates (. The gradient of a smooth function on ()-th element of the inverse of the matrix ) with elements (M, g) is the divergence of the gradient of f, given by

For any -smooth vector field Z on (M, g), there exists a unique Feller process with generator ∆ + is absolutely continuous with respect to the Riemannian volume measure on (M, g) with density p, the choice yields a process with invariant measure , pg. 72]. Furthermore, under the Bakry-Emery curvature condition, the diffusion is ergodic, indeed, geometrically ergodic in the Wasserstein metric [66, Theorem 2.3.3]. Therefore, letting be the generator of the diffusion , by Proposition 1, Stein kernel for any base reproducing kernel k on (M, g), with target measure . To complete the construction of Stein kernels on manifolds requires an underlying reproducing kernel. Any reproducing kernel admits an (extrinsic) reproducing kernel on (M, g) by the strong Whitney embedding theorem [59, Theorem 2.2.a] ((is an embedding). Intrinsic kernels are known for special manifolds, such as the d-dimensional sphere

Example 2 (Zanella-Stein Kernels). Stein kernels can be constructed on arbitrary discrete structures using the Zanella processes introduced in [69]. Here, we follow the presentation of [53]. Suppose that is a probability function on a countable set , specify a set as the neighbourhood of x. Doing so endows X with a directed graph structure G with edge set . The choice of each neighbourhood is arbitrary, with the exception that the resulting graph G must be strongly connected and aperiodic. Let g(x) = xg(1/x) for all x > 0. Such functions are referred to as balancing functions, and include . Then, the Markov jump process with infinitesimal generator

satisfies the detailed balance equations on X with respect to . Assuming the process is also ergodic, for any positive-definite graph kernel k on G (see [24] for a review on the subject), applying Proposition 1 to yields a reproducing Stein kernel for

As a consequence of the proof of Proposition 1, we require only that , to infer that is a RKHS with reproducing kernel (8). Therefore, for more general Stein operators, we rely on the following definition for constructing reproducing Stein kernels.

Definition 1. A Stein operator is said to induce a Stein kernel if for any RKHS of scalar-valued functions on S with reproducing kernel k,

For any RKHS H of functions on S, target measure , and operator inducing a Stein kernel (KSD) between ), satisfies

where, as in Proposition 1, and the second equality follows by the property . This construction (13) resembles maximum mean discrepancy (MMD); see [46, 3.5] for further details. Now, if is the empirical measure of samples

Both of the issues concerning computation of discrepancies for empirical distributions are now addressed, with an explicit and often easily computable formula for the evaluation of a particular integral probability metric. However, the kernelized Stein discrepancy is not limited to applications with empirical distributions, since, for any distribution , we can readily estimate are independently distributed according to , using crude Monte Carlo.

2.1. Efficient construction of Stein kernels on product spaces

Building upon Barbour’s generator approach to Stein’s method [3], using Proposition 1 one can, in principle, construct Stein kernels over any desired space. Unfortunately, Stein kernels obtained in this way are not always efficient to compute, and do not always coincide with Stein kernels currently in the literature. Consider the two following univariate cases:

• is an absolutely continuous probability measure on R with smooth density p supported on a connected subset of R, the overdamped Langevin stochastic differential equation

is an ergodic Feller process with stationary measure . By Itˆo’s lemma, its Markov generator acts on functions

Because ) is arbitrary, the extra derivatives on are redundant for Stein operator, and one may instead consider

which coincides with the canonical Stein operator on R [36, 50, 61]. The operator no longer the generator of a Markov process, however it still induces a Stein kernel nonetheless,

as a consequence of [62, Lemma 4.34] (see also the remark after Proposition 1). Alternatively, in place of the Langevin diffusion (15), one could start with any other -ergodic diffusion equation. See [26] for other alternatives, all of which induce Stein kernels under the reduction of derivatives, as in (17).

• ) is a probability mass function on Z (or some other countable set, in which case, one need only apply a injection into Z), any birth-death chain on the integers with birth and death rates ) respectively, will have stationary measure provided it satisfies the detailed balance relations

An immediate choice is

Notice that the proposed solution (18) does not require that be normalized. The generator of this chain acts on all functions

Analogously to how the generator in (16) acted on the derivatives of is acting on the (backward) difference of h. A similar reduction yields

It can be readily shown that (19) induces a Stein kernel. The Stein operator (19) is most common in discrete scenarios, as seen in [7], and coincides with the Stein-Chen operator for the Poisson distribution when + 1, for example. This was also the approach taken in [68] for the construction of reproducing Stein kernels on discrete spaces.

With (17) as (19) as building blocks, one can easily construct Stein kernels on higher-dimensional spaces that are products of R and Z. The method for doing so is a decomposition of the state space into a product of subspaces, and extends to products of arbitrary Polish spaces as well. Let be a distribution on ) denote the conditional distribution of -th subspace is a Stein operator for , then we may consider the operator

where is the projection operator mapping function . By linearity, if each Stein kernel, then also induces a Stein kernel. This is formalized in Proposition 2, the proof of which builds upon the arguments of [50, Theorem 1], and generalizes [65, Theorem 1].

Proposition 2 (Stein Kernels on Product Spaces). Suppose that probability measure on denote the conditional distribution of -th subspace . Furthermore, for each be a Stein operator for acting on is a RKHS of scalar-valued functions on S with reproducing kernel , that also induces a Stein kernel. Then, the operator acting on functions

is a Stein operator for is a RKHS of scalar-valued functions on S with reproducing kernel

Proof. Since the zero function is in is a Stein operator for if and only if the conditional distributions of X are equivalent to those of . Therefore, is a Stein operator for . By the hypothesis, for any

Defining to be the space of functions such that the norm

is finite, [62, Theorem 4.21] implies that is the reproducing kernel Hilbert space with reproducing kernel

where the last line follows since each commutes with the inner product. Since be seen to be the image of , the result follows.

Example 3 (Canonical Stein Kernel on As an especially important example, if smooth density , then applying (17), we can take and so

which is the well-known Stein operator for densities on . To build the corresponding Stein kernel, expanding out

(

Therefore, the Stein kernel as given by (20) becomes

which is precisely the canonical Stein kernel of [27, 36, 40, 50]. There is an extensive literature available for choosing reproducing kernels on . The most common choices are radial basis function kernels of the form

By the Schoenberg construction, such a kernel has the required positive-definiteness property (and therefore induces a RKHS) over any dimension d, if and only if ) is completely monotone, that is, for every non-negative integer m, the sign of the m-th derivative (, Theorem 7.13]. Examples of completely monotonic functions include ((inducing the inverse multiquadric kernel) and (inducing the Gaussian kernel) for any 0. Furthermore, for smooth functions are completely monotonic and is completely monotonic [45, Theorem 2]. Therefore, (inducing the inverse-log kernel) is a completely monotone function.

Example 4 (Marginal Stein Kernel on denote a reproducing kernel on R, for a RKHS H of functions , we can extend H to an RKHS of functions by are isomorphic, is a RKHS on reproducing kernel satisfying . The Stein operator (21) applied to (20) with these choices of underlying kernels yields the marginal Stein kernel first considered in [65]:

The kernel (24) enjoys an increased sensitivity to single-dimensional perturbations over (22), which becomes especially apparent with improved performance when the dimension d is large. On the other hand, it is unable to distinguish between distributions with equivalent marginals.

Example 5 (Stein Kernel on ) be a probability mass function on . Any reproducing kernel on is also a reproducing kernel on . Applying (19) with the choice of transition rates (18) to (20) immediately yields

where denotes the i-th unit coordinate vector. In cases where decreases rapidly in x, this choice of Stein kernel can succumb to significant numerical error. Instead, one would prefer to involve ratios of ). Observe that, if k is a positive-definite kernel on , then letting

is also a positive-definite (and therefore, reproducing) kernel on . Applying (19) with the choice (18), letting , the Stein kern