b

DiscoverSearch
About
My stuff
The reproducing Stein kernel approach for post-hoc corrected sampling
2020·arXiv
Abstract
Abstract

: Stein importance sampling [39] is a widely applicable technique based on kernelized Stein discrepancy [40], which corrects the output of approximate sampling algorithms by reweighting the empirical distribution of the samples. A general analysis of this technique is conducted for the previously unconsidered setting where samples are obtained via the simulation of a Markov chain, and applies to an arbitrary underlying Polish space. We prove that Stein importance sampling yields consistent estimators for quantities related to a target distribution of interest by using samples obtained from a geometrically ergodic Markov chain with a possibly unknown invariant measure that differs from the desired target. The approach is shown to be valid under conditions that are satisfied for a large number of unadjusted samplers, and is capable of retaining consistency when data subsampling is used. Along the way, a universal theory of reproducing Stein kernels is established, which enables the construction of kernelized Stein discrepancy on general Polish spaces, and provides sufficient conditions for kernels to be convergence determining on such spaces. These results are of independent interest for the development of future methodology based on kernelized Stein discrepancies.

Keywords and phrases: reproducing kernel, Stein’s method, Markov chain Monte Carlo, importance sampling. MSC2020 subject classifications: Primary 65C05; secondary 60J22, 60B10.

Our problem of interest is the efficient computation of integrals with respect to some target probability measure  π. Adopting the Monte Carlo approach, an empirical distribution is used to approximate the target measure, formed from samples drawn according to it. However, in many problems of interest, it is not possible to simulate according to  πexactly, and so further approximate methods must be used. Arguably the most widely employed and general approach is Markov Chain Monte Carlo (MCMC): successively drawing samples as a realization of a Markov chain. The dominant approach to MCMC involves the simulation of a process that is  π-ergodic, often constructed by the Metropolis-Hastings algorithm from an underlying irreducible and aperiodic Markov chain [56].

However, there has been significant recent interest in so-called unadjusted MCMC approaches [14, 18, 29, 42]. A common strategy with these methods is the approximate numerical simulation of Langevin diffusions, which are  π-ergodic [42]. For the same computational effort, one can achieve substantially lower variance of estimates at the cost of incurring additional (asymptotic) bias. Despite poorer asymptotic guarantees [20], the ensuing Markov chains are often rapidly mixing, and perform particularly well in high dimensional settings [19]. However, despite significant recent empirical success [29] and the presence of explicit error estimates [14, 18], such algorithms have not been widely adopted. Statisticians and practitioners alike remain skeptical, as consistency is often considered a basic theoretical necessity for any sampling algorithm.

The aim of this work is to construct a universal theoretical framework for Stein importance sampling. Consequently, one can achieve the best of both worlds by generating approximate samples from a rapidly mixing Markov chain, and then subsequently ‘correcting’ them, without pre-existing knowledge of the chain, to obtain consistent estimators of quantities involving an unnormalized probability distribution. This allows us to obtain both a low-variance estimator with good practical performance, and a theoretical guarantee that increasing the number of samples will provide an estimator with diminishing error. The approach is viable for all algorithms, most notably, including biased samplers. Stein importance sampling is not new, having been introduced as a form of black-box importance sampling in [39], but lacks sufficient theoretical justification to support its use as a universal post-hoc correction tool, particularly as it applies to Markov chains. It belongs to a large class of algorithms derived from the kernelized Stein discrepancy (KSD), first introduced in [40], and motivated by ideas in [50]. Interpreted as an integral probability metric with respect to a fixed target distribution  π, the KSD combines ideas from Stein’s method [58, 61] and reproducing kernel Hilbert spaces [2, 62], and has the advantage of being explicitly computable. KSD has been applied to a myriad of problems, including measurement of sample quality [27], goodness-of-fit tests [11, 32, 40, 68], variance reduction [39, 49, 50], variational inference [38, 54], and the construction of other sampling algorithms [9, 10, 15, 41, 64].

Stein importance sampling stands out as one of the most straightforward techniques involving KSD: for elements  X1, . . . , Xn ∈ S, weights w1, . . . , wnare selected such that the resulting weighted empirical distribution ˆπn with integrals

image

is as close as possible to a target distribution  πunder a corresponding KSD  S( · ∥π). As all weights must be non-negative and sum to one for ˆπn to be a valid probability distribution, the resulting optimization problem reduces to the constrained quadratic program

image

where  Kπis the Gram matrix corresponding to a  reproducing Stein kernel for π, where the inequality w ≥ 0is to be interpreted element-wise. Stein importance sampling requires no modification to existing methods; it is a post-hoc correction applicable for any arbitrary collection  {Xk}nk=1. Theprogram (2) is a well-studied problem for which many solution techniques exist, for example, interior point methods, or alternating direction method of multipliers (see [48] for a comprehensive survey).

1.1. Contribution

In this work, we make several important theoretical contributions to the theory of Stein importance sampling and reproducing Stein kernels:

(I) Our main result is presented in  §3as Theorem 1, where we establish sufficient conditions for Stein importance sampling — when applied to samples generated by a Markov chain — to yield consistent estimators for integrals of test functions under  π.

The result is particularly interesting for two reasons. Firstly, somewhat surprisingly, we do not require that the Markov chain is  π-ergodic. In fact, technical conditions for the result that pertain to the ergodicity and convergence of the chain are easily verified for many unadjusted samplers, which are of increasing interest to practitioners due to their rapid mixing in high dimension. Secondly, the result is established for a generalized form of Stein importance sampling where certain terms are estimated unbiasedly, for example by data subsampling, which provides justification for the use of the method in the large data settings common in twenty-first century statistical computing.

To establish the above result for general underlying spaces, we first in  §2— following some background exposition — establish new theory for Stein kernels and KSD that is of independent interest:

(II) Construction of reproducing Stein kernels, and therefore KSD, with respect to probability measures on arbitrary Polish spaces is addressed.

The construction subsumes existing approaches pertaining to Euclidean space [40, 50], discrete spaces [68], and Riemannian manifolds [4, 37], and further enables construction of Stein kernels and KSD on any Hilbert space, and other separable function spaces, such as C([0, 1]). It builds off the generator approach for Stein’s method [3]; Proposition 1 shows that reproducing Stein kernels can always be constructed from the generators of ergodic processes with strongly continuous semigroups, and Proposition 2 deals with the construction of reproducing Stein kernels on product spaces. Finally,

(III) We establish general sufficient conditions for reproducing Stein kernels to yield a convergence determining KSD.

Previously, specialized direct proofs have been used to establish the convergence determining property for specific KSDs; see [10, 27]. In Propositions 3 and 4, we show that in the locally compact setting, separability and divergence of some function in the image of the RKHS under the Stein operator yields the convergence determining property.

1.2. Notation

In the sequel, S denotes an arbitrary Polish space, with  C(S) (resp. Cb(S)) the set of arbitrary (resp. bounded) continuous functions from S to R, and F a Banach subspace of C(S). For locally compact  S, C0(S) denotes the closure of the space of compactly supported continuous functions from S to R under the uniform norm  ∥φ∥∞ = supx∈S |φ(x)|. For S ⊂ Rd, Ck(Rd) denotes the set of k-times continuously-differentiable functions from  Rd to R. The space of probability measures on S is denoted by P(S). The law of a random element X is denoted L(X). Let M(S) denote the space of  σ-finite measures on S, equipped with the vague topology. For any measure  µ ∈ M(S),the space of p-integrable functions with respect to  µis denoted  Lp(µ). For each  φ ∈ L1(µ), welet  µ(φ) := �S φ(x)µ(dx). For ν ∈ M(S), we write  µ ≪ ν if νis absolutely continuous with respect to  µ, that is, if  µ(B) = 0 implies ν(B) = 0 for any Borel set B— the corresponding Radon-Nikodym derivative in such cases is denoted ∂ν∂µ. We write µ ≡ νwhen the measures  µ andνare equivalent, that is, if  µ(B) = ν(B) for any Borel set B. We also adopt big O notation in probability:  {Xn}∞n=1, {Yn}∞n=1 satisfy Xn = OP(Yn) if for any  ϵ >0, there exists a finite M > 0 such that supn P(|Xn/Yn| > M) < ϵ. For any operator A and function f in the domain of A (denoted dom(A)), Af(x) denotes the evaluation of the function Af at x. For operators  A1, A2 acting onelements of a function space F of scalar-valued functions on  S, A1 ⊗A2denotes the tensor product of  A1 and A2, acting on functions  f : S × S → R by

image

For function spaces  H1 and H2, H1 ⊗ H2denotes the tensor product space of  H1 and H2. Finally,collections of elements are denoted in bold, and for any  x = (x1, . . . , xd) and 1 ≤ i ≤ d, we denote x−i = (x1, . . . , xi−1, xi+1, . . . , xd).

Discrepancy measures are a fundamental concept in many areas of probability and statistics. However, the vast majority of discrepancies can not be estimated unbiasedly, let alone computed exactly. The majority of useful discrepancies valid for arbitrary probability measures lie in the class of integral probability metrics [47], which compare two probability measures  µ and πin terms of their integrals with respect to some class of test functions Φ:

image

For example, by taking Φ to be the unit ball in  C(S), dΦbecomes the total variation metric. Choosing Φ to be the set of Fr´echet-differentiable functions with derivative bounded uniformly by one yields the Wasserstein-1 metric [63, Remark 6.5]. Conditions that guarantee  dΦis a metric on the space of probability measures, and that  dΦ(µn, π) → 0 as n → ∞ implies µn(φ) → π(φ) forany bounded continuous function  φ (the convergence determiningproperty), can be found in [21, Theorem 3.4.5].

There are now two obstacles encountered when computing a metric  dΦbetween an empirical measure  µand a target probability measure  π. Firstly, while the expectation  µ(φ) is easy enough to evaluate,  π(φ) is not. In fact, the empirical measure  µis usually designed to approximate  π, andcomputing  π(φ) is often the objective! One of the first universal methods for solving this problem is  Stein’s method, an ingenious technique which encodes details about a target probability measure πinto a so-called  Stein operator Aπ. Nearly twenty years after the method’s introduction in [61], Barbour [3] developed a popular framework for constructing Stein operators known as the generator approach, which also serves as intuition behind the method.

We briefly outline the approach: let  πbe an arbitrary probability measure on  S, and let Xπt bea  π-ergodic Markov process that induces a strongly continuous semigroup  {Pt}t≥0 on F definedby  Ptf(x) = E[f(Xπt )|Xπ0 = x] for f ∈ F. Arguably the most important case is when S is locally compact and  Xπt is Feller, in which case  {Pt}t≥0 is strongly continuous on  C0(S) [31, Theorem19.6]. Let  Aπdenote the Markov generator of  Xπt , that is, the linear operator satisfying  Aπf =limt→0+ 1t (Ptf − f) for functions  f ∈ Fwhere the limit exists with respect to the topology of F. Let Φ be a dense subset of the domain of  Aπwith respect to the topology of F. By [21, Proposition 1.1.5(c)], if  Xπ0 is distributed according to  µ, then for any  φ ∈ Φ and t ≥ 0,

image

and by taking  t → ∞, since Xπt D→ π,

image

For example, let  µ and πbe Dirac measures at points  a, b ∈ Rd (respectively) and  Xπt = b+(a−b)e−ta deterministic process satisfying  Xπ0 = a and limt→∞ Xπt = b. The infinitesimal generator of  Xπtis  Aπφ(x) = (b − x) · ∇φ(x), and so δa(φ) − δb(φ) = −� ∞0 ∇φ(Xπs ) · dXπs ,which is precisely the fundamental theorem of calculus for line integrals, along the path parameterized by  Xπt . Therefore, (3) is a stochastic generalization of the fundamental theorem of calculus for line integrals, where the infinitesimal generator plays the role of the gradient, and the stochastic process the curve joining two probability measures.

Keeping with this analogy, the Stein equation is a stochastic analogue of the mean value theorem. Let  Exdenote the expectation operator subject to the event  {Xπ0 = x}. Under the same assump- tions, by [21, Proposition 1.1.5], the function  x �→� t0 Exφ(Xπs )dslies in the domain of  Aπ for anyφ ∈ Φ and t ≥0, with image  x �→� t0 ExAπφ(Xπs )ds(in other words, the generator and the integral may be exchanged). Because  Aπis closed [21, Proposition 1.1.6], assuming strong continuity of  Aπ,one may derive the Stein equation

image

where  fφ(x) = −Ex� ∞0 [φ(Xπs ) − π(φ)]ds(see the proof of [26, Theorem 5], for example). Denoting by  FΦthe image of Φ under the operation  φ �→ fφ, note

image

no longer involves the expectation  π(φ), and therefore sidesteps the challenge in its computation. An important observation of Stein is that for any operator  Aπwith the property

image

we may proceed to formulate (4), solve for  fφaccordingly, and arrive at (5), even if  Aπ is not aMarkov generator. This procedure describes Stein’s method at its highest level of generality. Hence, operators satisfying (6) are often called Stein operators.

The second issue with computing  dΦis the non-trivial optimization problem of obtaining the supremum over Φ. In the context of (5), this is even more difficult due to the complex relationship between Φ and  FΦ. The trick, as outlined in [40], is to choose Φ such that  FΦis the unit ball of a reproducing kernel Hilbert space. Recall that a Hilbert space H of functions  h : S → R isa reproducing kernel Hilbert space (RKHS) if for all  x ∈ S, the evaluation functional  h �→ h(x)is bounded [2]. By the Riesz representation theorem, for any RKHS H, there exists a unique symmetric, positive-definite function  k : S × S → R, called the reproducing kernel of H, such that

image

The relation (7) is often called the reproducing property of the kernel. Conversely, the MooreAronszajn theorem [62, Theorem 4.21] states that any positive-definite kernel induces a unique corresponding RKHS. There exists a significant theory of RKHS; see, for example, [2] and [62,  §4].Reproducing kernel Hilbert spaces arise in machine learning as they can be used to reduce infinite-dimensional optimization problems over function spaces, to problems over finite-dimensional spaces. Such a simplification occurs in (5) when  FΦis the unit ball of a RKHS H, due to the existence of a reproducing Stein kernel, which is guaranteed when the Stein operator  Aπis the generator of a Markov process (Proposition 1). For brevity, we will often drop the term ‘reproducing’ as it is clear in our context, but stress that these Stein kernels should not be confused with those seen in [13], for example.

Proposition 1 (Stein Kernels from Generators). Let πbe a probability measure on  S, and Xπt aπ-ergodic process with strongly continuous semigroup  {Pt}t≥0 over Fand Markov generator  Aπ.Further, let  H ⊆ Fbe a RKHS with reproducing kernel k. Assuming that  k(·, x) ∈ dom(Aπ) forany  x ∈ S, then Hπ = AπHis a RKHS with reproducing kernel

image

The proof relies on the interchangeability of inner products with Bochner integrals and differential operators on  Rd; see [62, Lemma 4.34] for example. This property extends to Markov generators through their time derivative formulation.

Proof. Since the Markov generator is the time derivative of its corresponding semigroup, we can proceed in a similar fashion to [62, Lemma 4.34] concerning interchangability of derivatives and inner products. Let  Ptdenote the semigroup of  Xπt , recalling that for any  f ∈ F,

image

where  {κt}t≥0is the family of transition probability kernels of  Xπt . Interpreting (9) as a Bochner integral, for any  h ∈ H, since ⟨h, ·⟩His a continuous linear operator, the reproducing property together with properties of the Bochner integral imply

image

Therefore, by letting ∆t = t−1(Pt − I) for each t > 0, where Iis the identity operator,

image

To show that (Aπ ⊗ I)kexists, it suffices to show that for any sequence of positive real numbers {tn}∞n=1 converging to zero and any  x ∈ S, {(∆tn ⊗ I)k(x, ·)}∞n=1 is a Cauchy sequence in H. Since H is complete, by definition of the generator  Aπ,

image

For any  x ∈ S, let Dt(x) = (∆t ⊗ I)k(x, ·) ∈ H. Choosing any  x ∈ S and m, n ∈ N,

image

Now, for any  y ∈ S, by multiple applications of (10),

image

The Kolmogorov equations [21, Proposition 1.5(b)] state that for any  t ≥ 0 and hin the domain of Aπ, ∆th(x) = t−1 � t0 PsAπh(x)ds, which, by the mean value theorem, implies ∆th(x) = PξAπh(x)for some  ξ ∈ (0, t). Since k(·, y) and k(x, ·) are both assumed to be in the domain of  Aπ (due tothe symmetry of k), for some  ξm,n ∈ (0, tn) and ηm,n ∈ (0, tm),

image

The strong continuity of the semigroup  Ptimplies that for any  ϵ >0, there is an  N ∈ N such thatfor  m, n ≥ N,

image

The above inequality combined with (12) implies that  {Dtn(x)}∞n=1 is a Cauchy sequence, and so (11) holds. Together with (10),

image

The result now follows from [62, Theorem 4.21].

Example 1 (Riemannian Stein Kernels). We rederive the construction of Stein kernels on Riemannian manifolds outlined in [4, 37]. Let (M, g) be a complete connected d-dimensional Riemannian manifold with basis coordinates (∂j)dj=1. The gradient of a smooth function  f : M → R is definedon (M, g) by ∇f = �di,j=1 gij ∂f∂xi ∂j, where gij is the (i, j)-th element of the inverse of the matrix G = (gij) with elements  gij = g(∂i, ∂j) for i, j = 1, . . . , d. The Laplace-Beltrami operator ∆ on(M, g) is the divergence of the gradient of f, given by

image

For any  C1-smooth vector field Z on (M, g), there exists a unique Feller process  Xt on (M, g)with generator ∆ +  Z [66, §2.1]. If πis absolutely continuous with respect to the Riemannian volume measure on (M, g) with density p, the choice  Z = ∇ log pyields a process  Xtwith invariant measure  π [66, pg. 72]. Furthermore, under the Bakry-Emery curvature condition, the diffusion  Xtis ergodic, indeed, geometrically ergodic in the Wasserstein metric [66, Theorem 2.3.3]. Therefore, letting  Aπ = ∆ + ∇ log pbe the generator of the diffusion  Xt, by Proposition 1,  Aπ induces aStein kernel for any base reproducing kernel k on (M, g), with target measure  π. To complete the construction of Stein kernels on manifolds requires an underlying reproducing kernel. Any reproducing kernel  k on R2d admits an (extrinsic) reproducing kernel on (M, g) by the strong Whitney embedding theorem [59, Theorem 2.2.a] ((x, y) �→ k(φ(x), φ(y)) where x, y ∈ M andφ : M → R2d is an embedding). Intrinsic kernels are known for special manifolds, such as the d-dimensional sphere  Sd = {x ∈ Rd+1 : ∥x∥ = 1} [1].

Example 2 (Zanella-Stein Kernels). Stein kernels can be constructed on arbitrary discrete structures using the Zanella processes introduced in [69]. Here, we follow the presentation of [53]. Suppose that  πis a probability function on a countable set  X. For each x ∈ X, specify a set  ∂x ⊂ Xas the neighbourhood of x. Doing so endows X with a directed graph structure G with edge set E = {(x, y) : x ∈ X, y ∈ ∂x}. The choice of each neighbourhood is arbitrary, with the exception that the resulting graph G must be strongly connected and aperiodic. Let  g : R+ → R+ satisfyg(x) = xg(1/x) for all x > 0. Such functions are referred to as balancing functions, and include g(x) = min{1, x} and g(x) = x1+x. Then, the Markov jump process with infinitesimal generator

image

satisfies the detailed balance equations on X with respect to  π. Assuming the process is also ergodic, for any positive-definite graph kernel k on G (see [24] for a review on the subject), applying Proposition 1 to  Aπ and kyields a reproducing Stein kernel for  π on X:

image

As a consequence of the proof of Proposition 1, we require only that  Aπh(x) = ⟨h, (Aπ ⊗I)k(x, ·)⟩H for any h ∈ H and x ∈ S, to infer that  Hπ ⊂ His a RKHS with reproducing kernel (8). Therefore, for more general Stein operators, we rely on the following definition for constructing reproducing Stein kernels.

Definition 1. A Stein operator  Aπis said to induce a Stein kernel if for any RKHS  H ⊂ dom(Aπ)of scalar-valued functions on S with reproducing kernel k,

image

For any RKHS H of functions on S, target measure  π, and operator  Aπinducing a Stein kernel kπ, the kernelized Stein discrepancy(KSD) between  µ and π, denoted S(µ∥π), satisfies

image

where, as in Proposition 1,  Hπ = AπHand the second equality follows by the property  ∥Aπh∥Hπ =∥h∥H for any h ∈ H. This construction (13) resembles maximum mean discrepancy (MMD); see [46,  §3.5] for further details. Now, if  µis the empirical measure of samples  X1, . . . , Xn, then

image

Both of the issues concerning computation of discrepancies for empirical distributions are now addressed, with an explicit and often easily computable formula for the evaluation of a particular integral probability metric. However, the kernelized Stein discrepancy is not limited to applications with empirical distributions, since, for any distribution  µ on S, we can readily estimate  S(µ∥π)2 =E kπ(X, X′) where X, X′ are independently distributed according to  µ, using crude Monte Carlo.

2.1. Efficient construction of Stein kernels on product spaces

Building upon Barbour’s generator approach to Stein’s method [3], using Proposition 1 one can, in principle, construct Stein kernels over any desired space. Unfortunately, Stein kernels obtained in this way are not always efficient to compute, and do not always coincide with Stein kernels currently in the literature. Consider the two following univariate cases:

 S = R: If πis an absolutely continuous probability measure on R with smooth density p supported on a connected subset of R, the overdamped Langevin stochastic differential equation

image

is an ergodic Feller process with stationary measure  π. By Itˆo’s lemma, its Markov generator  Aπacts on functions  h ∈ C2(R) by

image

Because  h ∈ C2(R) is arbitrary, the extra derivatives on  h′′ and h′ are redundant for  Aπ to be aStein operator, and one may instead consider

image

which coincides with the canonical Stein operator on R [36, 50, 61]. The operator  Aπ as in (17) isno longer the generator of a Markov process, however it still induces a Stein kernel nonetheless,

as a consequence of [62, Lemma 4.34] (see also the remark after Proposition 1). Alternatively, in place of the Langevin diffusion (15), one could start with any other  π-ergodic diffusion equation. See [26] for other alternatives, all of which induce Stein kernels under the reduction of derivatives, as in (17).

 S ⊆ Z: If π(·) is a probability mass function on Z (or some other countable set, in which case, one need only apply a injection into Z), any birth-death chain on the integers with birth and death rates  α(·) and β(·) respectively, will have stationary measure  πprovided it satisfies the detailed balance relations

image

An immediate choice is

image

Notice that the proposed solution (18) does not require that  πbe normalized. The generator of this chain acts on all functions  h : Z → R by

image

Analogously to how the generator in (16) acted on the derivatives of  h, Aπhis acting on the (backward) difference of h. A similar reduction yields

image

It can be readily shown that (19) induces a Stein kernel. The Stein operator (19) is most common in discrete scenarios, as seen in [7], and coincides with the Stein-Chen operator for the Poisson distribution when  α(x) = 1 and β(x) = x+ 1, for example. This was also the approach taken in [68] for the construction of reproducing Stein kernels on discrete spaces.

With (17) as (19) as building blocks, one can easily construct Stein kernels on higher-dimensional spaces that are products of R and Z. The method for doing so is a decomposition of the state space into a product of subspaces, and extends to products of arbitrary Polish spaces as well. Let S ⊆ S1 × · · · × Sd and πbe a distribution on  S. For each i = 1, . . . , d, let πi(·|x−i) denote the conditional distribution of  π over its i-th subspace  Si, for x ∈ S. If Aπi(·|x−i)is a Stein operator for  πi(·|x−i) for each i = 1, . . . , d and x ∈ S, then we may consider the operator

image

where  f = (f1, . . . , fd) : S → Rd, and Pxi is the projection operator mapping  f : S → R to thefunction  y �→ f(x1, . . . , xi−1, y, xi+1, . . . , xd) over y ∈ Si. By linearity, if each  Aπi(·|x) induces aStein kernel, then  Aπalso induces a Stein kernel. This is formalized in Proposition 2, the proof of which builds upon the arguments of [50, Theorem 1], and generalizes [65, Theorem 1].

Proposition 2 (Stein Kernels on Product Spaces). Suppose that  S ⊆ S1 × · · · × Sd and let π be aprobability measure on  S. For each i = 1, . . . , d, let πi(·|x−i)denote the conditional distribution of π over its i-th subspace  Si. Furthermore, for each  i, let Aπi(·|x−i)be a Stein operator for  πi(·|x−i)acting on �x∈S Pxi Hi, where Hiis a RKHS of scalar-valued functions on S with reproducing kernel ki, that also induces a Stein kernel. Then, the operator  Aπacting on functions  h = (h1, . . . , hd) ∈

image

is a Stein operator for  π on H = H1 ⊗ · · · ⊗ Hd, and AπHis a RKHS of scalar-valued functions on S with reproducing kernel

image

Proof. Since the zero function is in  Hi and Aπi(·|x)is a Stein operator for  πi(·|x) for each i = 1, . . . , d,EAπh(X) = 0 for all h ∈ Hif and only if the conditional distributions of X are equivalent to those of  π. Therefore,  Aπis a Stein operator for  π on H. For each i = 1, . . . , d and x ∈ S, letKi(x) = (Aπi(·|x)Pxi ⊗ I)ki(x, ·) ∈ Hi, and let K(x) = (Ki(x))di=1. By the hypothesis, for any h ∈ H,

image

Defining  Hπto be the space of functions  h : S → Rd such that the norm

image

is finite, [62, Theorem 4.21] implies that  Hπis the reproducing kernel Hilbert space with reproducing kernel

image

where the last line follows since each  Aπi(·|x)Pxi commutes with the inner product. Since  Hπ canbe seen to be the image of  H under Aπ, the result follows.

Example 3 (Canonical Stein Kernel on  Rd).As an especially important example, if  π admits asmooth density  p on Rd, then applying (17), we can take  Aπi(·|x)h(x) = ∂xihi(x)+∂xi(log p(x))hi(x),and so

image

which is the well-known Stein operator for densities on  Rd. To build the corresponding Stein kernel, expanding out

(Aπi(·|x) ⊗ Aπi(·|y))k(x, y) = ∂xi∂yik(x, y) + ∂xi log p(x)∂yik(x, y)

image

Therefore, the Stein kernel  kπas given by (20) becomes

kπ(x, y) = ∇x · ∇yk(x, y) + ∇ log p(x) · ∇yk(x, y)

image

which is precisely the canonical Stein kernel of [27, 36, 40, 50]. There is an extensive literature available for choosing reproducing kernels on  Rd. The most common choices are radial basis function kernels of the form

image

By the Schoenberg construction, such a kernel has the required positive-definiteness property (and therefore induces a RKHS) over any dimension d, if and only if  κ ∈ C[0, ∞)∩C∞(0, ∞) is completely monotone, that is, for every non-negative integer m, the sign of the m-th derivative  κ(m)(x) of κ is(−1)m for any x > 0 [67, Theorem 7.13]. Examples of completely monotonic functions include  x �→(α + x)−β (inducing the inverse multiquadric kernel) and  x �→ e−αx (inducing the Gaussian kernel) for any  α, β >0. Furthermore, for smooth functions  f, g, if f and g′ are completely monotonic and g : R+ → R+, then f ◦gis completely monotonic [45, Theorem 2]. Therefore,  x �→ (α+log(1+x))−1(inducing the inverse-log kernel) is a completely monotone function.

Example 4 (Marginal Stein Kernel on  Rd). Letting kdenote a reproducing kernel on R, for a RKHS H of functions  h : R → R, we can extend H to an RKHS  Hiof functions  hi : Rd → Rby  hi(x1, . . . , xd) = h(xi) for x ∈ Rd. Since H and Hiare isomorphic,  Hiis a RKHS on  Rd withreproducing kernel  kisatisfying  ki(x, y) = k(xi, yi) for x, y ∈ Rd. The Stein operator (21) applied to (20) with these choices of underlying kernels yields the marginal Stein kernel first considered in [65]:

image

The kernel (24) enjoys an increased sensitivity to single-dimensional perturbations over (22), which becomes especially apparent with improved performance when the dimension d is large. On the other hand, it is unable to distinguish between distributions with equivalent marginals.

Example 5 (Stein Kernel on  Zd). Let π(·) be a probability mass function on  Zd. Any reproducing kernel on  Rd is also a reproducing kernel on  Zd. Applying (19) with the choice of transition rates (18) to (20) immediately yields

image

where  π±i(x) = π(x ± ei) and eidenotes the i-th unit coordinate vector. In cases where  π(x)decreases rapidly in x, this choice of Stein kernel can succumb to significant numerical error. Instead, one would prefer to involve ratios of  π, such as ri(x) = π(x − ei)/π(x). Observe that, if k is a positive-definite kernel on  Zd, then letting  π+(x) = π(x) + 1{π(x) = 0},

image

is also a positive-definite (and therefore, reproducing) kernel on  Zd. Applying (19) with the choice (18), letting  ιxi = 1{x + ei ∈ supp(π)}, the Stein kernel becomes

image

2.2. Convergence determining properties

One of the most important properties for a discrepancy measure is its ability to distinguish (or separate) a target measure from other distributions. We say that the KSD  S(·∥π) is separatingwhen

image

A stronger condition for the KSD  S(·∥π) is to be convergence determining:

image

To our knowledge, [27, Theorem 8], [9, Theorem 4], and [10, Theorems 3 & 4] comprise the KSDs that are currently known to be convergence determining. It is known that separation alone does not necessarily imply that a KSD is convergence determining for arbitrary sequences of probability measures [27, Theorem 6]. However, separation is sufficient to guarantee that the KSD is convergence determining for tight sequences of probability measures: by [21, Lemma 3.4.3], if  {µn}∞n=1 ⊂ P(S)is a tight sequence of probability measures and  S(·∥π) is separating, then  S(µn∥π) →0 if and only if  µnconverges weakly to  π. In the case where S is locally compact, an extension involving random probability measures is presented in Proposition 3(a). For a separating KSD to be convergence determining for arbitrary sequences of probability measures, it suffices to show that the KSD can detect tightness, that is, if  S(µn∥π) → 0, then {µn}∞n=1 is tight. For this, the following assumption is sufficient.

Assumption 1. For a continuous reproducing Stein kernel  kπon a locally compact space S with corresponding RKHS  Hπ, there exists  h ∈ Hπsuch that, for any M > 0, there is a compact set K ⊆ Ssatisfying infx/∈K h(x) > M.

Here, we adopt the convention that the infimum over the empty set is infinite, thus Assumption 1 is satisfied if S is compact.

Proposition 3 (Convergence Determination). Suppose that  S(·∥π)is separating and S is locally compact. Let (Ω, E, P) be an underlying probability space and suppose that  {µn}∞n=1 is a sequence of random probability measures, that is,  µn : Ω → P(S) for n = 1, 2, . . . . Then S(µn∥π) P→ 0 implies

µn D→ πif either (a)  {µn(ω)}∞n=1 is tight for almost every  ω ∈ Ω, or (b) Assumption 1 holds.

Proof. Assume that (a) holds. For any Borel set  B ⊆ S and ω ∈ Ω, {µn(ω, B)}∞n=1 is bounded in R+, and so {µn}∞n=1 is a tight sequence of random elements in M(S) [31, Lemma 16.15]. Consider a weakly convergent subsequence  µnk D→ µ. Passing through to a further subsequence if necessary, S(µnk∥π) →0 with probability one [31, Lemma 4.2]. From [21, Lemma 3.4.3] and Prohorov’s Theorem [31, Theorem 16.3], if  S(·∥π) is separating and  S(νn∥π) →0 for a sequence of tight probability measures  {νn}∞n=1, then νn → πweakly. Therefore, since there exists a set of probability one on which  S(µnk∥π) → 0 and {µnk(ω)}∞k=1 is a tight sequence of probability measures,  µnk(ω) →πweakly with probability one, implying that  µ ≡ π. Since {µn}∞n=1 was an arbitrary weakly convergent subsequence,  µn D→ π.

Now suppose that (b) holds. Let  {µnj}∞j=1 be an arbitrary subsequence of  {µn}∞n=1. Since M(S)is Polish [31, Theorem A2.3], if we can show that there exists a further subsequence of  {µnj}∞j=1that converges in distribution to  π, then µn D→ π. Since S(µnj∥π) P→ 0 as j → ∞, µnj(h) P→ 0,where  h ∈ Hπhas the property described in Assumption 1. Passing through to a subsequence if necessary, we may assume that  µnj(h) a.s.→0, and consequently, supj µnj(h) < +∞with probability one. Since  kπis continuous, h is also continuous (see [2, pg. 345]), and since there exists a compact set N such that infx/∈N h(x) > 0, his bounded below by  −C where C ≥0. Therefore, the function ˜h = h + C is non-negative, and supj µnj(˜h) < +∞with probability one. For  ϵ >0, there exists a compact set K such that infx/∈K ˜h(x) > ϵ−1 supj µnj(˜h) and so with probability one,

image

Since  ϵwas arbitrary,  {µnj(ω)}∞j=1 is tight for almost every  ω ∈ Ω, and µnj D→ π by (a). Therefore,

µn D→ π.

image

Proposition 3 suggests a general strategy for identifying Stein kernels that induce convergence determining KSD via Assumption 1. In particular, for  S = Rd, to satisfy Assumption 1, one should first look for Stein kernels such that  |kπ(x, y)| → +∞ as ∥x∥ → ∞ for fixed y ∈ Rd. As notedin [27], for the canonical Stein kernel, this requires that the base kernel k diminishes sufficiently slowly relative to the growth of  ∇ log p. For this reason, the inverse multiquadric (IMQ) kernel k(x, y) = (α + ∥x − y∥2)−β for α > 0, β < 12, is a good choice for subgaussian distributions, inducing a convergence-determining KSD [27, Theorem 8]. In Example 6, we show directly that the IMQ score kernel satisfies Assumption 1.

Example 6 (IMQ Score Kernel). Consider kπthe canonical Stein kernel (22) on  Rd constructed from the IMQ score kernel [10]

image

where  α, β > 0, and f = − log p. Assume that  ∇fis surjective and M-Lipschitz continuous, that is, there exists M > 0 such that  ∥∇f(x) − ∇f(y)∥ ≤ M∥x − y∥ for all x, y ∈ Rd. Furthermore, assume that  |kπ(x, y)| → ∞ as ∥x∥ → ∞ for any y ∈ Rd. For each i = 1, . . . , d, let y+i , y−i satisfy∇f(y+i ) = ei and ∇f(y−i ) = −ei, where ei is the i-th standard basis vector in  Rd. Letting

image

we observe that  kπ(x, y)/K(x, y) → 1 as ∥x∥, ∥y∥ → ∞. Now, letting  κ(x) = (α2 + x)−β, since κis completely monotone,

image

as  ∥x∥ → ∞. The same is also true for the function  x �→ �di=1[kπ(x, y+i ) + kπ(x, y−i )], which lies in  Hπ. Therefore,  kπsatisfies Assumption 1.

Our remaining challenge now is finding sufficient conditions for S to be separating. Assuming that S is locally compact, a reproducing kernel  k is called C0-universalif its corresponding RKHS H of functions  h : S → Ris dense in  C0(S) and k(x, ·) ∈ C0(S) for all x ∈ S [60]. Any RKHS over  Rd witha non-constant reproducing kernel of the radial basis function form (23) is  C0-universal if  κ ∈ C0(S)[44, Theorem 17]. Moreover, any restriction of a  C0-universal kernel on  Rd to a subset (for example, Zd), is also C0-universal on that restricted space. Every KSD arising from a  C0(S)-universal kernel on  Rd is separating for smooth densities from a certain restricted class of probability distributions [11, Theorem 2.2]. It is important that we can show it can separate  π from everyother probability measure.

A convenient condition, inspired by [25, 26], is presented in Proposition 4. Recall that a Markov process  Xt is exponentially ergodicif there exists some  C, ρ >0 such that for any bounded measurable function f,

image

Necessary and sufficient conditions for exponential ergodicity of Markov processes are available in [17]. For overdamped Langevin diffusion targeting a measure  πthat admits a smooth density p, that is,  Xπt satisfying

image

which is the multivariate generalization of (15), if  π ∈ C20(S) and there exists some  γ ∈ (0, 1) such

image

then  Xπt is exponentially ergodic [57, Theorem 2.3]. On the other hand, for finite case  |S| < ∞,by combining [17, Theorem 5.3] and [35, Theorem 4.9], we can see that any irreducible finite-state Markov jump process is exponentially ergodic.

Proposition 4 (Separation). Assume that S is locally compact. The kernelized Stein discrepancy S(·∥π)corresponding to the reproducing Stein kernel  kπin Proposition 1 is separating if  Aπ is thegenerator of an exponentially ergodic Markov process and  k is C0-universal.

Proof. By construction,  S(µ∥π) = 0 if µ ≡ π, so assume that  S(µ∥π) = 0, or in other words,µ(Aπh) = 0 for any h ∈ H(the RKHS corresponding to  k). Because k is C0-universal, to show that µ ≡ π, it will suffice to show that  µ(h) = π(h) for any function  h ∈ H(see, for example, [46, pg. 51]). It further suffices to show that  µ(h) = 0 for any h ∈ Hsatisfying  π(h) = 0. Let Pt = etAπbe the semigroup corresponding to  Aπ, which, by assumption, satisfies (27). Consider the resolvent operators  {Rλ}λ>0defined for  f ∈ C0(S) by

image

which satisfy for any  λ > 0, (λ − Aπ)Rλ = I, where Iis the identity operator (see [31, Theorem 19.4]). Since  k is C0-universal,  H ⊂ C0(S) and by the properties of the Bochner integral,  ∥Rλh∥H ≤λ−1∥Pth∥H ≤ λ−1∥h∥H, implying Rλ : H → H. Choosing some fixed  h ∈ H with π(h) = 0,together with the hypotheses, this implies that  µ(AπRλh) = 0 and so

image

Now, since  πis the stationary distribution of  {Pt}t≥0 and π(h) = 0, π(Pth) = 0 for any t ≥ 0.Therefore, since  Ptis geometrically ergodic and h is bounded,

image

and, in particular,  λ∥Rλh∥∞ → 0 as λ → 0+. Together with (28), this implies that  µ(h) = 0, andsince  h ∈ Hwas arbitrary, the result follows.

Proposition 4 can be extended to kernels constructed from Proposition 2, simply by applying the result to conditional probability measures. However, extending Proposition 4 to the canonical reproducing Stein kernel on  Rd is more challenging, since the corresponding Stein operator differs from a Markov generator by an extra derivative. Fortunately, this special case has been treated in [27] using results from [26]. In short, exponential convergence in the semigroup is still relevant, but required for Lipschitz-continuous, rather than bounded, test functions. Therefore, if the overdamped Langevin diffusion corresponding to  πmixes exponentially fast in Wasserstein distance, then the kernelized Stein discrepancy is separating. While exponential ergodicity of overdamped Langevin diffusion is known to hold for light-tailed target densities, sufficient conditions on exponential convergence in the Wasserstein metric often require log-concavity of the target density  π, or someother distant dissipativity condition [8, 26]. In particular, the KSD for the canonical Stein kernel on  Rd is separating if  πis distantly dissipative [27]. From [28, eq. 12], distant dissipativity also implies exponential ergodicity of overdamped Langevin diffusion.

Importance sampling is a classic Monte Carlo technique [33]: for any (possibly unnormalized) target probability measure  π, if X1, . . . , Xnare independent and identically distributed samples from a probability measure  ν with π ≪ ν, then for ˆπn defined by (1),

image

By the ergodic theorem, the same is true if  X1, . . . , Xncome from a  ν-ergodic Markov chain. Furthermore, in both cases, it can be shown that the rate of convergence as  n → ∞ is OP(n−1/2),for any  φ ∈ C0(S). Unfortunately, for this choice of weights, ˆπn(φ) will often have prohibitively large variance unless  π and νare close in total variation, and this is further exacerbated in high dimensions. On the other hand, in cases where  νis constructed to be a close approximation for π, such as when  X1, . . . , Xnare derived from some approximate sampling algorithm, explicit or computable forms for  νare rarely available.

These issues can be avoided entirely using Stein importance sampling [39]. Recall that, for a reproducing Stein kernel constructed with respect to a target measure  π, the kernelized Stein discrepancy between a weighted empirical measure (1) and  π is

image

where  Kπ = (kπ(xi, xj))ni,j=1 is the Gram matrix associated to  kπand the points  X1, . . . , Xn. Toobtain estimators of the form (1) that most closely match a given target measure, Stein importance sampling involves choosing weights  w = (w1, . . . , wn) that minimize the KSD (30), that is, the solution to the constrained quadratic program

image

There are two major advantages to this: (i) the underlying density of the samples  X1, . . . , Xn is nolonger needed, and (ii) the mean squared error of the corresponding estimator is often smaller than that obtained using classical importance sampling. One drawback of Stein importance sampling is the cubic computational complexity in the number of samples, however, many practitioners may not seek any more than a few thousand effective samples, for which solving (31) is typically inexpensive.

3.1. Stein kernel estimators

If computing a Stein kernel is either expensive or intractable, one might instead seek to conduct Stein importance sampling with only a collection of estimators of a Stein kernel. To be valid, it is necessary that the Gram matrix formed from these kernels remains positive-semidefinite with high probability. To cover this condition and facilitate further theoretical analysis, we extend the definition of Gram-de Finetti matrices, introduced in [16], to the kernel setting (Definition 2).

Definition 2 (Gram-de Finetti Kernel Arrays). An infinite array of random kernels (kij)∞i,j=1 ona probability space (Ω, E, P) with kij(ω) : S × S → R for each i, j = 1, 2, . . . , is Gram-de Finetti iffor every sequence  {xi}∞i=1 ⊂ S, (kij(xi, xj))∞i,j=1 is a Gram-de Finetti matrix, that is, a symmetric jointly exchangeable array (kij(xi, xj) is equivalent in distribution to  kσ(i)σ(j)(xσ(i), xσ(j)) for anyfinite permutation  σ of N) that is almost surely positive-semidefinite.

Observe that, for any positive-definite kernel  k, (k)∞i,j=1 is a Gram-de Finetti array. Like Gram- de Finetti matrices, such arrays possess convenient representations as functions over uniformly distributed random variables. The following Lemma 1 is established without difficulty by following the same arguments seen in proofs of the Dovbysh-Sudakov representation theorem (see [52, Lemma 1]).

Lemma 1. Let (kijπ )∞i,j=1 be a Gram-de Finetti array of kernels. There exists a function  Kπ andindependent random variables  ξ, (ξi)∞i=1 uniformly distributed in [0, 1] such that

image

Example 7 (Subsampled Stein Kernels). Consider the setting where a density p decomposes as

image

Here, each  pimay correspond to a single data point, in which case, n denotes the total number of data points used to construct the density p. Rather than computing p directly, it is popular to utilise subsampling, where log p is estimated by

image

where  S ⊆ {1, . . . , n}. Since log  pS often involves far fewer terms than log p itself, it can be significantly faster to compute. Such estimates are common in subsampled MCMC, based upon the pseudomarginal approach [55]. One of the biggest challenges with these methods is that  pS is notan unbiased estimator of p, which the pseudomarginal approach requires. However,  ∇ log pS is anunbiased estimator of  ∇ log p. Consequently, the subsampled canonical Stein kernel:

image

is an unbiased estimator of the canonical Stein kernel (22) when  i ̸= j, provided that each  Si ⊂{1, . . . , n} contains elements that are drawn independently with replacement, uniformly at random. Furthermore, for a sequence of such subsets  S1, S2, . . . ⊂ {1, . . . , n}, the array of subsampled Stein kernels (kij)∞i,j=1 is Gram-de Finetti.

Our proposed procedure for utilizing Stein kernel estimators in Stein importance sampling is presented in Algorithm 1.

Algorithm 1 (Stein Importance Sampling). Given a collection of samples  {Xi}ni=1, test function φ, and Gram-de Finetti array of kernels (kijπ )∞i,j=1 that estimate (unbiasedly) a reproducing Stein kernel  kπ when i ̸= j, execute the following steps:

1. Compute the Gram matrix  Kπ = (kijπ (Xi, Xj))ni,j=1.

2. Solve (31) to obtain the importance sampling weights w. 3. Output ˆπn(φ) = �ni=1 wiφ(Xi) as an estimator of  π(φ).

The use of Stein kernel estimators, or random Stein kernels, was recently considered for developing goodness-of-fit tests for latent variable models in [32]. These choices of Stein kernel estimators for latent variable models can be applied with Stein importance sampling via Algorithm 1.

3.2. The Stein correction

We now study the theoretical properties of Stein importance sampling as a post-hoc correction for MCMC output. In constrast to the Metropolis correction, we term this procedure the Stein correction. The primary evidence for the use of the Stein correction is the consistency guarantee provided in Theorem 1, a Markov chain analogue of [39, Theorem 3.2] with the further extension to arbitrary Gram-de Finetti array of kernels. Before stating the theorem, we recall the definition of V -uniform ergodicity [43,  §16]. A ν-ergodic Markov chain with Markov transition operator P on S is V -uniformly ergodic for a measurable function  V : Rd → [1, ∞) if

image

For a V -uniformly ergodic Markov chain, V is often referred to as a Lyapunov drift function, which provides an upper bound on the rate of decay of test functions  φfor which the convergence of  Pnφ isuniform. In the sequel, ˆπn denotes a weighted empirical distribution derived from Stein importance sampling (Algorithm 1) applied with the choice of  {Xi}ni=1, π, and kπclear from context.

Theorem 1 (Consistency of Stein Correction). Let {Xt}∞t=1 be a ν-ergodic Markov chain on S that is V -uniformly ergodic for some Lyapunov drift function  V : S → [1, ∞). Let kπbe a reproducing Stein kernel with respect to some target distribution  π on Swhich is absolutely continuous with respect to  ν, and inducing a kernelized Stein discrepancy S. Furthermore, let (kijπ )∞i,j=1 be a Gram-de Finetti array of kernels independent of  {Xt}∞t=1 such that Ekijπ = kπ for i ̸= j.

image

In either case, if S is locally compact and  kπis separating and satisfies Assumption 1, then ˆπn D→ πas  n → ∞.

Proof. By Lemma 1, there exists a function  Kπand independent uniformly distributed random variables  ξ, (ξi)∞i=1 such that kijπ (x, y) = Kπ(x, y, ξ, ξi, ξj) for all x, y ∈ S. To prove the desired convergence, our primary tool is the variance bound of [23] for U-statistics of non-stationary Markov chains. Since for each  n, w(n) minimizes Sn(w, x) := �ni,j=1 wiwjkπ(xi, xj) over w, it suffices to find a sequence of normalized reference weights  v(n) such that Sn(v(n), x) P→ 0 in case (a) andSn(v(n), x) = OP(n−1) in case (b). The final statement follows as a consequence of Proposition 3. These weights are to be constructed in reference to the usual importance sampling weights (29) obtained via the weight function  w = ∂π/∂ν. Furthermore, we let  EXdenote the conditional expectation operator conditioned on the entire Markov chain  {Xt}∞t=1, so that EXkijπ (Xi, Xj) =kπ(Xi, Xj) for any i, j.

Beginning with case (b) first, by (33), there exists a square integrable function F and a constant

image

for each  i, j and x ∈ S. Positive-definiteness ensures that

image

and so we may take F to satisfy

image

Consider the augmented Markov chain ˜Xt = (Xt, ξt) for t = 0, 1, 2, . . .Under the hypotheses, since ξiare independent, ˜Xt is ˜V-uniformly ergodic where

image

Letting  ϕ((x, u), (y, v)) = w(x)w(y)Kπ(x, y, ξ, u, v), since each  kijπ is an unbiased estimator of a Stein kernel, we see that

image

that is,  ϕis a degenerate kernel. Furthermore, from (35),

image

almost surely. Let  v(n) denote the normalized reference weights  v(n)i ∝ w(Xi). Observe that

image

where  Unis the degenerate U-statistic

image

and

image

Since, by assumption,  |ϕ((x, u), (x, u))|2 ≤ C ˜V (x, u) for some constant C > 0, it follows from [43, Theorem 17.01] that  ∥Vn∥2 = O(n−1/2). Furthermore, by applying [23, Corollary 2.3],  ∥Un∥2 =O(n−1). Finally, since  w ∈ L1(ν), [43, Theorem 17.01] implies that  Wna.s.→ 1 as n → ∞. Therefore, Sn(v(n), X) = OP(n−1) as required.

For case (a), the results of [23] can no longer be applied directly. Instead, consider the truncated reference weights  v(n)i ∝ w(Xi) ∧ τn, where τn = o(n1/2) and τn → ∞ as n → ∞. As before, by hypothesis, there exists a square integrable function F and a constant C such that

image

for each  i, j and x ∈ S, where, once again, F can be taken to satisfy (35). The augmented chain ˜Xtis ˜V -uniformly ergodic with respect to ˜V defined by (36) with the new choice of F, and now

image

almost surely. Defining

image

proceeding as before, there is the same decomposition (37) in terms of  Un, Vn, and Wn, but now with ϕreplaced by  ϕn and Wn = n−1 �ni=1 w(Xi)∧τn. Since ϕn ≤ τnϕ1 and (x, u) �→ ϕ1((x, u), (x, u)) isintegrable with respect to the product of  νand the Lebesgue measure on [0, 1], the ergodic theorem implies  τ −1n ∥Vn∥2 is bounded, and  n−1∥Vn∥2 →0. Unlike part (b),  ϕn is no longer degenerate, however, by applying [23, Corollary 2.3] to the Hoeffding decomposition of  Un, and recognising that

image

it follows that there is a constant K > 0 depending only on V , the geometric ergodicity of the chain Xtand the initial starting distribution (of  X0), such that, for ˜ϕn(x, y) =� 10� 10 ϕn((x, u), (y, v))dudvand any n = 2, 3, . . . ,

image

which implies that�ϕn(x, y)ν(dx)ν(dy) →0 by the dominated convergence theorem, which, together with (39), implies  ∥Un∥2 →0. Finally, concerning  Wn, by the ergodic theorem, for any fixed m ∈ N, with probability one,

image

and so  Wna.s.→1. It follows immediately that  Sn(v(n), x) P→ 0.

As a simple yet useful corollary, one can also look at the case where  X1, X2, . . .are independent and identically distributed random elements on S, in which case, V can be taken to be any arbitrary integrable function, and so Theorem 1(a) follows if  x �→ kπ(x, x) ∈ L2(ν). It is possible to weaken this assumption by using tighter estimates for the iid case, resulting in a generalization of [39, Theorem 3.2] presented in Corollary 1. Analogous results can also be obtained for other types of stationary sequences using similar arguments; we refer to [34] for the enabling estimates in these cases.

Corollary 1. In the setting of Theorem 1, let  X1, X2, . . .be a sequence of independent random variables on S with common distribution  ν, such that  πis absolutely continuous with respect to  ν.

image

Proof. As in the proof of Theorem 1(a), define  ϕnaccording to (38) and form the same decomposition  Un, Vn, Wn. Since τ −1n ∥Vn∥1 is bounded,  n−1∥Vn∥1 →0, and by the law of large numbers, Wna.s.→ 1. By [34, Theorem 1.3.3], there exists C > 0 such that  E[Un −� ˜ϕn(x, y)ν(dx)ν(dy)]2 ≤Cτ 2nn−1 → 0 as n → ∞. The remaining arguments in the proof of Theorem 1(a) arrive at (a). For (b), following the same arguments as Theorem 1(b), this time, computing the variance of  Vn gives∥Vn∥2 = O(n−1/2), while [34, Theorem 1.3.3] implies  ∥Un∥2 = O(n−1).

For the canonical Stein kernel on  Rd, if ∇ log p(x) is polynomial, then to satisfy the conditions of Theorem 1(a), it suffices to establish V -uniform ergodicity for a drift function V that grows as V (x) ∝ es∥x∥ for some s >0. In particular, inspired by [28, Corollary 3.3], the following Proposition 5 provides a simple sufficient condition for Theorem 1(a) when the Markov chain  {Xk}∞k=1 hasGaussian transition probabilities.

image

Proof. Our proof resembles that of [30, Proposition 1]. Let P denote the transition operator of {Xk}∞k=1, and let Vs(x) = es∥x∥. It will suffice to show that  {Xk}∞k=1 is Vs-uniformly ergodic for any s > 0. By [28, Theorem 3.1], the Markov kernel P is irreducible with respect to Lebesgue measure and aperiodic, and has small compact sets. Fixing s > 0, by [43, Theorem 15.0.1] and [43, Lemma 15.2.8], it suffices to show that  PVs(x)/Vs(x) → 0 as ∥x∥ → ∞. For Za standard normal random vector, let  α = lim sup∥x∥→∞ E∥µ(x) + σ(x)Z∥2/∥x∥2. By assumption,  α <1, and note that by Jensen’s inequality,

image

Finally, [5, Theorem 5.5] implies

image

and so  PVs(x)/Vs(x) → 0 as ∥x∥ → ∞ by (41).

Fortunately, (40) is satisfied for many algorithms generating approximate samples from distributions on  Rd, including the unadjusted Langevin algorithm [57], the (exact) implicit Langevin algorithm [30], and the tamed unadjusted Langevin algorithm [6]. In situations where a particular Markov chain of interest can itself only be approximately simulated (for example, the inexact implicit Langevin algorithm [30]), in Corollary 2, we appeal to the robustness of V -uniform ergodicity to perturbations in the total variation metric. Corollary 2 follows immediately from Theorem 1 and [22, Theorem 1].

Corollary 2. Let X1, X2, . . . be a V-uniformly ergodic Markov chain on S for some Lyapunov drift function  V : S → [1, ∞), with corresponding transition operator P. Consider a family of Markov chains  {Xϵt }ϵ>0 with a corresponding family of transition operators  {Pϵ}ϵ>0 such that Pϵ → P intotal variation as  ϵ → 0+. If Vsatisfies the hypotheses of Theorem 1, then there exists some  ϵ0 > 0such that the results of Theorem 1 hold for the Markov chain  Xϵ1, Xϵ2, . . . for any 0 < ϵ < ϵ0.

The results of two brief numerical experiments are shown. The first experiment investigates how the OP(n−1/2) convergence rate in KSD compares to that of MMD on a simple example where the latter can be computed explicitly. The second experiment considers a challenging target measure from a Bayesian deep learning problem, and shows that the Stein correction can still lead to observable improvements in the convergence rate under an estimated KSD.

4.1. Empirical study of the rate of convergence

Despite the  OP(n−1/2) rate of convergence in Theorem 1(b), convergence rates in KSD do not necessarily carry across to other probability metrics. To explore the extent to which this rate of convergence might hold with other discrepancy measures, we conduct an empirical study with a simple sampling problem for which the MMD, an alternative convergence determining probability metric popular in machine learning, can be computed exactly. Our target  π is the d-dimensional standard normal distribution  N(0, Id) with d = 20, which gives ∇ log p(x) = −x. We employ the canonical Stein kernel (24) with IMQ and Gaussian base kernels, given by  kIMQ(x, y) = (1 +∥x − y∥2)− 12 and kGauss(x, y) = e−∥x−y∥2 respectively. We remark that the former kernel satisfies Assumption 1, while the latter does not. To sample approximately from  N(0, Id), we use the tamed unadjusted Langevin algorithm (TULA), simulating the Markov chain

image

with  X0 = 0, and where each  Zkis an independent standard normal random vector in d dimensions, h > 0 is the step size, and  γ >0 is a taming parameter, ensuring the chain can drift by no more than  h/(2γ) in Euclidean distance at each step. By [6, Proposition 3], for any  h, γ > 0, theconditions of Theorem 1(b) are satisfied. In Figure 1, we plot MMD to the standard normal distribution for the empirical distributions of unadjusted and Stein-adjusted samples generated using TULA, and exact iid samples. As expected, the unadjusted empirical distribution fails to converge in either MMD or KSD. On the other hand, the exact and Stein-adjusted samples yield empirical distributions that converge not only at the expected  O(n−1/2) rate in KSD, but also in MMD. In fact, once sufficiently many samples have been obtained, the Stein-adjusted empirical distribution using the IMQ base kernel exhibits improved performance over exact iid samples under both discrepancy measures. The Gaussian base kernel, which fails to satisfy Assumption 1, exhibits poorer performance in MMD, although does better than unadjusted samples. Altogether, this experiment supports the possibility of transferal for convergence rates from convergence determining KSD to other discrepancy measures.

image

Fig 1. Rate of convergence in MMD to the standard multivariate normal distribution in 20 dimensions of unadjusted vs Stein-adjusted empirical distributions. Samples are obtained using TULA with step size h = 1 and taming parameter γ = 0.05. The corresponding Stein-adjusted weighted empirical distributions are computed with respect to the canonical Stein kernel with Gaussian/IMQ kernels and exact samples.

4.2. Comparison with subsampled and full-data KSD

The second example considers Bayesian logistic regression with a Gaussian prior, yielding a target measure with log-density

image

Again, the TULA algorithm is used for sampling, but with subsampled gradient (that is, log p terms are replaced by their subsampled counterpart, log  pSt). Note that the application of Stein importance sampling to subsampled gradients can be justified by viewing the problem through the lens of random switching Markov chains [12]. Indeed, TULA with subsampled gradients is a random switching Markov chain indexed over the subsample  St. For any fixed subsample S , the Markov chain  XS induced by the TULA algorithm with target density  pS is VS-uniformly ergodic, where VS (x) = eaS (1+∥x∥2)1/2 [6, Proposition 3]. Taking  a = minS aS, a modification of [43, Lemma 15.2.9] implies that  XS is V-uniformly ergodic, where  V (x) = ea(1+∥x∥2)1/2. Since Sis arbitrary, [12, Lemma 2.8] implies that TULA with subsampled gradients is  V α-uniformly ergodic for some α >0, and hence satisfies the conditions of Theorem 1(b).

Two of the larger data sets from the binary regression examples in [51] are considered, namely krkp (n = 3196, d = 38) and spam (n = 4601, d = 105). In the former example, the step size and taming parameter are taken to be  h = 0.1 and γ = 0.05, respectively, with  h = 0.05 and γ = 0.01 inthe latter example. To coincide with the use of subsampled gradients for sampling, we also consider performance of Stein importance sampling under a subsampled Stein kernel. For the conditions of Theorem 1 to hold, it is essential that the subsampled gradients in the Stein kernel are independent of those used in the sampling proceess. Therefore, there are two subsampling-related parameters in this experiment: the number  nsof subsampled data points at each iteration of TULA, and the number  nkof subsampled data points in computing the subsampled Stein kernel. For simplicity, we take  ns = nk, and set ns = 500 and ns = 1000 for krkp and spam, respectively.

Results are shown in Figure 2, where we plot the (full-data) KSD for the unadjusted samples, the samples obtained with the full-data Stein correction, and the samples obtained with the subsampled Stein correction. We observe in both examples the full-data Stein correction achieving an approximate  O(n−1/2) rate of convergence toward the end of the simulation. We observe a similar phenomenon for the subsampled Stein correction in the krkp example, but not necessarily for the spam example, which is still exhibiting prelimiting behaviour.

image

Fig 2. Results for the two binary regression problems krkp and spam. All KSD values reported are computed with respect to the full data set.

[1] Alegria, A., Cuevas, F., Diggle, P. and Porcu, E. (2018). A family of covariance functions for random fields on spheres. CSGB Research Reports, Department of Mathematics, Aarhus University.

[2] Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society 68 337–404.

[3] Barbour, A. D. (1988). Stein’s method and Poisson process convergence. Journal of Applied Probability 25 175–184.

[4] Barp, A., Oates, C., Porcu, E. and Girolami, M. (2018). A Riemannian-Stein Kernel Method. arXiv preprint arXiv:1810.04946.

[5] Boucheron, S., Lugosi, G. and Massart, P. (2013). Concentration inequalities: A nonasymptotic theory of independence. Oxford university press.

[6]  Brosse, N., Durmus, A., Moulines, ´E. and Sabanis, S.(2018). The tamed unadjusted Langevin algorithm. Stochastic Processes and their Applications.

[7] Brown, T. C. and Xia, A. (2001). Stein’s method and birth-death processes. Annals of Probability 1373–1403.

[8] Cattiaux, P. and Guillin, A. (2014). Semi log-concave Markov diffusions. In  S´eminaire deprobabilit´es XLVI231–292. Springer.

[9] Chen, W. Y., Barp, A., Briol, F.-X., Gorham, J., Girolami, M., Mackey, L. and Oates, C. (2019). Stein Point Markov Chain Monte Carlo. In Proceedings of the 36th International Conference on Machine Learning 1011–1021.

[10] Chen, W. Y., Mackey, L., Gorham, J., Briol, F.-X. and Oates, C. (2018). Stein Points. In Proceedings of the 35th International Conference on Machine Learning 844–853.

[11] Chwialkowski, K., Strathmann, H. and Gretton, A. (2016). A kernel test of goodness of fit. JMLR: Workshop and Conference Proceedings.

[12] Cloez, B. and Hairer, M. (2015). Exponential ergodicity for Markov processes with random switching. Bernoulli 21 505–536.

[13] Courtade, T. A., Fathi, M. and Pananjady, A. (2019). Existence of Stein kernels under a spectral gap, and discrepancy bounds. In  Annales de l’Institut Henri Poincar´e, Probabilit´eset Statistiques 55 777–790. Institut Henri Poincar´e.

[14] Dalalyan, A. S. and Karagulyan, A. (2019). User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. Stochastic Processes and their Applications 129 5278– 5311.

[15] Detommaso, G., Cui, T., Marzouk, Y., Spantini, A. and Scheichl, R. (2018). A Stein variational Newton method. In Advances in Neural Information Processing Systems 9169– 9179.

[16] Dovbysh, L. and Sudakov, V. (1984). Gram-de Finetti matrices. Journal of Soviet Mathematics 27 3047–3054.

[17] Down, D., Meyn, S. P. and Tweedie, R. L. (1995). Exponential and uniform ergodicity of Markov processes. The Annals of Probability 1671–1691.

[18] Durmus, A. and Moulines, A. (2017). Non-asymptotic convergence analysis for the unadjusted Langevin algorithm. Ann. Appl. Probab. 27 1551–1587.

[19] Durmus, A. and Moulines, E. (2019). High-dimensional Bayesian inference via the unadjusted Langevin algorithm. Bernoulli 25 2854–2882.

[20] Dwivedi, R., Chen, Y., Wainwright, M. J. and Yu, B. (2019). Log-concave sampling: Metropolis-Hastings algorithms are fast. Journal of Machine Learning Research 20 1–42.

[21] Ethier, S. N. and Kurtz, T. G. (2009). Markov processes: characterization and convergence 282. John Wiley & Sons.

[22]  Ferr´e, D., Herv´e, L. and Ledoux, J.(2013). Regular perturbation of V-geometrically ergodic Markov chains. Journal of Applied Probability 50 184–194.

[23] Fort, G., Moulines, E., Priouret, P. and Vandekerkhove, P. (2012). A simple variance inequality for U-statistics of a Markov chain with applications. Statistics & Probability Letters 82 1193–1201.

[24]  Ghosh, S., Das, N., Gonc¸alves, T., Quaresma, P. and Kundu, M.(2018). The journey of graph kernels through two decades. Computer Science Review 27 88–111.

[25] Glynn, P. W. and Meyn, S. P. (1996). A Liapounov bound for solutions of the Poisson equation. The Annals of Probability 24 916–931.

[26] Gorham, J., Duncan, A. B., Vollmer, S. J. and Mackey, L. (2019). Measuring sample quality with diffusions. Ann. Appl. Probab. 29 2884–2928.

[27] Gorham, J. and Mackey, L. (2017). Measuring Sample Quality with Kernels. In International Conference on Machine Learning 1292–1301.

[28] Hansen, N. R. (2003). Geometric ergodicity of discrete-time approximations to multivariate diffusions. Bernoulli 9 725–743.

[29] Heek, J. and Kalchbrenner, N. (2019). Bayesian Inference for Large Scale Image Classifi-cation. arXiv preprint arXiv:1908.03491.

[30] Hodgkinson, L., Salomone, R. and Roosta, F. (2021). Implicit Langevin algorithms for sampling from log-concave densities. Journal of Machine Learning Research 22 1–30.

[31] Kallenberg, O. (2006). Foundations of Modern Probability. Springer Science & Business Media.

[32] Kanagawa, H., Jitkrittum, W., Mackey, L., Fukumizu, K. and Gretton, A. (2019). A Kernel Stein Test for Comparing Latent Variable Models. arXiv preprint arXiv:1907.00586.

[33] Kroese, D. P., Taimre, T. and Botev, Z. I. (2013). Handbook of Monte Carlo methods 706. John Wiley & Sons.

[34] Lee, A. J. (1990). U-statistics: Theory and Practice. Routledge.

[35] Levin, D. A. and Peres, Y. (2017). Markov chains and mixing times 107. American Mathematical Soc.

[36] Ley, C., Reinert, G. and Swan, Y. (2017). Stein’s method for comparison of univariate distributions. Probability Surveys 14 1–52.

[37] Liu, C. and Zhu, J. (2018). Riemannian Stein variational gradient descent for Bayesian inference. In Thirty-Second AAAI Conference on Artificial Intelligence.

[38] Liu, Q. and Feng, Y. (2016). Two methods for wild variational inference. arXiv preprint arXiv:1612.00081.

[39] Liu, Q. and Lee, J. (2017). Black-box Importance Sampling. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics 952–961.

[40] Liu, Q., Lee, J. and Jordan, M. (2016). A kernelized Stein discrepancy for goodness-of-fit tests. In International Conference on Machine Learning 276–284.

[41] Liu, Q. and Wang, D. (2016). Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances in Neural Information Processing Systems 2378–2386.

[42] Ma, Y.-A., Chen, T. and Fox, E. (2015). A complete recipe for stochastic gradient MCMC. In Advances in Neural Information Processing Systems 2917–2925.

[43] Meyn, S. P. and Tweedie, R. L. (2012). Markov chains and stochastic stability. Springer Science & Business Media.

[44] Micchelli, C. A., Xu, Y. and Zhang, H. (2006). Universal kernels. Journal of Machine Learning Research 7 2651–2667.

[45] Miller, K. S. and Samko, S. G. (2001). Completely monotonic functions. Integral Transforms and Special Functions 12 389–402.

[46]  Muandet, K., Fukumizu, K., Sriperumbudur, B. and Sch¨olkopf, B.(2017). Kernel mean embedding of distributions: A review and beyond. Foundations and Trends in Machine Learning 10 1–141.

[47]  M¨uller, A.(1997). Integral probability metrics and their generating classes of functions. Advances in Applied Probability 29 429–443.

[48] Nocedal, J. and Wright, S. (2006). Numerical optimization. Springer Science & Business Media.

[49] Oates, C. J., Cockayne, J., Briol, F.-X. and Girolami, M. (2019). Convergence rates for a class of estimators based on Stein’s method. Bernoulli 25 1141–1159.

[50] Oates, C. J., Girolami, M. and Chopin, N. (2017). Control functionals for Monte Carlo integration. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79 695–718.

[51] Ong, V. M.-H., Nott, D. J. and Smith, M. S. (2018). Gaussian variational approximation with a factor covariance structure. Journal of Computational and Graphical Statistics 27 465– 478.

[52] Panchenko, D. (2010). On the Dovbysh-Sudakov representation result. Electronic Communications in Probability 15 330–338.

[53] Power, S. and Goldman, J. V. (2019). Accelerated Sampling on Discrete Spaces with NonReversible Markov Processes. arXiv preprint arXiv:1912.04681.

[54] Pu, Y., Gan, Z., Henao, R., Li, C., Han, S. and Carin, L. (2017). VAE learning via Stein variational gradient descent. In Advances in Neural Information Processing Systems 4236– 4245.

[55] Quiroz, M., Kohn, R., Villani, M. and Tran, M.-N. (2019). Speeding up MCMC by efficient data subsampling. Journal of the American Statistical Association 114 831–843.

[56] Roberts, G. O. and Rosenthal, J. S. (2004). General state space Markov chains and

MCMC algorithms. Probability Surveys 1 20–71.

[57] Roberts, G. O. and Tweedie, R. L. (1996). Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli 2 341–363.

[58] Ross, N. (2011). Fundamentals of Stein’s method. Probability Surveys 8 210–293.

[59] Skopenkov, A. B. (2008). Embedding and knotting of manifolds in Euclidean spaces. London Mathematical Society Lecture Note Series 347 248.

[60] Sriperumbudur, B. K., Fukumizu, K. and Lanckriet, G. R. G. (2011). Universality, characteristic kernels and RKHS embedding of measures. Journal of Machine Learning Research 12 2389–2410.

[61] Stein, C. (1986). Approximate computation of expectations. IMS.

[62] Steinwart, I. and Christmann, A. (2008). Support vector machines. Springer Science & Business Media.

[63] Villani, C. (2008). Optimal transport: old and new 338. Springer Science & Business Media.

[64] Wang, D. and Liu, Q. (2016). Learning to draw samples: with application to amortized MLE for generative adversarial learning. arXiv preprint arXiv:1611.01722.

[65] Wang, D., Zeng, Z. and Liu, Q. (2017). Stein variational message passing for continuous graphical models. In Proceedings of the 35th International Conference on Machine Learning.

[66] Wang, F.-Y. (2014). Analysis for diffusion processes on Riemannian manifolds 18. World Scientific.

[67] Wendland, H. (2004). Scattered data approximation 17. Cambridge University Press.

[68] Yang, J., Liu, Q., Rao, V. and Neville, J. (2018). Goodness-of-fit testing for discrete distributions via Stein discrepancy. In International Conference on Machine Learning 5557– 5566.

[69] Zanella, G. (2020). Informed proposals for local MCMC in discrete spaces. Journal of the American Statistical Association 115 852–865.

Liam Hodgkinson and Fred Roosta are affiliated with the International Computer Science Institute, Berkeley, CA, 94704, USA. All authors are supported in part by the Australian Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS), under Australian Research Council grant CE140100049. Fred Roosta is supported by DARPA D3M (FA8750-17-2-0122), Cray/77000, and ARC DECRA (DE180100923). We are grateful to Samuel Power for drawing our attention to Zanella processes.


Designed for Accessibility and to further Open Science