A Measure-Theoretic Approach to Kernel Conditional Mean Embeddings

2020·Arxiv

ABSTRACT

ABSTRACT

We present a new operator-free, measure-theoretic definition of the conditional mean embedding as a random variable taking values in a reproducing kernel Hilbert space. While the kernel mean embedding of marginal distributions has been defined rigorously, the existing operator-based approach of the conditional version lacks a rigorous definition, and depends on strong assumptions that hinder its analysis. Our definition does not impose any of the assumptions that the operator-based counterpart requires. We derive a natural regression interpretation to obtain empirical estimates, and provide a thorough analysis of its properties, including universal consistency. As natural by-products, we obtain the conditional analogues of the Maximum Mean Discrepancy and Hilbert-Schmidt Independence Criterion, and demonstrate their behaviour via simulations.

1 Introduction

The idea of embedding probability distributions into a reproducing kernel Hilbert space (RKHS), a space associated to a positive definite kernel, has received a lot of attention in the past decades [Berlinet and Thomas-Agnan, 2004, Smola et al., 2007], and has found a wealth of successful applications, such as independence testing [Gretton et al., 2008], two-sample testing [Gretton et al., 2012], learning on distributions [Muandet et al., 2012, Lopez-Paz et al., 2015, Szabó et al., 2016], goodness-of-fit testing [Chwialkowski et al., 2016, Liu et al., 2016] and probabilistic programming [Schölkopf et al., 2015, Simon-Gabriel et al., 2016], among others – see review by Muandet et al. [2017]. It extends the idea of kernelising linear methods by embedding data points into high- (and often infinite-)dimensional RKHSs, which has been applied, for example, in ridge regression, spectral clustering, support vector machines and principal component analysis among others [Scholkopf and Smola, 2001, Hofmann et al., 2008, Christmann and Steinwart, 2008].

Conditional distributions can also be embedded into RKHSs in a similar manner [Song et al., 2013],[Muandet et al., 2017, Chapter 4]. Compared to marginal distributions, conditional distributions can represent more complicated relations between several random variables, and therefore conditional mean embeddings (CMEs) have the potential to unlock the whole arsenal of kernel mean embeddings to a much wider setting. Indeed, conditional mean embeddings have been applied successfully to dynamical systems [Song et al., 2009], inference on graphical models via belief propagation [Song et al., 2010], probabilistic inference via kernel sum and product rules [Song et al., 2013], reinforcement learning [Grünewälder et al., 2012b, Nishiyama et al., 2012], kernelising the Bayes rule and applying it to nonparametric state-space models [Fukumizu et al., 2013] and causal inference [Mitrovic et al., 2018] to name a few.

Despite such progress, the current prevalent definition of the conditional mean embedding based on composing cross-covariance operators [Song et al., 2009] relies on some stringent assumptions, which are often violated and hinder its analysis. Klebanov et al. [2019] recently attempted to clarify and weaken some of these assumptions, but strong and hard-to-verify conditions still persist. Grünewälder et al. [2012a] provided a regression interpretation, but here, only the existence of the CME is shown, without an explicit expression. The main contribution of this paper is to provide a theoretically rigorous, operator-free definition of the CME that requires drastically weaker assumptions, and comes in an explicit expression. We believe this will enable a more principled analysis of its theoretical properties, and open doors to new application areas. We derive the empirical estimate based on vector-valued RKHS regression, and provide an in-depth analysis of its properties, including a universal consistency result of rate . In particular, we relax the assumption of Grünewälder et al. [2012a] to allow for infinite-dimensional RKHSs.

As natural by-products, we obtain quantities that are extensions of the Maximum Mean Discrepancy (MMD) and the Hilbert-Schmidt Independence Criterion (HSIC) to the conditional setting, which we call the Maximum Conditional Mean Discrepancy (MCMD) and the Hilbert-Schmidt Conditional Independence Criterion (HSCIC). We demonstrate their properties through simulation experiments.

All proofs can be found in Appendix C.

2 Preliminaries

We take as the underlying probability space. Let (X, X), (Y, Y) and (Z, Z) be separable measurable spaces, and let be random variables with distributions . We will use Z as the conditioning variable throughout.

2.1 Positive Definite Kernels and RKHS Embeddings

Let be a vector space of real-valued functions on X, endowed with the structure of a Hilbert space via an inner product . A symmetric function is a reproducing kernel of if and only if: 1. , ; 2. and . A Hilbert space of real-valued functions which possesses a reproducing kernel is called a reproducing kernel Hilbert space (RKHS) [Berlinet and Thomas-Agnan, 2004]. A symmetric function positive-definite function if its Gram matrix is positive definite. The Moore-Aronszajn Theorem [Aronszajn, 1950] shows that the set of positive-definite functions and the set of reproducing kernels on are in fact identical.

Assuming

. Note that the integrand is an element in a Hilbert space (and therefore a Banach space), so this integral is not a Lebesgue integral, but a Bochner integral [Dinculeanu, 2000, p.15, Definition 35]. The square-root integrability assumption ensures that is indeed Bochner-integrable. We will generalise the following lemma to

Next, suppose is an RKHS of functions on Y with kernel , and consider the tensor product RKHS (see Weidmann [1980, pp.47-48] for a definition of tensor product Hilbert spaces).

Theorem 2.2 (Berlinet and Thomas-Agnan [2004, p.31, Theorem 13]). The tensor product is generated by the functions , with and defined by . Moreover, is an RKHS of functions on with kernel

Now let us impose a slightly stronger integrability condition:

As a consequence, for any pair

There exists an isometric isomorphism , where HSis the space of HilbertSchmidt operators from to . The cross-covariance operator is defined as [Fukumizu et al., 2004]. It is straightforward to show that

The notion of characteristic kernels is essential, since it tells us that the associated RKHSs are rich enough to enable us to distinguish different distributions from their embeddings.

Definition 2.4 (Fukumizu et al. [2008]). A positive definite kernel is characteristic to a set P of probability measures defined on X if the map is injective.

Sriperumbudur et al. [2010] discuss various characterisations of characteristic kernels and show that the well-known Gaussian and Laplacian kernels are characteristic. We then have a metric on which goes by the name maximum mean discrepancy (MMD) [Gretton et al., 2007].

The HSIC is defined as the Hilbert-Schmidt norm of , or equivalently, [Gretton et al., 2005], i.e. the MMD between and . If is characteristic, then HSIC = 0 if and only if

2.2 Conditioning

In this subsection, we briefly review the concept of conditioning in the formal measure-theoretic probability theory, in the context of Banach space-valued random variables. We consider a sub-and a Banach space H.

Definition 2.5 (Conditional Expectation, Dinculeanu [2000, p.45, Definition 38]). Suppose H is a Bochner P-integrable, H-valued random variable. Then the conditional expectation of H given E is any E-measurable, Bochner P-integrable, H-valued random variable satisfying this condition is said to be a version of E[H | E]. We write E[H | Z] to mean , where is the sub--algebra of F generated by the random variable Z.

The (almost sure) uniqueness of the conditional expectation is shown in Dinculeanu [2000, p.44, Proposition 37], and the existence in Dinculeanu [2000, pp.45-46, Theorems 39 and 50]. Definition 2.6 (Çınlar [2011, p.149]). For each conditional probability of

Note that, in the unconditional case, the expectation is defined as the integral with respect to the measure, but in the conditional case, the expectation is defined first, and the measure is defined as the expectation of the indicator function.

For this definition of conditional probability to be useful, we require an additional property, called a “regular version”. We first define the transition probability kernel2.

Definition 2.7 (Çınlar [2011, p.37,40]). Let be measurable spaces. A mapping is a transition kernel from to if (i) is -measurable; (ii) , is a measure on is said to be a transition probability kernel.

Definition 2.8 (Çınlar [2011, p.150, Definition 2.4]). For each , let Q(A) be a version of . Then is said to be a regular version of the conditional probability measure if Q is a transition probability kernel from

The following theorem, proved in Appendix C, is the reason why a regular version is important. It means that, roughly speaking, the conditional expectation is indeed obtained by integration with respect to the conditional measure.

Theorem 2.9 (Adapted from Çınlar [2011, p.150, Proposition 2.5]). Suppose that admits a regular version Q. Then is a version of E[H | E] for every Bochner P-integrable H.

Next, we define the conditional distribution.

Definition 2.10 (Çınlar [2011, p.151]). Let X be a random variable taking values in a measurable space (X, X). Then the conditional distribution of X given E is any transition probability kernel from to (X, X) such that, for all

If has a regular version Q, then letting

for defines a version of the conditional distribution of X given . Unfortunately, a regular version of a conditional probability measure does not always exist. Also, it is not guaranteed that any version of the conditional distribution exists. The following definition and theorem tell us that, fortunately, these versions exist more often than not.

Definition 2.11 (Çınlar [2011, p.11]). A measurable space (E, E) is standard if it is isomorphic to is some Borel subset of is the Borel -algebra of F.

Theorem 2.12 (Çınlar [2011, p.151, Theorem 2.10]). If (X, X) is a standard measurable space, then there exists a version of the conditional distribution of X given E. In particular, if is a standard measurable space, then the conditional measure has a regular version.

Examples of standard measurable spaces include and with their respective Borel -algebras, complete separable metric spaces with their Borel -algebra, Polish spaces with their Borel -algebras, separable Banach spaces with their Borel -algebras and separable Hilbert spaces with their Borel

2.3 Vector-Valued RKHS Regression

In this subsection, we introduce the theory of vector-valued RKHS regression, based on operator-valued kernels. Let H be a Hilbert space, which will be the output space of regression.

Definition 2.13 (Carmeli et al. [2006, Definition 1]). An H-valued RKHS on Z is a Hilbert space G such that 1. the elements of G are functions there exists

Analogously to the scalar case, it can be shown that any H-valued RKHS G possesses a reproducing kernel, which is an H-kernel of positive type satisfying, for any and and . There is also an analogy of the Moore-Aronszajn Theorem:

Theorem 2.15 (Carmeli et al. [2006, Proposition 1]). Given an H-kernel of positive type a unique H-valued RKHS G on Z with reproducing kernel

Now suppose we want to perform regression with input space Z and output space H, by minimising the following regularised loss functional:

where is a regularisation parameter and . There is a corresponding representer theorem:

Theorem 2.16 (Micchelli and Pontil [2005, Theorem 4.1]). If minimises (4) in G, it is unique and has the form where the coefficients are the unique solution of the linear equations

3 Conditional Mean Embedding

We are now ready to introduce a formal definition of the conditional mean embedding of X given Z. Definition 3.1 (Conditional Mean Embedding (CME)). We define the conditional mean embedding of X given Z as

This is a direct extension of the marginal kernel mean embedding, , but instead of being a fixed element in -measurable random variable taking values in (see Definition 2.5). Also, for is a real-valued Z-measurable random variable. The following lemma is analogous to Lemma 2.1. Lemma 3.2. Suppose admits a regular version. Then for any

By Lemmas 3.2 and 3.3, for any pair

almost surely. Hence, we define the conditional cross-covariance operator as (see Section 2.1 for the definition of

3.1 Comparison with Existing Definitions

As previously mentioned, the idea of CMEs and conditional cross-covariance operators is not a novel one, yet our development of the theory and definitions above differ significantly from the existing works. In this subsection, we review the previous approaches and compare them to ours.

The prevalent definition of CMEs in the literature is the one given in the following definition. We first need to endow the conditioning space Z with a scalar kernel, say , with corresponding RKHS

Definition 3.4 (Song et al. [2009, Definition 3]). The conditional mean embedding of the conditional distribution P(X | Z) is the operator defined by , where and are unconditional (cross-)covariance operators as defined in Section 2.1.

As noted by Song et al. [2009], the motivation for this comes from Theorem 2 in the appendix of Fukumizu et al. [2004], which states that if , then for any . This relation can be used to prove the following theorem.

Theorem 3.5 (Song et al. [2009, Theorem 4]). Take any . Then assuming satisfies: 1.

Now we highlight the key differences between this approach and ours. Firstly, this approach requires the endowment of a kernel on the conditioning space Z, and subsequently defines the CME as an operator from to . By contrast, our definition did not consider any kernel or function on Z, and defined the CME as a Bochner conditional expectation given -measurable, -valued random variable. It seems more logical not to have to endow the conditioning space with a kernel, at least before the estimation stage. Secondly, the operator-based approach assumes that , as a function in z, lives in . This is a severe restriction; it is stated in Song et al. [2009] that this assumption, while true for finite domains with characteristic kernels, is not necessarily true for continuous domains, and Fukumizu et al. [2013] gives a simple counterexample using the Gaussian kernel. Lastly, it also assumes that exists, which is another severe restriction. Fukumizu et al. [2013] mentions that this assumption is too strong in many situations involving popular kernels, and gives a counterexample using the Gaussian kernel. The most common remedy is to resort to the regularised version and treat it as an approximation of These assumptions have been clarified and slightly weakened in Klebanov et al. [2019], but strong and hard-to-verify conditions persist.

In contrast, our definitions extend the notions of kernel mean embedding, expectation operator and cross-covariance operator to the conditional setting simply by using the formal definition of conditional expectations (Definition 2.5), and only rely on the mild assumption that the conditional probability measure admits a regular version.

Grünewälder et al. [2012a] gave a regression interpretation, by showing the existence, for each satisfies . However, the main drawback here is that there is no explicit expression for , limiting its analysis and use. In contrast, our definition has an explicit expression with which it is easy to explore potential applications.

In Fukumizu et al. [2004], the conditional cross-covariance operator is defined, but in a significantly different way. It is defined as is the right inverse of . This has the property that, for all and . Note that this is different to our relation stated after Lemma 3.3; the conditional covariance is integrated out over Z. In fact, this difference is explicitly noted by Song et al. [2009].

3.2 A Discrepancy Measure between Conditional Distributions

In this subsection, we propose a conditional analogue of the maximum mean discrepancy (MMD), and explore the role of characteristic kernels in the conditional case. Let be an additional random variable, satisfying

Definition 3.6. We define the maximum conditional mean discrepancy (MCMD) between

We note that MCMD is not a fixed value, but a real-valued, Z-measurable random variable.

The term MMD stems from the equality [Gretton et al., 2007, Sriperumbudur et al., 2010], where . The supremum is attained by the witness

function,

where we used Lemma 3.2. We define the conditional witness function as the -valued random variable

Casting aside measure-theoretic issues arising from conditioning on an event of measure 0, we can informally think of the realisation of the MCMD at each with as “the MMD between and ”, and as “the witness function between and ”. The following theorem says that, with characteristic kernels, the MCMD can indeed act as a discrepancy measure between conditional distributions.

Theorem 3.7. Suppose is a characteristic kernel, and assume that admits a regular version. Then MCMDalmost surely if and only if, almost surely,

The MCMD is reminiscent of the conditional maximum mean discrepancy of Ren et al. [2016], defined as the HilbertSchmidt norm of the operator (see Definition 3.4). However, due to strong assumptions previously discussed, often do not even exist, and/or do not have the desired properties of Theorem 3.5, so even at population level, is often not an exact measure of discrepancy between conditional distributions. On the other hand, Theorem 3.7 with the MCMD is an exact mathematical statement at population level that is valid between any pair of conditional distributions.

The discussion on characteristic kernels in the conditional setting, and the precise meaning of an “injective embedding” of conditional distributions, has largely been absent in the existing literature. We suspect that this is because the operator-based definition is somewhat cumbersome to work with, and it is not immediately clear how to express such statements. The new, mathematically elegant definition of the CME can remedy that through Theorem 3.7. We conjecture that characteristic kernels will play a crucial role in many future applications of the CME.

3.3 A Criterion of Conditional Independence

In this subsection, we introduce a novel criterion of conditional independence, via a direct analogy with the HSIC. Definition 3.8. We define the Hilbert-Schmidt Conditional Independence Criterion between X and Y given Z to be

Note that HSCIC(X, Y | Z) is an instance of the MCMD in the tensor product space , and is a (real-valued) random variable. Again, casting aside measure-theoretic issues arising from conditioning on an event of probability 0, we can conceptually think of the realisation of the HSCIC at each as “the HSIC between and ”. Since HSCIC is an instance of MCMD, the following theorem follows immediately from Theorem 3.7.

Theorem 3.9. Suppose is a characteristic kernel3 on , and that admits a regular version. Then HSCIC(X, Y | Z) = 0 a.s. if and only if

Concurrent and independent work by Sheng and Sriperumbudur [2019] proposes a similar criterion with the same nomenclature (HSCIC). However, they omit the discussion of CMEs entirely, and define the HSCIC as the usual HSIC between , without considerations for conditioning on an event of measure 0. Their focus is more on investigating connections to distance-based measures [Wang et al., 2015, Sejdinovic et al., 2013]. Fukumizu et al. [2008] propose , defined as the squared Hilbert-Schmidt norm of the normalised conditional cross-covariance operator , where and . As discussed, these operator-based definitions rely on a number of strong assumptions that will often mean that does not exist, or it does not satisfy the conditions for it to be used as an exact criterion even at population level. On the other hand, the HSCIC defined as in Definition 3.8 is an exact mathematical criterion of conditional independence at population level. Note that is a single-value criterion, whereas the HSCIC is a random criterion.

Figure 1: Diagram representing equation (5). The CME can be viewed in two ways: (i) as an -valued random variable, and (ii) as a deterministic function of the random variable Z.

4 Empirical Estimates

In this section, we discuss how we can obtain empirical estimates of

Theorem 4.1. Assume that is separable, and denote its Borel -algebra by . Then we can write

where is some deterministic function, measurable with respect to

Figure 1 depicts this relation. Note that the separability assumption on is not a restrictive one, since it is satisfied if, for example, is a continuous kernel on the separable space X [Christmann and Steinwart, 2008, p.130, Lemma 4.33]. Hence, the problem of estimating boils down to estimating the function , and this is exactly the setting for vector-valued regression discussed in Section 2.3, with input space Z and output space . In contrast to Grünewälder et al. [2012a], where regression is motivated by applying the Riesz representation theorem conditioned on each value of , we derive the CME as an explicit function of Z, which we argue is a more principled way to motivate regression. Moreover, for continuous Z, the event Z = z has measure 0 for each , so it is not measure-theoretically rigorous to apply the Riesz representation theorem conditioned on Z = z.

The natural optimisation problem is to minimise the loss

among all is a vector-valued RKHS of functions endowed with a kernel is a scalar kernel on Z.

We cannot minimise directly, since we do not observe samples from , but only the pairs . We bound this with a surrogate loss that has a sample-based version:

where we used generalised conditional Jensen’s inequality (see Appendix A, or Perlman [1974]). Section 4.1 discusses the meaning of this surrogate loss. We replace the surrogate population loss with a regularised empirical loss based on samples from the joint distribution

where is a regularisation parameter. We see that this loss functional has exactly the same form as in (4). Therefore, by Theorem 2.16, the minimiser has the form , where we wrote . By Theorem 2.16, the coefficients are the unique solutions of the linear equations and I is the identity matrix. Hence, the coefficients are , where we wrote . Finally, substituting this into the expression for

Note that this expression is identical to those in Song et al. [2009] and Grünewälder et al. [2012a].

4.1 Surrogate Loss, Universality and Consistency

There is no doubt that in (6) is a more natural loss functional than the surrogate loss . In this subsection, we investigate the meaning and consequences of using this surrogate loss, as well as the implications of using a universal kernel and the consistency properties of our algorithm.

Denote by the Banach space of (equivalence classes of) measurable functions is -integrable, with norm . We can note that the true function belongs to , because Theorem 4.1 tells us that is indeed measurable, and

by (1), where we used generalised conditional Jensen’s inequality again on the second line (Appendix A). The following theorem shows that the true function is the unique minimiser in

Theorem 4.2. minimises both and in . Moreover, it is almost surely equal to any other minimiser of the loss functionals.

Note the difference in the statement of Theorem 4.2 from Grünewälder et al. [2012a, Theorem 3.1], who only consider the minimisation of the loss functionals in , whereas we consider the larger space in which to minimise our loss functionals.

Next, we discuss the concepts of universal kernels and universal consistency.

Definition 4.3 (Carmeli et al. [2008, Definition 2]). An operator-valued reproducing kernel with associated RKHS is if is a subspace of , the space of continuous functions vanishing at infinity. The kernel is dense in for any measure

Recall that we are using the kernel . Carmeli et al. [2008, Example 14] show that is -universal if is a universal scalar kernel, which in turn is guaranteed if is Gaussian or Laplacian, for example [Steinwart, 2001].

The consistency result with optimal rate of in Grünewälder et al. [2012a] based on Caponnetto and De Vito [2006] imposes strong assumptions about the kernel , and finite-dimensional assumption on the output space . These are violated for many commonly used kernels such as the Gaussian kernel, and so we do not use this result in our paper (see Appendix B for more details). Fukumizu [2015] also shows consistency with rate slightly worse than with weaker assumptions. We prove the following universal consistency result, which relies on even weaker assumptions and achieves a better rate of

Theorem 4.4. Suppose that and are bounded kernels, i.e. for all for some and for all for some , and that the operator-valued kernel is -universal. Also, suppose that the regularisation parameter depends on the sample size n, and converges to zero at the rate of . Then our learning algorithm that produces is universally consistent (in the surrogate loss ), i.e. for any joint distribution and constants

for large enough n. The rate of convergence is

The boundedness assumption is satisfied with many commonly used kernels, such as the Gaussian and Laplacian kernels, and hence is not a restrictive condition. The key observation is that the target values are all of the form for , so the target space is bounded if is bounded (see Appendix B and the proof in Appendix C for details).

Theorem 4.4 is stated with respect to the surrogate loss, not the natural original loss . Let us now investigate is implications with respect to the original loss. Write

Figure 2: Behaviour of MCMD. We see that MCMDfor all Z. Near Z = 0, where the dependence on are similar, MCMD, whereas in regions away from 0, the dependence on Z of X and are different, and so MCMD. We also see that the conditional witness function between X and gives 0 at all values of X given any value of Z, whereas we have a saddle-like function between with non-zero functions in X in the regions of Z away from 0.

consequence of Theorem 4.4 is that

by (9), for any . This shows that, in the limit as is at most an arbitrarily small amount larger than with high probability. It remains to investigate what represents, and how large it is. The law of total expectation gives

Here, the inner conditional expectation

is the variance of (see Bharucha-Reid [1972, p.24] for the definition of the variance of Banach-space valued random variables), and by integrating over Z in the outer integral, represents the “mean variance” of over Z.

Suppose X is measurable with respect to has zero noise. Then we have and consequently, . In this case, we have universal consistency in both the surrogate loss and the original loss . On the other hand, will be large if information about Z tells us little about X, and subsequently In the extreme case where X and Z are independent, we have

which is precisely the variance of in . Hence, represents the irreducible loss of the true function due to noise in X, and the surrogate loss represents the loss functional taking noise into account, while the original loss measures the deviance from the true conditional expectation.

4.2 Empirical Estimates of MCMD and HSCIC

Recall that we defined the MCMD as (Definition 3.6). By Çınlar [2011, Theorem I.4.4], we can write MCMDfor some function . Using samples the joint distribution , we can obtain a plug-in estimate of using (8): . To evaluate this norm, we take the square of it:

Figure 3: Behaviour of the HSCIC. We see that HSCIC(left) and HSCIC(right) for all In particular, the dependence of and on X is greater than that of and , and is rep- resented by larger values of HSCICand HSCICcompared to HSCICand HSCIC

Similarly, we can write HSCICwe obtain a closed-form estimate of as follows:

where denotes elementwise multiplication of matrices.

4.3 Experiments

In this section, we carry out simulations to demonstrate the behaviour of the MCMD and HSCIC. In all simulations, we use the Gaussian kernel with hyperparameter regularisation parameter

In Figure 2, we simulate 500 samples from where is the (additive) noise variable. The first plot shows simulated data, the second MCMD values against Z, and the heatmaps show the (unnormalised) conditional witness function, whose norm gives the MCMD.

In Figure 3, on the left, we simulate 500 samples from the additive noise model, and , where is the (additive) noise variable. On the right, we simulate 500 samples from the multiplicative noise model, and are the (multiplicative) noise variables.

5 Conclusion

In this paper, we introduced a new definition of kernel conditional mean embedding, based on Bochner conditional expectation. It is mathematically elegant and measure-theoretically rigorous, and unlike the previous operator-based definition, does not rely on stringent assumptions that are often violated in common situations. Using this new approach for CMEs, we extended the notions of the MMD, witness function and HSIC to the conditional case. Finally, we discussed how to obtain empirical estimates via natural vector-valued regression, and established universal consistency with convergence rates of under mild and intuitive assumptions. We believe that our new definition has the potential to unlock the powerful arsenal of kernel mean embeddings to the conditional setting, in a more convenient and rigorous manner than the previous approaches.

Acknowledgements

We thank Mattes Mollenhauer at Freie Universität Berlin for pointing out the missing conditions on the regularization parameter of our initial universal consistency result, and for other fruitful discussions. We also thank an anonymous reviewer for pointing out typos and suggesting an improvement to the proof of Theorem 4.1. Finally, we thank Simon Buchholz, Alessandro Ialongo, Heiner Kremer and Jonas Kübler at MPI Tübingen for helpful feedback on initial drafts.

The idea behind this paper was conceived, and part of the work done, while JP was a Master’s student at the Seminar for Statistics, Department of Mathematics, ETH Zürich. JP is extremely grateful to his Master’s thesis supervisor, Professor Sara van de Geer, for readily accepting the proposed topic, and her expert guidance throughout the thesis.

References

N. Aronszajn. Theory of Reproducing Kernels. Transactions of the American mathematical society, 68(3):337–404, 1950.

A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer Academic Publishers, 2004.

A. T. Bharucha-Reid. Random Integral Equations. Academic Press, 1972.

A. Caponnetto and E. De Vito. Optimal Rates for the Regularized Least-Squares Algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2006.

C. Carmeli, E. De Vito, and A. Toigo. Vector Valued Reproducing Kernel Hilbert Spaces of Integrable Functions and Mercer Theorem. Analysis and Applications, 4(04):377–408, 2006.

C. Carmeli, E. De Vito, A. Toigo, and V. Umanitá. Vector Valued Reproducing Kernel Hilbert Spaces and Universality. Analysis and Applications, 8(01):19–61, 2008.

A. Christmann and I. Steinwart. Support Vector Machines. Springer, 2008.

K. Chwialkowski, H. Strathmann, and A. Gretton. A Kernel Test of Goodness of Fit. In Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48, pages 2606–2615, 2016.

E. Çınlar. Probability and Stochastics, volume 261. Springer Science & Business Media, 2011.

D. L. Cohn. Measure Theory. Birkhäuser, 2013.

N. Dinculeanu. Vector Integration and Stochastic Integration in Banach Spaces, volume 48. John Wiley & Sons, 2000.

R. M. Dudley. Real Analysis and Probability. CRC Press, 2018.

K. Fukumizu. Nonparametric Bayesian Inference with Kernel Mean Embedding. In Modern Methodology and Applications in Spatial-Temporal Modeling, pages 1–24. Springer, 2015.

K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces. Journal of Machine Learning Research, 5(Jan):73–99, 2004.

K. Fukumizu, A. Gretton, X. Sun, and B. Schölkopf. Kernel Measures of Conditional Dependence. In Advances in neural information processing systems, pages 489–496, 2008.

K. Fukumizu, L. Song, and A. Gretton. Kernel Bayes’ Rule: Bayesian Inference with Positive Definite Kernels. The Journal of Machine Learning Research, 14(1):3753–3783, 2013.

A. Gretton, O. Bousquet, A. Smola, and B. Schölkopf. Measuring Statistical Dependence with Hilbert-Schmidt Norms. In International conference on algorithmic learning theory, pages 63–77. Springer, 2005.

A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. J. Smola. A Kernel Method for the Two-Sample-Problem. In Advances in neural information processing systems, pages 513–520, 2007.

A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Schölkopf, and A. J. Smola. A Kernel Statistical Test of Independence. In Advances in neural information processing systems, pages 585–592, 2008.

A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A Kernel Two-Sample Test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.

S. Grünewälder, G. Lever, L. Baldassarre, S. Patterson, A. Gretton, and M. Pontil. Conditional Mean Embeddings as Regressors. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1803–1810, 2012a.

S. Grünewälder, G. Lever, L. Baldassarre, M. Pontil, and A. Gretton. Modelling Transition Dynamics in MDPs with RKHS Embeddings. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1603–1610. Omnipress, 2012b.

T. Hofmann, B. Schölkopf, and A. J. Smola. Kernel Methods in Machine Learning. The annals of statistics, pages 1171–1220, 2008.

H. Kadri, E. Duflos, P. Preux, S. Canu, A. Rakotomamonjy, and J. Audiffren. Operator-Valued Kernels for Learning from Functional Response Data. The Journal of Machine Learning Research, 17(1):613–666, 2016.

I. Klebanov, I. Schuster, and T. Sullivan. A Rigorous Theory of Conditional Mean Embeddings. arXiv preprint arXiv:1912.00671, 2019.

Q. Liu, J. Lee, and M. Jordan. A Kernelized Stein Discrepancy for Goodness-of-Fit Tests. In International conference on machine learning, pages 276–284, 2016.

D. Lopez-Paz, K. Muandet, B. Schölkopf, and I. Tolstikhin. Towards a Learning Theory of Cause-Effect Inference. In International Conference on Machine Learning, pages 1452–1461, 2015.

C. A. Micchelli and M. Pontil. On Learning Vector-Valued Functions. Neural computation, 17(1):177–204, 2005.

J. Mitrovic, D. Sejdinovic, and Y. W. Teh. Causal Inference via Kernel Deviance Measures. In Advances in Neural Information Processing Systems, pages 6986–6994, 2018.

K. Muandet, K. Fukumizu, F. Dinuzzo, and B. Schölkopf. Learning from Distributions via Support Measure Machines. In Advances in neural information processing systems, pages 10–18, 2012.

K. Muandet, K. Fukumizu, B. Sriperumbudur, B. Schölkopf, et al. Kernel Mean Embedding of Distributions: A Review and Beyond. Foundations and Trends Rin Machine Learning, 10(1-2):1–141, 2017.

Y. Nishiyama, A. Boularias, A. Gretton, and K. Fukumizu. Hilbert Space Embeddings of POMDPs. In Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, pages 644–653. AUAI Press, 2012.

M. D. Perlman. Jensen’s Inequality for a Convex Vector-Valued Function on an Infinite-Dimensional Space. Journal of Multivariate Analysis, 4(1):52–65, 1974.

Y. Ren, J. Zhu, J. Li, and Y. Luo. Conditional Generative Moment-Matching Networks. In Advances in Neural Information Processing Systems, pages 2928–2936, 2016.

B. Scholkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT press, 2001.

B. Schölkopf, K. Muandet, K. Fukumizu, S. Harmeling, and J. Peters. Computing Functions of Random Variables via Reproducing Kernel Hilbert Space Representations. Statistics and Computing, 25(4):755–766, 2015.

D. Sejdinovic, B. Sriperumbudur, A. Gretton, K. Fukumizu, et al. Equivalence of Distance-based and RKHS-based Statistics in Hypothesis Testing. The Annals of Statistics, 41(5):2263–2291, 2013.

T. Sheng and B. K. Sriperumbudur. On Distance and Kernel Measures of Conditional Independence. arXiv preprint arXiv:1912.01103, 2019.

C.-J. Simon-Gabriel, A. Scibior, I. O. Tolstikhin, and B. Schölkopf. Consistent Kernel Mean Estimation for Functions of Random Variables. In Advances in Neural Information Processing Systems, pages 1732–1740, 2016.

A. Smola, A. Gretton, L. Song, and B. Schölkopf. A Hilbert Space Embedding for Distributions. In International Conference on Algorithmic Learning Theory, pages 13–31. Springer, 2007.

L. Song, J. Huang, A. Smola, and K. Fukumizu. Hilbert Space Embeddings of Conditional Distributions with Applications to Dynamical Systems. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 961–968, 2009.

L. Song, A. Gretton, and C. Guestrin. Nonparametric Tree Graphical Models via Kernel Embeddings. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 765–772, 2010.

L. Song, K. Fukumizu, and A. Gretton. Kernel Embeddings of Conditional Distributions: A Unified Kernel Framework for Nonparametric Inference in Graphical Models. IEEE Signal Processing Magazine, 30(4):98–111, 2013.

B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf, and G. R. Lanckriet. Hilbert Space Embeddings and Metrics on Probability Measures. Journal of Machine Learning Research, 11(Apr):1517–1561, 2010.

I. Steinwart. On the Influence of the Kernel on the Consistency of Support Vector Machines. Journal of machine learning research, 2(Nov):67–93, 2001.

Z. Szabó and B. K. Sriperumbudur. Characteristic and Universal Tensor Product Kernels. The Journal of Machine Learning Research, 18(1):8724–8752, 2018.

Z. Szabó, B. K. Sriperumbudur, B. Póczos, and A. Gretton. Learning Theory for Distribution Regression. The Journal of Machine Learning Research, 17(1):5272–5311, 2016.

X. Wang, W. Pan, W. Hu, Y. Tian, and H. Zhang. Conditional Distance Correlation. Journal of the American Statistical Association, 110(512):1726–1734, 2015.

J. Weidmann. Linear Operators in Hilbert Spaces. Springer Science & Business Media, 1980.

A Generalised Jensen’s Inequality

In Section 4, we require a version of Jensen’s inequality generalised to (possibly) infinite-dimensional vector spaces, because our random variable takes values in , and our convex function is . Note that this square norm function is indeed convex, since, for any and any pair

The following theorem generalises Jensen’s inequality to infinite-dimensional vector spaces. Theorem A.1 (Generalised Jensen’s Inequality, [Perlman, 1974], Theorem 3.10). Suppose T is a real Hausdorff locally convex (possibly infinite-dimensional) linear topological space, and let C be a closed convex subset of T . Suppose is a probability space, and a Pettis-integrable random variable such that . Let be a convex, lower semi-continuous extended-real-valued function such that exists. Then

We will actually apply generalised Jensen’s inequality with conditional expectations, so we need the following theorem. Theorem A.2 (Generalised Conditional Jensen’s Inequality)