The idea of embedding probability distributions into a reproducing kernel Hilbert space (RKHS), a space associated to a positive definite kernel, has received a lot of attention in the past decades [Berlinet and Thomas-Agnan, 2004, Smola et al., 2007], and has found a wealth of successful applications, such as independence testing [Gretton et al., 2008], two-sample testing [Gretton et al., 2012], learning on distributions [Muandet et al., 2012, Lopez-Paz et al., 2015, Szabó et al., 2016], goodness-of-fit testing [Chwialkowski et al., 2016, Liu et al., 2016] and probabilistic programming [Schölkopf et al., 2015, Simon-Gabriel et al., 2016], among others – see review by Muandet et al. [2017]. It extends the idea of kernelising linear methods by embedding data points into high- (and often infinite-)dimensional RKHSs, which has been applied, for example, in ridge regression, spectral clustering, support vector machines and principal component analysis among others [Scholkopf and Smola, 2001, Hofmann et al., 2008, Christmann and Steinwart, 2008].
Conditional distributions can also be embedded into RKHSs in a similar manner [Song et al., 2013],[Muandet et al., 2017, Chapter 4]. Compared to marginal distributions, conditional distributions can represent more complicated relations between several random variables, and therefore conditional mean embeddings (CMEs) have the potential to unlock the whole arsenal of kernel mean embeddings to a much wider setting. Indeed, conditional mean embeddings have been applied successfully to dynamical systems [Song et al., 2009], inference on graphical models via belief propagation [Song et al., 2010], probabilistic inference via kernel sum and product rules [Song et al., 2013], reinforcement learning [Grünewälder et al., 2012b, Nishiyama et al., 2012], kernelising the Bayes rule and applying it to nonparametric state-space models [Fukumizu et al., 2013] and causal inference [Mitrovic et al., 2018] to name a few.
Despite such progress, the current prevalent definition of the conditional mean embedding based on composing cross-covariance operators [Song et al., 2009] relies on some stringent assumptions, which are often violated and hinder its analysis. Klebanov et al. [2019] recently attempted to clarify and weaken some of these assumptions, but strong and hard-to-verify conditions still persist. Grünewälder et al. [2012a] provided a regression interpretation, but here, only the existence of the CME is shown, without an explicit expression. The main contribution of this paper is to provide a theoretically rigorous, operator-free definition of the CME that requires drastically weaker assumptions, and comes in an explicit expression. We believe this will enable a more principled analysis of its theoretical properties, and open doors to new application areas. We derive the empirical estimate based on vector-valued RKHS regression, and provide an in-depth analysis of its properties, including a universal consistency result of rate . In particular, we relax the assumption of Grünewälder et al. [2012a] to allow for infinite-dimensional RKHSs.
As natural by-products, we obtain quantities that are extensions of the Maximum Mean Discrepancy (MMD) and the Hilbert-Schmidt Independence Criterion (HSIC) to the conditional setting, which we call the Maximum Conditional Mean Discrepancy (MCMD) and the Hilbert-Schmidt Conditional Independence Criterion (HSCIC). We demonstrate their properties through simulation experiments.
All proofs can be found in Appendix C.
We take as the underlying probability space. Let (X, X), (Y, Y) and (Z, Z) be separable measurable spaces, and let
be random variables with distributions
. We will use Z as the conditioning variable throughout.
2.1 Positive Definite Kernels and RKHS Embeddings
Let be a vector space of real-valued functions on X, endowed with the structure of a Hilbert space via an inner product
. A symmetric function
is a reproducing kernel of
if and only if: 1.
,
; 2.
and
. A Hilbert space of real-valued functions which possesses a reproducing kernel is called a reproducing kernel Hilbert space (RKHS) [Berlinet and Thomas-Agnan, 2004]. A symmetric function
positive-definite function if its Gram matrix is positive definite. The Moore-Aronszajn Theorem [Aronszajn, 1950] shows that the set of positive-definite functions and the set of reproducing kernels on
are in fact identical.
Assuming
. Note that the integrand
is an element in a Hilbert space (and therefore a Banach space), so this integral is not a Lebesgue integral, but a Bochner integral [Dinculeanu, 2000, p.15, Definition 35]. The square-root integrability assumption ensures that
is indeed Bochner-integrable. We will generalise the following lemma to
Next, suppose is an RKHS of functions on Y with kernel
, and consider the tensor product RKHS
(see Weidmann [1980, pp.47-48] for a definition of tensor product Hilbert spaces).
Theorem 2.2 (Berlinet and Thomas-Agnan [2004, p.31, Theorem 13]). The tensor product is generated by the functions
, with
and
defined by
. Moreover,
is an RKHS of functions on
with kernel
Now let us impose a slightly stronger integrability condition:
As a consequence, for any pair
There exists an isometric isomorphism , where HS
is the space of HilbertSchmidt operators from
to
. The cross-covariance operator is defined as
[Fukumizu et al., 2004]. It is straightforward to show that
The notion of characteristic kernels is essential, since it tells us that the associated RKHSs are rich enough to enable us to distinguish different distributions from their embeddings.
Definition 2.4 (Fukumizu et al. [2008]). A positive definite kernel is characteristic to a set P of probability measures defined on X if the map
is injective.
Sriperumbudur et al. [2010] discuss various characterisations of characteristic kernels and show that the well-known Gaussian and Laplacian kernels are characteristic. We then have a metric on which goes by the name maximum mean discrepancy (MMD) [Gretton et al., 2007].
The HSIC is defined as the Hilbert-Schmidt norm of , or equivalently,
[Gretton et al., 2005], i.e. the MMD between
and
. If
is characteristic, then HSIC = 0 if and only if
2.2 Conditioning
In this subsection, we briefly review the concept of conditioning in the formal measure-theoretic probability theory, in the context of Banach space-valued random variables. We consider a sub-and a Banach space H.
Definition 2.5 (Conditional Expectation, Dinculeanu [2000, p.45, Definition 38]). Suppose H is a Bochner P-integrable, H-valued random variable. Then the conditional expectation of H given E is any E-measurable, Bochner P-integrable, H-valued random variable satisfying this condition is said to be a version of E[H | E]. We write E[H | Z] to mean
, where
is the sub-
-algebra of F generated by the random variable Z.
The (almost sure) uniqueness of the conditional expectation is shown in Dinculeanu [2000, p.44, Proposition 37], and the existence in Dinculeanu [2000, pp.45-46, Theorems 39 and 50]. Definition 2.6 (Çınlar [2011, p.149]). For each conditional probability of
Note that, in the unconditional case, the expectation is defined as the integral with respect to the measure, but in the conditional case, the expectation is defined first, and the measure is defined as the expectation of the indicator function.
For this definition of conditional probability to be useful, we require an additional property, called a “regular version”. We first define the transition probability kernel2.
Definition 2.7 (Çınlar [2011, p.37,40]). Let be measurable spaces. A mapping
is a transition kernel from
to
if (i)
is
-measurable; (ii)
,
is a measure on
is said to be a transition probability kernel.
Definition 2.8 (Çınlar [2011, p.150, Definition 2.4]). For each , let Q(A) be a version of
. Then
is said to be a regular version of the conditional probability measure
if Q is a transition probability kernel from
The following theorem, proved in Appendix C, is the reason why a regular version is important. It means that, roughly speaking, the conditional expectation is indeed obtained by integration with respect to the conditional measure.
Theorem 2.9 (Adapted from Çınlar [2011, p.150, Proposition 2.5]). Suppose that admits a regular version Q. Then
is a version of E[H | E] for every Bochner P-integrable H.
Next, we define the conditional distribution.
Definition 2.10 (Çınlar [2011, p.151]). Let X be a random variable taking values in a measurable space (X, X). Then the conditional distribution of X given E is any transition probability kernel from
to (X, X) such that, for all
If has a regular version Q, then letting
for defines a version
of the conditional distribution of X given
. Unfortunately, a regular version of a conditional probability measure does not always exist. Also, it is not guaranteed that any version of the conditional distribution exists. The following definition and theorem tell us that, fortunately, these versions exist more often than not.
Definition 2.11 (Çınlar [2011, p.11]). A measurable space (E, E) is standard if it is isomorphic to is some Borel subset of
is the Borel
-algebra of F.
Theorem 2.12 (Çınlar [2011, p.151, Theorem 2.10]). If (X, X) is a standard measurable space, then there exists a version of the conditional distribution of X given E. In particular, if is a standard measurable space, then the conditional measure
has a regular version.
Examples of standard measurable spaces include and
with their respective Borel
-algebras, complete separable metric spaces with their Borel
-algebra, Polish spaces with their Borel
-algebras, separable Banach spaces with their Borel
-algebras and separable Hilbert spaces with their Borel
2.3 Vector-Valued RKHS Regression
In this subsection, we introduce the theory of vector-valued RKHS regression, based on operator-valued kernels. Let H be a Hilbert space, which will be the output space of regression.
Definition 2.13 (Carmeli et al. [2006, Definition 1]). An H-valued RKHS on Z is a Hilbert space G such that 1. the elements of G are functions there exists
Analogously to the scalar case, it can be shown that any H-valued RKHS G possesses a reproducing kernel, which is an H-kernel of positive type satisfying, for any
and
and
. There is also an analogy of the Moore-Aronszajn Theorem:
Theorem 2.15 (Carmeli et al. [2006, Proposition 1]). Given an H-kernel of positive type a unique H-valued RKHS G on Z with reproducing kernel
Now suppose we want to perform regression with input space Z and output space H, by minimising the following regularised loss functional:
where is a regularisation parameter and
. There is a corresponding representer theorem:
Theorem 2.16 (Micchelli and Pontil [2005, Theorem 4.1]). If minimises (4) in G, it is unique and has the form
where the coefficients
are the unique solution of the linear equations
We are now ready to introduce a formal definition of the conditional mean embedding of X given Z. Definition 3.1 (Conditional Mean Embedding (CME)). We define the conditional mean embedding of X given Z as
This is a direct extension of the marginal kernel mean embedding, , but instead of being a fixed element in
-measurable random variable taking values in
(see Definition 2.5). Also, for
is a real-valued Z-measurable random variable. The following lemma is analogous to Lemma 2.1. Lemma 3.2. Suppose
admits a regular version. Then for any
By Lemmas 3.2 and 3.3, for any pair
almost surely. Hence, we define the conditional cross-covariance operator as (see Section 2.1 for the definition of
3.1 Comparison with Existing Definitions
As previously mentioned, the idea of CMEs and conditional cross-covariance operators is not a novel one, yet our development of the theory and definitions above differ significantly from the existing works. In this subsection, we review the previous approaches and compare them to ours.
The prevalent definition of CMEs in the literature is the one given in the following definition. We first need to endow the conditioning space Z with a scalar kernel, say , with corresponding RKHS
Definition 3.4 (Song et al. [2009, Definition 3]). The conditional mean embedding of the conditional distribution P(X | Z) is the operator defined by
, where
and
are unconditional (cross-)covariance operators as defined in Section 2.1.
As noted by Song et al. [2009], the motivation for this comes from Theorem 2 in the appendix of Fukumizu et al. [2004], which states that if , then for any
. This relation can be used to prove the following theorem.
Theorem 3.5 (Song et al. [2009, Theorem 4]). Take any . Then assuming
satisfies: 1.
Now we highlight the key differences between this approach and ours. Firstly, this approach requires the endowment of a kernel on the conditioning space Z, and subsequently defines the CME as an operator from
to
. By contrast, our definition did not consider any kernel or function on Z, and defined the CME as a Bochner conditional expectation given
-measurable,
-valued random variable. It seems more logical not to have to endow the conditioning space with a kernel, at least before the estimation stage. Secondly, the operator-based approach assumes that
, as a function in z, lives in
. This is a severe restriction; it is stated in Song et al. [2009] that this assumption, while true for finite domains with characteristic kernels, is not necessarily true for continuous domains, and Fukumizu et al. [2013] gives a simple counterexample using the Gaussian kernel. Lastly, it also assumes that
exists, which is another severe restriction. Fukumizu et al. [2013] mentions that this assumption is too strong in many situations involving popular kernels, and gives a counterexample using the Gaussian kernel. The most common remedy is to resort to the regularised version
and treat it as an approximation of
These assumptions have been clarified and slightly weakened in Klebanov et al. [2019], but strong and hard-to-verify conditions persist.
In contrast, our definitions extend the notions of kernel mean embedding, expectation operator and cross-covariance operator to the conditional setting simply by using the formal definition of conditional expectations (Definition 2.5), and only rely on the mild assumption that the conditional probability measure admits a regular version.
Grünewälder et al. [2012a] gave a regression interpretation, by showing the existence, for each satisfies
. However, the main drawback here is that there is no explicit expression for
, limiting its analysis and use. In contrast, our definition has an explicit expression
with which it is easy to explore potential applications.
In Fukumizu et al. [2004], the conditional cross-covariance operator is defined, but in a significantly different way. It is defined as is the right inverse of
. This has the property that, for all
and
. Note that this is different to our relation stated after Lemma 3.3; the conditional covariance is integrated out over Z. In fact, this difference is explicitly noted by Song et al. [2009].
3.2 A Discrepancy Measure between Conditional Distributions
In this subsection, we propose a conditional analogue of the maximum mean discrepancy (MMD), and explore the role of characteristic kernels in the conditional case. Let be an additional random variable, satisfying
Definition 3.6. We define the maximum conditional mean discrepancy (MCMD) between
We note that MCMD is not a fixed value, but a real-valued, Z-measurable random variable.
The term MMD stems from the equality [Gretton et al., 2007, Sriperumbudur et al., 2010], where
. The supremum is attained by the witness
function,
where we used Lemma 3.2. We define the conditional witness function as the -valued random variable
Casting aside measure-theoretic issues arising from conditioning on an event of measure 0, we can informally think of the realisation of the MCMD at each with
as “the MMD between
and
”, and
as “the witness function between
and
”. The following theorem says that, with characteristic kernels, the MCMD can indeed act as a discrepancy measure between conditional distributions.
Theorem 3.7. Suppose is a characteristic kernel, and assume that
admits a regular version. Then MCMD
almost surely if and only if, almost surely,
The MCMD is reminiscent of the conditional maximum mean discrepancy of Ren et al. [2016], defined as the HilbertSchmidt norm of the operator (see Definition 3.4). However, due to strong assumptions previously discussed,
often do not even exist, and/or do not have the desired properties of Theorem 3.5, so even at population level,
is often not an exact measure of discrepancy between conditional distributions. On the other hand, Theorem 3.7 with the MCMD is an exact mathematical statement at population level that is valid between any pair of conditional distributions.
The discussion on characteristic kernels in the conditional setting, and the precise meaning of an “injective embedding” of conditional distributions, has largely been absent in the existing literature. We suspect that this is because the operator-based definition is somewhat cumbersome to work with, and it is not immediately clear how to express such statements. The new, mathematically elegant definition of the CME can remedy that through Theorem 3.7. We conjecture that characteristic kernels will play a crucial role in many future applications of the CME.
3.3 A Criterion of Conditional Independence
In this subsection, we introduce a novel criterion of conditional independence, via a direct analogy with the HSIC. Definition 3.8. We define the Hilbert-Schmidt Conditional Independence Criterion between X and Y given Z to be
Note that HSCIC(X, Y | Z) is an instance of the MCMD in the tensor product space , and is a (real-valued) random variable. Again, casting aside measure-theoretic issues arising from conditioning on an event of probability 0, we can conceptually think of the realisation of the HSCIC at each
as “the HSIC between
and
”. Since HSCIC is an instance of MCMD, the following theorem follows immediately from Theorem 3.7.
Theorem 3.9. Suppose is a characteristic kernel3 on
, and that
admits a regular version. Then HSCIC(X, Y | Z) = 0 a.s. if and only if
Concurrent and independent work by Sheng and Sriperumbudur [2019] proposes a similar criterion with the same nomenclature (HSCIC). However, they omit the discussion of CMEs entirely, and define the HSCIC as the usual HSIC between , without considerations for conditioning on an event of measure 0. Their focus is more on investigating connections to distance-based measures [Wang et al., 2015, Sejdinovic et al., 2013]. Fukumizu et al. [2008] propose
, defined as the squared Hilbert-Schmidt norm of the normalised conditional cross-covariance operator
, where
and
. As discussed, these operator-based definitions rely on a number of strong assumptions that will often mean that
does not exist, or it does not satisfy the conditions for it to be used as an exact criterion even at population level. On the other hand, the HSCIC defined as in Definition 3.8 is an exact mathematical criterion of conditional independence at population level. Note that
is a single-value criterion, whereas the HSCIC is a random criterion.
Figure 1: Diagram representing equation (5). The CME can be viewed in two ways: (i) as an -valued random variable, and (ii) as a deterministic function of the random variable Z.
In this section, we discuss how we can obtain empirical estimates of
Theorem 4.1. Assume that is separable, and denote its Borel
-algebra by
. Then we can write
where is some deterministic function, measurable with respect to
Figure 1 depicts this relation. Note that the separability assumption on is not a restrictive one, since it is satisfied if, for example,
is a continuous kernel on the separable space X [Christmann and Steinwart, 2008, p.130, Lemma 4.33]. Hence, the problem of estimating
boils down to estimating the function
, and this is exactly the setting for vector-valued regression discussed in Section 2.3, with input space Z and output space
. In contrast to Grünewälder et al. [2012a], where regression is motivated by applying the Riesz representation theorem conditioned on each value of
, we derive the CME as an explicit function of Z, which we argue is a more principled way to motivate regression. Moreover, for continuous Z, the event Z = z has measure 0 for each
, so it is not measure-theoretically rigorous to apply the Riesz representation theorem conditioned on Z = z.
The natural optimisation problem is to minimise the loss
among all is a vector-valued RKHS of functions
endowed with a kernel
is a scalar kernel on Z.
We cannot minimise directly, since we do not observe samples from
, but only the pairs
. We bound this with a surrogate loss
that has a sample-based version:
where we used generalised conditional Jensen’s inequality (see Appendix A, or Perlman [1974]). Section 4.1 discusses the meaning of this surrogate loss. We replace the surrogate population loss with a regularised empirical loss based on samples from the joint distribution
where is a regularisation parameter. We see that this loss functional has exactly the same form as in (4). Therefore, by Theorem 2.16, the minimiser
has the form
, where we wrote
. By Theorem 2.16, the coefficients
are the unique solutions of the linear equations
and I is the
identity matrix. Hence, the coefficients are
, where we wrote
. Finally, substituting this into the expression for
Note that this expression is identical to those in Song et al. [2009] and Grünewälder et al. [2012a].
4.1 Surrogate Loss, Universality and Consistency
There is no doubt that in (6) is a more natural loss functional than the surrogate loss
. In this subsection, we investigate the meaning and consequences of using this surrogate loss, as well as the implications of using a universal kernel and the consistency properties of our algorithm.
Denote by the Banach space of (equivalence classes of) measurable functions
is
-integrable, with norm
. We can note that the true function
belongs to
, because Theorem 4.1 tells us that
is indeed measurable, and
by (1), where we used generalised conditional Jensen’s inequality again on the second line (Appendix A). The following theorem shows that the true function is the unique minimiser in
Theorem 4.2. minimises both
and
in
. Moreover, it is almost surely equal to any other minimiser of the loss functionals.
Note the difference in the statement of Theorem 4.2 from Grünewälder et al. [2012a, Theorem 3.1], who only consider the minimisation of the loss functionals in , whereas we consider the larger space
in which to minimise our loss functionals.
Next, we discuss the concepts of universal kernels and universal consistency.
Definition 4.3 (Carmeli et al. [2008, Definition 2]). An operator-valued reproducing kernel with associated RKHS
is
if
is a subspace of
, the space of continuous functions
vanishing at infinity. The kernel
is dense in
for any measure
Recall that we are using the kernel . Carmeli et al. [2008, Example 14] show that
is
-universal if
is a universal scalar kernel, which in turn is guaranteed if
is Gaussian or Laplacian, for example [Steinwart, 2001].
The consistency result with optimal rate of in Grünewälder et al. [2012a] based on Caponnetto and De Vito [2006] imposes strong assumptions about the kernel
, and finite-dimensional assumption on the output space
. These are violated for many commonly used kernels such as the Gaussian kernel, and so we do not use this result in our paper (see Appendix B for more details). Fukumizu [2015] also shows consistency with rate slightly worse than
with weaker assumptions. We prove the following universal consistency result, which relies on even weaker assumptions and achieves a better rate of
Theorem 4.4. Suppose that and
are bounded kernels, i.e. for all
for some
and for all
for some
, and that the operator-valued kernel
is
-universal. Also, suppose that the regularisation parameter
depends on the sample size n, and converges to zero at the rate of
. Then our learning algorithm that produces
is universally consistent (in the surrogate loss
), i.e. for any joint distribution
and constants
for large enough n. The rate of convergence is
The boundedness assumption is satisfied with many commonly used kernels, such as the Gaussian and Laplacian kernels, and hence is not a restrictive condition. The key observation is that the target values are all of the form for
, so the target space is bounded if
is bounded (see Appendix B and the proof in Appendix C for details).
Theorem 4.4 is stated with respect to the surrogate loss, not the natural original loss . Let us now investigate is implications with respect to the original loss. Write
Figure 2: Behaviour of MCMD. We see that MCMDfor all Z. Near Z = 0, where the dependence on
are similar, MCMD
, whereas in regions away from 0, the dependence on Z of X and
are different, and so MCMD
. We also see that the conditional witness function between X and
gives 0 at all values of X given any value of Z, whereas we have a saddle-like function between
with non-zero functions in X in the regions of Z away from 0.
consequence of Theorem 4.4 is that
by (9), for any . This shows that, in the limit as
is at most an arbitrarily small amount larger than
with high probability. It remains to investigate what
represents, and how large it is. The law of total expectation gives
Here, the inner conditional expectation
is the variance of (see Bharucha-Reid [1972, p.24] for the definition of the variance of Banach-space valued random variables), and by integrating over Z in the outer integral,
represents the “mean variance” of
over Z.
Suppose X is measurable with respect to has zero noise. Then we have
and consequently,
. In this case, we have universal consistency in both the surrogate loss
and the original loss
. On the other hand,
will be large if information about Z tells us little about X, and subsequently
In the extreme case where X and Z are independent, we have
which is precisely the variance of in
. Hence,
represents the irreducible loss of the true function due to noise in X, and the surrogate loss represents the loss functional taking noise into account, while the original loss measures the deviance from the true conditional expectation.
4.2 Empirical Estimates of MCMD and HSCIC
Recall that we defined the MCMD as (Definition 3.6). By Çınlar [2011, Theorem I.4.4], we can write MCMD
for some function
. Using samples
the joint distribution
, we can obtain a plug-in estimate of
using (8):
. To evaluate this norm, we take the square of it:
Figure 3: Behaviour of the HSCIC. We see that HSCIC(left) and HSCIC
(right) for all
In particular, the dependence of
and
on X is greater than that of
and
, and is rep- resented by larger values of HSCIC
and HSCIC
compared to HSCIC
and HSCIC
Similarly, we can write HSCICwe obtain a closed-form estimate of
as follows:
where denotes elementwise multiplication of matrices.
4.3 Experiments
In this section, we carry out simulations to demonstrate the behaviour of the MCMD and HSCIC. In all simulations, we use the Gaussian kernel with hyperparameter
regularisation parameter
In Figure 2, we simulate 500 samples from where
is the (additive) noise variable. The first plot shows simulated data, the second MCMD values against Z, and the heatmaps show the (unnormalised) conditional witness function, whose norm gives the MCMD.
In Figure 3, on the left, we simulate 500 samples from the additive noise model, and
, where
is the (additive) noise variable. On the right, we simulate 500 samples from the multiplicative noise model,
and
are the (multiplicative) noise variables.
In this paper, we introduced a new definition of kernel conditional mean embedding, based on Bochner conditional expectation. It is mathematically elegant and measure-theoretically rigorous, and unlike the previous operator-based definition, does not rely on stringent assumptions that are often violated in common situations. Using this new approach for CMEs, we extended the notions of the MMD, witness function and HSIC to the conditional case. Finally, we discussed how to obtain empirical estimates via natural vector-valued regression, and established universal consistency with convergence rates of under mild and intuitive assumptions. We believe that our new definition has the potential to unlock the powerful arsenal of kernel mean embeddings to the conditional setting, in a more convenient and rigorous manner than the previous approaches.
We thank Mattes Mollenhauer at Freie Universität Berlin for pointing out the missing conditions on the regularization parameter of our initial universal consistency result, and for other fruitful discussions. We also thank an anonymous reviewer for pointing out typos and suggesting an improvement to the proof of Theorem 4.1. Finally, we thank Simon Buchholz, Alessandro Ialongo, Heiner Kremer and Jonas Kübler at MPI Tübingen for helpful feedback on initial drafts.
The idea behind this paper was conceived, and part of the work done, while JP was a Master’s student at the Seminar for Statistics, Department of Mathematics, ETH Zürich. JP is extremely grateful to his Master’s thesis supervisor, Professor Sara van de Geer, for readily accepting the proposed topic, and her expert guidance throughout the thesis.
N. Aronszajn. Theory of Reproducing Kernels. Transactions of the American mathematical society, 68(3):337–404, 1950.
A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer Academic Publishers, 2004.
A. T. Bharucha-Reid. Random Integral Equations. Academic Press, 1972.
A. Caponnetto and E. De Vito. Optimal Rates for the Regularized Least-Squares Algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2006.
C. Carmeli, E. De Vito, and A. Toigo. Vector Valued Reproducing Kernel Hilbert Spaces of Integrable Functions and Mercer Theorem. Analysis and Applications, 4(04):377–408, 2006.
C. Carmeli, E. De Vito, A. Toigo, and V. Umanitá. Vector Valued Reproducing Kernel Hilbert Spaces and Universality. Analysis and Applications, 8(01):19–61, 2008.
A. Christmann and I. Steinwart. Support Vector Machines. Springer, 2008.
K. Chwialkowski, H. Strathmann, and A. Gretton. A Kernel Test of Goodness of Fit. In Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48, pages 2606–2615, 2016.
E. Çınlar. Probability and Stochastics, volume 261. Springer Science & Business Media, 2011.
D. L. Cohn. Measure Theory. Birkhäuser, 2013.
N. Dinculeanu. Vector Integration and Stochastic Integration in Banach Spaces, volume 48. John Wiley & Sons, 2000.
R. M. Dudley. Real Analysis and Probability. CRC Press, 2018.
K. Fukumizu. Nonparametric Bayesian Inference with Kernel Mean Embedding. In Modern Methodology and Applications in Spatial-Temporal Modeling, pages 1–24. Springer, 2015.
K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces. Journal of Machine Learning Research, 5(Jan):73–99, 2004.
K. Fukumizu, A. Gretton, X. Sun, and B. Schölkopf. Kernel Measures of Conditional Dependence. In Advances in neural information processing systems, pages 489–496, 2008.
K. Fukumizu, L. Song, and A. Gretton. Kernel Bayes’ Rule: Bayesian Inference with Positive Definite Kernels. The Journal of Machine Learning Research, 14(1):3753–3783, 2013.
A. Gretton, O. Bousquet, A. Smola, and B. Schölkopf. Measuring Statistical Dependence with Hilbert-Schmidt Norms. In International conference on algorithmic learning theory, pages 63–77. Springer, 2005.
A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. J. Smola. A Kernel Method for the Two-Sample-Problem. In Advances in neural information processing systems, pages 513–520, 2007.
A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Schölkopf, and A. J. Smola. A Kernel Statistical Test of Independence. In Advances in neural information processing systems, pages 585–592, 2008.
A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A Kernel Two-Sample Test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
S. Grünewälder, G. Lever, L. Baldassarre, S. Patterson, A. Gretton, and M. Pontil. Conditional Mean Embeddings as Regressors. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1803–1810, 2012a.
S. Grünewälder, G. Lever, L. Baldassarre, M. Pontil, and A. Gretton. Modelling Transition Dynamics in MDPs with RKHS Embeddings. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1603–1610. Omnipress, 2012b.
T. Hofmann, B. Schölkopf, and A. J. Smola. Kernel Methods in Machine Learning. The annals of statistics, pages 1171–1220, 2008.
H. Kadri, E. Duflos, P. Preux, S. Canu, A. Rakotomamonjy, and J. Audiffren. Operator-Valued Kernels for Learning from Functional Response Data. The Journal of Machine Learning Research, 17(1):613–666, 2016.
I. Klebanov, I. Schuster, and T. Sullivan. A Rigorous Theory of Conditional Mean Embeddings. arXiv preprint arXiv:1912.00671, 2019.
Q. Liu, J. Lee, and M. Jordan. A Kernelized Stein Discrepancy for Goodness-of-Fit Tests. In International conference on machine learning, pages 276–284, 2016.
D. Lopez-Paz, K. Muandet, B. Schölkopf, and I. Tolstikhin. Towards a Learning Theory of Cause-Effect Inference. In International Conference on Machine Learning, pages 1452–1461, 2015.
C. A. Micchelli and M. Pontil. On Learning Vector-Valued Functions. Neural computation, 17(1):177–204, 2005.
J. Mitrovic, D. Sejdinovic, and Y. W. Teh. Causal Inference via Kernel Deviance Measures. In Advances in Neural Information Processing Systems, pages 6986–6994, 2018.
K. Muandet, K. Fukumizu, F. Dinuzzo, and B. Schölkopf. Learning from Distributions via Support Measure Machines. In Advances in neural information processing systems, pages 10–18, 2012.
K. Muandet, K. Fukumizu, B. Sriperumbudur, B. Schölkopf, et al. Kernel Mean Embedding of Distributions: A Review and Beyond. Foundations and Trends Rin Machine Learning, 10(1-2):1–141, 2017.
Y. Nishiyama, A. Boularias, A. Gretton, and K. Fukumizu. Hilbert Space Embeddings of POMDPs. In Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, pages 644–653. AUAI Press, 2012.
M. D. Perlman. Jensen’s Inequality for a Convex Vector-Valued Function on an Infinite-Dimensional Space. Journal of Multivariate Analysis, 4(1):52–65, 1974.
Y. Ren, J. Zhu, J. Li, and Y. Luo. Conditional Generative Moment-Matching Networks. In Advances in Neural Information Processing Systems, pages 2928–2936, 2016.
B. Scholkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT press, 2001.
B. Schölkopf, K. Muandet, K. Fukumizu, S. Harmeling, and J. Peters. Computing Functions of Random Variables via Reproducing Kernel Hilbert Space Representations. Statistics and Computing, 25(4):755–766, 2015.
D. Sejdinovic, B. Sriperumbudur, A. Gretton, K. Fukumizu, et al. Equivalence of Distance-based and RKHS-based Statistics in Hypothesis Testing. The Annals of Statistics, 41(5):2263–2291, 2013.
T. Sheng and B. K. Sriperumbudur. On Distance and Kernel Measures of Conditional Independence. arXiv preprint arXiv:1912.01103, 2019.
C.-J. Simon-Gabriel, A. Scibior, I. O. Tolstikhin, and B. Schölkopf. Consistent Kernel Mean Estimation for Functions of Random Variables. In Advances in Neural Information Processing Systems, pages 1732–1740, 2016.
A. Smola, A. Gretton, L. Song, and B. Schölkopf. A Hilbert Space Embedding for Distributions. In International Conference on Algorithmic Learning Theory, pages 13–31. Springer, 2007.
L. Song, J. Huang, A. Smola, and K. Fukumizu. Hilbert Space Embeddings of Conditional Distributions with Applications to Dynamical Systems. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 961–968, 2009.
L. Song, A. Gretton, and C. Guestrin. Nonparametric Tree Graphical Models via Kernel Embeddings. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 765–772, 2010.
L. Song, K. Fukumizu, and A. Gretton. Kernel Embeddings of Conditional Distributions: A Unified Kernel Framework for Nonparametric Inference in Graphical Models. IEEE Signal Processing Magazine, 30(4):98–111, 2013.
B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf, and G. R. Lanckriet. Hilbert Space Embeddings and Metrics on Probability Measures. Journal of Machine Learning Research, 11(Apr):1517–1561, 2010.
I. Steinwart. On the Influence of the Kernel on the Consistency of Support Vector Machines. Journal of machine learning research, 2(Nov):67–93, 2001.
Z. Szabó and B. K. Sriperumbudur. Characteristic and Universal Tensor Product Kernels. The Journal of Machine Learning Research, 18(1):8724–8752, 2018.
Z. Szabó, B. K. Sriperumbudur, B. Póczos, and A. Gretton. Learning Theory for Distribution Regression. The Journal of Machine Learning Research, 17(1):5272–5311, 2016.
X. Wang, W. Pan, W. Hu, Y. Tian, and H. Zhang. Conditional Distance Correlation. Journal of the American Statistical Association, 110(512):1726–1734, 2015.
J. Weidmann. Linear Operators in Hilbert Spaces. Springer Science & Business Media, 1980.
In Section 4, we require a version of Jensen’s inequality generalised to (possibly) infinite-dimensional vector spaces, because our random variable takes values in , and our convex function is
. Note that this square norm function is indeed convex, since, for any
and any pair
The following theorem generalises Jensen’s inequality to infinite-dimensional vector spaces. Theorem A.1 (Generalised Jensen’s Inequality, [Perlman, 1974], Theorem 3.10). Suppose T is a real Hausdorff locally convex (possibly infinite-dimensional) linear topological space, and let C be a closed convex subset of T . Suppose is a probability space, and
a Pettis-integrable random variable such that
. Let
be a convex, lower semi-continuous extended-real-valued function such that
exists. Then
We will actually apply generalised Jensen’s inequality with conditional expectations, so we need the following theorem. Theorem A.2 (Generalised Conditional Jensen’s Inequality). Suppose T is a real Hausdorff locally convex (possibly infinite-dimensional) linear topological space, and let C be a closed convex subset of T . Suppose is a probability space, and
a Pettis-integrable random variable such that
a convex, lower semi-continuous extended-real-valued function such that
exists. Suppose
of F. Then
Proof. Let be the dual space of all real-valued continuous linear functionals on T . The first part of the proof of Perlman [1974, Theorem 3.6] tells us that, for all
, we can write
where an affine function m on T is of the form for some
and
. If we define the subset
then we can rewrite f as
See that, for any
Here, (*) and (**) use the properties of conditional expectation of vector-valued random variables given in Dinculeanu [2000, pp.45-46, Properties 43 and 40 respectively].
We want to show that almost surely, and in order to so, we show that the right-hand side is a version of the left-hand side. The right-hand side is clearly E-measurable, since we have a linear operator on an E-measurable random variable. Moreover, for any
(here, all the equalities are almost-sure equalities). Hence, by the definition of the conditional expectation, we have that almost surely. Going back to our above work, this means that
Now take the supremum of the right-hand side over Q. Then (10) tells us that
as required.
In the context of Section 4, is real and Hausdorff, and locally convex (because it is a normed space). We take the closed convex subset to be the whole space
itself. The function
is convex (as shown above) and continuous, and finally, since Bochner-integrability implies Pettis integrability, all the conditions of Theorem A.2 are satisfied.
Caponnetto and De Vito [2006] give an optimal rate of convergence of vector-valued RKHS regression estimators, and its results are quoted by Grünewälder et al. [2012a] as the state of the art convergence rates, . In particular, this implies that the learning algorithm is consistent. However, the lower rate uses an assumption that the output space is a finite-dimensional Hilbert space [Caponnetto and De Vito, 2006, Theorem 2]; and in our case, this will mean that
is finite-dimensional. This is not true if, for example, we take
to be the Gaussian kernel; indeed, this is noted as a limitation by Grünewälder et al. [2012a], stating that “It is likely that this (finite-dimension) assumption can be weakened, but this requires a deeper analysis”. In this paper, we do not want to restrict our attention to finite-dimensional
. The upper bound would have been sufficient to guarantee consistency, but an assumption used in the upper bound requires the operator
to be Hilbert-Schmidt for all . However, for each
, taking any orthonormal basis
of
, we see that
meaning this assumption is not fulfilled with our choice of kernel either. Hence, results in Caponnetto and De Vito [2006], used by Grünewälder et al. [2012a], are not applicable to guarantee consistency in our context.
Kadri et al. [2016] address the problem of generalisability of function-valued learning algorithms. Let us write
for our training set of size n drawn i.i.d. from the distribution , and we denote by
which the data point
is removed. Further, we denote by
the estimate produced by our learning algorithm from the dataset D by minimising the loss
The assumptions used in this paper, with notations translated to our context, are
3. The map -admissible, i.e. convex with respect to F and Lipschitz continuous with respect to
as its Lipschitz constant.
4. There exists such that for all
and any training set D,
The concept of uniform stability, with notations translated to our context, is defined as follows. Definition B.1 (Uniform algorithmic stability, Kadri et al. [2016, Definition 6]). For each , define the function
A learning algorithm that calculates the estimate from a training set has uniform stability
with respect to the squared loss if the following holds: for all
and any training set D of size n,
The next two theorems are quoted from Kadri et al. [2016]. Theorem B.2 (Kadri et al. [2016, Theorem 7]). Under assumptions 1, 2 and 3, a learning algorithm that maps a training set D to the function -stable with
Theorem B.3 (Kadri et al. [2016, Theorem 8]). Let be a learning algorithm with uniform stability
, and assume Assumption 4 is satisfied. Then, for all
and any
, the following bound holds with probability at least
over the random draw of training samples:
Theorems B.2 and B.3 give us results about the generalisability of our learning algorithm. It remains to check whether the assumptions are satisfied.
Assumption 2 is satisfied thanks to our assumption that point embeddings are measurable functions, and Assumption 1 is satisfied if we assume that is a bounded kernel (i.e. there exists
such that
for all
In Kadri et al. [2016], a general loss function is used rather than the squared loss, and it is noted that Assumption 3 is in general not satisfied with the squared loss, which is what we use in our context. However, this issue can be addressed if we restrict the output space to a bounded subset. In fact, the only elements in that appear as the output samples in our case are
for
, so if we place the assumption that
is a bounded kernel (i.e. there exists
such that
So it is no problem, in our case, to place this boundedness assumption. Kadri et al. [2016, Appendix D] tells us that Assumption 1 with this boundedness assumption implies Assumption 4 with
while Kadri et al. [2016, Lemma 2] provides us with a condition which can replace Assumption 3 in Theorem B.2, giving us the uniform stability of our algorithm with
Then the result of Theorem B.3 holds with this new
Lemma 2.1. For each
Since the map is Bochner P-integrable,
is bounded, i.e.
. So by the Riesz Representation Theorem, there exists a unique
Choose
Lemma 2.3. For
Proof. For Bochner integrability, we see that
Hence, by Bochner integrability shown above, . So by the Riesz Representation Theorem, there exists
Choose
as required.
Before we prove Theorem 2.9, we state the following definition and theorems related to measurable functions for Banach-space valued functions.
Definition C.1 (Dinculeanu [2000, p.4, Definition 5]). A function is called an F-simple function if it has the form
A function is said to be F-measurable if there is a sequence
of H-valued, F-simple functions such that
pointwise.
Theorem C.2 (Dinculeanu [2000, p.4, Theorem 6]). If -measurable, then there is a sequence
H-valued, F-simple functions such that
pointwise and
Theorem C.3 (Dinculeanu [2000, p.19, Theorem 48], Lebesgue Convergence Theorem). Let be a sequence in
-measurable function, and
such that
-almost everywhere and
-almost everywhere, for each
Theorem 2.9. Suppose that admits a regular version Q. Then
with
is a version of E[H | E] for every Bochner P-integrable H.
Proof. Suppose H is Bochner P-integrable. Since Q is a regular version of , it is a probability transition kernel from
We first show that QH is measurable with respect to E. The map is well-defined, since, for each
,
is the Bochner-integral of H with respect to the measure
. Since H is F-measurable, by Theorem C.2, there is a sequence
of H-valued, F-simple functions such that
pointwise. Then for each
,
by Theorem C.3. But for each n, we can write
for some
and
, and so
. For each
the map
is E-measurable (by the definition of transition probability kernel, Definition 2.7), and so as a linear combination of E-measurable functions,
is E-measurable. Hence, as a pointwise limit of E-measurable functions, QH is also E-measurable, by [Dinculeanu, 2000, p.6, Theorem 10].
Next, we show that, for all . Fix
. By Theorem C.2, there is a sequence
of H-valued, F-simple functions such that
pointwise. For each n, we can write
for some
Hence, by the definition of the conditional expectation, QH is a version of E[H | E].
Lemma 3.2. Suppose that admits a regular version. Then for each
almost surely.
Proof. Write Q for a regular version of . Then in particular,
defined by
for any measurable set
is a version of the conditional distribution of X given Z. Then by Theorem 2.9, the event
holds has probability 1. Further, if we fix , then by Theorem 2.9 (or by directly applying Çınlar [2011, p.150, Theorem 2.5]), the event
Hence almost surely.
Lemma 3.3. Suppose that the conditional probability admits a regular version. Then for each pair
and
almost surely.
Proof. Write Q for a regular version of . Then in particular,
B) for any measurable set
is a version of the conditional distribution of (X, Y ) given Z. Then by Theorem 2.9, the event
holds has probability 1. Further, if we fix and
, then by Theorem 2.9 (or by directly applying Çınlar [2011, p.150, Theorem 2.5]), the event
holds has probability 1. Then , and fixing
Hence almost surely.
Theorem 3.7. Suppose is a characteristic kernel, and assume that
admits a regular version. Then MCMD
almost surely if and only if, almost surely,
Proof. Write Q for a regular version of , and assume without loss of generality that the conditional distributions
and
are given by
and
for
. By Theorem 2.9, there exists an event
such that for all
and an event such that for all
This means that, for each is the mean embedding of
, and for each
is the mean embedding of
( Suppose first that MCMD
almost surely, i.e. there exists
( Now assume there exists
with P(A) = 1 such that for each
for
Theorem 4.1. Assume that is separable, and denote its Borel
-algebra by
. Then we can write
where is some deterministic function, measurable with respect to
Proof. Let be the image of
, and let
denote the
-algebra on Im(Z) defined by
(see [Çınlar, 2011, page 5, 1.15]). We will first construct a function
, measurable with respect to
, such that
For a given , we have
. Suppose for contradiction that there are two distinct elements
is Hausdorff, there are disjoint open neighbourhoods
respectively. By definition of a Borel
-algebra, we have
and since
-measurable,
(ii) For any with
, we have
and
. So
and
Since (see [Çınlar, 2011], page 11, Exercise 2.20), we can’t have
. This is a contradiction to (11). We therefore conclude that, for any
for distinct
We define to be the unique value of
for all
. Then for any
by construction. It remains to check that
is measurable with respect to
Take any . Since
is
-measurable,
. Since
, we have
for some
. Since the mapping
is surjective,
is measurable with respect to
Finally, we can extend by Dudley [2018, page 128, Corollary 4.2.7] (note that
is a complete metric space, and assumed to be separable in this theorem).
Theorem 4.2. minimises both
Moreover, it is almost surely unique, i.e. it is almost surely equal to any other minimiser of the objective functionals.
So clearly, , meaning
minimises
in
. So it only remains to show that
is minimised in
Let F be any element in . Then we have
Here,
and similarly,
Substituting these expressions back into (12), we have
Hence, . The minimiser is further more
-almost surely unique; indeed, if
is another minimiser of
, then the calculation in (12) shows that
which immediately implies that -almost surely, which in turn implies that
-almost surely.
Theorem 4.4. Suppose that and
are bounded kernels, i.e. for all
for some
and for all
for some
, and that the operator-valued kernel
is
-universal. Also, suppose that the regularisation parameter
depends on the sample size n, and converges to zero at the rate of
. Then our learning algorithm that produces
is universally consistent (in the surrogate loss
), i.e. for any joint distribution
and constants
for large enough n. The rate of convergence is
Proof. Fix and
. Define
for
. Recall that
. By
-universality,
is dense in
, so there exists some
. Then note that, by the reverse triangle inequality,
This is not a probabilistic statement, so trivially,
Now we define the real-valued random variable
Then we have
where we recalled the definitions of and
from Section 4. Define
(Var
is bounded because
from Section 4.1 and
is bounded by assumption). Then by Chebyshev’s inequality, we have the following inequality for the unregularised loss:
Moreover, from Appendix B, we have
where
Hence, recalling that at the rate of
, we can see that
at the rate of
. By making n large enough, we can ensure that
and
Then
as required.