Generalized Sliced Distances for Probability Distributions

2020·Arxiv

Abstract

Abstract

Probability metrics have become an indispensable part of modern statistics and machine learning, and they play a quintessential role in various applications, including statistical hypothesis testing and generative modeling. However, in a practical setting, the convergence behavior of the algorithms built upon these distances have not been well established, except for a few specific cases. In this paper, we introduce a broad family of probability metrics, coined as Generalized Sliced Probability Metrics (GSPMs), that are deeply rooted in the generalized Radon transform. We first verify that GSPMs are metrics. Then, we identify a subset of GSPMs that are equivalent to maximum mean discrepancy (MMD) with novel positive definite kernels, which come with a unique geometric interpretation. Finally, by exploiting this connection, we consider GSPM-based gradient flows for generative modeling applications and show that under mild assumptions the gradient flow converges to the global optimum. We illustrate the utility of our approach on both real and synthetic problems.

1 Introduction

Measuring the discrepancy between probability distributions is at the heart of statistics and machine learning problems. A classic example in statistics is the hypothesis testing in higher dimensions, which has attracted a plethora of interest in recent years Gret- ton et al. (2012); Ramdas et al. (2017); Chwialkowski et al. (2015). Similarly, in generative modeling, leveraging probability metrics and discrepancy measures as an alternative to the adversarial networks, used in Generative Adversarial Networks (GANs), has become an exciting topic Dziugaite et al. (2015); Mo- hamed & Lakshminarayanan (2016); Li et al. (2017); Arjovsky et al. (2017). Notably, variations of the Wasserstein distances and the Maximum Mean Discrepancy (MMD) have enjoyed ample attention from the community and have incited many enthralling works in the literature.

There are specific challenges with measuring the discrepancy between two high-dimensional probability distributions, including the high computational cost (e.g., for p-Wasserstein distances), and growing sample complexity, i.e., in the sense of the dependence of convergence rate of a given metric between a measure and its empirical counterpart on the number of samples Genevay et al. (2019). The community has tackled these challenges from different angles in recent years. One of the thought-provoking approaches is via slicing high-dimensional distributions over their one-dimensional marginals and comparing their marginal distributions Kolouri et al. (2019b); Nadjahi et al. (2019b). The idea of slicing distributions is related to the Radon transform and has been successfully used in, for instance, sliced-Wasserstein distances in various applications Rabin et al. (2011); Kolouri et al. (2016); Carriere et al. (2017); Deshpande et al. (2018); Kolouri et al. (2018); Nadjahi et al. (2019a). More recently, Kolouri et al. (2019a) extended the idea of linear slices of distributions, used in sliced-Wasserstein distances, to non-linear slicing of high-dimensional distributions, which is rooted in the generalized Radon transform.

In this paper, we leverage the idea of slicing high-dimensional distributions and introduce a broad family of probability metrics named Generalized Sliced Probability Metrics (GSPMs). We provide a geometric interpretation of these metrics, and show their connection to the well-celebrated MMDs. GSPMs are built based on the idea of ’slicing’ high-dimensional distributions, or the pushforward measure of the high-dimensional input distributions for a real function. We emphasize that GSPMs are not a subclass of Integral Probability Measures (IPMs) M¨uller (1997); Dziugaite et al. (2015), however, they share many commonalities. We show that a subset of GSPMs is equivalent to MMDs, and leverage this connection to define geometrically interpretable kernels for MMDs that were not explored in the literature prior to this work.

Finally, following the work of Arbel et al. (2019), we identify some regularity conditions under which we show that the introduced kernels, which are rooted in GSPMs, satisfy the conditions for the global convergence of gradient flows. Hence, the proposed kernels are suitable for applications dealing with probability flows and implicit generative modeling.

2 Preliminaries

Let and be probability measures defined on a measurable space, X, with corresponding densities p and q. In addition, let (X, d) denote a metric space. Let F be a class of real-valued bounded measurable functions on X. Then the slice of a probability measure , with respect to , is the pushforward measure . We use the equivalent terminology that the slice of a d-dimensional probability density function 2), with respect to a function one-dimensional probability density function that is

defined as:

where is a one-dimensional Dirac function. Intuitively, is the distribution of f(x) (which is scalar) when xs are i.i.d samples from

2.1 Radon Transform

In Radon transform, we are interested in the question of whether one can recover the distribution p from its slices . In other words, when does the set preserve the information contained in p?

Classical Radon Transform: Denote by := = 1} the unit sphere in a d-dimensional Euclidean space. The classical Radon transform shows that when the function class is “linear”, i.e., the corresponding slices, Rp := contain all the required information to recover the distribution p. The previous statement implies that the classical Radon transform map is invertible, i.e. we have

for and , where ) is a one-dimensional high-pass filter with a Fourier transform ˆ) = , appearing as a result of the Fourier slice theorem. The geometric interpretation of this process is that ) integrates p along the hyperplane

Generalized Radon Transform: Classical Radon transform can be extended to the generalized Radon transform (GRT) to integrate p over hypersurfaces 1)-dimensional manifolds, ) = t}. The literature on GRT focuses on parametric functions with and Ω). These functions are so-called “defining functions”. To ensure that GRT is invertible, the following necessary conditions are identified Homan & Zhou (2017):

1. must be a real-valued function on to guarantee the smoothness of hyper-surfaces,

2. must be homogeneous of degree one in . The condition is required to guarantee a unique parametrization of hypersurfaces,

3. must be non-degenerate in the sense that = 0. The non-degenerate assumption ensures that the (1)-dimensional hypersurfaces do not collapse to points, and the integrals are well defined,

4. The mixed Hessian of must be strictly positive, i.e., 0 for , and . This condition is a local form of the Bolker’s condition (See Homan & Zhou (2017)), which allows one to locally identify () with the covector

The linear function class is one example of such family of “defining functions”. Invertibility of GRTs is a long standing research problem. We provide below a number of well-studied classes of “defining functions”, that ensure invertibility of GRTs.

In Kuchment (2006), it is shown that the circular defining function, ) = with and provides an injective GRT. Homogeneous polynomials with an odd degree also define an injective GRT Ehrenpreis (2003), ) = , where we use the multi-index nota- tion = (= , and . The summation here iterates over all possible multi-indices , such that represents the polynomial degree and rameter set for homogeneous polynomials is then set to be . One can see that the choice of m = 1 recovers the linear case , in that the set of the multi-indices with = 1 becomes = 1 for a single = 0and includes d elements. We note that GRT was also the basis for the recently proposed generalized slicedWasserstein distances Kolouri et al. (2019a).

3 Generalized Sliced Probabil-

In this section, we show that any probability metric between one-dimensional probability measures can be extended to higher-dimensions via the concept of generalized slicing. Let ) be a metric for one-dimensional probability measures. Then, for probability measures and defined on with respective densities p and q, the proposed GSPM is defined as follows:

where 1. Let us first show that GSPM is a metric. Non-negativity and symmetry immediately follow from non-negativity and symmetry of ), while triangle inequality follows from the Minkowski inequality:

Finally, the identity of indiscernibles states that, ) = 0 if and only if (iff) p = q. The forward proof is straightforward: p = q results in and since is a metric ) = 0 for . If for , we can conclude that p = q iff the GRT is injective. Hence, if GRT is injective then GSPMs provide a metric. Otherwise, GSPMs are pseudo-metrics.

3.1 Max-GSPM

Equation (4) is based on the expected value of ), when where is the uniform distribution on Ω. Here we show that the max version of GSPMs are also metrics. Substituting the expected value with supremum, leads to a metric defined as:

Verifying the metric properties for Eq. (5) is trivial, given the properties of (see the supplementary material). Note that the recently proposed distances like Sliced Wasserstein (SW) distances and max-SW distances are a special case of GSPMs and Max-GSPMs.

4 GSPMs and MMDs

The seminal work by Gretton et al. (2007, 2012) on maximum mean discrepancy (MMD) provides a framework for efficient comparison of probability distributions. MMD is an integral probability metric Sejdi- novic et al. (2013), and has become a popular choice of comparison between distributions in a wide variety of applications, e.g., generative modeling Li et al. (2017); Tolstikhin et al. (2018), and gradient flows Arbel et al. (2019). In practice, MMD is defined with respect to a Reproducing Kernel Hilbert Space (RKHS), with a unique kernel. Like other kernel methods, the choice of kernel is often an application-dependent choice. In what follows, we show that an interesting family of GSPMs could be related to MMDs. Notably, we combine generalized slices together with a specific family of distances, which both have clear geometric interpretations, and obtain MMDs with well-defined kernels.

Consider Equation (4) for the special case of ) = and r = 2, where A is a positive(-definite) linear operator. The positive assumption enforces to be a norm (i.e., the weighted Euclidean norm). If A is positive semi-definite, then would become a pseudo-metric, and as a consequence also becomes a pseudo-metric. Given a linear operator, A, we can write:

We focus on practical settings where we only observe samples and from these distributions. Substituting the empirical distribution in Equation (1) give us the empirical slices as ˆ) = )) and ˆ)). Using a common trick- of-trade in statistics, and without the loss of generality, we consider a smoothened version of the empirical slices via a identifies the radius of the RBF (that using is equivalent to assuming smoothness priors on the slices.

By plugging in the (smoothened) empirical sliced distributions into (6), we obtain:

Equation (7) is also the squared MMD with the particular kernel shown there-in. Note that one can use the Monte-Carlo integral approximation to obtain an algorithmic way of calculating the kernel for any feasible We now argue that these family of kernels are positive definite (PD). Indeed,

is a dot-product kernel, which is by definition PD, and summation/integration of PD kernels results in a PD kernel. Therefore,

is a PD kernel. Below, we study some special interesting cases of the GSPMs based on ) = , and their equivalent MMD form based on kernels.

4.1 First example: A = id(·)

When ), the GSPM is a generalized-sliced distance between the two distributions. This subsection shows that the work of Knop et al. (2018) follows this setting. In addition, we demonstrate that while such generalized sliced distance might not be as interesting for ) = , from a geometric point of view, it becomes appealing for more complex family of slices (e.g., homogeneous polynomials).

Assuming the RBF is a Gaussian, ) = ) and using the inner product between two Gaussians, one can show that the dot-product kernel in Eq. (8) boils down to:

The geometric interpretation of Equation (10) is quite interesting. First note that therefore, the pre-image of a scalar in the range of is a hyper-surface in X. This means that all points living on a hyper-surface would be projected to the same scalar in the range of (i.e., iso-hyper-surface). Therefore, while and could be far away from one another (in a Euclidean sense), as long as they live on the same or nearby iso-hyper-surfaces they will considered to be similar (with respect to ). Figure 1 demonstrates this effect and shows different s, from family of linear functions parameterized by on a unit sphere (a), and family of polynomials of degree 5 (b), for which samples are considered near-by/far-away.

As a special case, Knop et al. (2018) used linear slices (i.e., ) and showed that when the Gaussian function, then Equation (9) has a closed form:

where ) is a Kummers confluent hypergeometric function Barnard et al. (1998) and can be approximated as:

The above kernel also holds when A is the Fourier transform, which is due to the fact that the Fourier

Figure 1: Visualization of two points and (red points), and the iso-hyper-surfaces (in 2D iso-curves) for sample , for F = {f|f(x) =

(b). The green color indicates when and the red color indicates when it is low.

transform is a unitary linear operator, i.e. satis-fies . However, note that the Fourier transform of a PDF is the characteristic function. Therefore, one would be considering L2-norm squared of the characteristics functions of the slices.

Recall that in these derivations, we started by fixing a slicing operation (linear slices), and used a specific distance, i.e. distance, and that we know the geometric meaning of both of these steps and their implications. Then, we ended up with a novel PD kernel that defines a MMD, which inherits these geometric properties. Here we emphasize that the distance used here (and in Knop et al. (2018)) is a Sliced-. In the next section, we study the specific case of the Generalized-Sliced-Cram´er distance.

4.2 Second example: A is the cumulative integral operator

Now, we choose A as the cumulative integral operator:

Note that such A is a positive definite operator. In this setting, the distance is the 2-Cram´er distance Cram´er (1928) between the two one-dimensional probability distributions, , which is recently used in various publications Bellemare et al. (2017); Kolouri et al. (2020). The Cram´er distance shares some common characteristics to those of the Wasserstein distances. In fact, the 1-Cram´er distance and the 1-Wasserstein distance are equivalent. It is straightforward to show that

is unbounded. Note that is the CDF of an RBF, and therefore its integral is unbounded. However, assuming that the integral domain is [find closed form solutions for ). For instance, for ) = ) we have that is a step function and,

The boundedness assumption enforces us to use kernels with a bounded range (hence, Gaussian kernels won’t be allowed in this setting). Our experiments indicate that smoothstep functions, often used in computer graphics, are well-suited candidates for . The n’th order smoothstep function is defined as:

0 1

We include the derivations of ), with the smoothstep functions, in the supplementary material.

4.3 Third example: A is a generic integral transform

Integral transforms provide a broad family of linear operators, which could be used in Equation (6) to define novel distances/pseudo-distances (depending on the invertibility of the transform). The integral transform of a function, , is a generic linear

transform defined as:

where ) is the integral kernel or the nucleus of the transform. In this work, we suffice to mention this family of linear operators as an interesting class of operators for further studies.

5 GSPM Gradient Flows

Gradient flows have become increasingly popular in implicit generative modeling Liutkus et al. (2019); Ar- bel et al. (2019); Kolouri et al. (2019a), where the aim is to minimize a functional in the Wasserstein space (i.e., the space of probability measures with bounded second-order moments, metrized by the Wasserstein-2 metric), given as follows:

In this section, we will exploit the connections that we developed between GSPMs and MMD (as detailed in Section 4) and develop a globally convergent algorithm for solving problems of the form of (13) by building up on the recent theoretical results given in Arbel et al. (2019).

We now present the GSPM-flows, that aim at generating a path of measures (which minimizes the squared GSPM between an initial measure and a target measure q as t goes to infinity. In particular, we will consider the gradient flow, informally expressed as follows:

where denotes a notion of a gradient in the Wasserstein space Ambrosio et al. (2008). Such gradient flows are of particular interest for generative modeling, since if the solution paths of the flow can be shown to converge to the global optimum one can approximately simulate the gradient flow in order to solve the minimization problem and estimate

Under appropriate conditions Ambrosio et al. (2008), a path (is a solution of (14) if and only if it solves a continuity equation of the form:

v(x, p) =

where k is defined in (9).

The partial differential equation representation (15) has important practical implications, since such PDEs are often associated with a McKean-Vlasov (MV) process Bogachev et al. (2015), which can be used for developing practical algorithms. In particular, associated to the continuity equation, we can define a MV process (as a solution to the following differential equation:

where denotes the state of the process at time t. Here, evolves through the drift function v, which requires the knowledge of , i.e., the density function of . The interest in this process is that the probability density functions of (solve the continuity equation, hence, solving the optimization problem (13) reduces to simulating (16).

Unfortunately, exact simulation of (16) is often intractable due to (i) the process is continuous-time, it needs to be discretized, (ii) the drift depends on the density , which is not available in general. We will focus on the discretization of the process first, then we will develop a particle-based approach to alleviate the second problem.

In order to discretize (16), we consider the noisy Euler-Maruyama scheme, proposed in Arbel et al. (2019), given as follows:

where 0 is a step-size, n = 0, 1, 2, . . . denotes the iterations, denotes the density of 0 denotes an inverse temperature variable, and is a standard Gaussian variable. If = 0 for all n, this scheme reduces to the standard Euler-Maruyama discretization, whereas a positive would drive the scheme to explore the space in a more efficient way.

As one of our main contributions, we will now iden- tify sufficient regularity conditions on the defining function and the smoothing function , which will be required for the convergence analysis of the gradient flow and its discretization (17).

Condition 1. A is a linear, bounded, positive semi-definite operator with the corresponding operator norm

Condition 2. There exists a constant , such that (for any

for all

Condition 3. There exists a constant , such that the following inequalities hold:

We now present our main result.

Theorem 1. Let be a distribution with finite second-order moment. Then, under Conditions 1,2,3, there exists a unique (solving (16) such that the density functions of (constitute the unique solution of (15).

Furthermore, let (be the iterates obtained by (17). If , then the follow- ing bound holds:

where denotes the density of

The proof is given in the supplement. This result shows that, with sufficiently regular and , the noisy Euler scheme (17) can achieve the global optimum, where the convergence rate depends on the structure of

Even though Theorem 1 hints the potential of the proposed gradient flow, the discretization scheme (17) is unfortunately still intractable due to the dependency of v on . In order to obtain a practical algorithm, we finally consider a particle system that serves as an approximation to the original system (17), and given as follows:

where i = 1, . . . , N denotes the particle index and ˆ= denotes the empirical distribution of . Here, the idea is to approximate by ˆby evolving N different particles at the same time. Similar schemes have proved successful in generative modeling Liutkus et al. (2019) and Bayesian machine learning Liu & Wang (2016). Moreover, one can further show that the particle system converges to the true system (17) with a rate of Durmus et al. (2018); Arbel et al. (2019).

6 Numerical Experiments

In this section, we present our experimental results that illustrate our framework.

6.1 Gradient flow – Synthetic

We first perform a numerical experiment with synthetic datasets to demonstrate the performance of the proposed GSPM-MMD kernels. To simplify the presentation, in our first experiment, we assumed the noise = 0 for all n (i.e., the standard EulerMaruyama discretization). We consider three twodimensional target distributions, namely the Swiss Roll, the 8-Gaussians, and the 25-Gaussians distributions. The source distribution is initialized with N samples from a Gaussian distribution. Figure 2 shows the datasets and the flow (calculated using GSPMMMD). We calculate the gradient flow updates (See Equation (20)) to match the source and the target distributions.

For our method, we used the GSPM-MMD kernel with is the cumulative integral operator (i.e., the 2-Cram´er distance). For simplicity, we used linear slices . Also, as a standard

Figure 2: Gradient flow on the Swiss Roll distribu- tion (a), the 8-Gaussians distribution (b), and the 25-Gaussians distribution (c), using the proposed GSPMMMD kernels with ). The source distribution consists of N = 50 particles.

baseline for comparison, we apply the Gaussian kernel and minimize the MMD flow. In each iteration of the gradient flow, we measure the 2-Wasserstein distance between the updated source and the target distribution. For each method we vary and repeat the experiments 10 times. Figure 3 compares the algorithms on the three datasets and for various . For the cumulative integral operator we used L = 10 slices.

Effect of noise: The addition of noise lessens the effect of a poor choice of by allowing the particles to explore the space in a more efficient manner. To demonstrate the effect of the addition of noise to the updates (See Equation (20)) we repeated the experiment in Figure 2 for the Swiss Roll dataset, but with a poor choice of . From Figure 2, one can see that 001 is too small for calculating an effective flow. Hence, we chose 001 and solved a noisy gradient flow problem with the MMD-RBF kernel (baseline) and the GSPM-MMD with ) kernel. We selected the initial [1, 0.1, 0.01, 0] and decayed the noise in each gradient iteration with a 1/k rate (k being the iteration). The log 2-Wasserstein between the

Figure 3: The convergence curves of MMD flows for MMD-RBF, GSPM-MMD with identity operator, and GSPM-MMD with the cumulative integral operator on the three synthetic datasets shown in Figure 2.

source and target distributions is depicted in Figure 4. As expected, addition of noise improves the overall performance of gradient flows.

Linear or non-linear slices: So far, in our experiments, we have only used linear slices, i.e., . Here, we compare GSPM-MMD flows solely based on the choice of linear and non-linear slices. For the non-linear slices, in this experiment, we use the homogeneous polynomials of degree 5 (see Figure 1). To ensure a fair comparison, we chose the number of random slices for both GSPM-MMD kernels to L = 1. Figure 5 shows the comparison between linear and polynomial slices.

6.2 Gradient flow – MNIST

To show the effectiveness of the proposed distances in higher dimensions, we designed the following experiment. We first learn a simple convolutional autoencoder, with an added classifier on its bottleneck to ensure a discriminative space embedding, to embed the MNIST dataset into a (d = 16)-dimensional space. Then we solve the gradient flow problem in the embedded space with N = 100 particles initialized from a Gaussian distribution.

Similar to the previous experiments, we use MMDRBF, GSPM-MMD with ) (denoted as GSPM-MMD 1), and GSPM-MMD with A being the cumulative integral (denoted as GSPM-MMD 2) and calculate the flow between the source and target distributions. We measure the 2-Wasserstein distance between the distributions at each iteration. The experiments were repeated 10 times and the average performance for each method is reported in Figure 6 (top row). After the convergence, we sort the particles according to the output of the classifier and feed them to the decoder network to visualize the corresponding digits for each method (See the bottom row in Figure 6). We note that same was used for all three methods, and linear slicing was used in this experiment. We conclude that the GSPM-MMD with the cumulative integral operator, which corresponds to the sliced-Cram´er distance, seems to achieve a su-

Figure 4: The effect of the addition of Gaussian noise (See Equation (20)) in calculating the MMD flows using the GSPM-MMD kernel with the RBF kernel (b). The results are averaged over 10 runs and are calculated on the Swiss Roll dataset.

perior performance in comparison with the other two kernels.

7 Conclusion

We introduced a new family of distances, denoted as Generalized Sliced Probability Metrics (GSPMs), which calculate the expected distances between slices (i.e., one-dimensional marginals) of two input distributions. We then showed that a subset of the proposed distances is equivalent to the squared maximum mean discrepancy (MMD) with new kernels introduced in this work, denoted as GSPM-MMD kernels. Furthermore, we applied the GSPM-MMD kernels in the domain of gradient flows for implicit generative modeling, which has recently attracted ample attention from the research community. More importantly,

Figure 5: Gradient flows using GSPM-MMDs with linear and Polynomial slices. The experiment was calculated based on one random slice (linear or nonlinear) per iteration.

we identified sufficient regularity conditions on the building elements of our proposed distance (and consequently the proposed kernels) for guaranteeing global convergence of the gradient flow. Finally, we provide extensive ablation experiments to test our proposed distance on synthetic and real datasets.

References

Ambrosio, L., Gigli, N., and Savar´e, G. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008.

Arbel, M., Korba, A., Salim, A., and Gretton, A. Maximum mean discrepancy gradient flow. In Advances in Neural Information Processing Systems, 2019.

Arjovsky, M., Chintala, S., and Bottou, L. Wasser- stein generative adversarial networks. In International Conference on Machine Learning, pp. 214– 223, 2017.

Barnard, R., Dahlquist, G., Pearce, K., Reichel, L., and Richards, K. Gram polynomials and the kummer function. Journal of approximation theory, 94 (1):128–143, 1998.

Bellemare, M. G., Danihelka, I., Dabney, W., Mo- hamed, S., Lakshminarayanan, B., Hoyer, S., and

Figure 6: Comparison of the proposed kernels on the MNIST dataset. The top row shows the 2-Wasserstein distance between the source and target distributions, while the bottom row visualizes the actual target particles, N = 100.

Munos, R. The cramer distance as a solution to biased wasserstein gradients. arXiv preprint arXiv:1705.10743, 2017.

Bogachev, V. I., Krylov, N. V., R¨ockner, M., and Shaposhnikov, S. V. Fokker-Planck-Kolmogorov Equations, volume 207. American Mathematical Soc., 2015.

Carriere, M., Cuturi, M., and Oudot, S. Sliced wasser- stein kernel for persistence diagrams. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 664–673. JMLR. org, 2017.

Chwialkowski, K. P., Ramdas, A., Sejdinovic, D., and Gretton, A. Fast two-sample testing with analytic representations of probability measures. In Advances in Neural Information Processing Systems, pp. 1981–1989, 2015.

Cram´er, H. On the composition of elementary errors:

First paper: Mathematical deductions. Scandinavian Actuarial Journal, 1928(1):13–74, 1928.

Deshpande, I., Zhang, Z., and Schwing, A. G. Generative modeling using the sliced wasserstein distance. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3483–3491, 2018.

Durmus, A., Eberle, A., Guillin, A., and Zimmer, R. An elementary approach to uniform in time propagation of chaos. arXiv preprint arXiv:1805.11387, 2018.

Dziugaite, G. K., Roy, D. M., and Ghahramani, Z. Training generative neural networks via maximum mean discrepancy optimization. In Proceedings of the Thirty-First Conference on Uncertainty in Ar-tificial Intelligence, pp. 258–267, 2015.

Ehrenpreis, L. The universality of the Radon transform. Oxford University Press on Demand, 2003.

Genevay, A., Chizat, L., Bach, F., Cuturi, M., and Peyr´e, G. Sample complexity of sinkhorn divergences. In

Gretton, A., Borgwardt, K., Rasch, M., Sch¨olkopf, B., and Smola, A. J. A kernel method for the two-sample-problem. In Advances in neural information processing systems, pp. 513–520, 2007.

Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch¨olkopf, B., and Smola, A. A kernel two-sample test. Journal of Machine Learning Research, 13 (Mar):723–773, 2012.

Homan, A. and Zhou, H. Injectivity and stability for a generic class of generalized Radon transforms. The Journal of Geometric Analysis, 27(2):1515–1529, 2017.

Knop, S., Tabor, J., Spurek, P., Podolak, I., Mazur, M., and Jastrzebski, S. Cramer-wold autoencoder, 2018.

Kolouri, S., Zou, Y., and Rohde, G. K. Sliced wasser- stein kernels for probability distributions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5258–5267, 2016.

Kolouri, S., Rohde, G. K., and Hoffmann, H. Sliced wasserstein distance for learning gaussian mixture models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3427–3436, 2018.

Kolouri, S., Nadjahi, K., Simsekli, U., Badeau, R., and Rohde, G. K. Generalized sliced wasserstein distances. In Advances in Neural Information Processing Systems, 2019a.

Kolouri, S., Pope, P. E., Martin, C. E., and Ro- hde, G. K. Sliced wasserstein auto-encoders. In International Conference on Learning Representations, 2019b. URL https://openreview.net/ forum?id=H1xaJn05FQ.

Kolouri, S., Ketz, N. A., Soltoggio, A., and Pilly, P. K. Sliced cramer synaptic consolidation for preserving deeply learned representations. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id= BJge3TNKwH.

Kuchment, P. Generalized transforms of radon type and their applications. In Proceedings of Symposia in Applied Mathematics, volume 63, pp. 67, 2006.

Li, C.-L., Chang, W.-C., Cheng, Y., Yang, Y., and P´oczos, B. Mmd gan: Towards deeper understanding of moment matching network. In Advances in Neural Information Processing Systems, pp. 2203– 2213, 2017.

Liu, Q. and Wang, D. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances in neural information processing systems, pp. 2378–2386, 2016.

Liutkus, A., S¸im¸sekli, U., Majewski, S., Durmus, A., and Stoter, F.-R. Sliced-Wasserstein flows: Nonparametric generative modeling via optimal transport and diffusions. In International Conference on Machine Learning (ICML), pp. 4104–4113, 2019.

Mohamed, S. and Lakshminarayanan, B. Learning in implicit generative models. arXiv preprint arXiv:1610.03483, 2016.

M¨uller, A. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429–443, 1997.

Nadjahi, K., De Bortoli, V., Durmus, A., Badeau, R., and S¸im¸sekli, U. Approximate bayesian computation with the sliced-wasserstein distance. arXiv preprint arXiv:1910.12815, 2019a.

Nadjahi, K., Durmus, A., Simsekli, U., and Badeau, R. Asymptotic guarantees for learning generative models with the sliced-wasserstein distance. In Advances in Neural Information Processing Systems, pp. 250–260, 2019b.

Rabin, J., Peyr´e, G., Delon, J., and Bernot, M. Wasserstein barycenter and its application to texture mixing. In International Conference on Scale Space and Variational Methods in Computer Vision, pp. 435–446. Springer, 2011.

Ramdas, A., Trillos, N. G., and Cuturi, M. On wasser- stein two-sample testing and related families of nonparametric tests. Entropy, 19(2):47, 2017.

Sejdinovic, D., Sriperumbudur, B., Gretton, A., Fuku- mizu, K., et al. Equivalence of distance-based and rkhs-based statistics in hypothesis testing. The Annals of Statistics, 41(5):2263–2291, 2013.

Steinwart, I. and Christmann, A. Support vector machines. Springer Science & Business Media, 2008.

Tolstikhin, I., Bousquet, O., Gelly, S., and Schoelkopf, B. Wasserstein auto-encoders. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=HkL7n1-0b.

Supplementary Material

We invoke the following lemma from Steinwart & Christmann (2008) to prove our result.

Lemma 1. (Lemma 4.34 in Steinwart & Christmann (2008)) Let be an open subset, k be a kernel on be a feature space of k, and Φ : be a feature map of k. Let be an index such that the mixed partial derivative of k with respect to the coordinates i and i + d exists and is continuous. Then the partial derivative Φ with respect to the i-th coordinate exists, is continuous, and for all

Lemma 2. Let us define that can be represented as follows

for any . Then, under Conditions 1-3, we have that

• ∥

• ∥

• ∥

• ∥

Proof. The proof of statements above follows immediately from Conditions 1-3.

Proof of Theorem 1

We prove that the kernel k in (9) has L-Lipschitz gradients:

and satisfies the following inequality:

Then the rest of the proof follows from Arbel et al. (2019), Proposition 1 (existence and uniqueness) and Proposition 8 (convergence of the Euler scheme).

Recalling the definition of from Lemma 2, we have that

Applying Lemma 2, we can simplify above to get

On the other hand,

Due to Lipschitz continuity of as well as Lemma 2, the above entails that

Combining above with (23), we have by triangle inequality that

Integrating above over Ωand interchanging the integral with the norm on the left-hand-side proves Condition 21 with

To prove that (22) holds, we can use Lemma 1 to

Integrating above uniformly over Ω, substituting it in (24), and summing over i proves that (22) holds with

7.1 kθ(xi, xj) for the smooth step function.

Here we derive the analytical form of for the zero’th order smooth step function, which is essentially the clamping function. Note that similar analytical formulas could be derived for higher order smooth step functions. For simplicity we let := )), and without loss of generality, we assume that ). For the zero’th order smooth step function, assuming bounded range [], we have:

) + For simplicity let := ), then we have:

where,

and,

and,

remember that . Then, we can write:

Putting it all together: