b

DiscoverSearch
About
My stuff
Wasserstein Exponential Kernels
2020·arXiv
ABSTRACT
ABSTRACT

In the context of kernel methods, the similarity between data points is encoded by the kernel function which is often defined thanks to the Euclidean distance, a common example being the squared exponential kernel. Recently, other distances relying on optimal transport theory – such as the Wasserstein distance between probability distributions – have shown their practical relevance for different machine learning techniques. In this paper, we study the use of exponential kernels defined thanks to the regularized Wasserstein distance and discuss their positive definiteness. More specifically, we define Wasserstein feature maps and illustrate their interest for supervised learning problems involving shapes and images. Empirically, Wasserstein squared exponential kernels are shown to yield smaller classification errors on small training sets of shapes, compared to analogous classifiers using Euclidean distances.

Contemporary machine learning methods frequently rely on neural networks, and shape recognition relies more specifically on convolutional neural networks. The big advantage of the latter is its ability to take the underlying structure of the data into account by treating neighboring pixels together. If these methods are very often impressive by their performance, they are also known for their drawbacks such as a weak robustness and a difficult explainability. On the other side, though not always being as accurate as neural networks, kernel methods are praised for their easy explainability and robustness. Another advantage of kernel methods is their versatility as they easily be used in supervised and unsupervised methods, as well as for generation [1]. We emphasize here the interest of choosing a particular kernel based on Wasserstein distance for classifying small datasets consisting of shapes.

In the context of kernel methods, squared exponential kernel functions are widely used, mainly because of their universal approximation properties and their empirical success. These Gaussians consist of the exponential of the negative Euclidean distance squared. However, the Euclidean distance might not always be appropriate to compare data points when data has some specific structure. Indeed, it measures the correspondence of each feature independently of the other features. For example, let’s consider the case of two identical 2D-shapes. When the two shapes overlap, their Euclidean distance is zero. However, if they do not overlap, their relative Euclidean distance becomes large although the shapes are identical. In other words, the Euclidean distance only compares each pixel at the same place on the grid, not taking the neighbouring pixels into account. The general structure of the features is not taken into account, only their strict correspondence. However, another distance – the Wasserstein distance – gained popularity in recent years since it can incorporate the structure of the data if the dataset can be processed so that the datapoints can be considered as probability distributions.

Contributions

The contributions of this paper are the following. Empirically, we demonstrate that squared exponential kernels (1) based on a regularized Wassertein distance are performant on small scale classification problems involving shape datasets, compared for instance to the popular Gaussian RBF kernel [2]. Also, an approximation technique is proposed, with the so-called Wasserstein feature map, so that a positive semi-definite (psd) kernel can be defined from the Wasserstein squared exponential kernel which is not necessarily psd.

Notations and conventions

In the sequel, we denote vectors by bold lower case letters. Let 1 be the all ones column vector. Also, we define  δyto be the Dirac measure at point y. A kernel  k : Rd × Rd → Ris called positive semi-definite if all kernel matrices K = [k(xi, xj)]ni,j=1 are positive semi-definite.

Wasserstein distances

The Wasserstein distance is a central notion in optimal transport theory. Also known as the earth mover’s distance, it corresponds to the optimal transportation cost between two measures [3, 4]. Let p > 0. We then define two normalized empirical measures  α = �mi=1 aiδyiand  β = �nj=1 bjδzjsuch that  a⊤1 = 1and  b⊤1 = 1, and where {yi ∈ Rd}ni=1, {zj ∈ Rd}mj=1are support points. Also, we define an Euclidean distance matrix  dij = ∥yi − zj∥2. Then, the p-Wasserstein distance is given by

image

with  Π(α, β) =�Π ∈ Rm×n|Π1 = a and Π⊤1 = b�, the set of joint distributions  πwith specified marginals given by  α and β. Intuitively, the optimal probability distribution  π⋆ represents the optimal mass transportation scheme from α to β. A particular result occurs in the one-dimensional (d = 1) case assuming the support points are ordered, i.e.,  y1 ≤. . . ≤ ym and z1 ≤ . . . ≤ zn, where the Wasserstein distance reduces to an  ℓp-norm: Wpp�1n�ni=1 δyi, 1n�nj=1 δzj�=

n||y−z||pp [4]. This connection between  ℓp-norms and Wasserstein distances is only clear in one dimension, illustrating here again the fact that  ℓp-norms don’t take the underlying structure into account. To take it into account, we need to consider the case d > 1. In this way we can define a new kernel function

image

However, this has some undesirable consequences concerning positive definiteness. A kernel k(x, y) = exp (−tf(x, y))is positive semi-definite for all t > 0 if and only if f(x, y) is Hermitian and conditionally negative semi-definite [5]. Recall that a kernel is conditionally negative semi-definite if any Gram matrix  F = [f(xi, xj)]ni,j=1(with  n ≥ 2)built from a discrete sample satisfies  c⊤Fc ≤ 0for all c such that  1⊤c = 0. However, the Wasserstein distance for d > 1 is not necessarily conditionally negative definite [4]. The consequence is that we cannot guarantee that any resulting squared exponential kernel matrix built with the 2-Wasserstein distance is positive definite. This property is fundamental in kernel theory and more specifically for defining reproducing kernel Hilbert spaces (RKHS; see [6] for more details).

This restriction has lead authors to consider only some specific cases of Wasserstein distances which are known to be positive definite. The one-dimensional generic case is proven to be positive definite and has lead to the introduction of sliced Wasserstein distances [7,8]. Another notable case is the Wasserstein distance between two Gaussians in more than one dimension, which can even be written in closed form [4].

Some kernel methods are still usable with non positive definite kernels, such as LS-SVMs [9,10]. However, this leads to a slightly different interpretation of the global problem, using Kre˘ın spaces for which a weaker version of the representer theorem holds [11]. In this paper, we propose an alternative which allows us to still work with a positive definite kernel approximating the squared exponential kernel. If the Wasserstein exponential kernel can not be used, we can always find a parameter  σ > 0and a finite dimensional feature map resulting in a positive definite kernel.

2.1 Positive definite squared exponential kernels and bandwidth choice

In this section, we show that for a given dataset, the corresponding Gram matrix of  kWis positive definite if the bandwidth parameter  σ > 0is small enough.

Definition 2.1. Let  d : D × D → R≥0be a symmetric function such that d(x, x) = 0 and let  {xi ∈ D}Ni=1be a dataset. A squared exponential kernel matrix is defined as

image

By construction, this exponential kernel matrix will be symmetric and have a diagonal consisting only of ones. Its eigenvalues are real. To investigate its (semi)-definiteness, we have to investigate the sign of the minimum eigenvalue. The minimum eigenvalue  λmin (σ)of  Kd,σis the function  λmin : R>0 → R, σ �→ min {λ1, . . . , λN}where  λ1, . . . , λNare the eigenvalues of  Kd,σ. We can now prove the following result:

Lemma 2.1. The eigenvalues of the exponential kernel matrix  Kd,σare continuous functions of  σ. In particular, λmin (σ)is continuous.

Proof. This is a direct consequence of the continuity of the roots of a polynomial under continuously varying coefficients. Therefore, we have to prove that the coefficients of the characteristic polynomial of the exponential kernel matrix Kd,σis continuous in function of  σ. The characteristic polynomial is given by  det (Kd,σ − λI)and by the formula of Leibniz, we ultimately have that the characteristic polynomial is a sum of products of elements of  Kd,σ − λI, whichare continuous in function of  σ. Hence, the coefficients are continuous and so are the eigenvalues.

Lemma 2.2.  limσ→0 Kd,σ = id and thus λmin (0) = 1.

Proof. From Definition 2.1, we know that  [Kd,σ]i,j = exp�−d2(xi,xj)2σ2 �with d2(xi, xi) = 0 and d2(xi, xj) > 0 for

i ̸= j. Denote  Ci,j = d2(xi, xj)for simplicity. We have thus  limσ→0 exp� 02σ2�= 1and  limσ→0 exp�− Ci,j2σ2�= 0with  Ci,j > 0 for i ̸= j, hence the identity matrix. By consequence, all the eigenvalues are equal to 1.

Lemma 2.3. We have  limσ→∞ Kd,σ = 11T and thus limσ→∞ λmin (σ) = 0.

Proof. Similarly as before, we have  limσ→+∞ [Kd,σ]i,j = 1everywhere. By consequence, we have  λmax = Nand all others equal to zero, hence  λmin = 0.

Proposition 2.4. There exists a  σPSD ∈ R+ such that Kd,σis positive semi-definite for all  σ ≤ σPSD.

Proof. Let’s proceed ad absurdum and suppose this is not the case. We consider the sequence  (σn)nconverging to 0 with  σ0 = σPSD. There must exist some subsequence�σnj�jsuch that�λmin�σnj��j < 0. If this sequence if finite, then is suffices to consider a new sequence with  σPSD = σnjmax+1. If this subsequence is infinite, then  (λmin (σn))ncannot converge to 1. This is impossible because of the continuity of  λmin (σ)(lemma 2.1) and its convergence to 1 (lemma 2.2). Hence, there exist some  σPSD > 0such that  λmin (σ) ≥ 0for all  σ ≤ σPSD. This proves our proposition.

We can empirically see the result of Proposition 2.4 in Fig. 1, where all eigenvalues are positive. To give some intuition, decreasing the  σtends to make the smallest distances more predominant, pushing the smallest eigenvalue progressively to the positive side. In this sense, an indefinite kernel matrix with  σ close to σPSDwill lead to very proportionally very small negative eigenvalues in magnitude. In this case, a finite positive definite approximation can be justified.

image

Figure 1: Comparison of the classical squared exponential kernel matrix (based on a  ℓ2-distance) and the introduced Wasserstein exponential kernel matrix on 250 normalized digits of the MNIST dataset [12]. The digits are ordered by class in ascending order.

2.2 Wasserstein features

We can consider a finite dimensional feature map  φ(x)such that the positive semi-definite kernel  φ(x)⊤φ(y)approximates  kW (x, y)given in (1). This finite approximation is based on a training dataset  {xi}Ni=1for constructing an original kernel matrix  K = [kW (xi, xj)]Ni,j=1 ∈ RN×N. It suffices to truncate the spectral decomposition of the kernel matrix  K = �Nl=1 λlvlv⊤l to the ℓlargest strictly positive eigenvalues. This will result in a new positive definite kernel matrix  K(ℓ) def= �ℓl=1 λlvlv⊤l ≻ 0 with λ1 ≥ · · · ≥ λN. We can now reconstruct the different components of an approximate feature map

image

with  kxdef= [kW (x, x1) · · · kW (x, xN)]⊤. We refer to these different components as the Wasserstein features as they compose the approximate feature map  φ (x)def= [φ1(x) · · · φℓ(x)]⊤of the Wasserstein exponential kernel. This approximate feature map is constructed by using a training dataset, but can afterwards be evaluated at any out-of-sample point. By construction, we can verify that the Wassertein features evaluated on the training dataset result in the truncated kernel matrix:

Proposition 2.5. We have�φ(xi)⊤φ(xj)�Ni,j=1 = K(ℓ).

Proof. It suffices to observe that  kxi = �Nl=1 λlvl [vl]i. By consequence, we have  φl(xi) = √λl [vl]i.

Proposition 2.4 suggests that even if no suitable  σcan be found such that the kernel matrix is positive, the negative eigenvalues will remain very small in magnitude. By consequence, we can suppress them without loosing much information. A truncated kernel is thus very close to the original one in spectral norm. This justifies the Wasserstein features in this sense that they are very close to the Wasserstein exponential kernel as well as being positive definite by construction. This fact can be visualized on Fig. 2.

Clearly, the Wasserstein features yield a positive semi-definite kernel. Moreover, it is also advantageous to work with finite dimensional feature maps to reduce the training time. Indeed, the computation of the Wasserstein distance (or an approximation with e.g. Sinkhorn’s algorithm [13]) is still relatively expensive compared to  ℓ2 distance.

image

Figure 2: Kernel matrices constructed as the inner products of a different number Wasserstein features of a test set. These matrices are compared with the exact Wasserstein squared exponential kernel matrix of the test set. Both the training set and the test are of size N = 500.

Table 1: Percentage of classification error on the test set of three datasets. The standard deviation is given in parenthesis. The number of repeated simulations is 7 for MNIST, 8 for Quickdraw and 6 for USPS.

image

3.1 Setup for 2D shape classification

Let u be a greyscale image that we unfold as a vector of length m and so that  ui > 0is the “grey” value at the pixel yiof a pixel grid. It is mapped to a probability  α = �mi=1 aiδyiby defining  ai = ui/∥u∥1, so that the mass of  αis one. In practice, the p = 2 Wassertein distance is calculated in this paper with the help of the well-known entropic regularization, namely

image

where  ϵ > 0is a small regularization term and  dijis the Euclidean distance between pixels located at  yiand  yjin a pixel grid. The advantage of this regularized problem is that its solution can be efficiently obtained thanks to the Sinkhorn algorithm, which can be parallelized. For more details, we refer to [4]. All the simulations used  ϵ = 0.4 andthe diagonal of the distance matrix set to zero.

image

Figure 3: Mean misclassification rates for various subset sizes of the MNIST dataset, computed on 7 simulations. The standard deviation is given by the errors bars. For the specific case of “Core + OOS”, the out-of-sample subset represents 300 datapoints on 500, 750 on 1250, 1500 on 2500 and 2500 on 4000. The size of validation set is always 5000 and of the test set always 10 000.

3.2 Shape recognition

We illustrate the use of the Wasserstein based kernels in the context of shape classifications. Namely, we train a Least Squares Support Vector Machine [14] classifier on subsets of the MNIST [12], Quickdraw1 and USPS [15] datasets, which are sampled uniformly at random. These three datasets contain handwritten digits and shapes. The multiclass problem is solved by a one-versus-one encoding. One instance of these binary classifiers  f(x) = sign(w⋆⊤φ(x) + b⋆)is obtained by solving

image

where  yi ∈ {−1, 1}and  φ(x) ∈ Rℓis a feature map obtained for instance thanks to (2). The solution is obtained by solving

image

which is a  (ℓ + 1) × (ℓ + 1)linear system. A classifier can also be obtained by solving the dual problem of (3). The optimality conditions of this dual problem yield the following  (N + 1) × (N + 1)linear system

image

The resulting classifier has then the expression  f(x) = sign(�Ni=1 α⋆i k(x, xi) + b⋆). The hyperparameters  σ > 0and  γ > 0are chosen by validation. The final classification is done by minimizing the hammming distance on the one-versus-one outputs [16]. In order to account for the amount of ink in the grey images u and v, we also introduce a reweighted kernel that is defined as

image

Notice that a similar kernel has been defined with the Euclidean distance in [17,18].

In our experiments, we compare several methods based on  kW and kRW, Wasserstein and Euclidean distances.

3.2.1 Core Wasserstein kernel

The “Core” method consists in solving (3) thanks to the feature map (2) associated to  K(ℓ). The parameter  ℓ is chosensuch that all the selected eigenvalues are larger than  10−6 to avoid numerical instabilities. The optimal  w⋆ and b⋆ arethen obtained by solving a linear system.

3.2.2 Core Wasserstein kernel with out-of-sample

Our second method named “Core + OOS” uses almost the same methodology as “Core”. However, a subset of the training set is used to construct the truncated Wasserstein kernel of Proposition 2.5. Then the out-of-sample (OOS) formula (2) is used to construct an approximation of the kernel matrix on the full training dataset. The advantage of this approximation is that it can avoid the full eigendecomposition of the kernel matrix which is necessary for the “Core” method.

3.2.3 Indefinite Wasserstein kernel

For this second method, we simply use for the kernel matrix the indefinite Gram matrix associated to (5) and solve the system (4) associated to the dual formulation of LS-SVM. While the associated optimization problem is not necessarily bounded in that case, the linear system (4) still has a solution in practice (almost surely if  γselected uniformly at random.). We name this method “Indefinite Wasserstein” in Figure 3.

3.2.4 Gaussian RBF

The previous methods are compared with a classical LS-SVM classifier with kernel

image

The parameters  σ and γare obtained by validation in the same spirit as above.

3.2.5 KNN

The same task is also performed for a kNN classifiers defined both with Euclidean and Wasserstein distances [19]. Those two methods are considered as benchmarks to assess the accuracy of the kernel methods hereabove. Notice that the number of nearest neighbours k is selected by validation.

3.3 Description of the simulations

The simulations are repeated several times and the mean classification error rate is given as well as the standard deviation. We emphasize that the classes are balanced in each of the datasets. The coded is provided on GitHub2.

3.4 Discussion

The results obtained by classifiers defined with Wasserstein exponential kernel  kWoutperform the Euclidean and Wasserstein kNN classifiers, as well as LS-SVM with a Gaussian RBF kernel (see Fig. 3 and Table 1). The latter is especially outperformed when the number of training data points is limited to a few thousands. We observed empirically that the advantage of  kWis indeed reduced as the size of the training set further increases. Surprisingly, the classifier obtained for the indefinite  kWkernel yields the best performance when the training set is larger. For moderate size training sets, LS-SVM classifiers can be competitive with respect to other methods that do not rely on convolutional neural networks. The latter are known to be performant for relatively large training datasets. While an advantage of Wasserstein based methods is an increased accuracy in the classification tasks of this paper, a main disadvantage is the increased training time.

In this paper, we proposed the use of Wasserstein squared exponential kernels for classifying shapes given relatively small training datasets. Although the computation of Wasserstein distances is expensive, it can be made possible thanks to the entropic regularization and the Sinkhorn algorithm, as it is well known. The so-called Wasserstein features are also proposed to serve as an approximation of the Wasserstein squared exponential kernel which is not necessarily positive semidefinite. In particular, this construction is possible if the bandwidth parameter is small enough as it is explained by elementary theoretical results. These theoretical results also open a door to more general exponential kernels based on any measure of similarity.

EU: The research leading to these results has received funding from the European Research Council under the European Union’s Horizon 2020 research and innovation program / ERC Advanced Grant E-DUALITY (787960). This paper reflects only the authors’ views and the Union is not liable for any use that may be made of the contained information. Research Council KUL: Optimization frameworks for deep kernel machines C14/18/068. Flemish Government: FWO: projects: GOA4917N (Deep Restricted Kernel Machines: Methods and Foundations), PhD/Postdoc grant. This research received funding from the Flemish Government under the “Onderzoeksprogramma Artifici¨ele Intelligentie (AI) Vlaanderen” programme. Ford KU Leuven Research Alliance Project KUL0076 (Stability analysis and performance improvement of deep reinforcement learning algorithms). The computational resources and services used in this work were provided by the VSC (Flemish Supercomputer Center), funded by the Research Foundation - Flanders (FWO) and the Flemish Government – department EWI.

[1] Arun Pandey, Joachim Schreurs, and Johan A. K. Suykens. Generative restricted kernel machines , arxiv:1906.08144.

[2] Carl E. Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.

[3] C´edric Villani. Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg, 2008.

[4] Gabriel Peyr´e and Marco Cuturi. Computational Optimal Transport: With Applications to Data Science. Now Foundations and Trends, 2019.

[5] Christian Berg, Jens Peter Reus Christensen, and Paul Ressel. Harmonic Analysis on Semigroups, volume 100 of Graduate Texts in Mathematics. Springer New York, New York, NY, 1984.

[6] Bernhard Scholkopf and Alexander J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001.

[7] Mathieu Carri`ere, Marco Cuturi, and Steve Oudot. Sliced wasserstein kernel for persistence diagrams. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 664–673. JMLR.org, 2017.

[8] Soheil Kolouri, Kimia Nadjahi, Umut Simsekli, Roland Badeau, and Gustavo Rohde. Generalized sliced wasserstein distances. In Advances in Neural Information Processing Systems 32, pages 261–272. Curran Associates, Inc., 2019.

[9] Johan A. K. Suykens, Tony Van Gestel, Jos De Brabanter, Bart De Moor, and Joos Vandewalle. Least Squares Support Vector Machines. World Scientific, Singapore, 2002.

[10] Xiaolin Huang, Andreas Maier, Joachim Hornegger, and Johan A. K. Suykens. Indefinite kernels in least squares support vector machines and principal component analysis. Applied and Computational Harmonic Analysis, 43(1):162–172, 2017.

[11] Cheng Soon Ong, Xavier Mary, St´ephane Canu, and Alexander J Smola. Learning with non-positive kernels. In Twenty-first international conference on Machine learning - ICML ’04, page 81, New York, New York, USA, 2004. ACM Press.

[12] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010.

[13] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems 26, pages 2292–2300. Curran Associates, Inc., 2013.

[14] Johan A. K. Suykens and Joos Vandewalle. Least squares support vector machine classifiers. Neural Process. Lett., 9(3):293–300, June 1999.

[15] Jonathan J. Hull. A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5):550–554, May 1994.

[16] Johan A. K. Suykens and Joos Vandewalle. Multiclass least squares support vector machines. In IJCNN’99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339), volume 2, pages 900–903 vol.2, July 1999.

[17] Julien Mairal. End-to-end kernel learning with supervised convolutional kernel networks. In Advances in Neural Information Processing Systems 29, pages 1399–1407. Curran Associates, Inc., 2016.

[18] Dexiong Chen, Laurent Jacob, and Julien Mairal. Biological sequence modeling with convolutional kernel networks. Bioinformatics, 35(18):3294–3302, 02 2019.

[19] Michael Snow and Jan Van Lent. Monge’s optimal transport distance for image classification, arxiv:1612.00181.


Designed for Accessibility and to further Open Science