Theoretical Guarantees for Bridging Metric Measure Embedding and Optimal Transport

2020·Arxiv

Abstract

Abstract

We propose a novel approach for comparing distributions whose supports do not necessarily lie on the same metric space. Unlike Gromov-Wasserstein (GW) distance which compares pairwise distances of elements from each distribution, we consider a method allowing to embed the metric measure spaces in a common Euclidean space and compute an optimal transport (OT) on the embedded distributions. This leads to what we call a sub-embedding robust Wasserstein (SERW) distance. Under some conditions, SERW is a distance that considers an OT distance of the (low-distorted) embedded distributions using a common metric. In addition to this novel proposal that generalizes several recent OT works, our contributions stand on several theoretical analyses: (i) we characterize the embedding spaces to define SERW distance for distribution alignment; (ii) we prove that SERW mimics almost the same properties of GW distance, and we give a cost relation between GW and SERW. The paper also provides some numerical illustrations of how SERW behaves on matching problems.

1 Introduction

Many central tasks in machine learning often attempt to align or match real-world entities, based on computing distance (or dissimilarity) between pairs of corresponding probability distributions. Recently, optimal transport (OT) based data analysis has proven a significant usefulness to achieve such tasks, arising from designing loss functions (Frogner et al., 2015), unsupervised learning (Arjovsky et al., 2017), clustering (Ho et al., 2017), text classification (Kusner et al., 2015), domain adaptation (Courty et al., 2017), computer vision (Bonneel et al., 2011; Solomon et al., 2015), among many more applications (Kolouri et al., 2017; Peyr´e and Cuturi, 2019). Distances based on OT are referred to as the Monge-Kantorovich or Wasserstein distance (Monge, 1781; Kantorovich, 1942; Villani, 2009). OT tools allow for a natural geometric comparison of distributions, that takes into account the metric of the underlying space to find the most cost-efficiency way to transport mass from a set of sources to a set of targets. The success of machine learning algorithms based on Wasserstein distance is due to its nice properties (Villani, 2009) and to recent development of efficient computations using entropic regularization (Cuturi, 2013; Genevay et al., 2016; Altschuler et al., 2017; Alaya et al., 2019).

Distribution alignment using Wasserstein distance relies on the assumption that the two sets of entities in question belong to the same ground space, or at least pairwise distance between them can be computed. To overcome such limitations, one seeks to compute Gromov-Wasserstein (GW) distance (Sturm, 2006; M´emoli, 2011), which is a relaxation of Gromov-Hausdorff distance (M´emoli, 2008; Bronstein et al., 2010). GW distance allows to learn an optimal transport-like plan by measuring how the distances between pairs of samples within each ground space are similar. The GW framework has been used for solving alignment problems in several applications, for instance shape matching (M´emoli, 2011), graph partitioning and matching (Xu et al., 2019), matching of vocabulary sets between different languages (Alvarez-Melis and Jaakkola, 2018), generative models (Bunne et al., 2019), or matching weighted networks (Chowdhury and M´emoli, 2018), to name a few. However, computing GW distance induces a heavy computation burden as the underlying problem is a non-convex quadratic program and NP-hard (Peyr´e and Cuturi, 2019). Peyr´e et al. (2016) propose an entropic version called entropic GW discrepancy, that leads to approximate GW distance.

In this paper, we develop a distance, that is similarly to GW distance applied on sets of entities from different spaces. Our proposal builds upon metric embedding that allows for an approximation of some “hard” problem with complex metric spaces into another one involving “simpler” metric space (Matouˇsek, 2002) and upon Wasserstein OT cost on the embedding space. Hence, unlike GW distance that compares pairwise distances of elements from each distribution, we consider a method that embeds the metric measure spaces into a common Euclidean space and computes a Wasserstein OT distance between the embedded distributions. In this context, we introduce a distance, robust to isometry in the embedding space, that generalizes the “min-max” robust OT problem recently introduced in Paty and Cuturi (2019), where the authors consider orthogonal projections as embedding functions. Main contributions of this work are summarized in the following three points:

• We propose a framework for distribution alignment from different spaces using a sub-embedding robust Wasserstein (SERW) distance. As central contribution, we develop the theoretical analysis characterizing the embedding spaces so that SERW be a distance.

• We provide mathematical evidence on the relation between GW and our SERW distances. We show for instance, that one key point for approximating GW is that the embeddings be distance-preserving.

• We sketch a potential algorithm describing how our distance can be computed in practice and present numerical illustrations on simulated and real datasets that support our theoretical results.

The remainder of the paper is organized as follows. In Section 2 we introduce the definitions of Wasserstein and GW distances, and set up the embedding spaces. In Section 3 we investigate metric measure embedding for non-aligned distributions through an OT via SERW distance. Section 4 is dedicated to numerical experiments on matching tasks based on simulated and real data. The proofs of the main results are postponed to the appendices.

2 Preliminaries

We start here by reviewing basic definitions of the materials needed to introduce the main results. We consider two metric measure spaces (mm-space for short) (Gromov et al., 1999) () and (), where () is a compact metric space and is a probability measure with full support, i.e. . We recall that the support of a measure supp[] is the minimal closed subset such that ) = 0. Similarly, we define the mm-space (). Let P(X) be the set of probability measures in X and . We define ) as its subset consisting of measures with finite p-moment, i.e.,

where ) :=). For ) and ), we write Π() for the collection of probability measures (couplings) on

Wasserstein distance. The Monge-Kantorovich or the 2-Wasserstein distance aims at finding an optimal mass transportation plan ) such that the marginals of are respectively , and these two distributions are supposed to be defined over the same ground space, i.e., X = Y . It reads as

The infimum in (1) is attained, and any probability which realizes the minimum is called an optimal transport plan.

Gromov-Wasserstein distance. In contrast to Wasserstein distance, GW one deals with measures that do not necessarily belong to the same ground space. It learns an optimal transport-like plan which transports samples from a source metric space X into a target metric space Y by measuring how the distances between pairs of samples within each space are similar. Following the pioneering work of M´emoli (2011), GW distance is defined as

where

with a quadratic loss function Peyr´e et al. (2016) propose an entropic version called entropic GW discrepancy, allowing to tackle more flexible losses , such as mean-square-error or Kullback-Leibler divergence. The latter version includes an entropic regularization of in the GW distance computation problem.

Metric embedding. Metric embedding consists in characterizing a new representation of the samples based on the concept of distance preserving.

Definition 1. A mapping : () is an embedding with distortion , denoted as -embedding, if the following holds: there exists a constant 0 (“scaling factor”) such that for all

The approximation factor in metric embedding depends on a distortion parameter of the -embedding. This distortion is defined as the infimum of all 1 such that the above condition (2) holds. If no such exists, then the distortion of is infinity.

In this work we will focus on target spaces Z that are normed spaces endowed with Euclidean distance. Specifically, for some integer d to be specified later, we will consider the metric space (). Hence, one can always take the scaling factor to be equal to 1 (by replacing by ). Note that an embedding with distortion at most is necessarily one-to-one (injective). Isometric embeddings are for instance embeddings with distortion 1. For more details about embeddings, we invite the reader to look at the technical report of Matouˇsek (2013).

We suppose hereafter = 1 in (2) and we denote by ) and ) the set of -embedding , respectively. We further assume that (0) = 0. It is worth noting that when X and Y are finite spaces, then the embedding spaces non empty. Indeed, suppose we are given a set of n data points embedding theorem (Bourgain, 1985) guarantees the existence of an embedding tight distortion at most O(log n), i.e., (log , and the target dimension d = O(log). We stress that d is independent of the original dimensions of X and Y and depends only on the number of the given data points n and m and the accuracy-embedding parameters and . Hence for data points underlying the distributions of interest, one has

Remark 1. For our setting, we cannot use the Johnson-Lindenstrauss flattening lemma (Johnson and Lindenstrauss, 1984) (JL-lemma in short) to guarantee the existence of the embeddings. To see that, let us first state this lemma: the JL-lemma asserts that if X is a Hilbert space, 0, , and then there exists a linear mapping with d = O(log n) such that for all

Note that the origin space X is supposed to be a Hilbert space in the JL-lemma and the embedding is a linear mapping with a target dimension is only supposed to be a metric space and the target dimension in Bourgain’s theorem. Furthermore, the JL-lemma fails to hold true in certain non-Hilbertian settings (Matouˇsek, 1996; Brinkman and Charikar, 2005). In our approach, we consider a general setup that relies on non-linear embeddings that act on different data structures like text, graphs, images, etc. Therefore, we used Bourgain’s theorem to guarantee the non-emptiness of the embedding spaces

Let’s now highlight all the above criteria characterizing the metric embeddings we consider to define our novel distance and that help us shape some of its properties.

Assumption 1. Assume that () and () are finite mm-spaces containing the origin 0, and endowed with measures and . Assume also that X and Y are of cardinalities n and m, the target dimension d satisfies (3), ) = (0) = 0} and ) = (0) = 0}. The distortions parameters ),

3 Metric measure embedding and OT for distribution alignment

Let us give first the overall structure of our approach of non-aligned distributions, which generalizes recent works (Alvarez-Melis and Jaakkola, 2018; Paty and Cuturi, 2019). The generalization stands on the fact that the two distributions lie on two different metric spaces and this fact raises several challenging questions about the characterization of the embeddings for yielding a distance. In our approach, we consider a general setup that relies on non-linear embeddings before aligning the measures. Note, if the metric spaces coincide for both distributions and the embeddings are restricted to be linear (subspace projection) then, our distance reduces to the one proposed by Paty and Cuturi (2019).

In this work, we aim at proposing a novel distance between two measures defined on different mm-spaces. This distance will be defined as the optimal objective of some optimization problem, we provide technical details and conditions ensuring its existence in the first part of this section. We then present formally our novel distance and its properties including its cost relation with GW distance.

In a nutshell, our distribution alignment distance between and is obtained as a Wasserstein distance between pushforwards (see Definition 2) of and w.r.t. some appropriate couple of embeddings (). Towards this end, we need to exhibit some topological properties of the embeddings spaces, leading at first to the existence of the constructed OT approximate distances.

3.1 Topological properties of the embedding spaces

We may consider the function Γsuch that Γfor each pair of embeddings ). This function defines a proper metric on the space of embeddings ) and it is referred to as the ). Indeed, Γsatisfies all the conditions that define a general metric. We analogously define the metric Γon ). With the aforementioned preparations, the embeddings spaces satisfy the following topological property.

are both compact metric spaces.

Endowing the embedding spaces with the supremum metrics is fruitful, since we get benefits from some existing topological results, based on this functional space metric, to prove the statement in Proposition 1.

To let it more readable, the proof of Proposition 1 is divided into 5 steps summarized as follows: first step is for metric property of ); second one shows completeness of ); third establishes the totally boundedness of ), namely that one can recover this space using balls centered on a finite number of embedding points; the last is a conclusion using Arzela-Ascoli’s Theorem for characterizing compactness of subsets of functional continuous space, see Appendix A.1 for all theses details and their proofs.

We now give a definition of pushforward measures.

Definition 2. (Pushforward measure). Let (S, S ) and (T, T ) be two measurable spaces, be a mapping, and be a measure on S. The pushforward of , is the measure on T defined by ) = )) for . If is a measure and f is a measurable function, then is a measure.

3.2 Sub-Embedding OT

Assume Assumption 1 holds. Following Paty and Cuturi (2019), we define an embedding robust version of Wasserstein distance between pushforwards ) for some appropriate couple of embeddings (). We then consider the worst possible OT cost over all possible low-distortion embeddings. The notion of “robustness” in our distance stands from the fact that we look for this worst embedding.

Definition 3. The d-dimensional embedding robust 2-Wasserstein distance (ERW) between and reads as

where stands for the set of orthogonal mappings on and denotes the composition operator between functions.

The infimum over the orthogonal mappings on corresponds to a classical orthogonal procrustes problem (Gower, 1975; Grave et al., 2019). It learns the best rotation between the embedded points, allowing to an accurate alignment. The orthogonality constraint ensures that the distances between points are preserved by the transformation.

Note that ) is finite since the considered embeddings are Lipschitz and both of the distribu- tions have finite 2-moment due to Assumption 1. Next, using results of pushforward measures, for instance see Lemmas 7 and 8 in the Appendices, we explicit ERW in Lemma 1, whereas Lemmas 2 and 4 establish the existence of embeddings that achieve the suprema defining both ERW and SERW.

Lemma 1. For any () and , let ) =

By the compactness property of the embedding spaces (see Proposition 1), the set of optima defining ) is not empty.

Lemma 2. There exist a couple of embeddings (

Clearly, the quantity ) is difficult to compute, since an OT is a linear programming problem that requires generally super cubic arithmetic operations (Peyr´e and Cuturi, 2019). Based on this observation, we focus on the corresponding “min-max” problem to define the d-dimensional sub-embedding robust 2-Wasserstein distance (SERW). For the sake, we make the next definition.

Definition 4. The d-dimensional sub-embedding robust 2-Wasserstein distance (SERW) between and is defined as

Thanks to the minimax inequality, the following holds.

We emphasize that ERW and SERW quantities play a crucial role in our approach to match distributions in the common space regarding pushforwards of the measures and realized by a couple of optimal embeddings and a rotation. Optimal solutions for ) exist. Namely:

Lemma 4. There exist a couple of embeddings (

The proofs of Lemmas 2 and 4 rely on the continuity under integral sign Theorem (Schilling, 2005), and the compactness property of the embedding spaces, the orthogonal mappings on space and the coupling transport plan Π(), see Appendices A.5 and A.3 for more details.

Recall that we are interested in distribution alignment for measures coming from different mm-spaces. One hence expects that SERW mimics some metric properties of GW distance. To proceed in this direction, we first prove that SERW defines a proper metric on the set of all weakly isomorphism classes of mm-spaces. In our setting the terminology of weakly isomorphism means that there exists a pushforward embedding between mm-spaces. If such a pushforward is 1-embedding the class is called strongly isomorphism.

Proposition 2. Let Assumption 1 holds and assume and with . Then, ) = 0 happens if and only if the couple of embeddings () and optima of ) verify

Figure 1: Illustration of the preserving measure mappings between the mm-spaces () and () given in Proposition 2. The embedding . Our distance ) = 0 vanishes if and only if and are mapped through the embeddings

Figure 1 illustrates the mappings between the embedding spaces and how they are assumed to interact in order to satisfy condition in Proposition 2. In M´emoli (2011) (Theorem 5, property (a)), it is shown that ) = 0 if and only if () and () are strongly isomorphic. This means that there exists a Borel measurable bijection (with Borel measurable inverse such that is 1-embedding and The statement in Proposition 2 is a weak version of the aforementioned result, because neither nor (are isometric embeddings. However, we succeed to find a measure-preserving mapping relating to each other via the given pushforwards in Proposition 2. Note that maps from the Y space to while maps from X to . Our distance ) vanishes if and only if and are mapped through the embeddings . With these elements, we can now prove that both ERW and SERW are further distances.

Proposition 3. Assume that statement of Proposition 2 holds. Then, ERW and SERW define a proper distances between weakly isomorphism mm-spaces.

3.3 Cost relation between GW and SERW

In addition to the afore-mentioned theoretical properties of SERW, we establish a cost relation metric between GW and SERW distances. The obtained upper and lower bounds depend on approximation constants that are linked to the distortions of the embeddings.

Proposition 4. Let Assumption 1 holds. Then,

Proposition 5. Let Assumption 1 holds. Then,

Proofs of Propositions 4 and 5 are presented in Appendices A.9 and A.8. We use upper and lower bounds of GW distance as provided in M´emoli (2008). The cost relation between SERW and GW distances obtained in Propositions 4 and 5 are up to the constants which are depending on the distortion parameters of the embeddings, and up to an additive constant through the of the measures In the following we highlight some particular cases leading to closed form of the upper and lower bounds for the cost relation between GW and SERW distances.

3.4 Fixed sub-embedding for distribution alignment

From the computational point of view, computing SERW distance seems a daunting task, since one would have to optimize over the product of two huge embedding spaces ). However in some applications we may not require solving over ) and rather have at disposal known embeddings in advance. For instance, for image-text alignment we may leverage on features extracted from pre-trained deep architectures (VGG (Simonyan and Zisserman, 2015) for image embedding, and Word2vec Mikolov et al. (2013) for projection the text). Roughly speaking, our SERW procedure with respect to these fixed embeddings can be viewed as an embedding-dependent distribution alignment for matching. More precisely, the alignment quality is strongly dependent on the given embeddings; the lower distorted embeddings, the more accurate alignment.

Definition 5. For a fixed couple of embeddings (), we define the fixed sub-embedding robust Wasserstein (FSERW) as

defines a proper distance if and only if

The cost relation guarantees given in Propositions 3 and 4 are dependent on the distortions of the fixed embeddings, i.e., the constants the following holds

In a particular case of isometric embeddings, our procedure gives the following cost relation

The additive constants and can be upper bounded in a setting of data preprocessing, for instance in the case of a -normalization preprocessing we have

4 Numerical experiments

Here we illustrate how SERW distance behaves on numerical problems. We apply it on some toy problems as well as on some problems usually addressed using GW distance.

Sketch of practical implementation. Based on the above presented theory, we have several options for computing the distance between non-aligned measures and they all come with some guarantees compared to a GW distance. In the simpler case of fixed embedding, if the original spaces are subspaces of , any distance preserving embedding can be a good option for having an embedding with low distortion. Typically, methods like multidimensional scaling (MDS) (Kruskal and Wish, 1978), Isomap (Tenenbaum et al., 2000) or Local linear embedding (LLE) (Roweis and Saul, 2000) can be good candidates. One of the key advantages of SERW is that it considers non-linear embedding before measure alignment. Hence, it has the ability of leveraging over the large zoo of recent embedding methods that act on different data structures like text (Grave et al., 2018), graphs (Grover and Leskovec, 2016; Narayanan et al., 2017), images (Simonyan and Zisserman, 2015), or even histograms (Courty et al., 2018).

In the general setting our theoretical results require computing ) to solve the problem given in the Definition (4). We sketch in Section 4.1 a practical procedure to learn from samples and , non-linear neural-network based embedding functions and that maximize the Wasserstein distance between the embedded samples while minimizing the embedding distortion.

4.1 Implementation details on learning the embeddings

In practice, for computing ), we need to solve the problem given in Equation (4). As stated above in some practical situations, we leverage on existing embeddings and consider the problem without the maximization over the embedings as the space is restricted to an unique singleton. In some other cases, it is possible to learn the embedding that maximizes the Wasserstein distance between embedded examples and that minimizes the distance distortion of the embedding. In what follows, we detail how we have numerically implemented the computation of ) from samples respectively sampled from X and Y according to

For all , we denote the pairwise distances ) in the origin space in the embbed space The problem we want to solve is

with Π() = . In this equation, the first sum corresponds to the optimal transport cost function and the other two sums compute the distortion between pairwise distances in the input space and embedded space respectively for the x and the y samples. In the notation, ) is a loss function that penalizes the discrepancy between the input and embedded distances. This distance loss has been designed so as to encourage the embedding to preserve pairwise distance up to a ˜factor. Hence

with

and denotes the indicator function. In the experiments ˜is fixed as max(0(similarly for ˜). It penalizes the embbeded couples of inputs whose embbeded pairwise distances are the most dissimilar to the input pairwise distances. As these specific discrepancies impact the estimation of the distorsion rate of the embedding, the designed loss has been tailored to reduce the distorsion rate comparatively to those of the initial embeddings. In practice, the embedding functions have been implemented in the following way

where I is the identity matrix, and are trainable neural networks based embeddings and and are data-dependent low-dimensional projections that preserves (local) distances. Typically, for the h functions, we have considered in our experiments algorithms like MDS or LLE. Our choice of the embedding dimension can be then principally giving by the d-largest number of eigenvalues of the spectral decomposition arising within these algorithms.

So the learning problem described in Equation (4) involves a max-min problem over the Wasserstein distance of the mapped samples. For solving the problem, we have adopted an alternate optimization strategy where for each mini-batch of samples from , we first optimize and and then optimize the embeddings for fixed optimal . In practice, the sub-problem with respects to r and is an invariant OT problem and can be solved using the algorithm proposed by Alvarez-Melis et al. (2019). and is implemented as two fully connected neural networks with leaky ReLU activation functions and no bias. They are optimized using stochastic gradient descent using Adam as optimizer. Some details of the algorithms is provided in Algorithm 1.

4.2 Toy example

In this example, we extracted randomly n = m = 1000 samples from MNIST, USPS and Fashion MNIST data sets, denoted by X, Y and Z. We compare GW distances between three possible matchings with the assorted SERW distances. We pre-process the data in order to fix the parameter and as discussed previously. We then vary the dimension of the embedded points from logup to the smallest dimension of the original samples. We perform the embeddings by using LLE followed by a non-linear embedding scheme aiming at minimizing the distance distortion as described in Section 4.1.

In Figure 2 we report plots of the distortion rate, the additive constant in the upper bound in Proposition 5, and the distance ratio of SERW for the three data sets X, Y and Z. As can be seen the rates decrease as the embedding dimension increase. Note that to determine the distortion coefficient for each given embedded dimension, we compute the quotient of the pairwise distances both in the original and embedding spaces. Thus, this high magnitudes of the upper bounds are due to a “crude” estimation of the distortion rate. One may explore a better estimation to reach a tighter upper bound. For this toy set, we investigate a useful property in our approach called proximity preservation, a property stating that:

In order to confirm this property, we compute the ratio between for various embeddings and compare the resulting order with the quantities )

Figure 2: Plots of the distortion rates (left); various bounds in Proposition 5 (middle); GW cost and the distance ratio of SERW between the data points (right); as a function of the target dimension of embedded data d. The bold points in the right panel correspond to ) (red) and ) (blue). Note that the distance ratio (red curve) is upper than the blue one which highlights that the proximity preservation for SERW distance is fulfilled as the embbed dimension d goes up.

and ). As seen in Figure 2, while the ratios vary their order is often preserved for large embedding dimensions.

4.3 Meshes comparison

GW distance is frequently used in computer graphics for computing correspondence between meshes. Those distances are then exploited for organizing shape collection, for instance for shape retrieval or search. One of the useful key property of GW distance for those applications is that it is isometryinvariant. In order to show that our proposed approach approximately satisfies this property, we reproduce an experiment already considered by Solomon et al. (2016) and Vayer et al. (2019).

We have at our disposal a time-series of 45 meshes of galloping horses. When comparing all meshes with the first one, the goal is to show that the distance presents a cyclic nature related to galop cycle. Each mesh is composed of 8400 samples and in our case, we have embedded them into a 2-dimensional space using a multi-dimensional scaling algorithm followed by a non-linear embedding aiming at minimizing distortion as described in Section 4.1.

Figure 3 shows the (centered and max-normalized) distances between meshes we obtain with SERW and with a Sliced Gromov-Wasserstein (SGW) (Vayer et al., 2019) and a genuine GW distance. In both cases, due to the random aspect of the first two algorithms, distances are averaged over 10 runs. We note that our approach is able to recover the cyclic nature of the galloping motion as described by GW distance.

To quantify the cyclic nature of the galloping motion as described by GW, SERW and SGW, we compute the dynamic time warping measures (DTW) (Berndt and Clifford, 1994). DTW aims to find an optimal alignment between two given signals. Figure 4 illustrates that the couple distances (GW, SERW) and (GW, SGW) behave well according to the recovery of galloping motion. Indeed, the alignment suggested by DTW in both cases is almost diagonal showing that the motion curves provided by SERW and SGW are in accordance with the obtained motion curve via the plain GW. We further compute the mean and the standard deviation of DTW measures between GW and SERW one

Figure 3: GW, SGW, and SERW distances between 3D meshes of galloping horses. We can note that both SGW and SERW distances are able to retrieve the cyclic nature of the movement (see also Figure 4).

the one hand, and between GW and SGW distances on the other hand, over the 10 runs. We get the following: DTW(GW, SERW) = 10DTW(GW, SGW) = 1089. This emphasizes the fact that SERW retrieves the cyclic nature of the movement as does by GW.

Figure 4: Alignment of the couple distances (GW, SERW) and (GW, SGW) between 3D meshes images as yielded by DTW measure. Optimal alignment is diagonal. On the right and bottom of each panel are the compared curves.

4.4 Text-Image alignment

To show that our proposed also provides relevant coupling when considering out-of-the-shelves embeddings, we present here results on aligning text and images distributions. The problem we address is related to identifying different states of objects, scene and materials (Isola et al., 2015). We have images labeled by some nouns modified by some adjectives describing state of the objects. In our experiment, we want to show that our approach provides coupling between labels and images semantically meaningful as those obtained by a Gromov-Wasserstein approach. As for proof of concept, from the 115 available adjectives, we have considered only three of them ruffled, weathered, engraved and extracted all the classes associated with those adjectives. In total, we obtain 109 different classes of objects and about 525 images in total (as each class contains at most 5 objects).

The composed name (adjective + noun) of each label is embedded into using a word vector representation issued by fasttext model (Grave et al., 2018) trained on the first 1 billion bytes of English Wikipedia according to Mikolov et al. (2018). The 256 256 images have been embedded into a vector of dimension 4096 using a pre-trained VGG-16 model (Simonyan and Zisserman, 2015). These embeddings are extracted from the first dense layer of a VGG-16. The Gromov-Wasserstein distance of

Figure 5: Coupling matrices between text and image embeddings. (top) Gromov-Wasserstein coupling matrix obtained in the original embedding spaces (bottom) our SERW coupling matrix after projecting embeddings into same dimension space.

those embeddings has been computed for coupling labels and images in the two different embedding spaces. For our SERW approach, we have further reduce the dimension of the image embeddings using Isomap with 100 dimensions. When computing the distance matrix, objects have been organized by class of adjectives for an easy visual inspection.

Figure 5 presents coupling matrices obtained using GW and our SERW. Since in both cases, the distance is not approximated by the Sinkhorn algorithm, the obtained matching is not smooth. Our results show that both GW and SERW distances are able to retrieve the 3 classes of adjectives and matches appropriate images with the relevant labels. Figure 6 illustrates the best matched images by GW and SERW (according to the transportation map) to the texts Engraved Copper and Engraved Metal. We can remark that in both cases GW and SERW do not suggest the same images. However, the retrieved images are meaningful according to the text queries. We shall notice that the embeddings used by SERW do not distort the discriminative information, leading to interesting matched images as shown by the last row of Figure 6.

5 Conclusion

In this paper we introduced the SERW distance for distribution alignment lying in different mm-spaces. It is based on metric measure embedding of the original mm-spaces into a common Euclidean space and computes an optimal transport on the (low-distorted) embedded distributions. We prove that SERW defines a proper distance behaving like GW distance and we further show a cost relation between SERW and GW. Some of numerical experiments are tailored using a fully connected neural network to learn the maximization problem defining SERW, while other ones are conducted with fixed embeddings. In particular, SERW can be viewed as an embedding-dependent alignment for distributions coming from different mm-spaces, that its quality is strongly dependent on the given embeddings.

Acknowledgments

The works of Maxime B´erar, Gilles Gasso and Alain Rakotomamonjy have been supported by the OATMIL ANR-17-CE23-0012 Project of the French National Research Agency (ANR).

Figure 6: Best matched images obtained through GW transportation plan, and our SERW distance. The first block of images correspond to the class Engraved Copper and the second one to Engraved Metal. Within each block, the top row shows the results of GW and the bottom row illustrates the matching proposed by SERW.

A Proofs

In the proofs, we frequently use the two following lemmas. Lemma 7 writes an integration result using push-forward measures; it relates integrals with respect to a measure and its push-forward under a measurable map Lemma 8 proves that the admissible set of couplings between the embedded measures are exactly the embedded of the admissible couplings between the original measures.

be a measurable mapping, let be a measurable measure on S, and let g be a measurable function on

Proof of Lemma 8. See Paty and Cuturi (2019).

A.1 Proof of Proposition 1

To let it more readable, the proof is divided into 5 steps summarized as follows: first step is for metric property of ); second one shows completeness of ); third establishes the totally boundedness of ), namely that one can recover this space using balls centred on a finite number of embedding points; the last is a conclusion using Arzela-Ascoli’s Theorem for characterizing compactness of subsets of functional continuous space.

Since the arguments of the proof are similar for the two spaces, we only focus on proving the topological property of ). Let us refresh the memories by some results in topology: we denote ) the set of all continuous mappings of X into () and recall the notions of totally boundedness in order to characterize the compactness of (). The material here is taken from Kubrusly (2011) and O’Searcoid (2006).

Definition 6. i) (Totally bounded) Let A be a subset of a metric space (of A is an if for every point s of A there exists a point totally bounded (precompact) in (if for every real number there exists a finite ii) (Pointwise totally bounded) A subset is pointwise totally bounded if for each is totally bounded in T.

iii) (Equicontinuous) A subset S of C(S, T) is equicontinuous at a point if for each exists a

Proposition 6. If S is a metric space, then S is compact if and only if S is complete and totally bounded.

Let )) consisting of all continuous bounded mappings of S into (endowed with the supremum metric )). Proving the totally boundedness of some topological spaces may need more technical tricks. Fortunately, in our case we use Arzel`a–Ascoli Theorem that gives compactness criteria for subspaces of )) in terms of pointwise totally bounded and equicontinuous, namely they are a necessary and sufficient condition to guarantee that the totally boundedness of a subset

Theorem 1. (Arzel`a–Ascoli Theorem) If S is compact, then a subset of the metric space is totally bounded if and only if it is pointwise totally bounded and equicontinuous.

The proof is devided on 5 Steps:

• It is clear that for all ), Γ0 (nonegativeness) and Γ) = 0 if and only if . To verify the triangle inequality, we proceed as follows. Take and arbitrary and note that, if are embeddings in triangle inequality in the Euclidean space

hence Γ), and therefore () is a metric space.

• First recall that for each ) is a -embedding then it is Lipshitizian mapping. It is readily verified that every Lipshitizian mapping is uniformly continuous, that is for each real number 0 there exists a real number 0 such that implies . So it is sufficient to take

• The proof of this step is classic in the topology literature of the continuous space endowed with the supremum metric. For the sake of completeness, we adapt it in our case. Let be a Cauchy sequence in (Thus is a Cauchy sequence in () for every This can be as follows: Γ) for each pair of integers k, k and every , and hence converges in for every (since is complete). Let ) = lim) for each (i.e., )) in , which defines a a mapping . We shall show that to ), thus proving that () is complete. Note that for any integer n and every pair of points by the triangle inequality. Now take an arbitrary real number 0. Since is a Cauchy sequence in (), it follows that there exists a positive integer such that Γ(and hence for all , whenever . Moreover, since ) in , and the Euclidean distance is a continuous function from the metric space the metric space R for each , it also follows that for each positive integer k and every Thus for all whenever . Furthermore, since each lies in (), it follows that there exists a real number ) such that sup Therefore, for any 0 there exists a positive integer such that ) for all so that ), and Γ) = supwhenever so that converges to in (

• ) in Definition 6 and the details in ) is readily equicontinous. Next we shall prove that the subset : is totally bounded in . To proceed we use another result characterizing totally boundness that reads as:

(is totally bounded metric space if and only if every sequence in S has a Cauchy subsequence.

Since for any ) is Lipshitizian then it is uniformly continuous as explained above. Furthermore uniformly continuous functions have some very nice conserving properties. They map totally bounded sets onto totally bounded sets and Cauchy sequences onto Cauchy sequences.

Now suppose that Suppose is any sequence in ) . For each , the subset is non empty for each (Axiom of Countable Choice see O’Searcoid (2006)). Then . By the Cauchy criterion for total boundedness of X, the sequence has a Cauchy subsequence . Then, by what we have just proved, is a Cauchy subsequence of Since is an arbitrary sequence in satisfies the Cauchy criterion for total boundedness and so is totally bounded.

• Using Arzela-Ascoli Thereom 1 and Step 2 we conclude that ) is totally bounded. Together with ) is compact.

A.2 Proof of Lemma 1

Notice that for

It can be seen easily using the facts that

where by Assumption 1. Now, thanks to Lemmas 7 and 8, we have

A.3 Proof of Lemma 2

In one hand, for any fixed ) the application is continuous. To show that, we use the continuity under integral sign Theorem. Indeed,

• for the mapping (is continuous. To show that fix 0, and We endow the product sapce by the metric Γdefined as follows:

Letting , then if Γ((, one hasThis yields that lim=

• for a fixed () and (we have +

Therefore, the family (is continuous then it is upper semicontinuous. We know that the pointwise infimum of a family of upper semicontinuous functions is upper semicontinuous (see Lemma 2.41 in Aliprantis and Border (2006)). This entails infis upper semicontinuous. Since the product of compact sets is a compact set (Tychonoff Theorem), then ) is compact, hence sup) attains a maximum value (see Theorem 2.44 in Aliprantis and Border (2006)). So, there exits a couple of embeddings (such that sup) = ) for all Finally, it is easy to show that ) is continuous, hence the infimum over the orthogonal mappings (compact) exits.

A.4 Proof of Lemma 3

Let us recall first the minimax inequality:

Lemma 9. (Minimax inequality) Let Ξ : be a function. Then

Using minimax inequality, one has

Note that for a fixed ) exits (continuity of ) + compact set as shown in Proof of Lemma 2). Then

Thus

A.5 Proof of Lemma 4

As we proved in Lemma 2 that for any fixed is continuous, then it is lower semicontinous. The pointwise supremum of a family of lower semicontinuous functions is lower semicontinuous (Lemma 2.41 in Aliprantis and Border (2006)) Moreover, the pointwise infimum of a compact family of lower semicontinuous functions is lower semicontinuous (here is compact) then ) is lower semicontinuous Furthermore Π() is compact set with respect to the topology of narrow convergence (Villani, 2003), then inf) exists (see Theorem 2.44 in Aliprantis and Border (2006)).

A.6 Proof of Proposition 2

Theorem 2. (Cram´er and Wold, 1936) Let be Borel probability measures on and agree at every open half-space of X. Then . In other words if, for = : = 1} and we write : and if ) = for all and then one has

The fundamental Cram´er-Wold theorem states that a Borel probability measure is uniquely determined by the values it gives to halfspaces : for and Equivalently, is uniquely determined by its one-dimensional projections (∆, where ∆is the projection

Straightforwardly, we have

Analogously, we prove that (()) = Therefore, for all and Borels, we have

• “” Thanks to Lemma 4 in the core of the paper, there exists a couple of embeddings () and optimum for We assume now that

On the other hand, it is clear that () =Using the fact that -embedding then we get

A.7 Proof of Proposition 3

Symmetry is clear for both objects. In order to prove the triangle inequality, we use a classic lemma known as “gluing lemma” that allows to produce a sort of composition of two transport plans, as if they are maps.

Lemma 10. (Villani, 2003) Let X, Y, Z be three Polish spaces and let be such that ∆is the natural projection from there exists a measure

Let ). By the gluing lemma we know that there exists ) such that ∆). On the other hand

where ˜) (). Hence, we end up with the desired result,

A.8 Proof of Proposition 4

As the embedding is Lipschitizian then it is continuous. Since X is compact hence ) is also compact. Consequently supp[) is compact (closed subset of a compact). The same observation is fulfilled by supp[). Letting () is compact metric space and are Borel probability measures on Z. Thanks to Theorem 5 (property (c)) in M´emoli (2011), we have that

Together with the minimax inequality we arrive at

Using the fact that

where ) := inf). Using Bourgain’s embedding theorem Bourgain (1985),

In another hand, we have

A.9 Proof of Proposition 5

The proof of this proposition is based on a lower bound for the Gromov-Wasserstein distance (Proposition 6.1 in M´emoli (2011)):

where ) =defines an eccentricity function. Note that

where

We observe that

Note that

Finally,

where

This finishes the proof.

A.10 Proof of Lemma 5

Since is compact set and the mapping ) is continuous, then there exists such that inf). Using Lemma 8, we then get

between and . Hence ) = 0 if and only if that is = (. On the other hand, one has

The triangle inequality follows the same lines as proof of Proposition 3.

References

Alaya, M. Z., M. B´erar, G. Gasso, and A. Rakotomamonjy (2019). Screening Sinkhorn algorithm for regularized optimal transport. In H. Wallach, H. Larochelle, A. Beygelzimer, F. Alch´e-Buc, E. Fox, and R. Garnett (Eds.), Advances in Neural Information Processing Systems 32, pp. 12169–12179. Curran Associates, Inc.

Aliprantis, C. and K. Border (2006). . Springer Berlin Heidelberg.

Altschuler, J., J. Weed, and P. Rigollet (2017). Near-linear time approximation algorithms for optimal transport via sinkhorn iteration. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Advances in Neural Information Processing Systems 30, pp. 1964–1974. Curran Associates, Inc.

Alvarez-Melis, D. and T. Jaakkola (2018). Gromov–Wasserstein alignment of word embedding spaces. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1881–1890. Association for Computational Linguistics.

Alvarez-Melis, D., S. Jegelka, and T. S. Jaakkola (2019, 16–18 Apr). Towards optimal transport with global invariances. In K. Chaudhuri and M. Sugiyama (Eds.), Proceedings of Machine Learning Research, Volume 89 of Proceedings of Machine Learning Research, pp. 1870–1879. PMLR.

Arjovsky, M., S. Chintala, and L. Bottou (2017). Wasserstein generative adversarial networks. In D. Precup and Y. W. Teh (Eds.), Proceedings of the 34th International Conference on Machine Learning, Volume 70 of Proceedings of Machine Learning Research, International Convention Centre, Sydney, Australia, pp. 214–223. PMLR.

Berndt, D. J. and J. Clifford (1994). Using dynamic time warping to find patterns in time series. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, AAAIWS’94, pp. 359–370. AAAI Press.

Bonneel, N., M. van de Panne, S. Paris, and W. Heidrich (2011). Displacement interpolation using lagrangian mass transport. ACM Trans. Graph. 30(6), 158:1–158:12.

Bourgain, J. (1985). On lipschitz embedding of finite metric spaces in hilbert space. Israel Journal of Mathematics 52.

Brinkman, B. and M. Charikar (2005, September). On the impossibility of dimension reduction in l1. J. ACM 52(5), 766–788.

Bronstein, A. M., M. M. Bronstein, R. Kimmel, M. Mahmoudi, and G. Sapiro (2010). A Gromov– Hausdorff framework with diffusion geometry for topologically-robust non-rigid shape matching. International Journal of Computer Vision 89(2-3), 266–286.

Bunne, C., D. Alvarez-Melis, A. Krause, and S. Jegelka (2019, 09–15 Jun). Learning generative models across incomparable spaces. In K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of the 36th International Conference on Machine Learning, Volume 97 of Proceedings of Machine Learning Research, Long Beach, California, USA, pp. 851–861. PMLR.

Chowdhury, S. and F. M´emoli (2018). The Gromov–Wasserstein distance between networks and stable network invariants. CoRR abs/1808.04337.

Courty, N., R. Flamary, and M. Ducoffe (2018). Learning Wasserstein embeddings. In International Conference on Learning Representations (ICLR).

Courty, N., R. Flamary, D. Tuia, and A. Rakotomamonjy (2017). Optimal transport for domain adaptation. IEEE transactions on pattern analysis and machine intelligence 39(9), 1853–1865.

Cram´er, H. and H. Wold (1936). Some theorems on distribution functions. Journal of the London Mathematical Society s1-11(4), 290–294.

Cuturi, M. (2013). Sinkhorn distances: Lightspeed computation of optimal transport. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 26, pp. 2292–2300. Curran Associates, Inc.

Frogner, C., C. Zhang, H. Mobahi, M. Araya, and T. A. Poggio (2015). Learning with a Wasserstein loss. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), Advances in Neural Information Processing Systems 28, pp. 2053–2061. Curran Associates, Inc.

Genevay, A., M. Cuturi, G. Peyr´e, and F. Bach (2016). Stochastic optimization for large-scale optimal transport. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), Advances in Neural Information Processing Systems 29, pp. 3440–3448. Curran Associates, Inc.

Gower, J. C. (1975). Generalized procrustes analysis. Psychometrika 40(1), 33–51.

Grave, E., P. Bojanowski, P. Gupta, A. Joulin, and T. Mikolov (2018). Learning word vectors for 157 languages. arXiv preprint arXiv:1802.06893.

Grave, E., A. Joulin, and Q. Berthet (2019, 16–18 Apr). Unsupervised alignment of embeddings with wasserstein procrustes. In K. Chaudhuri and M. Sugiyama (Eds.), Proceedings of Machine Learning Research, Volume 89 of Proceedings of Machine Learning Research, pp. 1880–1890. PMLR.

Gromov, M., J. Lafontaine, and P. Pansu (1999). Metric Structures for Riemannian and NonRiemannian Spaces:. Progress in Mathematics - Birkh¨auser. Birkh¨auser.

Grover, A. and J. Leskovec (2016). Node2Vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864.

Ho, N., X. L. Nguyen, M. Yurochkin, H. H. Bui, V. Huynh, and D. Phung (2017). Multilevel clustering via Wasserstein means. In Proceedings of the 34th International Conference on Machine Learning -Volume 70, ICML’17, pp. 1501–1509. JMLR.org.

Isola, P., J. J. Lim, and E. H. Adelson (2015). Discovering states and transformations in image collections. In CVPR.

Johnson, W. B. and J. Lindenstrauss (1984). Extensions of lipschitz mappings into hilbert space. Contemporary mathematics 26, 189–206.

Kantorovich, L. (1942). On the transfer of masses (in russian). Doklady Akademii Nauk 2, 227–229.

Kolouri, S., S. R. Park, M. Thorpe, D. Slepcev, and G. K. Rohde (2017, July). Optimal mass transport: Signal processing and machine-learning applications. IEEE Signal Processing Magazine 34(4), 43–59.

Kruskal, J. B. and M. Wish (1978). Multidimensional scaling. Number 11. Sage.

Kubrusly, C. (2011). Elements of Operator Theory. Birkh¨auser Boston.

Kusner, M., Y. Sun, N. Kolkin, and K. Weinberger (2015, 07–09 Jul). From word embeddings to document distances. In F. Bach and D. Blei (Eds.), Proceedings of the 32nd International Conference on Machine Learning, Volume 37 of Proceedings of Machine Learning Research, Lille, France, pp. 957–966. PMLR.

Matouˇsek, J. (1996). On the distortion required for embedding finite metric spaces into normed spaces. Israel Journal of Mathematics 93(1), 333–344.

Matouˇsek, J. (2002). Embedding Finite Metric Spaces into Normed Spaces, pp. 355–400. New York, NY: Springer New York.

Matouˇsek, J. (2013). Lecture notes on metric embeddings. Technical Report.

M´emoli, F. (2008). Gromov–Hausdorff distances in Euclidean spaces. In 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–8.

M´emoli, F. (2011). Gromov–Wasserstein distances and the metric approach to object matching. Foundations of Computational Mathematics 11(4), 417–487.

Mikolov, T., E. Grave, P. Bojanowski, C. Puhrsch, and A. Joulin (2018). Advances in pre-training distributed word representations. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018).

Mikolov, T., I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119.

Monge, G. (1781). M´emoire sur la th´eotie des d´eblais et des remblais. des Sciences, 666–704.

Narayanan, A., M. Chandramohan, R. Venkatesan, L. Chen, Y. Liu, and S. Jaiswal (2017). Graph2Vec: Learning distributed representations of graphs. arXiv preprint arXiv:1707.05005.

O’Searcoid, M. (2006). Metric Spaces. Springer Undergraduate Mathematics Series. Springer London.

Paty, F.-P. and M. Cuturi (2019). Subspace robust Wasserstein distances. In K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of the 36th International Conference on Machine Learning, Volume 97 of Proceedings of Machine Learning Research, Long Beach, California, USA, pp. 5072–5081. PMLR.

Peyr´e, G., M. Cuturi, and J. Solomon (2016). Gromov–Wasserstein averaging of kernel and distance matrices. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pp. 2664–2672. JMLR.org.

Peyr´e, G. and M. Cuturi (2019). Computational optimal transport. Machine Learning 11(5-6), 355–607.

Roweis, S. T. and L. K. Saul (2000). Nonlinear dimensionality reduction by locally linear embedding. science 290(5500), 2323–2326.

Schilling, R. L. (2005). Measures, Integrals and Martingales. Cambridge University Press.

Simonyan, K. and A. Zisserman (2015). Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations.

Solomon, J., F. de Goes, G. Peyr´e, M. Cuturi, A. Butscher, A. Nguyen, T. Du, and L. Guibas (2015). Convolutional Wasserstein distances: Efficient optimal transportation on geometric domains. ACM Trans. Graph. 34(4), 66:1–66:11.

Solomon, J., G. Peyr´e, V. G. Kim, and S. Sra (2016). Entropic metric alignment for correspondence problems. ACM Transactions on Graphics (TOG) 35(4), 1–13.

Sturm, K. T. (2006). On the geometry of metric measure spaces. ii. Acta Math. 196(1), 133–177.

Tenenbaum, J. B., V. De Silva, and J. C. Langford (2000). A global geometric framework for nonlinear dimensionality reduction. science 290(5500), 2319–2323.

Vayer, T., R. Flamary, N. Courty, R. Tavenard, and L. Chapel (2019). Sliced Gromov–Wasserstein. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch´e-Buc, E. Fox, and R. Garnett (Eds.), Advances in Neural Information Processing Systems 32, pp. 14726–14736. Curran Associates, Inc.

Villani, C. (2003). Topics in Optimal Transportation. Graduate studies in mathematics. American Mathematical Society.

Villani, C. (2009). Optimal Transport: Old and New, Volume 338 of Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg.

Xu, H., D. Luo, and L. Carin (2019). Scalable Gromov–Wasserstein learning for graph partitioning and matching. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pp. 3046–3056.