Optimal transport (OT) has been gaining in recent years an increasing attention in the machine learning community. This success is due to its capacity to exploit the geometric property of the samples at hand. Generally speaking, OT is a mathematical tool to compare distributions by computing a transportation mass plan from a source to a target distribution. Distances based on OT are referred to as the Monge-Kantorovich or Wasserstein distances (Villani, 2009) and have been successfully employed in a wide variety of machine learning applications including clustering (Ho et al., 2017), computer vision (Bonneel et al., 2011; Solomon et al., 2015), generative adversarial networks (Arjovsky et al., 2017) or domain adaptation (Courty et al., 2017).
A key limitation of Wasserstein distance is that it relies on the assumption of aligned distributions, namely they must belong to the same ground space or that at least a meaningful distance across domains can be computed. Nevertheless, source and target distributions can be collected under distinct environments, representing different times of collection, contexts or measurements (see Fig. 1, left and right). To get benefit from OT on such heterogeneous distribution settings, one can compute the Gromov-Wasserstein (GW) distance (Sturm, 2006; Mémoli, 2011) to overcome the lack of intrinsic correspondence between the distribution spaces. GW extends Wasserstein by computing a distance between metrics defined within each of the source and target spaces. From a computational point view, it involves a non convex quadratic problem, hard to lift to large scale settings (Peyré and Cuturi, 2019). A remedy to a such heavy computation burden lies in a prevalent approach referred to as regularized OT (Cuturi, 2013), allowing one to add an entropic regularization penalty to the original problem. Peyré et al. (2016) proposed the entropic GW discrepancy, that can be solved by Sinkhorn iterations (Cuturi, 2013; Benamou et al., 2015).
A major bottleneck of OT in its traditional formulation is that it requires the two input measures to have the same total probability mass and/or that all the mass has to be transported. This is too restrictive for many applications since mass changes may occur due to a creation or an annihilation while computing an OT plan. To tackle this limitation, one may employ strategies such as partial or unbalanced transport (Guittet, 2002; Figalli, 2010; Caffarelli and McCann, 2010). Chizat et al. (2018) propose to relax the marginal constraints of unbalanced total masses using divergences such as Kullback-Leibler or Total Variation, allowing the use of generalized Sinkhorn iterations. Yang and Uhler (2019) generalize this approach to GANs and Lee et al. (2019) present an ADMM algorithm for the relaxed partial OT. Most of all these approaches concentrate on partial-Wasserstein.
This paper deals with exact partial Wassertein (partial-W) and Gromov-Wasserstein (partial-GW). Standard strategies for computing such partial-W require relaxations of the marginals constraints. We rather build our approach upon adding virtual or dummy points onto the marginals, in the same spirit as Caffarelli and McCann (2010). These points are used as a buffer when comparing distributions with different probability masses, allowing partial-W to boil down to solving an extended but standard Wasserstein problem. The main advantage of this approach is that it leads to computing sparse transport plans and hence exact partial-W or GW distances instead of regularized discrepancies obtained by running Sinkhorn algorithms. Regarding partial-GW, our approach relies on a FrankWolfe optimization algorithm (Frank and Wolfe, 1956) that builds on computations of partial-W.
Tackling partial-OT problems that preserve sparsity is motivated by the fact that they are more suitable to some applications such as the Positive-Unlabeled (PU) learning, see Bekker and Davis (2020) for a review, we target in this paper. We shall notice that this is the first application of OT for solving PU learning tasks. In a nutshell, PU classification is a variant of binary classification problem, in which we have only access to labeled samples from positive (Pos) class in the training stage. The aim is to assign classes to the points of an unlabeled (Unl) set which mixes data from both positive and negative classes. Using OT allows identifying the positive points within Unl, even when Pos and Unl samples do not lie in the same space (see Fig. 1).
The paper is organized as follows: we first recall some background on OT. In Section 3, we propose an algorithm to solve an exact partial-W problem, together with a Frank-Wolfe based algorithm to compute the partial-GW solution. After describing in more details the PU learning task and the use of partial-OT to solve it, we illustrate the advantage of partial-GW when the source and the target distributions are collected onto distinct environments. We finally give some perspectives.
Notations is an histogram of N bins withand is the Dirac function. Let -dimensional vector of ones. stands for the Frobenius dot product. |p| indicates the length of vector p.
Let be two point clouds representing the source and target samples, respectively. We assume two empirical distributions
where are histograms of |p| = n and |q| = m bins respectively. The set of all admissible couplings between histograms is given by
Figure 1: (Left) Source (in black) and target (in blue) distributions that have been collected under distinct environments. The source domain contains only positive points (o) whereas the target domain contains both positives and negatives (+) (Middle left) Partial-W fails to assign correctly all the labels in such context, red symbols indicating wrong assignments (Middle right) Partial-GW recovers the correct labels of the unlabeled samples, with a consistent transportation plan (gray lines), even when the dataset do not live in the same state space (Right).
where is a coupling matrix with an entry that describes the amount of mass at flowing toward the mass
OT addresses the problem of optimally transporting p toward q, given a cost measured as a geometric distance between . More precisely, when the ground cost a distance matrix, the p-Wassertein distance on at the power of p is defined as:
In some applications, the two distributions are not registered (i.e. we can not compute a ground cost between and ) or do not lie in the same underlying space. The Gromov-Wasserstein distance addresses this bottleneck by extending the Wasserstein distance to such settings, also allowing invariance to translation, rotation or scaling. It relies on intra-domain distance matrices of source and target , and is defined as in Mémoli (2011):
3.1 Partial Wasserstein distance
The previous OT distances require the two distributions to have the same total probability mass and that all the mass has to be transported. This may be a problematic assumption where some mass variation or partial mass displacement should be handled. The partial OT problem focuses on transporting only a fraction of the mass as cheaply as possible. In that case, the set of admissible couplings becomes
and the partial-W distance reads as This prob- lem has been studied by Caffarelli and McCann (2010); Figalli (2010); numerical solutions has notably been provided by Benamou et al. (2015); Chizat et al. (2018) in the entropic-regularized Wasserstein case. We propose here to solve the exact partial-W problem by adding dummy or virtual points (with any features) and extending the cost matrix as follows:
in which and is a bounded scalar. When the mass of these dummy points is set such that and , computing partial-W distance boils down to solving a unconstrained problem , where and . The intuitive derivation of this equivalent formulation is exposed in Appendix A.1.
Proposition 1 Assume that is bounded, one has
and the optimum transport plan of the partial Wasserstein problem is the optimum transport plan deprived from its last row and column.
The proof is postponed to Appendix A.2.
3.2 Partial Gromov-Wasserstein
We are now interested in the partial extension of Gromov-Wasserstein. In the case of a quadratic cost, p = 2, and the partial-GW problem writes as
where
This loss function is non-convex and the couplings feasibility domain is convex and compact. One may expect to introduce virtual points in the GW formulation in order to solve the partial-GW problem. Nevertheless, this strategy is no longer valid as it involves pairwise distances that do not allow the computations related to the dummy points to be isolated (see Appendix A.3).
In the following, we build upon a Frank-Wolfe optimization scheme (Frank and Wolfe, 1956) a.k.a. the conditional gradient method (Demyanov and Rubinov, 1970). It has received significant renewed interest in machine learning (Jaggi, 2013; Lacoste-Julien and Jaggi, 2015) and in OT community, since it serves as a basis to approximate penalized OT problems (Ferradans et al., 2013; Courty et al., 2017) or GW distances (Peyré et al., 2016; Vayer et al., 2018). Our proposed Frank-Wolfe iterations strongly rely on computing partial-W distances and as such, achieve a sparse transport plan (Ferradans et al., 2013).
Let us first introduce some additional notations. For any tensor we denote by the matrix in such that its (i, j)-th element is defined as
for all i = 1, . . . , n, j = 1, . . . , m. Introducing the 4-th order tensor we notice that , following Peyré et al. (2016), can be written as
The Frank-Wolfe algorithm for partial-GW is shown in Algorithm 1. Like classical Frank-Wolfe procedure, it is summarized in three steps for each iteration k, as detailed below. A theoretical study of the convergence of the Frank-Wolfe based algorithm for partial-GW is given in Appendix B.2. Also detailed derivation of the line search step is provided in the supplementary material.
Step1 Compute a linear minimization oracle over the set
To do so, we solve an extended Wasserstein problem with the ground metric as in eq. (1):
and get by removing its last row and column.
Step2 Determine optimal step-size
It can be shown that can take the following values, with
• if
• if
We hereafter investigate the application of partial optimal transport for learning from Positive and Unlabeled (PU) data. After introducing the PU learning, we present how to adapt the formulation of partial-OT to this setting.
4.1 Overview of PU learning
Learning from PU data is a variant of classical binary classification problem, in which the training data consist of only positive points, and the test data is composed of unlabeled positives and negatives. Let be the positive samples drawn according to the conditional distribution p(x|y = 1) and the unlabeled set sampled according to the marginal . The true proportion of positives, called class prior, is is the distribution of negative samples which are all unlabeled.The goal is to learn a binary classifier solely using Pos and Unl. A broad overview of existing PU learning approaches can be seen in (Bekker and Davis, 2020).
Most PU learning methods commonly rely on the selected completely at random (SCAR) assumption (Elkan and Noto, 2008), which assumes that the labeled samples are drawn at random among the positive distribution, independently of their attributes. Nevertheless, this assumption is often violated in real-case scenarios and PU data are often subject to selection biases, e.g. when part of the data may be easier to collect. Recently, a less restrictive assumption has been studied: the selected at random (SAR) setting (Bekker and Davis, 2018) which assumes that the positives are labeled according to a subset of features of the samples. Kato et al. (2019) move a step further and consider that the sampling scheme of the positives is such that p(o = 1|x, y = 1) (o = 1 means observed label) preserves the ordering over the samples induced by the posterior distribution p(y = 1|x, y = 1). Other approaches, as in (Hsieh et al., 2019), consider a classical PU learning problem adjuncted with a small proportion of observed negative samples. Those negatives are selected with bias following the distribution
4.2 PU learning formulation using partial optimal transport
We propose in this paper to use partial optimal transport to perform PU learning. In that context, the unlabeled points Unl represent the source distribution X and the positive points Pos are the target dataset Y. We set the total probability mass to be transported as the proportion of positives in the unlabeled set. As such, the transport matrix T should be such that the unlabeled positive points are mapped to the positive samples (as they have similar features or intra-domain distance matrices) while the negatives are discarded (in our context, they are not tranported at all). We look for an optimal transport plan that belongs to the following set of couplings, assuming ,
in which means that exactly or 0, to avoid matching a unlabeled negative with a positive. This set is not empty as long as . Though the positive samples Pos are assumed easy to label, their features may be corrupted with noise or they may be mislabeled. Let assume , the noise level.
To solve the Wasserstein problem with the admissible constraint set (6) related to this PU learning, we adopt a regularized point of view of the partial-OT problem as in Courty et al. (2017):
where , , and are defined as in Section 3.1, is a regularization parameter and is the percentage of Pos that we assume to be noisy (that is to say we do not want to match them with Unl). We choose contains the indices of the columns of that correspond to either the positives (g = [1, m]) or the dummy point (g = m + 1). This group-lasso regularization helps enforcing the constraint . Also notice that the definition of and allows ensuring that only a probability mass of is moved towards Pos. When partial-GW is involved, we use this regularized-OT in the step (i) of the Frank-Wolfe algorithm.
We can establish that solving problem (7) provides the solution to PU learning using partial-OT.
Proposition 2 Assume that is bounded, there exists a large such that:
where and min
The proof is postponed to Appendix C.
5.1 Experimental design
We illustrate the behavior of partial-W and -GW on real datasets in a PU learning context. First, we consider a SCAR assumption, then a SAR one and finally a more general setting, in which the underlying distributions of the point clouds come from different domains, or do not belong to the same metric space. Algorithm 1 has been implemented using the Python Optimal Transport (POT) toolbox (Flamary and Courty, 2017).
Following previous works (Kato et al., 2019; Hsieh et al., 2019), we assume that the class prior known throughout the experiments; otherwise, it can be estimated from off-the-shelf methods, e.g. (Ramaswamy et al., 2016; Plessis et al., 2017). For both partial-W and partial-GW, we choose p = 2 and the cost matrices C are computed using Euclidean distance.
We carry experiments on real-world datasets under the aforementioned scenarios. We rely on six datasets Mushrooms, Shuttle, Pageblocks, USPS, Connect-4, Spambase from the UCI repository1 (following Kato et al. (2019)’s setting) and colored MNIST (Arjovsky et al., 2019) to illustrate our methods in SCAR and SAR settings respectively. We also consider the Caltech office dataset, which is a common application of domain adaptation (Courty et al., 2017) to explore the effectiveness of our method on heterogeneous distribution settings. Additional illustration of the regularization term of equation (7) is given in Appendix E.
Whenever they contain several classes, these datasets are converted into binary classification problems following Kato et al. (2019), and the positives are the samples that belong to the y = 1 class. For UCI and colored MNIST datasets, we randomly draw positive and points among the remaining data. As the datasets are smaller, we choose and for Caltech office dataset. To ease the presentation, we only report results with class prior as the true proportion of positive class in the dataset. We ran the experiments 10 times and report the mean accuracy rate (standard deviations are shown in Appendix F). We test 2 levels of noise in Pos: and choose a large
For the experiments, we consider unbiased PU learning method (denoted by PU in the sequel) (Du Plessis et al., 2014) and the most recent and effective method to address PU learning with a selection bias (called PUSB below) that tries to weaken the SCAR assumption (Kato et al., 2019). Whenever possible (that is to say when source and target samples share the same features), we compare our approaches P-W and P-GW with PU and PUSB; if not, we are not aware of any competitive PU learning method able to handle different features in Pos and Unl. The GW formulation is a non convex problem and the quality of the solution is highly dependent on the initialization. We test several starting points for P-GW and report the result that gets the lowest loss (see Appendix D for the details).
5.2 Partial-W and partial-GW in a PU learning under a SCAR assumption
Under SCAR, the Pos dataset and the positives in Unl are assumed independently and identically drawn according to the distribution p(x|y = 1) from a set of positive points. We experiment on the UCI datasets and Table 1 (top) summarizes our findings. Except for Connect-4 and spambase, partial-W has similar results or consistently outperforms PU and PUSB under a SCAR assumption. Including some noise has little impact on the results, exept for the connect-4 dataset. Partial-GW has competitive results, showing that relying on intra-domain matrices may allow discriminating the classes. It nevertheless under-performs relatively to partial-W, as the distance matrix C between Pos and Unl is more informative than only relying on intra-domain matrices.
Table 1: Average accuracy rates on various datasets. (G)-PW 0 indicates no noise and (G)P-W 0.025 a noise of 0.025. Best values are indicated boldface.
5.3 Experiments under a SAR assumption
The SAR assumption supposes that Pos is drawn according to some features of the samples. To implement such a setting, we inspire from (Arjovsky et al., 2019) and we construct a colored version of MNIST: each digit is colored, either in green or red, with a probability of 90% to be colored in red. The probability to label a digit y = 1 as positive depends on its color, with only green y = 1 composing the positive set. The Unl dataset is then mostly composed of red digits. Results under this setting are provided in Table 1 (middle). When we consider a SCAR scenario, partial-W exhibits the best performance. However, its effectiveness highly drops when a covariate shift appears in the distributions p(x|y = 1) of the Pos and Unl datasets as in this SAR scenario. On the opposite, partial-GW allows maintaining a comparable level of accuracy as the discriminative information are preserved in intra-domain distance matrices.
5.4 Partial-W and -GW in a PU learning with different domains and/or feature spaces
To further validate the proposed method in a different context, we apply partial-W and partial-GW to a domain adaptation task. We consider the Caltech Office dataset, that consists of four domains: ) (Griffin et al., 2007), ) (Saenko et al., 2010). There exists a high inter-domain variability as the objects may face different illumination, orientation etc. Following a standard protocol, each image of each domain is described by a set of SURF features (Saenko et al., 2010) consisting of a normalized 800-bins histogram, and by a set of DECAF features (Donahue et al., 2014), that are 4096-dimensional features extracted from a neural network. The Pos dataset consists of images from Caltech 256. The unlabeled samples are formed by the Amazon, Webcam, DSLR images together with the Caltech 256 images that are not included in Pos. We perform a PCA to project the data onto d = 10 dimensions for the SURF features and d = 40 for the DECAF ones.
We first investigate the case where the objects are represented by the same features but belong to the same or different domains. Results are given in Table 1 (bottom). For both features, we first notice that PU and PUSB have similar performances than partial-W when the domains are the same. As soon as the two domains differ, partial-GW exhibits the best performances, suggesting that it is able to capture some domain shift. We then consider a scenario where the source and target objects are described by different features (Table 2). In that case, only partial-GW is applicable and its performances suggest that it is able to efficiently leverage on the discriminative information conveyed in the intra-domain similarity matrices, especially when using SURF features to make predictions based on DECAF ones.
Table 2: Average accuracy rates on domain adaptation scenarios described by different features. As there is little difference between the results, we report performances only for
The contributions of this work are twofold. We first show that considering an extended Wasserstein problem allows solving an exact partial-W. We then propose an algorithm relying on iterations of Frank-Wolfe method to solve a partial-GW problem, in which each iteration requires solving a partial-W problem. The second contribution relates to using partial-W and -GW distances to solve PU learning problems. We show that those distances compete and sometimes outperform the state-of-the-art PU learning methods, and that partial-GW allows remarkable improvements when the underlying spaces of the positive and unlabeled datasets are distinct or even unregistered.
While considering only features (with partial-W) or intra-domain distances (with partial-GW), this work can be extended to define a partial-Fused Gromov-Wasserstein distance (Vayer et al., 2018) that can combines both aspects. Another line of work will also focus on lowering the computational complexity by using sliced partial-GW, building on existing works on sliced partial-W (Bonneel and Coeurjolly, 2019) and sliced GW (Vayer et al., 2019). Regarding the application view point, we envision a potential use of the approach to subgraph matching (Kriege and Mutzel, 2012) or PU learning on graphs (Zhao et al., 2011) as GW has been proved to be effective to compare structured data such as graphs. Finally, we plan to derive an extension of this work to PU learning in which the proportion of positives in the dataset will be estimated in a unified optimal transport formulation, building on results of GW-based test of isomorphism between distributions Brécheteau (2019).
This work does not present any significant societal, environnemental or ethical consequence.
We have first defined a novel algorithm that gives the exact solution of the partial-(G)W problem. Several algorithms relying on a entropic regularization of the problem already exist and are known to be less computationally expensive than their exact counterpart. Nevertheless, in order to come close to a sparse solution, the regularization parameter should be very small, leading to a potential ill-conditioned optimal transport problem and a slow convergence of the algorithm. As such, the impact regarding the computational aspect is rather limited.
The second contribution is the use of OT to solve PU learning problems, even when the domains or features are not registered. This is of particular interest when the negative samples are difficult or costly to label and PU learning allows in general the decrease of the labeling burden. Domain adaptation task may be considered as an interesting task to reduce the storage and labeling cost as data coming from different domains can be reused instead of collecting new and additional data.
This work was supported by grants from the MULTISCALE ANR-18-CE23-0022 Project and the OATMIL ANR-17-CE23-0012 Project of the French National Research Agency (ANR).
Arjovsky, M., L. Bottou, I. Gulrajani, and D. Lopez-Paz (2019). Invariant risk minimization. arXiv preprint arXiv:1907.02893.
Arjovsky, M., S. Chintala, and L. Bottou (2017). Wasserstein Generative Adversarial Networks. In Proceedings of the 34th International Conference on Machine Learning, Volume 70, pp. 214–223.
Bekker, J. and J. Davis (2018). Learning from positive and unlabeled data under the selected at random assumption. In Proceedings of Machine Learning Research, Volume 94, pp. 8–22.
Bekker, J. and J. Davis (2020). Learning from positive and unlabeled data: A survey. Machine Learning 109(719-760).
Benamou, J. D., G. Carlier, M. Cuturi, L. Nenna, and G. Peyré (2015). Iterative Bregman projections for regularized transportation problems. SIAM J. Scientific Computing 37.
Bonneel, N. and D. Coeurjolly (2019). Spot: sliced partial optimal transport. ACM Transactions on Graphics (TOG) 38(4), 1–13.
Bonneel, N., M. Van de Panne, S. Paris, and W. Heidrich (2011). Displacement interpolation using lagrangian mass transport. ACM Trans. Graph. 30(6), 158:1–12.
Brécheteau, C. (2019). A statistical test of isomorphism between metric-measure spaces using the distance-to-a-measure signature. Electronic Journal of Statistics 13(1), 795–849.
Caffarelli, L. A. and R. J. McCann (2010). Free boundaries in optimal transport and monge-ampère obstacle problems. Annals of Mathematics 171(2), 673–730.
Chizat, L., G. Peyré, B. Schmitzer, and F.-X. Vialard (2018). Scaling algorithms for unbalanced optimal transport problems. Math. Comput. 87, 2563–2609.
Courty, N., R. Flamary, D. Tuia, and A. Rakotomamonjy (2017). Optimal transport for domain adaptation. IEEE transactions on pattern analysis and machine intelligence 39(9), 1853–1865.
Cuturi, M. (2013). Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems 26, pp. 2292–2300.
Demyanov, V. F. and A. M. Rubinov (1970). Approximate methods in optimization problems. Elsevier Publishing Company 53.
Donahue, J., Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell (2014). Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pp. 647–655.
Du Plessis, M., G. Niu, and M. Sugiyama (2014). Analysis of learning from positive and unlabeled data. In Advances in neural information processing systems, pp. 703–711.
Elkan, C. and K. Noto (2008). Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 213–220.
Ferradans, S., N. Papadakis, J. Rabin, G. Peyré, and J. F. Aujol (2013). Regularized discrete optimal transport. In A. Kuijper, K. Bredies, T. Pock, and H. Bischof (Eds.), Scale Space and Variational Methods in Computer Vision, Berlin, Heidelberg, pp. 428–439. Springer Berlin Heidelberg.
Figalli, A. (2010). The optimal partial transport problem. Archive for Rational Mechanics and Analysis 195(2), 533–560.
Flamary, R. and N. Courty (2017). Pot python optimal transport library. https://github.com/rflamary/POT.
Frank, M. and P. Wolfe (1956). An algorithm for quadratic programming. Naval Research Logistics Quarterly 3(1-2), 95–110.
Griffin, G., A. Holub, and P. Perona (2007). Caltech-256 object category dataset. Technical report, California Institute of Technology.
Guittet, K. (2002). Extended kantorovich norms: a tool for optimization. Tech. Rep. 4402.
Ho, N., X. L. Nguyen, M. Yurochkin, H. H. Bui, V. Huynh, and D. Phung (2017). Multilevel clustering via wasserstein means. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, pp. 1501–1509.
Hsieh, Y., G. Niu, and M. Sugiyama (2019). Classification from positive, unlabeled and biased nega- tive data. In Proceedings of the 36th International Conference on Machine Learning, Volume 97.
Jaggi, M. (2013). Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In Proceedings of the 30th International Conference on Machine Learning, Number 1, pp. 427–435.
Kato, M., T. Teshima, and J. Honda (2019). Learning from positive and unlabeled data with a selection bias. In ICLR.
Kriege, N. and P. Mutzel (2012). Subgraph matching kernels for attributed graphs. In Proceedings of the 29th International Conference on Machine Learning.
Lacoste-Julien, S. (2016). Convergence rate of Frank-Wolfe for non-convex objectives. CoRR abs/1607.00345.
Lacoste-Julien, S. and M. Jaggi (2015). On the global linear convergence of frank-wolfe optimization variants. Neural Information Processing Systems 2015, 496–504.
Lee, J., N. Bertrand, and C. Rozell (2019). Parallel unbalanced optimal transport regularization for large scale imaging problems. Technical report, arXiv preprint arXiv:1909.00149. Under review.
Mémoli, F. (2011). Gromov–Wasserstein distances and the metric approach to object matching. Foundations of Computational Mathematics 11(4), 417–487.
Peyré, G. and M. Cuturi (2019). Computational optimal transport. Foundations and Trends Rin Machine Learning 11(5-6), 355–607.
Peyré, G., M. Cuturi, and J. Solomon (2016). Gromov-Wasserstein averaging of kernel and distance matrices. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, pp. 2664–2672.
Plessis, M. C., G. Niu, and M. Sugiyama (2017, April). Class-prior estimation for learning from positive and unlabeled data. Mach. Learn. 106(4), 463–492.
Ramaswamy, H., C. Scott, and A. Tewari (2016). Mixture proportion estimation via kernel embed- dings of distributions. In International Conference on Machine Learning, pp. 2052–2060.
Saenko, K., B. Kulis, M. Fritz, and T. Darrell (2010). Adapting visual category models to new domains. In European conference on computer vision, pp. 213–226.
Solomon, J., F. de Goes, G. Peyré, M. Cuturi, A. Butscher, A. Nguyen, T. Du, and L. Guibas (2015). Convolutional wasserstein distances: Efficient optimal transportation on geometric domains. ACM Trans. Graph. 34(4), 66:1–66:11.
Sturm, K. T. (2006). On the geometry of metric measure spaces. Acta Math. 196(1), 133–177.
Vayer, T., L. Chapel, R. Flamary, R. Tavenard, and N. Courty (2018). Fused gromov-wasserstein distance for structured objects: theoretical foundations and mathematical properties. CoRR abs/1811.02834.
Vayer, T., R. Flamary, R. Tavenard, L. Chapel, and N. Courty (2019). Sliced gromov-wasserstein. In Advances in Neural Information Processing Systems 32, pp. 14726–14736. Curran Associates, Inc.
Villani, C. (2009). Optimal Transport: Old and New, Volume 338. Springer Berlin Heidelberg.
Yang, K. D. and C. Uhler (2019). Scalable unbalanced optimal transport using generative adversarial networks. In International Conference on Learning Representations.
Zhao, Y., X. Kong, and S. Philip (2011). Positive and unlabeled learning for graph classification. In 2011 IEEE 11th International Conference on Data Mining, pp. 962–971. IEEE.
The proof involves 3 steps:
1. we first justify the definition of in the extended problem formulation, and show that should be equal to zero in order to have an equivalence between the original and the extended constraint set;
2. we then show that, for an optimal , we have if
3. we finally show that the solution of the extended Wasserstein problem (deprived from its last row and column) and the one of the partial-Wasserstein one are the same, and we show that
A.1 Equivalence between the constraint set of the extended Wasserstein problem and the partial-W problem
Let recall the formulation of partial OT problem that aims to transport only a fraction of the mass as cheaply as possible. In that case, the problem to solve is:
with the constraint set:
To express this as a standard discrete Kantorovitch optimal transport problem, we get rid of the inequality constraints by re-writting (using standard tricks of linear programming):
with two unknown vectors. Using the fact that , we get then the following equality constraint for marginal q:
The relations take into account the constraint related to the transported mass s and we establish in subsection A.2.1 of the supplementary material that this mass is preserved whenever we solve the extended problem without the explicit constraint Gathering these elements leads to this equivalent formulation
with re-expressed as
we get this compact formulation
In the following, we show how solving a Wasserstein problem under the constraint set (9) and how recovering the solution of the original partial problem (with the constraint set 8) from it.
A.2 Proof of Proposition 1
Let us denote the optimal coupling of the extended problem
and recall that we set
A.2.1 We first check that
Assume is the matrix with the last row and column removed. Let us first suppose that . As a consequence and because of the constraints on the marginals, we have
and
We can also easily see that
as . This implies that
Hence, we have established that
A.2.2 Let show that
Let us now suppose that . This implies that
A.2.3 We now show that
We have
Let us now suppose that . It means that, that belongs to the admissible constraint set
When we have
We have previously shown that . It implies that as and by assumption. We then have
A.2.4 We then prove Proposition 1
Let us denote deprived from its last row and column. We then show that where is the solution of the original partial-W problem:
We have
We also have . Finally, we can write that belongs to the following constraint set:
which is the same like . We then reach the result
A.2.5 We finally show how to construct
We have
Setting , we recover the constraint set and finally
A.3 Adding dummy points to the GW problem does not solve a partial-GW problem
While solving the partial-W problem can be achieved by adding dummy points and extending the cost matrix C in an appropriate way, the same strategy can not be set up for solving GW. Indeed, the equivalence between the partial and the extended problem is permitted because we can ensure that (as long as and is a bounded scalar) , which implies that . If we extend the intra-domain cost matrices on the same pattern as follows
(with a large constant A > 0) the GW formulation involves pairs of points. We have
where
does not allows having , and hence
B.1 Line-search
The step size in the line-search of Frank-Wolfe algorithm for partial-GW is given by
Straightforwardly, one has
Since we choose a quadratic cost, p = 2, then for any T , R one has and we can then rewrite
We then have to find that minimises
with its derivative . This yields the following cases:
Case 1: a < 0. In that case, is a concave function, whose minimum is reached either for . We then have:
Case 2: a > 0. In that case, is a convex function, whose minimum on [0, 1] is reached for
B.2 Convergence guarantee.
Intuitively a stationary point for partial-GW problem verifies that every direction in the polytope with origin is correlated with the gradient of the loss , namely A good criterion to measure distance to a stationary point at iteration k is the often used Frank-Wolfe gap, which is defined by
Note that is always non-negative, and zero if and only if at a stationary point. Thanks to Theorem 1 in Lacoste-Julien (2016) we have
where defines the initial suboptimality, L is a Lipschitz constant of and diamis the -diameter of (see (11)). Therefore we can state the following lemma characterizing the convergence guarantee:
Lemma 1 The Frank-Wolfe gap for the partial-GW loss after K iterations satisfies
where s is the total mass to be transported and is the maximum value of cost matrix , similarly for
Note that for the implementations, one can set , hence the upper bound in (10) becomes more tight regarding a good initialization of Algorithm 1. This can be used to reduce significantly the initial suboptimality . Furthermore, according to Theorem 1 in Lacoste-Julien (2016), Algorithm 1 takes at most iterations to find an approximate stationary point with a gap smaller than
Proof of Lemma 1. Let us first calculate the diameter of the couplings set with respect to the Frobenieus norm
Using triangle inequality and the fact that are probability masses that is get
where s in the total mass to be transported. Thus diam
For the Lipschitz constant of the gradient of we proceed as follows: for any we have
We have also
Hence the Lipschitz constant of verifies . This gives the desired result.
Let’s denote the optimal coupling of the extended problem
in which , where contains the indices of the columns of that correspond to either the positives (g = [1, . . . , m]) or dummy points (g = m + 1).
C.1 We first show that
Let recall that by construction, we have and . Also due to the definition of the marginals
Therefore we arrive at the results by construction. Using the development in Section A.2.2 of the supplemental, we can establish that , and . Thereon we attain
C.2 Proof of Proposition 2
Given a solution of the extended PU learning problem stated in Equation (12), we can write the objective function of the extended problem as
We now turn this problem into its equivalent constrained form, namely there exists some , related to the regularization parameter , such that
Let such that for all Since we have the polytope constraint , this means that and analagously . Using the established results from section C.1 of the supplementary, we can derive that the matrix belongs to the following constraint set:
Let define for all and . Then the problem can be re-formulated as
On the other hand, using the polytope constraints we have
Also we can derive an upper bound on the norm of v
and finally
Gathering those inequalities, we get
Therefore, for any value of , the group-lasso constraint is always satisfied for any choice of and the marginal constraints
One can choose a particular sparse solution for as follows:
• if , condition (13) implies necessary that
• If not, one can choose , and hence
So it remains that the solution of the constrained problem
that is either or is exactly 0. This means that there exists some value , related to , such that for , solving the extended problem (12) amounts to solving our PU learning formulation, which concludes the proof.
The partial-OT computation is based on a augmented problem with a dummy point and, as such, is convex. On the contrary, the GW problem is non-convex and, although the algorithm is proved to converge, there is no guarantee that the global optimum is reached. The quality of the solution is therefore highly dependent on the initialization. We propose to rely on an initial Wasserstein barycenter problem to build a first guess of the transport matrix.
For partial-GW, as the matrices do not lie in the same ground space, we can not define a distance function between their members. Nevertheless, within a domain, we can build two sets of “homogeneous” points. Instead of relying on a classical partitioning algorithm such as k-means, we propose to look for a barycenter with atoms and weights b of the set U that minimizes the following function:
over the feasible sets for . In other word, we look for the set that allows having the most similar (in the Wasserstein sense) distribution q than U. The induced transport matrix gives then two clusters, the one with mass serving as an initialization matrix for the GW problem.
Whenever possible (that is to say when Pos and Unl belong to the same space, we also initialize the GW algorithm with the solution of Partial-W, and the outer product of
We first draw and samples, 6 of them being positives and we set . Fig. E shows the data and the obtained optimal couplings: enforcing a group constraint assigns some unlabeled points to the dummy point, allowing a clear identification of the negatives among Unl, whereas the solution with no such constraints may split the probability mass of the unlabeled positives between Pos and the dummy point, preventing to consistently identify the negatives.
Figure 2: (Left) Positives (black) and Unlabeled (blue) samples (Middle) Transport matrix with no group constraints, where darker color indicates stronger matching (Right) Transport matrix with group constraints.“D” is the dummy point.
Table 3: Standard deviation of accuracy rates on different datasets and scenarios.