The mathematical problem of optimal mass transport has a long history dating back to its introduction in Monge [10], with key contributions by Kantorovivc [6] and Kantorovivc & Rubinvsteuin [7]. It has recently received increased interest due to numerous applications in machine learning; see, e.g., the recent overview of Kolouri, Park, Thorpe, Slepcev & Rohde [9] and the references therein. In a nutshell, the (discrete) problem of optimal transport in its Kantorovich form is to compute for given mass distributions a and b with equal mass a transport plan, i.e., an assignment of how much mass of a at some point should be moved to another point to match the mass in b. This should be done in a way such that some transport cost (usually proportional to the amount of mass and dependent on the distance) is minimized. This leads to a linear optimization problem which has been well studied, but its application in machine learning has been problematic due to large memory requirement and long run time. Recently, Cuturi [2] proposed a method that overcomes the memory requirement by so-called entropic regularization that has found broad applications; see, e.g., [1], Cuturi & Doucet [3], and Frogner, Zhang, Mobahi, Araya & Poggio [5]. The resulting iteration resembles the so-called Sinkhorn–Knopp method from Sinkhorn & Knopp [11] for matrix balancing and allows for a simple and efcient implementation.
1.1 our contribution
In this work, we show that the Sinkhorn–Knopp method can be viewed as an approximate Newton method and derive a full Newton method for entropically regularized optimal transport problems that is demonstrated to perform signifcantly better for small entropic regularization parameters. Here, compared to Cuturi [2], the key idea is to apply a logarithmic transform to the variables.
This paper is organized as follows. In Section 2, we state the Kantorovich formulation of optimal transport together with its dual which serves as the basis of the derived algorithm. Afterwards, we establish local quadratic convergence and discuss the relation of the proposed Newton method to the Sinkhorn–Knopp iteration. The performance and parameter dependence of the proposed method are illustrated with numerical examples in Section 3. Section 4 contains the proof of the key estimate for quadratic convergence, and Section 5 concludes the paper.
1.2 notation
In the following, represents the n-dimensional vector with all ones and refers to the matrix with all ones. Moreover, denotes the probability simplex in whose elements are called probability vectors, or equivalently, histograms. For two histograms
is the set of admissible coupling matrices. In the context of optimal transport, the elements of U (a,b) are also referred to as transport plans. Histograms a and b can be viewed as mass distributions, and an entry of a transport plan can be interpreted as the amount of mass moved from
At the same time, denotes the standard dot product of two vectors Finally, Diagis defned as the diagonal matrix with Diagfor is the Hadamard product (i.e., the component-wise product) of
In this section we derive our Sinkhorn–Newton method. We start by introducing the problem of entropically regularized optimal transport in Section 2.1. Afterwards, in Section 2.2, we present our approach,which is essentially applying Newton’s method to the optimality system associated with the transport problem and its dual, before we discuss its local quadratic convergence in Section 2.3. In Section 2.4, we fnally establish a connection between our Newton iteration and the Sinkhorn–Knopp type iteration introduced by Cuturi [2].
2.1 problem setting
Let be given histograms together with a non-negative cost matrix The entropically regularized Kantorovich problem of optimal mass transport between a and b is
where the logarithm is applied componentwise to P and is the regularization strength. The variables indicate how much of ends up in is the corresponding transport cost per unit mass. Abbreviating , standard convex duality theory leads us to the dual problem
where f and are the dual variables and the exponential function is applied componentwise. The problems (P) are linked via the optimality conditions
The frst condition (2.1a) connects the optimal transport plan with the dual variables. The conditions (2.1b) and (2.1c) simply refect the feasibility of ,i.e.,for the mass conservation constraints in (1.1).
2.2 algorithm
Finding dual vectors f and that satisfy (2.1b) and (2.1c) is equivalent to fnding a root of the function
The Jacobian matrix of F is
where we used (2.1a) to simplify the notation. Performing the Newton step (2.3) requires fnding a solution of the linear equation system
The new iterates are then given by
If one is only interested in the optimal transport plan, then it is actually not necessary to keep track of the dual iterates after initialization (in our subsequent experiments, we use and hence, ). This is true because (2.5) can be expressed entirely in terms of
and thus, using (2.6) and (2.7), we obtain the multiplicative update rule
In this way, we obtain an algorithm which only operates with primal variables, see Algorithm 1. In applications where the storage demand for the plans is too high and one is only interested in the optimal value, there is another form which does not form the plans , but only the dual variables and and which can basically operate matrix-free. We sketch it as Algorithm 2 below.
2.3 convergence and numerical aspects
In the following, we frst argue that (2.5) is solvable. Then we show that the sequence of Newton iterates converges locally at a quadratic rate as long as the optimal transport plan satisfes for some constant c > 0.
Lemma 2.1. For and , the Jacobian matrix is symmetric positive semidefnite, and its kernel is given by
Proof. The matrix is obviously symmetric. For arbitrary and , we obtain from (2.4) that
which holds with equality if and only if we have
Hence, the system (2.5) can be solved by a conjugate gradient (CG) method. To see that, recall that the CG method iterates on the orthogonal complement of the kernel as long as the initial iterate is chosen from this subspace, in this case with . Furthermore, the Newton matrix can be applied matrix-free in an efcient manner as soon as the multiplication with and its transpose can be done efciently, see Algorithm 2. This is the case, for example if only depends on and thus, multiplication with K amounts to a convolution. A cheap diagonal preconditioner is provided by the matrix
According to Deuflhard [4, Thm. 2.3], we expect local quadratic convergence as long as
holds for all and , with an arbitrary norm and some constant in a neighborhood of the solution. Here, we abbreviated
holds in the
We postpone the proof of Theorem 2.2 to Section 4.
Remark 2.3. In fact, one can show that necessarily . Indeed, ifthen one can explicitly compute
where the exponential and the multiplication are pointwise (the calculation is detailed in the proof of Theorem 2.2).
Hence, if is chosen sufciently close to a solution of , then the contraction property of Newton’s method shows that the sequence of Newton iterates , and hence , remain bounded. If the optimal plan satisfes , we can therefore expect local quadractic convergence of Newton’s method.
2.4 relation to sinkhorn–knopp
Substituting shows that the optimality system can be written equivalently as
In order to fnd a solution of (2.14b)–(2.14c), one can apply the Sinkhorn–Knopp algorithm [11] as recently proposed in Cuturi [2]. This amounts to alternating updates in the form of
In (2.15a), is updated such that , and in the subsequent (2.15b), is updated such that form a solution of (2.14c). If we proceed analogously to Section 2.2 and derive a Newton iteration to fnd a root of the function
then the associated Jacobian matrix is
Neglecting the of-diagonal blocks in (2.17) and using the approximation
leads us to the parallel updates
Hence, we see that a Sinkhorn–Knopp step (2.15) simply approximates one Newton step (2.20) by neglecting the of-diagonal blocks and replacing by in (2.20b). In our experience, neither the Newton iteration for G(u,v) = 0 (which seems to work for the less general problem of matrix balancing; see Knight & Ruiz [8]) nor the version of Sinkhorn–Knopp in which is updated using
We illustrate the performance of the Sinkhorn–Newton method and its behavior by several examples. We note that a numerical comparison is not straightforward as there a several possibilities to tune the method, depending on the structure at hand and on the specifc goals. As illustrated in Section 2.2, one could take advantage of fast applications of the matrix K or use less memory if one does not want to store P during the iteration. Here we focus on the
Figure 1: Performance of Sinkhorn (S) and Newton (N) iterations measured by constraint violation (viol.), distance to optimal transport cost (cost) and distance to optimal transport plan (plan).
comparison with the usual (linearly convergent) Sinkhorn iteration. Thus, we do not aim for greatest overall speed but for a fair comparison between the Sinkhorn-Newton method and the Sinkhorn iteration. To that end, we observe that one application of the Newton matrix (2.4) amounts to one multiplication with each, two coordinate-wise products and sums of vectors. For one Sinkhorn iteration we need one multiplication with and two additional coordinate-wise operations. Although Algorithm 2 looks a little closer to the Sinkhorn iteration, we still compare Algorithm 1, as we did not exploit any of the special structure in K or P.
All timings are reported using MATLAB (R2017b) implementations of the methods on an Intel Xeon E3-1270v3 (four cores at 3.5 GHz) with 16 GB RAM. The code used to generate the results below can be downloaded from htps://github.com/dirloren/sinkhornnewton.
In all our experiments, we address the case m = n and the considered histograms are defned on equidistant grids Sections 3.1 and 3.2) andd = 1 (in Section 3.3), respectively. Throughout, the cost is chosen as quadratic, i.e.,
Our Sinkhorn–Newton method is implemented according to Algorithm 1 and using a preconditioned CG method. The iteration is terminated as soon as the maximal violation of the constraints and drops below some threshold. If applicable, the same termination criterion is chosen for the Sinkhorn method, which is initialized with
3.1 comparison with sinkhorn–knopp
We frst address the comparison of Sinkhorn–Newton with the classical Sinkhorn iteration. For this purpose, we discretize the unit square using a 20 equidistant grid
which are then normalized to unit mass by setting
The entropic regularization parameter is set to and in case of Sinkhorn-Newton, the CG method is implemented with a tolerance of 10and a maximum number of 34 iterations. Moreover, the threshold for the termination criterion is chosen as 10
Figure 1 shows the convergence history of the constraint violation for both iterations together with the error in the unregularized transport cost , where denotes the fnal transport plan, and the error in the transport plan . In Figure 1a, we compare the error as a function of the iterations, where we take the total number of CG iterations for Sinkhorn–Newton to allow for a fair comparison (since both a Sinkhorn and a CG step have comparable costs, dominated by the two dense matrix–vector products respectively). It can be seen clearly that with respect to all error measures, Sinkhorn converges linearly while Sinkhorn–Newton converges roughly quadratically, as expected, with Sinkhorn– Newton signifcantly outperforming classical Sinkhorn for this choice of parameters. The same behavior holds if the error is measured as a function of runtime; see Figure 1b.
3.2 dependence on the regularization strength
The second example addresses the dependence of the Sinkhorn–Newton method on the problem parameters. In particular, we consider the dependence on and on the minimal value of a and b (via the corresponding transport plans P), since these enter into the convergence rate estimate (2.13). Here, we take an example which is also used in Cuturi [2]: computing the transport distances between diferent images from the MNIST database, which contains 28 images of handwritten digits. We consider these as discrete distributions of dimension 282 = 784 on , to which we add a small ofset before normalizing to unit mass as before. Here, the tolerance for both the Newton and the CG iteration is set to 10, and the maximum number of CG iterations is fxed at 66. The entropic regularization parameter is chosen as multiples of the median of the cost (which is 2821 in this case).
Figure 2 shows the convergence history for diferent ofsets and , where we again report the constraint violation both as a function of CG iterations and of the run time in seconds. Comparing Figures 2a to 2f, we see that as decreases, an increasing number of CG iterations is required to achieve the prescribed tolerance. However, the convergence seems to be robust in at least for larger values of and only moderately deteriorate for
Figure 2: Constraint violation for diferent ofsets and diferent regularization parameters
Figure 3: Constraint violation for diferent mesh sizes n
3.3 dependence on the problem dimension
We fnally address the dependence on the dimension of the problem. For this purpose, we discretize the unit interval [0, 1] using n equidistant points
which are again normalized to unit mass to obtain a and b. The regularization parameter is fxed at . Moreover, the inner and outer tolerances are here set to 10, and the maximum number of CG iterations is coupled to the mesh size via
Figure 3 shows the convergence behaviorof Sinkhorn–Newton forAs can be seen from Figure 3a, the behavior is nearly independent of n; in particular, the number of CG iterations required to reach the prescribed tolerance stays almost the same. (This is also true for the Newton method itself with 21, 22, 23 and 23 iterations.) Since each CG iteration involves two dense matrix–vector products with complexity , the total run time scales quadratically; see Figure 3b.
For the sake of presentation, we restrict ourselves to the case m = n here. However, in the end, we suggest how the proof can be generalized to the case
To estimate (2.12) and in particular we observe that
To keep the notation concise, we abbreviate and also write and , respectively. Then, we compute
where all exponents are applied componentwise. Now we frst treat only the summands for l = 0 and l = k. For those terms (2.4) immediately implies
which has supremum norm bounded by for all . For all other summands (i.e. 1 1), we write
Using (2.4) again, it follows that
and we aim to estimate and by and . By Lemma 2.1, the matrix A has a one-dimensional kernel spanned by , and a solution in the orthogonal complement is also a solution to
for any . From Varah [12], we know that the -norm of the inverse matrix of B is estimated by
In this case, we calculate for i = 1, . . . ,n that
For any , this leads to
Choosing , we thus obtain that
and similarly for Diag, fnally gives
with
Hence we obtain
for all 1. In summary we obtain
We have proposed a Newton iteration to solve the entropically regularized discrete optimal transport problem. Diferent from related Newton type approaches for matrix balancing, our method iterates on the logarithm of the scalings, which seems to be necessary for robust convergence in the optimal transport setting. Numerical examples show that our algorithm is a robust and efcient alternative to the more commonly used Sinkhorn–Knopp algorithm, at least for small regularization strength.
[1] ,Convergence of entropic schemes foroptimal transport and gradient fows, SIAM Journal on Mathematical Analysis 49 (2017), 1385–1418, doi: 10. 1137/15M1050264.
[2] Cuturi, Sinkhorn distances: Lightspeed computation of optimal transport, in: Advances in Neural Information Processing Systems (NIPS) 26, 2013, 2292–2300, url: htp://papers.nips. cc/paper/4927-sinkhorn-distances-lightspeed-computation-of-optimal-transport.
[3] Cuturi & Doucet, Fast computation of Wasserstein barycenters, in: Proceedings of the 31st International Conference on Machine Learning, 2014, 685–693, url: htp://proceedings.mlr. press/v32/cuturi14.pdf.
[4] Deuflhard, Newton Methods for Nonlinear Problems, Afne Invariance and Adaptive Algorithms, Springer, 2011, doi: 10.1007/978-3-642-23899-4.
[5] Frogner, Zhang, Mobahi, Araya & Poggio, Learning with a Wasserstein loss, in: Advances in Neural Information Processing Systems (NIPS) 28, 2015, 2053–2061, url: htps://papers. nips.cc/paper/5679-learning-with-a-wasserstein-loss.
[6] Kantorovivc, On the translocation of masses, C. R. (Doklady) Acad. Sci. URSS (N.S.) 37 (1942), 199–201.
[7] Kantorovivc & Rubinvsteuin, On a functional space and certain extremum problems, Doklady Akademii Nauk SSSR 115 (1957), 1058–1061.
[8] Knight & Ruiz, A fast algorithm for matrix balancing, IMA J. Numer. Anal. 33 (2013), 1029– 1047, doi: 10.1093/imanum/drs019.
[9] Kolouri, Park, Thorpe, Slepcev & Rohde, Optimal mass transport: signal processing and machine-learning applications, IEEE Signal Processing Magazine 34 (2017), 43–59, doi: 10.1109/MSP.2017.2695801.
[10] Monge, Mémoire sur la théorie des déblais et des remblais, des Sciences de Paris (1781).
[11] Sinkhorn & Knopp, Concerning nonnegative matrices and doubly stochastic matrices, Pacifc Journal of Mathematics 21 (1967), 343–348, doi: 10.2140/pjm.1967.21.343.
[12] Varah, A lower bound for the smallest singular value of a matrix, Linear Algebra and Appl. 11 (1975), 3–5, doi: 10.1016/0024-3795(75)90112-3.