b

DiscoverSearch
About
My stuff
Riemannian coordinate descent algorithms on matrix manifolds
2 months ago·arXiv
Abstract
Abstract

Many machine learning applications are naturally formulated as optimization problems on Riemannian manifolds. The main idea behind Riemannian optimization is to maintain the feasibility of the variables while moving along a descent direction on the manifold. This results in updating all the variables at every iteration. In this work, we provide a general framework for developing computationally efficient coordinate descent (CD) algorithms on matrix manifolds that allows updating only a few variables at every iteration while adhering to the manifold constraint. In particular, we propose CD algorithms for various manifolds such as Stiefel, Grassmann, (generalized) hyperbolic, symplectic, and symmetric positive (semi)definite. While the cost per iteration of the proposed CD algorithms is low, we further develop a more efficient variant via a first-order approximation of the objective function. We analyze their convergence and complexity, and empirically illustrate their efficacy in several applications.

In this work, we consider the optimization problem

image

where M is a smooth, and often nonlinear constraint. Examples of M include orthogonality constraint (Edelman et al., 1998), positive (semi)definite constraint (Bhatia, 2009; Han et al., 2021), fixed-rank constraint (Vandereycken, 2013), hyperbolic constraint (Nickel & Kiela, 2018), doubly stochastic constraint (Douik & Hassibi, 2019), etc. Problem (1) has been explored in applications such as PCA (Zhang et al., 2016; Kasai et al., 2019), low-rank matrix/tensor completion (Jawanpuria & Mishra, 2018; Nimishakavi et al.,

2018; Kressner et al., 2014), computer vision (Pennec et al., 2006), natural language processing (Jawanpuria et al., 2019a; 2020), optimal transport (Mishra et al., 2021; Han et al., 2022; Shi et al., 2021), and deep learning (Arjovsky et al., 2016; Wang et al., 2020). Problem (1) has also been studied in various settings such as stochastic optimization (Bonnabel, 2013; Zhang et al., 2016; Tripuraneni et al., 2018; Sato et al., 2019; Kasai et al., 2019; Han & Gao, 2021), differential privacy (Reimherr et al., 2021; Han et al., 2024a; Utpala et al., 2022), federated learning (Li & Ma, 2022; Huang et al., 2024), decentralized learning (Mishra et al., 2019), and saddle point and bilevel optimization (Han et al., 2023b;a; Zhang et al., 2023; Han et al., 2024b).

The smooth constraint set can be turned into a Riemannian manifold by endowing a properly chosen metric structure. The Riemannian optimization approach (Absil et al., 2008; Boumal, 2023) then provides a principled approach to solve (1) intrinsically on the manifold space. The main idea is to iteratively update the variable along a descent direction without leaving the manifold. The descent direction is often computed using the Riemannian gradient, which is then followed by a retraction update to ensure feasibility of the manifold constraint. As the dimensionality of the constraint set increases, ensuring feasibility via retraction becomes a key computational bottleneck, e.g., the complexity of ensuring orthogonality and positive definiteness scales as  O(n3)with the input dimension n. This has led to many recent works (Gao et al., 2019; Xiao & Liu, 2021; Ablin et al., 2023) that develop infeasible methods for solving (1). However, such methods are largely limited to the orthogonality constraint and cannot be easily adapted to other manifolds.

In the Euclidean space, the coordinate descent (CD) method (Luo & Tseng, 1992; Nesterov, 2012; Wright, 2015) is a classic algorithm that successively solves a smalldimensional subproblem along a component of the vector variable while holding others fixed. Since each subproblem can be more easily solved than the original problem, this strategy leads to efficient variable update.

On manifolds, designing CD updates is inherently difficult (Gutman & Ho-Nguyen, 2023). A few works have proposed manifold specific CD updates, mainly for the orthogonal (Shalit & Chechik, 2014; Jiang et al., 2022; Massart & Abrol, 2022) and Stiefel (Gutman & Ho-Nguyen, 2023) manifolds. Although Gutman & Ho-Nguyen (2023) discuss a general framework for developing CD methods on manifolds, concrete developments have been shown only for the Stiefel manifold. Recently, for a class of optimization objectives, Darmwal & Rajawat (2023) have proposed CD updates on the symmetric positive definite manifold with the affine-invariant metric.

In this work, we provide a general approach for developing CD algorithms on matrix manifolds. We summarize our contributions below.

• We introduce a framework for designing CD algorithms on manifolds. In particular, we find a basis spanning the tangent space such that a chosen retraction along the direction of such a basis admits an efficient computation. We discuss a simple expression for the coordinate derivative. Finally, we provide optimization ingredients for various matrix manifolds of interest.

• A nonlinear objective f in (1) requires computation of gradient for every CD update. Using a first-order approximation of f, we develop a more efficient CD algorithm which requires gradient computations one in every fixed number of CD updates. We analyze the convergence and complexity of the two algorithms with randomized and cyclic selection of coordinates.

• We show the benefits of the proposed CD algorithms on the orthogonal Procrustes, PCA, orthogonal deep network distillation, nearest matrix, and learning hyperbolic embeddings problems.

Riemannian manifolds and optimization. For a Riemannian manifold M, denote its tangent space at  X ∈ Mas  TXM. A Riemannian metric is an inner product structure  gX(·, ·) = ⟨·, ·⟩X : TXM × TXM → Rthat varies smoothly with the base point X. In this work, we particularly focus on matrix manifolds, i.e., where X can be represented in the ambient vector space  Rm×n. The orthogonal projection  ProjX : Rm×n → TXMprojects arbitrary ambient vectors to the tangent space  TXMwith respect to the Riemannian metric. For a differentiable function  f : M → R, the Riemannian gradient at X is defined as the tangent vector  gradf(X) ∈ TXMsuch that  ⟨U, gradf(X)⟩X = Df(X)[U], ∀U ∈ TXMwhere  Df(X)[U] = ⟨∇f(X), U⟩. A retraction  RetrX :TXM −→ Mallows points to move along the manifold, which satisfies the conditions:  RetrX(0) = Xand DRetrX(0)[U] = U.

Related works. We provide a detailed review of the existing coordinate descent (CD) algorithms on specific manifolds, along with other related works in Appendix B.

Notations. We use  ⟨·, ·⟩without the subscript to represent the Euclidean inner product while we use  ⟨·, ·⟩Xto denote the Riemannian inner product on  TXM. The specific expression for  ⟨·, ·⟩Xdepends on both M and X. Sym(n) and Skew(n) denote the sets of  n × nsymmetric and skew-symmetric matrices, respectively. Let sym(A) := (A + A⊤)/2, skew(A) := (A − A⊤)/2, exp(·)be the elementwise exponential, and  expm(·)be the matrix exponential. We also use  eito represent the i-th basis vector with the dimension to be determined from the context.  [A]ijdenotes the i, j-th entry of a matrix  A while Aijrepresents a matrix with index i, j. We use  Into denote the  n × nidentity matrix,  1nto denote the size-n vector of all 1s, and define [n] := {1, 2, ..., n}.

As shown in (Shalit & Chechik, 2014; Gutman & Ho- Nguyen, 2023; Massart & Abrol, 2022; Jiang et al., 2022; Darmwal & Rajawat, 2023), for specific manifolds, the key in developing CD algorithms is the choice of the basis vectors  Bℓ (ℓ ∈ Iand I denotes the index set) spanning the tangent space that allow efficient retraction. In general, our chosen basis need not be orthonormal with respect to the Riemannian metric. Once the basis and retraction are chosen, the CD update is given by  RetrX(−ηθBℓ), where η > 0is the stepsize and  θis the coordinate derivative, i.e.,

image

It can be verified that  −θBℓis indeed a descent direction, i.e.,  ⟨gradf(X), −θBℓ⟩X = −θ ddθf(RetrX(θBℓ))|θ=0 =−θ2 ≤ 0. The CD algorithm then involves iteratively selecting coordinate index  ℓ, computing  θ, and updating in the coordinate descent direction  RetrX(−ηθBℓ).

The main challenges in developing CD algorithms on matrix manifolds are: 1) characterization of  Bℓwhich facilitates efficient computation, 2) efficient computation of  θ, and 3)easy generalization to different manifolds. We propose to leverage the following connection between the Riemannian and Euclidean gradients:

image

where  ∇f(X)is the Euclidean gradient and the last equality follows from the definition of the Riemannian gradient. We exploit (3) to efficiently compute  θfor several manifolds as it is independent of the Riemannian gradient and metric. In the subsequent sections, we develop concrete CD optimization ingredients for the manifolds of interest under the proposed approach. These are summarized in Table 1.

Table 1: Summary of CD ingredients over various manifolds.  Hij = eie⊤j − eje⊤iand  Eij = eiej + eje⊤iare the basis for skew-symmetric and symmetric matrices, respectively.  Gij(θ)and  Rij(θ)corresponds to the Givens and hyperbolic rotations, respectively.  PX := In − XX⊤, and  ∇f(X)kis the k-th column of  ∇f(X). We use  ExpSx(v)to denote the exponential retraction over sphere. The complexity only considers the computation of coordinate derivative and coordinate update, while excluding the complexity of first-order oracle  ∇f(X).

image

3.1. CD on Stiefel manifold

The Stiefel manifold St(n, p) is the set of column orthonormal matrices of size  Rn×p, i.e.,  St(n, p) := {X ∈ Rn×p :X⊤X = Ip}. When p = n, St(n, n) ≡ O(n), the orthogonal manifold. The tangent space of Stiefel manifold is identified as  TXSt(n, p) = {U ∈ Rn×p : X⊤U + U ⊤X = 0}. The Riemannian metric is defined as  ⟨U, V ⟩X := ⟨U, V ⟩for any  U, V ∈ TXM. The Riemannian gradient is derived as  gradf(X) = ∇f(X) − Xsym(X⊤∇f(X)).

Choice of basis. Taking inspiration from O(n), we adopt the  ΩXparameterization of the tangent vectors (where Ω ∈ Skew(n)) and choose the basis as  Bℓ = HijXfor ℓ ∈ I = {(i, j) : 1 ≤ i < j ≤ n} and Hij := eie⊤j − eje⊤i .In contrast to O(n), the chosen basis is not orthonormal for St(n, p). This is expected as the manifold St(n, p) has a dimension  np − p(p−1)2while we adopt an overparameterization of the tangent space using  n(n − 1)/2basis vectors.

Retraction. For the purpose of CD update, we first note that RetrX(tU) = expm(tΩ)Xis a valid retraction on St(n, p) because: 1)  RetrX(0) = Xand 2)  DRetrX(0)[U] =ΩX = Uare satisfied (Siegel, 2020).

CD update. Based on the above choices of the basis vectors and retraction, the proposed CD update is  RetrX(−ηθBℓ) = Gij(−ηθ)X, where  θ =⟨∇f(X), Bℓ⟩ = [∇f(X)X⊤ − X∇f(X)⊤]ij. Here, Gij(θ) = In +(cos(θ)−1)(eie⊤i +eje⊤j )+sin(θ)(eie⊤j −eje⊤i )is known as the Givens rotation around axes i, j with angle  −θ. Overall, each CD update only requires O(p) as we modify only two rows of X.

Remark 3.1. Gutman & Ho-Nguyen (2023) propose a column-wise CD update on the Stiefel manifold which costs O(np) per iteration. On the other hand, our proposed CD update is row-wise and costs O(p), which is cheaper. Furthermore, the CD update strategy of (Gutman & Ho-Nguyen, 2023) cannot be applied to the sphere manifold, i.e., when p = 1, it reduces to the full gradient update on the sphere. This, however, is not an issue for our update. Finally, the update of Gutman & Ho-Nguyen (2023) is not invariant to the right action of orthogonal group and hence does not yield a valid CD strategy for the Grassmann manifold. In contrast, as shown in the next section, our strategy can be readily generalized to the Grassmann manifold.

3.2. CD on Grassmann manifold

The Grassmann manifold Gr(n, p) represents the p-dimensional subspaces in  Rn, which can be represented by an  n × porthonormal matrix X, i.e.,  X ∈ St(n, p), where the columns span the subspace. The representation is not unique, with XQ representing the same subspace for arbitrary  Q ∈ O(p). Thus, the Grassmann manifold can be identified as  Gr(n, p) = {[X] : X ∈ Rn×p, X⊤X = Ip}, where  [X] := {XQ : Q ∈ O(p)}. The tangent space can be uniquely characterized by the horizontal space at TXSt(n, p), i.e.,  T[X]Gr(n, p) = {[U] : X⊤U = 0}. For a given  ξ ∈ T[X]Gr(n, p), its unique horizontal lift is  U = liftX(ξ), where [X] is represented as X. The lift operator satisfies  liftXQ(ξ) = liftX(ξ)Q. On Gr(n, p), the Riemannian metric is pushed forward by the Euclidean metric on St(n, p) as  ⟨ξ, ζ⟩[X] = ⟨liftX(ξ), liftX(ζ)⟩and the corresponding Riemannian gradient gradf([X]) can be represented by  liftX(gradf([X]) = (In − XX⊤)∇f(X). Retractions such as QR retraction for St(n, p) also work for Gr(n, p) as long as it preserves the equivalence class, i.e.,  [RetrXQ(t liftX(ξ)Q)] = [RetrX(t liftX(ξ))]for any Q ∈ O(p). Below, we show that the proposed CD update for St(n, p) is also well-defined for Gr(n, p).

Proposition 3.2. Consider a function  f : Gr(n, p) →R. Let the coordinate descent update at [X] be given by  RetrX(−η θHijX) := Gij(−ηθ)Xfor  1 ≤ i <j ≤ n, where  θ = ⟨∇f(X), HijX⟩and for some fixed stepsize  η > 0. Then,  RetrXQ(−η θXQHijXQ) =RetrX(−ηθHijX)Q.

3.3. CD on hyperbolic manifold

We now consider the generalized hyperbolic space (Bai & Li, 2014; Xiao et al., 2023)  H(n, p) := {X ∈ Rn×p :−X⊤JX = Ip}, where  J = diag(−1, 1, ..., 1) ∈ Rn×nis the metric tensor. When p = 1, this reduces to the wellknown hyperbolic space (the hyperboloid model). The tangent space at  X ∈ H(n, p)is identified as

TXH(n, p) = {U ∈ Rn×p : U ⊤JX + X⊤JU = 0}= {WJX : W ∈ Skew(n)}.

The Riemannian metric on TH(n, p) is the generalized Lorentz inner product as  ⟨U, V ⟩L := tr(U ⊤JV ). The normal space is given by  NXH(d, r) = {XS : S ∈ Sym(p)}. The orthogonal projection to  TXH(n, p)and the Riemannian gradient are derived below.

Proposition 3.3. The orthogonal projection of  A ∈ Rn×p toTXH(n, p)is given by  ProjX(A) = A + Xsym(X⊤JA). The Riemannian gradient is  gradf(X) = J∇f(X) +Xsym(X⊤∇f(X)).

Choice of basis. For generalized hyperbolic manifold, we consider the basis  Bℓ = HijJX = (eie⊤j − eje⊤i )JX, for1 ≤ i < j ≤ n.

Retraction. Taking inspiration from our Stiefel analysis in Section 3.1, we define the map  RetrX(tU) :=expm(tWJ)X for  U = WJX ∈ TXH(n, p). We next show such a map defines a valid retraction. As shown below, the retraction expression considerably simplifies along the chosen basis  HijJX.

Proposition 3.4. Given a tangent vector  U = WJX ∈TXH(n, p)for some skew-symmetric matrix W, then RetrX(tU) := expm(tWJ)Xis a retraction.

In fact, expm(tWJ) is a Lorentz transform that satisfies expm(tWJ)⊤Jexpm(tWJ) = J, which preserves the Lorentz inner product as  (LX)⊤JLX = X⊤JX = −Ip. Hence by following the direction  U = θHijJX, we define a coordinate type of updates on (generalized) hyperbolic manifold as  expm(θHijJ)X, which can be computed efficiently similar to the Givens rotation. Particularly, when i, j ̸= 1, HijJ = Hijand  expm(θHijJ) = Gij(θ)exactly recovers the Givens rotation. When i = 1, we have HijJ = Eij := eie⊤j + eje⊤i. We show in the follow- ing lemma that this also leads to a rotation known as the hyperbolic rotation.

Lemma 3.5. For  U = θHijJXwith  1 ≤ i < j ≤ n,

image

When  d = 4, Rij(θ)is known as the Lorentz boost with rapidity  −θand can be thought of as rotation in the time domain. Hence, while the Givens rotation based CD updates have been explored for the orthogonal and Stiefel manifolds (Shalit & Chechik, 2014; Gutman & Ho-Nguyen, 2023), our approach generalizes the Givens rotation based CD updates to hyperbolic spaces.

CD update. Similar to the Stiefel case, the proposed CD update is  RetrX(−ηθHijJX), where  θ =⟨∇f(X), HijJX⟩ = [∇f(X)X⊤J − JX∇f(X)⊤]ij.In Appendix E.3, we additionally derive a canonical-type metric and a Cayley retraction.

3.4. CD on symplectic manifold

The symplectic manifold (Gao et al., 2021a;b; 2022) is defined as  Sp(n, p) := {X ∈ R2n×2p : X⊤ΩnX = Ωp},where  Ωk :=

block matrix. The tangent space is given as

TXSp(n, p) = {U ∈ R2n×2p : U ⊤ΩnX + X⊤ΩnU = 0}= {SΩnX : S ∈ Sym(2n)}.

Here we consider the Euclidean metric (Gao et al., 2021a) as  ⟨U, V ⟩X = tr(U ⊤V )for any  X ∈ Sp(n, p), U, V ∈TXSp(n, p). The Riemannian gradient (Gao et al., 2021a, Proposition 3) is given by  gradf(X) = ∇f(X) −ΩnXWX, where  WX ∈ Skew(2p)is the unique solution to the Lyapunov equation  X⊤XW + WX⊤X =2 skew(X⊤Ω⊤n ∇f(X)).

Choice of basis. Similar to the Stiefel and hyperbolic manifolds, we consider the basis  Bℓ = EijΩnXfor the tangent space  TXSp(n, p), where  Eij = eie⊤j + eje⊤i, for 1 ≤ i ≤ j ≤ 2n. Here, ei is the i-th basis in  R2n.

Retraction. We propose the following retraction for efficient CD updates.

Proposition 3.6. For any  X ∈ Sp(n, p) and U = SΩnX ∈

TXSp(n, p), the map  RetrX(tU) = expm(tSΩn)Xis a retraction.

The above retraction further simplifies when moving along the chosen basis direction.

Proposition 3.7. Let  Uij = EijΩnX ∈ TXSp(n, p)for 1 ≤ i ≤ j ≤ 2n. Then, we have  RetrX(θUij) = X +(exp (−θ) − 1)eie⊤i X + (exp(θ) − 1)en+ie⊤n+iX,if i = j−n, j > n and RetrX(θUij) = X +θEijΩnXotherwise.

Remark 3.8 (Block coordinate updates). We may also consider block coordinate updates. Let E =

A, B ∈ Sym(n)and  C ∈ Rn×n, and we wish to update X in the direction of  U = EΩnX ∈ TXSp(n, p). First, we consider the upper-left and bottom-right blocks, i.e., where E =

Sym(n). Here, RetrX(θEΩnX) = X +θEΩnX. Second,if E =

image

CD update. Finally, based on the above discussion our proposed CD update is  RetrX(−ηθEijΩnX), where,  θ = ⟨∇f(X), EijΩnX⟩ = [∇f(X)X⊤Ω⊤n +ΩnX∇f(X)⊤]i,j.

3.5. CD on doubly stochastic and multinomial manifolds

Given two marginals  µ ∈ ∆m, ν ∈ ∆nwhere  ∆k := {z ∈Rk : z ≥ 0, z⊤1k = 1}denotes the k-simplex, the doubly stochastic manifold (Douik & Hassibi, 2019; Shi et al., 2021; Mishra et al., 2021) is defined as  Π(µ, ν) := {X ∈ Rm×n :X > 0, X1n = µ, X⊤1m = ν}. The tangent space is TXΠ(µ, ν) = {U ∈ Rm×n : U1n = 0, U ⊤1m = 0}, which can be endowed with the Fisher metric as  ⟨U, V ⟩X =tr(U ⊤(V ⊘X)), where ⊙ and ⊘represent the elementwise product and division operations, respectively. The orthogonal projection is  ProjX(A) = A − (α1⊤n + 1mβ⊤) ⊙ X, where  α ∈ Rm, β ∈ Rnare solutions to the linear system: α ⊙ µ + Xβ = A1n, β ⊙ ν + X⊤α = A⊤1m. The Riemannian gradient is given by  gradf(X) = ProjX(X ⊙∇f(X)) = X ⊙�∇f(X) − (α1⊤n + 1mβ⊤)�.

Choice of basis. We consider the parameterization of the tangent space as  TXΠ(µ, ν) = {ACB⊤ : A ∈Rm×(m−1), B ∈ Rn×(n−1), A⊤1m = 0, B⊤1n = 0, C ∈R(m−1)×(n−1)}. We notice that the tangent space has a dimension  (m−1)×(n−1), and hence, we can let  A = [e1−e2, ..., em−1 −em] ∈ Rm×(m−1), B = [e1 −e2, ..., en−1 −en] ∈ Rn×(n−1), where we denote  eias the i-th canonical basis for the corresponding vector space. Hence the tangent space is parameterized by  C ∈ R(m−1)×(n−1). The basis we consider  Bℓ = Aeie⊤j B⊤ = (ei − ei+1)(ej − ej+1)⊤for  i ∈ [m − 1], j ∈ [n − 1].

Retraction. We consider the Sinkhorn retraction applied in the direction of the basis as  RetrX(−ηθBℓ) = SK(X ⊙exp(−ηθBℓ ⊘ X)). Here, the Sinkhorn algorithm SK(U) iteratively normalize rows and columns of U according to the given marginals (Knight, 2008). We notice the input to the Sinkhorn algorithm only modifies a  2 × 2sub-matrix of X. It, thus, suffices to apply the Sinkhorn algorithm to the  2 × 2sub-matrix with the modified marginals, which largely simplifies the computation compared to running the Sinkhorn algorithm for the entire input. To this end, we define the coordinate Sinkhorn, denoted as  SKij(U)or simply cSK(U) if the coordinates are clear from context, as performing the Sinkhorn algorithm for the  2 × 2sub-matrix formed by indices i, i+1 and j, j +1 with marginals ([µ]i − �k̸=j,j+1[U]ik, [µ]i+1 − �k̸=j,j+1[U](i+1)k)and ([ν]j − �l̸=i,i+1[U]tj, [ν]j+1 − �l̸=i,i+1[U]l(j+1)). The other entries of U remains unchanged. We show in the next proposition that applying coordinate Sinkhorn to the basis results in a valid retraction.

Proposition 3.9. The coordinate Sinkhorn applied to the basis  Bℓ, i.e., cSK(X ⊙exp(tBℓ ⊘X))is a valid retraction in the direction of  Bℓ.

We can further simplify the computation of  cSK(X ⊙exp(tBℓ ⊘X)), which is equivalent to performing Sinkhorn on a  2 × 2matrix. Furthermore, in this case, we show in Lemma E.6 that the Sinkhorn admits a closed-form solution.

CD update. The CD update follows as  cSK(X ⊙exp(−ηθBℓ ⊘ X)), where the coordinate derivative  θis computed as  θ = ⟨∇f(X), Bℓ⟩ = [∇f(X)]ij −[∇f(X)]i(j+1) − [∇f(X)](i+1)j + [∇f(X)](i+1)(j+1).

Remark 3.10 (CD on multinomial manifold). The developments in this section readily applies to the multinomial manifold (Douik & Hassibi, 2019), i.e.,  Mn,p := {X ∈Rn×p : X > 0, X1p = v}where we assume  v = 1nwithout loss of generality. The multinomial constraint corresponds to n independent simplex constraints restricted to positive entries. The tangent space is  TXMn,p = {U ∈Rn×p : U1p = 0} = {V B⊤ : V ∈ Rn×(p−1), B ∈Rp×(p−1), B⊤1p = 0}. Thus, the basis is similarly given by  Bℓ = ei(ej − ej+1)⊤. The Riemannian metric is the same Fisher metric. A retraction is given in  RetrX(tU) =P(X ⊙ (exp(tU ⊘ X))), where  P(V ) = V ⊘ (V 1p1⊤p )denotes the row normalization. It should be noted that P(V ) is a special case of the Sinkhorn algorithm without column normalization. Thus, in the basis direction, we can define the coordinate projection by modifying only two entries per row. The coordinate derivative can be computed as θ = ⟨∇f(X), Bℓ⟩ = [∇f(X)]ij − [∇f(X)]i(j+1).

3.6. CD on positive (semi)definite manifold

The set of fixed-rank symmetric positive semi-definite manifold (SPSD) matrices (Vandereycken et al., 2009; 2013; Massart & Absil, 2020) is defined as  Sn,p+ := {X ∈ Rn×n :X ⪰ 0, rank(X) = p}. When p = n, we recover the set of symmetric positive definite (SPD) matrices as  Sn,n+ ≡ Sn++.For the purpose of developing efficient CD updates on SPSD, we follow the parameterization purposed in (Massart & Ab- sil, 2020), i.e.,  X ∈ Sn×p+is factorized as  X = Y Y ⊤, Y ∈ Rn×p∗, which is unique up to the right-action of the orthogonal group O(p). The Riemannian gradient can be computed as  gradf(Y ) = ∇f(Y Y ⊤) = 2 sym(∇f(Y Y ⊤))Ybecause the  TY Rn×p∗ ≃ Rn×p. The main advantage of this parameterization is its simple expression of retraction, i.e., RetrY (tξ) = Y + tξ (Massart & Absil, 2020).

Choice of basis, retraction, and CD update. Using X = Y Y ⊤, the optimization problem is on  Rn×p∗with a simple retraction. For the objective  f : Sn,p+ → R, we ini-tialize  Y ∈ Rn×p∗and update as  RetrY (−η∇f(Y Y ⊤)) =Y − η∇f(Y Y ⊤)for some stepsize  η. We choose the basis to be  eie⊤jfor  i ∈ [n], j ∈ [p], which is orthonormal for the tangent space  TY Rn×p∗. The CD update is given by  RetrY (−ηθeie⊤j ), where  θ = ⟨∇f(Y Y ⊤), eie⊤j ⟩ =[∇f(Y Y ⊤)]ij. The simplicity of the geometry allows CD to be developed efficiently on  Sn,p+, which coincides with the Euclidean CD update in the Euclidean space. When p = n, we obtain a CD update for the SPD manifold.

CD update with the BW metric. The Bures-Wasserstein (BW) metric for the SPD manifold (p = n) has been recently studied for various machine learning applications (Bhatia et al., 2019; Han et al., 2021). For the BW metric, the gradient descent update is  ExpX(−ηgradf(X)) =X − 2η(∇f(X)X + X∇f(X)) + 4η2∇f(X)X∇f(X).Consider a basis  EijX + XEijwhere  Eij = eie⊤j +eje⊤i. The coordinate derivative is computed as  θij =⟨EijX + XEij, ∇f(X)⟩. Finally, the CD update is given by  X −2ηθij(EijX +XEij)+4η2θ2ijEijXEij. Each CDupdate modifies two rows and two columns of X.

Remark 3.11. For the SPD manifold (p = n), Darmwal & Rajawat (2023) propose CD updates based on the affine-invariant metric and Cholesky factorization. They specifically focus on a class of objective functions and show that the exponential map computations are efficient. In contrast, our choices of parameterization/metric directly leads to a faster retraction.

RCD. We present the proposed Riemannian coordinate descent (RCD) algorithm in Algorithm 1. The complexity of RCD per iteration is the complexity of one first-order oracle and the update complexity in Table 1. Although

image

for some problem settings, we may explore the structure of the objective to efficiently compute  θ = ⟨∇f(X), Bℓ⟩, for general problem instances,  ∇f(X)becomes the main computational bottleneck.

RCDLin. To reduce gradient computations in the RCD setup (especially for non-linear objectives), we also propose the Riemannian linearized coordinate descent (RCDlin) method in Algorithm 1. The main difference with RCD is that the variables are updated using an anchored gradient at  Xk(which does not change for inner iterations). This scheme is equivalent to taking a linearization of the original cost function at  Xk, and in the inner iterations, we solve:  minX∈M�g(X) := f(Xk) + ⟨∇f(Xk), X − Xk⟩},where the inner product and subtraction are defined in the ambient Euclidean space. Subsequently, the Euclidean gradient at  Xskis  ∇g(Xsk) = ∇f(Xk), and thus,  θsk =⟨∇f(Xk), Bℓsk⟩. For the randomized setting with S = 1, RCDlin is equivalent to RCD. Additionally, for linear problems where  ∇f(Xk) = C (Cis some constant matrix), RCDlin also reduces to RCD.

Convergence and complexity of Algorithm 1

We next discuss the convergence analysis of RCD and RCDlin. It follows the standard analysis for CD algorithms (Wright, 2015). Note that Gutman & Ho-Nguyen (2023) mostly consider the analysis of CD algorithms under exponential map and parallel transport operations. In contrast, we consider the more general retraction and vector transport operations. We also adapt our analysis for RCDlin. For brevity, our analysis is informally discussed here. The analysis is in a compact neighbourhood around a critical point, which is required for validating certain regularity assumptions, boundedness of basis and projection onto the basis, and smoothness of the objective (details in Appendix F).

On RCD. We start by showing the convergence of RCD under randomized selection of basis and certain regularity assumptions. L is the smoothness constant of the objective.

Theorem 4.1 (Randomized RCD). Under mild assumptions, consider RCD with  S = 1 and ℓsk selected uniformly at random from I. Then, choosing  η = Θ� 1L�leads to min0≤k≤K−1 E∥gradf(Xk)∥2Xk ≤ O� |I|LK �.

The convergence of RCD with cyclic selection of basis requires further assumptions that bound the difference of the constructed bases between tangent spaces. These are reasonable given the compactness of the domain.

Theorem 4.2 (Cyclic RCD). Under mild assumptions in addition to the ones required by randomized RCD, consider RCD with S = |I| and  ℓsk = s + 1for s = 0, ..., |I| − 1. Then, for  η = Θ� 1L�, we have

image

On RCDlin. The key idea here is to relate the coordinate derivative  θsk = ⟨∇f(Xk), Bℓsk⟩to the correct descent derivative  ⟨∇f(Xsk), Bℓsk⟩. In randomized settings, we can show the same convergence rate as RCD up to some additional constants regulated by the difference between  θsk andthe descent direction. For the cyclic settings, however, we require S = |I| in order to cycle through all the basis.

Theorem 4.3 (Randomized and cyclic RCDlin). Under assumptions required in Theorems 4.1 & 4.2, suppose θskand  ⟨∇f(Xsk), Bℓsk⟩are positively related. Then, consider randomized RCDlin with  ℓskselected uniformly at random from I. Choosing  η = Θ( 1L)leads to min0≤k≤K−1,0≤s≤S−1 E∥gradf(Xsk)∥2Xsk ≤ � |I|LKS�. In addition, consider cyclic RCDlin with S = |I| and  ℓsk =s + 1 for  s = 0, ..., |I| − 1. Also, if  η = Θ� 1L�, then

image

Complexity analysis. Let the cost of computing the coordinate derivative  θand CD update be  δ(last column of Table 1). Then, the total computational cost of RCD and RCDlin is  O(KS(F + δ)) and O(K(F + Sδ)), respectively, where F denotes the cost of computing  ∇f(X). We note that the proposed algorithms can parallely update in disjoint basis directions. For example, in the Stiefel/Grassmann case, we can select n/2 non-overlapping index pairs, which results in n/2 independent Givens rotation, and can be parallelized.

We now benchmark the performance of the proposed RCD and RCDlin algorithms in terms of computational efficiency (flop counts and/or runtime) and convergence quality (distance to optimality). One of the considered baselines is the Riemannian gradient descent method (RGD), a full gradient method. As RGD exploits the entire gradient direction, it has advantage over CD algorithms. However, RGD is significantly more costly than CD in every up-

image

Figure 1: The Procrustes problem with varying p: (a) p = 150 and (b) p = 50. (Top row) Comparing various algorithms in terms of flop counts. (Bottom row) Comparing various algorithms in terms of runtime. We observe that our RCD algorithm obtains better flop counts than the baselines in flop counts and is competitive in terms of runtime.

date. Our codes are implemented using the Manopt toolbox (Boumal et al., 2014) and run on a laptop with an i5-10500 3.1GHz CPU processor. The codes are available at https://github.com/andyjm3.

5.1. Orthogonal Procrustes and PCA

Orthogonal Procrustes problem. We aim to solve minX∈St(n,p) ∥XA − B∥2(≡ −⟨XA, B⟩)for given matrices  A ∈ Rp×p, B ∈ Rn×p. There exists a closed-form solution provided by the (thin) SVD of  BA⊤. For this, RCD and RCDlin have same updates as  ∇f(X) = −BA⊤for X ∈ St(n, p). In experiments, we generate random matrices A, B and evaluate the performance against optimality gap computed as  |f(Xk) − f ∗|/|f ∗|.

Baselines. The closest baseline to RCD is TSD (Gutman & Ho-Nguyen, 2023), which is a CD method under an alternative construction of bases. As discussed, while TSD updates the columns, the proposed RCD updates the rows. Since RCD is equivalent to TSD for p = n, we focus only on the p < n setting. For both RCD and TSD, we use the cyclic selection of basis. We also compare against RGD methods with QR, Cayley (CL), and exponential (EXP) retractions. For all the methods, we tune the stepsize.

Results. In Figure 1, we show results with varying dimension p. While the proposed RCD obtains better flop counts than the baselines in flop counts, it is competitive in terms of runtime. We highlight that the runtime of RCD can be further improved with parallel implementation. In Figure 2c, we compare a variety of basis selection rules for both

image

Figure 2: (a) & (b): Experiments on the PCA problem with n = 200, p = 50. In (a), we observe that our algorithm RCDlin achieves the fastest convergence due to low per-iteration cost. In (b), we compare various strategies for basis selection: cyclic selection (-c) and uniformly random selection (-r) of basis for TSD, RCD, and RCDlin, and selection without replacement (-nr) for RCDlin. We observe that cyclic and selection without replacement strategies are better than random selection. (c) & (d): Experiments on the Procrustes problem with n = 200, p = 150. In (c), we again observe that cyclic selection performs better than random selection. In (d), RCD performs competitively against the infeasible methods.

TSD and RCD: cyclic selection (‘c’) and uniformly random selection (‘r’). We observe that cyclic selection is more favourable than the random selection rule for both the methods. We compare against full gradient infeasible methods in Figure 2d, including PLAM (Gao et al., 2019), PCAL (Gao et al., 2019), PenCF (Xiao et al., 2022), ExPen (Xiao & Liu, 2021; Xiao et al., 2023), and Landing (Ablin & Peyr´e, 2022; Ablin et al., 2023). RCD is performs competitively against infeasible methods for orthogonality constraints.

PCA problem. The PCA problem solves a quadratic maximization problem as  maxX⊤X=Ip tr(X⊤AX)for some positive definite matrix A, i.e.,  A ∈ Sn++. This problem is in fact an optimization problem over the Grassmann manifold because the objective is invariant to basis change of X. Hence, we use the Riemannian distance to the optimal solution on the Grassmann manifold to measure the performance. As discussed in Section 3.2, our proposed RCD has well-defined updates on the Grassmann manifold. In contrast, TSD is not invariant to the basis change. For experiments, we generate A with a condition number  103 andwith exponential decay of eigenvalues. For TSD, RCD, and RCDlin, we implement the cyclic selection of basis.

Results. In Figure 2a, we observe that RCDlin achieves the best performance due to its low per-iteration cost. We note that TSD converges slowly due to non-invariance of the CD updates. In Figure 2b, we compare the cyclic and uniformly random selection of the basis of RCD, RCDlin, and TSD. For RCDlin, we also implement the selection without replacement (‘nr’) strategy. We observe that cyclic and ‘nr’ strategies are better than random selection.

5.2. Orthogonal deep networks for distillation

We next evaluate RCD on a deep learning based distillation problem (Hinton et al., 2015). Let  Θdenote the parameters of the student network (S) while  ΘTbe the optimal parameters of the teacher network (T). Then, the aim

image

Figure 3: Experiments on the distillation problem. We observe that the proposed RCD algorithm performs better than the baselines both in terms of flop counts and runtime.

is to learn S that approximates T, i.e., minimize  L(Θ) =∥ΨΘ(X) − ΨΘT(X)∥2, where  ΨΘ(X) ∈ RN×doutrepresent the output of the network for some input  X ∈ RN×din.The network architecture is detailed in Appendix A.1. Here, we constrain all the weights to be orthonormal, thus posing the problem as optimization over the joint space of Stiefel and Euclidean manifolds. For experiments, we consider a six-layer network and set  din = 500, dout = 200. We use stochastic versions of RGD and RCD where the input samples are randomly generated. In Figure 3, RCD outperforms the baselines in terms of flop counts and runtime. This is because RCD has the most cost-efficient update per iteration, while maintaining a competitive convergence rate.

5.3. Nearest matrix problem

We consider the problem:  minX∈Sp(n,p) ∥X − A∥2on the symplectic manifold (Gao et al., 2021b). We follow the setting in (Gao et al., 2021b) by generating A as a random matrix with p = n = 200. The algorithms are evaluated on optimality gap  |f(Xk) − f ∗|/|f ∗|, where f ∗ is obtained by running the conjugate gradient algorithm with the Cayley retraction. We implement RCD and RCDlin with both CD and block CD updates (discussed in Remark 3.8). As there

image

Figure 4: Experiments on the nearest matrix problem. We notice the utility of the block-update variants of our RCD and RCDlin algorithms in obtaining faster convergence.

image

Figure 5: Experiments on learning Lorentz (hyperbolic) embeddings. The performance of our RCDlin algorithms (with cyclic and time-cyclic basis selection) is competitive to RGD.

is no CD baseline on the symplectic manifold, we compare against the full gradient RGD algorithms with three retractions: Cayley (‘CL’), quasi-geodesic (‘QG’), and SR (‘SR’). In Figure 4, RCD with block update shows clear advantage in both flop counts and runtime.

5.4. Learning Lorentz embeddings

We consider the task of learning embeddings for word hierarchies, which is formulated on the hyperbolic manifold using the hyperboloid model (Nickel & Kiela, 2017; 2018; Jawanpuria et al., 2019b). The goal is to map word pairs with hypernymy relations closer while separate those without. We follow the formulation in (Nickel & Kiela, 2018), and the details are in Appendix A.2.

For experiment settings, we train 5-dimensional embeddings (n = 5) for WordNet mammals subtree (Miller, 1998). We adopt the RCDlin algorithm with two selection rules for the basis  HijJX: cyclic (‘RCDlin-c’) and time cyclic (‘RCDlin-tc’). The cyclic selection loops through all  n(n − 1)/2pairs per iteration. The time cyclic selection only loops through all the space-time coordinate pairs, namely (1, 2), (1, 3), ..., (1, n), which reduces the computational cost to scale linearly with dimension n. For RCDlin-c and RCDlin-tc, we use a linearly-decaying stepsize, i.e., η/(1 + 0.1 × epoch). For RGD we use a fixed stepsize  ηwhich generally leads to better convergence. We tune and set  η = 1.0for RCDlin and 0.5 for RGD. We use the met-

image

Figure 6: Experiments on the weighted least squares problem in two settings: (a) n = p = 500 and  A = 1n1⊤nis a dense matrix and (b) n = 500, p = 100 and A is a random symmetric matrix with 70% entries as 1 and others are 0. While RGD and the proposed RCDlin have similar convergence rate in (a), RCDlin has clear advantage in (b).

rics for evaluating the convergence: mean average precision (MAP) and mean rank (MR) (Nickel & Kiela, 2017; 2018). In Figure 5, we see that RCDlin converges at a similar rate compared to RGD in terms of runtime.

5.5. Weighted least squares (SPSD manifold)

The weighted least squares problem is  minX∈Sn×p+ ∥A ⊙X − B∥, where  A ∈ {0, 1}n×nmasks the known entries in B. It is an instance of the matrix completion problem (Han et al., 2021). For the experiment, we follow (Han et al., 2021) by generating  B = A ⊙ X∗where  X∗is an SPD/SPSD matrix with exponentially decaying eigenvalues. We consider two settings: (left)  n = p = 500, A = 1n1⊤nis a dense matrix and (right) n = 500, p = 100 and A is a random symmetric matrix with 70% entries as 1 and others are 0. We compare RCDlin with RGD (for  X = Y Y ⊤factorization). We set S = np/5 and select the coordinates randomly without replacement. In Figure 6, we observe that RCDlin performs competitively with RGD on the SPD manifold with dense A while performing significantly better on the SPSD manifold with sparse A.

In this work, we discuss how to develop computationally efficient CD updates for a number of matrix manifolds. The main bottleneck in developing CD methods is on finding the right basis parameterization of the tangent space. We show precise constructions for various manifolds and propose two CD algorithms: RCD and RCDlin. RCDlin specifically reduces the gradient computations of RCD further. Our experiments show the benefit of our proposed CD updates on a number of problem instances.

This paper presents work whose goal is to advance optimization methods with applications in Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Ablin, P. and Peyr´e, G. Fast and accurate optimization on the orthogonal manifold without retraction. In International Conference on Artificial Intelligence and Statistics, pp. 5636–5657. PMLR, 2022.

Ablin, P., Vary, S., Gao, B., and Absil, P.-A. Infeasible deterministic, stochastic, and variance-reduction algorithms for optimization under orthogonality constraints. arXiv:2303.16510, 2023.

Absil, P.-A., Mahony, R., and Sepulchre, R. Optimization algorithms on matrix manifolds. Princeton University Press, 2008.

Arjovsky, M., Shah, A., and Bengio, Y. Unitary evolution recurrent neural networks. In International Conference on Machine Learning, pp. 1120–1128. PMLR, 2016.

Bai, Z. and Li, R.-C. Minimization principles and com- putation for the generalized linear response eigenvalue problem. BIT Numerical Mathematics, 54:31–54, 2014.

Bhatia, R. Positive definite matrices. Princeton University Press, 2009.

Bhatia, R., Jain, T., and Lim, Y. On the bures–wasserstein distance between positive definite matrices. Expositiones Mathematicae, 37(2):165–191, 2019.

Bonnabel, S. Stochastic gradient descent on Riemannian manifolds. IEEE Transactions on Automatic Control, 58 (9):2217–2229, 2013.

Boumal, N. An introduction to optimization on smooth manifolds. Cambridge University Press, 2023.

Boumal, N., Mishra, B., Absil, P.-A., and Sepulchre, R. Manopt, a matlab toolbox for optimization on manifolds. The Journal of Machine Learning Research, 15(1):1455– 1459, 2014.

Cardoso, J. R. and Leite, F. S. Exponentials of skew-symmetric matrices and logarithms of orthogonal matrices. Journal of Computational and Applied Mathematics, 233(11):2867–2875, 2010.

Darmwal, Y. and Rajawat, K. Low-complexity subspace- descent over symmetric positive definite manifold. Technical report, arXiv preprint arXiv:2305.02041, 2023.

Douik, A. and Hassibi, B. Manifold optimization over the set of doubly stochastic matrices: A second-order geometry. IEEE Transactions on Signal Processing, 67 (22):5761–5774, 2019.

Edelman, A., Arias, T. A., and Smith, S. T. The geometry of algorithms with orthogonality constraints. SIAM Journal on Matrix Analysis and Applications, 20(2):303–353, 1998.

Gao, B., Liu, X., and Yuan, Y.-x. Parallelizable algorithms for optimization problems with orthogonality constraints. SIAM Journal on Scientific Computing, 41(3):A1949– A1983, 2019.

Gao, B., Son, N. T., Absil, P.-A., and Stykel, T. Geometry of the symplectic Stiefel manifold endowed with the Euclidean metric. In Geometric Science of Information, pp. 789–796. Springer, 2021a.

Gao, B., Son, N. T., Absil, P.-A., and Stykel, T. Riemannian optimization on the symplectic Stiefel manifold. SIAM Journal on Optimization, 31(2):1546–1575, 2021b.

Gao, B., Son, N. T., and Stykel, T. Optimization on the symplectic Stiefel manifold: SR decomposition-based retraction and applications. Technical report, arXiv preprint arXiv:2211.09481, 2022.

Gutman, D. H. and Ho-Nguyen, N. Coordinate descent without coordinates: Tangent subspace descent on Riemannian manifolds. Mathematics of Operations Research, 48(1):127–159, 2023.

Han, A. and Gao, J. Improved variance reduction methods for Riemannian non-convex optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44 (11):7610–7623, 2021.

Han, A., Mishra, B., Jawanpuria, P. K., and Gao, J. On Riemannian optimization over positive definite matrices with the Bures-Wasserstein geometry. In Advances in Neural Information Processing Systems, volume 34, pp. 8940–8953, 2021.

Han, A., Mishra, B., Jawanpuria, P., and Gao, J. Rieman- nian block SPD coupling manifold and its application to optimal transport. Machine Learning, pp. 1–28, 2022.

Han, A., Mishra, B., Jawanpuria, P., and Gao, J. Nonconvex- nonconcave min-max optimization on riemannian manifolds. Transactions on Machine Learning Research, 2023a. ISSN 2835-8856.

Han, A., Mishra, B., Jawanpuria, P., Kumar, P., and Gao, J. Riemannian hamiltonian methods for min-max optimization on manifolds. SIAM Journal on Optimization, 33(3): 1797–1827, 2023b.

Han, A., Mishra, B., Jawanpuria, P., and Gao, J. Differen- tially private Riemannian optimization. Machine Learning, 113(3):1133–1161, 2024a.

Han, A., Mishra, B., Jawanpuria, P., and Takeda, A. A framework for bilevel optimization on Riemannian manifolds. Technical report, arXiv preprint arXiv:2402.03883, 2024b.

Hinton, G., Vinyals, O., and Dean, J. Distilling the knowl- edge in a neural network. Technical report, arXiv preprint arXiv:1503.02531, 2015.

Huang, M., Ma, S., and Lai, L. A Riemannian block co- ordinate descent method for computing the projection robust Wasserstein distance. In International Conference on Machine Learning, pp. 4446–4455. PMLR, 2021.

Huang, W., Absil, P.-A., and Gallivan, K. A. A Riemannian symmetric rank-one trust-region method. Mathematical Programming, 150(2):179–216, 2015.

Huang, Z., Huang, W., Jawanpuria, P., and Mishra, B. Federated learning on Riemannian manifolds with differential privacy. Technical report, arXiv preprint arXiv:2404.10029, 2024.

Jawanpuria, P. and Mishra, B. A unified framework for structured low-rank matrix learning. In International Conference on Machine Learning (ICML), 2018.

Jawanpuria, P., Balgovind, A., Kunchukuttan, A., and Mishra, B. Learning multilingual word embeddings in a latent metric space: a geometric approach. Transactions of the Association for Computational Linguistics, 7(3): 107–120, 2019a.

Jawanpuria, P., Meghwanshi, M., and Mishra, B. Low- rank approximations of hyperbolic embeddings. In IEEE Conference on Decision and Control, 2019b.

Jawanpuria, P., Meghwanshi, M., and Mishra, B. Geometry- aware domain adaptation for unsupervised alignment of word embeddings. In Annual Meeting of the Association for Computational Linguistics, 2020.

Jiang, Y., Zhang, H., Qiu, Y., Xiao, Y., Long, B., and Yang, W.-Y. Givens coordinate descent methods for rotation matrix learning in trainable embedding indexes. In International Conference on Learning Representations, 2022.

Kasai, H., Jawanpuria, P., and Mishra, B. Riemannian adaptive stochastic gradient algorithms on matrix manifolds. In International Conference on Machine Learning (ICML), 2019.

Knight, P. A. The Sinkhorn–Knopp algorithm: convergence and applications. SIAM Journal on Matrix Analysis and Applications, 30(1):261–275, 2008.

Kressner, D., Steinlechner, M., and Vandereycken, B. Low- rank tensor completion by Riemannian optimization. BIT Numerical Mathematics, 54:447–468, 2014.

Lezcano Casado, M. Trivializations for gradient-based opti- mization on manifolds. In Advances in Neural Information Processing Systems, volume 32, 2019.

Li, J. and Ma, S. Federated learning on Riemannian mani- folds. Technical report, arXiv preprint arXiv:2206.05668, 2022.

Luo, Z.-Q. and Tseng, P. On the convergence of the coordi- nate descent method for convex differentiable minimization. Journal of Optimization Theory and Applications, 72(1):7–35, 1992.

Massart, E. and Abrol, V. Coordinate descent on the or- thogonal group for recurrent neural network training. In AAAI Conference on Artificial Intelligence, volume 36, pp. 7744–7751, 2022.

Massart, E. and Absil, P.-A. Quotient geometry with sim- ple geodesics for the manifold of fixed-rank positivesemidefinite matrices. SIAM Journal on Matrix Analysis and Applications, 41(1):171–198, 2020.

Miller, G. A. WordNet: An electronic lexical database. MIT press, 1998.

Mishra, B., Kasai, H., Jawanpuria, P., and Saroop, A. A Riemannian gossip approach to subspace learning on Grassmann manifold. Machine Learning, 108(10):1783– 1803, 2019.

Mishra, B., Satyadev, N., Kasai, H., and Jawanpuria, P. Man- ifold optimization for non-linear optimal transport problems. Technical report, arXiv preprint arXiv:2103.00902, 2021.

Nesterov, Y. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.

Nickel, M. and Kiela, D. Poincar´e embeddings for learning hierarchical representations. In Advances in Neural Information Processing Systems, volume 30, 2017.

Nickel, M. and Kiela, D. Learning continuous hierarchies in the Lorentz model of hyperbolic geometry. In International Conference on Machine Learning, pp. 3779–3788. PMLR, 2018.

Nimishakavi, M., Jawanpuria, P., and Mishra, B. A dual framework for low-rank tensor completion. In Conference on Neural Information Processing Systems (NeurIPS), 2018.

Peng, L. and Vidal, R. Block coordinate descent on smooth manifolds. Technical report, arXiv preprint arXiv:2305.14744, 2023.

Pennec, X., Fillard, P., and Ayache, N. A Riemannian Framework for Tensor Computing. International Journal of Computer Vision, 66(1):41–66, 2006.

Reimherr, M., Bharath, K., and Soto, C. Differential privacy over riemannian manifolds. In Advances in Neural Information Processing Systems, volume 34, pp. 12292–12303, 2021.

Sato, H., Kasai, H., and Mishra, B. Riemannian stochastic variance reduced gradient algorithm with retraction and vector transport. SIAM Journal on Optimization, 29(2): 1444–1472, 2019.

Shalit, U. and Chechik, G. Coordinate-descent for learning orthogonal matrices through Givens rotations. In International Conference on Machine Learning, pp. 548–556. PMLR, 2014.

Shi, D., Gao, J., Hong, X., Boris Choy, S., and Wang, Z. Coupling matrix manifolds assisted optimization for optimal transport problems. Machine Learning, 110:533–558, 2021.

Siegel, J. W. Accelerated optimization with orthogonality constraints. Journal of Computational Mathematics, 39 (2):207–206, 2020.

Sinkhorn, R. Diagonal equivalence to matrices with pre- scribed row and column sums. The American Mathematical Monthly, 74(4):402–405, 1967.

Tripuraneni, N., Flammarion, N., Bach, F., and Jordan, M. I. Averaging stochastic gradient descent on Riemannian manifolds. In Conference on Learning Theory, pp. 650– 687. PMLR, 2018.

Utpala, S., Han, A., Jawanpuria, P., and Mishra, B. Im- proved differentially private Riemannian optimization: Fast sampling and variance reduction. Transactions on Machine Learning Research, 2022.

Vandereycken, B. Low-rank matrix completion by Rieman- nian optimization. SIAM Journal on Optimization, 23(2): 1214–1236, 2013.

Vandereycken, B., Absil, P.-A., and Vandewalle, S. Embed- ded geometry of the set of symmetric positive semidefinite matrices of fixed rank. In IEEE/SP Workshop on Statistical Signal Processing, pp. 389–392. IEEE, 2009.

Vandereycken, B., Absil, P.-A., and Vandewalle, S. A Rie- mannian geometry with complete geodesics for the set of positive semidefinite matrices of fixed rank. IMA Journal of Numerical Analysis, 33(2):481–514, 2013.

Wang, J., Chen, Y., Chakraborty, R., and Yu, S. X. Or- thogonal convolutional neural networks. In Conference on Computer Vision and Pattern Recognition, pp. 11505– 11515, 2020.

Wen, Z. and Yin, W. A feasible method for optimization with orthogonality constraints. Mathematical Programming, 142(1-2):397–434, 2013.

Wilson, B. and Leimeister, M. Gradient descent in hyperbolic space. Technical report, arXiv preprint arXiv:1805.08207, 2018.

Wright, S. J. Coordinate descent algorithms. Mathematical Programming, 151(1):3–34, 2015.

Xiao, N. and Liu, X. Solving optimization problems over the Stiefel manifold by smooth exact penalty function. Technical report, arXiv preprint arXiv:2110.08986, 2021.

Xiao, N., Liu, X., and Yuan, Y.-x. A class of smooth exact penalty function methods for optimization problems with orthogonality constraints. Optimization Methods and Software, 37(4):1205–1241, 2022.

Xiao, N., Liu, X., and Toh, K.-C. Dissolving constraints for Riemannian optimization. Mathematics of Operations Research, 2023.

Yuan, G. A block coordinate descent method for nonsmooth composite optimization under orthogonality constraints. Technical report, arXiv preprint arXiv:2304.03641, 2023.

Zhang, H., J Reddi, S., and Sra, S. Riemannian SVRG: Fast stochastic optimization on Riemannian manifolds. In Advances in Neural Information Processing Systems, volume 29, 2016.

Zhang, P., Zhang, J., and Sra, S. Sion’s minimax theorem in geodesic metric spaces and a Riemannian extragradient algorithm. SIAM Journal on Optimization, 33(4):2885– 2908, 2023.

Zhu, X. A Riemannian conjugate gradient method for op- timization on the Stiefel manifold. Computational Optimization and Applications, 67:73–110, 2017.

The experiments are run on a laptop with an i5-10500 3.1GHz CPU processor.

A.1. Orthogonal deep networks for distillation

Here we provide the detailed network architecture for the distillation task. In particular, we define a L-layer feed-forward neural network with Tanh activation function, i.e.,  Xℓ+1 = tanh(WℓXℓ + bℓ)for  ℓ ∈ [L], where  W1 ∈ Rdin×d, WL ∈Rd×dout, and Wℓ ∈ Rd×d for ℓ ̸= 1, L. In the experiment, we set L = 6.

A.2. Learning Lorentz embeddings

Here we provide the problem formulation for the task of learning Lorentz embeddings. Let D = {(u, v)} be the related word pairs and construct  Neg(u) = {v : (u, v) /∈ D}as the negative samples of word u. The objective is to learn embeddings  xufor all word u by solving

image

where  dist(xu, xv) = arccosh(−⟨xu, xv⟩L)is the Lorentz Riemannian distance.

We start by reviewing the developments of coordinate descent on the orthogonal manifold (Shalit & Chechik, 2014; Massart & Abrol, 2022; Jiang et al., 2022), Stiefel manifold (Gutman & Ho-Nguyen, 2023), and symmetric positive definite (SPD) manifold (Darmwal & Rajawat, 2023), which motivate the proposed general framework for other manifolds.

Some other works (Huang et al., 2021; Peng & Vidal, 2023) study (block) coordinate descent on a product of manifolds, where each update concerns a component manifold. This is different to our considered setting, where the update is defined for coordinate on the tangent space for a single manifold.

B.1. Orthogonal manifold

Orthogonal manifold O(n) is the smooth space formed by the orthogonality constraints, i.e.,  O(n) := {X ∈ Rn×n :XX⊤ = X⊤X = In}. The tangent space can be identified as  TXO(n) := {ΩX : Ω ∈ Skew(n)}. The Riemannian metric coincides with the Euclidean metric, i.e.,  ⟨U, V ⟩X = 12⟨U, V ⟩. The 12 is added to ensure consistency with the canonical metric for the Stiefel manifold, as we shall see later. This leads to the Riemannian gradient  gradf(X) = ∇f(X) − X∇f(X)⊤Xand the exponential retraction is given by  RetrX(θΩX) = expm(θΩ)X, for some θ ∈ R.

Remark B.1. We remark that in all the existing works (Shalit & Chechik, 2014; Massart & Abrol, 2022; Jiang et al., 2022), the tangent space is parameterized as  TXO(n) := {XΩ′ : Ω′ ∈ Skew(n)}and thus the exponential retraction amounts to RetrX(θXΩ′) = Xexpm(θΩ′). Such a formulation is equivalent to the above by letting  Ω = XΩ′X⊤ ∈ Skew(n). Our reformulation allows natural generalization of the coordinate descent framework to column orthonormal matrices (the Stiefel manifold), where the orthogonal matrix is a special case.

The manifold has a dimension of  n(n − 1)/2and its tangent space can be provided with an orthonormal basis  HijXwhere  Hij := eie⊤j − eje⊤iis the basis for the skew-symmetric matrices. In each basis direction  HijX, the exponential retraction reduces to the Givens rotation, which allows efficient updates, i.e.,  RetrX(θHijX) = Gij(θ)X where Gij(θ) =In + (cos(θ) − 1)(eie⊤i + eje⊤j ) + sin(θ)(eie⊤j − eje⊤i )is known as the Givens rotation matrix around axes i, j with angle −θ.

In order to minimize a function  f : O(n) → R, one needs to update the variables in the negative gradient direction. Here along the basis direction, coordinate descent aims to minimize the function  f(Gij(θ)X)with respect to  θ. One strategy is to solve this one-variable optimization problem directly as in (Shalit & Chechik, 2014). When the objective is more involved, we can approximately solve this problem by following a descent direction (Massart & Abrol, 2022; Jiang et al., 2022), which is given by  − ddθf(Gij(θ)X)|θ=0 = −⟨gradf(X), HijX⟩X. This leads to the coordinate descent update in the direction of −⟨gradf(X), HijX⟩XHijX, which modifies two rows of X every iteration, resulting in an O(n) complexity per update. One pass over all the coordinates requires n(n−1)2iterations, leading to  O(n3)complexity in total, which is comparable to the commonly considered retractions, including the exponential, Cayley and QR retractions.

B.2. Stiefel manifold

The Stiefel manifold St(n, p) is the set of column orthonormal matrices of size  Rn×p, i.e.,  St(n, p) := {X ∈ Rn×p :X⊤X = Ip}. When  p = n, St(n, n) ≡ O(n). The tangent space of Stiefel manifold is identified as  TXSt(n, p) = {U ∈Rn×p : X⊤U + U ⊤X = 0} = {XΩ + X⊥K : Ω ∈ Skew(p), K ∈ R(n−p)×p}. The Euclidean metric turns out to be a valid Riemannian metric (Edelman et al., 1998),  ⟨U, V ⟩X = ⟨U, V ⟩for any  U, V ∈ TXM. The Riemannian gradient is derived as  gradf(X) = ∇f(X) − Xsym(X⊤∇f(X)).

In (Gutman & Ho-Nguyen, 2023), a coordinate (subspace) descent algorithm has been developed for general manifolds, via selecting proper subspaces for the tangent space. Although showing theoretical guarantees, the paper only provides a concrete developments for Stiefel manifold (thus including the orthogonal manifold). The basis considered for the tangent space of Stiefel manifold is  {XHij}1≤i<j≤n ∪ {ve⊤k : X⊤v = 0}k∈[p], where  Hij = eie⊤j − eje⊤i. They show along the direction of basis, the exponential retraction can be computed efficiently. That is, for basis  XHij, we compute RetrX(tXHij) = XGij(t), which also leads to the Givens rotation. The projection of Riemannian gradient onto the basis is given by  ⟨gradf(X), XHij⟩XHij. For basis  ve⊤k , RetrX(ve⊤k ) = X +�cos(∥v∥)[X]:,j +sin(∥v∥) v∥v∥ −[X]:,j�e⊤k andthe projection of Riemannian gradient is  (In − XX⊤)[∇f(X)]:,ke⊤k . This essentially updates j-th column by Riemannian gradient descent over sphere. We notice that each coordinate update costs O(n) while the projection onto second basis costs O(np).

Further, we highlight a recent paper (Yuan, 2023), which considers block coordinate descent updates for Stiefel manifold by modifying k rows. The strategy is to decompose the the rows of the variable into two sets and solve for a subproblem that updates k rows instead. Nevertheless, each subproblem can be difficult to solve in general.

B.3. SPD manifold

A recent work (Darmwal & Rajawat, 2023) develops coordinate (subspace) descent algorithm for symmetric positive definite (SPD) manifold, i.e.,  Sn++ := {X ∈ Rn×n : X ≻ 0}, with tangent space given by  TXSn++ = {X ∈ Rn×n :X ∈ Sym(n)}. Optimization with SPD constraint is difficult as the cost of maintaining positive definiteness requires at least  O(n3). Under the affine-invariant metric (Bhatia, 2009), i.e.,  ⟨U, V ⟩X = tr(UX−1V X−1), the paper verifies that the tangent space can be provided with an orthonormal basis  LEijL⊤where  LL⊤ = Xis the Cholesky decomposition and  Eij = (eie⊤j + eje⊤i )/√2for  i ̸= jand  eie⊤jfor i = j. The exponential retraction along the basis direction can be simplified as  RetrX(tLEijL⊤) = Lexpm(tEij)L⊤, where the Cholesky factor of  expm(tEij)has a simple form. Thus it suffices to update the Cholesky factor. The coordinate descent then updates as  RetrX(−ηθLEijL⊤) where θis computed as the coordinate derivative  θ = ⟨gradf(X), LEijL⊤⟩X = ⟨∇f(X), LEijL⊤⟩because the Riemannian gradient has the form  gradf(X) = X∇f(X)X.

This section summarizes the existing works for optimization under the respective manifold constraints. These methods exploit the full gradient information and in general have advantage over CD methods.

Optimization on orthogonal, Stiefel, and Grassmann manifolds. Optimization with orthogonality constraint has been widely studied. Apart from the Riemannian optimization approach, many recent works turn to infeasible methods, either through converting the orthogonality constrained problem into an unconstrained counterpart in the Euclidean space (Xiao et al., 2022; Lezcano Casado, 2019; Xiao & Liu, 2021; Xiao et al., 2023), or following a direction that leads to the same critical point on the manifold (Gao et al., 2019; Ablin & Peyr´e, 2022; Ablin et al., 2023). We provide a through review of these methods in Appendix section D. Although the per-iteration cost is smaller without using a retraction to ensure feasibility, the methods are sensitive to the choice of stepsize and other parameters, mostly a regularization parameter. In the experiment sections, we observe the infeasible methods either require careful tuning of multiple parameters or some stepsize sequence to show good performance.

Optimization on hyperbolic, symplectic and SPSD manifold. Optimization over hyperbolic space H(n, p) has mostly focused on the case where p = 1 (Nickel & Kiela, 2018; Wilson & Leimeister, 2018). Given the exponential map is already efficient in this case, few works have explored more efficient alternatives to Riemannian optimization. Similarly for the symplectic manifold, existing works (Gao et al., 2021b;a; 2022) have focused on the retraction-based Riemannian optimization approach. Optimization over SPD matrices has a long history and can be solved with semidefinite programming if the objective is convex. For general cost functions and with additional rank constraint, many works (Vandereycken et al., 2009; 2013; Massart & Absil, 2020) have defined Riemannian geometries for the constraint set to leverage the tools from Riemannian optimization.

This section reviews various infeasible methods for solving optimization problems under orthogonality constraints, i.e.,

image

where  Λ = Λ⊤is the Lagrangian multipliers. Because  Λis symmetric, we can compute  Λ = X⊤∇f(X) =∇f(X)⊤X = sym(∇f(X)⊤X)(by left-multiplying the first equation by  X⊤). Subsequently, the first condition becomes  ∇f(X) − X∇f(X)⊤X = skew(∇f(X)X⊤)X = 0, which in fact corresponds to the first-order condition of Riemannian optimization (under the canonical metric). The augmented Lagrangian of (4) is given by

image

and the Augmented Lagrangian Method (ALM) outlined in (Gao et al., 2019) considers alternately updating X and  Λ. However the numerical performance of ALM is sensitive to the choice of  µ, which has been empirically verified in (Gao et al., 2019).

PLAM. Gao et al. (2019) propose an alternative update of the Lagrangian multipliers  Λby setting it to  sym(∇f(X)⊤X),which is optimal at first-order stationarity, and the proximal linearized augmented Lagrangian (PLAM) algorithm takes the update

image

which corresponds to the  Xk+1 = arg minX�∇XLµ(Xk, Λk), X −Xk�+ 12η∥X −Xk∥2 with Λk = sym(∇f(Xk)⊤Xk).However, the boundedness of the iterates cannot be expected without setting  λto be sufficiently large and  ηto be sufficiently small.

PCAL. Gao et al. (2019) further constrain the proximal linearized update in (5) to a redundant column-sphere constraint. This is used to reduce the sensitivity of convergence to  λ, µ. The update can be implemented in a column-wise parallel fashion as

image

PenCF Xiao et al. (2022) consider the merit function (used to analyze function value decrease in (Gao et al., 2019)) as an exact penalty function to be minimized. Specifically, the merit function is given by

image

Without proper constraints, directly minimizing h(X) can be problematic because h(X) can be unbounded. Hence, (Xiao et al., 2022) incorporates constraints to the minimization of h(X) (called PenC), including a ball, convex hull of Stiefel/oblique manifold, which includes the orthogonality constraint set. Then the paper shows the first-order and second-order critical points of PenC match those of the original problem (provided  βis sufficiently large). Nevertheless, the gradient  ∇h(X)involves second-order derivatives  ∇2f(X). Hence, the paper similarly considers approximating  ∇h(X)with  ∇f(X) − Xsym(∇f(X)⊤X) + βX(X⊤X − Ip), which is the same as in (Gao et al., 2019) and then project the update to the constraint set.

Landing field Ablin et al. (2023) consider the update

where the direction takes a combination of the Riemannian gradient and a landing direction orthogonal to the Riemannian gradient.

ExPen Xiao & Liu (2021); Xiao et al. (2023) convert the constrained optimization problem to an unconstrained problem by minimizing the following objective:

image

It can be shown that when  βis sufficiently large, first-order and second-order critical points recover those of the original problem. When compared to the framework of PCAL/PenC, the problem is now unconstrained and the gradient can be computed as

image

E.1. Stiefel manifold

In the main text, we focus on the Euclidean metric on Stiefel manifold, i.e.,  ⟨U, V ⟩EX = ⟨U, V ⟩, which leads to the Rieman- nian gradient  gradf(X) = ∇f(X) − Xsym(X⊤∇f(X)). We here review another popular metric on Stiefel manifold, namely the canonical metric (Edelman et al., 1998), defined as  ⟨U, V ⟩CX = ⟨U, (In − 12XX⊤)V ⟩. The corresponding Riemannian gradient is derived as  gradCf(X) = ∇f(X) − X∇f(X)⊤X.

There exists a variety of retractions for the Stiefel manifold, including the QR retraction  RetrqrX(tU) = qf(X + tU) whereqf(A) extracts the q-factor from the thin QR decomposition. The Cayley retraction (Wen & Yin, 2013; Zhu, 2017) is RetrclX(tU) = (I − t2W)−1(I + t2W)X, where U = WX for some W ∈ Skew(n). The exponential retraction is given by RetrexpX (tU) =�X U�expm� �X⊤U −U ⊤UI X⊤U

Here we verify the same coordinate derivative is recovered under the canonical metric.

Lemma E.1. Let Bℓ = HijXfor  1 ≤ i < j ≤ n, then we have θ  = ⟨gradEf(X), Bℓ⟩EX = ⟨gradCf(X), Bℓ⟩CX =⟨∇f(X), Bℓ⟩ = [∇f(X)X⊤ − X∇f(X)⊤]ij.

Proof of Lemma E.1. Recall the  gradEf(X) = ∇f(X) − Xsym(X⊤∇f(X)). Then

image

where we use the fact that skew-symmetric matrix is orthogonal to symmetric matrix with respect to the Euclidean inner product. Similarly for the canonical metric,

image

This verifies  θis the same under the two metrics.

E.2. Grassmann manifold

Proof of Proposition 3.2. It suffices to verify that  ∇f(XQ) = ∇f(X)Qas the Euclidean inner product ⟨∇f(X)Q, HijXQ⟩ = ⟨∇f(X), HijX⟩. We start by noticing f(X) = f(XQ) for any  Q ∈ O(p). Then taking derivative on both sides gives  ∇f(X) = ∇f(XQ)Q⊤ and thus ∇f(XQ) = ∇f(X)Q.

E.3. Hyperbolic manifold

E.3.1. PROOFS

Proof of Proposition 3.3. By the decomposition, we can express  A = XΩ + XJ⊥K + XSfor some  Ω ∈ Skew(p), S ∈Sym(p), K ∈ R(n−p)×p. Left-multiplying  X⊤J gives X⊤JA = −Ω − S. Summing both sides with the transposes yields S = −sym(X⊤JA). Hence the projection to  TXH(d, r)is given by  ProjX(A) = A + Xsym(X⊤JA).

From the definition of Riemannian gradient, we have  Df(X)[U] = ⟨∇f(X), U⟩F = ⟨J∇f(X), U⟩L =⟨J∇f(X), ProjX(U)⟩L = ⟨ProjX(J∇f(X)), U⟩L, where we use the self-adjoint property of orthogonal projection with respect to the metric. Thus the Riemannian gradient is  gradf(X) = ProjX(J∇f(X)) = J∇f(X) +Xsym(X⊤∇f(X)).

Proof of Proposition 3.4. First, we see c(0) = X and we compute  c′(t) = expm(tWJ)WJX and hence c′(0) = WJX =U. The only part left is to show  c(t) ∈ H(n, p). This can be verified by showing expm(tWJ) is a Lorentz transform. To see this, let  L(t) := expm(tWJ)⊤Jexpm(tWJ). Then we have

image

Thus  L(t) = L(0) = J, ∀t. This completes the proof as  c(t) = expm(tWJ)X ∈ H(n, p)because  c(t)⊤Jc(t) =X⊤L(t)X = X⊤JX = −Ip.

Proof of Lemma 3.5. For the case where  i, j ̸= 1, we can see  HijJ = Hijand thus  expm(θHij)leads to the Givens rotation. However in the case where  i = 1, HijJ = Eij := eie⊤j + eje⊤i. Thus, it remains to show  expm(θEij)can be simplified. To this end, we see  E2tij = 12(Eii + Ejj), E2t−1ij = Eij, for t ∈ N. Then we can show  expm(θEij) =

image

In addition to the Euclidean metric, we define the canonical metric on hyperbolic manifold as for  U, V ∈ TXH(n, p),

image

The normal space under the canonical metric is the same as that for the Euclidean metric  NXH(d, r) = {XS : S ∈ Sym(p)}and thus the orthogonal projection can be derived also to be the same. The Riemannian gradient can be derived as follows.

Proposition E.2. The orthogonal projection of  A ∈ Rn×pto  TXH(n, p)is given by  ProjX(A) = A + Xsym(X⊤JA). The Riemannian gradient with respect to the canonical metric is  gradf(X) = −J∇f(X) − X∇f(X)⊤X =2skew(J∇f(X)X⊤)JX.

Proof of Proposition E.2. First we notice that  (J + 12JXX⊤J)−1 = J − XX⊤. This implies  ⟨U, V ⟩ = tr(U ⊤V ) =tr(U ⊤(J − XX⊤)(J + 12JXX⊤J)V ) = −⟨(J − XX⊤)U, V ⟩X. Then by definition of Riemannian gradient, we have  ⟨∇f(X), U⟩ = −⟨(J − XX⊤)∇f(X), U⟩X = ⟨ProjX�(XX⊤ − J)∇f(X)�, U⟩X.Hence, gradf(X) = ProjX�(XX⊤ − J)∇f(X)�= (XX⊤ − J)∇f(X) = −J∇f(X) − X∇f(X)⊤X.

image

Motivated by the Cayley transform for (generalized) Stiefel manifold, we define the Cayley transform for generalized hyperbolic manifold as follows. Then in Proposition E.4, we show the Cayley transform naturally leads to a valid retraction on the generalized hyperbolic manifold.

Definition E.3. For  X ∈ H(d, r)and  U = WJX ∈ TXH(d, r), the Cayley transform is defined as  CayX(U) =(I − 12WJ)−1(I + 12WJ)X, which is well-defined if  I − 12WJis nonsingular.

Proposition E.4 (Cayley retraction). For  X ∈ H(d, r)and  U ∈ TXH(d, r), The map  RetrCX(tU) := CayX(tU) =(I − t2WUJ)−1(I + t2WUJ)Xis a retraction, where  WU = XU ⊤PX − P ⊤X UX⊤, with PX = In + 12JXX⊤.

Proof of Proposition E.4. First, because  U ∈ TXH(d, r), we can express U = WJX for some  W ∈ Skew(n). Since H(d, r) is an embedded submanifold of  Rd×r, let the curve  c(t) = RetrCX(tU)and we have c(0) = X and

image

with  c′(0) = WJX = U. It remains to show  c(t) ∈ H(d, r). First we notice that

image

Then

where the second equality uses the fact that  W ∈ Skew(d)and the last equality is due to  (I + A)(I − A) = (I − A)(I + A)for any A.

Finally we can verify that  WU = XU ⊤PX − P ⊤X UX⊤ where PX = In + 12JXX⊤ satisfies WUJX = U. That is,

image

where we use  U ∈ TXH(n, p).

E.4. Symplectic manifold

E.4.1. REVIEW OF CANONICAL METRIC AND RETRACTIONS

The canonical metric of symplectic manifold is developed in (Gao et al., 2021b), as  ⟨U, V ⟩X = 1ρ⟨Su, Sv⟩ + ⟨Ku, Kv⟩for some chosen  ρ > 0, where  U = XΩpSu + ΩnX⊥Kuand  V = XΩpSv + ΩnX⊥Kv. The Riemannian gradient associated with the metric is  gradρf(X) = ρXΩpsym(Ω⊤p X⊤∇f(X)) + ΩnX⊥X⊤⊥Ω⊤n ∇f(X). The quasi-geodesic retraction is derived by replacing the covariance derivative with the Euclidean derivative, given by  RetrqgeoX (tU) =[X, U]expm�t�−ΩpW ΩpU ⊤ΩnUI2p −ΩpW

� � �expm(tΩnW)0 �where  W = X⊤ΩnU. The symplectic Cayley retraction is

image

GX = I2n − 12XωpX⊤Ω⊤n. In (Gao et al., 2022), a SR decomposition based retraction is proposed. That is, let P2p := [e1, e3, ..., e2p−1, e2, ..., e2p]where  ejis the j-th basis vector of  R2p. Then denote a congruence matrix set as Tsk(P2p) := {P ⊤2p ˆRP2p : ˆR ∈ R2p×2pis upper triangular}. Then the RetrsrX(tU) = sf(X + tU) where A = SR is the SRdecomposition of  A ∈ R2n×2p, with S ∈ Sp(2n, 2p) and R ∈ T2p(P2p) and sf(A)extracts the S factor.

E.4.2. PROOFS

The proof of Proposition 3.6 follows immediately from the following Lemma.

Lemma E.5. For any  S ∈ Sym(2n), we have expm(tΩnS), expm(tSΩn) ∈ Sp(n, n).

Proof of Lemma E.5. The proof follows from (Gao et al., 2021b, Proposition 4.6).

Proof of Proposition 3.6. Similar to the previous sections, let  c(t) := RetrX(tU)and we have c(0) = X and  c′(0) =SΩnX = U. Finally from Lemma E.5, we have  c(t)⊤Ωnc(t) = X⊤expm(tSΩn)⊤Ωnexpm(tSΩn)X = X⊤ΩnX = Ωp,which verifies  c(t) ∈ Sp(n, p)

Proof of Proposition 3.7. We can partition the basis  Eij =�Eij,1 Eij,2E⊤ij,2 Eij,3

image

For  1 ≤ i ≤ j ≤ n, we have  Eij,2, Eij,3 = 0and thus we obtain  exp(θEijΩn) = I + θEijΩn + θ22 (EijΩn)2 + · · · =I + θEijΩn. Similarly for  n + 1 ≤ i < j ≤ 2n, we have Eij,1, Eij,2 = 0 and expm(θEijΩn) = I + θEijΩn. This verifies ∀1 ≤ i ≤ j ≤ n or n + 1 ≤ i < j ≤ 2n

image

For  1 ≤ i ≤ n < j ≤ 2n, we have  Eij,1 = Eij,3 = 0and  Eij,2 = eie⊤j−n. Then  EijΩn =�−Eij,2 00 E⊤ij,2

notice that for k = 2, 3, 4, ...

image

This suggests for k = 2, 3, 4, ...,

image

This verifies for  1 ≤ i ≤ n < j ≤ 2n,

image

The proof is now complete.

E.5. Doubly stochastic manifold

For the doubly stochastic manifold, the retraction applies the Sinkhorn algorithm (Knight, 2008) for matrix balancing, i.e., RetrX(tU) = SK(X ⊙ exp(tU ⊘ X)), where U is a tangent vector belonging to the tangent space  TXΠ(µ, ν)and the Sinkhorn algorithm SK(U) iteratively normalize rows and columns of U according to the given marginals (Shi et al., 2021; Cardoso & Leite, 2010).

E.5.1. PROOFS

Proof of Proposition 3.9. In fact, we can show for any  Hijkl := (ei − ej)(ek − el)⊤for  i ̸= j, k ̸= l. The coordinate Sinkhorn is a valid retraction along the direction  Hijkl. Let c(t) := cSK(X ⊙ exp(tHijkl ⊘ X))and we can immediately see c(0) = X. Also, Let �X := X ⊙exp(tHijkl ⊘X). Then �Xdiffers with X in only the entries at (i, k), (j, k), (i, l), (k, l), which forms the  2 × 2sub-matrix that we wish to balance. Also, by definition, the marginals are given by  ˜µ := ([X]ik +[X]il, [X]jk + [X]jl) and ˜ν := ([X]ik + [X]jk, [X]il + [Xjl]). It readily holds that  ˜µ⊤12 = ˜ν⊤12. For notational purposes, for any matrix  A ∈ Rm×n, let Aijkl ∈ Rm×n be the matrix that zeros out the entries except for the  2 × 2sub-matrix. Also, we denote  A♭ijkl :=�[A]ik [A]il[A]jk [A]jl

Sinkhorn on �X♭ijkl with marginals  ˜µ, ˜νwith other entries of  �Xunchanged. This is well-defined as the Sinkhorn algorithm converges to the unique doubly stochastic matrix of the form  diag(u) �X♭ijkldiag(v)for some positive vectors u, v (Sinkhorn, 1967). This verifies that  cSK( �X)results in a doubly stochastic matrix, which remains on the manifold. Lastly, it remains to show that  c′(0) = Hijkl. For this, we first have

image

where we use the first-order approximation of the exponential operations. Notice that  cSK(X + tHijkl)only modifies the 2×2sub-matrix of  X by SK(X♭ijkl+tH♭ijkl). From (Douik & Hassibi, 2019; Shi et al., 2021), we have  SK(X♭ijkl+tH♭ijkl) ≈X♭ijkl + tH♭ijkl. This suggests  limt→0(SK(X♭ijkl + tH♭ijkl) − X♭ijkl)/t = H♭ijkl,which verifies  c′(0) = Hijkl.

image

Lemma E.6. Given a positive  2 × 2matrix A =

�a bc d �, the Sinkhorn algorithm on A with marginals  p = [p1, p2], q =

[q1, q2] ∈ ∆2converges to�c11a c12bc21c c22d�where c12 = p1/(κa + b), c22 = p2/(κc + d), c11 = κc12, c21 = κc22where κ

is the positive root of the equation  q2acκ2 +�(bc + ad)q2 − bcp1 − adp2�κ − bdq1 = 0.

Proof of Lemma E.6. Sinkhorn algorithm converges to the unique doubly stochastic matrix of the form diag(u)Adiag(v)

for  u = [u1, u2], v = [v1, v2]. From the constraints, diag(u)Adiag(v)12 = pand diag(v)A⊤diag(u)12 = qwe need to

image

Let  c11 = u1v1, c12 = u1v2, c21 = u2v1, c22 = u2v2which transforms (6) into a set of linear equations for the variables c11, c12, c21, c22. The equation system, however, is under-determined and has many solutions. The unique solution that is sought should satisfy  c11/c12 = c21/c22 = κ. To this end, from the first two equations, we obtain

image

Substituting the expressions to the last equation yields bp1κa+b + dp2κc+d = q2, which we solve for  κas the positive root of q2acκ2 +�(bc + ad)q2 − bcp1 − adp2�κ − bdq1 = 0.

F.1. Developments

Assumption F.1. Consider a neighbourhood  X ⊆ Mthat contains a critical point.

F.1.1 The basis and its projection are bounded. Let the projection onto the basis  Bℓ,Xbe  PBℓ,X(U) := ⟨U, Bℓ,X⟩XBℓ,X. There exists constant  cb, cp > 0such that  ∀X ∈ X, U ∈ TXM, ℓ ∈ I, ∥Bℓ,X∥2X ≤ cband  �ℓ∈I ∥PBℓ,XU∥2X ≥cp∥U∥2X.

F.1.2 The objective f is retraction L-smooth in X, i.e.,  f(RetrX(U)) − f(U) − ⟨gradf(X), U⟩X ≤ L2 ∥U∥2X, ∀X ∈ Xand U ∈ TXM such that RetrX(U) ∈ X.

Remark F.2. Assumption F.1.1 requires the basis has a bounded norm and the projection of any tangent vector onto the basis does not vanish. Such an assumption is manifold-specific and we can verify that  ∥Bℓ,X∥2Xhas an upper bound (e.g., for Stiefel and Grassmann,  ∥Bℓ,X∥2X ≤ ∥Hi,j∥2F = 2). Then we note that the second requirement trivially holds for orthonormal basis due to the decomposition of U and Jensen’s inequality. For non-orthonormal basis, this assumption also holds as long as projection of a tangent vector does not vanish. Assumption F.1.2 can also be satisfied by the compactness of the domain, e.g., we can take L in Assumption F.1.2 to be  L = maxX∈X,U∈TXM:∥U∥X=1d2f(RetrX(tU))dt2. These are all reasonable assumptions within a compact neighbourhood X.

Theorem 4.1 (Formal). Under Assumption F.1, consider RCD algorithm with  S = 1 and ℓsk selected uniformly at random

image

To analyze the cyclic variant, we further require the assumption that bounds the difference between Riemannian distance and distance induced by general retraction. In addition, we require the gradient Lipschitzness.

Assumption F.3. Under the same settings as in Assumption F.1,

F.3.1 For all  X, Y = RetrX(U) ∈ X, there exists constants  ϑ0, ϑ1 > 0 such that ϑ0∥U∥2X ≤ dist2(X, Y ) ≤ ϑ1∥U∥2X.F.3.2 The objective has retraction  Lg-Lipschitz gradient, i.e.,  ∥gradf(X) − T XY gradf(Y )∥2X ≤ Lg∥U∥2X, ∀X, Y =RetrX(U) ∈ Xand  T YXis the an isometric vector transport that satisfies  ⟨T YX U, T YX V ⟩Y = ⟨U, V ⟩X, ∀X, Y ∈X, U, V ∈ TXM.F.3.3 For any fixed coordinate index  ℓ ∈ I, there exists a constant  υ > 0such that for all  X, Y ∈ X, V ∈ TY M, ∥PBℓ,XT XY V ∥2X ≥ υ∥PBℓ,Y V ∥2Y .

Remark F.4. Assumption F.3.1 bounds the difference Riemannian distance (relating to inverse exponential map) and the inverse retraction. Because retraction is a first-order approximation to the exponential map, this assumption naturally holds when the domain is sufficiently small (see (Huang et al., 2015; Sato et al., 2019)). Assumption F.3.2 is further required because when general retraction is used, gradient Lipschitzness is not equivalent to function smoothness. Assumption F.3.3 further claims that the difference between the same coordinate basis on different tangent spaces is bounded. We note that the RHS is identical to  ∥PT XY Bℓ,Y T XY V ∥due to the isometric vector transport. Then it reduces to whether  T XY Bℓ,Y and Bℓ,Xare related, which is expected because due to the compactness of the domain, X is bounded from Y . This allows to establish the convergence for cyclic selection of basis.

Theorem 4.2 (Formal). Under Assumption F.1 and consider RCD algorithm with  S = |I| and ℓsk = s+1 for s = 0, ..., |I|−1.Then selecting  η = 1Lcb gives min0≤k≤K−1 ∥gradf(Xk)∥Xk ≤ C∆0K , where C = 4Lc2bc−1p υ−1(1+|I|2c−1b L−2Lgϑ1ϑ−10 ).

Theorem 4.3 (Formal). Under Assumption F.1, F.3 and further let  ω0, ω1 > 0, such that for any fixed epoch k, ω0⟨∇f(Xsk), Bℓsk⟩2 ≤ θsk⟨∇f(Xsk), Bℓsk⟩ ≤ ω1⟨∇f(Xsk), Bℓsk⟩2, ∀s ≤ Smax − 1.

(Randomized). Consider RCDlin algorithm with  1 < S ≤ Smax and ℓsk selected uniformly at random from I. Then choosing

image

(Cyclic). Suppose  Smax ≥ |I|and consider RCDlin algorithm with S = |I| and  ℓsk = s + 1for  s = 0, ..., |I| −1. Then choosing  η = ω0Lcbω21, we have  min0≤k≤K−1 ∥gradf(Xsk)∥2Xsk ≤ �C∆0K, where  �C = 4Lc2bω21ω−20 c−1p ν−1(1 +

image

Remark F.5. We finally remark that the proof ideas of cyclic and randomized RCD follow from classic developments of coordinate descent (Wright, 2015) in the Euclidean space by showing sufficient descent in the objective function. On general manifolds, in order to generalize the proof ideas, we further require the assumptions outlined in Assumption F.1, F.3. In particular, for cyclic selection rule, we require Assumption F.3.3 to relate bases from different tangent spaces. Similar assumptions have been considered in (Gutman & Ho-Nguyen, 2023) for showing convergence of deterministic subspace descent algorithms on manifolds (see (C, r)-norm condition).

F.2. Proofs

Proof of Theorem 4.1. Because S = 1 and by retraction L-smoothness, we have

image

image

Telescoping this inequality and taking full expectation yields

image

where we let  ∆0 := f(X0) − f ∗.

Lemma F.6. Under Assumption F.1.1, we have  ∥PBℓ,XU∥X ≤ cb∥U∥X, ∀X ∈ X, ℓ ∈ I.

Proof of Lemma F.6.  ∥PBℓ,XU∥2X ≤ cb⟨U, Bℓ(X)⟩2X = cb�⟨U, Bℓ(X)⟩XBℓ(X), U�X ≤ cb∥PBℓ,XU∥X∥U∥X. Can- celling  ∥PBℓ,XU∥Xon both sides completes the proof.

Proof of Theorem 4.2. We first focus on a single epoch k and for notation simplicity, we let  T XkXsk = Ts→0 and T XskXk = T0→s.Similarly from retraction L-smoothness,

image

Then it remains to bound the RHS. From  Lg-Lipschitz,

where we use Assumption F.3 and triangle inequality of Riemannian distance.

Now we can show

∥PBℓsk T0→sgradf(Xk)∥2Xsk ≤ 2∥PBℓsk T0→sgradf(Xk) − PBℓsk gradf(Xsk)∥2Xsk + 2∥PBℓsk gradf(Xsk)∥2Xsk≤ 2cb∥T0→sgradf(Xk) − gradf(Xsk)∥2Xsk + 2∥PBℓsk gradf(Xsk)∥2Xsk

image

where the second inequality is due to Lemma F.6. Summing this inequality from  s = 0, ..., S − 1 gives

image

where we notice S = |I|. Also due to the cyclic selection of  ℓsk, we can see the LHS is

image

where we use Assumption F.1.1 and F.3.3. Combining with previous results, we finally obtain

image

Proof of Theorem 4.3. For the randomized setting, by retraction L-smoothness,

�∥⟨∇f(Xsk), Bℓsk⟩Bℓsk∥2Xsk

image

where we choose  η = ω0Lcbω21. The second inequality follows from the assumption  ω0⟨∇f(Xsk), Bℓsk⟩2 ≤θsk⟨∇f(Xsk), Bℓsk⟩ ≤ ω1⟨∇f(Xsk), Bℓsk⟩2and the third inequality is due to Assumption F.1.1. Following the similar proof strategy, we obtain the desired result. For the cyclic setting, the bound also readily follows by using the above result.


Designed for Accessibility and to further Open Science