Riemannian coordinate descent algorithms on matrix manifolds

2 months ago·arXiv

Abstract

1. Introduction

In this work, we consider the optimization problem

where M is a smooth, and often nonlinear constraint. Examples of M include orthogonality constraint (Edelman et al., 1998), positive (semi)definite constraint (Bhatia, 2009; Han et al., 2021), fixed-rank constraint (Vandereycken, 2013), hyperbolic constraint (Nickel & Kiela, 2018), doubly stochastic constraint (Douik & Hassibi, 2019), etc. Problem (1) has been explored in applications such as PCA (Zhang et al., 2016; Kasai et al., 2019), low-rank matrix/tensor completion (Jawanpuria & Mishra, 2018; Nimishakavi et al.,

2018; Kressner et al., 2014), computer vision (Pennec et al., 2006), natural language processing (Jawanpuria et al., 2019a; 2020), optimal transport (Mishra et al., 2021; Han et al., 2022; Shi et al., 2021), and deep learning (Arjovsky et al., 2016; Wang et al., 2020). Problem (1) has also been studied in various settings such as stochastic optimization (Bonnabel, 2013; Zhang et al., 2016; Tripuraneni et al., 2018; Sato et al., 2019; Kasai et al., 2019; Han & Gao, 2021), differential privacy (Reimherr et al., 2021; Han et al., 2024a; Utpala et al., 2022), federated learning (Li & Ma, 2022; Huang et al., 2024), decentralized learning (Mishra et al., 2019), and saddle point and bilevel optimization (Han et al., 2023b;a; Zhang et al., 2023; Han et al., 2024b).

The smooth constraint set can be turned into a Riemannian manifold by endowing a properly chosen metric structure. The Riemannian optimization approach (Absil et al., 2008; Boumal, 2023) then provides a principled approach to solve (1) intrinsically on the manifold space. The main idea is to iteratively update the variable along a descent direction without leaving the manifold. The descent direction is often computed using the Riemannian gradient, which is then followed by a retraction update to ensure feasibility of the manifold constraint. As the dimensionality of the constraint set increases, ensuring feasibility via retraction becomes a key computational bottleneck, e.g., the complexity of ensuring orthogonality and positive definiteness scales as with the input dimension n. This has led to many recent works (Gao et al., 2019; Xiao & Liu, 2021; Ablin et al., 2023) that develop infeasible methods for solving (1). However, such methods are largely limited to the orthogonality constraint and cannot be easily adapted to other manifolds.

In the Euclidean space, the coordinate descent (CD) method (Luo & Tseng, 1992; Nesterov, 2012; Wright, 2015) is a classic algorithm that successively solves a smalldimensional subproblem along a component of the vector variable while holding others fixed. Since each subproblem can be more easily solved than the original problem, this strategy leads to efficient variable update.

On manifolds, designing CD updates is inherently difficult (Gutman & Ho-Nguyen, 2023). A few works have proposed manifold specific CD updates, mainly for the orthogonal (Shalit & Chechik, 2014; Jiang et al., 2022; Massart & Abrol, 2022) and Stiefel (Gutman & Ho-Nguyen, 2023) manifolds. Although Gutman & Ho-Nguyen (2023) discuss a general framework for developing CD methods on manifolds, concrete developments have been shown only for the Stiefel manifold. Recently, for a class of optimization objectives, Darmwal & Rajawat (2023) have proposed CD updates on the symmetric positive definite manifold with the affine-invariant metric.

In this work, we provide a general approach for developing CD algorithms on matrix manifolds. We summarize our contributions below.

• We introduce a framework for designing CD algorithms on manifolds. In particular, we find a basis spanning the tangent space such that a chosen retraction along the direction of such a basis admits an efficient computation. We discuss a simple expression for the coordinate derivative. Finally, we provide optimization ingredients for various matrix manifolds of interest.

• A nonlinear objective f in (1) requires computation of gradient for every CD update. Using a first-order approximation of f, we develop a more efficient CD algorithm which requires gradient computations one in every fixed number of CD updates. We analyze the convergence and complexity of the two algorithms with randomized and cyclic selection of coordinates.

• We show the benefits of the proposed CD algorithms on the orthogonal Procrustes, PCA, orthogonal deep network distillation, nearest matrix, and learning hyperbolic embeddings problems.

2. Preliminaries

Riemannian manifolds and optimization. For a Riemannian manifold M, denote its tangent space at as . A Riemannian metric is an inner product structure that varies smoothly with the base point X. In this work, we particularly focus on matrix manifolds, i.e., where X can be represented in the ambient vector space . The orthogonal projection projects arbitrary ambient vectors to the tangent space with respect to the Riemannian metric. For a differentiable function , the Riemannian gradient at X is defined as the tangent vector such that where . A retraction allows points to move along the manifold, which satisfies the conditions: and

Related works. We provide a detailed review of the existing coordinate descent (CD) algorithms on specific manifolds, along with other related works in Appendix B.

Notations. We use without the subscript to represent the Euclidean inner product while we use to denote the Riemannian inner product on . The specific expression for depends on both M and X. Sym(n) and Skew(n) denote the sets of symmetric and skew-symmetric matrices, respectively. Let be the elementwise exponential, and be the matrix exponential. We also use to represent the i-th basis vector with the dimension to be determined from the context. denotes the i, j-th entry of a matrix represents a matrix with index i, j. We use to denote the identity matrix, to denote the size-n vector of all 1s, and define [n] := {1, 2, ..., n}.

3. Proposed CD Framework

As shown in (Shalit & Chechik, 2014; Gutman & Ho- Nguyen, 2023; Massart & Abrol, 2022; Jiang et al., 2022; Darmwal & Rajawat, 2023), for specific manifolds, the key in developing CD algorithms is the choice of the basis vectors and I denotes the index set) spanning the tangent space that allow efficient retraction. In general, our chosen basis need not be orthonormal with respect to the Riemannian metric. Once the basis and retraction are chosen, the CD update is given by , where is the stepsize and is the coordinate derivative, i.e.,

It can be verified that is indeed a descent direction, i.e., . The CD algorithm then involves iteratively selecting coordinate index , computing , and updating in the coordinate descent direction

The main challenges in developing CD algorithms on matrix manifolds are: 1) characterization of which facilitates efficient computation, 2) efficient computation of easy generalization to different manifolds. We propose to leverage the following connection between the Riemannian and Euclidean gradients:

where is the Euclidean gradient and the last equality follows from the definition of the Riemannian gradient. We exploit (3) to efficiently compute for several manifolds as it is independent of the Riemannian gradient and metric. In the subsequent sections, we develop concrete CD optimization ingredients for the manifolds of interest under the proposed approach. These are summarized in Table 1.

Table 1: Summary of CD ingredients over various manifolds. and are the basis for skew-symmetric and symmetric matrices, respectively. and corresponds to the Givens and hyperbolic rotations, respectively. , and is the k-th column of . We use to denote the exponential retraction over sphere. The complexity only considers the computation of coordinate derivative and coordinate update, while excluding the complexity of first-order oracle

3.1. CD on Stiefel manifold

The Stiefel manifold St(n, p) is the set of column orthonormal matrices of size , i.e., , the orthogonal manifold. The tangent space of Stiefel manifold is identified as . The Riemannian metric is defined as for any . The Riemannian gradient is derived as

Choice of basis. Taking inspiration from O(n), we adopt the parameterization of the tangent vectors (where ) and choose the basis as for In contrast to O(n), the chosen basis is not orthonormal for St(n, p). This is expected as the manifold St(n, p) has a dimension while we adopt an overparameterization of the tangent space using basis vectors.

Retraction. For the purpose of CD update, we first note that is a valid retraction on St(n, p) because: 1) and 2) are satisfied (Siegel, 2020).

CD update. Based on the above choices of the basis vectors and retraction, the proposed CD update is , where . Here, is known as the Givens rotation around axes i, j with angle . Overall, each CD update only requires O(p) as we modify only two rows of X.

Remark 3.1. Gutman & Ho-Nguyen (2023) propose a column-wise CD update on the Stiefel manifold which costs O(np) per iteration. On the other hand, our proposed CD update is row-wise and costs O(p), which is cheaper. Furthermore, the CD update strategy of (Gutman & Ho-Nguyen, 2023) cannot be applied to the sphere manifold, i.e., when p = 1, it reduces to the full gradient update on the sphere. This, however, is not an issue for our update. Finally, the update of Gutman & Ho-Nguyen (2023) is not invariant to the right action of orthogonal group and hence does not yield a valid CD strategy for the Grassmann manifold. In contrast, as shown in the next section, our strategy can be readily generalized to the Grassmann manifold.

3.2. CD on Grassmann manifold

The Grassmann manifold Gr(n, p) represents the p-dimensional subspaces in , which can be represented by an orthonormal matrix X, i.e., , where the columns span the subspace. The representation is not unique, with XQ representing the same subspace for arbitrary . Thus, the Grassmann manifold can be identified as , where . The tangent space can be uniquely characterized by the horizontal space at , i.e., . For a given , its unique horizontal lift is , where [X] is represented as X. The lift operator satisfies . On Gr(n, p), the Riemannian metric is pushed forward by the Euclidean metric on St(n, p) as and the corresponding Riemannian gradient gradf([X]) can be represented by . Retractions such as QR retraction for St(n, p) also work for Gr(n, p) as long as it preserves the equivalence class, i.e., for any . Below, we show that the proposed CD update for St(n, p) is also well-defined for Gr(n, p).

Proposition 3.2. Consider a function R. Let the coordinate descent update at [X] be given by for , where and for some fixed stepsize . Then,

3.3. CD on hyperbolic manifold

We now consider the generalized hyperbolic space (Bai & Li, 2014; Xiao et al., 2023) , where is the metric tensor. When p = 1, this reduces to the wellknown hyperbolic space (the hyperboloid model). The tangent space at is identified as

T

The Riemannian metric on TH(n, p) is the generalized Lorentz inner product as . The normal space is given by . The orthogonal projection to and the Riemannian gradient are derived below.

Proposition 3.3. The orthogonal projection of is given by . The Riemannian gradient is

Choice of basis. For generalized hyperbolic manifold, we consider the basis

Retraction. Taking inspiration from our Stiefel analysis in Section 3.1, we define the map expm(tWJ)X for . We next show such a map defines a valid retraction. As shown below, the retraction expression considerably simplifies along the chosen basis

Proposition 3.4. Given a tangent vector for some skew-symmetric matrix W, then is a retraction.

In fact, expm(tWJ) is a Lorentz transform that satisfies , which preserves the Lorentz inner product as . Hence by following the direction , we define a coordinate type of updates on (generalized) hyperbolic manifold as , which can be computed efficiently similar to the Givens rotation. Particularly, when and exactly recovers the Givens rotation. When i = 1, we have . We show in the follow- ing lemma that this also leads to a rotation known as the hyperbolic rotation.

Lemma 3.5. For with ,

When is known as the Lorentz boost with rapidity and can be thought of as rotation in the time domain. Hence, while the Givens rotation based CD updates have been explored for the orthogonal and Stiefel manifolds (Shalit & Chechik, 2014; Gutman & Ho-Nguyen, 2023), our approach generalizes the Givens rotation based CD updates to hyperbolic spaces.

CD update. Similar to the Stiefel case, the proposed CD update is , where In Appendix E.3, we additionally derive a canonical-type metric and a Cayley retraction.

3.4. CD on symplectic manifold

The symplectic manifold (Gao et al., 2021a;b; 2022) is defined as where

block matrix. The tangent space is given as

Here we consider the Euclidean metric (Gao et al., 2021a) as for any . The Riemannian gradient (Gao et al., 2021a, Proposition 3) is given by , where is the unique solution to the Lyapunov equation

Choice of basis. Similar to the Stiefel and hyperbolic manifolds, we consider the basis for the tangent space , where , for -th basis in

Retraction. We propose the following retraction for efficient CD updates.

Proposition 3.6. For any

, the map is a retraction.

The above retraction further simplifies when moving along the chosen basis direction.

Proposition 3.7. Let for . Then, we have if i = otherwise.

Remark 3.8 (Block coordinate updates). We may also consider block coordinate updates. Let E =

and , and we wish to update X in the direction of . First, we consider the upper-left and bottom-right blocks, i.e., where E =

Sym(if E =

CD update. Finally, based on the above discussion our proposed CD update is , where,

3.5. CD on doubly stochastic and multinomial manifolds

Given two marginals where denotes the k-simplex, the doubly stochastic manifold (Douik & Hassibi, 2019; Shi et al., 2021; Mishra et al., 2021) is defined as . The tangent space is , which can be endowed with the Fisher metric as represent the elementwise product and division operations, respectively. The orthogonal projection is , where are solutions to the linear system: . The Riemannian gradient is given by

Choice of basis. We consider the parameterization of the tangent space as . We notice that the tangent space has a dimension , and hence, we can let , where we denote as the i-th canonical basis for the corresponding vector space. Hence the tangent space is parameterized by . The basis we consider for

Retraction. We consider the Sinkhorn retraction applied in the direction of the basis as . Here, the Sinkhorn algorithm SK(U) iteratively normalize rows and columns of U according to the given marginals (Knight, 2008). We notice the input to the Sinkhorn algorithm only modifies a sub-matrix of X. It, thus, suffices to apply the Sinkhorn algorithm to the sub-matrix with the modified marginals, which largely simplifies the computation compared to running the Sinkhorn algorithm for the entire input. To this end, we define the coordinate Sinkhorn, denoted as or simply cSK(U) if the coordinates are clear from context, as performing the Sinkhorn algorithm for the sub-matrix formed by indices i, i+1 and j, j +1 with marginals and . The other entries of U remains unchanged. We show in the next proposition that applying coordinate Sinkhorn to the basis results in a valid retraction.

Proposition 3.9. The coordinate Sinkhorn applied to the basis is a valid retraction in the direction of

We can further simplify the computation of , which is equivalent to performing Sinkhorn on a matrix. Furthermore, in this case, we show in Lemma E.6 that the Sinkhorn admits a closed-form solution.

CD update. The CD update follows as , where the coordinate derivative is computed as

Remark 3.10 (CD on multinomial manifold). The developments in this section readily applies to the multinomial manifold (Douik & Hassibi, 2019), i.e., where we assume without loss of generality. The multinomial constraint corresponds to n independent simplex constraints restricted to positive entries. The tangent space is . Thus, the basis is similarly given by . The Riemannian metric is the same Fisher metric. A retraction is given in , where denotes the row normalization. It should be noted that P(V ) is a special case of the Sinkhorn algorithm without column normalization. Thus, in the basis direction, we can define the coordinate projection by modifying only two entries per row. The coordinate derivative can be computed as

3.6. CD on positive (semi)definite manifold

The set of fixed-rank symmetric positive semi-definite manifold (SPSD) matrices (Vandereycken et al., 2009; 2013; Massart & Absil, 2020) is defined as , we recover the set of symmetric positive definite (SPD) matrices as For the purpose of developing efficient CD updates on SPSD, we follow the parameterization purposed in (Massart & Ab- sil, 2020), i.e., is factorized as , , which is unique up to the right-action of the orthogonal group O(p). The Riemannian gradient can be computed as because the . The main advantage of this parameterization is its simple expression of retraction, i.e., Massart & Absil, 2020).

Choice of basis, retraction, and CD update. Using , the optimization problem is on with a simple retraction. For the objective tialize and update as for some stepsize . We choose the basis to be for , which is orthonormal for the tangent space . The CD update is given by , where . The simplicity of the geometry allows CD to be developed efficiently on , which coincides with the Euclidean CD update in the Euclidean space. When p = n, we obtain a CD update for the SPD manifold.

CD update with the BW metric. The Bures-Wasserstein (BW) metric for the SPD manifold (p = n) has been recently studied for various machine learning applications (Bhatia et al., 2019; Han et al., 2021). For the BW metric, the gradient descent update is Consider a basis where . The coordinate derivative is computed as . Finally, the CD update is given by update modifies two rows and two columns of X.

Remark 3.11. For the SPD manifold (p = n), Darmwal & Rajawat (2023) propose CD updates based on the affine-invariant metric and Cholesky factorization. They specifically focus on a class of objective functions and show that the exponential map computations are efficient. In contrast, our choices of parameterization/metric directly leads to a faster retraction.

4. Algorithms and Analysis

RCD. We present the proposed Riemannian coordinate descent (RCD) algorithm in Algorithm 1. The complexity of RCD per iteration is the complexity of one first-order oracle and the update complexity in Table 1. Although

for some problem settings, we may explore the structure of the objective to efficiently compute , for general problem instances, becomes the main computational bottleneck.

RCDLin. To reduce gradient computations in the RCD setup (especially for non-linear objectives), we also propose the Riemannian linearized coordinate descent (RCDlin) method in Algorithm 1. The main difference with RCD is that the variables are updated using an anchored gradient at (which does not change for inner iterations). This scheme is equivalent to taking a linearization of the original cost function at , and in the inner iterations, we solve: where the inner product and subtraction are defined in the ambient Euclidean space. Subsequently, the Euclidean gradient at is , and thus, . For the randomized setting with S = 1, RCDlin is equivalent to RCD. Additionally, for linear problems where is some constant matrix), RCDlin also reduces to RCD.

Convergence and complexity of Algorithm 1

We next discuss the convergence analysis of RCD and RCDlin. It follows the standard analysis for CD algorithms (Wright, 2015). Note that Gutman & Ho-Nguyen (2023) mostly consider the analysis of CD algorithms under exponential map and parallel transport operations. In contrast, we consider the more general retraction and vector transport operations. We also adapt our analysis for RCDlin. For brevity, our analysis is informally discussed here. The analysis is in a compact neighbourhood around a critical point, which is required for validating certain regularity assumptions, boundedness of basis and projection onto the basis, and smoothness of the objective (details in Appendix F).

On RCD. We start by showing the convergence of RCD under randomized selection of basis and certain regularity assumptions. L is the smoothness constant of the objective.

Theorem 4.1 (Randomized RCD). Under mild assumptions, consider RCD with selected uniformly at random from I. Then, choosing leads to

The convergence of RCD with cyclic selection of basis requires further assumptions that bound the difference of the constructed bases between tangent spaces. These are reasonable given the compactness of the domain.

Theorem 4.2 (Cyclic RCD). Under mild assumptions in addition to the ones required by randomized RCD, consider RCD with S = |I| and for . Then, for , we have

On RCDlin. The key idea here is to relate the coordinate derivative to the correct descent derivative . In randomized settings, we can show the same convergence rate as RCD up to some additional constants regulated by the difference between the descent direction. For the cyclic settings, however, we require S = |I| in order to cycle through all the basis.

Theorem 4.3 (Randomized and cyclic RCDlin). Under assumptions required in Theorems 4.1 & 4.2, suppose and are positively related. Then, consider randomized RCDlin with selected uniformly at random from I. Choosing leads to . In addition, consider cyclic RCDlin with S = |I| and s + 1 for . Also, if , then

Complexity analysis. Let the cost of computing the coordinate derivative and CD update be (last column of Table 1). Then, the total computational cost of RCD and RCDlin is , respectively, where F denotes the cost of computing . We note that the proposed algorithms can parallely update in disjoint basis directions. For example, in the Stiefel/Grassmann case, we can select n/2 non-overlapping index pairs, which results in n/2 independent Givens rotation, and can be parallelized.

5. Experiments

We now benchmark the performance of the proposed RCD and RCDlin algorithms in terms of computational efficiency (flop counts and/or runtime) and convergence quality (distance to optimality). One of the considered baselines is the Riemannian gradient descent method (RGD), a full gradient method. As RGD exploits the entire gradient direction, it has advantage over CD algorithms. However, RGD is significantly more costly than CD in every up-

Figure 1: The Procrustes problem with varying p: (a) p = 150 and (b) p = 50. (Top row) Comparing various algorithms in terms of flop counts. (Bottom row) Comparing various algorithms in terms of runtime. We observe that our RCD algorithm obtains better flop counts than the baselines in flop counts and is competitive in terms of runtime.

date. Our codes are implemented using the Manopt toolbox (Boumal et al., 2014) and run on a laptop with an i5-10500 3.1GHz CPU processor. The codes are available at https://github.com/andyjm3.

5.1. Orthogonal Procrustes and PCA

Orthogonal Procrustes problem. We aim to solve for given matrices . There exists a closed-form solution provided by the (thin) SVD of . For this, RCD and RCDlin have same updates as for . In experiments, we generate random matrices A, B and evaluate the performance against optimality gap computed as

Baselines. The closest baseline to RCD is TSD (Gutman & Ho-Nguyen, 2023), which is a CD method under an alternative construction of bases. As discussed, while TSD updates the columns, the proposed RCD updates the rows. Since RCD is equivalent to TSD for p = n, we focus only on the p < n setting. For both RCD and TSD, we use the cyclic selection of basis. We also compare against RGD methods with QR, Cayley (CL), and exponential (EXP) retractions. For all the methods, we tune the stepsize.

Results. In Figure 1, we show results with varying dimension p. While the proposed RCD obtains better flop counts than the baselines in flop counts, it is competitive in terms of runtime. We highlight that the runtime of RCD can be further improved with parallel implementation. In Figure 2c, we compare a variety of basis selection rules for both

Figure 2: (a) & (b): Experiments on the PCA problem with n = 200, p = 50. In (a), we observe that our algorithm RCDlin achieves the fastest convergence due to low per-iteration cost. In (b), we compare various strategies for basis selection: cyclic selection (-c) and uniformly random selection (-r) of basis for TSD, RCD, and RCDlin, and selection without replacement (-nr) for RCDlin. We observe that cyclic and selection without replacement strategies are better than random selection. (c) & (d): Experiments on the Procrustes problem with n = 200, p = 150. In (c), we again observe that cyclic selection performs better than random selection. In (d), RCD performs competitively against the infeasible methods.

TSD and RCD: cyclic selection (‘c’) and uniformly random selection (‘r’). We observe that cyclic selection is more favourable than the random selection rule for both the methods. We compare against full gradient infeasible methods in Figure 2d, including PLAM (Gao et al., 2019), PCAL (Gao et al., 2019), PenCF (Xiao et al., 2022), ExPen (Xiao & Liu, 2021; Xiao et al., 2023), and Landing (Ablin & Peyr´e, 2022; Ablin et al., 2023). RCD is performs competitively against infeasible methods for orthogonality constraints.

PCA problem. The PCA problem solves a quadratic maximization problem as for some positive definite matrix A, i.e., . This problem is in fact an optimization problem over the Grassmann manifold because the objective is invariant to basis change of X. Hence, we use the Riemannian distance to the optimal solution on the Grassmann manifold to measure the performance. As discussed in Section 3.2, our proposed RCD has well-defined updates on the Grassmann manifold. In contrast, TSD is not invariant to the basis change. For experiments, we generate A with a condition number with exponential decay of eigenvalues. For TSD, RCD, and RCDlin, we implement the cyclic selection of basis.

Results. In Figure 2a, we observe that RCDlin achieves the best performance due to its low per-iteration cost. We note that TSD converges slowly due to non-invariance of the CD updates. In Figure 2b, we compare the cyclic and uniformly random selection of the basis of RCD, RCDlin, and TSD. For RCDlin, we also implement the selection without replacement (‘nr’) strategy. We observe that cyclic and ‘nr’ strategies are better than random selection.

5.2. Orthogonal deep networks for distillation

We next evaluate RCD on a deep learning based distillation problem (Hinton et al., 2015). Let denote the parameters of the student network (S) while be the optimal parameters of the teacher network (T). Then, the aim

Figure 3: Experiments on the distillation problem. We observe that the proposed RCD algorithm performs better than the baselines both in terms of flop counts and runtime.

is to learn S that approximates T, i.e., minimize , where represent the output of the network for some input The network architecture is detailed in Appendix A.1. Here, we constrain all the weights to be orthonormal, thus posing the problem as optimization over the joint space of Stiefel and Euclidean manifolds. For experiments, we consider a six-layer network and set . We use stochastic versions of RGD and RCD where the input samples are randomly generated. In Figure 3, RCD outperforms the baselines in terms of flop counts and runtime. This is because RCD has the most cost-efficient update per iteration, while maintaining a competitive convergence rate.

5.3. Nearest matrix problem

We consider the problem: on the symplectic manifold (Gao et al., 2021b). We follow the setting in (Gao et al., 2021b) by generating A as a random matrix with p = n = 200. The algorithms are evaluated on optimality gap is obtained by running the conjugate gradient algorithm with the Cayley retraction. We implement RCD and RCDlin with both CD and block CD updates (discussed in Remark 3.8). As there

Figure 4: Experiments on the nearest matrix problem. We notice the utility of the block-update variants of our RCD and RCDlin algorithms in obtaining faster convergence.

Figure 5: Experiments on learning Lorentz (hyperbolic) embeddings. The performance of our RCDlin algorithms (with cyclic and time-cyclic basis selection) is competitive to RGD.

is no CD baseline on the symplectic manifold, we compare against the full gradient RGD algorithms with three retractions: Cayley (‘CL’), quasi-geodesic (‘QG’), and SR (‘SR’). In Figure 4, RCD with block update shows clear advantage in both flop counts and runtime.

5.4. Learning Lorentz embeddings

We consider the task of learning embeddings for word hierarchies, which is formulated on the hyperbolic manifold using the hyperboloid model (Nickel & Kiela, 2017; 2018; Jawanpuria et al., 2019b). The goal is to map word pairs with hypernymy relations closer while separate those without. We follow the formulation in (Nickel & Kiela, 2018), and the details are in Appendix A.2.

For experiment settings, we train 5-dimensional embeddings (n = 5) for WordNet mammals subtree (Miller, 1998). We adopt the RCDlin algorithm with two selection rules for the basis : cyclic (‘RCDlin-c’) and time cyclic (‘RCDlin-tc’). The cyclic selection loops through all pairs per iteration. The time cyclic selection only loops through all the space-time coordinate pairs, namely (1, 2), (1, 3), ..., (1, n), which reduces the computational cost to scale linearly with dimension n. For RCDlin-c and RCDlin-tc, we use a linearly-decaying stepsize, i.e., . For RGD we use a fixed stepsize which generally leads to better convergence. We tune and set for RCDlin and 0.5 for RGD. We use the met-

Figure 6: Experiments on the weighted least squares problem in two settings: (a) n = p = 500 and is a dense matrix and (b) n = 500, p = 100 and A is a random symmetric matrix with 70% entries as 1 and others are 0. While RGD and the proposed RCDlin have similar convergence rate in (a), RCDlin has clear advantage in (b).

rics for evaluating the convergence: mean average precision (MAP) and mean rank (MR) (Nickel & Kiela, 2017; 2018). In Figure 5, we see that RCDlin converges at a similar rate compared to RGD in terms of runtime.

5.5. Weighted least squares (SPSD manifold)

The weighted least squares problem is , where masks the known entries in B. It is an instance of the matrix completion problem (Han et al., 2021). For the experiment, we follow (Han et al., 2021) by generating where is an SPD/SPSD matrix with exponentially decaying eigenvalues. We consider two settings: (left) is a dense matrix and (right) n = 500, p = 100 and A is a random symmetric matrix with 70% entries as 1 and others are 0. We compare RCDlin with RGD (for factorization). We set S = np/5 and select the coordinates randomly without replacement. In Figure 6, we observe that RCDlin performs competitively with RGD on the SPD manifold with dense A while performing significantly better on the SPSD manifold with sparse A.

6. Conclusion

In this work, we discuss how to develop computationally efficient CD updates for a number of matrix manifolds. The main bottleneck in developing CD methods is on finding the right basis parameterization of the tangent space. We show precise constructions for various manifolds and propose two CD algorithms: RCD and RCDlin. RCDlin specifically reduces the gradient computations of RCD further. Our experiments show the benefit of our proposed CD updates on a number of problem instances.

Impact Statement

This paper presents work whose goal is to advance optimization methods with applications in Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

References

Ablin, P. and Peyr´e, G. Fast and accurate optimization on the orthogonal manifold without retraction. In International Conference on Artificial Intelligence and Statistics, pp. 5636–5657. PMLR, 2022.

Ablin, P., Vary, S., Gao, B., and Absil, P.-A. Infeasible deterministic, stochastic, and variance-reduction algorithms for optimization under orthogonality constraints. arXiv:2303.16510, 2023.

Absil, P.-A., Mahony, R., and Sepulchre, R. Optimization algorithms on matrix manifolds. Princeton University Press, 2008.

Arjovsky, M., Shah, A., and Bengio, Y. Unitary evolution recurrent neural networks. In International Conference on Machine Learning, pp. 1120–1128. PMLR, 2016.

Bai, Z. and Li, R.-C. Minimization principles and com- putation for the generalized linear response eigenvalue problem. BIT Numerical Mathematics, 54:31–54, 2014.

Bhatia, R. Positive definite matrices. Princeton University Press, 2009.

Bhatia, R., Jain, T., and Lim, Y. On the bures–wasserstein distance between positive definite matrices. Expositiones Mathematicae, 37(2):165–191, 2019.

Bonnabel, S. Stochastic gradient descent on Riemannian manifolds. IEEE Transactions on Automatic Control, 58 (9):2217–2229, 2013.

Boumal, N. An introduction to optimization on smooth manifolds. Cambridge University Press, 2023.

Boumal, N., Mishra, B., Absil, P.-A., and Sepulchre, R. Manopt, a matlab toolbox for optimization on manifolds. The Journal of Machine Learning Research, 15(1):1455– 1459, 2014.

Cardoso, J. R. and Leite, F. S. Exponentials of skew-symmetric matrices and logarithms of orthogonal matrices. Journal of Computational and Applied Mathematics, 233(11):2867–2875, 2010.

Darmwal, Y. and Rajawat, K. Low-complexity subspace- descent over symmetric positive definite manifold. Technical report, arXiv preprint arXiv:2305.02041, 2023.

Douik, A. and Hassibi, B. Manifold optimization over the set of doubly stochastic matrices: A second-order geometry. IEEE Transactions on Signal Processing, 67 (22):5761–5774, 2019.

Edelman, A., Arias, T. A., and Smith, S. T. The geometry of algorithms with orthogonality constraints. SIAM Journal on Matrix Analysis and Applications, 20(2):303–353, 1998.

Gao, B., Liu, X., and Yuan, Y.-x. Parallelizable algorithms for optimization problems with orthogonality constraints. SIAM Journal on Scientific Computing, 41(3):A1949– A1983, 2019.

Gao, B., Son, N. T., Absil, P.-A., and Stykel, T. Geometry of the symplectic Stiefel manifold endowed with the Euclidean metric. In Geometric Science of Information, pp. 789–796. Springer, 2021a.

Gao, B., Son, N. T., Absil, P.-A., and Stykel, T. Riemannian optimization on the symplectic Stiefel manifold. SIAM Journal on Optimization, 31(2):1546–1575, 2021b.

Gao, B., Son, N. T., and Stykel, T. Optimization on the symplectic Stiefel manifold: SR decomposition-based retraction and applications. Technical report, arXiv preprint arXiv:2211.09481, 2022.

Gutman, D. H. and Ho-Nguyen, N. Coordinate descent without coordinates: Tangent subspace descent on Riemannian manifolds. Mathematics of Operations Research, 48(1):127–159, 2023.

Han, A. and Gao, J. Improved variance reduction methods for Riemannian non-convex optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44 (11):7610–7623, 2021.

Han, A., Mishra, B., Jawanpuria, P. K., and Gao, J. On Riemannian optimization over positive definite matrices with the Bures-Wasserstein geometry. In Advances in Neural Information Processing Systems, volume 34, pp. 8940–8953, 2021.

Han, A., Mishra, B., Jawanpuria, P., and Gao, J. Rieman- nian block SPD coupling manifold and its application to optimal transport. Machine Learning, pp. 1–28, 2022.

Han, A., Mishra, B., Jawanpuria, P., and Gao, J. Nonconvex- nonconcave min-max optimization on riemannian manifolds. Transactions on Machine Learning Research, 2023a. ISSN 2835-8856.

Han, A., Mishra, B., Jawanpuria, P., Kumar, P., and Gao, J. Riemannian hamiltonian methods for min-max optimization on manifolds. SIAM Journal on Optimization, 33(3): 1797–1827, 2023b.

Han, A., Mishra, B., Jawanpuria, P., and Gao, J. Differen- tially private Riemannian optimization. Machine Learning, 113(3):1133–1161, 2024a.

Han, A., Mishra, B., Jawanpuria, P., and Takeda, A. A framework for bilevel optimization on Riemannian manifolds. Technical report, arXiv preprint arXiv:2402.03883, 2024b.

Hinton, G., Vinyals, O., and Dean, J. Distilling the knowl- edge in a neural network. Technical report, arXiv preprint arXiv:1503.02531, 2015.

Huang, M., Ma, S., and Lai, L. A Riemannian block co- ordinate descent method for computing the projection robust Wasserstein distance. In International Conference on Machine Learning, pp. 4446–4455. PMLR, 2021.

Huang, W., Absil, P.-A., and Gallivan, K. A. A Riemannian symmetric rank-one trust-region method. Mathematical Programming, 150(2):179–216, 2015.

Huang, Z., Huang, W., Jawanpuria, P., and Mishra, B. Federated learning on Riemannian manifolds with differential privacy. Technical report, arXiv preprint arXiv:2404.10029, 2024.

Jawanpuria, P. and Mishra, B. A unified framework for structured low-rank matrix learning. In International Conference on Machine Learning (ICML), 2018.

Jawanpuria, P., Balgovind, A., Kunchukuttan, A., and Mishra, B. Learning multilingual word embeddings in a latent metric space: a geometric approach. Transactions of the Association for Computational Linguistics, 7(3): 107–120, 2019a.

Jawanpuria, P., Meghwanshi, M., and Mishra, B. Low- rank approximations of hyperbolic embeddings. In IEEE Conference on Decision and Control, 2019b.

Jawanpuria, P., Meghwanshi, M., and Mishra, B. Geometry- aware domain adaptation for unsupervised alignment of word embeddings. In Annual Meeting of the Association for Computational Linguistics, 2020.

Jiang, Y., Zhang, H., Qiu, Y., Xiao, Y., Long, B., and Yang, W.-Y. Givens coordinate descent methods for rotation matrix learning in trainable embedding indexes. In International Conference on Learning Representations, 2022.

Kasai, H., Jawanpuria, P., and Mishra, B. Riemannian adaptive stochastic gradient algorithms on matrix manifolds. In International Conference on Machine Learning (ICML), 2019.

Knight, P. A. The Sinkhorn–Knopp algorithm: convergence and applications. SIAM Journal on Matrix Analysis and Applications, 30(1):261–275, 2008.

Kressner, D., Steinlechner, M., and Vandereycken, B. Low- rank tensor completion by Riemannian optimization. BIT Numerical Mathematics, 54:447–468, 2014.

Lezcano Casado, M. Trivializations for gradient-based opti- mization on manifolds. In Advances in Neural Information Processing Systems, volume 32, 2019.

Li, J. and Ma, S. Federated learning on Riemannian mani- folds. Technical report, arXiv preprint arXiv:2206.05668, 2022.

Luo, Z.-Q. and Tseng, P. On the convergence of the coordi- nate descent method for convex differentiable minimization. Journal of Optimization Theory and Applications, 72(1):7–35, 1992.

Massart, E. and Abrol, V. Coordinate descent on the or- thogonal group for recurrent neural network training. In AAAI Conference on Artificial Intelligence, volume 36, pp. 7744–7751, 2022.

Massart, E. and Absil, P.-A. Quotient geometry with sim- ple geodesics for the manifold of fixed-rank positivesemidefinite matrices. SIAM Journal on Matrix Analysis and Applications, 41(1):171–198, 2020.

Miller, G. A. WordNet: An electronic lexical database. MIT press, 1998.

Mishra, B., Kasai, H., Jawanpuria, P., and Saroop, A. A Riemannian gossip approach to subspace learning on Grassmann manifold. Machine Learning, 108(10):1783– 1803, 2019.

Mishra, B., Satyadev, N., Kasai, H., and Jawanpuria, P. Man- ifold optimization for non-linear optimal transport problems. Technical report, arXiv preprint arXiv:2103.00902, 2021.

Nesterov, Y. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.

Nickel, M. and Kiela, D. Poincar´e embeddings for learning hierarchical representations. In Advances in Neural Information Processing Systems, volume 30, 2017.

Nickel, M. and Kiela, D. Learning continuous hierarchies in the Lorentz model of hyperbolic geometry. In International Conference on Machine Learning, pp. 3779–3788. PMLR, 2018.

Nimishakavi, M., Jawanpuria, P., and Mishra, B. A dual framework for low-rank tensor completion. In Conference on Neural Information Processing Systems (NeurIPS), 2018.

Peng, L. and Vidal, R. Block coordinate descent on smooth manifolds. Technical report, arXiv preprint arXiv:2305.14744, 2023.

Pennec, X., Fillard, P., and Ayache, N. A Riemannian Framework for Tensor Computing. International Journal of Computer Vision, 66(1):41–66, 2006.

Reimherr, M., Bharath, K., and Soto, C. Differential privacy over riemannian manifolds. In Advances in Neural Information Processing Systems, volume 34, pp. 12292–12303, 2021.

Sato, H., Kasai, H., and Mishra, B. Riemannian stochastic variance reduced gradient algorithm with retraction and vector transport. SIAM Journal on Optimization, 29(2): 1444–1472, 2019.

Shalit, U. and Chechik, G. Coordinate-descent for learning orthogonal matrices through Givens rotations. In International Conference on Machine Learning, pp. 548–556. PMLR, 2014.

Shi, D., Gao, J., Hong, X., Boris Choy, S., and Wang, Z. Coupling matrix manifolds assisted optimization for optimal transport problems. Machine Learning, 110:533–558, 2021.

Siegel, J. W. Accelerated optimization with orthogonality constraints. Journal of Computational Mathematics, 39 (2):207–206, 2020.

Sinkhorn, R. Diagonal equivalence to matrices with pre- scribed row and column sums. The American Mathematical Monthly, 74(4):402–405, 1967.

Tripuraneni, N., Flammarion, N., Bach, F., and Jordan, M. I. Averaging stochastic gradient descent on Riemannian manifolds. In Conference on Learning Theory, pp. 650– 687. PMLR, 2018.

Utpala, S., Han, A., Jawanpuria, P., and Mishra, B. Im- proved differentially private Riemannian optimization: Fast sampling and variance reduction. Transactions on Machine Learning Research, 2022.

Vandereycken, B. Low-rank matrix completion by Rieman- nian optimization. SIAM Journal on Optimization, 23(2): 1214–1236, 2013.

Vandereycken, B., Absil, P.-A., and Vandewalle, S. Embed- ded geometry of the set of symmetric positive semidefinite matrices of fixed rank. In IEEE/SP Workshop on Statistical Signal Processing, pp. 389–392. IEEE, 2009.

Vandereycken, B., Absil, P.-A., and Vandewalle, S. A Rie- mannian geometry with complete geodesics for the set of positive semidefinite matrices of fixed rank. IMA Journal of Numerical Analysis, 33(2):481–514, 2013.

Wang, J., Chen, Y., Chakraborty, R., and Yu, S. X. Or- thogonal convolutional neural networks. In Conference on Computer Vision and Pattern Recognition, pp. 11505– 11515, 2020.

Wen, Z. and Yin, W. A feasible method for optimization with orthogonality constraints. Mathematical Programming, 142(1-2):397–434, 2013.

Wilson, B. and Leimeister, M. Gradient descent in hyperbolic space. Technical report, arXiv preprint arXiv:1805.08207, 2018.

Wright, S. J. Coordinate descent algorithms. Mathematical Programming, 151(1):3–34, 2015.

Xiao, N. and Liu, X. Solving optimization problems over the Stiefel manifold by smooth exact penalty function. Technical report, arXiv preprint arXiv:2110.08986, 2021.

Xiao, N., Liu, X., and Yuan, Y.-x. A class of smooth exact penalty function methods for optimization problems with orthogonality constraints. Optimization Methods and Software, 37(4):1205–1241, 2022.

Xiao, N., Liu, X., and Toh, K.-C. Dissolving constraints for Riemannian optimization. Mathematics of Operations Research, 2023.

Yuan, G. A block coordinate descent method for nonsmooth composite optimization under orthogonality constraints. Technical report, arXiv preprint arXiv:2304.03641, 2023.

Zhang, H., J Reddi, S., and Sra, S. Riemannian SVRG: Fast stochastic optimization on Riemannian manifolds. In Advances in Neural Information Processing Systems, volume 29, 2016.

Zhang, P., Zhang, J., and Sra, S. Sion’s minimax theorem in geodesic metric spaces and a Riemannian extragradient algorithm. SIAM Journal on Optimization, 33(4):2885– 2908, 2023.

Zhu, X. A Riemannian conjugate gradient method for op- timization on the Stiefel manifold. Computational Optimization and Applications, 67:73–110, 2017.

A. Additional experiment details

The experiments are run on a laptop with an i5-10500 3.1GHz CPU processor.

A.1. Orthogonal deep networks for distillation

Here we provide the detailed network architecture for the distillation task. In particular, we define a L-layer feed-forward neural network with Tanh activation function, i.e., for , where . In the experiment, we set L = 6.

A.2. Learning Lorentz embeddings

Here we provide the problem formulation for the task of learning Lorentz embeddings. Let D = {(u, v)} be the related word pairs and construct as the negative samples of word u. The objective is to learn embeddings for all word u by solving

where is the Lorentz Riemannian distance.

B. A review on coordinate descent for orthogonal and SPD manifold

We start by reviewing the developments of coordinate descent on the orthogonal manifold (Shalit & Chechik, 2014; Massart & Abrol, 2022; Jiang et al., 2022), Stiefel manifold (Gutman & Ho-Nguyen, 2023), and symmetric positive definite (SPD) manifold (Darmwal & Rajawat, 2023), which motivate the proposed general framework for other manifolds.

Some other works (Huang et al., 2021; Peng & Vidal, 2023) study (block) coordinate descent on a product of manifolds, where each update concerns a component manifold. This is different to our considered setting, where the update is defined for coordinate on the tangent space for a single manifold.

B.1. Orthogonal manifold

Orthogonal manifold O(n) is the smooth space formed by the orthogonality constraints, i.e., . The tangent space can be identified as . The Riemannian metric coincides with the Euclidean metric, i.e., is added to ensure consistency with the canonical metric for the Stiefel manifold, as we shall see later. This leads to the Riemannian gradient and the exponential retraction is given by

Remark B.1. We remark that in all the existing works (Shalit & Chechik, 2014; Massart & Abrol, 2022; Jiang et al., 2022), the tangent space is parameterized as and thus the exponential retraction amounts to . Such a formulation is equivalent to the above by letting . Our reformulation allows natural generalization of the coordinate descent framework to column orthonormal matrices (the Stiefel manifold), where the orthogonal matrix is a special case.

The manifold has a dimension of and its tangent space can be provided with an orthonormal basis where is the basis for the skew-symmetric matrices. In each basis direction , the exponential retraction reduces to the Givens rotation, which allows efficient updates, i.e., is known as the Givens rotation matrix around axes i, j with angle

In order to minimize a function , one needs to update the variables in the negative gradient direction. Here along the basis direction, coordinate descent aims to minimize the function with respect to . One strategy is to solve this one-variable optimization problem directly as in (Shalit & Chechik, 2014). When the objective is more involved, we can approximately solve this problem by following a descent direction (Massart & Abrol, 2022; Jiang et al., 2022), which is given by . This leads to the coordinate descent update in the direction of , which modifies two rows of X every iteration, resulting in an O(n) complexity per update. One pass over all the coordinates requires iterations, leading to complexity in total, which is comparable to the commonly considered retractions, including the exponential, Cayley and QR retractions.

B.2. Stiefel manifold

The Stiefel manifold St(n, p) is the set of column orthonormal matrices of size , i.e., . When . The tangent space of Stiefel manifold is identified as . The Euclidean metric turns out to be a valid Riemannian metric (Edelman et al., 1998), for any . The Riemannian gradient is derived as

In (Gutman & Ho-Nguyen, 2023), a coordinate (subspace) descent algorithm has been developed for general manifolds, via selecting proper subspaces for the tangent space. Although showing theoretical guarantees, the paper only provides a concrete developments for Stiefel manifold (thus including the orthogonal manifold). The basis considered for the tangent space of Stiefel manifold is , where . They show along the direction of basis, the exponential retraction can be computed efficiently. That is, for basis , we compute , which also leads to the Givens rotation. The projection of Riemannian gradient onto the basis is given by . For basis the projection of Riemannian gradient is . This essentially updates j-th column by Riemannian gradient descent over sphere. We notice that each coordinate update costs O(n) while the projection onto second basis costs O(np).

Further, we highlight a recent paper (Yuan, 2023), which considers block coordinate descent updates for Stiefel manifold by modifying k rows. The strategy is to decompose the the rows of the variable into two sets and solve for a subproblem that updates k rows instead. Nevertheless, each subproblem can be difficult to solve in general.

B.3. SPD manifold

A recent work (Darmwal & Rajawat, 2023) develops coordinate (subspace) descent algorithm for symmetric positive definite (SPD) manifold, i.e., , with tangent space given by . Optimization with SPD constraint is difficult as the cost of maintaining positive definiteness requires at least . Under the affine-invariant metric (Bhatia, 2009), i.e., , the paper verifies that the tangent space can be provided with an orthonormal basis where is the Cholesky decomposition and for and for i = j. The exponential retraction along the basis direction can be simplified as , where the Cholesky factor of has a simple form. Thus it suffices to update the Cholesky factor. The coordinate descent then updates as is computed as the coordinate derivative because the Riemannian gradient has the form

C. Other related works

This section summarizes the existing works for optimization under the respective manifold constraints. These methods exploit the full gradient information and in general have advantage over CD methods.

Optimization on orthogonal, Stiefel, and Grassmann manifolds. Optimization with orthogonality constraint has been widely studied. Apart from the Riemannian optimization approach, many recent works turn to infeasible methods, either through converting the orthogonality constrained problem into an unconstrained counterpart in the Euclidean space (Xiao et al., 2022; Lezcano Casado, 2019; Xiao & Liu, 2021; Xiao et al., 2023), or following a direction that leads to the same critical point on the manifold (Gao et al., 2019; Ablin & Peyr´e, 2022; Ablin et al., 2023). We provide a through review of these methods in Appendix section D. Although the per-iteration cost is smaller without using a retraction to ensure feasibility, the methods are sensitive to the choice of stepsize and other parameters, mostly a regularization parameter. In the experiment sections, we observe the infeasible methods either require careful tuning of multiple parameters or some stepsize sequence to show good performance.

Optimization on hyperbolic, symplectic and SPSD manifold. Optimization over hyperbolic space H(n, p) has mostly focused on the case where p = 1 (Nickel & Kiela, 2018; Wilson & Leimeister, 2018). Given the exponential map is already efficient in this case, few works have explored more efficient alternatives to Riemannian optimization. Similarly for the symplectic manifold, existing works (Gao et al., 2021b;a; 2022) have focused on the retraction-based Riemannian optimization approach. Optimization over SPD matrices has a long history and can be solved with semidefinite programming if the objective is convex. For general cost functions and with additional rank constraint, many works (Vandereycken et al., 2009; 2013; Massart & Absil, 2020) have defined Riemannian geometries for the constraint set to leverage the tools from Riemannian optimization.

D. Review of infeasible methods for optimization under orthogonality constraints

This section reviews various infeasible methods for solving optimization problems under orthogonality constraints, i.e.,

where is the Lagrangian multipliers. Because is symmetric, we can compute (by left-multiplying the first equation by ). Subsequently, the first condition becomes , which in fact corresponds to the first-order condition of Riemannian optimization (under the canonical metric). The augmented Lagrangian of (4) is given by

and the Augmented Lagrangian Method (ALM) outlined in (Gao et al., 2019) considers alternately updating X and . However the numerical performance of ALM is sensitive to the choice of , which has been empirically verified in (Gao et al., 2019).

PLAM. Gao et al. (2019) propose an alternative update of the Lagrangian multipliers by setting it to which is optimal at first-order stationarity, and the proximal linearized augmented Lagrangian (PLAM) algorithm takes the update

which corresponds to the However, the boundedness of the iterates cannot be expected without setting to be sufficiently large and to be sufficiently small.

PCAL. Gao et al. (2019) further constrain the proximal linearized update in (5) to a redundant column-sphere constraint. This is used to reduce the sensitivity of convergence to . The update can be implemented in a column-wise parallel fashion as

PenCF Xiao et al. (2022) consider the merit function (used to analyze function value decrease in (Gao et al., 2019)) as an exact penalty function to be minimized. Specifically, the merit function is given by

Without proper constraints, directly minimizing h(X) can be problematic because h(X) can be unbounded. Hence, (Xiao et al., 2022) incorporates constraints to the minimization of h(X) (called PenC), including a ball, convex hull of Stiefel/oblique manifold, which includes the orthogonality constraint set. Then the paper shows the first-order and second-order critical points of PenC match those of the original problem (provided is sufficiently large). Nevertheless, the gradient involves second-order derivatives . Hence, the paper similarly considers approximating with , which is the same as in (Gao et al., 2019) and then project the update to the constraint set.

Landing field Ablin et al. (2023) consider the update

where the direction takes a combination of the Riemannian gradient and a landing direction orthogonal to the Riemannian gradient.

ExPen Xiao & Liu (2021); Xiao et al. (2023) convert the constrained optimization problem to an unconstrained problem by minimizing the following objective:

It can be shown that when is sufficiently large, first-order and second-order critical points recover those of the original problem. When compared to the framework of PCAL/PenC, the problem is now unconstrained and the gradient can be computed as

E. Additional developments and proofs for Section 3

E.1. Stiefel manifold

In the main text, we focus on the Euclidean metric on Stiefel manifold, i.e., , which leads to the Rieman- nian gradient . We here review another popular metric on Stiefel manifold, namely the canonical metric (Edelman et al., 1998), defined as . The corresponding Riemannian gradient is derived as

There exists a variety of retractions for the Stiefel manifold, including the QR retraction qf(A) extracts the q-factor from the thin QR decomposition. The Cayley retraction (Wen & Yin, 2013; Zhu, 2017) is . The exponential retraction is given by

Here we verify the same coordinate derivative is recovered under the canonical metric.

Lemma E.1. Let Bfor , then we have θ

Proof of Lemma E.1. Recall the

where we use the fact that skew-symmetric matrix is orthogonal to symmetric matrix with respect to the Euclidean inner product. Similarly for the canonical metric,

This verifies is the same under the two metrics.

E.2. Grassmann manifold

Proof of Proposition 3.2. It suffices to verify that as the Euclidean inner product . We start by noticing f(X) = f(XQ) for any . Then taking derivative on both sides gives

E.3. Hyperbolic manifold

E.3.1. PROOFS

Proof of Proposition 3.3. By the decomposition, we can express for some . Left-multiplying . Summing both sides with the transposes yields . Hence the projection to is given by

From the definition of Riemannian gradient, we have , where we use the self-adjoint property of orthogonal projection with respect to the metric. Thus the Riemannian gradient is

Proof of Proposition 3.4. First, we see c(0) = X and we compute U. The only part left is to show . This can be verified by showing expm(tWJ) is a Lorentz transform. To see this, let . Then we have

Thus . This completes the proof as because

Proof of Lemma 3.5. For the case where , we can see and thus leads to the Givens rotation. However in the case where . Thus, it remains to show can be simplified. To this end, we see . Then we can show

In addition to the Euclidean metric, we define the canonical metric on hyperbolic manifold as for

The normal space under the canonical metric is the same as that for the Euclidean metric and thus the orthogonal projection can be derived also to be the same. The Riemannian gradient can be derived as follows.

Proposition E.2. The orthogonal projection of to is given by . The Riemannian gradient with respect to the canonical metric is

Proof of Proposition E.2. First we notice that . This implies . Then by definition of Riemannian gradient, we have Hence, gradf(X) =

Motivated by the Cayley transform for (generalized) Stiefel manifold, we define the Cayley transform for generalized hyperbolic manifold as follows. Then in Proposition E.4, we show the Cayley transform naturally leads to a valid retraction on the generalized hyperbolic manifold.

Definition E.3. For and , the Cayley transform is defined as , which is well-defined if is nonsingular.

Proposition E.4 (Cayley retraction). For and , The map is a retraction, where

Proof of Proposition E.4. First, because , we can express U = WJX for some . Since H(d, r) is an embedded submanifold of , let the curve and we have c(0) = X and

with . It remains to show . First we notice that

Then

where the second equality uses the fact that and the last equality is due to for any A.

Finally we can verify that

where we use

E.4. Symplectic manifold

E.4.1. REVIEW OF CANONICAL METRIC AND RETRACTIONS

The canonical metric of symplectic manifold is developed in (Gao et al., 2021b), as for some chosen , where and . The Riemannian gradient associated with the metric is . The quasi-geodesic retraction is derived by replacing the covariance derivative with the Euclidean derivative, given by

0 where . The symplectic Cayley retraction is

. In (Gao et al., 2022), a SR decomposition based retraction is proposed. That is, let where is the j-th basis vector of . Then denote a congruence matrix set as is upper triangulardecomposition of extracts the S factor.

E.4.2. PROOFS

The proof of Proposition 3.6 follows immediately from the following Lemma.

Lemma E.5. For any

Proof of Lemma E.5. The proof follows from (Gao et al., 2021b, Proposition 4.6).

Proof of Proposition 3.6. Similar to the previous sections, let and we have c(0) = X and . Finally from Lemma E.5, we have which verifies

Proof of Proposition 3.7. We can partition the basis

For , we have and thus we obtain . Similarly for . This verifies

For , we have and . Then

notice that for k = 2, 3, 4, ...

This suggests for k = 2, 3, 4, ...,

This verifies for

The proof is now complete.

E.5. Doubly stochastic manifold

For the doubly stochastic manifold, the retraction applies the Sinkhorn algorithm (Knight, 2008) for matrix balancing, i.e., , where U is a tangent vector belonging to the tangent space and the Sinkhorn algorithm SK(U) iteratively normalize rows and columns of U according to the given marginals (Shi et al., 2021; Cardoso & Leite, 2010).

E.5.1. PROOFS

Proof of Proposition 3.9. In fact, we can show for any for . The coordinate Sinkhorn is a valid retraction along the direction and we can immediately see c(0) = X. Also, Let differs with X in only the entries at (i, k), (j, k), (i, l), (k, l), which forms the sub-matrix that we wish to balance. Also, by definition, the marginals are given by . It readily holds that . For notational purposes, for any matrix be the matrix that zeros out the entries except for the sub-matrix. Also, we denote

Sinkhorn on with marginals with other entries of unchanged. This is well-defined as the Sinkhorn algorithm converges to the unique doubly stochastic matrix of the form for some positive vectors u, v (Sinkhorn, 1967). This verifies that results in a doubly stochastic matrix, which remains on the manifold. Lastly, it remains to show that . For this, we first have

where we use the first-order approximation of the exponential operations. Notice that only modifies the sub-matrix of Douik & Hassibi, 2019; Shi et al., 2021), we have . This suggests which verifies

Lemma E.6. Given a positive matrix A =

c d , the Sinkhorn algorithm on A with marginals

converges towhere c, cwhere κ

is the positive root of the equation

Proof of Lemma E.6. Sinkhorn algorithm converges to the unique doubly stochastic matrix of the form diag(u)Adiag(v)

for . From the constraints, diag(and diag(we need to

Let which transforms (6) into a set of linear equations for the variables . The equation system, however, is under-determined and has many solutions. The unique solution that is sought should satisfy . To this end, from the first two equations, we obtain

Substituting the expressions to the last equation yields , which we solve for as the positive root of

F. Formal developments and proofs for Section 4

F.1. Developments

Assumption F.1. Consider a neighbourhood that contains a critical point.

F.1.1 The basis and its projection are bounded. Let the projection onto the basis be . There exists constant such that and

F.1.2 The objective f is retraction L-smooth in X, i.e.,

Remark F.2. Assumption F.1.1 requires the basis has a bounded norm and the projection of any tangent vector onto the basis does not vanish. Such an assumption is manifold-specific and we can verify that has an upper bound (e.g., for Stiefel and Grassmann, ). Then we note that the second requirement trivially holds for orthonormal basis due to the decomposition of U and Jensen’s inequality. For non-orthonormal basis, this assumption also holds as long as projection of a tangent vector does not vanish. Assumption F.1.2 can also be satisfied by the compactness of the domain, e.g., we can take L in Assumption F.1.2 to be . These are all reasonable assumptions within a compact neighbourhood X.

Theorem 4.1 (Formal). Under Assumption F.1, consider RCD algorithm with selected uniformly at random

To analyze the cyclic variant, we further require the assumption that bounds the difference between Riemannian distance and distance induced by general retraction. In addition, we require the gradient Lipschitzness.

Assumption F.3. Under the same settings as in Assumption F.1,

F.3.1 For all , there exists constants F.3.2 The objective has retraction -Lipschitz gradient, i.e., and is the an isometric vector transport that satisfies F.3.3 For any fixed coordinate index , there exists a constant such that for all ,

Remark F.4. Assumption F.3.1 bounds the difference Riemannian distance (relating to inverse exponential map) and the inverse retraction. Because retraction is a first-order approximation to the exponential map, this assumption naturally holds when the domain is sufficiently small (see (Huang et al., 2015; Sato et al., 2019)). Assumption F.3.2 is further required because when general retraction is used, gradient Lipschitzness is not equivalent to function smoothness. Assumption F.3.3 further claims that the difference between the same coordinate basis on different tangent spaces is bounded. We note that the RHS is identical to due to the isometric vector transport. Then it reduces to whether are related, which is expected because due to the compactness of the domain, X is bounded from Y . This allows to establish the convergence for cyclic selection of basis.

Theorem 4.2 (Formal). Under Assumption F.1 and consider RCD algorithm with Then selecting

Theorem 4.3 (Formal). Under Assumption F.1, F.3 and further let , such that for any fixed epoch k,

(Randomized). Consider RCDlin algorithm with selected uniformly at random from I. Then choosing

(Cyclic). Suppose and consider RCDlin algorithm with S = |I| and for 1. Then choosing , we have , where

Remark F.5. We finally remark that the proof ideas of cyclic and randomized RCD follow from classic developments of coordinate descent (Wright, 2015) in the Euclidean space by showing sufficient descent in the objective function. On general manifolds, in order to generalize the proof ideas, we further require the assumptions outlined in Assumption F.1, F.3. In particular, for cyclic selection rule, we require Assumption F.3.3 to relate bases from different tangent spaces. Similar assumptions have been considered in (Gutman & Ho-Nguyen, 2023) for showing convergence of deterministic subspace descent algorithms on manifolds (see (C, r)-norm condition).

F.2. Proofs

Proof of Theorem 4.1. Because S = 1 and by retraction L-smoothness, we have

Telescoping this inequality and taking full expectation yields

where we let

Lemma F.6. Under Assumption F.1.1, we have

Proof of Lemma F.6. . Can- celling on both sides completes the proof.

Proof of Theorem 4.2. We first focus on a single epoch k and for notation simplicity, we let Similarly from retraction L-smoothness,

Then it remains to bound the RHS. From -Lipschitz,

where we use Assumption F.3 and triangle inequality of Riemannian distance.

Now we can show

where the second inequality is due to Lemma F.6. Summing this inequality from

where we notice S = |I|. Also due to the cyclic selection of , we can see the LHS is

where we use Assumption F.1.1 and F.3.3. Combining with previous results, we finally obtain

Proof of Theorem 4.3. For the randomized setting, by retraction L-smoothness,

where we choose . The second inequality follows from the assumption and the third inequality is due to Assumption F.1.1. Following the similar proof strategy, we obtain the desired result. For the cyclic setting, the bound also readily follows by using the above result.

Designed for Accessibility and to further Open Science