Riemannian coordinate descent algorithms on matrix manifolds

2 weeks ago·arXiv

Abstract

Abstract

Many machine learning applications are naturally formulated as optimization problems on Riemannian manifolds. The main idea behind Riemannian optimization is to maintain the feasibility of the variables while moving along a descent direction on the manifold. This results in updating all the variables at every iteration. In this work, we provide a general framework for developing computationally efficient coordinate descent (CD) algorithms on matrix manifolds that allows updating only a few variables at every iteration while adhering to the manifold constraint. In particular, we propose CD algorithms for various manifolds such as Stiefel, Grassmann, (generalized) hyperbolic, symplectic, and symmetric positive (semi)definite. While the cost per iteration of the proposed CD algorithms is low, we further develop a more efficient variant via a first-order approximation of the objective function. We analyze their convergence and complexity, and empirically illustrate their efficacy in several applications.

1. Introduction

In this work, we consider the optimization problem

where M is a smooth, and often nonlinear constraint. Examples of M include orthogonality constraint (Edelman et al., 1998), positive (semi)definite constraint (Bhatia, 2009; Han et al., 2021), fixed-rank constraint (Vandereycken, 2013), hyperbolic constraint (Nickel & Kiela, 2018), doubly stochastic constraint (Douik & Hassibi, 2019), etc. Problem (1) has been explored in applications such as PCA (Zhang et al., 2016; Kasai et al., 2019), low-rank matrix/tensor completion (Jawanpuria & Mishra, 2018; Nimishakavi et al.,

2018; Kressner et al., 2014), computer vision (Pennec et al., 2006), natural language processing (Jawanpuria et al., 2019a; 2020), optimal transport (Mishra et al., 2021; Han et al., 2022; Shi et al., 2021), and deep learning (Arjovsky et al., 2016; Wang et al., 2020). Problem (1) has also been studied in various settings such as stochastic optimization (Bonnabel, 2013; Zhang et al., 2016; Tripuraneni et al., 2018; Sato et al., 2019; Kasai et al., 2019; Han & Gao, 2021), differential privacy (Reimherr et al., 2021; Han et al., 2024a; Utpala et al., 2022), federated learning (Li & Ma, 2022; Huang et al., 2024), decentralized learning (Mishra et al., 2019), and saddle point and bilevel optimization (Han et al., 2023b;a; Zhang et al., 2023; Han et al., 2024b).

The smooth constraint set can be turned into a Riemannian manifold by endowing a properly chosen metric structure. The Riemannian optimization approach (Absil et al., 2008; Boumal, 2023) then provides a principled approach to solve (1) intrinsically on the manifold space. The main idea is to iteratively update the variable along a descent direction without leaving the manifold. The descent direction is often computed using the Riemannian gradient, which is then followed by a retraction update to ensure feasibility of the manifold constraint. As the dimensionality of the constraint set increases, ensuring feasibility via retraction becomes a key computational bottleneck, e.g., the complexity of ensuring orthogonality and positive definiteness scales as with the input dimension n. This has led to many recent works (Gao et al., 2019; Xiao & Liu, 2021; Ablin et al., 2023) that develop infeasible methods for solving (1). However, such methods are largely limited to the orthogonality constraint and cannot be easily adapted to other manifolds.

In the Euclidean space, the coordinate descent (CD) method (Luo & Tseng, 1992; Nesterov, 2012; Wright, 2015) is a classic algorithm that successively solves a smalldimensional subproblem along a component of the vector variable while holding others fixed. Since each subproblem can be more easily solved than the original problem, this strategy leads to efficient variable update.

On manifolds, designing CD updates is inherently difficult (Gutman & Ho-Nguyen, 2023). A few works have proposed manifold specific CD updates, mainly for the orthogonal (Shalit & Chechik, 2014; Jiang et al., 2022; Massart & Abrol, 2022) and Stiefel (Gutman & Ho-Nguyen, 2023) manifolds. Although Gutman & Ho-Nguyen (2023) discuss a general framework for developing CD methods on manifolds, concrete developments have been shown only for the Stiefel manifold. Recently, for a class of optimization objectives, Darmwal & Rajawat (2023) have proposed CD updates on the symmetric positive definite manifold with the affine-invariant metric.

In this work, we provide a general approach for developing CD algorithms on matrix manifolds. We summarize our contributions below.

• We introduce a framework for designing CD algorithms on manifolds. In particular, we find a basis spanning the tangent space such that a chosen retraction along the direction of such a basis admits an efficient computation. We discuss a simple expression for the coordinate derivative. Finally, we provide optimization ingredients for various matrix manifolds of interest.

• A nonlinear objective f in (1) requires computation of gradient for every CD update. Using a first-order approximation of f, we develop a more efficient CD algorithm which requires gradient computations one in every fixed number of CD updates. We analyze the convergence and complexity of the two algorithms with randomized and cyclic selection of coordinates.

• We show the benefits of the proposed CD algorithms on the orthogonal Procrustes, PCA, orthogonal deep network distillation, nearest matrix, and learning hyperbolic embeddings problems.

2. Preliminaries

Riemannian manifolds and optimization. For a Riemannian manifold M, denote its tangent space at as . A Riemannian metric is an inner product structure that varies smoothly with the base point X. In this work, we particularly focus on matrix manifolds, i.e., where X can be represented in the ambient vector space . The orthogonal projection projects arbitrary ambient vectors to the tangent space with respect to the Riemannian metric. For a differentiable function , the Riemannian gradient at X is defined as the tangent vector such that where . A retraction allows points to move along the manifold, which satisfies the conditions: and

Related works. We provide a detailed review of the existing coordinate descent (CD) algorithms on specific manifolds, along with other related works in Appendix B.

Notations. We use without the subscript to represent the Euclidean inner product while we use to denote the Riemannian inner product on . The specific expression for depends on both M and X. Sym(n) and Skew(n) denote the sets of symmetric and skew-symmetric matrices, respectively. Let be the elementwise exponential, and be the matrix exponential. We also use to represent the i-th basis vector with the dimension to be determined from the context. denotes the i, j-th entry of a matrix represents a matrix with index i, j. We use to denote the identity matrix, to denote the size-n vector of all 1s, and define [n] := {1, 2, ..., n}.

3. Proposed CD Framework

As shown in (Shalit & Chechik, 2014; Gutman & Ho- Nguyen, 2023; Massart & Abrol, 2022; Jiang et al., 2022; Darmwal & Rajawat, 2023), for specific manifolds, the key in developing CD algorithms is the choice of the basis vectors and I denotes the index set) spanning the tangent space that allow efficient retraction. In general, our chosen basis need not be orthonormal with respect to the Riemannian metric. Once the basis and retraction are chosen, the CD update is given by , where is the stepsize and is the coordinate derivative, i.e.,

It can be verified that is indeed a descent direction, i.e., . The CD algorithm then involves iteratively selecting coordinate index , computing , and updating in the coordinate descent direction

The main challenges in developing CD algorithms on matrix manifolds are: 1) characterization of which facilitates efficient computation, 2) efficient computation of easy generalization to different manifolds. We propose to leverage the following connection between the Riemannian and Euclidean gradients:

where is the Euclidean gradient and the last equality follows from the definition of the Riemannian gradient. We exploit (3) to efficiently compute for several manifolds as it is independent of the Riemannian gradient and metric. In the subsequent sections, we develop concrete CD optimization ingredients for the manifolds of interest under the proposed approach. These are summarized in Table 1.

Table 1: Summary of CD ingredients over various manifolds. and are the basis for skew-symmetric and symmetric matrices, respectively. and corresponds to the Givens and hyperbolic rotations, respectively. , and is the k-th column of . We use to denote the exponential retraction over sphere. The complexity only considers the computation of coordinate derivative and coordinate update, while excluding the complexity of first-order oracle

3.1. CD on Stiefel manifold

The Stiefel manifold St(n, p) is the set of column orthonormal matrices of size , i.e., , the orthogonal manifold. The tangent space of Stiefel manifold is identified as . The Riemannian metric is defined as for any . The Riemannian gradient is derived as

Choice of basis. Taking inspiration from O(n), we adopt the parameterization of the tangent vectors (where ) and choose the basis as for In contrast to O(n), the chosen basis is not orthonormal for St(n, p). This is expected as the manifold St(n, p) has a dimension while we adopt an overparameterization of the tangent space using basis vectors.

Retraction. For the purpose of CD update, we first note that is a valid retraction on St(n, p) because: 1) and 2) are satisfied (Siegel, 2020).

CD update. Based on the above choices of the basis vectors and retraction, the proposed CD update is , where . Here, is known as the Givens rotation around axes i, j with angle . Overall, each CD update only requires O(p) as we modify only two rows of X.

Remark 3.1. Gutman & Ho-Nguyen (2023) propose a column-wise CD update on the Stiefel manifold which costs O(np) per iteration. On the other hand, our proposed CD update is row-wise and costs O(p), which is cheaper. Furthermore, the CD update strategy of (Gutman & Ho-Nguyen, 2023) cannot be applied to the sphere manifold, i.e., when p = 1, it reduces to the full gradient update on the sphere. This, however, is not an issue for our update. Finally, the update of Gutman & Ho-Nguyen (2023) is not invariant to the right action of orthogonal group and hence does not yield a valid CD strategy for the Grassmann manifold. In contrast, as shown in the next section, our strategy can be readily generalized to the Grassmann manifold.

3.2. CD on Grassmann manifold

The Grassmann manifold Gr(n, p) represents the p-dimensional subspaces in , which can be represented by an orthonormal matrix X, i.e., , where the columns span the subspace. The representation is not unique, with XQ representing the same subspace for arbitrary . Thus, the Grassmann manifold can be identified as , where . The tangent space can be uniquely characterized by the horizontal space at , i.e., . For a given , its unique horizontal lift is , where [X] is represented as X. The lift operator satisfies . On Gr(n, p), the Riemannian metric is pushed forward by the Euclidean metric on St(n, p) as and the corresponding Riemannian gradient gradf([X]) can be represented by . Retractions such as QR retraction for St(n, p) also work for Gr(n, p) as long as it preserves the equivalence class, i.e., for any . Below, we show that the proposed CD update for St(n, p) is also well-defined for Gr(n, p).

Proposition 3.2. Consider a function R. Let the coordinate descent update at [X] be given by for , where and for some fixed stepsize . Then,

3.3. CD on hyperbolic manifold

We now consider the generalized hyperbolic space (Bai & Li, 2014; Xiao et al., 2023) , where is the metric tensor. When p = 1, this reduces to the wellknown hyperbolic space (the hyperboloid model). The tangent space at is identified as

T

The Riemannian metric on TH(n, p) is the generalized Lorentz inner product as . The normal space is given by . The orthogonal projection to and the Riemannian gradient are derived below.

Proposition 3.3. The orthogonal projection of is given by . The Riemannian gradient is

Choice of basis. For generalized hyperbolic manifold, we consider the basis

Retraction. Taking inspiration from our Stiefel analysis in Section 3.1, we define the map expm(tWJ)X for . We next show such a map de