1b:["$","$L29",null,{"isWhiteLabelled":false,"children":["$","$Lb",null,{"pt":{"compact":0,"expanded":3},"children":[["$","$L2a",null,{"noStar":true,"publisher":true,"task":true,"params":true,"size":"xl","product":{"id":"eyJwYXBlcklEIjoiMjQwNi4wMjIyNSIsInB1Ymxpc2hlciI6ImFyeGl2In0=","publisher":"arxiv","updated":"2024-06-04T00:00:00.000Z","paperID":"2406.02225","published":"2024-06-04T00:00:00.000Z","authors":"[\"Han Andi\",\"Jawanpuria Pratik\",\"Mishra Bamdev\"]","title":"Riemannian coordinate descent algorithms on matrix manifolds","scoreTrending":null,"summary":"Many machine learning applications are naturally formulated as optimization\nproblems on Riemannian manifolds. The main idea behind Riemannian optimization\nis to maintain the feasibility of the variables while moving along a descent\ndirection on the manifold. This results in updating all the variables at every\niteration. In this work, we provide a general framework for developing\ncomputationally efficient coordinate descent (CD) algorithms on matrix\nmanifolds that allows updating only a few variables at every iteration while\nadhering to the manifold constraint. In particular, we propose CD algorithms\nfor various manifolds such as Stiefel, Grassmann, (generalized) hyperbolic,\nsymplectic, and symmetric positive (semi)definite. While the cost per iteration\nof the proposed CD algorithms is low, we further develop a more efficient\nvariant via a first-order approximation of the objective function. We analyze\ntheir convergence and complexity, and empirically illustrate their efficacy in\nseveral applications.","lastCheckedForCode":"2024-08-28T17:36:15.783Z","links":[{"id":"eyJ1cmwiOiJodHRwczovL3BhcGVyc3dpdGhjb2RlLmNvbS9wYXBlci9yaWVtYW5uaWFuLWNvb3JkaW5hdGUtZGVzY2VudC1hbGdvcml0aG1zLW9uIn0=","type":"pwc","url":"https://paperswithcode.com/paper/riemannian-coordinate-descent-algorithms-on","data":"{\"date\":\"2024-09-04T21:24:12.431Z\"}"}],"reposConnection":{"edges":[{"official":null,"node":{"id":"eyJyZXBvSUQiOiI4MTc1MzUyNjciLCJzb3VyY2UiOiJnaXRodWIifQ==","source":"github","repoID":"817535267","url":"https://github.com/andyjm3/rcd","title":"rcd","language":"matlab","stars":1,"forks":0,"framework":null,"scoreTrending":null,"updated":null,"created":null,"downloads":null,"likes":null,"owner":[{"username":"andyjm3","avatar":"https://avatars.githubusercontent.com/u/68631477?v=4"}]}}]},"models":[],"tags":[{"id":"eyJuYW1lIjoicmllbWFubmlhbiBvcHRpbWl6YXRpb24iLCJ0eXBlIjoidGFzayJ9","name":"riemannian optimization","description":"Riemannian optimization in machine learning involves inputting a non-Euclidean data set and outputting an optimized solution that respects the geometry of the input space. This method is often used in computer vision and signal processing, where data naturally resides on non-Euclidean spaces, such as the space of symmetric positive definite matrices or the space of rotation matrices.","scoreTrending":null,"count":{"stars":113,"papers":81,"models":42},"__typename":"Tag"}],"summaries":[],"emailsConnection":{"edges":[{"author":"han andi","node":{"id":"eyJhZGRyZXNzIjoiYW5kaS5oYW5AcmlrZW4uanAifQ==","address":"andi.han@riken.jp","name":"Andi Han","avatar":null,"linkedin":null,"bio":null,"site":null,"override":null,"membership":[{"name":"Riken"}],"paper":[{"modelsAggregate":{"count":0}},{"modelsAggregate":{"count":0}},{"modelsAggregate":{"count":0}},{"modelsAggregate":{"count":0}},{"modelsAggregate":{"count":0}},{"modelsAggregate":{"count":0}},{"modelsAggregate":{"count":0}},{"modelsAggregate":{"count":0}},{"modelsAggregate":{"count":0}},{"modelsAggregate":{"count":0}},{"modelsAggregate":{"count":0}},{"modelsAggregate":{"count":0}},{"modelsAggregate":{"count":0}},{"modelsAggregate":{"count":0}}],"github":[],"scholar":[],"twitter":[],"location":[],"owner":[{"id":"eyJ1aWQiOiJjYTgyMmJjNy03MmZmLTQ3NDctOWRhYi01YzRkM2Y0ODI1M2MifQ==","name":"Andi Han","github":[],"email":[],"authored":[{"id":"eyJwYXBlcklEIjoiMjQwNi4wMjIxNCIsInB1Ymxpc2hlciI6ImFyeGl2In0=","publisher":"arxiv","paperID":"2406.02214"},{"id":"eyJwYXBlcklEIjoiMjQwMS4xNDU4MCIsInB1Ymxpc2hlciI6ImFyeGl2In0=","publisher":"arxiv","paperID":"2401.14580"},{"id":"eyJwYXBlcklEIjoiMjQwMS4wODExOSIsInB1Ymxpc2hlciI6ImFyeGl2In0=","publisher":"arxiv","paperID":"2401.08119"},{"id":"eyJwYXBlcklEIjoiMjQwMi4wMzg4MyIsInB1Ymxpc2hlciI6ImFyeGl2In0=","publisher":"arxiv","paperID":"2402.03883"},{"id":"eyJwYXBlcklEIjoiMjQwNi4wMjIyNSIsInB1Ymxpc2hlciI6ImFyeGl2In0=","publisher":"arxiv","paperID":"2406.02225"},{"id":"eyJwYXBlcklEIjoiMjQwNS4xMjUyMSIsInB1Ymxpc2hlciI6ImFyeGl2In0=","publisher":"arxiv","paperID":"2405.12521"}]}]}}]},"__typename":"paper","authorArray":["Han Andi","Jawanpuria Pratik","Mishra Bamdev"]}}],["$","$L18",null,{"container":true,"columns":100,"spacing":{"compact":0,"expanded":2,"large":3},"children":[["$","$L18",null,{"size":{"compact":100,"expanded":100,"large":68},"children":[["$","$7",null,{"children":["$","$L2b",null,{"publisher":"arxiv","paperID":"2406.02225","product":{"paper":"$1b:props:children:props:children:0:props:product","models":"$1b:props:children:props:children:0:props:product:models"},"isWhiteLabelled":false}]}],["$","$7",null,{"children":["$","$L2c",null,{"article":"$L2d","model":"$undefined"}]}]]}],["$","$L18",null,{"size":"grow","children":["$","$L2e",null,{}]}]]}],["$","$7",null,{"children":null}],[["$","audio",null,{"id":"tts"}],["$","$L2f",null,{"paperID":"2406.02225","publisher":"arxiv","paperJSON":{"title":"Riemannian coordinate descent algorithms on matrix manifolds","paperID":"2406.02225","avgLineHeight":11.96,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"Many machine learning applications are naturally formulated as optimization problems on Riemannian manifolds. The main idea behind Riemannian optimization is to maintain the feasibility of the variables while moving along a descent direction on the manifold. This results in updating all the variables at every iteration. In this work, we provide a general framework for developing computationally efficient coordinate descent (CD) algorithms on matrix manifolds that allows updating only a few variables at every iteration while adhering to the manifold constraint. In particular, we propose CD algorithms for various manifolds such as Stiefel, Grassmann, (generalized) hyperbolic, symplectic, and symmetric positive (semi)definite. While the cost per iteration of the proposed CD algorithms is low, we further develop a more efficient variant via a first-order approximation of the objective function. We analyze their convergence and complexity, and empirically illustrate their efficacy in several applications.","element":"span"}]]},{"heading":"1. Introduction","paragraphs":[[{"text":"In this work, we consider the optimization problem","element":"span"}],[{"id":"id-6","style":{"width":"78%"},"width":739,"height":61,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/0-0.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"is a smooth, and often nonlinear constraint. Examples of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"include orthogonality constraint (","element":"span"},{"href":"#id-0","referenceIndex":14,"text":"Edelman et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-0","referenceIndex":14,"text":"1998","element":"a"},{"text":"), positive (semi)definite constraint (","element":"span"},{"href":"#id-1","referenceIndex":6,"text":"Bhatia","element":"a"},{"text":", ","element":"span"},{"href":"#id-1","referenceIndex":6,"text":"2009","element":"a"},{"text":"; ","element":"span"},{"href":"#id-2","referenceIndex":21,"text":"Han ","element":"a"},{"href":"#id-2","referenceIndex":21,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-2","referenceIndex":21,"text":"2021","element":"a"},{"text":"), fixed-rank constraint (","element":"span"},{"href":"#id-3","referenceIndex":61,"text":"Vandereycken","element":"a"},{"text":", ","element":"span"},{"href":"#id-3","referenceIndex":61,"text":"2013","element":"a"},{"text":"), hyperbolic constraint (","element":"span"},{"href":"#id-4","referenceIndex":49,"text":"Nickel & Kiela","element":"a"},{"text":", ","element":"span"},{"href":"#id-4","referenceIndex":49,"text":"2018","element":"a"},{"text":"), doubly stochastic constraint (","element":"span"},{"href":"#id-5","referenceIndex":13,"text":"Douik & Hassibi","element":"a"},{"text":", ","element":"span"},{"href":"#id-5","referenceIndex":13,"text":"2019","element":"a"},{"text":"), etc. Problem (","element":"span"},{"href":"#id-6","text":"1","element":"a"},{"text":") has been explored in applications such as PCA (","element":"span"},{"href":"#id-7","referenceIndex":72,"text":"Zhang ","element":"a"},{"href":"#id-7","referenceIndex":72,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-7","referenceIndex":72,"text":"2016","element":"a"},{"text":"; ","element":"span"},{"href":"#id-8","referenceIndex":36,"text":"Kasai et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-8","referenceIndex":36,"text":"2019","element":"a"},{"text":"), low-rank matrix/tensor completion (","element":"span"},{"href":"#id-9","referenceIndex":31,"text":"Jawanpuria & Mishra","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":31,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-10","referenceIndex":50,"text":"Nimishakavi et al.","element":"a"},{"text":",","element":"span"}],[{"href":"#id-10","referenceIndex":50,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-11","referenceIndex":38,"text":"Kressner et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-11","referenceIndex":38,"text":"2014","element":"a"},{"text":"), computer vision (","element":"span"},{"href":"#id-12","referenceIndex":52,"text":"Pennec ","element":"a"},{"href":"#id-12","referenceIndex":52,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-12","referenceIndex":52,"text":"2006","element":"a"},{"text":"), natural language processing (","element":"span"},{"href":"#id-13","referenceIndex":32,"text":"Jawanpuria et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-13","referenceIndex":32,"text":"2019a","element":"a"},{"text":"; ","element":"span"},{"href":"#id-14","referenceIndex":34,"text":"2020","element":"a"},{"text":"), optimal transport (","element":"span"},{"href":"#id-15","referenceIndex":46,"text":"Mishra et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-15","referenceIndex":46,"text":"2021","element":"a"},{"text":"; ","element":"span"},{"href":"#id-16","referenceIndex":22,"text":"Han ","element":"a"},{"href":"#id-16","referenceIndex":22,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-16","referenceIndex":22,"text":"2022","element":"a"},{"text":"; ","element":"span"},{"href":"#id-17","referenceIndex":56,"text":"Shi et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-17","referenceIndex":56,"text":"2021","element":"a"},{"text":"), and deep learning (","element":"span"},{"href":"#id-18","referenceIndex":4,"text":"Arjovsky ","element":"a"},{"href":"#id-18","referenceIndex":4,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-18","referenceIndex":4,"text":"2016","element":"a"},{"text":"; ","element":"span"},{"href":"#id-19","referenceIndex":64,"text":"Wang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-19","referenceIndex":64,"text":"2020","element":"a"},{"text":"). Problem (","element":"span"},{"href":"#id-6","text":"1","element":"a"},{"text":") has also been studied in various settings such as stochastic optimization (","element":"span"},{"href":"#id-20","referenceIndex":8,"text":"Bonnabel","element":"a"},{"text":", ","element":"span"},{"href":"#id-20","referenceIndex":8,"text":"2013","element":"a"},{"text":"; ","element":"span"},{"href":"#id-7","referenceIndex":72,"text":"Zhang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-7","referenceIndex":72,"text":"2016","element":"a"},{"text":"; ","element":"span"},{"href":"#id-21","referenceIndex":59,"text":"Tripuraneni et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-21","referenceIndex":59,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-22","referenceIndex":54,"text":"Sato et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-22","referenceIndex":54,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-8","referenceIndex":36,"text":"Kasai et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-8","referenceIndex":36,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-23","referenceIndex":20,"text":"Han & Gao","element":"a"},{"text":", ","element":"span"},{"href":"#id-23","referenceIndex":20,"text":"2021","element":"a"},{"text":"), differential privacy (","element":"span"},{"href":"#id-24","referenceIndex":53,"text":"Reimherr et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-24","referenceIndex":53,"text":"2021","element":"a"},{"text":"; ","element":"span"},{"href":"#id-25","referenceIndex":25,"text":"Han et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-25","referenceIndex":25,"text":"2024a","element":"a"},{"text":"; ","element":"span"},{"href":"#id-26","referenceIndex":60,"text":"Utpala et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-26","referenceIndex":60,"text":"2022","element":"a"},{"text":"), federated learning (","element":"span"},{"href":"#id-27","referenceIndex":40,"text":"Li & Ma","element":"a"},{"text":", ","element":"span"},{"href":"#id-27","referenceIndex":40,"text":"2022","element":"a"},{"text":"; ","element":"span"},{"href":"#id-28","referenceIndex":30,"text":"Huang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-28","referenceIndex":30,"text":"2024","element":"a"},{"text":"), decentralized learning (","element":"span"},{"href":"#id-29","referenceIndex":45,"text":"Mishra ","element":"a"},{"href":"#id-29","referenceIndex":45,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-29","referenceIndex":45,"text":"2019","element":"a"},{"text":"), and saddle point and bilevel optimization (","element":"span"},{"href":"#id-30","referenceIndex":24,"text":"Han ","element":"a"},{"href":"#id-30","referenceIndex":24,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-30","referenceIndex":24,"text":"2023b","element":"a"},{"text":";","element":"span"},{"href":"#id-31","referenceIndex":23,"text":"a","element":"a"},{"text":"; ","element":"span"},{"href":"#id-32","referenceIndex":73,"text":"Zhang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-32","referenceIndex":73,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-33","referenceIndex":26,"text":"Han et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-33","referenceIndex":26,"text":"2024b","element":"a"},{"text":").","element":"span"}],[{"text":"The smooth constraint set can be turned into a Riemannian manifold by endowing a properly chosen metric structure. The Riemannian optimization approach (","element":"span"},{"href":"#id-34","referenceIndex":3,"text":"Absil et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-34","referenceIndex":3,"text":"2008","element":"a"},{"text":"; ","element":"span"},{"href":"#id-35","referenceIndex":9,"text":"Boumal","element":"a"},{"text":", ","element":"span"},{"href":"#id-35","referenceIndex":9,"text":"2023","element":"a"},{"text":") then provides a principled approach to solve (","element":"span"},{"href":"#id-6","text":"1","element":"a"},{"text":") intrinsically on the manifold space. The main idea is to iteratively update the variable along a descent direction without leaving the manifold. The descent direction is often computed using the Riemannian gradient, which is then followed by a retraction update to ensure feasibility of the manifold constraint. As the dimensionality of the constraint set increases, ensuring feasibility via retraction becomes a key computational bottleneck, e.g., the complexity of ensuring orthogonality and positive definiteness scales as ","element":"span"},{"style":{"height":17.38},"width":104.8,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/0-1.png","element":"img","alt":" O(n3)","inline":true,"padRight":true},{"text":"with the input dimension ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":". This has led to many recent works (","element":"span"},{"href":"#id-36","referenceIndex":15,"text":"Gao et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-36","referenceIndex":15,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-37","referenceIndex":68,"text":"Xiao & Liu","element":"a"},{"text":", ","element":"span"},{"href":"#id-37","referenceIndex":68,"text":"2021","element":"a"},{"text":"; ","element":"span"},{"href":"#id-38","referenceIndex":2,"text":"Ablin et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-38","referenceIndex":2,"text":"2023","element":"a"},{"text":") that develop infeasible methods for solving (","element":"span"},{"href":"#id-6","text":"1","element":"a"},{"text":"). However, such methods are largely limited to the orthogonality constraint and cannot be easily adapted to other manifolds.","element":"span"}],[{"text":"In the Euclidean space, the coordinate descent (CD) method (","element":"span"},{"href":"#id-39","referenceIndex":41,"text":"Luo & Tseng","element":"a"},{"text":", ","element":"span"},{"href":"#id-39","referenceIndex":41,"text":"1992","element":"a"},{"text":"; ","element":"span"},{"href":"#id-40","referenceIndex":47,"text":"Nesterov","element":"a"},{"text":", ","element":"span"},{"href":"#id-40","referenceIndex":47,"text":"2012","element":"a"},{"text":"; ","element":"span"},{"href":"#id-41","referenceIndex":67,"text":"Wright","element":"a"},{"text":", ","element":"span"},{"href":"#id-41","referenceIndex":67,"text":"2015","element":"a"},{"text":") is a classic algorithm that successively solves a smalldimensional subproblem along a component of the vector variable while holding others fixed. Since each subproblem can be more easily solved than the original problem, this strategy leads to efficient variable update.","element":"span"}],[{"text":"On manifolds, designing CD updates is inherently difficult (","element":"span"},{"href":"#id-42","referenceIndex":19,"text":"Gutman & Ho-Nguyen","element":"a"},{"text":", ","element":"span"},{"href":"#id-42","referenceIndex":19,"text":"2023","element":"a"},{"text":"). A few works have proposed manifold specific CD updates, mainly for the orthogonal (","element":"span"},{"href":"#id-43","referenceIndex":55,"text":"Shalit & Chechik","element":"a"},{"text":", ","element":"span"},{"href":"#id-43","referenceIndex":55,"text":"2014","element":"a"},{"text":"; ","element":"span"},{"href":"#id-44","referenceIndex":35,"text":"Jiang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-44","referenceIndex":35,"text":"2022","element":"a"},{"text":"; ","element":"span"},{"href":"#id-45","referenceIndex":42,"text":"Massart & Abrol","element":"a"},{"text":", ","element":"span"},{"href":"#id-45","referenceIndex":42,"text":"2022","element":"a"},{"text":") and Stiefel (","element":"span"},{"href":"#id-42","referenceIndex":19,"text":"Gutman & Ho-Nguyen","element":"a"},{"text":", ","element":"span"},{"href":"#id-42","referenceIndex":19,"text":"2023","element":"a"},{"text":") manifolds. ","element":"span"},{"text":"Although ","element":"span"},{"href":"#id-42","referenceIndex":19,"text":"Gutman & Ho-Nguyen ","element":"a"},{"text":"(","element":"span"},{"href":"#id-42","referenceIndex":19,"text":"2023","element":"a"},{"text":") discuss a general framework for developing CD methods on manifolds, concrete developments have been shown only for the Stiefel manifold. Recently, for a class of optimization objectives, ","element":"span"},{"href":"#id-46","referenceIndex":12,"text":"Darmwal & Rajawat ","element":"a"},{"text":"(","element":"span"},{"href":"#id-46","referenceIndex":12,"text":"2023","element":"a"},{"text":") have proposed CD updates on the symmetric positive definite manifold with the affine-invariant metric.","element":"span"}],[{"text":"In this work, we provide a general approach for developing CD algorithms on matrix manifolds. We summarize our contributions below.","element":"span"}],[{"text":"• We introduce a framework for designing CD algorithms on manifolds. In particular, we find a basis spanning the tangent space such that a chosen retraction along the direction of such a basis admits an efficient computation. We discuss a simple expression for the coordinate derivative. Finally, we provide optimization ingredients for various matrix manifolds of interest.","element":"span"}],[{"text":"• A nonlinear objective ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"in (","element":"span"},{"href":"#id-6","text":"1","element":"a"},{"text":") requires computation of gradient for every CD update. Using a first-order approximation of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":", we develop a more efficient CD algorithm which requires gradient computations one in every fixed number of CD updates. We analyze the convergence and complexity of the two algorithms with randomized and cyclic selection of coordinates.","element":"span"}],[{"text":"• We show the benefits of the proposed CD algorithms on the orthogonal Procrustes, PCA, orthogonal deep network distillation, nearest matrix, and learning hyperbolic embeddings problems.","element":"span"}]]},{"heading":"2. Preliminaries","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"Riemannian manifolds and optimization. ","element":"span"},{"text":"For a Riemannian manifold ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M","element":"span"},{"text":", denote its tangent space at ","element":"span"},{"style":{"height":11.6},"width":152.31,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-0.png","element":"img","alt":" X ∈ M","inline":true,"padRight":true},{"text":"as ","element":"span"},{"style":{"height":13.19},"width":101.58,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-1.png","element":"img","alt":" TXM","inline":true},{"text":". A Riemannian metric is an inner product structure ","element":"span"},{"style":{"height":16},"width":679.18,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-2.png","element":"img","alt":" gX(·, ·) = ⟨·, ·⟩X : TXM × TXM → R","inline":true,"padRight":true},{"text":"that varies smoothly with the base point ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"text":". In this work, we particularly focus on matrix manifolds, i.e., where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"can be represented in the ambient vector space ","element":"span"},{"style":{"height":11.78},"width":101.96,"height":29.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-3.png","element":"img","alt":" Rm×n","inline":true},{"text":". The orthogonal projection ","element":"span"},{"style":{"height":15.68},"width":417.54,"height":39.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-4.png","element":"img","alt":" ProjX : Rm×n → TXM","inline":true,"padRight":true},{"text":"projects arbitrary ambient vectors to the tangent space ","element":"span"},{"style":{"height":13.19},"width":101.58,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-5.png","element":"img","alt":" TXM","inline":true,"padRight":true},{"text":"with respect to the Riemannian metric. ","element":"span"},{"text":"For a differentiable function ","element":"span"},{"style":{"height":14},"width":275.59,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-6.png","element":"img","alt":" f : M → R","inline":true},{"text":", the Riemannian gradient at ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"is defined as the tangent vector ","element":"span"},{"style":{"height":16},"width":339.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-7.png","element":"img","alt":" gradf(X) ∈ TXM","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":16},"width":768.79,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-8.png","element":"img","alt":" ⟨U, gradf(X)⟩X = Df(X)[U], ∀U ∈ TXM","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":16},"width":446.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-9.png","element":"img","alt":" Df(X)[U] = ⟨∇f(X), U⟩","inline":true},{"text":". A retraction ","element":"span"},{"style":{"height":13.19},"width":137.4,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-10.png","element":"img","alt":" RetrX :","inline":true},{"style":{"height":13.19},"width":246.56,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-11.png","element":"img","alt":"TXM −→ M","inline":true,"padRight":true},{"text":"allows points to move along the manifold, which satisfies the conditions: ","element":"span"},{"style":{"height":16},"width":258.47,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-12.png","element":"img","alt":" RetrX(0) = X","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":338.19,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-13.png","element":"img","alt":"DRetrX(0)[U] = U.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Related works. ","element":"span"},{"text":"We provide a detailed review of the existing coordinate descent (CD) algorithms on specific manifolds, along with other related works in Appendix ","element":"span"},{"text":"B","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Notations. ","element":"span"},{"text":"We use ","element":"span"},{"style":{"height":16},"width":71.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-14.png","element":"img","alt":" ⟨·, ·⟩","inline":true,"padRight":true},{"text":"without the subscript to represent the Euclidean inner product while we use ","element":"span"},{"style":{"height":16},"width":97.85,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-15.png","element":"img","alt":" ⟨·, ·⟩X","inline":true,"padRight":true},{"text":"to denote the Riemannian inner product on ","element":"span"},{"style":{"height":13.19},"width":101.58,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-16.png","element":"img","alt":" TXM","inline":true},{"text":". The specific expression for ","element":"span"},{"style":{"height":16},"width":97.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-17.png","element":"img","alt":" ⟨·, ·⟩X","inline":true,"padRight":true},{"text":"depends on both ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"text":". ","element":"span"},{"text":"Sym(","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":") ","element":"span"},{"text":"and ","element":"span"},{"text":"Skew(","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":") ","element":"span"},{"text":"denote the sets of ","element":"span"},{"style":{"height":8},"width":101.32,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-18.png","element":"img","alt":" n × n","inline":true,"padRight":true},{"text":"symmetric and skew-symmetric matrices, respectively. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":16},"width":941.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-19.png","element":"img","alt":"sym(A) := (A + A⊤)/2, skew(A) := (A − A⊤)/2, exp(·)","inline":true,"padRight":true},{"text":"be the elementwise exponential, and ","element":"span"},{"style":{"height":16},"width":136.66,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-20.png","element":"img","alt":" expm(·)","inline":true,"padRight":true},{"text":"be the matrix exponential. We also use ","element":"span"},{"style":{"height":9.19},"width":29.56,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-21.png","element":"img","alt":" ei","inline":true,"padRight":true},{"text":"to represent the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"-th basis vector with the dimension to be determined from the context. ","element":"span"},{"style":{"height":16.79},"width":76.3,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-22.png","element":"img","alt":" [A]ij","inline":true,"padRight":true},{"text":"denotes the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i, j","element":"span"},{"text":"-th entry of a matrix ","element":"span"},{"style":{"height":16.39},"width":192.12,"height":40.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-23.png","element":"img","alt":" A while Aij","inline":true,"padRight":true},{"text":"represents a matrix with index ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i, j","element":"span"},{"text":". We use ","element":"span"},{"style":{"height":13.19},"width":37.52,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-24.png","element":"img","alt":" In","inline":true,"padRight":true},{"text":"to denote the ","element":"span"},{"style":{"height":8},"width":101.2,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-25.png","element":"img","alt":" n × n","inline":true,"padRight":true},{"text":"identity matrix, ","element":"span"},{"style":{"height":13.19},"width":39.92,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-26.png","element":"img","alt":" 1n","inline":true,"padRight":true},{"text":"to denote the size-","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"vector of all 1s, and define ","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":"] := ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":", ..., n","element":"span"},{"style":{"fontStyle":"italic"},"text":"}","element":"span"},{"text":".","element":"span"}]]},{"heading":"3. Proposed CD Framework","paragraphs":[[{"text":"As shown in (","element":"span"},{"href":"#id-43","referenceIndex":55,"text":"Shalit & Chechik","element":"a"},{"text":", ","element":"span"},{"href":"#id-43","referenceIndex":55,"text":"2014","element":"a"},{"text":"; ","element":"span"},{"href":"#id-42","referenceIndex":19,"text":"Gutman & Ho- ","element":"a"},{"href":"#id-42","referenceIndex":19,"text":"Nguyen","element":"a"},{"text":", ","element":"span"},{"href":"#id-42","referenceIndex":19,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-45","referenceIndex":42,"text":"Massart & Abrol","element":"a"},{"text":", ","element":"span"},{"href":"#id-45","referenceIndex":42,"text":"2022","element":"a"},{"text":"; ","element":"span"},{"href":"#id-44","referenceIndex":35,"text":"Jiang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-44","referenceIndex":35,"text":"2022","element":"a"},{"text":"; ","element":"span"},{"href":"#id-46","referenceIndex":12,"text":"Darmwal & Rajawat","element":"a"},{"text":", ","element":"span"},{"href":"#id-46","referenceIndex":12,"text":"2023","element":"a"},{"text":"), for specific manifolds, the key in developing CD algorithms is the choice of the basis vectors ","element":"span"},{"style":{"height":14},"width":169.72,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-27.png","element":"img","alt":" Bℓ (ℓ ∈ I","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"I ","element":"span"},{"text":"denotes the index set) spanning the tangent space that allow efficient retraction. In general, our chosen basis need not be orthonormal with respect to the Riemannian metric. Once the basis and retraction are chosen, the CD update is given by ","element":"span"},{"style":{"height":16},"width":257.53,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-28.png","element":"img","alt":" RetrX(−ηθBℓ)","inline":true},{"text":", where ","element":"span"},{"style":{"height":14.4},"width":94.35,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-29.png","element":"img","alt":"η > 0","inline":true,"padRight":true},{"text":"is the stepsize and ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-30.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"is the coordinate derivative, i.e.,","element":"span"}],[{"style":{"width":"95%"},"width":898,"height":84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-31.png","element":"img"}],[{"text":"It can be verified that ","element":"span"},{"style":{"height":13.19},"width":95.04,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-32.png","element":"img","alt":" −θBℓ","inline":true,"padRight":true},{"text":"is indeed a descent direction, i.e., ","element":"span"},{"style":{"height":19.37},"width":865.28,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-33.png","element":"img","alt":" ⟨gradf(X), −θBℓ⟩X = −θ ddθf(RetrX(θBℓ))|θ=0 =","inline":true},{"style":{"height":15.78},"width":141.82,"height":39.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-34.png","element":"img","alt":"−θ2 ≤ 0","inline":true},{"text":". The CD algorithm then involves iteratively selecting coordinate index ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-35.png","element":"img","alt":" ℓ","inline":true},{"text":", computing ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-36.png","element":"img","alt":" θ","inline":true},{"text":", and updating in the coordinate descent direction ","element":"span"},{"style":{"height":16},"width":267.03,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-37.png","element":"img","alt":" RetrX(−ηθBℓ).","inline":true}],[{"text":"The main challenges in developing CD algorithms on matrix manifolds are: 1) characterization of ","element":"span"},{"style":{"height":13.19},"width":44.23,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-38.png","element":"img","alt":" Bℓ","inline":true,"padRight":true},{"text":"which facilitates efficient computation, 2) efficient computation of ","element":"span"},{"style":{"height":14.4},"width":140.67,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-39.png","element":"img","alt":" θ, and 3)","inline":true,"padRight":true},{"text":"easy generalization to different manifolds. We propose to leverage the following connection between the Riemannian and Euclidean gradients:","element":"span"}],[{"id":"id-47","style":{"width":"84%"},"width":792,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-40.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":124.65,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-41.png","element":"img","alt":" ∇f(X)","inline":true,"padRight":true},{"text":"is the Euclidean gradient and the last equality follows from the definition of the Riemannian gradient. We exploit (","element":"span"},{"href":"#id-47","text":"3","element":"a"},{"text":") to efficiently compute ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/1-42.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"for several manifolds as it is independent of the Riemannian gradient and metric. In the subsequent sections, we develop concrete CD optimization ingredients for the manifolds of interest under the proposed approach. These are summarized in Table ","element":"span"},{"href":"#id-48","text":"1","element":"a"},{"text":".","element":"span"}],[{"id":"id-48","text":"Table 1: Summary of CD ingredients over various manifolds. ","element":"figcaption","subtype":"caption"},{"style":{"height":17.35},"width":312.16,"height":43.37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-0.png","element":"img","alt":" Hij = eie⊤j − eje⊤i","inline":true,"padRight":true},{"text":"and ","element":"figcaption","subtype":"caption"},{"style":{"height":15.59},"width":298.34,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-1.png","element":"img","alt":" Eij = eiej + eje⊤i","inline":true,"padRight":true},{"text":"are the basis ","element":"figcaption","subtype":"caption"},{"text":"for skew-symmetric and symmetric matrices, respectively. ","element":"figcaption","subtype":"caption"},{"style":{"height":16.79},"width":110.71,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-2.png","element":"img","alt":" Gij(θ)","inline":true,"padRight":true},{"text":"and ","element":"figcaption","subtype":"caption"},{"style":{"height":16.79},"width":109.64,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-3.png","element":"img","alt":" Rij(θ)","inline":true,"padRight":true},{"text":"corresponds to the Givens and hyperbolic rotations, respectively. ","element":"figcaption","subtype":"caption"},{"style":{"height":13.19},"width":299.14,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-4.png","element":"img","alt":" PX := In − XX⊤","inline":true},{"text":", and ","element":"figcaption","subtype":"caption"},{"style":{"height":16},"width":141.14,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-5.png","element":"img","alt":" ∇f(X)k","inline":true,"padRight":true},{"text":"is the ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"k","element":"figcaption","subtype":"caption"},{"text":"-th column of ","element":"figcaption","subtype":"caption"},{"style":{"height":16},"width":124.65,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-6.png","element":"img","alt":" ∇f(X)","inline":true},{"text":". We use ","element":"figcaption","subtype":"caption"},{"style":{"height":18.55},"width":146.13,"height":46.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-7.png","element":"img","alt":" ExpSx(v)","inline":true,"padRight":true},{"text":"to denote the ","element":"figcaption","subtype":"caption"},{"text":"exponential retraction over sphere. The complexity only considers the computation of coordinate derivative and coordinate update, while excluding the complexity of first-order oracle ","element":"figcaption","subtype":"caption"},{"style":{"height":16},"width":134.15,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-8.png","element":"img","alt":" ∇f(X).","inline":true}],[{"style":{"width":"99%"},"width":1935,"height":781,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-9.png","element":"img"}],[{"id":"id-52","style":{"fontWeight":"bold"},"text":"3.1. CD on Stiefel manifold","element":"span"}],[{"text":"The Stiefel manifold ","element":"span"},{"text":"St(","element":"span"},{"style":{"fontStyle":"italic"},"text":"n, p","element":"span"},{"text":") ","element":"span"},{"text":"is the set of column orthonormal matrices of size ","element":"span"},{"style":{"height":11.78},"width":90.39,"height":29.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-10.png","element":"img","alt":" Rn×p","inline":true},{"text":", i.e., ","element":"span"},{"style":{"height":16},"width":418.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-11.png","element":"img","alt":" St(n, p) := {X ∈ Rn×p :","inline":true},{"style":{"height":16.79},"width":727.88,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-12.png","element":"img","alt":"X⊤X = Ip}. When p = n, St(n, n) ≡ O(n)","inline":true},{"text":", the orthogonal manifold. The tangent space of Stiefel manifold is identified as ","element":"span"},{"style":{"height":16},"width":795.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-13.png","element":"img","alt":" TXSt(n, p) = {U ∈ Rn×p : X⊤U + U ⊤X = 0}","inline":true},{"text":". The Riemannian metric is defined as ","element":"span"},{"style":{"height":16},"width":317.98,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-14.png","element":"img","alt":" ⟨U, V ⟩X := ⟨U, V ⟩","inline":true,"padRight":true},{"text":"for any ","element":"span"},{"style":{"height":14},"width":227.29,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-15.png","element":"img","alt":" U, V ∈ TXM","inline":true},{"text":". The Riemannian gradient is derived as ","element":"span"},{"style":{"height":16},"width":728.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-16.png","element":"img","alt":" gradf(X) = ∇f(X) − Xsym(X⊤∇f(X)).","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Choice of basis. ","element":"span"},{"text":"Taking inspiration from ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":")","element":"span"},{"text":", we adopt the ","element":"span"},{"style":{"height":10.8},"width":63.78,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-17.png","element":"img","alt":" ΩX","inline":true,"padRight":true},{"text":"parameterization of the tangent vectors (where ","element":"span"},{"style":{"height":16},"width":235.41,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-18.png","element":"img","alt":"Ω ∈ Skew(n)","inline":true},{"text":") and choose the basis as ","element":"span"},{"style":{"height":15.59},"width":208.27,"height":38.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-19.png","element":"img","alt":" Bℓ = HijX","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":18.55},"width":943.22,"height":46.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-20.png","element":"img","alt":"ℓ ∈ I = {(i, j) : 1 ≤ i < j ≤ n} and Hij := eie⊤j − eje⊤i .","inline":true,"padRight":true},{"text":"In contrast to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":")","element":"span"},{"text":", the chosen basis is not orthonormal for ","element":"span"},{"text":"St(","element":"span"},{"style":{"fontStyle":"italic"},"text":"n, p","element":"span"},{"text":")","element":"span"},{"text":". ","element":"span"},{"text":"This is expected as the manifold ","element":"span"},{"text":"St(","element":"span"},{"style":{"fontStyle":"italic"},"text":"n, p","element":"span"},{"text":") ","element":"span"},{"text":"has a dimension ","element":"span"},{"style":{"height":21.63},"width":206.56,"height":54.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-21.png","element":"img","alt":" np − p(p−1)2","inline":true,"padRight":true},{"text":"while we adopt an overparameterization of the tangent space using ","element":"span"},{"style":{"height":16},"width":194.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-22.png","element":"img","alt":" n(n − 1)/2","inline":true,"padRight":true},{"text":"basis vectors.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Retraction. ","element":"span"},{"text":"For the purpose of CD update, we first note that ","element":"span"},{"style":{"height":16},"width":441.78,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-23.png","element":"img","alt":"RetrX(tU) = expm(tΩ)X","inline":true,"padRight":true},{"text":"is a valid retraction on ","element":"span"},{"text":"St(","element":"span"},{"style":{"fontStyle":"italic"},"text":"n, p","element":"span"},{"text":") ","element":"span"},{"text":"because: 1) ","element":"span"},{"style":{"height":16},"width":274.17,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-24.png","element":"img","alt":" RetrX(0) = X","inline":true,"padRight":true},{"text":"and 2) ","element":"span"},{"style":{"height":16},"width":298.9,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-25.png","element":"img","alt":" DRetrX(0)[U] =","inline":true},{"style":{"height":10.8},"width":149.06,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-26.png","element":"img","alt":"ΩX = U","inline":true,"padRight":true},{"text":"are satisfied (","element":"span"},{"href":"#id-49","referenceIndex":57,"text":"Siegel","element":"a"},{"text":", ","element":"span"},{"href":"#id-49","referenceIndex":57,"text":"2020","element":"a"},{"text":").","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"CD update. ","element":"span"},{"text":"Based on the above choices of the basis vectors and retraction, ","element":"span"},{"text":"the proposed CD update is ","element":"span"},{"style":{"height":16.79},"width":553.38,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-27.png","element":"img","alt":" RetrX(−ηθBℓ) = Gij(−ηθ)X","inline":true},{"text":", where ","element":"span"},{"style":{"height":10.8},"width":84.78,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-28.png","element":"img","alt":" θ =","inline":true},{"style":{"height":16.79},"width":794.56,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-29.png","element":"img","alt":"⟨∇f(X), Bℓ⟩ = [∇f(X)X⊤ − X∇f(X)⊤]ij","inline":true},{"text":". ","element":"span"},{"text":"Here, ","element":"span"},{"style":{"height":18.55},"width":936.01,"height":46.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-30.png","element":"img","alt":"Gij(θ) = In +(cos(θ)−1)(eie⊤i +eje⊤j )+sin(θ)(eie⊤j −","inline":true},{"style":{"height":16.79},"width":96.8,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-31.png","element":"img","alt":"eje⊤i )","inline":true,"padRight":true},{"text":"is known as the Givens rotation around axes ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i, j ","element":"span"},{"text":"with ","element":"span"},{"text":"angle ","element":"span"},{"style":{"height":10.8},"width":50,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-32.png","element":"img","alt":" −θ","inline":true},{"text":". Overall, each CD update only requires ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":") ","element":"span"},{"text":"as we modify only two rows of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Remark ","element":"span"},{"text":"3.1","element":"span"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"href":"#id-42","referenceIndex":19,"text":"Gutman & Ho-Nguyen ","element":"a"},{"text":"(","element":"span"},{"href":"#id-42","referenceIndex":19,"text":"2023","element":"a"},{"text":") propose a column-wise CD update on the Stiefel manifold which costs ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"np","element":"span"},{"text":") ","element":"span"},{"text":"per iteration. On the other hand, our proposed CD update is row-wise and costs ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":")","element":"span"},{"text":", which is cheaper. Furthermore, the CD update strategy of (","element":"span"},{"href":"#id-42","referenceIndex":19,"text":"Gutman & Ho-Nguyen","element":"a"},{"text":", ","element":"span"},{"href":"#id-42","referenceIndex":19,"text":"2023","element":"a"},{"text":") cannot be applied to the sphere manifold, i.e., when ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"= 1","element":"span"},{"text":", it reduces to the full gradient update on the sphere. This, however, is not an issue for our update. Finally, the update of ","element":"span"},{"href":"#id-42","referenceIndex":19,"text":"Gutman & Ho-Nguyen ","element":"a"},{"text":"(","element":"span"},{"href":"#id-42","referenceIndex":19,"text":"2023","element":"a"},{"text":") is not invariant to the right action of orthogonal group and hence does not yield a valid CD strategy for the Grassmann manifold. In contrast, as shown in the next section, our strategy can be readily generalized to the Grassmann manifold.","element":"span"}],[{"id":"id-71","style":{"fontWeight":"bold"},"text":"3.2. CD on Grassmann manifold","element":"span"}],[{"text":"The Grassmann manifold ","element":"span"},{"text":"Gr(","element":"span"},{"style":{"fontStyle":"italic"},"text":"n, p","element":"span"},{"text":") ","element":"span"},{"text":"represents the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"-dimensional subspaces in ","element":"span"},{"style":{"height":10.8},"width":48.78,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-33.png","element":"img","alt":" Rn","inline":true},{"text":", which can be represented by an ","element":"span"},{"style":{"height":11.2},"width":101.75,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-34.png","element":"img","alt":" n × p","inline":true,"padRight":true},{"text":"orthonormal matrix ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"text":", i.e., ","element":"span"},{"style":{"height":16},"width":233.47,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-35.png","element":"img","alt":" X ∈ St(n, p)","inline":true},{"text":", where the columns span the subspace. The representation is not unique, with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"XQ ","element":"span"},{"text":"representing the same subspace for arbitrary ","element":"span"},{"style":{"height":16},"width":164.66,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-36.png","element":"img","alt":" Q ∈ O(p)","inline":true},{"text":". Thus, the Grassmann manifold can be identified as ","element":"span"},{"style":{"height":16.79},"width":721.54,"height":41.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-37.png","element":"img","alt":" Gr(n, p) = {[X] : X ∈ Rn×p, X⊤X = Ip}","inline":true},{"text":", where ","element":"span"},{"style":{"height":16},"width":481.81,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-38.png","element":"img","alt":" [X] := {XQ : Q ∈ O(p)}","inline":true},{"text":". The tangent space can be uniquely characterized by the horizontal space at ","element":"span"},{"style":{"height":16},"width":184.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-39.png","element":"img","alt":"TXSt(n, p)","inline":true},{"text":", i.e., ","element":"span"},{"style":{"height":17.68},"width":640.66,"height":44.19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-40.png","element":"img","alt":" T[X]Gr(n, p) = {[U] : X⊤U = 0}","inline":true},{"text":". For a given ","element":"span"},{"style":{"height":17.68},"width":297.57,"height":44.19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-41.png","element":"img","alt":" ξ ∈ T[X]Gr(n, p)","inline":true},{"text":", its unique horizontal lift is ","element":"span"},{"style":{"height":16},"width":222.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/2-42.png","element":"img","alt":" U = liftX(ξ)","inline":true},{"text":", where ","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"text":"] ","element":"span"},{"text":"is represented as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"text":". The lift operator satisfies ","element":"span"},{"style":{"height":16.79},"width":391.8,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-0.png","element":"img","alt":" liftXQ(ξ) = liftX(ξ)Q","inline":true},{"text":". On ","element":"span"},{"text":"Gr(","element":"span"},{"style":{"fontStyle":"italic"},"text":"n, p","element":"span"},{"text":")","element":"span"},{"text":", the Riemannian metric is pushed forward by the Euclidean metric on ","element":"span"},{"text":"St(","element":"span"},{"style":{"fontStyle":"italic"},"text":"n, p","element":"span"},{"text":") ","element":"span"},{"text":"as ","element":"span"},{"style":{"height":17.68},"width":508.72,"height":44.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-1.png","element":"img","alt":" ⟨ξ, ζ⟩[X] = ⟨liftX(ξ), liftX(ζ)⟩","inline":true,"padRight":true},{"text":"and the corresponding Riemannian gradient ","element":"span"},{"text":"grad","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"([","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"text":"]) ","element":"span"},{"text":"can be represented by ","element":"span"},{"style":{"height":16},"width":684.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-2.png","element":"img","alt":" liftX(gradf([X]) = (In − XX⊤)∇f(X)","inline":true},{"text":". Retractions such as QR retraction for ","element":"span"},{"text":"St(","element":"span"},{"style":{"fontStyle":"italic"},"text":"n, p","element":"span"},{"text":") ","element":"span"},{"text":"also work for ","element":"span"},{"text":"Gr(","element":"span"},{"style":{"fontStyle":"italic"},"text":"n, p","element":"span"},{"text":") ","element":"span"},{"text":"as long as it preserves the equivalence class, i.e., ","element":"span"},{"style":{"height":16.79},"width":739.09,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-3.png","element":"img","alt":" [RetrXQ(t liftX(ξ)Q)] = [RetrX(t liftX(ξ))]","inline":true,"padRight":true},{"text":"for any ","element":"span"},{"style":{"height":16},"width":167.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-4.png","element":"img","alt":"Q ∈ O(p)","inline":true},{"text":". Below, we show that the proposed CD update for ","element":"span"},{"text":"St(","element":"span"},{"style":{"fontStyle":"italic"},"text":"n, p","element":"span"},{"text":") ","element":"span"},{"text":"is also well-defined for ","element":"span"},{"text":"Gr(","element":"span"},{"style":{"fontStyle":"italic"},"text":"n, p","element":"span"},{"text":")","element":"span"},{"text":".","element":"span"}],[{"id":"id-93","style":{"fontWeight":"bold"},"text":"Proposition 3.2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Consider a function ","element":"span"},{"style":{"height":16},"width":282.89,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-5.png","element":"img","alt":" f : Gr(n, p) →","inline":true,"padRight":true},{"text":"R","element":"span"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let the coordinate descent update at ","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"text":"] ","element":"span"},{"style":{"fontStyle":"italic"},"text":"be given by ","element":"span"},{"style":{"height":16.79},"width":609.43,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-6.png","element":"img","alt":" RetrX(−η θHijX) := Gij(−ηθ)X","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for ","element":"span"},{"style":{"height":13.2},"width":181.4,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-7.png","element":"img","alt":" 1 ≤ i <","inline":true},{"style":{"height":13.6},"width":112.27,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-8.png","element":"img","alt":"j ≤ n","inline":true},{"style":{"fontStyle":"italic"},"text":", where ","element":"span"},{"style":{"height":16.79},"width":360.08,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-9.png","element":"img","alt":" θ = ⟨∇f(X), HijX⟩","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and for some fixed stepsize ","element":"span"},{"style":{"height":14.4},"width":133.08,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-10.png","element":"img","alt":" η > 0","inline":true},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Then, ","element":"span"},{"style":{"height":16.79},"width":487.78,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-11.png","element":"img","alt":" RetrXQ(−η θXQHijXQ) =","inline":true},{"style":{"height":16.79},"width":350.31,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-12.png","element":"img","alt":"RetrX(−ηθHijX)Q.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"3.3. CD on hyperbolic manifold","element":"span"}],[{"text":"We now consider the generalized hyperbolic space (","element":"span"},{"href":"#id-50","referenceIndex":5,"text":"Bai ","element":"a"},{"href":"#id-50","referenceIndex":5,"text":"& Li","element":"a"},{"text":", ","element":"span"},{"href":"#id-50","referenceIndex":5,"text":"2014","element":"a"},{"text":"; ","element":"span"},{"href":"#id-51","referenceIndex":70,"text":"Xiao et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-51","referenceIndex":70,"text":"2023","element":"a"},{"text":") ","element":"span"},{"style":{"height":16},"width":431.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-13.png","element":"img","alt":" H(n, p) := {X ∈ Rn×p :","inline":true},{"style":{"height":16.79},"width":265.18,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-14.png","element":"img","alt":"−X⊤JX = Ip}","inline":true},{"text":", where ","element":"span"},{"style":{"height":16},"width":502.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-15.png","element":"img","alt":" J = diag(−1, 1, ..., 1) ∈ Rn×n","inline":true,"padRight":true},{"text":"is the metric tensor. When ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"= 1","element":"span"},{"text":", this reduces to the wellknown hyperbolic space (the hyperboloid model). The tangent space at ","element":"span"},{"style":{"height":16},"width":212.07,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-16.png","element":"img","alt":" X ∈ H(n, p)","inline":true,"padRight":true},{"text":"is identified as","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"style":{"height":16},"width":875.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-17.png","element":"img","alt":"XH(n, p) = {U ∈ Rn×p : U ⊤JX + X⊤JU = 0}","inline":true},{"style":{"height":16},"width":496.05,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-18.png","element":"img","alt":"= {WJX : W ∈ Skew(n)}.","inline":true}],[{"text":"The Riemannian metric on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"n, p","element":"span"},{"text":") ","element":"span"},{"text":"is the generalized Lorentz inner product as ","element":"span"},{"style":{"height":16},"width":372.65,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-19.png","element":"img","alt":" ⟨U, V ⟩L := tr(U ⊤JV )","inline":true},{"text":". The normal space is given by ","element":"span"},{"style":{"height":16},"width":576.9,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-20.png","element":"img","alt":" NXH(d, r) = {XS : S ∈ Sym(p)}","inline":true},{"text":". The orthogonal projection to ","element":"span"},{"style":{"height":16},"width":180.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-21.png","element":"img","alt":" TXH(n, p)","inline":true,"padRight":true},{"text":"and the Riemannian gradient are derived below.","element":"span"}],[{"id":"id-94","style":{"fontWeight":"bold"},"text":"Proposition 3.3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The orthogonal projection of ","element":"span"},{"style":{"height":12.58},"width":209.09,"height":31.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-22.png","element":"img","alt":" A ∈ Rn×p to","inline":true},{"style":{"height":16},"width":180.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-23.png","element":"img","alt":"TXH(n, p)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is given by ","element":"span"},{"style":{"height":16},"width":558.23,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-24.png","element":"img","alt":" ProjX(A) = A + Xsym(X⊤JA)","inline":true},{"style":{"fontStyle":"italic"},"text":". The Riemannian gradient is ","element":"span"},{"style":{"height":16},"width":449.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-25.png","element":"img","alt":" gradf(X) = J∇f(X) +","inline":true},{"style":{"height":16},"width":334.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-26.png","element":"img","alt":"Xsym(X⊤∇f(X)).","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Choice of basis. ","element":"span"},{"text":"For generalized hyperbolic manifold, we consider the basis ","element":"span"},{"style":{"height":18.55},"width":642.42,"height":46.37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-27.png","element":"img","alt":" Bℓ = HijJX = (eie⊤j − eje⊤i )JX, for","inline":true},{"style":{"height":14},"width":245.67,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-28.png","element":"img","alt":"1 ≤ i < j ≤ n.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Retraction. ","element":"span"},{"text":"Taking inspiration from our Stiefel analysis in Section ","element":"span"},{"href":"#id-52","text":"3.1","element":"a"},{"text":", we define the map ","element":"span"},{"style":{"height":16},"width":253.57,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-29.png","element":"img","alt":" RetrX(tU) :=","inline":true,"padRight":true},{"text":"expm(","element":"span"},{"style":{"fontStyle":"italic"},"text":"tWJ","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"for ","element":"span"},{"style":{"height":16},"width":447.02,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-30.png","element":"img","alt":" U = WJX ∈ TXH(n, p)","inline":true},{"text":". We next show such a map defines a valid retraction. As shown below, the retraction expression considerably simplifies along the chosen basis ","element":"span"},{"style":{"height":15.59},"width":133.26,"height":38.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-31.png","element":"img","alt":" HijJX.","inline":true}],[{"id":"id-95","style":{"fontWeight":"bold"},"text":"Proposition 3.4. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Given a tangent vector ","element":"span"},{"style":{"height":11.6},"width":246.28,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-32.png","element":"img","alt":" U = WJX ∈","inline":true},{"style":{"height":16},"width":180.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-33.png","element":"img","alt":"TXH(n, p)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for some skew-symmetric matrix ","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"style":{"fontStyle":"italic"},"text":", then ","element":"span"},{"style":{"height":16},"width":490.51,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-34.png","element":"img","alt":"RetrX(tU) := expm(tWJ)X","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a retraction.","element":"span"}],[{"text":"In fact, ","element":"span"},{"text":"expm(","element":"span"},{"style":{"fontStyle":"italic"},"text":"tWJ","element":"span"},{"text":") ","element":"span"},{"text":"is a Lorentz transform that satisfies ","element":"span"},{"style":{"height":16},"width":573.09,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-35.png","element":"img","alt":"expm(tWJ)⊤Jexpm(tWJ) = J","inline":true},{"text":", which preserves the Lorentz inner product as ","element":"span"},{"style":{"height":16.79},"width":517.84,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-36.png","element":"img","alt":" (LX)⊤JLX = X⊤JX = −Ip","inline":true},{"text":". Hence by following the direction ","element":"span"},{"style":{"height":15.59},"width":226.63,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-37.png","element":"img","alt":" U = θHijJX","inline":true},{"text":", we define a coordinate type of updates on (generalized) hyperbolic manifold as ","element":"span"},{"style":{"height":16.79},"width":267.02,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-38.png","element":"img","alt":" expm(θHijJ)X","inline":true},{"text":", which can be computed efficiently similar to the Givens rotation. Particularly, when ","element":"span"},{"style":{"height":16.39},"width":376.24,"height":40.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-39.png","element":"img","alt":"i, j ̸= 1, HijJ = Hij","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16.79},"width":410.96,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-40.png","element":"img","alt":" expm(θHijJ) = Gij(θ)","inline":true,"padRight":true},{"text":"exactly recovers the Givens rotation. When ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"text":", we have ","element":"span"},{"style":{"height":17.35},"width":495.78,"height":43.37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-41.png","element":"img","alt":"HijJ = Eij := eie⊤j + eje⊤i","inline":true,"padRight":true},{"text":". We show in the follow- ","element":"span"},{"text":"ing lemma that this also leads to a rotation known as the hyperbolic rotation.","element":"span"}],[{"id":"id-96","style":{"fontWeight":"bold"},"text":"Lemma 3.5. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For ","element":"span"},{"style":{"height":15.59},"width":243.48,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-42.png","element":"img","alt":" U = θHijJX","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"with ","element":"span"},{"style":{"height":14},"width":286.28,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-43.png","element":"img","alt":" 1 ≤ i < j ≤ n","inline":true},{"style":{"fontStyle":"italic"},"text":",","element":"span"}],[{"style":{"width":"100%"},"width":939,"height":160,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-44.png","element":"img"}],[{"text":"When ","element":"span"},{"style":{"height":16.79},"width":235.42,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-45.png","element":"img","alt":" d = 4, Rij(θ)","inline":true,"padRight":true},{"text":"is known as the Lorentz boost with rapidity ","element":"span"},{"style":{"height":10.8},"width":50,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-46.png","element":"img","alt":" −θ","inline":true,"padRight":true},{"text":"and can be thought of as rotation in the time domain. Hence, while the Givens rotation based CD updates have been explored for the orthogonal and Stiefel manifolds (","element":"span"},{"href":"#id-43","referenceIndex":55,"text":"Shalit & Chechik","element":"a"},{"text":", ","element":"span"},{"href":"#id-43","referenceIndex":55,"text":"2014","element":"a"},{"text":"; ","element":"span"},{"href":"#id-42","referenceIndex":19,"text":"Gutman & Ho-Nguyen","element":"a"},{"text":", ","element":"span"},{"href":"#id-42","referenceIndex":19,"text":"2023","element":"a"},{"text":"), our approach generalizes the Givens rotation based CD updates to hyperbolic spaces.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"CD update. ","element":"span"},{"text":"Similar to the Stiefel case, the proposed CD update is ","element":"span"},{"style":{"height":16.79},"width":335.24,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-47.png","element":"img","alt":" RetrX(−ηθHijJX)","inline":true},{"text":", where ","element":"span"},{"style":{"height":10.8},"width":80.95,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-48.png","element":"img","alt":" θ =","inline":true},{"style":{"height":16.79},"width":891.02,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-49.png","element":"img","alt":"⟨∇f(X), HijJX⟩ = [∇f(X)X⊤J − JX∇f(X)⊤]ij.","inline":true,"padRight":true},{"text":"In Appendix ","element":"span"},{"href":"#id-53","text":"E.3","element":"a"},{"text":", we additionally derive a canonical-type metric and a Cayley retraction.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.4. CD on symplectic manifold","element":"span"}],[{"text":"The symplectic manifold (","element":"span"},{"href":"#id-54","referenceIndex":16,"text":"Gao et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-54","referenceIndex":16,"text":"2021a","element":"a"},{"text":";","element":"span"},{"href":"#id-55","referenceIndex":17,"text":"b","element":"a"},{"text":"; ","element":"span"},{"href":"#id-56","referenceIndex":18,"text":"2022","element":"a"},{"text":") is defined as ","element":"span"},{"style":{"height":18.17},"width":759.93,"height":45.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-50.png","element":"img","alt":" Sp(n, p) := {X ∈ R2n×2p : X⊤ΩnX = Ωp},","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":13.19},"width":107.28,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-51.png","element":"img","alt":" Ωk :=","inline":true}],[{"text":"block matrix. The tangent space is given as","element":"span"}],[{"style":{"height":17.38},"width":932.54,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-52.png","element":"img","alt":"TXSp(n, p) = {U ∈ R2n×2p : U ⊤ΩnX + X⊤ΩnU = 0}","inline":true},{"style":{"height":16},"width":466.13,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-53.png","element":"img","alt":"= {SΩnX : S ∈ Sym(2n)}.","inline":true}],[{"text":"Here we consider the Euclidean metric (","element":"span"},{"href":"#id-54","referenceIndex":16,"text":"Gao et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-54","referenceIndex":16,"text":"2021a","element":"a"},{"text":") as ","element":"span"},{"style":{"height":16},"width":360.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-54.png","element":"img","alt":" ⟨U, V ⟩X = tr(U ⊤V )","inline":true,"padRight":true},{"text":"for any ","element":"span"},{"style":{"height":16},"width":379.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-55.png","element":"img","alt":" X ∈ Sp(n, p), U, V ∈","inline":true},{"style":{"height":16},"width":191.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-56.png","element":"img","alt":"TXSp(n, p)","inline":true},{"text":". The Riemannian gradient (","element":"span"},{"href":"#id-54","referenceIndex":16,"text":"Gao et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-54","referenceIndex":16,"text":"2021a","element":"a"},{"text":", Proposition 3) is given by ","element":"span"},{"style":{"height":16},"width":443.42,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-57.png","element":"img","alt":" gradf(X) = ∇f(X) −","inline":true},{"style":{"height":13.19},"width":151.25,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-58.png","element":"img","alt":"ΩnXWX","inline":true},{"text":", where ","element":"span"},{"style":{"height":16},"width":304.53,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-59.png","element":"img","alt":" WX ∈ Skew(2p)","inline":true,"padRight":true},{"text":"is the unique solution to the Lyapunov equation ","element":"span"},{"style":{"height":12},"width":401.14,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-60.png","element":"img","alt":" X⊤XW + WX⊤X =","inline":true},{"style":{"height":16},"width":393.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-61.png","element":"img","alt":"2 skew(X⊤Ω⊤n ∇f(X)).","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Choice of basis. ","element":"span"},{"text":"Similar to the Stiefel and hyperbolic manifolds, we consider the basis ","element":"span"},{"style":{"height":15.59},"width":254.36,"height":38.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-62.png","element":"img","alt":" Bℓ = EijΩnX","inline":true,"padRight":true},{"text":"for the tangent space ","element":"span"},{"style":{"height":16},"width":191.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-63.png","element":"img","alt":" TXSp(n, p)","inline":true},{"text":", where ","element":"span"},{"style":{"height":17.35},"width":329.56,"height":43.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-64.png","element":"img","alt":" Eij = eie⊤j + eje⊤i","inline":true,"padRight":true},{"text":", for ","element":"span"},{"style":{"height":14.4},"width":526.25,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-65.png","element":"img","alt":"1 ≤ i ≤ j ≤ 2n. Here, ei is the i","inline":true},{"text":"-th basis in ","element":"span"},{"style":{"height":13.78},"width":76.36,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-66.png","element":"img","alt":" R2n.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Retraction. ","element":"span"},{"text":"We propose the following retraction for efficient CD updates.","element":"span"}],[{"id":"id-99","style":{"fontWeight":"bold"},"text":"Proposition 3.6. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any ","element":"span"},{"style":{"height":16},"width":532.94,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/3-67.png","element":"img","alt":" X ∈ Sp(n, p) and U = SΩnX ∈","inline":true}],[{"style":{"height":16},"width":191.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-0.png","element":"img","alt":"TXSp(n, p)","inline":true},{"style":{"fontStyle":"italic"},"text":", the map ","element":"span"},{"style":{"height":16},"width":501.39,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-1.png","element":"img","alt":" RetrX(tU) = expm(tSΩn)X","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a retraction.","element":"span"}],[{"text":"The above retraction further simplifies when moving along the chosen basis direction.","element":"span"}],[{"id":"id-101","style":{"fontWeight":"bold"},"text":"Proposition 3.7. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16.79},"width":514.86,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-2.png","element":"img","alt":" Uij = EijΩnX ∈ TXSp(n, p)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for ","element":"span"},{"style":{"height":14},"width":296.8,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-3.png","element":"img","alt":"1 ≤ i ≤ j ≤ 2n","inline":true},{"style":{"fontStyle":"italic"},"text":". Then, we have ","element":"span"},{"style":{"height":16.79},"width":360.11,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-4.png","element":"img","alt":" RetrX(θUij) = X +","inline":true},{"style":{"height":17.75},"width":831.64,"height":44.37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-5.png","element":"img","alt":"(exp (−θ) − 1)eie⊤i X + (exp(θ) − 1)en+ie⊤n+iX,","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= ","element":"span"},{"style":{"height":16.79},"width":772.39,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-6.png","element":"img","alt":"j−n, j > n and RetrX(θUij) = X +θEijΩnX","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"otherwise.","element":"span"}],[{"id":"id-75","style":{"fontStyle":"italic"},"text":"Remark ","element":"span"},{"text":"3.8 (Block coordinate updates)","element":"span"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"We may also consider block coordinate updates. Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"E ","element":"span"},{"text":"=","element":"span"}],[{"style":{"height":16},"width":270.34,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-7.png","element":"img","alt":"A, B ∈ Sym(n)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":12.58},"width":183.43,"height":31.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-8.png","element":"img","alt":" C ∈ Rn×n","inline":true},{"text":", and we wish to update ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"in the direction of ","element":"span"},{"style":{"height":16},"width":464.83,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-9.png","element":"img","alt":" U = EΩnX ∈ TXSp(n, p)","inline":true},{"text":". First, we consider the upper-left and bottom-right blocks, i.e., where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"E ","element":"span"},{"text":"=","element":"span"}],[{"text":"Sym(","element":"span"},{"style":{"height":16},"width":849.34,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-10.png","element":"img","alt":"n). Here, RetrX(θEΩnX) = X +θEΩnX. Second,","inline":true,"padRight":true},{"text":"if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"E ","element":"span"},{"text":"=","element":"span"}],[{"style":{"width":"99%"},"width":935,"height":278,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-11.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"CD update. ","element":"span"},{"text":"Finally, ","element":"span"},{"text":"based on the above discussion our proposed CD update is ","element":"span"},{"style":{"height":16.79},"width":356.07,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-12.png","element":"img","alt":" RetrX(−ηθEijΩnX)","inline":true},{"text":", where, ","element":"span"},{"style":{"height":16.79},"width":813.92,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-13.png","element":"img","alt":" θ = ⟨∇f(X), EijΩnX⟩ = [∇f(X)X⊤Ω⊤n +","inline":true},{"style":{"height":16.79},"width":297.26,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-14.png","element":"img","alt":"ΩnX∇f(X)⊤]i,j.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"3.5. CD on doubly stochastic and multinomial manifolds","element":"span"}],[{"text":"Given two marginals ","element":"span"},{"style":{"height":14.8},"width":278.04,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-15.png","element":"img","alt":" µ ∈ ∆m, ν ∈ ∆n","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":16},"width":192.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-16.png","element":"img","alt":" ∆k := {z ∈","inline":true},{"style":{"height":17.38},"width":375.8,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-17.png","element":"img","alt":"Rk : z ≥ 0, z⊤1k = 1}","inline":true,"padRight":true},{"text":"denotes the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":"-simplex, the doubly stochastic manifold (","element":"span"},{"href":"#id-5","referenceIndex":13,"text":"Douik & Hassibi","element":"a"},{"text":", ","element":"span"},{"href":"#id-5","referenceIndex":13,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-17","referenceIndex":56,"text":"Shi et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-17","referenceIndex":56,"text":"2021","element":"a"},{"text":"; ","element":"span"},{"href":"#id-15","referenceIndex":46,"text":"Mishra et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-15","referenceIndex":46,"text":"2021","element":"a"},{"text":") is defined as ","element":"span"},{"style":{"height":16},"width":416.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-18.png","element":"img","alt":" Π(µ, ν) := {X ∈ Rm×n :","inline":true},{"style":{"height":16},"width":557.01,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-19.png","element":"img","alt":"X > 0, X1n = µ, X⊤1m = ν}","inline":true},{"text":". The tangent space is ","element":"span"},{"style":{"height":16},"width":930.9,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-20.png","element":"img","alt":"TXΠ(µ, ν) = {U ∈ Rm×n : U1n = 0, U ⊤1m = 0}","inline":true},{"text":", which can be endowed with the Fisher metric as ","element":"span"},{"style":{"height":16},"width":180.3,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-21.png","element":"img","alt":" ⟨U, V ⟩X =","inline":true},{"style":{"height":16},"width":522.54,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-22.png","element":"img","alt":"tr(U ⊤(V ⊘X)), where ⊙ and ⊘","inline":true,"padRight":true},{"text":"represent the elementwise product and division operations, respectively. The orthogonal projection is ","element":"span"},{"style":{"height":16},"width":637.78,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-23.png","element":"img","alt":" ProjX(A) = A − (α1⊤n + 1mβ⊤) ⊙ X","inline":true},{"text":", ","element":"span"},{"text":"where ","element":"span"},{"style":{"height":14.4},"width":275.55,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-24.png","element":"img","alt":" α ∈ Rm, β ∈ Rn","inline":true,"padRight":true},{"text":"are solutions to the linear system: ","element":"span"},{"style":{"height":14.8},"width":766.9,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-25.png","element":"img","alt":"α ⊙ µ + Xβ = A1n, β ⊙ ν + X⊤α = A⊤1m","inline":true},{"text":". The Riemannian gradient is given by ","element":"span"},{"style":{"height":16},"width":437.45,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-26.png","element":"img","alt":" gradf(X) = ProjX(X ⊙","inline":true},{"style":{"height":19.2},"width":750.91,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-27.png","element":"img","alt":"∇f(X)) = X ⊙�∇f(X) − (α1⊤n + 1mβ⊤)�.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Choice of basis. ","element":"span"},{"text":"We consider the parameterization of the tangent space as ","element":"span"},{"style":{"height":16},"width":563.86,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-28.png","element":"img","alt":" TXΠ(µ, ν) = {ACB⊤ : A ∈","inline":true},{"style":{"height":17.38},"width":936.43,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-29.png","element":"img","alt":"Rm×(m−1), B ∈ Rn×(n−1), A⊤1m = 0, B⊤1n = 0, C ∈","inline":true},{"style":{"height":18.18},"width":255.06,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-30.png","element":"img","alt":"R(m−1)×(n−1)}","inline":true},{"text":". We notice that the tangent space has a dimension ","element":"span"},{"style":{"height":16},"width":273.82,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-31.png","element":"img","alt":" (m−1)×(n−1)","inline":true},{"text":", and hence, we can let ","element":"span"},{"style":{"height":16},"width":164.79,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-32.png","element":"img","alt":" A = [e1−","inline":true},{"style":{"height":18.19},"width":936.01,"height":45.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-33.png","element":"img","alt":"e2, ..., em−1 −em] ∈ Rm×(m−1), B = [e1 −e2, ..., en−1 −","inline":true},{"style":{"height":18.19},"width":258.46,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-34.png","element":"img","alt":"en] ∈ Rn×(n−1)","inline":true},{"text":", where we denote ","element":"span"},{"style":{"height":9.19},"width":29.55,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-35.png","element":"img","alt":" ei","inline":true,"padRight":true},{"text":"as the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"-th canonical ","element":"span"},{"text":"basis for the corresponding vector space. Hence the tangent space is parameterized by ","element":"span"},{"style":{"height":14.99},"width":317.67,"height":37.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-36.png","element":"img","alt":" C ∈ R(m−1)×(n−1)","inline":true},{"text":". The basis we consider ","element":"span"},{"style":{"height":18.55},"width":717.41,"height":46.37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-37.png","element":"img","alt":" Bℓ = Aeie⊤j B⊤ = (ei − ei+1)(ej − ej+1)⊤","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":16},"width":398,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-38.png","element":"img","alt":" i ∈ [m − 1], j ∈ [n − 1].","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Retraction. ","element":"span"},{"text":"We consider the Sinkhorn retraction applied in the direction of the basis as ","element":"span"},{"style":{"height":16},"width":454.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-39.png","element":"img","alt":" RetrX(−ηθBℓ) = SK(X ⊙","inline":true},{"style":{"height":16},"width":311.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-40.png","element":"img","alt":"exp(−ηθBℓ ⊘ X))","inline":true},{"text":". Here, the Sinkhorn algorithm ","element":"span"},{"text":"SK(","element":"span"},{"style":{"fontStyle":"italic"},"text":"U","element":"span"},{"text":") ","element":"span"},{"text":"iteratively normalize rows and columns of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U ","element":"span"},{"text":"according to the given marginals (","element":"span"},{"href":"#id-57","referenceIndex":37,"text":"Knight","element":"a"},{"text":", ","element":"span"},{"href":"#id-57","referenceIndex":37,"text":"2008","element":"a"},{"text":"). We notice the input to the Sinkhorn algorithm only modifies a ","element":"span"},{"style":{"height":10.8},"width":91.33,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-41.png","element":"img","alt":" 2 × 2","inline":true,"padRight":true},{"text":"sub-matrix of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"text":". It, thus, suffices to apply the Sinkhorn algorithm to the ","element":"span"},{"style":{"height":10.8},"width":91.4,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-42.png","element":"img","alt":" 2 × 2","inline":true,"padRight":true},{"text":"sub-matrix with the modified marginals, which largely simplifies the computation compared to running the Sinkhorn algorithm for the entire input. To this end, we define the coordinate Sinkhorn, denoted as ","element":"span"},{"style":{"height":18.15},"width":144.25,"height":45.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-43.png","element":"img","alt":" SKij(U)","inline":true,"padRight":true},{"text":"or simply ","element":"span"},{"text":"cSK(","element":"span"},{"style":{"fontStyle":"italic"},"text":"U","element":"span"},{"text":") ","element":"span"},{"text":"if the coordinates are clear from context, as performing the Sinkhorn algorithm for the ","element":"span"},{"style":{"height":10.8},"width":93.91,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-44.png","element":"img","alt":" 2 × 2","inline":true,"padRight":true},{"text":"sub-matrix formed by indices ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i, i","element":"span"},{"text":"+1 ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j, j ","element":"span"},{"text":"+1 ","element":"span"},{"text":"with marginals ","element":"span"},{"style":{"height":19.18},"width":871.34,"height":47.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-45.png","element":"img","alt":"([µ]i − �k̸=j,j+1[U]ik, [µ]i+1 − �k̸=j,j+1[U](i+1)k)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":19.18},"width":847.68,"height":47.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-46.png","element":"img","alt":"([ν]j − �l̸=i,i+1[U]tj, [ν]j+1 − �l̸=i,i+1[U]l(j+1))","inline":true},{"text":". The ","element":"span"},{"text":"other entries of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U ","element":"span"},{"text":"remains unchanged. We show in the next proposition that applying coordinate Sinkhorn to the basis results in a valid retraction.","element":"span"}],[{"id":"id-103","style":{"fontWeight":"bold"},"text":"Proposition 3.9. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The coordinate Sinkhorn applied to the basis ","element":"span"},{"style":{"height":16},"width":534.99,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-47.png","element":"img","alt":" Bℓ, i.e., cSK(X ⊙exp(tBℓ ⊘X))","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a valid retraction in the direction of ","element":"span"},{"style":{"height":13.19},"width":55.56,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-48.png","element":"img","alt":" Bℓ.","inline":true}],[{"text":"We can further simplify the computation of ","element":"span"},{"style":{"height":16},"width":168.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-49.png","element":"img","alt":" cSK(X ⊙","inline":true},{"style":{"height":16},"width":243.03,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-50.png","element":"img","alt":"exp(tBℓ ⊘X))","inline":true},{"text":", which is equivalent to performing Sinkhorn on a ","element":"span"},{"style":{"height":10.8},"width":91.04,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-51.png","element":"img","alt":" 2 × 2","inline":true,"padRight":true},{"text":"matrix. Furthermore, in this case, we show in Lemma ","element":"span"},{"href":"#id-58","text":"E.6 ","element":"a"},{"text":"that the Sinkhorn admits a closed-form solution.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"CD update. ","element":"span"},{"text":"The CD update follows as ","element":"span"},{"style":{"height":16},"width":172.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-52.png","element":"img","alt":" cSK(X ⊙","inline":true},{"style":{"height":16},"width":325.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-53.png","element":"img","alt":"exp(−ηθBℓ ⊘ X))","inline":true},{"text":", where the coordinate derivative ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-54.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"is computed as ","element":"span"},{"style":{"height":16.79},"width":650.9,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-55.png","element":"img","alt":" θ = ⟨∇f(X), Bℓ⟩ = [∇f(X)]ij −","inline":true},{"style":{"height":17.68},"width":892.49,"height":44.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-56.png","element":"img","alt":"[∇f(X)]i(j+1) − [∇f(X)](i+1)j + [∇f(X)](i+1)(j+1).","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Remark ","element":"span"},{"text":"3.10 (CD on multinomial manifold)","element":"span"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"The developments in this section readily applies to the multinomial manifold (","element":"span"},{"href":"#id-5","referenceIndex":13,"text":"Douik & Hassibi","element":"a"},{"text":", ","element":"span"},{"href":"#id-5","referenceIndex":13,"text":"2019","element":"a"},{"text":"), i.e., ","element":"span"},{"style":{"height":16},"width":261.77,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-57.png","element":"img","alt":" Mn,p := {X ∈","inline":true},{"style":{"height":16.79},"width":475.85,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-58.png","element":"img","alt":"Rn×p : X > 0, X1p = v}","inline":true,"padRight":true},{"text":"where we assume ","element":"span"},{"style":{"height":13.19},"width":132.32,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-59.png","element":"img","alt":" v = 1n","inline":true,"padRight":true},{"text":"without loss of generality. The multinomial constraint corresponds to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"independent simplex constraints restricted to positive entries. The tangent space is ","element":"span"},{"style":{"height":16},"width":300.69,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-60.png","element":"img","alt":" TXMn,p = {U ∈","inline":true},{"style":{"height":18.97},"width":936.44,"height":47.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-61.png","element":"img","alt":"Rn×p : U1p = 0} = {V B⊤ : V ∈ Rn×(p−1), B ∈","inline":true},{"style":{"height":18.98},"width":368.2,"height":47.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-62.png","element":"img","alt":"Rp×(p−1), B⊤1p = 0}","inline":true},{"text":". Thus, the basis is similarly given by ","element":"span"},{"style":{"height":16.79},"width":348.86,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-63.png","element":"img","alt":" Bℓ = ei(ej − ej+1)⊤","inline":true},{"text":". The Riemannian metric is the same Fisher metric. A retraction is given in ","element":"span"},{"style":{"height":16},"width":227.47,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-64.png","element":"img","alt":" RetrX(tU) =","inline":true},{"style":{"height":16},"width":410.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-65.png","element":"img","alt":"P(X ⊙ (exp(tU ⊘ X)))","inline":true},{"text":", where ","element":"span"},{"style":{"height":18.34},"width":393.99,"height":45.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-66.png","element":"img","alt":" P(V ) = V ⊘ (V 1p1⊤p )","inline":true,"padRight":true},{"text":"denotes the row normalization. It should be noted that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":") ","element":"span"},{"text":"is a special case of the Sinkhorn algorithm without column normalization. Thus, in the basis direction, we can define the coordinate projection by modifying only two entries per row. The coordinate derivative can be computed as ","element":"span"},{"style":{"height":17.68},"width":817.16,"height":44.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/4-67.png","element":"img","alt":"θ = ⟨∇f(X), Bℓ⟩ = [∇f(X)]ij − [∇f(X)]i(j+1).","inline":true}],[{"style":{"fontWeight":"bold"},"text":"3.6. CD on positive (semi)definite manifold","element":"span"}],[{"text":"The set of fixed-rank symmetric positive semi-definite manifold (SPSD) matrices (","element":"span"},{"href":"#id-59","referenceIndex":62,"text":"Vandereycken et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-59","referenceIndex":62,"text":"2009","element":"a"},{"text":"; ","element":"span"},{"href":"#id-60","referenceIndex":63,"text":"2013","element":"a"},{"text":"; ","element":"span"},{"href":"#id-61","referenceIndex":43,"text":"Massart & Absil","element":"a"},{"text":", ","element":"span"},{"href":"#id-61","referenceIndex":43,"text":"2020","element":"a"},{"text":") is defined as ","element":"span"},{"style":{"height":18.03},"width":353.32,"height":45.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-0.png","element":"img","alt":" Sn,p+ := {X ∈ Rn×n :","inline":true},{"style":{"height":16},"width":589.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-1.png","element":"img","alt":"X ⪰ 0, rank(X) = p}. When p = n","inline":true},{"text":", we recover the set of symmetric positive definite (SPD) matrices as ","element":"span"},{"style":{"height":18.03},"width":209.21,"height":45.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-2.png","element":"img","alt":" Sn,n+ ≡ Sn++.","inline":true,"padRight":true},{"text":"For the purpose of developing efficient CD updates on SPSD, we follow the parameterization purposed in (","element":"span"},{"href":"#id-61","referenceIndex":43,"text":"Massart & Ab- ","element":"a"},{"href":"#id-61","referenceIndex":43,"text":"sil","element":"a"},{"text":", ","element":"span"},{"href":"#id-61","referenceIndex":43,"text":"2020","element":"a"},{"text":"), i.e., ","element":"span"},{"style":{"height":19.23},"width":187.08,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-3.png","element":"img","alt":" X ∈ Sn×p+","inline":true,"padRight":true},{"text":"is factorized as ","element":"span"},{"style":{"height":10.8},"width":185.74,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-4.png","element":"img","alt":" X = Y Y ⊤","inline":true},{"text":", ","element":"span"},{"style":{"height":15.91},"width":171.09,"height":39.78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-5.png","element":"img","alt":"Y ∈ Rn×p∗","inline":true,"padRight":true},{"text":", which is unique up to the right-action of the orthogonal group ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":")","element":"span"},{"text":". The Riemannian gradient can be computed as ","element":"span"},{"style":{"height":16},"width":790.06,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-6.png","element":"img","alt":" gradf(Y ) = ∇f(Y Y ⊤) = 2 sym(∇f(Y Y ⊤))Y","inline":true,"padRight":true},{"text":"because the ","element":"span"},{"style":{"height":16.05},"width":286.26,"height":40.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-7.png","element":"img","alt":" TY Rn×p∗ ≃ Rn×p","inline":true},{"text":". The main advantage of this parameterization is its simple expression of retraction, i.e., ","element":"span"},{"style":{"height":16},"width":360.85,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-8.png","element":"img","alt":"RetrY (tξ) = Y + tξ (","inline":true},{"href":"#id-61","referenceIndex":43,"text":"Massart & Absil","element":"a"},{"text":", ","element":"span"},{"href":"#id-61","referenceIndex":43,"text":"2020","element":"a"},{"text":").","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Choice of basis, retraction, and CD update. ","element":"span"},{"text":"Using ","element":"span"},{"style":{"height":10.8},"width":185.34,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-9.png","element":"img","alt":"X = Y Y ⊤","inline":true},{"text":", the optimization problem is on ","element":"span"},{"style":{"height":15.91},"width":90.39,"height":39.78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-10.png","element":"img","alt":" Rn×p∗","inline":true,"padRight":true},{"text":"with a simple retraction. For the objective ","element":"span"},{"style":{"height":18.03},"width":348.42,"height":45.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-11.png","element":"img","alt":" f : Sn,p+ → R, we ini-","inline":true,"padRight":true},{"text":"tialize ","element":"span"},{"style":{"height":15.91},"width":172.32,"height":39.78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-12.png","element":"img","alt":" Y ∈ Rn×p∗","inline":true,"padRight":true},{"text":"and update as ","element":"span"},{"style":{"height":16},"width":410.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-13.png","element":"img","alt":" RetrY (−η∇f(Y Y ⊤)) =","inline":true},{"style":{"height":16},"width":283.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-14.png","element":"img","alt":"Y − η∇f(Y Y ⊤)","inline":true,"padRight":true},{"text":"for some stepsize ","element":"span"},{"style":{"height":10.4},"width":20,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-15.png","element":"img","alt":" η","inline":true},{"text":". We choose the basis to be ","element":"span"},{"style":{"height":13.35},"width":64.38,"height":33.37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-16.png","element":"img","alt":" eie⊤j","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":16},"width":257.17,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-17.png","element":"img","alt":" i ∈ [n], j ∈ [p]","inline":true},{"text":", which is orthonormal ","element":"span"},{"text":"for the tangent space ","element":"span"},{"style":{"height":16.05},"width":141.27,"height":40.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-18.png","element":"img","alt":" TY Rn×p∗","inline":true,"padRight":true},{"text":". The CD update is given by ","element":"span"},{"style":{"height":18.55},"width":286.54,"height":46.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-19.png","element":"img","alt":" RetrY (−ηθeie⊤j )","inline":true},{"text":", where ","element":"span"},{"style":{"height":18.55},"width":451.48,"height":46.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-20.png","element":"img","alt":" θ = ⟨∇f(Y Y ⊤), eie⊤j ⟩ =","inline":true},{"style":{"height":16.79},"width":225.3,"height":41.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-21.png","element":"img","alt":"[∇f(Y Y ⊤)]ij","inline":true},{"text":". The simplicity of the geometry allows CD to be developed efficiently on ","element":"span"},{"style":{"height":18.03},"width":68.29,"height":45.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-22.png","element":"img","alt":" Sn,p+","inline":true,"padRight":true},{"text":", which coincides with ","element":"span"},{"text":"the Euclidean CD update in the Euclidean space. When ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":", we obtain a CD update for the SPD manifold.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"CD update with the BW metric. ","element":"span"},{"text":"The Bures-Wasserstein (BW) metric for the SPD manifold (","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":") has been recently studied for various machine learning applications (","element":"span"},{"href":"#id-62","referenceIndex":7,"text":"Bhatia et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-62","referenceIndex":7,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-2","referenceIndex":21,"text":"Han et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-2","referenceIndex":21,"text":"2021","element":"a"},{"text":"). For the BW metric, the gradient descent update is ","element":"span"},{"style":{"height":16},"width":394.46,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-23.png","element":"img","alt":" ExpX(−ηgradf(X)) =","inline":true},{"style":{"height":17.38},"width":935.94,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-24.png","element":"img","alt":"X − 2η(∇f(X)X + X∇f(X)) + 4η2∇f(X)X∇f(X).","inline":true,"padRight":true},{"text":"Consider a basis ","element":"span"},{"style":{"height":15.59},"width":243.66,"height":38.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-25.png","element":"img","alt":" EijX + XEij","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":17.35},"width":262.24,"height":43.37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-26.png","element":"img","alt":" Eij = eie⊤j +","inline":true},{"style":{"height":11.59},"width":67.9,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-27.png","element":"img","alt":"eje⊤i","inline":true,"padRight":true},{"text":". The coordinate derivative is computed as ","element":"span"},{"style":{"height":15.59},"width":100.63,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-28.png","element":"img","alt":" θij =","inline":true},{"style":{"height":16.79},"width":409.58,"height":41.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-29.png","element":"img","alt":"⟨EijX + XEij, ∇f(X)⟩","inline":true},{"text":". Finally, the CD update is given by ","element":"span"},{"style":{"height":19.93},"width":887.72,"height":49.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-30.png","element":"img","alt":" X −2ηθij(EijX +XEij)+4η2θ2ijEijXEij. Each CD","inline":true,"padRight":true},{"text":"update modifies two rows and two columns of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Remark ","element":"span"},{"text":"3.11","element":"span"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"For the SPD manifold (","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":"), ","element":"span"},{"href":"#id-46","referenceIndex":12,"text":"Darmwal & ","element":"a"},{"href":"#id-46","referenceIndex":12,"text":"Rajawat ","element":"a"},{"text":"(","element":"span"},{"href":"#id-46","referenceIndex":12,"text":"2023","element":"a"},{"text":") propose CD updates based on the affine-invariant metric and Cholesky factorization. They specifically focus on a class of objective functions and show that the exponential map computations are efficient. In contrast, our choices of parameterization/metric directly leads to a faster retraction.","element":"span"}]]},{"heading":"4. Algorithms and Analysis","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"RCD. ","element":"span"},{"text":"We present the proposed Riemannian coordinate descent (RCD) algorithm in Algorithm ","element":"span"},{"href":"#id-63","text":"1","element":"a"},{"text":". The complexity of RCD per iteration is the complexity of one first-order oracle and the update complexity in Table ","element":"span"},{"href":"#id-48","text":"1","element":"a"},{"text":". Although","element":"span"}],[{"id":"id-63","style":{"width":"99%"},"width":938,"height":704,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-31.png","element":"img"}],[{"text":"for some problem settings, we may explore the structure of the objective to efficiently compute ","element":"span"},{"style":{"height":16},"width":296.94,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-32.png","element":"img","alt":" θ = ⟨∇f(X), Bℓ⟩","inline":true},{"text":", for general problem instances, ","element":"span"},{"style":{"height":16},"width":124.65,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-33.png","element":"img","alt":" ∇f(X)","inline":true,"padRight":true},{"text":"becomes the main computational bottleneck.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"RCDLin. ","element":"span"},{"text":"To reduce gradient computations in the RCD setup (especially for non-linear objectives), we also propose the Riemannian linearized coordinate descent (RCDlin) method in Algorithm ","element":"span"},{"href":"#id-63","text":"1","element":"a"},{"text":". The main difference with RCD is that the variables are updated using an anchored gradient at ","element":"span"},{"style":{"height":13.19},"width":50.02,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-34.png","element":"img","alt":" Xk","inline":true,"padRight":true},{"text":"(which does not change for inner iterations). This scheme is equivalent to taking a linearization of the original cost function at ","element":"span"},{"style":{"height":13.19},"width":50.02,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-35.png","element":"img","alt":" Xk","inline":true},{"text":", and in the inner iterations, we solve: ","element":"span"},{"style":{"height":19.2},"width":836.54,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-36.png","element":"img","alt":" minX∈M�g(X) := f(Xk) + ⟨∇f(Xk), X − Xk⟩},","inline":true,"padRight":true},{"text":"where the inner product and subtraction are defined in the ambient Euclidean space. ","element":"span"},{"text":"Subsequently, the Euclidean gradient at ","element":"span"},{"style":{"height":15.32},"width":51.14,"height":38.29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-37.png","element":"img","alt":" Xsk","inline":true,"padRight":true},{"text":"is ","element":"span"},{"style":{"height":16.52},"width":350.44,"height":41.29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-38.png","element":"img","alt":" ∇g(Xsk) = ∇f(Xk)","inline":true},{"text":", and thus, ","element":"span"},{"style":{"height":15.32},"width":89.56,"height":38.29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-39.png","element":"img","alt":" θsk =","inline":true},{"style":{"height":18.25},"width":252.82,"height":45.62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-40.png","element":"img","alt":"⟨∇f(Xk), Bℓsk⟩","inline":true},{"text":". For the randomized setting with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":"= 1","element":"span"},{"text":", ","element":"span"},{"text":"RCDlin is equivalent to RCD. Additionally, for linear problems where ","element":"span"},{"style":{"height":16},"width":300.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/5-41.png","element":"img","alt":" ∇f(Xk) = C (C","inline":true,"padRight":true},{"text":"is some constant matrix), RCDlin also reduces to RCD.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Convergence and complexity of Algorithm ","element":"span"},{"href":"#id-63","style":{"fontWeight":"bold"},"text":"1","element":"a"}],[{"text":"We next discuss the convergence analysis of RCD and RCDlin. It follows the standard analysis for CD algorithms (","element":"span"},{"href":"#id-41","referenceIndex":67,"text":"Wright","element":"a"},{"text":", ","element":"span"},{"href":"#id-41","referenceIndex":67,"text":"2015","element":"a"},{"text":"). Note that ","element":"span"},{"href":"#id-42","referenceIndex":19,"text":"Gutman & Ho-Nguyen ","element":"a"},{"text":"(","element":"span"},{"href":"#id-42","referenceIndex":19,"text":"2023","element":"a"},{"text":") mostly consider the analysis of CD algorithms under exponential map and parallel transport operations. In contrast, we consider the more general retraction and vector transport operations. We also adapt our analysis for RCDlin. For brevity, our analysis is informally discussed here. The analysis is in a compact neighbourhood around a critical point, which is required for validating certain regularity assumptions, boundedness of basis and projection onto the basis, and smoothness of the objective (details in Appendix ","element":"span"},{"text":"F","element":"span"},{"text":").","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"On RCD. ","element":"span"},{"text":"We start by showing the convergence of RCD under randomized selection of basis and certain regularity ","element":"span"},{"text":"assumptions. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"is the smoothness constant of the objective.","element":"span"}],[{"id":"id-64","style":{"fontWeight":"bold"},"text":"Theorem 4.1 ","element":"span"},{"text":"(Randomized RCD)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under mild assumptions, consider RCD with ","element":"span"},{"style":{"height":15.71},"width":213.41,"height":39.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/6-0.png","element":"img","alt":" S = 1 and ℓsk ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"selected uniformly ","element":"span"},{"style":{"fontStyle":"italic"},"text":"at random from ","element":"span"},{"style":{"fontStyle":"italic"},"text":"I","element":"span"},{"style":{"fontStyle":"italic"},"text":". Then, choosing ","element":"span"},{"style":{"height":19.38},"width":188.5,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/6-1.png","element":"img","alt":" η = Θ� 1L�","inline":true},{"style":{"fontStyle":"italic"},"text":"leads to ","element":"span"},{"style":{"height":22.21},"width":726.24,"height":55.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/6-2.png","element":"img","alt":"min0≤k≤K−1 E∥gradf(Xk)∥2Xk ≤ O� |I|LK �.","inline":true}],[{"text":"The convergence of RCD with cyclic selection of basis requires further assumptions that bound the difference of the constructed bases between tangent spaces. These are reasonable given the compactness of the domain.","element":"span"}],[{"id":"id-65","style":{"fontWeight":"bold"},"text":"Theorem 4.2 ","element":"span"},{"text":"(Cyclic RCD)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under mild assumptions in addition to the ones required by randomized RCD, consider RCD with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|I| ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":15.31},"width":237.04,"height":38.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/6-3.png","element":"img","alt":" ℓsk = s + 1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for ","element":"span"},{"style":{"height":16},"width":324.45,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/6-4.png","element":"img","alt":"s = 0, ..., |I| − 1","inline":true},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Then, for ","element":"span"},{"style":{"height":19.37},"width":207.75,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/6-5.png","element":"img","alt":" η = Θ� 1L�","inline":true},{"style":{"fontStyle":"italic"},"text":", we have","element":"span"}],[{"style":{"width":"75%"},"width":710,"height":47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/6-6.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"On RCDlin. ","element":"span"},{"text":"The key idea here is to relate the coordinate derivative ","element":"span"},{"style":{"height":18.25},"width":344.26,"height":45.62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/6-7.png","element":"img","alt":" θsk = ⟨∇f(Xk), Bℓsk⟩","inline":true,"padRight":true},{"text":"to the correct descent ","element":"span"},{"text":"derivative ","element":"span"},{"style":{"height":18.25},"width":253.37,"height":45.62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/6-8.png","element":"img","alt":" ⟨∇f(Xsk), Bℓsk⟩","inline":true},{"text":". In randomized settings, we can ","element":"span"},{"text":"show the same convergence rate as RCD up to some additional constants regulated by the difference between ","element":"span"},{"style":{"height":15.72},"width":105.53,"height":39.29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/6-9.png","element":"img","alt":" θsk and","inline":true,"padRight":true},{"text":"the descent direction. For the cyclic settings, however, we require ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|I| ","element":"span"},{"text":"in order to cycle through all the basis.","element":"span"}],[{"id":"id-114","style":{"fontWeight":"bold"},"text":"Theorem 4.3 ","element":"span"},{"text":"(Randomized and cyclic RCDlin)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under assumptions required in Theorems ","element":"span"},{"href":"#id-64","style":{"fontStyle":"italic"},"text":"4.1 ","element":"a"},{"text":"& ","element":"span"},{"href":"#id-65","style":{"fontStyle":"italic"},"text":"4.2","element":"a"},{"style":{"fontStyle":"italic"},"text":", suppose ","element":"span"},{"style":{"height":15.32},"width":35.71,"height":38.29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/6-10.png","element":"img","alt":"θsk","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":18.25},"width":253.37,"height":45.63,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/6-11.png","element":"img","alt":" ⟨∇f(Xsk), Bℓsk⟩","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"are positively related. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Then, consider randomized RCDlin with ","element":"span"},{"style":{"height":15.1},"width":33.6,"height":37.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/6-12.png","element":"img","alt":" ℓsk","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"selected uniformly ","element":"span"},{"style":{"fontStyle":"italic"},"text":"at random from ","element":"span"},{"style":{"fontStyle":"italic"},"text":"I","element":"span"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Choosing ","element":"span"},{"style":{"height":19.37},"width":213.62,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/6-13.png","element":"img","alt":" η = Θ( 1L)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"leads to ","element":"span"},{"style":{"height":24.38},"width":860.5,"height":60.95,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/6-14.png","element":"img","alt":"min0≤k≤K−1,0≤s≤S−1 E∥gradf(Xsk)∥2Xsk ≤ � |I|LKS�","inline":true},{"style":{"fontStyle":"italic"},"text":". In addition, consider cyclic RCDlin with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|I| ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":15.1},"width":81.99,"height":37.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/6-15.png","element":"img","alt":" ℓsk =","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"+ 1 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"for ","element":"span"},{"style":{"height":16},"width":301.87,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/6-16.png","element":"img","alt":" s = 0, ..., |I| − 1","inline":true},{"style":{"fontStyle":"italic"},"text":". Also, if ","element":"span"},{"style":{"height":19.37},"width":191.62,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/6-17.png","element":"img","alt":" η = Θ� 1L�","inline":true},{"style":{"fontStyle":"italic"},"text":", then","element":"span"}],[{"style":{"width":"75%"},"width":710,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/6-18.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Complexity analysis. ","element":"span"},{"text":"Let the cost of computing the coordinate derivative ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/6-19.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"and CD update be ","element":"span"},{"style":{"height":11.6},"width":19,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/6-20.png","element":"img","alt":" δ","inline":true,"padRight":true},{"text":"(last column of Table ","element":"span"},{"href":"#id-48","text":"1","element":"a"},{"text":"). Then, the total computational cost of RCD and RCDlin is ","element":"span"},{"style":{"height":16},"width":582.54,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/6-21.png","element":"img","alt":" O(KS(F + δ)) and O(K(F + Sδ))","inline":true},{"text":", respectively, where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"denotes the cost of computing ","element":"span"},{"style":{"height":16},"width":124.65,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/6-22.png","element":"img","alt":" ∇f(X)","inline":true},{"text":". We note that the proposed algorithms can parallely update in disjoint basis directions. For example, in the Stiefel/Grassmann case, we can select ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n/","element":"span"},{"text":"2 ","element":"span"},{"text":"non-overlapping index pairs, which results in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n/","element":"span"},{"text":"2 ","element":"span"},{"text":"independent Givens rotation, and can be parallelized.","element":"span"}]]},{"heading":"5. Experiments","paragraphs":[[{"text":"We now benchmark the performance of the proposed RCD and RCDlin algorithms in terms of computational efficiency (flop counts and/or runtime) and convergence quality (distance to optimality). One of the considered baselines is the Riemannian gradient descent method (RGD), a full gradient method. As RGD exploits the entire gradient direction, it has advantage over CD algorithms. However, RGD is significantly more costly than CD in every up-","element":"span"}],[{"id":"id-67","style":{"width":"88%"},"width":830,"height":695,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/6-23.png","element":"img"}],[{"text":"Figure 1: The Procrustes problem with varying ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"p","element":"figcaption","subtype":"caption"},{"text":": (a) ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"p ","element":"figcaption","subtype":"caption"},{"text":"= 150 ","element":"figcaption","subtype":"caption"},{"text":"and (b) ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"p ","element":"figcaption","subtype":"caption"},{"text":"= 50","element":"figcaption","subtype":"caption"},{"text":". (Top row) Comparing various algorithms in terms of flop counts. (Bottom row) Comparing various algorithms in terms of runtime. We observe that our RCD algorithm obtains better flop counts than the baselines in flop counts and is competitive in terms of runtime.","element":"figcaption","subtype":"caption"}],[{"text":"date. Our codes are implemented using the Manopt toolbox (","element":"span"},{"href":"#id-66","referenceIndex":10,"text":"Boumal et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-66","referenceIndex":10,"text":"2014","element":"a"},{"text":") and run on a laptop with an i5-10500 3.1GHz CPU processor. The codes are available at ","element":"span"},{"href":"https://github.com/andyjm3","text":"https://github.com/andyjm3","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"5.1. Orthogonal Procrustes and PCA","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Orthogonal Procrustes problem. ","element":"span"},{"text":"We aim to solve ","element":"span"},{"style":{"height":19.06},"width":698.6,"height":47.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/6-24.png","element":"img","alt":"minX∈St(n,p) ∥XA − B∥2(≡ −⟨XA, B⟩)","inline":true,"padRight":true},{"text":"for given matrices ","element":"span"},{"style":{"height":14.18},"width":372.05,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/6-25.png","element":"img","alt":" A ∈ Rp×p, B ∈ Rn×p","inline":true},{"text":". There exists a closed-form solution provided by the (thin) SVD of ","element":"span"},{"style":{"height":11.6},"width":76.11,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/6-26.png","element":"img","alt":" BA⊤","inline":true},{"text":". For this, RCD and RCDlin have same updates as ","element":"span"},{"style":{"height":16},"width":291.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/6-27.png","element":"img","alt":" ∇f(X) = −BA⊤","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":16},"width":217.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/6-28.png","element":"img","alt":"X ∈ St(n, p)","inline":true},{"text":". In experiments, we generate random matrices ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A, B ","element":"span"},{"text":"and evaluate the performance against optimality gap computed as ","element":"span"},{"style":{"height":16},"width":314.57,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/6-29.png","element":"img","alt":" |f(Xk) − f ∗|/|f ∗|.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Baselines. ","element":"span"},{"text":"The closest baseline to RCD is TSD (","element":"span"},{"href":"#id-42","referenceIndex":19,"text":"Gutman ","element":"a"},{"href":"#id-42","referenceIndex":19,"text":"& Ho-Nguyen","element":"a"},{"text":", ","element":"span"},{"href":"#id-42","referenceIndex":19,"text":"2023","element":"a"},{"text":"), which is a CD method under an alternative construction of bases. As discussed, while TSD updates the columns, the proposed RCD updates the rows. Since RCD is equivalent to TSD for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":", we focus only on the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p < n ","element":"span"},{"text":"setting. For both RCD and TSD, we use the cyclic selection of basis. We also compare against RGD methods with QR, Cayley (CL), and exponential (EXP) retractions. For all the methods, we tune the stepsize.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Results. ","element":"span"},{"text":"In Figure ","element":"span"},{"href":"#id-67","text":"1","element":"a"},{"text":", we show results with varying dimension ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":". While the proposed RCD obtains better flop counts than the baselines in flop counts, it is competitive in terms of runtime. We highlight that the runtime of RCD can be further improved with parallel implementation. In Figure ","element":"span"},{"href":"#id-68","text":"2c","element":"a"},{"text":", we compare a variety of basis selection rules for both","element":"span"}],[{"id":"id-68","style":{"width":"89%"},"width":1741,"height":380,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/7-0.png","element":"img"}],[{"text":"Figure 2: (a) & (b): Experiments on the PCA problem with ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"n ","element":"figcaption","subtype":"caption"},{"text":"= 200","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":", p ","element":"figcaption","subtype":"caption"},{"text":"= 50","element":"figcaption","subtype":"caption"},{"text":". In (a), we observe that our algorithm RCDlin achieves the fastest convergence due to low per-iteration cost. In (b), we compare various strategies for basis selection: cyclic selection (-c) and uniformly random selection (-r) of basis for TSD, RCD, and RCDlin, and selection without replacement (-nr) for RCDlin. We observe that cyclic and selection without replacement strategies are better than random selection. (c) & (d): Experiments on the Procrustes problem with ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"n ","element":"figcaption","subtype":"caption"},{"text":"= 200","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":", p ","element":"figcaption","subtype":"caption"},{"text":"= 150","element":"figcaption","subtype":"caption"},{"text":". In (c), we again observe that cyclic selection performs better than random selection. In (d), RCD performs competitively against the infeasible methods.","element":"figcaption","subtype":"caption"}],[{"text":"TSD and RCD: cyclic selection (‘c’) and uniformly random selection (‘r’). We observe that cyclic selection is more favourable than the random selection rule for both the methods. We compare against full gradient infeasible methods in Figure ","element":"span"},{"href":"#id-68","text":"2d","element":"a"},{"text":", including PLAM (","element":"span"},{"href":"#id-36","referenceIndex":15,"text":"Gao et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-36","referenceIndex":15,"text":"2019","element":"a"},{"text":"), PCAL (","element":"span"},{"href":"#id-36","referenceIndex":15,"text":"Gao ","element":"a"},{"href":"#id-36","referenceIndex":15,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-36","referenceIndex":15,"text":"2019","element":"a"},{"text":"), PenCF (","element":"span"},{"href":"#id-69","referenceIndex":69,"text":"Xiao et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-69","referenceIndex":69,"text":"2022","element":"a"},{"text":"), ExPen (","element":"span"},{"href":"#id-37","referenceIndex":68,"text":"Xiao & Liu","element":"a"},{"text":", ","element":"span"},{"href":"#id-37","referenceIndex":68,"text":"2021","element":"a"},{"text":"; ","element":"span"},{"href":"#id-51","referenceIndex":70,"text":"Xiao et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-51","referenceIndex":70,"text":"2023","element":"a"},{"text":"), and Landing (","element":"span"},{"href":"#id-70","referenceIndex":1,"text":"Ablin & Peyr","element":"a"},{"text":"´e","element":"span"},{"text":", ","element":"span"},{"href":"#id-70","referenceIndex":1,"text":"2022","element":"a"},{"text":"; ","element":"span"},{"href":"#id-38","referenceIndex":2,"text":"Ablin et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-38","referenceIndex":2,"text":"2023","element":"a"},{"text":"). RCD is performs competitively against infeasible methods for orthogonality constraints.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"PCA problem. ","element":"span"},{"text":"The PCA problem solves a quadratic maximization problem as ","element":"span"},{"style":{"height":18.43},"width":409.97,"height":46.06,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/7-1.png","element":"img","alt":" maxX⊤X=Ip tr(X⊤AX)","inline":true,"padRight":true},{"text":"for some positive definite matrix ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":", i.e., ","element":"span"},{"style":{"height":17.14},"width":150.76,"height":42.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/7-2.png","element":"img","alt":" A ∈ Sn++","inline":true},{"text":". This problem is ","element":"span"},{"text":"in fact an optimization problem over the Grassmann manifold because the objective is invariant to basis change of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"text":". Hence, we use the Riemannian distance to the optimal solution on the Grassmann manifold to measure the performance. As discussed in Section ","element":"span"},{"href":"#id-71","text":"3.2","element":"a"},{"text":", our proposed RCD has well-defined updates on the Grassmann manifold. In contrast, TSD is not invariant to the basis change. For experiments, we generate ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"with a condition number ","element":"span"},{"style":{"height":13.78},"width":124.92,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/7-3.png","element":"img","alt":" 103 and","inline":true,"padRight":true},{"text":"with exponential decay of eigenvalues. For TSD, RCD, and RCDlin, we implement the cyclic selection of basis.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Results. ","element":"span"},{"text":"In Figure ","element":"span"},{"href":"#id-68","text":"2a","element":"a"},{"text":", we observe that RCDlin achieves the best performance due to its low per-iteration cost. We note that TSD converges slowly due to non-invariance of the CD updates. In Figure ","element":"span"},{"href":"#id-68","text":"2b","element":"a"},{"text":", we compare the cyclic and uniformly random selection of the basis of RCD, RCDlin, and TSD. For RCDlin, we also implement the selection without replacement (‘nr’) strategy. We observe that cyclic and ‘nr’ strategies are better than random selection.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"5.2. Orthogonal deep networks for distillation","element":"span"}],[{"text":"We next evaluate RCD on a deep learning based distillation problem (","element":"span"},{"href":"#id-72","referenceIndex":27,"text":"Hinton et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-72","referenceIndex":27,"text":"2015","element":"a"},{"text":"). Let ","element":"span"},{"style":{"height":10.8},"width":31,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/7-4.png","element":"img","alt":" Θ","inline":true,"padRight":true},{"text":"denote the parameters of the student network (S) while ","element":"span"},{"style":{"height":13.19},"width":54,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/7-5.png","element":"img","alt":" ΘT","inline":true,"padRight":true},{"text":"be the optimal parameters of the teacher network (T). Then, the aim","element":"span"}],[{"id":"id-74","style":{"width":"91%"},"width":860,"height":407,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/7-6.png","element":"img"}],[{"text":"Figure 3: Experiments on the distillation problem. We observe that the proposed RCD algorithm performs better than the baselines both in terms of flop counts and runtime.","element":"figcaption","subtype":"caption"}],[{"text":"is to learn S that approximates T, i.e., minimize ","element":"span"},{"style":{"height":16},"width":136.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/7-7.png","element":"img","alt":" L(Θ) =","inline":true},{"style":{"height":17.77},"width":377.36,"height":44.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/7-8.png","element":"img","alt":"∥ΨΘ(X) − ΨΘT(X)∥2","inline":true},{"text":", where ","element":"span"},{"style":{"height":17.38},"width":317.66,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/7-9.png","element":"img","alt":" ΨΘ(X) ∈ RN×dout","inline":true,"padRight":true},{"text":"represent the output of the network for some input ","element":"span"},{"style":{"height":14.18},"width":220.36,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/7-10.png","element":"img","alt":" X ∈ RN×din.","inline":true,"padRight":true},{"text":"The network architecture is detailed in Appendix ","element":"span"},{"href":"#id-73","text":"A.1","element":"a"},{"text":". Here, we constrain all the weights to be orthonormal, thus posing the problem as optimization over the joint space of Stiefel and Euclidean manifolds. For experiments, we consider a six-layer network and set ","element":"span"},{"style":{"height":13.6},"width":389.8,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/7-11.png","element":"img","alt":" din = 500, dout = 200","inline":true},{"text":". We use stochastic versions of RGD and RCD where the input samples are randomly generated. In Figure ","element":"span"},{"href":"#id-74","text":"3","element":"a"},{"text":", RCD outperforms the baselines in terms of flop counts and runtime. This is because RCD has the most cost-efficient update per iteration, while maintaining a competitive convergence rate.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"5.3. Nearest matrix problem","element":"span"}],[{"text":"We consider the problem: ","element":"span"},{"style":{"height":19.06},"width":401.14,"height":47.65,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/7-12.png","element":"img","alt":" minX∈Sp(n,p) ∥X − A∥2","inline":true,"padRight":true},{"text":"on the symplectic manifold (","element":"span"},{"href":"#id-55","referenceIndex":17,"text":"Gao et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-55","referenceIndex":17,"text":"2021b","element":"a"},{"text":"). We follow the setting in (","element":"span"},{"href":"#id-55","referenceIndex":17,"text":"Gao et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-55","referenceIndex":17,"text":"2021b","element":"a"},{"text":") by generating ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"as a random matrix with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"= 200","element":"span"},{"text":". The algorithms are evaluated on optimality gap ","element":"span"},{"style":{"height":16},"width":469.09,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/7-13.png","element":"img","alt":" |f(Xk) − f ∗|/|f ∗|, where f ∗ ","inline":true,"padRight":true},{"text":"is obtained by running the conjugate gradient algorithm with the Cayley retraction. We implement RCD and RCDlin with both CD and block CD updates (discussed in Remark ","element":"span"},{"href":"#id-75","text":"3.8","element":"a"},{"text":"). As there","element":"span"}],[{"id":"id-76","style":{"width":"95%"},"width":900,"height":336,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/8-0.png","element":"img"}],[{"text":"Figure 4: Experiments on the nearest matrix problem. We notice the utility of the block-update variants of our RCD and RCDlin algorithms in obtaining faster convergence.","element":"figcaption","subtype":"caption"}],[{"id":"id-81","style":{"width":"90%"},"width":847,"height":332,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/8-1.png","element":"img"}],[{"text":"Figure 5: Experiments on learning Lorentz (hyperbolic) embeddings. The performance of our RCDlin algorithms (with cyclic and time-cyclic basis selection) is competitive to RGD.","element":"figcaption","subtype":"caption"}],[{"text":"is no CD baseline on the symplectic manifold, we compare against the full gradient RGD algorithms with three retractions: Cayley (‘CL’), quasi-geodesic (‘QG’), and SR (‘SR’). In Figure ","element":"span"},{"href":"#id-76","text":"4","element":"a"},{"text":", RCD with block update shows clear advantage in both flop counts and runtime.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"5.4. Learning Lorentz embeddings","element":"span"}],[{"text":"We consider the task of learning embeddings for word hierarchies, which is formulated on the hyperbolic manifold using the hyperboloid model (","element":"span"},{"href":"#id-77","referenceIndex":48,"text":"Nickel & Kiela","element":"a"},{"text":", ","element":"span"},{"href":"#id-77","referenceIndex":48,"text":"2017","element":"a"},{"text":"; ","element":"span"},{"href":"#id-4","referenceIndex":49,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-78","referenceIndex":33,"text":"Jawanpuria et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-78","referenceIndex":33,"text":"2019b","element":"a"},{"text":"). The goal is to map word pairs with hypernymy relations closer while separate those without. We follow the formulation in (","element":"span"},{"href":"#id-4","referenceIndex":49,"text":"Nickel & Kiela","element":"a"},{"text":", ","element":"span"},{"href":"#id-4","referenceIndex":49,"text":"2018","element":"a"},{"text":"), and the details are in Appendix ","element":"span"},{"href":"#id-79","text":"A.2","element":"a"},{"text":".","element":"span"}],[{"text":"For experiment settings, we train ","element":"span"},{"text":"5","element":"span"},{"text":"-dimensional embeddings (","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"= 5","element":"span"},{"text":") for WordNet mammals subtree (","element":"span"},{"href":"#id-80","referenceIndex":44,"text":"Miller","element":"a"},{"text":", ","element":"span"},{"href":"#id-80","referenceIndex":44,"text":"1998","element":"a"},{"text":"). We adopt the RCDlin algorithm with two selection rules for the basis ","element":"span"},{"style":{"height":15.59},"width":122.13,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/8-2.png","element":"img","alt":" HijJX","inline":true},{"text":": cyclic (‘RCDlin-c’) and time cyclic (‘RCDlin-tc’). The cyclic selection loops through all ","element":"span"},{"style":{"height":16},"width":191.17,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/8-3.png","element":"img","alt":" n(n − 1)/2","inline":true,"padRight":true},{"text":"pairs per iteration. The time cyclic selection only loops through all the space-time coordinate pairs, namely ","element":"span"},{"text":"(1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2)","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"(1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"3)","element":"span"},{"style":{"fontStyle":"italic"},"text":", ..., ","element":"span"},{"text":"(1","element":"span"},{"style":{"fontStyle":"italic"},"text":", n","element":"span"},{"text":")","element":"span"},{"text":", which reduces the computational cost to scale linearly with dimension ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":". For RCDlin-c and RCDlin-tc, we use a linearly-decaying stepsize, i.e., ","element":"span"},{"style":{"height":16},"width":344.09,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/8-4.png","element":"img","alt":"η/(1 + 0.1 × epoch)","inline":true},{"text":". For RGD we use a fixed stepsize ","element":"span"},{"style":{"height":10.4},"width":20,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/8-5.png","element":"img","alt":" η","inline":true,"padRight":true},{"text":"which generally leads to better convergence. We tune and set ","element":"span"},{"style":{"height":14.4},"width":126.38,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/8-6.png","element":"img","alt":" η = 1.0","inline":true,"padRight":true},{"text":"for RCDlin and ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"5 ","element":"span"},{"text":"for RGD. We use the met-","element":"span"}],[{"id":"id-82","style":{"width":"94%"},"width":886,"height":421,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/8-7.png","element":"img"}],[{"text":"Figure 6: Experiments on the weighted least squares problem in two settings: (a) ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"n ","element":"figcaption","subtype":"caption"},{"text":"= ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"p ","element":"figcaption","subtype":"caption"},{"text":"= 500 ","element":"figcaption","subtype":"caption"},{"text":"and ","element":"figcaption","subtype":"caption"},{"style":{"height":15.54},"width":168.92,"height":38.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/8-8.png","element":"img","alt":" A = 1n1⊤n","inline":true,"padRight":true},{"text":"is a ","element":"figcaption","subtype":"caption"},{"text":"dense matrix and (b) ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"n ","element":"figcaption","subtype":"caption"},{"text":"= 500","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":", p ","element":"figcaption","subtype":"caption"},{"text":"= 100 ","element":"figcaption","subtype":"caption"},{"text":"and ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"A ","element":"figcaption","subtype":"caption"},{"text":"is a random symmetric matrix with ","element":"figcaption","subtype":"caption"},{"text":"70% ","element":"figcaption","subtype":"caption"},{"text":"entries as ","element":"figcaption","subtype":"caption"},{"text":"1 ","element":"figcaption","subtype":"caption"},{"text":"and others are ","element":"figcaption","subtype":"caption"},{"text":"0","element":"figcaption","subtype":"caption"},{"text":". While RGD and the proposed RCDlin have similar convergence rate in (a), RCDlin has clear advantage in (b).","element":"figcaption","subtype":"caption"}],[{"text":"rics for evaluating the convergence: mean average precision (MAP) and mean rank (MR) (","element":"span"},{"href":"#id-77","referenceIndex":48,"text":"Nickel & Kiela","element":"a"},{"text":", ","element":"span"},{"href":"#id-77","referenceIndex":48,"text":"2017","element":"a"},{"text":"; ","element":"span"},{"href":"#id-4","referenceIndex":49,"text":"2018","element":"a"},{"text":"). In Figure ","element":"span"},{"href":"#id-81","text":"5","element":"a"},{"text":", we see that RCDlin converges at a similar rate compared to RGD in terms of runtime.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"5.5. Weighted least squares (SPSD manifold)","element":"span"}],[{"text":"The weighted least squares problem is ","element":"span"},{"style":{"height":22.45},"width":288.1,"height":56.13,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/8-9.png","element":"img","alt":" minX∈Sn×p+ ∥A ⊙","inline":true},{"style":{"height":16},"width":139.43,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/8-10.png","element":"img","alt":"X − B∥","inline":true},{"text":", where ","element":"span"},{"style":{"height":16},"width":246.49,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/8-11.png","element":"img","alt":" A ∈ {0, 1}n×n","inline":true,"padRight":true},{"text":"masks the known entries in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"B","element":"span"},{"text":". It is an instance of the matrix completion problem (","element":"span"},{"href":"#id-2","referenceIndex":21,"text":"Han et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-2","referenceIndex":21,"text":"2021","element":"a"},{"text":"). For the experiment, we follow (","element":"span"},{"href":"#id-2","referenceIndex":21,"text":"Han ","element":"a"},{"href":"#id-2","referenceIndex":21,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-2","referenceIndex":21,"text":"2021","element":"a"},{"text":") by generating ","element":"span"},{"style":{"height":12.8},"width":224.36,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/8-12.png","element":"img","alt":" B = A ⊙ X∗","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":10.99},"width":52.14,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/8-13.png","element":"img","alt":" X∗","inline":true,"padRight":true},{"text":"is an SPD/SPSD matrix with exponentially decaying eigenvalues. We consider two settings: (left) ","element":"span"},{"style":{"height":15.54},"width":405.32,"height":38.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/8-14.png","element":"img","alt":" n = p = 500, A = 1n1⊤n","inline":true,"padRight":true},{"text":"is a dense matrix and (right) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"= 500","element":"span"},{"style":{"fontStyle":"italic"},"text":", p ","element":"span"},{"text":"= 100 ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"is a random symmetric matrix with ","element":"span"},{"text":"70% ","element":"span"},{"text":"entries as ","element":"span"},{"text":"1 ","element":"span"},{"text":"and others are ","element":"span"},{"text":"0","element":"span"},{"text":". We compare RCDlin with RGD (for ","element":"span"},{"style":{"height":10.8},"width":182.22,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/8-15.png","element":"img","alt":" X = Y Y ⊤","inline":true,"padRight":true},{"text":"factorization). We set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"np/","element":"span"},{"text":"5 ","element":"span"},{"text":"and select the coordinates randomly without replacement. In Figure ","element":"span"},{"href":"#id-82","text":"6","element":"a"},{"text":", we observe that RCDlin performs competitively with RGD on the SPD manifold with dense ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"while performing significantly better on the SPSD manifold with sparse ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":".","element":"span"}]]},{"heading":"6. Conclusion","paragraphs":[[{"text":"In this work, we discuss how to develop computationally efficient CD updates for a number of matrix manifolds. The main bottleneck in developing CD methods is on finding the right basis parameterization of the tangent space. We show precise constructions for various manifolds and propose two CD algorithms: RCD and RCDlin. RCDlin specifically reduces the gradient computations of RCD further. Our experiments show the benefit of our proposed CD updates on a number of problem instances.","element":"span"}]]},{"heading":"Impact Statement","paragraphs":[[{"text":"This paper presents work whose goal is to advance optimization methods with applications in Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.","element":"span"}]]},{"heading":"References","paragraphs":[[{"id":"id-70","text":"Ablin, P. and Peyr","element":"span"},{"text":"´e, G. Fast and accurate optimization on the orthogonal manifold without retraction. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Artificial Intelligence and Statistics","element":"span"},{"text":", pp. 5636–5657. PMLR, 2022.","element":"span"}],[{"id":"id-38","text":"Ablin, P., Vary, S., Gao, B., and Absil, P.-A. Infeasible ","element":"span"},{"text":"deterministic, stochastic, and variance-reduction algorithms for optimization under orthogonality constraints. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv:2303.16510","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-34","text":"Absil, P.-A., Mahony, R., and Sepulchre, R. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Optimization algorithms on matrix manifolds","element":"span"},{"text":". Princeton University Press, 2008.","element":"span"}],[{"id":"id-18","text":"Arjovsky, M., Shah, A., and Bengio, Y. Unitary evolution ","element":"span"},{"text":"recurrent neural networks. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pp. 1120–1128. PMLR, 2016.","element":"span"}],[{"id":"id-50","text":"Bai, Z. and Li, R.-C. Minimization principles and com- ","element":"span"},{"text":"putation for the generalized linear response eigenvalue problem. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"BIT Numerical Mathematics","element":"span"},{"text":", 54:31–54, 2014.","element":"span"}],[{"id":"id-1","text":"Bhatia, R. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Positive definite matrices","element":"span"},{"text":". Princeton University Press, 2009.","element":"span"}],[{"id":"id-62","text":"Bhatia, R., Jain, T., and Lim, Y. On the bures–wasserstein ","element":"span"},{"text":"distance between positive definite matrices. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Expositiones Mathematicae","element":"span"},{"text":", 37(2):165–191, 2019.","element":"span"}],[{"id":"id-20","text":"Bonnabel, S. Stochastic gradient descent on Riemannian ","element":"span"},{"text":"manifolds. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on Automatic Control","element":"span"},{"text":", 58 (9):2217–2229, 2013.","element":"span"}],[{"id":"id-35","text":"Boumal, N. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"An introduction to optimization on smooth manifolds","element":"span"},{"text":". Cambridge University Press, 2023.","element":"span"}],[{"id":"id-66","text":"Boumal, N., Mishra, B., Absil, P.-A., and Sepulchre, R. ","element":"span"},{"text":"Manopt, a matlab toolbox for optimization on manifolds. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Journal of Machine Learning Research","element":"span"},{"text":", 15(1):1455– 1459, 2014.","element":"span"}],[{"id":"id-102","text":"Cardoso, J. R. and Leite, F. S. ","element":"span"},{"text":"Exponentials of skew-symmetric matrices and logarithms of orthogonal matrices. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Computational and Applied Mathematics","element":"span"},{"text":", 233(11):2867–2875, 2010.","element":"span"}],[{"id":"id-46","text":"Darmwal, Y. and Rajawat, K. Low-complexity subspace- ","element":"span"},{"text":"descent over symmetric positive definite manifold. Technical report, arXiv preprint arXiv:2305.02041, 2023.","element":"span"}],[{"id":"id-5","text":"Douik, A. and Hassibi, B. ","element":"span"},{"text":"Manifold optimization over the set of doubly stochastic matrices: A second-order geometry. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on Signal Processing","element":"span"},{"text":", 67 (22):5761–5774, 2019.","element":"span"}],[{"id":"id-0","text":"Edelman, A., Arias, T. A., and Smith, S. T. The geometry ","element":"span"},{"text":"of algorithms with orthogonality constraints. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SIAM Journal on Matrix Analysis and Applications","element":"span"},{"text":", 20(2):303–353, 1998.","element":"span"}],[{"id":"id-36","text":"Gao, B., Liu, X., and Yuan, Y.-x. Parallelizable algorithms ","element":"span"},{"text":"for optimization problems with orthogonality constraints. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SIAM Journal on Scientific Computing","element":"span"},{"text":", 41(3):A1949– A1983, 2019.","element":"span"}],[{"id":"id-54","text":"Gao, B., Son, N. T., Absil, P.-A., and Stykel, T. Geometry ","element":"span"},{"text":"of the symplectic Stiefel manifold endowed with the Euclidean metric. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Geometric Science of Information","element":"span"},{"text":", pp. 789–796. Springer, 2021a.","element":"span"}],[{"id":"id-55","text":"Gao, B., Son, N. T., Absil, P.-A., and Stykel, T. Riemannian ","element":"span"},{"text":"optimization on the symplectic Stiefel manifold. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SIAM Journal on Optimization","element":"span"},{"text":", 31(2):1546–1575, 2021b.","element":"span"}],[{"id":"id-56","text":"Gao, B., Son, N. T., and Stykel, T. Optimization on the ","element":"span"},{"text":"symplectic Stiefel manifold: SR decomposition-based retraction and applications. Technical report, arXiv preprint arXiv:2211.09481, 2022.","element":"span"}],[{"id":"id-42","text":"Gutman, D. H. and Ho-Nguyen, N. Coordinate descent ","element":"span"},{"text":"without coordinates: Tangent subspace descent on Riemannian manifolds. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Mathematics of Operations Research","element":"span"},{"text":", 48(1):127–159, 2023.","element":"span"}],[{"id":"id-23","text":"Han, A. and Gao, J. Improved variance reduction methods ","element":"span"},{"text":"for Riemannian non-convex optimization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on Pattern Analysis and Machine Intelligence","element":"span"},{"text":", 44 (11):7610–7623, 2021.","element":"span"}],[{"id":"id-2","text":"Han, A., Mishra, B., Jawanpuria, P. K., and Gao, J. On ","element":"span"},{"text":"Riemannian optimization over positive definite matrices with the Bures-Wasserstein geometry. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", volume 34, pp. 8940–8953, 2021.","element":"span"}],[{"id":"id-16","text":"Han, A., Mishra, B., Jawanpuria, P., and Gao, J. Rieman- ","element":"span"},{"text":"nian block SPD coupling manifold and its application to optimal transport. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Machine Learning","element":"span"},{"text":", pp. 1–28, 2022.","element":"span"}],[{"id":"id-31","text":"Han, A., Mishra, B., Jawanpuria, P., and Gao, J. Nonconvex- ","element":"span"},{"text":"nonconcave min-max optimization on riemannian manifolds. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Transactions on Machine Learning Research","element":"span"},{"text":", 2023a. ISSN 2835-8856.","element":"span"}],[{"id":"id-30","text":"Han, A., Mishra, B., Jawanpuria, P., Kumar, P., and Gao, J. ","element":"span"},{"text":"Riemannian hamiltonian methods for min-max optimization on manifolds. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SIAM Journal on Optimization","element":"span"},{"text":", 33(3): 1797–1827, 2023b.","element":"span"}],[{"id":"id-25","text":"Han, A., Mishra, B., Jawanpuria, P., and Gao, J. Differen- ","element":"span"},{"text":"tially private Riemannian optimization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Machine Learning","element":"span"},{"text":", 113(3):1133–1161, 2024a.","element":"span"}],[{"id":"id-33","text":"Han, A., Mishra, B., Jawanpuria, P., and Takeda, A. A ","element":"span"},{"text":"framework for bilevel optimization on Riemannian manifolds. Technical report, arXiv preprint arXiv:2402.03883, 2024b.","element":"span"}],[{"id":"id-72","text":"Hinton, G., Vinyals, O., and Dean, J. Distilling the knowl- ","element":"span"},{"text":"edge in a neural network. Technical report, arXiv preprint arXiv:1503.02531, 2015.","element":"span"}],[{"id":"id-83","text":"Huang, M., Ma, S., and Lai, L. A Riemannian block co- ","element":"span"},{"text":"ordinate descent method for computing the projection robust Wasserstein distance. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pp. 4446–4455. PMLR, 2021.","element":"span"}],[{"id":"id-111","text":"Huang, W., Absil, P.-A., and Gallivan, K. A. A Riemannian ","element":"span"},{"text":"symmetric rank-one trust-region method. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Mathematical Programming","element":"span"},{"text":", 150(2):179–216, 2015.","element":"span"}],[{"id":"id-28","text":"Huang, Z., Huang, W., Jawanpuria, P., and Mishra, B. ","element":"span"},{"text":"Federated learning on Riemannian manifolds with differential privacy. ","element":"span"},{"text":"Technical report, arXiv preprint arXiv:2404.10029, 2024.","element":"span"}],[{"id":"id-9","text":"Jawanpuria, P. and Mishra, B. A unified framework for ","element":"span"},{"text":"structured low-rank matrix learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning (ICML)","element":"span"},{"text":", 2018.","element":"span"}],[{"id":"id-13","text":"Jawanpuria, P., Balgovind, A., Kunchukuttan, A., and ","element":"span"},{"text":"Mishra, B. Learning multilingual word embeddings in a latent metric space: a geometric approach. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Transactions of the Association for Computational Linguistics","element":"span"},{"text":", 7(3): 107–120, 2019a.","element":"span"}],[{"id":"id-78","text":"Jawanpuria, P., Meghwanshi, M., and Mishra, B. Low- ","element":"span"},{"text":"rank approximations of hyperbolic embeddings. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Conference on Decision and Control","element":"span"},{"text":", 2019b.","element":"span"}],[{"id":"id-14","text":"Jawanpuria, P., Meghwanshi, M., and Mishra, B. Geometry- ","element":"span"},{"text":"aware domain adaptation for unsupervised alignment of word embeddings. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Annual Meeting of the Association for Computational Linguistics","element":"span"},{"text":", 2020.","element":"span"}],[{"id":"id-44","text":"Jiang, Y., Zhang, H., Qiu, Y., Xiao, Y., Long, B., and Yang, ","element":"span"},{"text":"W.-Y. Givens coordinate descent methods for rotation matrix learning in trainable embedding indexes. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":", 2022.","element":"span"}],[{"id":"id-8","text":"Kasai, H., Jawanpuria, P., and Mishra, B. ","element":"span"},{"text":"Riemannian adaptive stochastic gradient algorithms on matrix manifolds. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning (ICML)","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-57","text":"Knight, P. A. The Sinkhorn–Knopp algorithm: convergence ","element":"span"},{"text":"and applications. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SIAM Journal on Matrix Analysis and Applications","element":"span"},{"text":", 30(1):261–275, 2008.","element":"span"}],[{"id":"id-11","text":"Kressner, D., Steinlechner, M., and Vandereycken, B. Low- ","element":"span"},{"text":"rank tensor completion by Riemannian optimization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"BIT Numerical Mathematics","element":"span"},{"text":", 54:447–468, 2014.","element":"span"}],[{"id":"id-86","text":"Lezcano Casado, M. Trivializations for gradient-based opti- ","element":"span"},{"text":"mization on manifolds. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", volume 32, 2019.","element":"span"}],[{"id":"id-27","text":"Li, J. and Ma, S. Federated learning on Riemannian mani- ","element":"span"},{"text":"folds. Technical report, arXiv preprint arXiv:2206.05668, 2022.","element":"span"}],[{"id":"id-39","text":"Luo, Z.-Q. and Tseng, P. On the convergence of the coordi- ","element":"span"},{"text":"nate descent method for convex differentiable minimization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Optimization Theory and Applications","element":"span"},{"text":", 72(1):7–35, 1992.","element":"span"}],[{"id":"id-45","text":"Massart, E. and Abrol, V. Coordinate descent on the or- ","element":"span"},{"text":"thogonal group for recurrent neural network training. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"AAAI Conference on Artificial Intelligence","element":"span"},{"text":", volume 36, pp. 7744–7751, 2022.","element":"span"}],[{"id":"id-61","text":"Massart, E. and Absil, P.-A. Quotient geometry with sim- ","element":"span"},{"text":"ple geodesics for the manifold of fixed-rank positivesemidefinite matrices. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SIAM Journal on Matrix Analysis and Applications","element":"span"},{"text":", 41(1):171–198, 2020.","element":"span"}],[{"id":"id-80","text":"Miller, G. A. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"WordNet: An electronic lexical database","element":"span"},{"text":". MIT press, 1998.","element":"span"}],[{"id":"id-29","text":"Mishra, B., Kasai, H., Jawanpuria, P., and Saroop, A. A ","element":"span"},{"text":"Riemannian gossip approach to subspace learning on Grassmann manifold. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Machine Learning","element":"span"},{"text":", 108(10):1783– 1803, 2019.","element":"span"}],[{"id":"id-15","text":"Mishra, B., Satyadev, N., Kasai, H., and Jawanpuria, P. Man- ","element":"span"},{"text":"ifold optimization for non-linear optimal transport problems. Technical report, arXiv preprint arXiv:2103.00902, 2021.","element":"span"}],[{"id":"id-40","text":"Nesterov, Y. Efficiency of coordinate descent methods on ","element":"span"},{"text":"huge-scale optimization problems. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SIAM Journal on Optimization","element":"span"},{"text":", 22(2):341–362, 2012.","element":"span"}],[{"id":"id-77","text":"Nickel, M. and Kiela, D. Poincar","element":"span"},{"text":"´e embeddings for learning hierarchical representations. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", volume 30, 2017.","element":"span"}],[{"id":"id-4","text":"Nickel, M. and Kiela, D. Learning continuous hierarchies ","element":"span"},{"text":"in the Lorentz model of hyperbolic geometry. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pp. 3779–3788. PMLR, 2018.","element":"span"}],[{"id":"id-10","text":"Nimishakavi, M., Jawanpuria, P., and Mishra, B. A dual ","element":"span"},{"text":"framework for low-rank tensor completion. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Conference on Neural Information Processing Systems (NeurIPS)","element":"span"},{"text":", 2018.","element":"span"}],[{"id":"id-84","text":"Peng, L. and Vidal, R. ","element":"span"},{"text":"Block coordinate descent on smooth manifolds. ","element":"span"},{"text":"Technical report, arXiv preprint arXiv:2305.14744, 2023.","element":"span"}],[{"id":"id-12","text":"Pennec, X., Fillard, P., and Ayache, N. A Riemannian ","element":"span"},{"text":"Framework for Tensor Computing. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Journal of Computer Vision","element":"span"},{"text":", 66(1):41–66, 2006.","element":"span"}],[{"id":"id-24","text":"Reimherr, M., Bharath, K., and Soto, C. Differential privacy ","element":"span"},{"text":"over riemannian manifolds. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", volume 34, pp. 12292–12303, 2021.","element":"span"}],[{"id":"id-22","text":"Sato, H., Kasai, H., and Mishra, B. Riemannian stochastic ","element":"span"},{"text":"variance reduced gradient algorithm with retraction and vector transport. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SIAM Journal on Optimization","element":"span"},{"text":", 29(2): 1444–1472, 2019.","element":"span"}],[{"id":"id-43","text":"Shalit, U. and Chechik, G. Coordinate-descent for learning ","element":"span"},{"text":"orthogonal matrices through Givens rotations. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pp. 548–556. PMLR, 2014.","element":"span"}],[{"id":"id-17","text":"Shi, D., Gao, J., Hong, X., Boris Choy, S., and Wang, Z. ","element":"span"},{"text":"Coupling matrix manifolds assisted optimization for optimal transport problems. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Machine Learning","element":"span"},{"text":", 110:533–558, 2021.","element":"span"}],[{"id":"id-49","text":"Siegel, J. W. Accelerated optimization with orthogonality ","element":"span"},{"text":"constraints. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Computational Mathematics","element":"span"},{"text":", 39 (2):207–206, 2020.","element":"span"}],[{"id":"id-104","text":"Sinkhorn, R. Diagonal equivalence to matrices with pre- ","element":"span"},{"text":"scribed row and column sums. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The American Mathematical Monthly","element":"span"},{"text":", 74(4):402–405, 1967.","element":"span"}],[{"id":"id-21","text":"Tripuraneni, N., Flammarion, N., Bach, F., and Jordan, M. I. ","element":"span"},{"text":"Averaging stochastic gradient descent on Riemannian manifolds. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Conference on Learning Theory","element":"span"},{"text":", pp. 650– 687. PMLR, 2018.","element":"span"}],[{"id":"id-26","text":"Utpala, S., Han, A., Jawanpuria, P., and Mishra, B. Im- ","element":"span"},{"text":"proved differentially private Riemannian optimization: Fast sampling and variance reduction. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Transactions on Machine Learning Research","element":"span"},{"text":", 2022.","element":"span"}],[{"id":"id-3","text":"Vandereycken, B. Low-rank matrix completion by Rieman- ","element":"span"},{"text":"nian optimization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SIAM Journal on Optimization","element":"span"},{"text":", 23(2): 1214–1236, 2013.","element":"span"}],[{"id":"id-59","text":"Vandereycken, B., Absil, P.-A., and Vandewalle, S. Embed- ","element":"span"},{"text":"ded geometry of the set of symmetric positive semidefinite matrices of fixed rank. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE/SP Workshop on Statistical Signal Processing","element":"span"},{"text":", pp. 389–392. IEEE, 2009.","element":"span"}],[{"id":"id-60","text":"Vandereycken, B., Absil, P.-A., and Vandewalle, S. A Rie- ","element":"span"},{"text":"mannian geometry with complete geodesics for the set of positive semidefinite matrices of fixed rank. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IMA Journal of Numerical Analysis","element":"span"},{"text":", 33(2):481–514, 2013.","element":"span"}],[{"id":"id-19","text":"Wang, J., Chen, Y., Chakraborty, R., and Yu, S. X. Or- ","element":"span"},{"text":"thogonal convolutional neural networks. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Conference on Computer Vision and Pattern Recognition","element":"span"},{"text":", pp. 11505– 11515, 2020.","element":"span"}],[{"id":"id-90","text":"Wen, Z. and Yin, W. A feasible method for optimization ","element":"span"},{"text":"with orthogonality constraints. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Mathematical Programming","element":"span"},{"text":", 142(1-2):397–434, 2013.","element":"span"}],[{"id":"id-87","text":"Wilson, B. and Leimeister, M. ","element":"span"},{"text":"Gradient descent in hyperbolic space. ","element":"span"},{"text":"Technical report, arXiv preprint arXiv:1805.08207, 2018.","element":"span"}],[{"id":"id-41","text":"Wright, S. J. Coordinate descent algorithms. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Mathematical Programming","element":"span"},{"text":", 151(1):3–34, 2015.","element":"span"}],[{"id":"id-37","text":"Xiao, N. and Liu, X. Solving optimization problems over ","element":"span"},{"text":"the Stiefel manifold by smooth exact penalty function. Technical report, arXiv preprint arXiv:2110.08986, 2021.","element":"span"}],[{"id":"id-69","text":"Xiao, N., Liu, X., and Yuan, Y.-x. A class of smooth exact ","element":"span"},{"text":"penalty function methods for optimization problems with orthogonality constraints. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Optimization Methods and Software","element":"span"},{"text":", 37(4):1205–1241, 2022.","element":"span"}],[{"id":"id-51","text":"Xiao, N., Liu, X., and Toh, K.-C. Dissolving constraints for ","element":"span"},{"text":"Riemannian optimization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Mathematics of Operations Research","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-85","text":"Yuan, G. A block coordinate descent method for nonsmooth ","element":"span"},{"text":"composite optimization under orthogonality constraints. Technical report, arXiv preprint arXiv:2304.03641, 2023.","element":"span"}],[{"id":"id-7","text":"Zhang, H., J Reddi, S., and Sra, S. Riemannian SVRG: ","element":"span"},{"text":"Fast stochastic optimization on Riemannian manifolds. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", volume 29, 2016.","element":"span"}],[{"id":"id-32","text":"Zhang, P., Zhang, J., and Sra, S. Sion’s minimax theorem in ","element":"span"},{"text":"geodesic metric spaces and a Riemannian extragradient algorithm. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SIAM Journal on Optimization","element":"span"},{"text":", 33(4):2885– 2908, 2023.","element":"span"}],[{"id":"id-91","text":"Zhu, X. A Riemannian conjugate gradient method for op- ","element":"span"},{"text":"timization on the Stiefel manifold. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Computational Optimization and Applications","element":"span"},{"text":", 67:73–110, 2017.","element":"span"}]]},{"heading":"A. Additional experiment details","paragraphs":[[{"text":"The experiments are run on a laptop with an i5-10500 3.1GHz CPU processor.","element":"span"}],[{"id":"id-73","style":{"fontWeight":"bold"},"text":"A.1. Orthogonal deep networks for distillation","element":"span"}],[{"text":"Here we provide the detailed network architecture for the distillation task. In particular, we define a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"-layer feed-forward neural network with Tanh activation function, i.e., ","element":"span"},{"style":{"height":16},"width":443.86,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/12-0.png","element":"img","alt":" Xℓ+1 = tanh(WℓXℓ + bℓ)","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":16},"width":120.49,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/12-1.png","element":"img","alt":" ℓ ∈ [L]","inline":true},{"text":", where ","element":"span"},{"style":{"height":16.58},"width":344.29,"height":41.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/12-2.png","element":"img","alt":" W1 ∈ Rdin×d, WL ∈","inline":true},{"style":{"height":16.98},"width":608.9,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/12-3.png","element":"img","alt":"Rd×dout, and Wℓ ∈ Rd×d for ℓ ̸= 1, L","inline":true},{"text":". In the experiment, we set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"= 6","element":"span"},{"text":".","element":"span"}],[{"id":"id-79","style":{"fontWeight":"bold"},"text":"A.2. Learning Lorentz embeddings","element":"span"}],[{"text":"Here we provide the problem formulation for the task of learning Lorentz embeddings. Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"u, v","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":"} ","element":"span"},{"text":"be the related word pairs and construct ","element":"span"},{"style":{"height":16},"width":441.7,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/12-4.png","element":"img","alt":" Neg(u) = {v : (u, v) /∈ D}","inline":true,"padRight":true},{"text":"as the negative samples of word ","element":"span"},{"style":{"fontStyle":"italic"},"text":"u","element":"span"},{"text":". The objective is to learn embeddings ","element":"span"},{"style":{"height":9.19},"width":41.77,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/12-5.png","element":"img","alt":" xu","inline":true,"padRight":true},{"text":"for all word ","element":"span"},{"style":{"fontStyle":"italic"},"text":"u ","element":"span"},{"text":"by solving","element":"span"}],[{"style":{"width":"46%"},"width":913,"height":115,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/12-6.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":600.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/12-7.png","element":"img","alt":" dist(xu, xv) = arccosh(−⟨xu, xv⟩L)","inline":true,"padRight":true},{"text":"is the Lorentz Riemannian distance.","element":"span"}]]},{"heading":"B. A review on coordinate descent for orthogonal and SPD manifold","paragraphs":[[{"text":"We start by reviewing the developments of coordinate descent on the orthogonal manifold (","element":"span"},{"href":"#id-43","referenceIndex":55,"text":"Shalit & Chechik","element":"a"},{"text":", ","element":"span"},{"href":"#id-43","referenceIndex":55,"text":"2014","element":"a"},{"text":"; ","element":"span"},{"href":"#id-45","referenceIndex":42,"text":"Massart ","element":"a"},{"href":"#id-45","referenceIndex":42,"text":"& Abrol","element":"a"},{"text":", ","element":"span"},{"href":"#id-45","referenceIndex":42,"text":"2022","element":"a"},{"text":"; ","element":"span"},{"href":"#id-44","referenceIndex":35,"text":"Jiang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-44","referenceIndex":35,"text":"2022","element":"a"},{"text":"), Stiefel manifold (","element":"span"},{"href":"#id-42","referenceIndex":19,"text":"Gutman & Ho-Nguyen","element":"a"},{"text":", ","element":"span"},{"href":"#id-42","referenceIndex":19,"text":"2023","element":"a"},{"text":"), and symmetric positive definite (SPD) manifold (","element":"span"},{"href":"#id-46","referenceIndex":12,"text":"Darmwal & Rajawat","element":"a"},{"text":", ","element":"span"},{"href":"#id-46","referenceIndex":12,"text":"2023","element":"a"},{"text":"), which motivate the proposed general framework for other manifolds.","element":"span"}],[{"text":"Some other works (","element":"span"},{"href":"#id-83","referenceIndex":28,"text":"Huang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-83","referenceIndex":28,"text":"2021","element":"a"},{"text":"; ","element":"span"},{"href":"#id-84","referenceIndex":51,"text":"Peng & Vidal","element":"a"},{"text":", ","element":"span"},{"href":"#id-84","referenceIndex":51,"text":"2023","element":"a"},{"text":") study (block) coordinate descent on a product of manifolds, where each update concerns a component manifold. This is different to our considered setting, where the update is defined for coordinate on the tangent space for a single manifold.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"B.1. Orthogonal manifold","element":"span"}],[{"text":"Orthogonal manifold ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":") ","element":"span"},{"text":"is the smooth space formed by the orthogonality constraints, i.e., ","element":"span"},{"style":{"height":16},"width":399.46,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/12-8.png","element":"img","alt":" O(n) := {X ∈ Rn×n :","inline":true},{"style":{"height":16},"width":363.89,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/12-9.png","element":"img","alt":"XX⊤ = X⊤X = In}","inline":true},{"text":". The tangent space can be identified as ","element":"span"},{"style":{"height":16},"width":562.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/12-10.png","element":"img","alt":" TXO(n) := {ΩX : Ω ∈ Skew(n)}","inline":true},{"text":". The Riemannian metric coincides with the Euclidean metric, i.e., ","element":"span"},{"style":{"height":19.37},"width":435.31,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/12-11.png","element":"img","alt":" ⟨U, V ⟩X = 12⟨U, V ⟩. The 12 ","inline":true,"padRight":true},{"text":"is added to ensure consistency with the canonical metric ","element":"span"},{"text":"for the Stiefel manifold, as we shall see later. This leads to the Riemannian gradient ","element":"span"},{"style":{"height":16},"width":616.79,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/12-12.png","element":"img","alt":" gradf(X) = ∇f(X) − X∇f(X)⊤X","inline":true,"padRight":true},{"text":"and the exponential retraction is given by ","element":"span"},{"style":{"height":16},"width":764.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/12-13.png","element":"img","alt":" RetrX(θΩX) = expm(θΩ)X, for some θ ∈ R.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Remark ","element":"span"},{"text":"B.1","element":"span"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"We remark that in all the existing works (","element":"span"},{"href":"#id-43","referenceIndex":55,"text":"Shalit & Chechik","element":"a"},{"text":", ","element":"span"},{"href":"#id-43","referenceIndex":55,"text":"2014","element":"a"},{"text":"; ","element":"span"},{"href":"#id-45","referenceIndex":42,"text":"Massart & Abrol","element":"a"},{"text":", ","element":"span"},{"href":"#id-45","referenceIndex":42,"text":"2022","element":"a"},{"text":"; ","element":"span"},{"href":"#id-44","referenceIndex":35,"text":"Jiang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-44","referenceIndex":35,"text":"2022","element":"a"},{"text":"), the tangent space is parameterized as ","element":"span"},{"style":{"height":16},"width":584.81,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/12-14.png","element":"img","alt":" TXO(n) := {XΩ′ : Ω′ ∈ Skew(n)}","inline":true,"padRight":true},{"text":"and thus the exponential retraction amounts to ","element":"span"},{"style":{"height":16},"width":510.83,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/12-15.png","element":"img","alt":"RetrX(θXΩ′) = Xexpm(θΩ′)","inline":true},{"text":". Such a formulation is equivalent to the above by letting ","element":"span"},{"style":{"height":16},"width":415.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/12-16.png","element":"img","alt":" Ω = XΩ′X⊤ ∈ Skew(n)","inline":true},{"text":". Our reformulation allows natural generalization of the coordinate descent framework to column orthonormal matrices (the Stiefel manifold), where the orthogonal matrix is a special case.","element":"span"}],[{"text":"The manifold has a dimension of ","element":"span"},{"style":{"height":16},"width":191.74,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/12-17.png","element":"img","alt":" n(n − 1)/2","inline":true,"padRight":true},{"text":"and its tangent space can be provided with an orthonormal basis ","element":"span"},{"style":{"height":15.59},"width":96.2,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/12-18.png","element":"img","alt":" HijX","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":17.35},"width":319.89,"height":43.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/12-19.png","element":"img","alt":" Hij := eie⊤j − eje⊤i","inline":true,"padRight":true},{"text":"is the basis for the skew-symmetric matrices. In each basis direction ","element":"span"},{"style":{"height":15.59},"width":96.2,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/12-20.png","element":"img","alt":" HijX","inline":true},{"text":", the exponential ","element":"span"},{"text":"retraction reduces to the Givens rotation, which allows efficient updates, i.e., ","element":"span"},{"style":{"height":16.79},"width":724.77,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/12-21.png","element":"img","alt":" RetrX(θHijX) = Gij(θ)X where Gij(θ) =","inline":true},{"style":{"height":18.55},"width":903.54,"height":46.37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/12-22.png","element":"img","alt":"In + (cos(θ) − 1)(eie⊤i + eje⊤j ) + sin(θ)(eie⊤j − eje⊤i )","inline":true,"padRight":true},{"text":"is known as the Givens rotation matrix around axes ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i, j ","element":"span"},{"text":"with angle ","element":"span"},{"style":{"height":11.2},"width":60.81,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/12-23.png","element":"img","alt":"−θ.","inline":true}],[{"text":"In order to minimize a function ","element":"span"},{"style":{"height":16},"width":235.81,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/12-24.png","element":"img","alt":" f : O(n) → R","inline":true},{"text":", one needs to update the variables in the negative gradient direction. Here along the basis direction, coordinate descent aims to minimize the function ","element":"span"},{"style":{"height":16.79},"width":201.65,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/12-25.png","element":"img","alt":" f(Gij(θ)X)","inline":true,"padRight":true},{"text":"with respect to ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/12-26.png","element":"img","alt":" θ","inline":true},{"text":". One strategy is to solve this one-variable optimization problem directly as in (","element":"span"},{"href":"#id-43","referenceIndex":55,"text":"Shalit & Chechik","element":"a"},{"text":", ","element":"span"},{"href":"#id-43","referenceIndex":55,"text":"2014","element":"a"},{"text":"). When the objective is more involved, we can approximately solve this problem by following a descent direction (","element":"span"},{"href":"#id-45","referenceIndex":42,"text":"Massart & Abrol","element":"a"},{"text":", ","element":"span"},{"href":"#id-45","referenceIndex":42,"text":"2022","element":"a"},{"text":"; ","element":"span"},{"href":"#id-44","referenceIndex":35,"text":"Jiang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-44","referenceIndex":35,"text":"2022","element":"a"},{"text":"), which is given by ","element":"span"},{"style":{"height":19.37},"width":769.49,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/12-27.png","element":"img","alt":" − ddθf(Gij(θ)X)|θ=0 = −⟨gradf(X), HijX⟩X","inline":true},{"text":". This leads to the coordinate descent update in the direction of ","element":"span"},{"style":{"height":16.79},"width":472.06,"height":41.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/12-28.png","element":"img","alt":"−⟨gradf(X), HijX⟩XHijX","inline":true},{"text":", which modifies two rows of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"every iteration, resulting in an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":") ","element":"span"},{"text":"complexity per update. One pass over all the coordinates requires ","element":"span"},{"style":{"height":21.63},"width":104.66,"height":54.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/12-29.png","element":"img","alt":"n(n−1)2","inline":true,"padRight":true},{"text":"iterations, leading to ","element":"span"},{"style":{"height":17.39},"width":104.8,"height":43.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/12-30.png","element":"img","alt":" O(n3)","inline":true,"padRight":true},{"text":"complexity in total, which is comparable to the commonly considered retractions, including the exponential, Cayley and QR retractions.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"B.2. Stiefel manifold","element":"span"}],[{"text":"The Stiefel manifold ","element":"span"},{"text":"St(","element":"span"},{"style":{"fontStyle":"italic"},"text":"n, p","element":"span"},{"text":") ","element":"span"},{"text":"is the set of column orthonormal matrices of size ","element":"span"},{"style":{"height":11.79},"width":90.39,"height":29.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/13-0.png","element":"img","alt":" Rn×p","inline":true},{"text":", i.e., ","element":"span"},{"style":{"height":16},"width":436.33,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/13-1.png","element":"img","alt":" St(n, p) := {X ∈ Rn×p :","inline":true},{"style":{"height":16.79},"width":208.39,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/13-2.png","element":"img","alt":"X⊤X = Ip}","inline":true},{"text":". When ","element":"span"},{"style":{"height":16},"width":393.1,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/13-3.png","element":"img","alt":" p = n, St(n, n) ≡ O(n)","inline":true},{"text":". The tangent space of Stiefel manifold is identified as ","element":"span"},{"style":{"height":16},"width":326.78,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/13-4.png","element":"img","alt":" TXSt(n, p) = {U ∈","inline":true},{"style":{"height":18.18},"width":1292.48,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/13-5.png","element":"img","alt":"Rn×p : X⊤U + U ⊤X = 0} = {XΩ + X⊥K : Ω ∈ Skew(p), K ∈ R(n−p)×p}","inline":true},{"text":". The Euclidean metric turns out to be a valid Riemannian metric (","element":"span"},{"href":"#id-0","referenceIndex":14,"text":"Edelman et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-0","referenceIndex":14,"text":"1998","element":"a"},{"text":"), ","element":"span"},{"style":{"height":16},"width":302.69,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/13-6.png","element":"img","alt":" ⟨U, V ⟩X = ⟨U, V ⟩","inline":true,"padRight":true},{"text":"for any ","element":"span"},{"style":{"height":14},"width":230.12,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/13-7.png","element":"img","alt":" U, V ∈ TXM","inline":true},{"text":". The Riemannian gradient is derived as ","element":"span"},{"style":{"height":16},"width":728.81,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/13-8.png","element":"img","alt":" gradf(X) = ∇f(X) − Xsym(X⊤∇f(X)).","inline":true}],[{"text":"In (","element":"span"},{"href":"#id-42","referenceIndex":19,"text":"Gutman & Ho-Nguyen","element":"a"},{"text":", ","element":"span"},{"href":"#id-42","referenceIndex":19,"text":"2023","element":"a"},{"text":"), a coordinate (subspace) descent algorithm has been developed for general manifolds, via selecting proper subspaces for the tangent space. Although showing theoretical guarantees, the paper only provides a concrete developments for Stiefel manifold (thus including the orthogonal manifold). The basis considered for the tangent space of Stiefel manifold is ","element":"span"},{"style":{"height":17.68},"width":696.11,"height":44.19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/13-9.png","element":"img","alt":" {XHij}1≤i 0","inline":true},{"text":", where ","element":"span"},{"style":{"height":15.59},"width":441.03,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/17-20.png","element":"img","alt":" U = XΩpSu + ΩnX⊥Ku","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":15.59},"width":436.68,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/17-21.png","element":"img","alt":" V = XΩpSv + ΩnX⊥Kv","inline":true},{"text":". The Riemannian gradient associated with the metric is ","element":"span"},{"style":{"height":18.34},"width":1107.63,"height":45.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/17-22.png","element":"img","alt":" gradρf(X) = ρXΩpsym(Ω⊤p X⊤∇f(X)) + ΩnX⊥X⊤⊥Ω⊤n ∇f(X)","inline":true},{"text":". The quasi-geodesic ","element":"span"},{"text":"retraction is derived by replacing the covariance derivative with the Euclidean derivative, given by ","element":"span"},{"style":{"height":17.54},"width":271.21,"height":43.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/17-23.png","element":"img","alt":" RetrqgeoX (tU) =","inline":true},{"style":{"height":38.8},"width":613.64,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/17-24.png","element":"img","alt":"[X, U]expm�t�−ΩpW ΩpU ⊤ΩnUI2p −ΩpW","inline":true}],[{"style":{"height":38.8},"width":312.77,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/17-25.png","element":"img","alt":"� � �expm(tΩnW)","inline":true,"padRight":true},{"text":"0 ","element":"span"},{"style":{"height":38.8},"width":21,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/17-26.png","element":"img","alt":"�","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":13.19},"width":251.55,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/17-27.png","element":"img","alt":" W = X⊤ΩnU","inline":true},{"text":". The symplectic Cayley retraction is","element":"span"}],[{"style":{"width":"99%"},"width":1945,"height":74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/17-28.png","element":"img"}],[{"style":{"height":19.37},"width":465.5,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/17-29.png","element":"img","alt":"GX = I2n − 12XωpX⊤Ω⊤n","inline":true,"padRight":true},{"text":". In (","element":"span"},{"href":"#id-56","referenceIndex":18,"text":"Gao et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-56","referenceIndex":18,"text":"2022","element":"a"},{"text":"), a SR decomposition based retraction is proposed. That is, let ","element":"span"},{"style":{"height":16.79},"width":582.76,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/17-30.png","element":"img","alt":"P2p := [e1, e3, ..., e2p−1, e2, ..., e2p]","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":11.59},"width":31.56,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/17-31.png","element":"img","alt":" ej","inline":true,"padRight":true},{"text":"is the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-th basis vector of ","element":"span"},{"style":{"height":13.39},"width":61.67,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/17-32.png","element":"img","alt":" R2p","inline":true},{"text":". Then denote a congruence matrix set as ","element":"span"},{"style":{"height":21.18},"width":612.05,"height":52.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/17-33.png","element":"img","alt":"Tsk(P2p) := {P ⊤2p ˆRP2p : ˆR ∈ R2p×2p","inline":true,"padRight":true},{"text":"is upper triangular","element":"span"},{"style":{"height":16},"width":1028.33,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/17-34.png","element":"img","alt":"}. Then the RetrsrX(tU) = sf(X + tU) where A = SR is the SR","inline":true,"padRight":true},{"text":"decomposition of ","element":"span"},{"style":{"height":18.18},"width":1030.22,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/17-35.png","element":"img","alt":" A ∈ R2n×2p, with S ∈ Sp(2n, 2p) and R ∈ T2p(P2p) and sf(A)","inline":true,"padRight":true},{"text":"extracts the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":"factor.","element":"span"}],[{"text":"E.4.2. P","element":"span"},{"text":"ROOFS","element":"span"}],[{"text":"The proof of Proposition ","element":"span"},{"href":"#id-99","text":"3.6 ","element":"a"},{"text":"follows immediately from the following Lemma.","element":"span"}],[{"id":"id-100","style":{"fontWeight":"bold"},"text":"Lemma E.5. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any ","element":"span"},{"style":{"height":16},"width":1036.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/17-36.png","element":"img","alt":" S ∈ Sym(2n), we have expm(tΩnS), expm(tSΩn) ∈ Sp(n, n).","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-100","style":{"fontStyle":"italic"},"text":"E.5","element":"a"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"The proof follows from (","element":"span"},{"href":"#id-55","referenceIndex":17,"text":"Gao et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-55","referenceIndex":17,"text":"2021b","element":"a"},{"text":", Proposition 4.6).","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Proposition ","element":"span"},{"href":"#id-99","style":{"fontStyle":"italic"},"text":"3.6","element":"a"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"Similar to the previous sections, let ","element":"span"},{"style":{"height":16},"width":320.07,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-0.png","element":"img","alt":" c(t) := RetrX(tU)","inline":true,"padRight":true},{"text":"and we have ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c","element":"span"},{"text":"(0) = ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":16},"width":126.42,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-1.png","element":"img","alt":" c′(0) =","inline":true},{"style":{"height":13.19},"width":197.5,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-2.png","element":"img","alt":"SΩnX = U","inline":true},{"text":". Finally from Lemma ","element":"span"},{"href":"#id-100","text":"E.5","element":"a"},{"text":", we have ","element":"span"},{"style":{"height":16.79},"width":1178.88,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-3.png","element":"img","alt":" c(t)⊤Ωnc(t) = X⊤expm(tSΩn)⊤Ωnexpm(tSΩn)X = X⊤ΩnX = Ωp,","inline":true,"padRight":true},{"text":"which verifies ","element":"span"},{"style":{"height":16},"width":248.79,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-4.png","element":"img","alt":" c(t) ∈ Sp(n, p)","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof of Proposition ","element":"span"},{"href":"#id-101","style":{"fontStyle":"italic"},"text":"3.7","element":"a"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"We can partition the basis ","element":"span"},{"style":{"height":39.41},"width":335.5,"height":98.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-5.png","element":"img","alt":" Eij =�Eij,1 Eij,2E⊤ij,2 Eij,3","inline":true}],[{"style":{"width":"96%"},"width":1876,"height":93,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-6.png","element":"img"}],[{"text":"For ","element":"span"},{"style":{"height":14},"width":251.26,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-7.png","element":"img","alt":" 1 ≤ i ≤ j ≤ n","inline":true},{"text":", we have ","element":"span"},{"style":{"height":15.59},"width":258.27,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-8.png","element":"img","alt":" Eij,2, Eij,3 = 0","inline":true,"padRight":true},{"text":"and thus we obtain ","element":"span"},{"style":{"height":22.18},"width":867.83,"height":55.45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-9.png","element":"img","alt":" exp(θEijΩn) = I + θEijΩn + θ22 (EijΩn)2 + · · · =","inline":true},{"style":{"height":15.59},"width":190.56,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-10.png","element":"img","alt":"I + θEijΩn","inline":true},{"text":". Similarly for ","element":"span"},{"style":{"height":16.79},"width":1303.6,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-11.png","element":"img","alt":" n + 1 ≤ i < j ≤ 2n, we have Eij,1, Eij,2 = 0 and expm(θEijΩn) = I + θEijΩn","inline":true},{"text":". This verifies ","element":"span"},{"style":{"height":14},"width":639.23,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-12.png","element":"img","alt":"∀1 ≤ i ≤ j ≤ n or n + 1 ≤ i < j ≤ 2n","inline":true}],[{"style":{"width":"31%"},"width":611,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-13.png","element":"img"}],[{"text":"For ","element":"span"},{"style":{"height":14},"width":355.84,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-14.png","element":"img","alt":" 1 ≤ i ≤ n < j ≤ 2n","inline":true},{"text":", we have ","element":"span"},{"style":{"height":15.59},"width":300.06,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-15.png","element":"img","alt":" Eij,1 = Eij,3 = 0","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.35},"width":250.12,"height":43.37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-16.png","element":"img","alt":" Eij,2 = eie⊤j−n","inline":true},{"text":". Then ","element":"span"},{"style":{"height":39.41},"width":419.1,"height":98.53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-17.png","element":"img","alt":" EijΩn =�−Eij,2 00 E⊤ij,2","inline":true}],[{"text":"notice that for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= 2","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"3","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"4","element":"span"},{"style":{"fontStyle":"italic"},"text":", ...","element":"span"}],[{"style":{"width":"83%"},"width":780,"height":121,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-18.png","element":"img"}],[{"text":"This suggests for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= 2","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"3","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"4","element":"span"},{"style":{"fontStyle":"italic"},"text":", ...","element":"span"},{"text":",","element":"span"}],[{"style":{"width":"67%"},"width":630,"height":256,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-19.png","element":"img"}],[{"text":"This verifies for ","element":"span"},{"style":{"height":14},"width":342.65,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-20.png","element":"img","alt":" 1 ≤ i ≤ n < j ≤ 2n,","inline":true}],[{"style":{"width":"69%"},"width":655,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-21.png","element":"img"}],[{"text":"The proof is now complete.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"E.5. Doubly stochastic manifold","element":"span"}],[{"text":"For the doubly stochastic manifold, the retraction applies the Sinkhorn algorithm (","element":"span"},{"href":"#id-57","referenceIndex":37,"text":"Knight","element":"a"},{"text":", ","element":"span"},{"href":"#id-57","referenceIndex":37,"text":"2008","element":"a"},{"text":") for matrix balancing, i.e., ","element":"span"},{"style":{"height":16},"width":639.26,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-22.png","element":"img","alt":"RetrX(tU) = SK(X ⊙ exp(tU ⊘ X))","inline":true},{"text":", where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U ","element":"span"},{"text":"is a tangent vector belonging to the tangent space ","element":"span"},{"style":{"height":16},"width":178.91,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-23.png","element":"img","alt":" TXΠ(µ, ν)","inline":true,"padRight":true},{"text":"and the Sinkhorn algorithm ","element":"span"},{"text":"SK(","element":"span"},{"style":{"fontStyle":"italic"},"text":"U","element":"span"},{"text":") ","element":"span"},{"text":"iteratively normalize rows and columns of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U ","element":"span"},{"text":"according to the given marginals (","element":"span"},{"href":"#id-17","referenceIndex":56,"text":"Shi et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-17","referenceIndex":56,"text":"2021","element":"a"},{"text":"; ","element":"span"},{"href":"#id-102","referenceIndex":11,"text":"Cardoso & Leite","element":"a"},{"text":", ","element":"span"},{"href":"#id-102","referenceIndex":11,"text":"2010","element":"a"},{"text":").","element":"span"}],[{"text":"E.5.1. P","element":"span"},{"text":"ROOFS","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Proposition ","element":"span"},{"href":"#id-103","style":{"fontStyle":"italic"},"text":"3.9","element":"a"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"In fact, we can show for any ","element":"span"},{"style":{"height":16.79},"width":473.77,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-24.png","element":"img","alt":" Hijkl := (ei − ej)(ek − el)⊤","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":15.2},"width":210.54,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-25.png","element":"img","alt":" i ̸= j, k ̸= l","inline":true},{"text":". The coordinate Sinkhorn is a valid retraction along the direction ","element":"span"},{"style":{"height":16.79},"width":766.02,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-26.png","element":"img","alt":" Hijkl. Let c(t) := cSK(X ⊙ exp(tHijkl ⊘ X))","inline":true,"padRight":true},{"text":"and we can immediately see ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c","element":"span"},{"text":"(0) = ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"text":". Also, Let ","element":"span"},{"style":{"height":16.79},"width":595.1,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-27.png","element":"img","alt":"�X := X ⊙exp(tHijkl ⊘X). Then �X","inline":true,"padRight":true},{"text":"differs with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"in only the entries at ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i, k","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"j, k","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i, l","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"k, l","element":"span"},{"text":")","element":"span"},{"text":", which forms the ","element":"span"},{"style":{"height":10.8},"width":88.61,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-28.png","element":"img","alt":" 2 × 2","inline":true,"padRight":true},{"text":"sub-matrix that we wish to balance. Also, by definition, the marginals are given by ","element":"span"},{"style":{"height":16},"width":230.06,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-29.png","element":"img","alt":" ˜µ := ([X]ik +","inline":true},{"style":{"height":16.79},"width":985.22,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-30.png","element":"img","alt":"[X]il, [X]jk + [X]jl) and ˜ν := ([X]ik + [X]jk, [X]il + [Xjl])","inline":true},{"text":". It readily holds that ","element":"span"},{"style":{"height":14},"width":226.9,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-31.png","element":"img","alt":" ˜µ⊤12 = ˜ν⊤12","inline":true},{"text":". For notational purposes, for any matrix ","element":"span"},{"style":{"height":16.58},"width":487.86,"height":41.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-32.png","element":"img","alt":" A ∈ Rm×n, let Aijkl ∈ Rm×n ","inline":true,"padRight":true},{"text":"be the matrix that zeros out the entries except for the ","element":"span"},{"style":{"height":10.8},"width":88.54,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-33.png","element":"img","alt":" 2 × 2","inline":true,"padRight":true},{"text":"sub-matrix. Also, we denote ","element":"span"},{"style":{"height":38.8},"width":371.54,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-34.png","element":"img","alt":" A♭ijkl :=�[A]ik [A]il[A]jk [A]jl","inline":true}],[{"text":"Sinkhorn on ","element":"span"},{"style":{"height":17.72},"width":86.7,"height":44.29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-35.png","element":"img","alt":"�X♭ijkl ","inline":true,"padRight":true},{"text":"with marginals ","element":"span"},{"style":{"height":14},"width":63.98,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-36.png","element":"img","alt":" ˜µ, ˜ν","inline":true,"padRight":true},{"text":"with other entries of ","element":"span"},{"style":{"height":10.8},"width":35,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/18-37.png","element":"img","alt":" �X","inline":true,"padRight":true},{"text":"unchanged. This is well-defined as the Sinkhorn algorithm ","element":"span"},{"text":"converges to the unique doubly stochastic matrix of the form ","element":"span"},{"style":{"height":18.92},"width":342.21,"height":47.29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-0.png","element":"img","alt":" diag(u) �X♭ijkldiag(v)","inline":true,"padRight":true},{"text":"for some positive vectors ","element":"span"},{"style":{"fontStyle":"italic"},"text":"u, v ","element":"span"},{"text":"(","element":"span"},{"href":"#id-104","referenceIndex":58,"text":"Sinkhorn","element":"a"},{"text":", ","element":"span"},{"href":"#id-104","referenceIndex":58,"text":"1967","element":"a"},{"text":"). This verifies that ","element":"span"},{"style":{"height":16},"width":138.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-1.png","element":"img","alt":" cSK( �X)","inline":true,"padRight":true},{"text":"results in a doubly stochastic matrix, which remains on the manifold. Lastly, it remains to show that ","element":"span"},{"style":{"height":16.79},"width":219.29,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-2.png","element":"img","alt":" c′(0) = Hijkl","inline":true},{"text":". For this, we first have","element":"span"}],[{"style":{"width":"74%"},"width":1447,"height":85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-3.png","element":"img"}],[{"text":"where we use the first-order approximation of the exponential operations. Notice that ","element":"span"},{"style":{"height":16.79},"width":290.62,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-4.png","element":"img","alt":" cSK(X + tHijkl)","inline":true,"padRight":true},{"text":"only modifies the ","element":"span"},{"style":{"height":10.8},"width":72.64,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-5.png","element":"img","alt":"2×2","inline":true,"padRight":true},{"text":"sub-matrix of ","element":"span"},{"style":{"height":18.91},"width":526.86,"height":47.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-6.png","element":"img","alt":" X by SK(X♭ijkl+tH♭ijkl). From (","inline":true},{"href":"#id-5","referenceIndex":13,"text":"Douik & Hassibi","element":"a"},{"text":", ","element":"span"},{"href":"#id-5","referenceIndex":13,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-17","referenceIndex":56,"text":"Shi et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-17","referenceIndex":56,"text":"2021","element":"a"},{"text":"), we have ","element":"span"},{"style":{"height":18.91},"width":351.29,"height":47.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-7.png","element":"img","alt":" SK(X♭ijkl+tH♭ijkl) ≈","inline":true},{"style":{"height":17.72},"width":238.84,"height":44.29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-8.png","element":"img","alt":"X♭ijkl + tH♭ijkl","inline":true},{"text":". This suggests ","element":"span"},{"style":{"height":18.92},"width":798.38,"height":47.29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-9.png","element":"img","alt":" limt→0(SK(X♭ijkl + tH♭ijkl) − X♭ijkl)/t = H♭ijkl,","inline":true,"padRight":true},{"text":"which verifies ","element":"span"},{"style":{"height":16.79},"width":231.53,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-10.png","element":"img","alt":" c′(0) = Hijkl.","inline":true}],[{"style":{"width":"11%"},"width":105,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-11.png","element":"img"}],[{"id":"id-58","style":{"fontWeight":"bold"},"text":"Lemma E.6. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Given a positive ","element":"span"},{"style":{"height":10.8},"width":88.49,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-12.png","element":"img","alt":" 2 × 2","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"matrix ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"=","element":"span"}],[{"style":{"height":38.8},"width":100.8,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-13.png","element":"img","alt":"�a b","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"c ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"style":{"height":38.8},"width":21,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-14.png","element":"img","alt":"�","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":", the Sinkhorn algorithm on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"style":{"fontStyle":"italic"},"text":"with marginals ","element":"span"},{"style":{"height":16},"width":267.89,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-15.png","element":"img","alt":" p = [p1, p2], q =","inline":true}],[{"style":{"height":16},"width":210.18,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-16.png","element":"img","alt":"[q1, q2] ∈ ∆2","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"converges to","element":"span"},{"style":{"height":38.8},"width":225.7,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-17.png","element":"img","alt":"�c11a c12bc21c c22d�","inline":true},{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c","element":"span"},{"style":{"height":16},"width":609.46,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-18.png","element":"img","alt":"12 = p1/(κa + b), c22 = p2/(κc + d)","inline":true},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c","element":"span"},{"style":{"height":10.4},"width":356.99,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-19.png","element":"img","alt":"11 = κc12, c21 = κc22","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"κ","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"is the positive root of the equation ","element":"span"},{"style":{"height":19.2},"width":869.23,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-20.png","element":"img","alt":" q2acκ2 +�(bc + ad)q2 − bcp1 − adp2�κ − bdq1 = 0.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-58","style":{"fontStyle":"italic"},"text":"E.6","element":"a"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"Sinkhorn algorithm converges to the unique doubly stochastic matrix of the form ","element":"span"},{"text":"diag(","element":"span"},{"style":{"fontStyle":"italic"},"text":"u","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":"diag(","element":"span"},{"style":{"fontStyle":"italic"},"text":"v","element":"span"},{"text":")","element":"span"}],[{"text":"for ","element":"span"},{"style":{"height":16},"width":411.02,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-21.png","element":"img","alt":" u = [u1, u2], v = [v1, v2]","inline":true},{"text":". From the constraints, ","element":"span"},{"text":"diag(","element":"span"},{"style":{"height":16},"width":310.54,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-22.png","element":"img","alt":"u)Adiag(v)12 = p","inline":true,"padRight":true},{"text":"and ","element":"span"},{"text":"diag(","element":"span"},{"style":{"height":16},"width":334.43,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-23.png","element":"img","alt":"v)A⊤diag(u)12 = q","inline":true,"padRight":true},{"text":"we need to","element":"span"}],[{"id":"id-105","style":{"width":"99%"},"width":1943,"height":269,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-24.png","element":"img"}],[{"text":"Let ","element":"span"},{"style":{"height":10},"width":783.09,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-25.png","element":"img","alt":" c11 = u1v1, c12 = u1v2, c21 = u2v1, c22 = u2v2","inline":true,"padRight":true},{"text":"which transforms (","element":"span"},{"href":"#id-105","text":"6","element":"a"},{"text":") into a set of linear equations for the variables ","element":"span"},{"style":{"height":10},"width":255.31,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-26.png","element":"img","alt":"c11, c12, c21, c22","inline":true},{"text":". The equation system, however, is under-determined and has many solutions. The unique solution that is sought should satisfy ","element":"span"},{"style":{"height":16},"width":373.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-27.png","element":"img","alt":" c11/c12 = c21/c22 = κ","inline":true},{"text":". To this end, from the first two equations, we obtain","element":"span"}],[{"style":{"width":"34%"},"width":670,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-28.png","element":"img"}],[{"text":"Substituting the expressions to the last equation yields ","element":"span"},{"style":{"height":21.81},"width":304.34,"height":54.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-29.png","element":"img","alt":"bp1κa+b + dp2κc+d = q2","inline":true},{"text":", which we solve for ","element":"span"},{"style":{"height":7.2},"width":23,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-30.png","element":"img","alt":" κ","inline":true,"padRight":true},{"text":"as the positive root of ","element":"span"},{"style":{"height":19.2},"width":869.24,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-31.png","element":"img","alt":"q2acκ2 +�(bc + ad)q2 − bcp1 − adp2�κ − bdq1 = 0.","inline":true}]]},{"heading":"F. Formal developments and proofs for Section 4","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"F.1. Developments","element":"span"}],[{"id":"id-106","style":{"fontWeight":"bold"},"text":"Assumption F.1. ","element":"span"},{"text":"Consider a neighbourhood ","element":"span"},{"style":{"height":13.2},"width":135.39,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-32.png","element":"img","alt":" X ⊆ M","inline":true,"padRight":true},{"text":"that contains a critical point.","element":"span"}],[{"id":"id-107","href":"#id-106","text":"F.1","element":"a"},{"text":".1 The basis and its projection are bounded. Let the projection onto the basis ","element":"span"},{"style":{"height":15.59},"width":80.02,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-33.png","element":"img","alt":" Bℓ,X","inline":true,"padRight":true},{"text":"be ","element":"span"},{"style":{"height":17.68},"width":496.46,"height":44.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-34.png","element":"img","alt":" PBℓ,X(U) := ⟨U, Bℓ,X⟩XBℓ,X","inline":true},{"text":". There exists constant ","element":"span"},{"style":{"height":15.59},"width":166.24,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-35.png","element":"img","alt":" cb, cp > 0","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":18.17},"width":734.37,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-36.png","element":"img","alt":" ∀X ∈ X, U ∈ TXM, ℓ ∈ I, ∥Bℓ,X∥2X ≤ cb","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":19.07},"width":352.88,"height":47.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-37.png","element":"img","alt":" �ℓ∈I ∥PBℓ,XU∥2X ≥","inline":true},{"style":{"height":18.18},"width":147.36,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-38.png","element":"img","alt":"cp∥U∥2X.","inline":true}],[{"id":"id-108","href":"#id-106","text":"F.1","element":"a"},{"text":".2 The objective ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"is retraction ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"-smooth in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"text":", i.e., ","element":"span"},{"style":{"height":37.4},"width":1863.04,"height":93.51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-39.png","element":"img","alt":" f(RetrX(U)) − f(U) − ⟨gradf(X), U⟩X ≤ L2 ∥U∥2X, ∀X ∈ Xand U ∈ TXM such that RetrX(U) ∈ X.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Remark ","element":"span"},{"text":"F.2","element":"span"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"Assumption ","element":"span"},{"href":"#id-106","text":"F.1.","element":"a"},{"href":"#id-107","text":"1 ","element":"a"},{"text":"requires the basis has a bounded norm and the projection of any tangent vector onto the basis does not vanish. Such an assumption is manifold-specific and we can verify that ","element":"span"},{"style":{"height":18.18},"width":150.18,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-40.png","element":"img","alt":" ∥Bℓ,X∥2X","inline":true,"padRight":true},{"text":"has an upper bound ","element":"span"},{"text":"(e.g., for Stiefel and Grassmann, ","element":"span"},{"style":{"height":18.18},"width":416.75,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-41.png","element":"img","alt":" ∥Bℓ,X∥2X ≤ ∥Hi,j∥2F = 2","inline":true},{"text":"). Then we note that the second requirement trivially holds for ","element":"span"},{"text":"orthonormal basis due to the decomposition of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U ","element":"span"},{"text":"and Jensen’s inequality. For non-orthonormal basis, this assumption also holds as long as projection of a tangent vector does not vanish. Assumption ","element":"span"},{"href":"#id-106","text":"F.1.","element":"a"},{"href":"#id-108","text":"2 ","element":"a"},{"text":"can also be satisfied by the compactness of the domain, e.g., we can take ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"in Assumption ","element":"span"},{"href":"#id-106","text":"F.1.","element":"a"},{"href":"#id-108","text":"2 ","element":"a"},{"text":"to be ","element":"span"},{"style":{"height":23.82},"width":739.41,"height":59.55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/19-42.png","element":"img","alt":" L = maxX∈X,U∈TXM:∥U∥X=1d2f(RetrX(tU))dt2","inline":true,"padRight":true},{"text":". These are all reasonable assumptions within a compact neighbourhood ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem ","element":"span"},{"href":"#id-64","style":{"fontWeight":"bold"},"text":"4.1 ","element":"a"},{"text":"(Formal)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumption ","element":"span"},{"href":"#id-106","style":{"fontStyle":"italic"},"text":"F.1","element":"a"},{"style":{"fontStyle":"italic"},"text":", consider RCD algorithm with ","element":"span"},{"style":{"height":15.72},"width":213.43,"height":39.29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/20-0.png","element":"img","alt":" S = 1 and ℓsk ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"selected uniformly at random","element":"span"}],[{"style":{"width":"98%"},"width":1910,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/20-1.png","element":"img"}],[{"text":"To analyze the cyclic variant, we further require the assumption that bounds the difference between Riemannian distance and distance induced by general retraction. In addition, we require the gradient Lipschitzness.","element":"span"}],[{"id":"id-109","style":{"fontWeight":"bold"},"text":"Assumption F.3. ","element":"span"},{"text":"Under the same settings as in Assumption ","element":"span"},{"href":"#id-106","text":"F.1","element":"a"},{"text":",","element":"span"}],[{"id":"id-110","href":"#id-109","text":"F.3","element":"a"},{"text":".1 For all ","element":"span"},{"style":{"height":16},"width":389.49,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/20-2.png","element":"img","alt":" X, Y = RetrX(U) ∈ X","inline":true},{"text":", there exists constants ","element":"span"},{"style":{"height":19.12},"width":935.86,"height":47.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/20-3.png","element":"img","alt":" ϑ0, ϑ1 > 0 such that ϑ0∥U∥2X ≤ dist2(X, Y ) ≤ ϑ1∥U∥2X.","inline":true,"padRight":true},{"id":"id-112","href":"#id-109","text":"F.3","element":"a"},{"text":".2 The objective has retraction ","element":"span"},{"style":{"height":15.59},"width":43.12,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/20-4.png","element":"img","alt":" Lg","inline":true},{"text":"-Lipschitz gradient, i.e., ","element":"span"},{"style":{"height":36.51},"width":1864.3,"height":91.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/20-5.png","element":"img","alt":" ∥gradf(X) − T XY gradf(Y )∥2X ≤ Lg∥U∥2X, ∀X, Y =RetrX(U) ∈ X","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.78},"width":55.83,"height":44.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/20-6.png","element":"img","alt":" T YX","inline":true,"padRight":true},{"text":"is the an isometric vector transport that satisfies ","element":"span"},{"style":{"height":17.78},"width":632.39,"height":44.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/20-7.png","element":"img","alt":" ⟨T YX U, T YX V ⟩Y = ⟨U, V ⟩X, ∀X, Y ∈","inline":true},{"style":{"height":14},"width":289.05,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/20-8.png","element":"img","alt":"X, U, V ∈ TXM.","inline":true,"padRight":true},{"href":"#id-109","text":"F.3","element":"a"},{"text":".3 For any fixed coordinate index ","element":"span"},{"style":{"height":11.6},"width":109,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/20-9.png","element":"img","alt":" ℓ ∈ I","inline":true},{"text":", there exists a constant ","element":"span"},{"style":{"height":11.6},"width":112.78,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/20-10.png","element":"img","alt":" υ > 0","inline":true,"padRight":true},{"text":"such that for all ","element":"span"},{"style":{"height":14},"width":405.4,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/20-11.png","element":"img","alt":" X, Y ∈ X, V ∈ TY M","inline":true},{"text":", ","element":"span"},{"style":{"height":19.07},"width":549.3,"height":47.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/20-12.png","element":"img","alt":"∥PBℓ,XT XY V ∥2X ≥ υ∥PBℓ,Y V ∥2Y .","inline":true}],[{"id":"id-113","style":{"fontStyle":"italic"},"text":"Remark ","element":"span"},{"text":"F.4","element":"span"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"Assumption ","element":"span"},{"href":"#id-109","text":"F.3.","element":"a"},{"href":"#id-110","text":"1 ","element":"a"},{"text":"bounds the difference Riemannian distance (relating to inverse exponential map) and the inverse retraction. Because retraction is a first-order approximation to the exponential map, this assumption naturally holds when the domain is sufficiently small (see (","element":"span"},{"href":"#id-111","referenceIndex":29,"text":"Huang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-111","referenceIndex":29,"text":"2015","element":"a"},{"text":"; ","element":"span"},{"href":"#id-22","referenceIndex":54,"text":"Sato et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-22","referenceIndex":54,"text":"2019","element":"a"},{"text":")). Assumption ","element":"span"},{"href":"#id-109","text":"F.3.","element":"a"},{"href":"#id-112","text":"2 ","element":"a"},{"text":"is further required because when general retraction is used, gradient Lipschitzness is not equivalent to function smoothness. Assumption ","element":"span"},{"href":"#id-109","text":"F.3.","element":"a"},{"href":"#id-113","text":"3 ","element":"a"},{"text":"further claims that the difference between the same coordinate basis on different tangent spaces is bounded. We note that the RHS is identical to ","element":"span"},{"style":{"height":20.95},"width":284.04,"height":52.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/20-13.png","element":"img","alt":" ∥PT XY Bℓ,Y T XY V ∥","inline":true,"padRight":true},{"text":"due to the isometric vector transport. Then it reduces to whether ","element":"span"},{"style":{"height":18.17},"width":300.27,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/20-14.png","element":"img","alt":" T XY Bℓ,Y and Bℓ,X","inline":true,"padRight":true},{"text":"are related, which is expected because due to the compactness of the domain, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"is bounded from ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y ","element":"span"},{"text":". This allows to establish the convergence for cyclic selection of basis.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem ","element":"span"},{"href":"#id-65","style":{"fontWeight":"bold"},"text":"4.2 ","element":"a"},{"text":"(Formal)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumption ","element":"span"},{"href":"#id-106","style":{"fontStyle":"italic"},"text":"F.1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and consider RCD algorithm with ","element":"span"},{"style":{"height":16.52},"width":692.53,"height":41.29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/20-15.png","element":"img","alt":" S = |I| and ℓsk = s+1 for s = 0, ..., |I|−1.","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"Then selecting ","element":"span"},{"style":{"height":21.72},"width":1713.62,"height":54.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/20-16.png","element":"img","alt":" η = 1Lcb gives min0≤k≤K−1 ∥gradf(Xk)∥Xk ≤ C∆0K , where C = 4Lc2bc−1p υ−1(1+|I|2c−1b L−2Lgϑ1ϑ−10 ).","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Theorem ","element":"span"},{"href":"#id-114","style":{"fontWeight":"bold"},"text":"4.3 ","element":"a"},{"text":"(Formal)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumption ","element":"span"},{"href":"#id-106","style":{"fontStyle":"italic"},"text":"F.1","element":"a"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"href":"#id-109","style":{"fontStyle":"italic"},"text":"F.3 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and further let ","element":"span"},{"style":{"height":14},"width":202.58,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/20-17.png","element":"img","alt":" ω0, ω1 > 0","inline":true},{"style":{"fontStyle":"italic"},"text":", such that for any fixed epoch ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"height":19.64},"width":1302.04,"height":49.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/20-18.png","element":"img","alt":"ω0⟨∇f(Xsk), Bℓsk⟩2 ≤ θsk⟨∇f(Xsk), Bℓsk⟩ ≤ ω1⟨∇f(Xsk), Bℓsk⟩2, ∀s ≤ Smax − 1.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"(Randomized). Consider RCDlin algorithm with ","element":"span"},{"style":{"height":15.32},"width":349.23,"height":38.29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/20-19.png","element":"img","alt":" 1 < S ≤ Smax and ℓsk ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"selected uniformly at random from ","element":"span"},{"style":{"fontStyle":"italic"},"text":"I","element":"span"},{"style":{"fontStyle":"italic"},"text":". Then choosing","element":"span"}],[{"style":{"width":"72%"},"width":1402,"height":60,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/20-20.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"(Cyclic). Suppose ","element":"span"},{"style":{"height":16},"width":202.85,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/20-21.png","element":"img","alt":" Smax ≥ |I|","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and consider RCDlin algorithm with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|I| ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":15.31},"width":201.58,"height":38.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/20-22.png","element":"img","alt":" ℓsk = s + 1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for ","element":"span"},{"style":{"height":16},"width":268.34,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/20-23.png","element":"img","alt":" s = 0, ..., |I| −","inline":true,"padRight":true},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":". Then choosing ","element":"span"},{"style":{"height":20.75},"width":180.58,"height":51.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/20-24.png","element":"img","alt":" η = ω0Lcbω21","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":", we have ","element":"span"},{"style":{"height":22.79},"width":637.5,"height":56.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/20-25.png","element":"img","alt":" min0≤k≤K−1 ∥gradf(Xsk)∥2Xsk ≤ �C∆0K","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":", where ","element":"span"},{"style":{"height":20.76},"width":498.88,"height":51.9,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/20-26.png","element":"img","alt":" �C = 4Lc2bω21ω−20 c−1p ν−1(1 +","inline":true}],[{"style":{"width":"38%"},"width":365,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/20-27.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Remark ","element":"span"},{"text":"F.5","element":"span"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"We finally remark that the proof ideas of cyclic and randomized RCD follow from classic developments of coordinate descent (","element":"span"},{"href":"#id-41","referenceIndex":67,"text":"Wright","element":"a"},{"text":", ","element":"span"},{"href":"#id-41","referenceIndex":67,"text":"2015","element":"a"},{"text":") in the Euclidean space by showing sufficient descent in the objective function. On general manifolds, in order to generalize the proof ideas, we further require the assumptions outlined in Assumption ","element":"span"},{"href":"#id-106","text":"F.1","element":"a"},{"text":", ","element":"span"},{"href":"#id-109","text":"F.3","element":"a"},{"text":". In particular, for cyclic selection rule, we require Assumption ","element":"span"},{"href":"#id-109","text":"F.3.","element":"a"},{"href":"#id-113","text":"3 ","element":"a"},{"text":"to relate bases from different tangent spaces. Similar assumptions have been considered in (","element":"span"},{"href":"#id-42","referenceIndex":19,"text":"Gutman & Ho-Nguyen","element":"a"},{"text":", ","element":"span"},{"href":"#id-42","referenceIndex":19,"text":"2023","element":"a"},{"text":") for showing convergence of deterministic subspace descent algorithms on manifolds (see (","element":"span"},{"style":{"fontStyle":"italic"},"text":"C, r","element":"span"},{"text":")-norm condition).","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"F.2. Proofs","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Theorem ","element":"span"},{"href":"#id-64","style":{"fontStyle":"italic"},"text":"4.1","element":"a"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"Because ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":"= 1 ","element":"span"},{"text":"and by retraction ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"-smoothness, we have","element":"span"}],[{"style":{"width":"56%"},"width":1094,"height":321,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/20-28.png","element":"img"}],[{"style":{"width":"93%"},"width":1819,"height":415,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/21-0.png","element":"img"}],[{"text":"Telescoping this inequality and taking full expectation yields","element":"span"}],[{"style":{"width":"35%"},"width":333,"height":119,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/21-1.png","element":"img"}],[{"text":"where we let ","element":"span"},{"style":{"height":16},"width":319.14,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/21-2.png","element":"img","alt":" ∆0 := f(X0) − f ∗.","inline":true}],[{"id":"id-115","style":{"fontWeight":"bold"},"text":"Lemma F.6. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumption ","element":"span"},{"href":"#id-106","style":{"fontStyle":"italic"},"text":"F.1.","element":"a"},{"href":"#id-107","style":{"fontStyle":"italic"},"text":"1","element":"a"},{"style":{"fontStyle":"italic"},"text":", we have ","element":"span"},{"style":{"height":17.68},"width":669.14,"height":44.21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/21-3.png","element":"img","alt":" ∥PBℓ,XU∥X ≤ cb∥U∥X, ∀X ∈ X, ℓ ∈ I.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-115","style":{"fontStyle":"italic"},"text":"F.6","element":"a"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"style":{"height":20.09},"width":1484.9,"height":50.22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/21-4.png","element":"img","alt":" ∥PBℓ,XU∥2X ≤ cb⟨U, Bℓ(X)⟩2X = cb�⟨U, Bℓ(X)⟩XBℓ(X), U�X ≤ cb∥PBℓ,XU∥X∥U∥X","inline":true},{"text":". Can- ","element":"span"},{"text":"celling ","element":"span"},{"style":{"height":17.68},"width":198.92,"height":44.21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/21-5.png","element":"img","alt":" ∥PBℓ,XU∥X","inline":true,"padRight":true},{"text":"on both sides completes the proof.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Theorem ","element":"span"},{"href":"#id-65","style":{"fontStyle":"italic"},"text":"4.2","element":"a"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"We first focus on a single epoch ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"and for notation simplicity, we let ","element":"span"},{"style":{"height":25.63},"width":518.24,"height":64.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/21-6.png","element":"img","alt":" T XkXsk = Ts→0 and T XskXk = T0→s.","inline":true,"padRight":true},{"text":"Similarly from retraction ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"-smoothness,","element":"span"}],[{"style":{"width":"73%"},"width":1431,"height":363,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/21-7.png","element":"img"}],[{"text":"Then it remains to bound the RHS. From ","element":"span"},{"style":{"height":15.59},"width":43.12,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/21-8.png","element":"img","alt":" Lg","inline":true},{"text":"-Lipschitz,","element":"span"}],[{"text":"where we use Assumption ","element":"span"},{"href":"#id-109","text":"F.3 ","element":"a"},{"text":"and triangle inequality of Riemannian distance.","element":"span"}],[{"text":"Now we can show","element":"span"}],[{"style":{"height":23.24},"width":1696.89,"height":58.11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/21-9.png","element":"img","alt":"∥PBℓsk T0→sgradf(Xk)∥2Xsk ≤ 2∥PBℓsk T0→sgradf(Xk) − PBℓsk gradf(Xsk)∥2Xsk + 2∥PBℓsk gradf(Xsk)∥2Xsk","inline":true},{"style":{"height":23.24},"width":1106.87,"height":58.11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/21-10.png","element":"img","alt":"≤ 2cb∥T0→sgradf(Xk) − gradf(Xsk)∥2Xsk + 2∥PBℓsk gradf(Xsk)∥2Xsk","inline":true}],[{"style":{"width":"61%"},"width":1189,"height":122,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/21-11.png","element":"img"}],[{"text":"where the second inequality is due to Lemma ","element":"span"},{"href":"#id-115","text":"F.6","element":"a"},{"text":". Summing this inequality from ","element":"span"},{"style":{"height":14.4},"width":348.72,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/22-0.png","element":"img","alt":" s = 0, ..., S − 1 gives","inline":true}],[{"style":{"width":"69%"},"width":1358,"height":403,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/22-1.png","element":"img"}],[{"text":"where we notice ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|I|","element":"span"},{"text":". Also due to the cyclic selection of ","element":"span"},{"style":{"height":15.1},"width":33.6,"height":37.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/22-2.png","element":"img","alt":" ℓsk","inline":true},{"text":", we can see the LHS is","element":"span"}],[{"style":{"width":"73%"},"width":1433,"height":121,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/22-3.png","element":"img"}],[{"text":"where we use Assumption ","element":"span"},{"href":"#id-106","text":"F.1.","element":"a"},{"href":"#id-107","text":"1 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-109","text":"F.3.","element":"a"},{"href":"#id-113","text":"3","element":"a"},{"text":". Combining with previous results, we finally obtain","element":"span"}],[{"style":{"width":"99%"},"width":1943,"height":216,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/22-4.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Theorem ","element":"span"},{"href":"#id-114","style":{"fontStyle":"italic"},"text":"4.3","element":"a"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"For the randomized setting, by retraction ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"-smoothness,","element":"span"}],[{"style":{"height":28.8},"width":420.8,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/22-5.png","element":"img","alt":"�∥⟨∇f(Xsk), Bℓsk⟩Bℓsk∥2Xsk","inline":true}],[{"style":{"width":"20%"},"width":194,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/22-6.png","element":"img"}],[{"text":"where we choose ","element":"span"},{"style":{"height":20.75},"width":217.51,"height":51.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/22-7.png","element":"img","alt":" η = ω0Lcbω21","inline":true,"padRight":true},{"text":". ","element":"span"},{"text":"The second inequality follows from the assumption ","element":"span"},{"style":{"height":19.64},"width":381.79,"height":49.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/22-8.png","element":"img","alt":" ω0⟨∇f(Xsk), Bℓsk⟩2 ≤","inline":true},{"style":{"height":19.64},"width":673.57,"height":49.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2406.02225/images/22-9.png","element":"img","alt":"θsk⟨∇f(Xsk), Bℓsk⟩ ≤ ω1⟨∇f(Xsk), Bℓsk⟩2","inline":true,"padRight":true},{"text":"and the third inequality is due to Assumption ","element":"span"},{"href":"#id-106","text":"F.1.","element":"a"},{"href":"#id-107","text":"1","element":"a"},{"text":". Following the similar ","element":"span"},{"text":"proof strategy, we obtain the desired result. For the cyclic setting, the bound also readily follows by using the above result.","element":"span"}]]}],"_version":"3.3.4"},"paperNode":"$1b:props:children:props:children:0:props:product"}]]]}]}]