Constructing fast approximate eigenspaces with application to the fast graph Fourier transforms

2020·Arxiv

Abstract

Abstract

We investigate numerically efficient approximations of eigenspaces associated to symmetric and general matrices. The eigenspaces are factored into a fixed number of fundamental components that can be efficiently manipulated (we consider extended orthogonal Givens or scaling and shear transformations). The number of these components controls the trade-off between approximation accuracy and the computational complexity of projecting on the eigenspaces. We write minimization problems for the single fundamental components and provide closed-form solutions. Then we propose algorithms that iterative update all these components until convergence. We show results on random matrices and an application on the approximation of graph Fourier transforms for directed and undirected graphs.

1 Introduction

Matrix decomposition techniques [Stewart, 2000], and specifically eigenvalue decompositions [Golub and van der Vorst, 2000], are widely used in numerical linear algebra, scientific computing, machine learning, quantum computing and other scientific fields.

In general, given no assumptions on the structure of an eigenspace, the eigenvector matrix of a given linear operator of size exhibits no advantageous numerical properties and therefore they require ) operations when performing matrix-vector multiplications. In this paper, we want to perform an approximate eigenvalue decomposition so that we accurately represent the original eigenspace with a new one that also exhibits favorable numerical complexity, for example, it requires only O(n log n) operations when performing matrix-vector multiplication with a generic vector. In such cases, a trade-off between the accuracy of the approximation and its numerical complexity exists. Several previous works, such as [Lee et al., 2008] and [Kondor et al., 2014], have already introduced these ideas in the machine learning community with considerable success. Such approximations are particularly useful in situations where, once computed, the eigenspace is repeatedly used in matrix-vector calculations in downstream applications.

Eigendecomposition algorithms developed in the matrix computations literature are different for symmetric and unsymmetric matrices. In the symmetric case, the eigenspace is always full (non-defective), real-valued and furthermore, orthonormal [Golub and van Loan, 1996][Chapter 8]. We approximate these eigenspaces by using extended Givens transformations (which are themselves orthonormal and include as a particular case the well-known Givens, sometimes also called Jacobi, rotations [Givens, 1958]). In this case, given the spectrum or an estimation of it, we can provide a locally optimal iterative algorithm similar to Jacobi diagonalization for symmetric matrices [Jacobi, 1846]. The general, unsymmetric, case [Golub and van Loan, 1996][Chapter 7] is much more challenging as the given matrix might not even be diagonalizable and furthermore, even when it is, the factorization has to be done over the complex-valued field in general. As the eigenvector matrix is generally unstructured, in this case, we rely on a given number of scaling and shear transformations to approximate it [Rusu, 2018]. We formulate optimization problems for each of these basic components and show how to locally, optimally solve them with an iterative algorithm and closed-form solutions.

For both proposed algorithms, we show experimental results on the approximation of random unstructured symmetric matrices and then show an application to the construction of fast graph Fourier transforms on synthetic and real-world directed and undirected graphs.

2 Prior approaches

The literature has always distinguished between eigendecompositions of symmetric and unsymmetric matrices and we will do the same.

In the symmetric case, the diagonalization is done with orthonormal matrices, which are well understood in terms of their decomposition with Givens rotations or Householder reflectors (the QR algorithm, see Chapters 5.1 and 5.2 of [Golub and van Loan, 1996]). The starting point for some approaches in the literature is the Jacobi diagonalization process for symmetric matrices [Ja- cobi, 1846] which is an iterative procedure that uses Givens rotations to bring the symmetric matrix to a strongly diagonally dominant one. A truncated Jacobi procedure is used in [Le Magoarou et al., 2018] to compute fast graph Fourier transforms for undirected graphs (as undirected implies symmetry in the adjacency and Laplacian matrices). Other methods deal directly with the orthonormal eigenspace. For example, Treelets [Lee et al., 2008] and multiresolution [Kondor et al., 2014] structures use Givens rotations in a structured way to decompose the orthonormal components into hierarchies or multiple differ-ent scales, respectively. Another approach is to exploit manifold optimization techniques to find approximate factorizations of orthonormal matrices with few Givens rotations either by greedy coordinate descent [Shalit and Chechik, 2014] or by –style optimization [Frerix and Bruna, 2019]. Recently, an approach that combines rotations and reflections was proposed with an application to fast principal component analysis (PCA) projections [Rusu and Rosasco, 2019]. While this latter work needs to precompute the orthonormal eigenspace, in this paper we show how to perform the same factorization given the dataset.

In the unsymmetric case, we rely on sparse structured components. For example, the incomplete LU [Meijerink and Vorst, 1977], the randomized LU [Shabat et al., 2018] factorizations, the additive low-rank plus multiresolution decomposition [Mudrakarta et al., 2019], and approximate Gaussian elimination [Kyng and Sachdeva, 2016] all rely on structured sparse matrices to construct efficient approximations of a given unstructured matrix.

In this paper, we use structured matrices to construct numerically efficient approximations of eigenspaces. We describe the fundamental building blocks of our factorizations and provide exact optimization problems with closed-form solutions to optimally, locally update these blocks efficiently.

3 Problem setup and formulation

3.1 The symmetric case

Given a symmetric matrix the main result that we use is its eigenvalue factorization as

where we assume w.l.o.g. that the entries of s, which are the real-valued eigenvalues of S, are in descending algebraic order and U is the real-valued orthonormal eigenspace. Based on (1), we consider the problem

where a set of orthonormal matrices such that matrix-vector multiplication with any matrix from this set is O(g), instead of the classic ). Let us now consider a particular set . Based on all the 2 2 orthonormal matrices

we have the extended orthonormal Givens transformations [Rusu and Thomp- son, 2017, Rusu and Rosasco, 2019], which for simplicity we call a G-transform:

where the non-zero entries located at rows/columns i and j, denoted as “*”, are the two possible options in (3). The matrices in (3) are basic building blocks of the orthonormal group because every orthonormal matrix can be diagonalized (and the diagonal entries are ) using Givens rotations (the matrix (4) with the first, unsymmetric, component in (3)) by the QR decomposition of U, see Chapter 5.2.5 of [Golub and van Loan, 1996]. Then, in this paper, any has the following structure

where all matrices are G-transforms (4). The number is given and fixed. With this structure, matrix-vector multiplication takes 6g operations while storing takes approximately 2g logbits, where C is the number of bits required for a double precision floating-point representation. Similar structures to (5) have been previously proposed by [Lee et al., 2008], [Kondor et al., 2014], and [Frerix and Bruna, 2019], but they all consider only Givens rotations, and no reflectors.

3.2 The unsymmetric case

Given a general diagonalizable the main result that we use is its eigenvalue factorization as

where c contains the complex-valued eigenvalues and T has the complex-valued eigenvectors. Based on (6), we consider the problem

where a set of general matrices such that matrix-vector multiplication with any matrix from this set or its inverse is O(m), instead of ). Let us now consider a particular set . Based on 2 2 scaling and shear transformations

and, similarly to (4), we define the T-transform:

where the non-zero entries are the three possible options in (8). For the shear transformations we necessarily have j > i while for the the scaling transformations we abuse notation and impose i = j, i.e., in (9) is the identity matrix except for the diagonal element that is a. The matrices in (9) are building blocks of every diagonalizable because, by Gaussian elimination (see Chapter 3.2.1 of [Golub and van Loan, 1996]), shear transformations (9) diagonalize C and then n scaling transformations (9) exactly represent the resulting diagonal. We choose these three specific matrix as the optimization in (7) takes place over the variable and its inverse and these matrices have trivial inverses. Then, in this paper, any has the following structure

where all matrices are T-transforms (9). Their number is given and fixed. We assume that the factorization contains scalings and shears (). With this structure, matrix-vector multiplication takes operations while storing takes approximately ) logbits, where C is the number of bits required for a double precision floating-point representation.

There are two important differences when compared to G-transforms. Neither the scaling nor the shears are orthogonal but they are more efficient: two computations and one degree of freedom per transform (as compared to the G-transform where we have 6 operations and one degree of freedom). Therefore, for the same computational cost, we expect T-transforms to provide more accurate approximations. The two types of transforms are connected since any 2 2 orthonormal transformation can be written as a product of three shears and scalings by the lifting scheme [Daubechies and Sweldens, 1998].

4 Proposed factorizations and algorithms

In this section we propose approximate solutions to the optimization problems (1) and (6). Therefore, we distinguish between the symmetric and unsymmetric cases. Furthermore, we analyze separately the initialization and iterative procedures that improve the approximation for each of the two problems. Both (2) and (6) echo the fast circulant matrix-vector multiplication which is possible because every circulant matrix of size has a factorization as , Chapter 4.8 of [Golub and van Loan, 1996], where F is the Fourier matrix and = diag(. The idea is to replace the Fourier matrix with a new learned matrix ( as (5) or as (10)), with similar computational properties to the Fourier [Cooley and Tukey, 1965].

4.1 Approximation of symmetric matrices

Based on the eigenvalue decomposition, the idea is to approximate U in (1) with a fast approximation (5). The approximation of S would therefore be diag(. Multiplications with S can be viewed as a sequence of fast multiplications by , diag(s) and finally . Matrix-vector multiplication with a diagonal matrix is fast, n operations, so therefore our goal is to construct such that it also has advantageous numerical properties (for example, computations take less than 2operations). Therefore, based on (5), we propose an approximation as

where are the estimated eigenvalues of S. These can be the actual eigenvalues of S if they are known, or else they can be randomly initialized or set to the diagonal elements of S – in the latter case, we should ensure entries of are distinct for reasons that will be clear later in this section. To find the best approximation of S we can compute the best approximation to the spectrum by the following lemma.

Lemma 1 Let S and orthogonal be fixed, then the arg min of the expression diag() is given by

The complexity of computing is O(gn).

Let us now move to approximate the orthogonal eigenspace of S. Given an approximation as (11) where all G-transforms t = k + 1, . . . , g were initialized, we now study the problem of initializing such that we minimize

where we have defined the symmetric matrix

where we have denoted the quantity

and the 2 2 symmetric =

Theorem 1 provides an efficient way to find the optimal G-transform that minimizes (13): both the indices and the transform values. Starting from k = g we can continue down to k = 1 and initialize in this fashion all g G-transforms in (5). Also, notice that the unified approach we propose (allowing for both the rotation and the reflection) simplifies the results, i.e., we are not looking to optimize an angle of rotation but we get the optimal local solution by an eigenvalue decomposition (the solution to a two-sided 22 Procrustes problem). The computational cost of (15) is dominated by the sweep of the indices () operations).

Remark 1 (Connection to the Jacobi method) The Jacobi method, see Chapter 8.4 of [Golub and van Loan, 1996], used to diagonalize symmetric matrices, only uses Givens rotations and selects indices () that correspond to the off-diagonal element of with the highest magnitude, i.e., we have (15) with . The Jacobi algorithm is not concerned with the number of Givens rotations used to diagonalize and indeed, in general, more than rotations are used (see Chapter 8.4.3 of [Golub and van Loan, 1996] for a detailed discussion on how the number of rotations relates to the converge of the method). Furthermore, the Jacobi method uses only Givens rotations (while we have a richer structure for given in (3)) and it does not explicitly have an objective function as (2), i.e., there is no reference matrix to be reconstructed in the sense of minimizing a Frobenius norm (the Jacobi method minimizes the squared sum of the off-diagonal entries of the approximation). Also, the Jacobi method does not need an estimate of the eigenvalues . If we now ignore the eigenvalue information, we can consider in (15). When we have that , just as in the Jacobi method, while when we have . Therefore, the calculated score approximates the Jacobi approach when off-diagonal elements are large but the selection criterium for the indices is different as the iterative process makes progress and the working matrix becomes diagonally dominant. Finally, we note that = 0 whenever ¯= ¯which agrees with theoretical convergence results on the Jacobi method that hold when assuming distinct eigenvalues [Henrici, 1958].

The proposed approach is also significantly different from other previous approaches. As opposed to the approach in [Kondor et al., 2014], by maximizing (15) we find the indices of the optimal 22 transform without actually explicitly having to compute it. This saves up computational time (the A are easy to compute: only 10 operations) and also leads to better approximation (as we also consider the reflector simultaneously with the Givens rotation in (3)). The approach in [Frerix and Bruna, 2019] uses again only Givens rotations to perform coordinate descent on the orthonormal manifold using a particular basis for the tangent space such that the exponential map is a Givens rotation. Noting that

can be seen as a coordinate swap followed by a rotation, we can interpret our approach as a simultaneous dual tangent space descent. Unfortunately, efforts to integrate this view with manifold optimization, in general, seem difficult at this stage as many difficulties arise: for example, the logarithmic map of the reflector is complex-valued and therefore it is not clear how to choose a basis for the tangent space corresponding to the reflector. As rotations and reflections are disconnected components the unified approach used in this paper may work only for the objective function and constraints we consider (due to the existence of the Procrustes solutions based on eigendecompositions).

Given an approximation as (11) where all G-transforms were initialized, we now study the problem of improving each individual iteratively. We want to optimize each G-transform sequentially such that we minimize

where, due to the invariance of the Frobenius norm to multiplications by and its transpose, we have defined the symmetric matrices

Notice that Theorem 1 covered only the case when is a diagonal matrix. Next we state a general result that holds for the minimization of (17) and any and .

Theorem 2 (Optimal update of each G-transform) Let and be any symmetric matrices, then the minimizer of the quantity in (17) has its non-trivial values given by

where = min where for the optimal coordinates

where = (+ 2(. The new index runs through the two options (rotation and reflector) in (3). The matrices of size 2 2, the vectors of length 2 and w of length , and the matrices and all of size 4 4 depend only on the entries in and are given explicitly in the supplementary materials.

Unlike with the initialization procedure, Theorem 2 shows that considering both the rotation and the reflector simultaneously does not lead to a unified optimization problem. Indeed, the index runs through both transformations from (3). Still, solving the second problem, for = 2, brings an extra computational load that is negligible as it shares most calculations with the first problem, for = 1.

The iterative process is computationally expensive as it covers all ) unique pairs of indices () while the calculation of (20) is itself non-trivial and requires ) operations (substantially more expensive than the initialization (16)). If the running time is an important constraint, we can run the iterative process just as a “polishing step”: keep the indices of the G-transforms fixed all the time and update only the values of the transformations .

4.2 Approximation of unsymmetric matrices

Similarly to the symmetric case, based now on the eigenvalue decomposition (6), the idea is to approximate T with a numerically efficient approximation . The approximation of C would therefore be = diag(. Multiplications with can be viewed as a sequence of fast multiplications by , diag() and finally . Again, matrix-vector multiplication with a diagonal is fast and therefore the computational burden depends on the numerical properties of and its inverse. By using scaling and shear transformations (8) in the direct transformation the numerical properties transfer also to its inverse since inverses of scalings and shears are themselves scalings and shears, respectively. Therefore, based on (10), we propose an approximation as

where are the estimated eigenvalues of C, which we constraint to be real-valued. Just like in the symmetric case, there are several ways to set : randomly, the diagonal values of C or the true eigenvalues, if they are known. To find the best approximation of C we can compute the best approximation to the spectrum by the following lemma.

Lemma 2 Let C and be fixed, then the arg min of the expression diag(is given by

where is the Khatri-Rao product.

The computational complexity of getting is ), we solve a least squares problem of size where the columns of are Kronecker products. Alternatively, we can approximately solve the problem by some iterative method which exploits the Kronecker product structure to efficiently perform matrix-vector multiplications.

Let us now move to approximate the eigenspace of T. Given an approximation as (22) where all T-transforms t = 11 were initialized, we now study the problem of initializing such that we minimize

where we have defined the symmetric matrix

Theorem 3 (Optimal initialization of each T-transform) Let C and be fixed, let all components for t = 11 while for t = k + 1, . . . , m, then the optimal component that minimizes the quantity in (24) is given by

where the quantities (), for an index that runs through all three op- tions in (8), are rational functions in and are given explicitly in the supplementary materials.

Given an approximation as (22) where all T-transforms were initialized, we now study the problem of improving each individual iteratively. Therefore, we want to optimize each T-transform sequentially such that we minimize

where we have defined the matrix

Theorem 4 (Optimal update of each T-transform) Let and be any matrices, then the optimal component that minimizes the quantity in (27) is given by

where the quantities (), for an index that runs through all three op- tions in (8), are rational functions in and are given explicitly in the supplementary materials.

Although the initialization and iterative steps look very similar, (26) and (29), they are significantly different from a computational perspective. While

a particular () is computed in constant time O(1), for each of the )s we have quadratic complexity ).

Similarly to the symmetric case, we now have a locally optimal way of choosing and updating our T-transforms. Also, the update step is again significantly more computationally expensive than the initialization and more are more expensive than the results for the symmetric case. The basic difficulty stems from the fact that we are dealing now with building blocks that are not orthogonal and therefore are not invariant in the Frobenius norm. The exact initialization of T-transforms takes ) while the update takes ) operations. Due to this high computational cost, simplification can be brought to the algorithm: for example, in the update steps we no longer search over every index pair () but we keep these indices fixed and just calculate the locally optimal coefficient of the transformation . We call this step a polishing step and it reduces the computational complexity of updating the T-transforms to ).

We now make two remarks regarding the proposed decomposition for general matrices.

Remark 2 (Using T-transforms for the symmetric case) The ideas of this section can also be applied to the symmetric case. Consider an approxima-

tion analogous to (11) based on T-transforms as

If the factorization in (11) is based on the eigendecomposition of symmetric matrices, this one is similar to a Cholesky factorization. With this structure we reach minimization problems similar to (27) that have similar solutions to (67) following the same development as in (63), for brevity we omit these formulas. The obvious disadvantage of this approach is that we do not explicitly preserve the eigen-information of the original matrix: we lose control over the spectrum of the approximation and the T-transform do not approximate explicitly the eigenspace. Still, there are some advantages: i) T-transforms (2 operations per degree of freedom) are more efficient than G-transforms (6 operations per degree of freedom) and therefore we expect to get better approximation accuracy for the same numerical complexity or vice-versa, lower computational complexity for the same representation accuracy; and ii) direct and inverse matrix-vector multiplications with are still efficient, e.g., take at most 4g operations instead of 12g + n with .

An alternative construction is to keep the eigendecomposition but use real-valued approximate eigenvalues and use T-transforms to approximate the orthonormal eigenspace

We lose the orthogonality of the eigenspace but we expect this representation to be more accurate just because, as already discussed, T-transforms have better numerical properties as compared to G-transforms. As every matrix in (3) can be written as four transformations from (8) (two shears and two scalings by [Daubechies and Sweldens, 1998]), we can use the approximation of S from (11) as an initialization to the factorization in (31) with m = 4g.

Remark 3 (An approximate Schur decomposition for S) Notice that in (2), instead of the diagonal containing we can use for example an upper (or lower) triangular matrix which is also sparse, say O(g) off-diagonal elements. The computational complexity of using such an approximation (directly or inversely) would still be O(g) while we expect the approximation accuracy to be better, lower overall due to the extra degrees in freedom in the new triangular factor. This factorization would be similar to the Schur decomposition , where J is upper triangular and V is orthonormal (but all real-valued). This is in contrast with the eigenvalues decomposition of C which is done over the complex values, in general.

4.3 The proposed algorithm

Figure 1: Approximation accuracy (mean and std) for Laplacians of randomly generated graphs as a function of the number of transformations g going as log. All graphs are randomly generated using the default settings of the GSP box: community graphs (left), Erdos-Renyi random graphs with the probability of a connection between two nodes p = 0.3 (center) and a sensor graphs (right). Given n the number of nodes in the graph, results are shown for n = 512 (dotted), n = 256 (dashed) and n = 128 (solid) and all methods update also the spectrum of the estimation. Top row shows undirected graphs while bottom row show directed graph (created from undirected graphs, direction of the edge between the nodes is decided randomly with probability 0.5) results. Results are averaged over 100 realizations.

Given that we now have optimal ways to initialize (Theorems 1 and 3) and update G and T transforms (Theorems 2 and 4), we are ready the describe the full proposed procedure, in Algorithm 1. For brevity, we describe a single procedure for both the symmetric and general matrices. Whenever we allow for spectrum updates of or we explicitly mention so.

As previously discussed, every step of the algorithm is locally optimal and can only decrease the objective functions we consider. Thus, convergence to a stationary point is guaranteed. Compared to previous methods, there are two characteristics that we would like to highlight at this point: i) in the symmetric case, we use simultaneously both rotations and reflections to construct our approximation; and ii) for each sub-problem we define we can provide a closed-form solution based either on singular and eigenvalue decompositions or least squares. Also, because the proposed method relies on the calculation of scores that span indices i and j it is naturally parallelizable and is amenable to randomized linear algebra techniques.

Proofs of the lemmas and theorems from this section are collected in the supplementary materials.

Figure 2: Comparison of the proposed approach (black squares) against previously proposed methods from the literature: Jacobi (red circles) from [Le Magoarou et al., 2018], greedy Givens (green diamonds) from [Kondor et al., 2014] and L1 (blue triangles) from [Frerix and Bruna, 2019]. The comparison is based on Fig. 6 from [Frerix and Bruna, 2019] and we keep their measure of accuracy between U and .

Figure 3: For the pro- posed method, we show average estimation accuracy for the overall Laplacian L, not just the eigenspace U, for the graphs in Figure 2 as a function of the number of G-transforms g going as log. In all cases the proposed method also updates the spectrum of the approximation, i.e., = diag(¯. The initial estimated eigenvalues are assumed to be the diagonal of L.

Figure 4: Given random undirected ErdosRenyi graphs on size n = 1024 with Laplacian L = Udiag(we show the average approximation accuracy when we are approximating U directly, i.e., the eigendecomposition is explicitly given, or we are approximating L directly. When the eigendecomposition is available we approximate U and diag() (a weighted eigenspace) using [Rusu and Rosasco, 2019].

5 Experimental results

In this section, we describe numerical experimental results with the proposed algorithms and compare them to previous work. Following previous methods [Le Magoarou et al., 2018, Frerix and Bruna, 2019], we measure the quality of the approximation using the Frobenius norm objective functions of the optimization problems we consider. Source code for Algorithm 1 is available online. In all the experiments we use ‘update’ for the spectrum estimation and only polishing in the iterative part of the algorithm (keep indices found in the initialization fixed and optimize only the values of the transforms - monotonicity in the objective function is still preserved).

We show an application to the calculation of the fast graph Fourier transforms. Given a graph with n vertices we compute its Laplacian where D is the diagonal degree matrix and A is the adjacency matrix, i.e., = 1 if a directed edge exists between nodes i and j and = 0 otherwise. Given the eigenvalue decomposition of the Laplacian, we call U the graph Fourier transformation of the graph. Our goal is now to build approximations of U which enjoy the same numerical complexity as the classic time-domain Fourier transformation, i.e., O(n log n). We distinguish between undirected and directed graphs. With undirected graphs, the adjacency matrix A is symmetric and therefore the Laplacian is symmetric positive semidefinite allowing for an eigendecomposition with an orthonormal basis U as . In this case, we use G-transforms to approximate the eigenspace. For directed graphs, the Laplacian does not have an orthonormal structure and therefore we use the more general T-transforms in the factorization. In Figure 1 we show approximation results for different types of graphs of different sizes (number of vertices n).

In Figure 2 we compared the proposed method against the previous state of the art for the computation of fast graph Fourier transforms. The results are shown for four graphs: Minnesota graph from [Defferrard et al., 2015] with n = 2642 and 3304 edges, HumanProtein graph from [Rual et al., 2005] with n = 3133 and 6726 edges, Email graph from [Guimer`a et al., 2004] with n = 1133 and 5451 edges and the Facebook graph from [Leskovec and Mcauley, 2012] with n = 2888 and 2981 edges. Our proposed method performs best in all these situations.

In this paper, we assume we do not have information about the eigenspace of L, but have access to L itself. Previous work [Rusu and Rosasco, 2019, Frerix and Bruna, 2019] considered the possibility of performing first the eigendecomposition and then working directly with the eigenspace U. For the same graphs, we show in Figure 3 the evolution of the accuracy of the overall Laplacian L as a function of the number of basic transformations in their factorization. We emphasize again that this metric is different from the one used in Figure 2 on the accuracy of the eigenspace U.

In Figure 4 we analyze the approximation accuracy of estimating the Laplacian L for random undirected Erdos-Renyi graphs with n = 1024 and compare it against the approach in [Rusu and Rosasco, 2019] which needs directly the eigenspace U. We show that our proposed method, especially with the updated spectrum, is the most appropriate to build an accurate approximation of L.

The supplementary materials have further numerical experiments, comparisons, and measurements of the running time of the fast transformations (not just number of operations).

6 Conclusions

In this paper, we have described algorithms for the efficient approximate computation of eigenspaces which can be efficiently manipulated. We show an application to the computation of the fast graph Fourier transform (both for directed and undirected graphs) and compare against previous approaches from the literature, which we outperform. An open problem, that we cannot address at this time, is the setup of an appropriate theoretical framework where the proposed factorizations and algorithms can be analyzed and their limitations understood.

Acknowledgments

This material is based upon work supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216, and the Italian Institute of Technology. Part of this work has been carried out at the Machine Learning Genoa (MaLGa) center, Universit`a di Genova (IT) L. Rosasco acknowledges the financial support of the European Research Council (grant SLING 819789), the AFOSR projects FA9550-17-1-0390 and BAA-AFRL-AFOSR-2016-0007 (European Office of Aerospace Research and Development), and the EU H2020-MSCA-RISE project NoMADS - DLV-777826. C. Rusu is supported by the Romanian Ministry of Education and Research, CNCS-UEFISCDI, project number PN-III-P1-1.1-TE-2019-1843, within PNCDI III.

Supplementary materials

Eigenvalues of symmetric matrices

given by

Proof of Lemma 1

The result follows directly by using the invariance of the Frobenius norm under orthogonal transformations:

Then, because the Frobenius norm is entry-wise we have the minimizer = diag( ).

Proof of Theorem 1

Given that we have initial values for all components for t = g, . . . , k + 1 (while for t = 11) and we now want to also initialize the component such that we minimize objective function of (2). That expression can be written as

where we have defined the cost

For convenience, we defined the 2 2 symmetric matrices

and = diag(. In the development of (34) we have used the trace definition of the Frobenius norm = tr(), the fact that the Frobenius norm is invariant to orthonormal transformation (in particular G-transformation) = = and that operates only on rows and columns and .

The problem of maximizing the quantity in (35) is known as the two-side orthonormal Procrustes problem [Schonemann, 1968] whose solution in our case is given by

We will use that = . We assume that in the eigenvalue decomposition of where the diagonal matrix = diag() contains the eigenvalues in algebraic descending order. The same ordering is also assumed in . Therefore, by the rearrangement inequality, see Section 10.2, Theorem 368 of [Hardy et al., 1952], and with the trace quantity is maximized and it reduces to

Therefore, the overall cost (35) reduces to

where we have denoted

and we noticed that = ¯and = ¯. The eigenvalues of in are computed by the formulas in (32). Therefore, the minimizer of (34) is

Proof of Theorem 2

With the definitions (18) and (19), the objective function (17) can be expressed to

with the cost that is

where we have defined =

and are both matrices of size 2 2) composed of only rows and , but with the columns and eliminated from both, from and respectively. Finally, we have used

the operation denotes the entry-wise matrix-matrix product. The quantity in (44) seems difficult to minimize, in the sense that a solution based on an eigenvalue decomposition, such as in (38), does not seems possible. This is because the cost contains a term (the first trace, similar to the one in (35)) whose maximum is given as the solution to the two-side orthonormal Procrustes problem [Schonemann, 1968] by eigenvalue decompositions of and but also another term (the second trace) whose maximum is given as the solution to the one-sided orthogonal Procrustes problem by the singular value decomposition of another quantity, . The computational simplification in (35) is possible because one of the matrices is diagonal and therefore the second trace term does not appear in the cost. It is for this reason that we have to analyze separately the initialization and iteratively procedures. To find the minimizer of (17), both in terms of the indices and the values of the

orthonormal , take the following

where (and (are the columns of and , respectively. We have introduced the matrix

with and where are the standard basis vectors for and is the Kronecker product. For ease, we also define and , the two options for P. We have these two variants for P due to the dual structure of (3). Unlike the previous section, where the dual structure enabled the initialization by an eigenvalue decomposition, here we actually need to solve two different, but related, problems. In the development of (46) we have used again the invariance of norms to orthonormal transformations and the fact that the Frobenius norm is element-wise = vec(and that vec(ABC) = ()vec(B). The rest of the section is dedicated to finding the minimizer of (46) in a numerically efficient manner.

Therefore, the minimization of (46) is equivalent to a constrained least squares problem that can be solved efficiently using the singular value decomposition, see Chapter 12.1 of [Golub and van Loan, 1996] and [Gander et al., 1989]. We have to solve the problem twice, for and in (47), and keep the best result. The vector w and the matrix P are never explicitly constructed, but we build the products

For these, we have the explicit formulas

• for = we have: = 2(++2) and = 2(+ + );

• for = we have: = + + + 2+ 4= 2(+ ) and + + + 2;

• for = we have: = 2(+ ) and = 2(+++).

Here we have also introduces the quantities:

that compute to the squared norms of the columns of and , respectively. With this setup, the minimizer of (46) with some fixed indices and is

where is chosen such that the solution has unit norm according to [Gander et al., 1989], the regularization parameter is computed by

i.e., the are the generalized real-valued eigenvalues of the 4 4 matrices M =

(50), the overall minimizer of (43) is found by searching for the indices

The quantities are already computed and =+ )2.

Proof of Lemma 2

The result follows by using the trace product formula:

Then, this is a simple least-squares problem in where is the Khatri-Rao product, the Kronecker products between the corresponding columns of and .

Proof of Theorem 3

Consider the scenario where we have initial values for all components for t = 11 (while for t = k + 1, . . . , m) and we want to also initialize the component such that we minimize the quantity

with the matrix

where we have defined the cost values as

and we have used the quantities:

The goal is to search for

Because all three ) are polynomials in of degree four and five their minimization therefore reduces to finding the roots of their derivatives and evaluating the values at those points searching for the minimum value. We have used the following explicit inverse formulas:

Proof of Theorem 4

The expression (27) can be developed to

where we have used twice the fact that () = (), twice that vec(ABC) = ()vec(B) and is a function only of . We have defined w = vec((). But this large vector of size and other large matrices (Kronecker products of size ) are never explicitly constructed but the objective function is minimized by exploiting the structures (9) and (8). The cost values for the scaling and the two shears respectively are

and we have used the quantities:

Similarly to the initialization step, the goal is to search for

Again, all three ) are polynomials in of degree four and five their minimization therefore reduces to finding the roots of their derivatives and evaluating the values at those points searching for the minimum value.

Additional experimental results

In Figure 5, we show the accuracy of the approximation for randomly generated symmetric (both positive definite and indefinite), and general matrices. Details

Figure 5: In red, approximation accuracy (mean and std) of the proposed method for randomly generated matrices as a function of the number of transformations g or m going as log. Given a matrix X with entries i.i.d. standard Gaussian we have results for: symmetric indefinite (left), symmetric positive semidefinite (central) and unsymmetric C = X (right). Results are shown for n = 512 (dotted), n = 256 (dashed) and n = 128 (solid) and all methods update also the spectrum of the estimation. Note that the achieved accuracy is better for the positive definite case. In black, for comparison, the results of r-rank approximations: for the symmetric case r = 3logwhile for the unsymmetric case log(for these values we match the numerical complexity of the transformations and , we count 2rn operations for matrix-vector multiplication with the r-rank matrix). Results are averaged over 100 realizations.

Figure 6: Average speedup achieved for matrix-vector multiplication between the full eigenspaces versus their approximations using Algorithm 1 for the graphs from Figure 2. We show the FLOP count (count of the number of operations: 6logfor the G-transformations and 2logfor the Ttransformations) as compared to regular matrix-vector multiplication (2) and the actual matrix-vector multiplication runtime as compated to the LAPACK, Level 2 BLAS, implementation (SGEMV). Application of the butterflies is implemented in the C programming language (scripting languages such as Matlab perform very poorly if all g or m basic transformations are applied sequentially). No parallelism is used in any of these experiments. Code runs on a 2.3GHz Quad-Core Intel Core i5 system with 16GB LPDDR3 memory.

about the experimental setup are given in the figure caption. The proposed algorithm uses again only a polishing step and not a full transform update, for computational efficiency.

Finally, in Figure 6 we show the speedup achieved by the proposed transformations. Throughout the paper, we define numerically efficient transformations as those that present a low number of additions and multiplications in matrix-vector operations, i.e., FLOP (floating-point operations) count. In this figure, we also show the speedup in terms of actual running time and compare it with the FLOP count. We stress that the speedup does not refer to the running time of the proposed Algorithm 1, but the application of the transformations reached by this algorithm. For this figure, the butterfly transformations (G and T transformations) are implemented in the C programming language (a Matlab implementation of the application of these butterflies is hopelessly slow as compared to just matrix-vector multiplication in Matlab, i.e., the “*” operation calls compiled BLAS functions as opposed to parsing and running Matlab scripts).

References

J. W. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex Fourier series. Math. Comp., 19:297–301, 1965.

I. Daubechies and W. Sweldens. Factoring wavelet transforms into lifting steps. J. Fourier Anal. App., 4(3):247–269, 1998.

M. Defferrard, L. Martin, R. Pena, and N. Perraudin. PyGSP: Graph Signal Processing in Python, 2015.

T. Frerix and J. J. Bruna. Approximating orthogonal matrices with effective Givens factorization. In Proceedings 36th International Conference on Machine Learning (ICML), 2019.

W. Gander, G. H. Golub, and U. von Matt. A constrained eigenvalue problem. Linear Algebra Appl., 114-115:815–839, 1989.

W. Givens. Computation of plain unitary rotations transforming a general matrix to triangular form. Journal of the Society for Industrial and Applied Mathematics, 6(1):26–50, 1958.

G. H. Golub and H. A. van der Vorst. Eigenvalue computation in the 20th century. Journal of Computational and Applied Mathematics, 123(1-2):35– 65, 2000.

G. H. Golub and C. F. van Loan. Matrix Computations. Johns Hopkins University Press, 1996.

R. Guimer`a, L. Danon, A. Diaz-Guilera, F. Giralt, and A. Arenas. Self-similar community structure in a network of human interactions. Physical review. E, Statistical, nonlinear, and soft matter physics, 68:065103, 2004.

G. H. Hardy, J. E. Littlewood, and G. Polya. Cambridge University Press, 1952.

P. Henrici. On the speed of convergence of cyclic and quasicyclic Jacobi methods for computing eigenvalues of Hermitian matrices. Journal of the Society for Industrial and Applied Mathematics, 6(2):144–162, 1958.

C. Jacobi. Uber ein leichtes Verfahren die in der Theorie der Sacularstorungen vorkommenden Gleichungen numerisch aufzulosen. Journal fur die reine und angewandte Mathematik, 30:51–94, 1846.

R. Kondor, N. Teneva, and V. K. Garg. Multiresolution matrix factorization. In Proceedings 31st International Conference on Machine Learning (ICML), pages II–1620–II–1628, 2014.

R. Kyng and S. Sachdeva. Approximate Gaussian elimination for Laplacians - fast, sparse, and simple. In IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 573–582, 2016.

L. Le Magoarou, R. Gribonval, and N. Tremblay. Approximate fast graph Fourier transforms via multi-layer sparse approximations. IEEE Transactions on Signal and Information Processing over Networks, 4(2):407–420, 2018.

A. B. Lee, B. Nadler, and L. Wasserman. Treelets - an adaptive multi-scale basis for sparse unordered data. Annals of Applied Statistics, 2(2):435–471, 2008.

J. Leskovec and J. J. Mcauley. Learning to discover social circles in ego networks. In Advances in Neural Information Processing Systems 25, pages 539–547. 2012.

J. A. Meijerink and H. A. Vorst. An iterative solution method for linear systems of which the coefficient matrix is a symmetric M-matrix. Math. Comp., 31: 148–162, 1977.

P. K. Mudrakarta, S. Trivedi, and R. Kondor. Asymmetric multiresolution matrix factorization. arXiv 1910.05132, 2019.

J.-F. Rual, K. Venkatesan, T. Hao, T. Hirozane-Kishikawa, A. Dricot, N. Li, G. Berriz, F. Gibbons, M. Dreze, N. Ayivi-Guedehoussou, N. Klitgord, C. Simon, M. Boxem, S. Milstein, J. Rosenberg, D. Goldberg, L. Zhang, S. Wong, G. Franklin, and M. Vidal. Towards a proteome-scale map of the human protein-protein interaction network. Nature, 437:1173–8, 2005.

C. Rusu. Learning multiplication-free linear transformations. arXiv 1812.03412, 2018.

C. Rusu and L. Rosasco. Fast approximation of orthogonal matrices and appli- cation to PCA. arXiv 1907.08697, 2019.

C. Rusu and J. Thompson. Learning fast sparsifying transforms. IEEE Trans. Sig. Proc., 65(16):4367–4378, 2017.

P. Schonemann. On two-sided orthogonal Procrustes problems. Psychometrika, 33(1):19–33, 1968.

G. Shabat, Y. Shmueli, Y. Aizenbud, and A. Averbuch. Randomized LU de- composition. Applied and Computational Harmonic Analysis, 44(2):246 – 272, 2018.

U. Shalit and G. Chechik. Coordinate-descent for learning orthogonal matrices through Givens rotations. In Proceedings 31st International Conference on Machine Learning (ICML), pages I–548–I–556, 2014.

G. W. Stewart. The decompositional approach to matrix computation. Computing in Science Engineering, 2(1):50–59, 2000.