Graph Metric Learning via Gershgorin Disc Alignment

2020·Arxiv

ABSTRACT

ABSTRACT

We propose a general projection-free metric learning framework, where the minimization objective is a convex differentiable function of the metric matrix M, and M resides in the set S of generalized graph Laplacian matrices for connected graphs with positive edge weights and node degrees. Unlike low-rank metric matrices common in the literature, S includes the important positive-diagonal-only matrices as a special case in the limit. The key idea for fast optimization is to rewrite the positive definite cone constraint in S as signal-adaptive linear constraints via Gershgorin disc alignment, so that the alternating optimization of the diagonal and off-diagonal terms in M can be solved efficiently as linear programs via Frank-Wolfe iterations. We prove that left-ends of the Gershgorin discs can be aligned perfectly using the first eigenvector v of M, which we update iteratively using Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) with warm start as diagonal / off-diagonal terms are optimized. Experiments show that our effi-ciently computed graph metric matrices outperform metrics learned using competing methods in terms of classification tasks.

Index Terms— Metric Learning, graph signal processing

1. INTRODUCTION

Given a feature vector feature distance (Mahalanobis distance [1]) between two samples i and j in a feature space as where M is commonly assumed to be positive definite (PD). Metric learning—identifying the best metric M minimizing a chosen objective function Q(M) subject to —has been the focus of many recent machine learning research efforts [2, 3, 4, 5, 6].

One key challenge in metric learning is to satisfy the positive (semi-)definite (PSD) cone constraint minimizing Q(M) in a computation-efficient manner. A standard approach is alternating gradient-descent / projection (e.g., proximal gradient (PG) [7]), where a descent step from current solution at iteration t in the direction of the negative gradient is followed by a projection Pr() back to the PSD cone, i.e., . However, projection Pr() typically requires eigen-decomposition of M and soft-thresholding of its eigenvalues, which is computation-expensive.

Recent methods consider alternative search spaces of matrices such as sparse or low-rank matrices to ease optimization [3, 4, 5, 8, 9]. While efficient, the assumed restricted search spaces often degrade the quality of sought metric M in defining the Mahalanobis distance. For example, low-rank methods explicitly assume reducibility of the K available features to a lower dimension, and hence exclude the simple yet important weighted feature metric case where M contains only positive diagonal entries [10], i.e.,

show in our experiments that computed metrics by these methods may result in inferior performance for selected applications.

In this paper, we propose a metric learning framework that is both general and projection-free, capable of optimizing any convex differentiable objective Q(M). Compared to low-rank methods, our framework is more encompassing and includes positive-diagonal metric matrices as a special case in the limit1. The main idea is as follows. First, we define a search space S of general graph Laplacian matrices [11], each corresponding to a connected graph with positive edge weights and node degrees. The underlying graph edge weights capture pairwise correlations among the K features, and the self-loops designate relative importance among the features.

Assuming , we next rewrite the PD cone constraint as signal-adaptive linear constraints via Gershgorin disc alignment [12, 13]: i) compute scalars ’s from previous solution align the Gershgorin disc left-ends of matrix , ii) derive scaled linear constraints using ensure PDness of the next computed metric via the Gershgorin Circle Theorem (GCT) [14]. Linear constraints mean that our proposed alternating optimization of the diagonal and off-diagonal terms in M can be solved speedily as linear programs (LP) [15] via Frank-Wolfe iterations [16]. We prove that for any metric using scalars can perfectly align Gershgorin disc left-ends for matrix at the smallest eigenvalue . We efficiently update v iteratively using Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) [17] with warm start as diagonal / off-diagonal terms are optimized. Experiments show that our computed graph metrics outperform metrics learned using competing methods in terms of classification tasks.

2. REVIEW OF SPECTRAL GRAPH THEORY

We consider an undirected graph G = {V, E, W} composed of a node set V of cardinality |V| = N, an edge set E connecting nodes, and a weighted adjacency matrix W. Each edge a positive weight which reflects the degree of similarity (correlation) between nodes i and j. Specifically, it is common to compute edge weight as the exponential of the negative feature distance between nodes i and j [18]:

Using (1) means . We discuss feature distance in the next section.

There may be self-loops in graph the corresponding diagonal entries of W are positive. The combinatorial graph Laplacian [18] is defined as the degree matrix—a diagonal matrix where generalized graph Laplacian [11] accounts for self-loops in G also and is defined as tracts the diagonal entries of W. Alternatively we can write , where the generalized degree matrix is diagonal.

3. GRAPH METRIC LEARNING

3.1. Graph Metric Matrices

We first define the search space of metric matrices for our optimization framework. We assume that associated with each sample i is a length-K feature vector . A metric matrix fines the feature distance Mahalanobis distance [1]— between samples i and j as:

We require M to be a positive definite (PD) matrix2. The special case where M is diagonal with strictly positive entries was studied in [10]. Instead, we study here a more general case: M must be a graph metric matrix, which we define formally as follows.

Definition 1. A PD symmetric matrix M is a graph metric if it is a generalized graph Laplacian matrix with positive edge weights and node degrees for an irreducible graph.

Remark: A generalized graph Laplacian matrix M with positive degrees means ; in graph terminology, each graph node i may have a self-loop, but its loop weight must satisfy . Positive edge weights means Irreducible graph [20] essentially means that any graph node can commute with any other node.

3.2. Problem Formulation

Denote by S the set of all graph metric matrices. We pose an optimization problem for M: find the optimal graph metric M in S— leading to inter-sample distances in (2)—that yields the smallest value of a convex differential objective

where C > 0 is a chosen parameter. Constraint needed to avoid pathological solutions with infinite feature distances, i.e., . For stability, we assume also that the objective is lower-bounded, i.e., constant

Our strategy to solve (3) is to optimize M’s diagonal and off-diagonal terms alternately using Frank-Wolfe iterations [16], where each iteration is solved as an LP until convergence. We discuss first the initialization of M, then the two optimizations in order. For notation convenience, we will write the objective simply as Q(M), with the understanding that metric M affects first the feature distances , which in turn determine the objective

3.3. Initialization of M

2. Initialize off-diagonal terms

where is a small parameter. Initialization of the diagonal terms ensures that constraints satisfied. Initialization of the off-diagonal terms ensures that is symmetric and irreducible, and constraint satisfied; i.e., is a generalized graph Laplacian matrix for graph with positive edge weights. We can hence conclude that initial is a graph metric, i.e.,

3.4. Optimization of Diagonal Terms

When optimizing M’s diagonal terms , (3) becomes

where . Because the diagonal terms do not affect the irreducibility of matrix M, the only requirements for M to be a graph metric are: i) M must be PD, and ii) diagonals must be strictly positive.

3.4.1. Gershgorin-based Reformulation

To efficiently enforce the PD constraint , we derive sufficient (but not necessary) linear constraints using the Gershgorin Circle Theorem (GCT) [14]. By GCT, each eigenvalue of a real matrix M resides in at least one Gershgorin disc , corresponding to row i of M, with center and radius

Thus a sufficient condition to ensure M is PD (smallest eigenvalue ) is to ensure that all discs’ left-ends are strictly positive, i.e.,

This translates to a linear constraint for each row i:

where is a sufficiently small parameter.

However, GCT lower bound is often loose. When optimizing M’s diagonal terms, enforcing (8) directly means that we are searching for in a much smaller space than the original space in (5), resulting in an inferior solution. As an illustration, consider the following example matrix M:

Gershgorin disc left-ends for this matrix are is the smallest. Thus the diagonal terms {2, 5, 4} do not meet constraints (8). However, M is PD, since its smallest eigenvalue is

3.4.2. Gershgorin Disc Alignment

To derive more appropriate linear constraints—thus more suitable search space when solving , we examine instead the Gershgorin discs of a similar-transformed matrix B from M, i.e.,

where is a diagonal matrix with scalars along its diagonal, has the same eigenvalues as M, and thus the smallest Gershgorin disc left-end, is also a lower bound for est eigenvalue Our goal is to derive tight lower bounds by adapting to good solutions to (5)—by appropriately choosing scalars used to define similar-transformed B in (10).

Specifically, given scalars has center and radius . Thus to ensure B is PD (and hence M is PD), we can write similar linear constraints as (8):

It turns out that given a graph metric M, there exist scalars such that all Gershgorin disc left-ends are aligned at the same value . We state this formally as a theorem.

Theorem 1. Let M be a graph metric matrix. There exist strictly positive scalars such that all Gershgorin disc left-ends of are aligned exactly at the smallest eigenvalue, i.e.,

In other words, for matrix B the Gershgorin lower bound , and the bound is the tightest possible. The important corollary is the following:

Corollary 1. For any graph metric matrix M, which by definition is PD, there exist scalars is feasible using linear constraints in (11).

Proof. By Theorem 1, let be scalars such that all Gershgorin disc left-ends of

where is PD. Hence M must also satisfy (11) for all i for sufficiently small

Continuing our earlier example, using , we see that has all disc left-ends aligned at . Hence using these scalars and constraints (11), diagonal terms {2, 5, 4} now constitute a feasible solution.

To prove Theorem 1, we first establish the following lemma.

Lemma 1. There exists a first eigenvector v with strictly positive entries for a graph metric matrix M.

Proof. By definition, graph metric matrix M is a generalized graph Laplacian with positive edge weights in W and positive degrees in be the first eigenvector of M, i.e.,

where is PD. Since the matrix on the right contains only non-negative entries and W is an irreducible matrix, v is a positive eigenvector by the Perron-Frobenius Theorem [21].

We now prove Theorem 1 as follows.

Proof. Denote by v a strictly positive eigenvector corresponding to graph metric matrix M’s smallest eigenvalue

where

(14) means that

Note that the off-diagonal terms v is strictly positive and ii) off-diagonal terms of graph metric M satisfy

Thus defining has all its Gershgorin disc left-ends aligned at

Thus, using a positive first eigenvector v of a graph metric M, one can compute corresponding scalars to align all disc left-ends of satisfies (11) by Corollary 1. Note that these scalars are signal-adaptive, i.e., on v, which is computed from M. Our strategy then is to derive scalars ’s from a good solution , optimize for a better solu- tion using scaled Gershgorin linear constraints (11), derive new scalars again until convergence. Specifically,

1. Given scalars ’s, identify a good solution minimizing objective Q(M) subject to (11), i.e.,

2. Given , update scalars is the first eigenvector of

3. Increment t and repeat until convergence.

When the scalars in (16) are updated as tion t + 1, we show that previous solution at iteration t remains feasible at iteration t + 1:

Lemma 2. Solution in iteration t remains feasible in iteration t + 1, when scalars for the linear constraints in (16) are updated as is the first eigenvector of

Proof. Using the first eigenvector of graph metric at iteration t, by the proof of Theorem 1 we know that the Gershgorin disc left-ends of are aligned at is a feasible solution in (16), is also a feasible solution when scalars are updated as

The remaining issue is how to best compute first eigenvector given solution repeatedly. For this task, we employ Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) [17], a fast iterative algorithm known to compute extreme eigenpairs effi-ciently. Further, using previously computed eigenvector initial guess, LOBPCG benefits from warm start when computing , reducing its complexity in subsequent iterations [17].

3.4.3. Frank-Wolfe Algorithm

To solve (16), we employ the Frank-Wolfe algorithm [16] that iteratively linearizes the objective Q(M) using its gradient with respect to diagonal terms , computed using previous solution

Given gradient , optimization (16) becomes a linear program (LP) at each iteration t:

where is a vector composed of diagonal terms are off-diagonal terms of previous solution . LP (18) can be solved efficiently using known fast algorithms such as Simplex [15] and interior point method [22]. When a new solution is obtained, gradient updated, and LP (18) is solved again until convergence.

3.5. Optimization of Off-diagonal Entries

For off-diagonal entries of M, we design a block coordinate descent algorithm, which optimizes one row / column at a time.

3.5.1. Block Coordinate Iteration

First, we divide M into four sub-matrices:

where is symmetric, . We optimize in one iteration, i.e.,

In the next iteration, a different row / column i is selected, and with appropriate row / column permutation, we still optimize the first column off-diagonal terms as in (20).

Note that the constraint in (3) can be ignored, since it does not involve optimization variable to remain in the set S of graph metric matrices, i) M must be PD, ii) M must be irreducible, and iii)

As done for the diagonal terms optimization, we replace the PD constraint with Gershgorin-based linear constraints. To ensure irreducibility (i.e., the graph remains connected), we ensure that at least one off-diagonal term (say index ) in column 1 has magnitude at least . The optimization thus becomes:

Essentially any selection of in (21) can ensure M is irreducible. To encourage solution convergence, we select as the index of the previously optimized with the largest magnitude.

(21) also has a convex differentiable objective with a set of linear constraints. We thus employ the Frank-Wolfe algorithm again to iteratively linearize the objective using gradient spect to off-diagonal , where the solution in each iteration is solved as an LP. We omit the details for brevity.

4. EXPERIMENTS

We evaluate our proposed metric learning method in classification performance. Specifically, the objective function Q(M) we consider here is the graph Laplacian Regularizer (GLR) [18, 23]:

A small GLR means that signal z at connected node pairs are similar for a large edge weight with respect to the variation operator L(M). GLR has been used in the GSP literature to solve a range of inverse problems, including image denoising [23], deblurring [24], dequantization amd contrast enhancement [25], and soft decoding of JPEG [26].

We evaluate our method with the following competing schemes: three metric learning methods that only learn the diagonals of M, i.e., [27], [28], and [10], and two methods that learn the full matrix M, i.e., [6] and [29]. We perform classification tasks using one of the following two classifiers: 1) a k-nearest-neighbour clas-sifier, and 2) a graph-based classifier with quadratic formulation

in subset F are the observed labels. We evaluate all classifiers on wine (3 classes, 13 features and 178 samples), iris (3 classes, 4 features and 150 samples), seeds (3 classes, 7 features and 210 samples), and pb (2 classes, 10 features and 300 samples). All experiments were performed in Matlab R2017a on an i5-7500, 8GB of RAM, Windows 10 PC. We perform 2-fold cross validation 50 times using 50 random seeds (0 to 49) with one-against-all classification strategy. As shown in Table 1, our proposed metric learning method has the lowest classification error rates with a graph-based classifier.

Table 1. Classification error rates. (GB=Graph-based classifier.)

5. REFERENCES

[1] P. C. Mahalanobis, “On the generalized distance in statistics,” Proceedings of the National Institute of Sciences of India, vol. 2, no. 1, pp. 49–55, April 1936.

[2] K. Q. Weinberger and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification,” Journal of Machine Learning Research, vol. 10, no. 2, pp. 207–244, Feb. 2009.

[3] G.-J. Qi, J. Tang, Z.-J. Zha, T.-S. Chua, and H.-J. Zhang, “An efficient sparse metric learning in high-dimensional space via -penalized log-determinant regularization,” in ICML, June 2009, pp. 841–848.

[4] D. Lim, G. Lanckriet, and B. McFee, “Robust structural metric learning,” in ICML, June 2013, pp. 615–623.

[5] W. Liu, C. Mu, R. Ji, S. Ma, J. R. Smith, and S.-F. Chang, “Low-rank similarity metric learning in high dimensions,” in AAAI, Jan. 2015, p. 27922799.

[6] P. Zadeh, R. Hosseini, and S. Sra, “Geometric mean metric learning,” in ICML, June 2016, pp. 2464–2471.

[7] N. Parikh and S. Boyd, “Proximal algorithms,” Foundations and Trends in Optimization, vol. 1, no. 3, pp. 127–239, Jan. 2014.

[8] Y. Mu, “Fixed-rank supervised metric learning on Riemannian manifold,” in AAAI, Feb. 2016, pp. 1941–1947.

[9] J. Zhang and L. Zhang, “Efficient stochastic optimization for low-rank distance metric learning,” in AAAI, Feb. 2017, pp. 933–939.

[10] C. Yang, G. Cheung, and V. Stankovic, “Alternating binary classifier and graph learning from partial labels,” in APSIPA, Nov. 2018, pp. 1137–1140.

[11] T. Biyikoglu, J. Leydold, and P. F. Stadler, “Nodal domain the- orems and bipartite subgraphs,” The Electronic Journal of Linear Algebra, vol. 13, pp. 344–351, Jan. 2005.

[12] Y. Bai, G. Cheung, F. Wang, X. Liu, and W. Gao, “Reconstruction-cognizant graph sampling using Gershgorin disc alignment,” in ICASSP, May 2019, pp. 5396–5400.

[13] Y. Bai, F. Wang, G. Cheung, Y. Nakatsukasa, and W. Gao, “Fast graph sampling set selection using Gershgorin disc alignment,” arXiv, 2019.

[14] R. S. Varga, Gershgorin and his circles. Springer, 2004.

[15] C. Papadimitriou and K. Steiglitz, Combinatorial Optimization. Dover Publications, Inc, 1998.

[16] M. Jaggi, “Revisiting Frank-Wolfe: Projection-free sparse con- vex optimization,” in ICML, Jun. 2013, pp. 427–435.

[17] A. V. Knyazev, “Toward the optimal preconditioned eigen- solver: Locally optimal block preconditioned conjugate gradient method,” SIAM Journal on Scientific Computing, vol. 23, no. 2, pp. 517–541, 2001.

[18] D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst, “The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains,” IEEE Signal Processing Magazine, vol. 30, pp. 83–98, May 2013.

[19] M. Vetterli, J. Kovacevic, and V. Goyal, Foundations of Signal Processing. Cambridge University Press, 2014.

[20] M. Milgram, “Irreducible graphs,” Journal Of Combinatorial Theory (B), vol. 12, pp. 6–31, Feb. 1972.

[21] R. Horn and C. Johnson, Matrix Analysis. Cambridge University Press, 2012.

[22] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press, 2009.

[23] J. Pang and G. Cheung, “Graph Laplacian regularization for image denoising: Analysis in the continuous domain,” IEEE Transactions on Image Processing, vol. 26, no. 4, pp. 1770– 1785, April 2017.

[24] Y. Bai, G. Cheung, X. Liu, and W. Gao, “Graph-based blind image deblurring from a single photograph,” IEEE Transactions on Image Processing, vol. 28, no. 3, pp. 1404–1418, March 2019.

[25] X. Liu, G. Cheung, X. Ji, D. Zhao, and W. Gao, “Graph- based joint dequantization and contrast enhancement of poorly lit JPEG images,” IEEE Transactions on Image Processing, vol. 28, no. 3, pp. 1205–1219, March 2019.

[26] X. Liu, G. Cheung, X. Wu, and D. Zhao, “Random walk graph Laplacian-based smoothness prior for soft decoding of JPEG images,” IEEE Transactions on Image Processing, vol. 26, no. 2, pp. 509–524, Feb. 2017.

[27] X. Zhu, Z. Ghahramani, and J. Lafferty, “Semi-supervised learning using Gaussian fields and harmonic functions,” in ICML, Aug. 2003, pp. 912–919.

[28] Y. Mao, G. Cheung, C.-W. Lin, and Y. Ji, “Joint learning of similarity graph and image classifier from partial labels,” in APSIPA, Dec. 2016, pp. 1–4.

[29] W. Hu, X. Gao, G. Cheung, and Z. Guo, “Feature graph learning for 3d point cloud denoising,” arXiv, 2019.