b

DiscoverSearch
About
My stuff
Graph Metric Learning via Gershgorin Disc Alignment
2020·arXiv
ABSTRACT
ABSTRACT

We propose a general projection-free metric learning framework, where the minimization objective  minM∈S Q(M)is a convex differentiable function of the metric matrix M, and M resides in the set S of generalized graph Laplacian matrices for connected graphs with positive edge weights and node degrees. Unlike low-rank metric matrices common in the literature, S includes the important positive-diagonal-only matrices as a special case in the limit. The key idea for fast optimization is to rewrite the positive definite cone constraint in S as signal-adaptive linear constraints via Gershgorin disc alignment, so that the alternating optimization of the diagonal and off-diagonal terms in M can be solved efficiently as linear programs via Frank-Wolfe iterations. We prove that left-ends of the Gershgorin discs can be aligned perfectly using the first eigenvector v of M, which we update iteratively using Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) with warm start as diagonal / off-diagonal terms are optimized. Experiments show that our effi-ciently computed graph metric matrices outperform metrics learned using competing methods in terms of classification tasks.

Index Terms— Metric Learning, graph signal processing

Given a feature vector  fi ∈ RK per sample i, a metric matrix M ∈RK×K defines thefeature distance (Mahalanobis distance [1]) between two samples i and j in a feature space as  (fi−fj)⊤M(fi−fj),where M is commonly assumed to be positive definite (PD). Metric learning—identifying the best metric M minimizing a chosen objective function Q(M) subject to  M ≻ 0—has been the focus of many recent machine learning research efforts [2, 3, 4, 5, 6].

One key challenge in metric learning is to satisfy the positive (semi-)definite (PSD) cone constraint  M ≻ 0 (M ⪰ 0) whenminimizing Q(M) in a computation-efficient manner. A standard approach is alternating gradient-descent / projection (e.g., proximal gradient (PG) [7]), where a descent step  αfrom current solution  Mt at iteration t in the direction of the negative gradient −∇Q(Mt)is followed by a projection Pr() back to the PSD cone, i.e.,  Mt+1 := Pr�Mt − α∇Q(Mt)�. However, projection Pr() typically requires eigen-decomposition of M and soft-thresholding of its eigenvalues, which is computation-expensive.

Recent methods consider alternative search spaces of matrices such as sparse or low-rank matrices to ease optimization [3, 4, 5, 8, 9]. While efficient, the assumed restricted search spaces often degrade the quality of sought metric M in defining the Mahalanobis distance. For example, low-rank methods explicitly assume reducibility of the K available features to a lower dimension, and hence exclude the simple yet important weighted feature metric case where M contains only positive diagonal entries [10], i.e.,

image

show in our experiments that computed metrics by these methods may result in inferior performance for selected applications.

In this paper, we propose a metric learning framework that is both general and projection-free, capable of optimizing any convex differentiable objective Q(M). Compared to low-rank methods, our framework is more encompassing and includes positive-diagonal metric matrices as a special case in the limit1. The main idea is as follows. First, we define a search space S of general graph Laplacian matrices [11], each corresponding to a connected graph with positive edge weights and node degrees. The underlying graph edge weights capture pairwise correlations among the K features, and the self-loops designate relative importance among the features.

Assuming  M ∈ S, we next rewrite the PD cone constraint as signal-adaptive linear constraints via Gershgorin disc alignment [12, 13]: i) compute scalars  sk’s from previous solution  Mt thatalign the Gershgorin disc left-ends of matrix  SMtS−1, where S =diag(s1, . . . , sK), ii) derive scaled linear constraints using  sk’s toensure PDness of the next computed metric  Mt+1 via the Gershgorin Circle Theorem (GCT) [14]. Linear constraints mean that our proposed alternating optimization of the diagonal and off-diagonal terms in M can be solved speedily as linear programs (LP) [15] via Frank-Wolfe iterations [16]. We prove that for any metric  Mt in S,using scalars  sk = 1/vkcan perfectly align Gershgorin disc left-ends for matrix  SMtS−1 at the smallest eigenvalue  λmin, whereMtv = λminv. We efficiently update v iteratively using Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) [17] with warm start as diagonal / off-diagonal terms are optimized. Experiments show that our computed graph metrics outperform metrics learned using competing methods in terms of classification tasks.

We consider an undirected graph G = {V, E, W} composed of a node set V of cardinality |V| = N, an edge set E connecting nodes, and a weighted adjacency matrix W. Each edge  (i, j) ∈ E hasa positive weight  wi,j > 0which reflects the degree of similarity (correlation) between nodes i and j. Specifically, it is common to compute edge weight  wi,jas the exponential of the negative feature distance  δi,jbetween nodes i and j [18]:

image

Using (1) means  wi,j ∈ (0, 1] for δi,j ∈ [0, ∞). We discuss feature distance  δi,jin the next section.

There may be self-loops in graph  G, i.e., ∃i where wi,i > 0, andthe corresponding diagonal entries of W are positive. The combinatorial graph Laplacian [18] is defined as  L := D − W, where D isthe degree matrix—a diagonal matrix where  di,i = �Nj=1 wi,j. Ageneralized graph Laplacian [11] accounts for self-loops in G also and is defined as  Lg = D − W + diag(W), where diag(W) ex-tracts the diagonal entries of W. Alternatively we can write  Lg =Dg−W, where the generalized degree matrix  Dg = D+diag(W)is diagonal.

3.1. Graph Metric Matrices

We first define the search space of metric matrices for our optimization framework. We assume that associated with each sample i is a length-K feature vector  fi ∈ RK. A metric matrix  M ∈ RK×K de-fines the feature distance  δi,j(M)—theMahalanobis distance [1]— between samples i and j as:

image

We require M to be a positive definite (PD) matrix2. The special case where M is diagonal with strictly positive entries was studied in [10]. Instead, we study here a more general case: M must be a graph metric matrix, which we define formally as follows.

Definition 1. A PD symmetric matrix M is a graph metric if it is a generalized graph Laplacian matrix with positive edge weights and node degrees for an irreducible graph.

Remark: A generalized graph Laplacian matrix M with positive degrees means  mi,i > 0; in graph terminology, each graph node i may have a self-loop, but its loop weight  wi,imust satisfy  wi,i >− �j | j̸=i wi,j. Positive edge weights means  mi,j ≤ 0, i ̸= j.Irreducible graph [20] essentially means that any graph node can commute with any other node.

3.2. Problem Formulation

Denote by S the set of all graph metric matrices. We pose an optimization problem for M: find the optimal graph metric M in S— leading to inter-sample distances  δi,j(M)in (2)—that yields the smallest value of a convex differential objective  Q({δi,j(M)}):

image

where C > 0 is a chosen parameter. Constraint  tr(M) ≤ C isneeded to avoid pathological solutions with infinite feature distances, i.e.,  δi,i(M) = ∞. For stability, we assume also that the objective is lower-bounded, i.e.,  minM∈S Q({δi,j(M)}) ≥ κ > −∞ for someconstant  κ.

Our strategy to solve (3) is to optimize M’s diagonal and off-diagonal terms alternately using Frank-Wolfe iterations [16], where each iteration is solved as an LP until convergence. We discuss first the initialization of M, then the two optimizations in order. For notation convenience, we will write the objective simply as Q(M), with the understanding that metric M affects first the feature distances δi,j(M), which in turn determine the objective  Q({δi,j(M)}).

3.3. Initialization of M

image

2. Initialize off-diagonal terms  m0i,j, i ̸= j, as:

image

where  ǫ > 0is a small parameter. Initialization of the diagonal terms ensures that constraints  tr(M0) ≤ C, M0 ≻ 0 and m0i,i > 0 aresatisfied. Initialization of the off-diagonal terms ensures that  M0is symmetric and irreducible, and constraint  m0i,j ≤ 0, i ̸= j, issatisfied; i.e.,  M0 is a generalized graph Laplacian matrix for graph with positive edge weights. We can hence conclude that initial  M0is a graph metric, i.e.,  M0 ∈ S.

3.4. Optimization of Diagonal Terms

When optimizing M’s diagonal terms  mi,i, (3) becomes

image

where  tr(M) = �i mi,i. Because the diagonal terms do not affect the irreducibility of matrix M, the only requirements for M to be a graph metric are: i) M must be PD, and ii) diagonals must be strictly positive.

3.4.1. Gershgorin-based Reformulation

To efficiently enforce the PD constraint  M ≻ 0, we derive sufficient (but not necessary) linear constraints using the Gershgorin Circle Theorem (GCT) [14]. By GCT, each eigenvalue  λof a real matrix M resides in at least one Gershgorin disc  Ψi, corresponding to row i of M, with center  ci = mi,iand radius  ri = �j | j̸=i |mi,j|, i.e.,

image

Thus a sufficient condition to ensure M is PD (smallest eigenvalue λmin > 0) is to ensure that all discs’ left-ends are strictly positive, i.e.,

image

This translates to a linear constraint for each row i:

image

where  ρ > 0is a sufficiently small parameter.

However, GCT lower bound  mini ci−ri for λminis often loose. When optimizing M’s diagonal terms, enforcing (8) directly means that we are searching for  {mi,i}in a much smaller space than the original space  {M | M ≻ 0}in (5), resulting in an inferior solution. As an illustration, consider the following example matrix M:

image

Gershgorin disc left-ends  mi,i − �j | j̸=i |mi,j|for this matrix are {−1, 1, 1}, of which −1is the smallest. Thus the diagonal terms {2, 5, 4} do not meet constraints (8). However, M is PD, since its smallest eigenvalue is  λmin = 0.1078 > 0.

3.4.2. Gershgorin Disc Alignment

To derive more appropriate linear constraints—thus more suitable search space when solving  minM∈S Q(M), we examine instead the Gershgorin discs of a similar-transformed matrix B from M, i.e.,

image

where  S = diag(s1, . . . , sK)is a diagonal matrix with scalars s1, . . . , sKalong its diagonal,  sk > 0, ∀k. Bhas the same eigenvalues as M, and thus the smallest Gershgorin disc left-end, mini bi,i−�j | j̸=i |bi,j|, for Bis also a lower bound for  M’s small-est eigenvalue  λmin.Our goal is to derive tight  λminlower bounds by adapting to good solutions to (5)—by appropriately choosing scalars  s1, . . . , sKused to define similar-transformed B in (10).

Specifically, given scalars  s1, . . . , sK, a disc Ψi for Bhas center mi,iand radius  si�j | j̸=i |mi,j|/sj. Thus to ensure B is PD (and hence M is PD), we can write similar linear constraints as (8):

image

It turns out that given a graph metric M, there exist scalars s1, . . . , sKsuch that all Gershgorin disc left-ends are aligned at the same value  λmin. We state this formally as a theorem.

Theorem 1. Let M be a graph metric matrix. There exist strictly positive scalars  s1, . . . , sKsuch that all Gershgorin disc left-ends of  B = SMS−1 are aligned exactly at the smallest eigenvalue, i.e., bi,i − �j | j̸=i |bi,j| = λmin, ∀i.

In other words, for matrix B the Gershgorin lower bound mini ci − ri is exactly λmin, and the bound is the tightest possible. The important corollary is the following:

Corollary 1. For any graph metric matrix M, which by definition is PD, there exist scalars  s1, . . . , sK where Mis feasible using linear constraints in (11).

Proof. By Theorem 1, let  s1, . . . , sKbe scalars such that all Gershgorin disc left-ends of  B = SMS−1 align at λmin. Thus

image

where  λmin > 0 since Mis PD. Hence M must also satisfy (11) for all i for sufficiently small  ρ > 0.

Continuing our earlier example, using  s1 = 0.7511, s2 =0.4886 and s3 = 0.4440, we see that  B = SMS−1 for M in (9)has all disc left-ends aligned at  λmin = 0.1078. Hence using these scalars and constraints (11), diagonal terms {2, 5, 4} now constitute a feasible solution.

To prove Theorem 1, we first establish the following lemma.

Lemma 1. There exists a first eigenvector v with strictly positive entries for a graph metric matrix M.

Proof. By definition, graph metric matrix M is a generalized graph Laplacian  Lg = Dg − Wwith positive edge weights in W and positive degrees in  Dg. Let vbe the first eigenvector of M, i.e.,

image

where  λmin > 0 since Mis PD. Since the matrix on the right contains only non-negative entries and W is an irreducible matrix, v is a positive eigenvector by the Perron-Frobenius Theorem [21].

We now prove Theorem 1 as follows.

Proof. Denote by v a strictly positive eigenvector corresponding to graph metric matrix M’s smallest eigenvalue  λmin. Define S =

image

where  Sv = 1 = [1, . . . , 1]⊤. Let B = SMS−1. Then,

image

(14) means that

image

Note that the off-diagonal terms  bi,j = (vi/vj)mi,j ≤ 0, since i)v is strictly positive and ii) off-diagonal terms of graph metric M satisfy  mi,j ≤ 0. Thus,

image

Thus defining  S = diag(1/v1, . . . , 1/vK) means B = SMS−1has all its Gershgorin disc left-ends aligned at  λmin.

Thus, using a positive first eigenvector v of a graph metric M, one can compute corresponding scalars  sk = 1/vkto align all disc left-ends of  B = SMS−1 at λmin, and Msatisfies (11) by Corollary 1. Note that these scalars are signal-adaptive, i.e.,  sk’s dependon v, which is computed from M. Our strategy then is to derive scalars  stk’s from a good solution  Mt−1, optimize for a better solu- tion  Mt using scaled Gershgorin linear constraints (11), derive new scalars again until convergence. Specifically,

1. Given scalars  stk’s, identify a good solution  Mtminimizing objective Q(M) subject to (11), i.e.,

image

2. Given  Mt, update scalars  st+1k = 1/vtk where vtis the first eigenvector of  Mt.

3. Increment t and repeat until convergence.

When the scalars in (16) are updated as  st+1k = 1/vtk for itera-tion t + 1, we show that previous solution  Mt at iteration t remains feasible at iteration t + 1:

Lemma 2. Solution  Mt to (16)in iteration t remains feasible in iteration t + 1, when scalars  st+1ifor the linear constraints in (16) are updated as  st+1i = 1/vti, ∀i, where vtis the first eigenvector of Mt.

Proof. Using the first eigenvector  vt of graph metric  Mt at iteration t, by the proof of Theorem 1 we know that the Gershgorin disc left-ends of  B = SMtS−1 are aligned at  λmin. Since Mt is a feasible solution in (16),  Mt ≻ 0 and λmin > 0. Thus Mt is also a feasible solution when scalars are updated as  si = 1/vti, ∀i.

The remaining issue is how to best compute first eigenvector  vtgiven solution  Mt repeatedly. For this task, we employ Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) [17], a fast iterative algorithm known to compute extreme eigenpairs effi-ciently. Further, using previously computed eigenvector  vt−1 as aninitial guess, LOBPCG benefits from warm start when computing vt, reducing its complexity in subsequent iterations [17].

3.4.3. Frank-Wolfe Algorithm

To solve (16), we employ the Frank-Wolfe algorithm [16] that iteratively linearizes the objective Q(M) using its gradient  ∇Q(Mt)with respect to diagonal terms  {mi,i}, computed using previous solution  Mt, i.e.,

image

Given gradient  ∇Q(Mt), optimization (16) becomes a linear program (LP) at each iteration t:

image

where  vec({mi,i}) = [m1,1 m2,2 . . . mK,K]⊤ is a vector composed of diagonal terms  {mi,i}, and mti,j are off-diagonal terms of previous solution  Mt. LP (18) can be solved efficiently using known fast algorithms such as Simplex [15] and interior point method [22]. When a new solution  {mt+1i,i }is obtained, gradient  ∇Q(Mt+1) isupdated, and LP (18) is solved again until convergence.

3.5. Optimization of Off-diagonal Entries

For off-diagonal entries of M, we design a block coordinate descent algorithm, which optimizes one row / column at a time.

3.5.1. Block Coordinate Iteration

First, we divide M into four sub-matrices:

image

where  m1,1 ∈ R, M1,2 ∈ R1×(K−1), M2,1 ∈ R(K−1)×1 andM2,2 ∈ R(K−1)×(K−1). Assuming Mis symmetric,  M1,2 =M⊤2,1. We optimize  M2,1 in one iteration, i.e.,

image

In the next iteration, a different row / column i is selected, and with appropriate row / column permutation, we still optimize the first column off-diagonal terms  M2,1as in (20).

Note that the constraint  tr(M) ≤ Cin (3) can be ignored, since it does not involve optimization variable  M2,1. For Mto remain in the set S of graph metric matrices, i) M must be PD, ii) M must be irreducible, and iii)  M2,1 ≤ 0.

As done for the diagonal terms optimization, we replace the PD constraint with Gershgorin-based linear constraints. To ensure irreducibility (i.e., the graph remains connected), we ensure that at least one off-diagonal term (say index  ζ) in column 1 has magnitude at least  ǫ > 0. The optimization thus becomes:

image

Essentially any selection of  ζin (21) can ensure M is irreducible. To encourage solution convergence, we select  ζas the index of the previously optimized  Mt2,1 with the largest magnitude.

(21) also has a convex differentiable objective with a set of linear constraints. We thus employ the Frank-Wolfe algorithm again to iteratively linearize the objective using gradient  ∇Q(Mt) with re-spect to off-diagonal  M2,1, where the solution in each iteration is solved as an LP. We omit the details for brevity.

We evaluate our proposed metric learning method in classification performance. Specifically, the objective function Q(M) we consider here is the graph Laplacian Regularizer (GLR) [18, 23]:

image

A small GLR means that signal z at connected node pairs  (zi, zj)are similar for a large edge weight  wi,j, i.e., z is smoothwith respect to the variation operator L(M). GLR has been used in the GSP literature to solve a range of inverse problems, including image denoising [23], deblurring [24], dequantization amd contrast enhancement [25], and soft decoding of JPEG [26].

We evaluate our method with the following competing schemes: three metric learning methods that only learn the diagonals of M, i.e., [27], [28], and [10], and two methods that learn the full matrix M, i.e., [6] and [29]. We perform classification tasks using one of the following two classifiers: 1) a k-nearest-neighbour clas-sifier, and 2) a graph-based classifier with quadratic formulation

image

in subset F are the observed labels. We evaluate all classifiers on wine (3 classes, 13 features and 178 samples), iris (3 classes, 4 features and 150 samples), seeds (3 classes, 7 features and 210 samples), and pb (2 classes, 10 features and 300 samples). All experiments were performed in Matlab R2017a on an i5-7500, 8GB of RAM, Windows 10 PC. We perform 2-fold cross validation 50 times using 50 random seeds (0 to 49) with one-against-all classification strategy. As shown in Table 1, our proposed metric learning method has the lowest classification error rates with a graph-based classifier.

Table 1. Classification error rates. (GB=Graph-based classifier.)

image

[1] P. C. Mahalanobis, “On the generalized distance in statistics,” Proceedings of the National Institute of Sciences of India, vol. 2, no. 1, pp. 49–55, April 1936.

[2] K. Q. Weinberger and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification,” Journal of Machine Learning Research, vol. 10, no. 2, pp. 207–244, Feb. 2009.

[3] G.-J. Qi, J. Tang, Z.-J. Zha, T.-S. Chua, and H.-J. Zhang, “An efficient sparse metric learning in high-dimensional space via l1-penalized log-determinant regularization,” in ICML, June 2009, pp. 841–848.

[4] D. Lim, G. Lanckriet, and B. McFee, “Robust structural metric learning,” in ICML, June 2013, pp. 615–623.

[5] W. Liu, C. Mu, R. Ji, S. Ma, J. R. Smith, and S.-F. Chang, “Low-rank similarity metric learning in high dimensions,” in AAAI, Jan. 2015, p. 27922799.

[6] P. Zadeh, R. Hosseini, and S. Sra, “Geometric mean metric learning,” in ICML, June 2016, pp. 2464–2471.

[7] N. Parikh and S. Boyd, “Proximal algorithms,” Foundations and Trends in Optimization, vol. 1, no. 3, pp. 127–239, Jan. 2014.

[8] Y. Mu, “Fixed-rank supervised metric learning on Riemannian manifold,” in AAAI, Feb. 2016, pp. 1941–1947.

[9] J. Zhang and L. Zhang, “Efficient stochastic optimization for low-rank distance metric learning,” in AAAI, Feb. 2017, pp. 933–939.

[10] C. Yang, G. Cheung, and V. Stankovic, “Alternating binary classifier and graph learning from partial labels,” in APSIPA, Nov. 2018, pp. 1137–1140.

[11] T. Biyikoglu, J. Leydold, and P. F. Stadler, “Nodal domain the- orems and bipartite subgraphs,” The Electronic Journal of Linear Algebra, vol. 13, pp. 344–351, Jan. 2005.

[12] Y. Bai, G. Cheung, F. Wang, X. Liu, and W. Gao, “Reconstruction-cognizant graph sampling using Gershgorin disc alignment,” in ICASSP, May 2019, pp. 5396–5400.

[13] Y. Bai, F. Wang, G. Cheung, Y. Nakatsukasa, and W. Gao, “Fast graph sampling set selection using Gershgorin disc alignment,” arXiv, 2019.

[14] R. S. Varga, Gershgorin and his circles. Springer, 2004.

[15] C. Papadimitriou and K. Steiglitz, Combinatorial Optimization. Dover Publications, Inc, 1998.

[16] M. Jaggi, “Revisiting Frank-Wolfe: Projection-free sparse con- vex optimization,” in ICML, Jun. 2013, pp. 427–435.

[17] A. V. Knyazev, “Toward the optimal preconditioned eigen- solver: Locally optimal block preconditioned conjugate gradient method,” SIAM Journal on Scientific Computing, vol. 23, no. 2, pp. 517–541, 2001.

[18] D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst, “The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains,” IEEE Signal Processing Magazine, vol. 30, pp. 83–98, May 2013.

[19] M. Vetterli, J. Kovacevic, and V. Goyal, Foundations of Signal Processing. Cambridge University Press, 2014.

[20] M. Milgram, “Irreducible graphs,” Journal Of Combinatorial Theory (B), vol. 12, pp. 6–31, Feb. 1972.

[21] R. Horn and C. Johnson, Matrix Analysis. Cambridge University Press, 2012.

[22] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press, 2009.

[23] J. Pang and G. Cheung, “Graph Laplacian regularization for image denoising: Analysis in the continuous domain,” IEEE Transactions on Image Processing, vol. 26, no. 4, pp. 1770– 1785, April 2017.

[24] Y. Bai, G. Cheung, X. Liu, and W. Gao, “Graph-based blind image deblurring from a single photograph,” IEEE Transactions on Image Processing, vol. 28, no. 3, pp. 1404–1418, March 2019.

[25] X. Liu, G. Cheung, X. Ji, D. Zhao, and W. Gao, “Graph- based joint dequantization and contrast enhancement of poorly lit JPEG images,” IEEE Transactions on Image Processing, vol. 28, no. 3, pp. 1205–1219, March 2019.

[26] X. Liu, G. Cheung, X. Wu, and D. Zhao, “Random walk graph Laplacian-based smoothness prior for soft decoding of JPEG images,” IEEE Transactions on Image Processing, vol. 26, no. 2, pp. 509–524, Feb. 2017.

[27] X. Zhu, Z. Ghahramani, and J. Lafferty, “Semi-supervised learning using Gaussian fields and harmonic functions,” in ICML, Aug. 2003, pp. 912–919.

[28] Y. Mao, G. Cheung, C.-W. Lin, and Y. Ji, “Joint learning of similarity graph and image classifier from partial labels,” in APSIPA, Dec. 2016, pp. 1–4.

[29] W. Hu, X. Gao, G. Cheung, and Z. Guo, “Feature graph learning for 3d point cloud denoising,” arXiv, 2019.


Designed for Accessibility and to further Open Science