Given a feature vector feature distance (Mahalanobis distance [1]) between two samples i and j in a feature space as
where M is commonly assumed to be positive definite (PD). Metric learning—identifying the best metric M minimizing a chosen objective function Q(M) subject to
—has been the focus of many recent machine learning research efforts [2, 3, 4, 5, 6].
One key challenge in metric learning is to satisfy the positive (semi-)definite (PSD) cone constraint minimizing Q(M) in a computation-efficient manner. A standard approach is alternating gradient-descent / projection (e.g., proximal gradient (PG) [7]), where a descent step
from current solution
at iteration t in the direction of the negative gradient
is followed by a projection Pr() back to the PSD cone, i.e.,
. However, projection Pr() typically requires eigen-decomposition of M and soft-thresholding of its eigenvalues, which is computation-expensive.
Recent methods consider alternative search spaces of matrices such as sparse or low-rank matrices to ease optimization [3, 4, 5, 8, 9]. While efficient, the assumed restricted search spaces often degrade the quality of sought metric M in defining the Mahalanobis distance. For example, low-rank methods explicitly assume reducibility of the K available features to a lower dimension, and hence exclude the simple yet important weighted feature metric case where M contains only positive diagonal entries [10], i.e.,
show in our experiments that computed metrics by these methods may result in inferior performance for selected applications.
In this paper, we propose a metric learning framework that is both general and projection-free, capable of optimizing any convex differentiable objective Q(M). Compared to low-rank methods, our framework is more encompassing and includes positive-diagonal metric matrices as a special case in the limit1. The main idea is as follows. First, we define a search space S of general graph Laplacian matrices [11], each corresponding to a connected graph with positive edge weights and node degrees. The underlying graph edge weights capture pairwise correlations among the K features, and the self-loops designate relative importance among the features.
Assuming , we next rewrite the PD cone constraint as signal-adaptive linear constraints via Gershgorin disc alignment [12, 13]: i) compute scalars
’s from previous solution
align the Gershgorin disc left-ends of matrix
, ii) derive scaled linear constraints using
ensure PDness of the next computed metric
via the Gershgorin Circle Theorem (GCT) [14]. Linear constraints mean that our proposed alternating optimization of the diagonal and off-diagonal terms in M can be solved speedily as linear programs (LP) [15] via Frank-Wolfe iterations [16]. We prove that for any metric
using scalars
can perfectly align Gershgorin disc left-ends for matrix
at the smallest eigenvalue
. We efficiently update v iteratively using Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) [17] with warm start as diagonal / off-diagonal terms are optimized. Experiments show that our computed graph metrics outperform metrics learned using competing methods in terms of classification tasks.
We consider an undirected graph G = {V, E, W} composed of a node set V of cardinality |V| = N, an edge set E connecting nodes, and a weighted adjacency matrix W. Each edge a positive weight
which reflects the degree of similarity (correlation) between nodes i and j. Specifically, it is common to compute edge weight
as the exponential of the negative feature distance
between nodes i and j [18]:
Using (1) means . We discuss feature distance
in the next section.
There may be self-loops in graph the corresponding diagonal entries of W are positive. The combinatorial graph Laplacian [18] is defined as
the degree matrix—a diagonal matrix where
generalized graph Laplacian [11] accounts for self-loops in G also and is defined as
tracts the diagonal entries of W. Alternatively we can write
, where the generalized degree matrix
is diagonal.
3.1. Graph Metric Matrices
We first define the search space of metric matrices for our optimization framework. We assume that associated with each sample i is a length-K feature vector . A metric matrix
fines the feature distance
Mahalanobis distance [1]— between samples i and j as:
We require M to be a positive definite (PD) matrix2. The special case where M is diagonal with strictly positive entries was studied in [10]. Instead, we study here a more general case: M must be a graph metric matrix, which we define formally as follows.
Definition 1. A PD symmetric matrix M is a graph metric if it is a generalized graph Laplacian matrix with positive edge weights and node degrees for an irreducible graph.
Remark: A generalized graph Laplacian matrix M with positive degrees means ; in graph terminology, each graph node i may have a self-loop, but its loop weight
must satisfy
. Positive edge weights means
Irreducible graph [20] essentially means that any graph node can commute with any other node.
3.2. Problem Formulation
Denote by S the set of all graph metric matrices. We pose an optimization problem for M: find the optimal graph metric M in S— leading to inter-sample distances in (2)—that yields the smallest value of a convex differential objective
where C > 0 is a chosen parameter. Constraint needed to avoid pathological solutions with infinite feature distances, i.e.,
. For stability, we assume also that the objective is lower-bounded, i.e.,
constant
Our strategy to solve (3) is to optimize M’s diagonal and off-diagonal terms alternately using Frank-Wolfe iterations [16], where each iteration is solved as an LP until convergence. We discuss first the initialization of M, then the two optimizations in order. For notation convenience, we will write the objective simply as Q(M), with the understanding that metric M affects first the feature distances , which in turn determine the objective
3.3. Initialization of M
2. Initialize off-diagonal terms
where is a small parameter. Initialization of the diagonal terms ensures that constraints
satisfied. Initialization of the off-diagonal terms ensures that
is symmetric and irreducible, and constraint
satisfied; i.e.,
is a generalized graph Laplacian matrix for graph with positive edge weights. We can hence conclude that initial
is a graph metric, i.e.,
3.4. Optimization of Diagonal Terms
When optimizing M’s diagonal terms , (3) becomes
where . Because the diagonal terms do not affect the irreducibility of matrix M, the only requirements for M to be a graph metric are: i) M must be PD, and ii) diagonals must be strictly positive.
3.4.1. Gershgorin-based Reformulation
To efficiently enforce the PD constraint , we derive sufficient (but not necessary) linear constraints using the Gershgorin Circle Theorem (GCT) [14]. By GCT, each eigenvalue
of a real matrix M resides in at least one Gershgorin disc
, corresponding to row i of M, with center
and radius
Thus a sufficient condition to ensure M is PD (smallest eigenvalue ) is to ensure that all discs’ left-ends are strictly positive, i.e.,
This translates to a linear constraint for each row i:
where is a sufficiently small parameter.
However, GCT lower bound is often loose. When optimizing M’s diagonal terms, enforcing (8) directly means that we are searching for
in a much smaller space than the original space
in (5), resulting in an inferior solution. As an illustration, consider the following example matrix M:
Gershgorin disc left-ends for this matrix are
is the smallest. Thus the diagonal terms {2, 5, 4} do not meet constraints (8). However, M is PD, since its smallest eigenvalue is
3.4.2. Gershgorin Disc Alignment
To derive more appropriate linear constraints—thus more suitable search space when solving , we examine instead the Gershgorin discs of a similar-transformed matrix B from M, i.e.,
where is a diagonal matrix with scalars
along its diagonal,
has the same eigenvalues as M, and thus the smallest Gershgorin disc left-end,
is also a lower bound for
est eigenvalue
Our goal is to derive tight
lower bounds by adapting to good solutions to (5)—by appropriately choosing scalars
used to define similar-transformed B in (10).
Specifically, given scalars has center
and radius
. Thus to ensure B is PD (and hence M is PD), we can write similar linear constraints as (8):
It turns out that given a graph metric M, there exist scalars such that all Gershgorin disc left-ends are aligned at the same value
. We state this formally as a theorem.
Theorem 1. Let M be a graph metric matrix. There exist strictly positive scalars such that all Gershgorin disc left-ends of
are aligned exactly at the smallest eigenvalue, i.e.,
In other words, for matrix B the Gershgorin lower bound , and the bound is the tightest possible. The important corollary is the following:
Corollary 1. For any graph metric matrix M, which by definition is PD, there exist scalars is feasible using linear constraints in (11).
Proof. By Theorem 1, let be scalars such that all Gershgorin disc left-ends of
where is PD. Hence M must also satisfy (11) for all i for sufficiently small
Continuing our earlier example, using , we see that
has all disc left-ends aligned at
. Hence using these scalars and constraints (11), diagonal terms {2, 5, 4} now constitute a feasible solution.
To prove Theorem 1, we first establish the following lemma.
Lemma 1. There exists a first eigenvector v with strictly positive entries for a graph metric matrix M.
Proof. By definition, graph metric matrix M is a generalized graph Laplacian with positive edge weights in W and positive degrees in
be the first eigenvector of M, i.e.,
where is PD. Since the matrix on the right contains only non-negative entries and W is an irreducible matrix, v is a positive eigenvector by the Perron-Frobenius Theorem [21].
We now prove Theorem 1 as follows.
Proof. Denote by v a strictly positive eigenvector corresponding to graph metric matrix M’s smallest eigenvalue
where
(14) means that
Note that the off-diagonal terms v is strictly positive and ii) off-diagonal terms of graph metric M satisfy
Thus defining has all its Gershgorin disc left-ends aligned at
Thus, using a positive first eigenvector v of a graph metric M, one can compute corresponding scalars to align all disc left-ends of
satisfies (11) by Corollary 1. Note that these scalars are signal-adaptive, i.e.,
on v, which is computed from M. Our strategy then is to derive scalars
’s from a good solution
, optimize for a better solu- tion
using scaled Gershgorin linear constraints (11), derive new scalars again until convergence. Specifically,
1. Given scalars ’s, identify a good solution
minimizing objective Q(M) subject to (11), i.e.,
2. Given , update scalars
is the first eigenvector of
3. Increment t and repeat until convergence.
When the scalars in (16) are updated as tion t + 1, we show that previous solution
at iteration t remains feasible at iteration t + 1:
Lemma 2. Solution in iteration t remains feasible in iteration t + 1, when scalars
for the linear constraints in (16) are updated as
is the first eigenvector of
Proof. Using the first eigenvector of graph metric
at iteration t, by the proof of Theorem 1 we know that the Gershgorin disc left-ends of
are aligned at
is a feasible solution in (16),
is also a feasible solution when scalars are updated as
The remaining issue is how to best compute first eigenvector given solution
repeatedly. For this task, we employ Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) [17], a fast iterative algorithm known to compute extreme eigenpairs effi-ciently. Further, using previously computed eigenvector
initial guess, LOBPCG benefits from warm start when computing
, reducing its complexity in subsequent iterations [17].
3.4.3. Frank-Wolfe Algorithm
To solve (16), we employ the Frank-Wolfe algorithm [16] that iteratively linearizes the objective Q(M) using its gradient with respect to diagonal terms
, computed using previous solution
Given gradient , optimization (16) becomes a linear program (LP) at each iteration t:
where is a vector composed of diagonal terms
are off-diagonal terms of previous solution
. LP (18) can be solved efficiently using known fast algorithms such as Simplex [15] and interior point method [22]. When a new solution
is obtained, gradient
updated, and LP (18) is solved again until convergence.
3.5. Optimization of Off-diagonal Entries
For off-diagonal entries of M, we design a block coordinate descent algorithm, which optimizes one row / column at a time.
3.5.1. Block Coordinate Iteration
First, we divide M into four sub-matrices:
where is symmetric,
. We optimize
in one iteration, i.e.,
In the next iteration, a different row / column i is selected, and with appropriate row / column permutation, we still optimize the first column off-diagonal terms as in (20).
Note that the constraint in (3) can be ignored, since it does not involve optimization variable
to remain in the set S of graph metric matrices, i) M must be PD, ii) M must be irreducible, and iii)
As done for the diagonal terms optimization, we replace the PD constraint with Gershgorin-based linear constraints. To ensure irreducibility (i.e., the graph remains connected), we ensure that at least one off-diagonal term (say index ) in column 1 has magnitude at least
. The optimization thus becomes:
Essentially any selection of in (21) can ensure M is irreducible. To encourage solution convergence, we select
as the index of the previously optimized
with the largest magnitude.
(21) also has a convex differentiable objective with a set of linear constraints. We thus employ the Frank-Wolfe algorithm again to iteratively linearize the objective using gradient spect to off-diagonal
, where the solution in each iteration is solved as an LP. We omit the details for brevity.
We evaluate our proposed metric learning method in classification performance. Specifically, the objective function Q(M) we consider here is the graph Laplacian Regularizer (GLR) [18, 23]:
A small GLR means that signal z at connected node pairs are similar for a large edge weight
with respect to the variation operator L(M). GLR has been used in the GSP literature to solve a range of inverse problems, including image denoising [23], deblurring [24], dequantization amd contrast enhancement [25], and soft decoding of JPEG [26].
We evaluate our method with the following competing schemes: three metric learning methods that only learn the diagonals of M, i.e., [27], [28], and [10], and two methods that learn the full matrix M, i.e., [6] and [29]. We perform classification tasks using one of the following two classifiers: 1) a k-nearest-neighbour clas-sifier, and 2) a graph-based classifier with quadratic formulation
in subset F are the observed labels. We evaluate all classifiers on wine (3 classes, 13 features and 178 samples), iris (3 classes, 4 features and 150 samples), seeds (3 classes, 7 features and 210 samples), and pb (2 classes, 10 features and 300 samples). All experiments were performed in Matlab R2017a on an i5-7500, 8GB of RAM, Windows 10 PC. We perform 2-fold cross validation 50 times using 50 random seeds (0 to 49) with one-against-all classification strategy. As shown in Table 1, our proposed metric learning method has the lowest classification error rates with a graph-based classifier.
Table 1. Classification error rates. (GB=Graph-based classifier.)
[1] P. C. Mahalanobis, “On the generalized distance in statistics,” Proceedings of the National Institute of Sciences of India, vol. 2, no. 1, pp. 49–55, April 1936.
[2] K. Q. Weinberger and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification,” Journal of Machine Learning Research, vol. 10, no. 2, pp. 207–244, Feb. 2009.
[3] G.-J. Qi, J. Tang, Z.-J. Zha, T.-S. Chua, and H.-J. Zhang, “An efficient sparse metric learning in high-dimensional space via -penalized log-determinant regularization,” in ICML, June 2009, pp. 841–848.
[4] D. Lim, G. Lanckriet, and B. McFee, “Robust structural metric learning,” in ICML, June 2013, pp. 615–623.
[5] W. Liu, C. Mu, R. Ji, S. Ma, J. R. Smith, and S.-F. Chang, “Low-rank similarity metric learning in high dimensions,” in AAAI, Jan. 2015, p. 27922799.
[6] P. Zadeh, R. Hosseini, and S. Sra, “Geometric mean metric learning,” in ICML, June 2016, pp. 2464–2471.
[7] N. Parikh and S. Boyd, “Proximal algorithms,” Foundations and Trends in Optimization, vol. 1, no. 3, pp. 127–239, Jan. 2014.
[8] Y. Mu, “Fixed-rank supervised metric learning on Riemannian manifold,” in AAAI, Feb. 2016, pp. 1941–1947.
[9] J. Zhang and L. Zhang, “Efficient stochastic optimization for low-rank distance metric learning,” in AAAI, Feb. 2017, pp. 933–939.
[10] C. Yang, G. Cheung, and V. Stankovic, “Alternating binary classifier and graph learning from partial labels,” in APSIPA, Nov. 2018, pp. 1137–1140.
[11] T. Biyikoglu, J. Leydold, and P. F. Stadler, “Nodal domain the- orems and bipartite subgraphs,” The Electronic Journal of Linear Algebra, vol. 13, pp. 344–351, Jan. 2005.
[12] Y. Bai, G. Cheung, F. Wang, X. Liu, and W. Gao, “Reconstruction-cognizant graph sampling using Gershgorin disc alignment,” in ICASSP, May 2019, pp. 5396–5400.
[13] Y. Bai, F. Wang, G. Cheung, Y. Nakatsukasa, and W. Gao, “Fast graph sampling set selection using Gershgorin disc alignment,” arXiv, 2019.
[14] R. S. Varga, Gershgorin and his circles. Springer, 2004.
[15] C. Papadimitriou and K. Steiglitz, Combinatorial Optimization. Dover Publications, Inc, 1998.
[16] M. Jaggi, “Revisiting Frank-Wolfe: Projection-free sparse con- vex optimization,” in ICML, Jun. 2013, pp. 427–435.
[17] A. V. Knyazev, “Toward the optimal preconditioned eigen- solver: Locally optimal block preconditioned conjugate gradient method,” SIAM Journal on Scientific Computing, vol. 23, no. 2, pp. 517–541, 2001.
[18] D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst, “The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains,” IEEE Signal Processing Magazine, vol. 30, pp. 83–98, May 2013.
[19] M. Vetterli, J. Kovacevic, and V. Goyal, Foundations of Signal Processing. Cambridge University Press, 2014.
[20] M. Milgram, “Irreducible graphs,” Journal Of Combinatorial Theory (B), vol. 12, pp. 6–31, Feb. 1972.
[21] R. Horn and C. Johnson, Matrix Analysis. Cambridge University Press, 2012.
[22] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press, 2009.
[23] J. Pang and G. Cheung, “Graph Laplacian regularization for image denoising: Analysis in the continuous domain,” IEEE Transactions on Image Processing, vol. 26, no. 4, pp. 1770– 1785, April 2017.
[24] Y. Bai, G. Cheung, X. Liu, and W. Gao, “Graph-based blind image deblurring from a single photograph,” IEEE Transactions on Image Processing, vol. 28, no. 3, pp. 1404–1418, March 2019.
[25] X. Liu, G. Cheung, X. Ji, D. Zhao, and W. Gao, “Graph- based joint dequantization and contrast enhancement of poorly lit JPEG images,” IEEE Transactions on Image Processing, vol. 28, no. 3, pp. 1205–1219, March 2019.
[26] X. Liu, G. Cheung, X. Wu, and D. Zhao, “Random walk graph Laplacian-based smoothness prior for soft decoding of JPEG images,” IEEE Transactions on Image Processing, vol. 26, no. 2, pp. 509–524, Feb. 2017.
[27] X. Zhu, Z. Ghahramani, and J. Lafferty, “Semi-supervised learning using Gaussian fields and harmonic functions,” in ICML, Aug. 2003, pp. 912–919.
[28] Y. Mao, G. Cheung, C.-W. Lin, and Y. Ji, “Joint learning of similarity graph and image classifier from partial labels,” in APSIPA, Dec. 2016, pp. 1–4.
[29] W. Hu, X. Gao, G. Cheung, and Z. Guo, “Feature graph learning for 3d point cloud denoising,” arXiv, 2019.