Learning Global and Local Consistent Representations for Unsupervised Image Retrieval via Deep Graph Diffusion Networks

2020·Arxiv

Abstract

Abstract

Diffusion has shown great success in improving the accuracy of unsupervised image retrieval systems by utilizing high-order structures of image manifold. However, existing diffusion methods suffer from three major limitations: 1) they usually rely on local structures without considering global manifold information; 2) they focus on improving pairwise similarities within existing input images transductively while lacking the flexibility to learn representations for unseen instances inductively; 3) they fail to scale to large datasets due to prohibitive memory consumption and computational burden due to the intrinsic high-order operations on the whole graph. In this paper, to address these limitations, we propose a novel method, Graph Diffusion Networks (GRAD-Net), that adopts graph neural networks (GNNs), a novel variant of deep learning algorithms on irregular graphs. GRAD-Net learns semantic representations by exploiting both local and global structures of image manifold in an unsupervised fashion. By utilizing sparse coding techniques, GRAD-Net not only preserves global information on the image manifold, but also enables scalable training and efficient querying. Experiments on several large benchmark datasets demonstrate the effectiveness of our method over existing state-of-the-art diffusion algorithms for unsupervised image retrieval.

1 INTRODUCTION

UNSUPERVISED image retrieval refers to the task ofreturning relevant instances in a database given an unlabeled query, which is an important basis for various applications such as information search and database management [1], [2]. Traditionally, unsupervised image retrieval is accomplished by computing either predefined pairwise similarities (e.g. Euclidean distances) or by adopting a random-walk style process (e.g. diffusion [3]). It is widely acknowledged that no single metric can generate reliable retrieval performance due to the “curse” of dimensionality [4]. Consequently, most of the existing works on unsupervised image retrieval have been focusing on exploiting the diffusion process to learn context-sensitive affinity measures [5]–[7].

The principle of capturing geometric manifold applies to most existing diffusion methods. First, the manifold is interpreted as a weighted graph, where each instance is represented by a node, and weights on the edges connecting nodes represent the similarities. The pairwise affinities are then updated iteratively by diffusing along the graph geometry. This diffusion process, originally proposed by Zhou et al. [8], typically follows the concept of random walk, where a transition matrix derived from the edge weights determines the probabilities of transiting from one node to another. The updated affinity values in turn improve the retrieval results. To illustrate this concept, we consider the following toy example. As shown in Fig. 1, each point is a simple analogy to the feature vector of an image. The data points consist of four letters, each with 1,500 points. The queries are displayed as the star nodes. The ideal result for this retrieval is that the points from the same letter as the query point should have the highest ranks. Euclidean distance as a similarity metric is inadequate for this task (Fig. 1a), while in comparison, after diffusing the similarities on the graph, the retrieval result is significantly improved (Fig. 1b).

The success of deep neural networks on various applications partly results from the powerful image features with rich semantic information. Models, particularly deep convolutional networks [9]–[12] pre-trained on large datasets, such as ImageNet [13] and Landmarks [9], are increasingly being used for feature extraction. Diffusion is then deployed on the extracted features to further improve the retrieval results.

However, these diffusion-based methods suffer from three main limitations. First, they usually rely on local structures without considering global manifold information, that is sparsified graphs based on neighborhood search are extensively used without considering the cluster structures as a whole. Secondly, these diffusion-based methods focus on improving pairwise similarities within existing images transductively but lack the flexibility to learn representations

Fig. 1: The retrieval results on a synthetic toy dataset using (a) Euclidean distance and (b) diffusion. Four queries are marked as stars.

for unseen instances inductively. In other words, it is difficult to generalize the diffusion models to unobserved databases without re-computing the diffusion iterations. Lastly, they usually fail to scale to large datasets due to prohibitive memory consumption and computational burden resulting from the intrinsic high-order operations on the whole graphs.

Graph neural networks (GNNs) have recently emerged as a promising line of research that adopts convolutional operators on irregular inputs like graphs [14]–[17]. For instance, there has been an increasing interest in applying GNNs to citation network analysis [14], [18], collaborative filtering [19] and knowledge graphs [20], all of which demonstrated state-of-the-art performance. However, most existing GNNs rely on supervised learning to train an effective model, which prevents direct adoption of GNNs to the unsupervised image retrieval tasks due to the lack of labeled data.

In this paper, we propose a novel approach, Graph Diffusion Networks (GRAD-Net), for unsupervised image retrieval. GRAD-Net consists of GNNs that are trained with two loss functions that directly depict the diffusion process without any labeled information. The diffusion process is thus fully learnable, and the features are trained to transform on an optimal manifold subject to the defined loss in an unsupervised fashion. Benefiting from the generalizability of GNNs, our model can be easily extended to unseen queries. Furthermore, to consider the global structures, GRAD-Net constructs an additional bipartite graph to learn consistent global and local representations of all the images. GRAD-Net alleviates the common scalability issue of diffusion methods by effectively sampling subnetworks in a mini-batch manner at each training iteration. We demonstrate the effectiveness of GRAD-Net with extensive empirical experiments.

Our main contributions are summarized as follows:

1) Unsupervised representation learning on image manifolds. GRAD-Net, among the first-in-class deep learning models for unsupervised diffusion process on image retrieval, updates the instance features (i.e. the node attributes on the graphs) in an end-to-end manner rather than the conventional choice of diffusion rank values or similarities on

manifolds. We transform the conventional diffusion from a message-passing process to an optimization process through a multi-layer neural networks with an efficient diffusion-like operator. We show that this approach enriches the features with more semantic information, and the learned features can be used to perform efficient retrieval with a simple nearest neighbor search and support effective inductive learning.

2) Achieving global and local consistency. We introduce a series of novel techniques including: (i) second order propagation that accelerates the message aggregation in GNNs; (ii) local and global loss functions that leverage the structure of the manifold; (iii) sparse coding that adopts a bipartite graph technique on top of the sparsified pairwise graphs to reflect the global cluster information. These novel techniques collectively contribute to GRAD-Net to achieve global and local consistency.

3) Intrinsic scalability and high efficiency. Unlike traditional diffusion methods that suffer from at least quadratic computational complexity, our proposed GRAD-Net has linear complexity and is intrinsically scalable to large databases through mini-batch training on the graphs. GRAD-Net has proven its high efficiency in both training and query processes on very large-scale datasets.

2 RELATED WORK

2.1 Diffusion for Unsupervised Image Retrieval

Metric learning has been a central task for image retrieval. There is a myriad of works focusing on learning a general metric to compute pairwise distances between images [21], [22]. In contrast to conventional metric learning methods, diffusion-based methods have emerged as a promising alternative that learns pairwise similarities on image manifolds. Introduced by Zhou et al. [8], diffusion has shown consistent improvements over raw distances/similarities by exploiting intrinsic manifold geometry [23]. Inspired by semi-supervised learning, Graph Transduction (GT) [24] takes the query point as the only labeled data, and propagates the labeled point to unlabeled databases in a similar fashion to label propagation [25]. Motivated by the observation that a good ranking is usually asymmetrical, Contextual Dissimilarity Measure (CDM) [26] improves Bag-of-Words (BoW) [27] retrieval system by iteratively estimating the pairwise distance in the spirit of Sinkhorns scaling algorithm [28].

Further, noticing that diffusion is susceptible to noise edges in the affinity graph, Locally Constrained Diffusion Process (LCDP) [29] stresses that it is crucial to constrain the diffusion process “locally”. Along the same line, Tensor Product Graph diffusion (TPG) [30] manages to leverage the high-order information from the tensor product of the affinity graphs. However, TPG constrains the pathways of message passing to local neighbors only such that the computational complexity does not increase significantly. A detailed comparison of various diffusion methods has been conducted in a survey [3] by enumerating 72 variants of diffusion process (4 different affinity initializations, 6 different transition matrices and 3 different updating schemes). Bai et al. [31] provides empirical and theoretical results and suggests that metric learning on tensor product graphs with high-order information is more robust to retrieval. Recently, Bai et al. [6] enhances the tensor product to a regularization process on graphs and displays its potential of retrieval for heterogeneous instances.

2.2 Graph Neural Networks

Recent years have witnessed an increasing trend in applying deep learning algorithms to irregular inputs such as graphs. Early works include recursive neural networks [32] that represent and process graph data. GNNs were introduced by Gori et al. [33] and Scarselli et al. [34] as a generalization of recursive neural networks that can directly deal with graphs. Typical GNNs consist of an iterative random-walk process in which node states are propagated based on certain probability distribution until convergence. This idea was improved by Li et al. [35] that uses gated recurrent units [36] in the propagation step. However, these recurrent models usually suffer from large computational burdens and therefore hard to train and scale to large graphs.

Inspired by the tremendous success of convolutional neural networks (CNNs) in computer vision, Bruna et al. [37] introduced the convolution operation in the Fourier domain by computing the eigendecomposition of the graph Laplacian. It was then improved by Henaff et al. [38] that parameterizes the spectral filters with smooth coefficients such that the computations are spatially localized. The milestone work by Kipf et al. [14] introduced graph convolutional network (GCN) that simplifies the previous methods by restricting the filters to operate in the first-order neighborhood around each node. Since then, various modifications have been proposed to improve upon GCN. For example, Velikovi et al. [39] adopted the attention mechanism to address the high sensitivity to noise issue in the graph edges of GCN by computing learnable attentions among the nodes.

The applications of GNNs in computer vision are booming with notable examples such as few-shot image clas-sification [40], [41], semantic segmentation [42], [43] and visual question answering [44], [45]. However, most of these applications require an extensive amount of labeled data. Recently, Jiang et al. [46] introduces new GNN layer operations for feature diffusion in a semi-supervised setting, and Guided Similarity Separation (GSS) [47] applies a Kipfstyle GCN [14] to unsupervised image retrieval but lacks the scalability to large datasets. Apart from the aforementioned works, few has focused on unsupervised image retrieval.

The proposed GRAD-Net, to our best knowledge, is one of the first attempts to adopt GCN-like algorithms in unsupervised image retrieval. Our proposed network requires no label information yet extends the traditional diffusion models to learn global and local consistent representations and enables effective training and querying on large-scale datasets.

3 OUR APPROACH

In this section, we introduce the detailed components of the proposed Graph Diffusion Network (GRAD-Net). We first review the theoretical bases of the diffusion process on graphs and GCN, and then elaborate on the structures of our proposed GRAD-Net. Though our proposed method can be easily generalized to the retrieval of various types of data, in this work, we will focus on image retrieval to illustrate the idea of GRAD-Net.

3.1 Problem Setup and Notations

For image retrieval tasks, we denote the dataset as X = , where is a feature vector of an image. We assume the set of queries can be projected to the same feature space as the instances in the dataset, and denote the queries as , where represents the feature vector of the i-th query image and m is the number of query images. For simplicity, but without loss of generality, we consider the scenario that the instance (or query) is interpreted as a single vector. The combined set of dataset instances and queries can then be expressed as

and we use denotes the i-th instance in .

3.2 Diffusion on Graphs

A graph G = {V, E} consists of a set of nodes V and edges E. An affinity matrix A represents the adjacent node pairs in the graph. Two main approaches to conduct diffusion on graphs are (i) iteratively updating the pairwise similarities [3], [8] and (ii) solving the closed form of Eq.4 directly [5]. However, both approaches are essentially based on the same random walk mechanism proposed by Zhou et al. [8].

To perform a random walk on a graph G, a transition matrix is introduced to describe the probability of walking from one node to another, which is considered to be proportional to the affinity value. A degree matrix D is introduced to normalize the affinity matrix, which produces the transition matrix. The degree matrix

is a diagonal matrix with each diagonal element corresponding to the row-wise sum of a predefined affinity matrix A. Then the transition matrix S is computed as:

Based on the transition matrix S, random walk is then performed on the graph to update a state vector until it converges. The process iterates at the t-th step of the random walk as follows,

where is composed of both the query state and the dataset state . The initial state is a binary vector set to . Eq.3 can be intuitively interpreted as follows. Given a state , it transits based on S with a probability of and restarts from the initial state with a probability of . This process will converge to a closed-form solution [8]:

The final after iterations stands for the updated affinity matrix. As argued by Donoser et al. [3], the TPG diffusion achieved the most robust performance among all the compared methods. However, TPG diffusion suffers from heavy computational burden when handling large networks due to the large-scale tensor product graph. Our proposed method adopts a second-order operator similar to TPG diffusion, but overcomes the scalability issue by applying the second-order operator on a mutual k-NN graph (see detailed description in Section 3.4).

3.3 Graph Convolutional Networks

GCN [14] applies a first-order aggregation on graphs using affinity matrices. GCN contains several hidden layers that take a feature matrix as the input of the l-th layer and update the feature matrix by using a graph convolution operator.

Given an input feature matrix and the graph affinity matrix , GCN conducts the following layer-wise propagation,

where is the layer index and S is the same as in Eq.2. is the trainable weight matrix of the l-th layer. denotes an activation function. Eq.6 adopts S rather than A because multiplication with A will completely change the scale of the feature vectors, which can cause numerical instabilities. To further alleviate this issue, Kipf et al. [14] suggested to use the re-normalized (I + S) as follows:

where is the diagonal degree matrix of .

3.4 Graph Diffusion Networks (GRAD-Net)

In this section, we present our proposed GRAD-Net, illustrated in Fig. 2, that is composed of multiple graph diffusion layers. These layers aggregate messages between nodes according to the manifold structure and output the updated node features. Lastly, an affinity estimation module takes the output features and estimates affinities. Apart from the local manifold structure information, a sparse coding feature vector Z is also passed to the graph diffusion layer to provide global structure information. GRAD-Net is trained with two novel unsupervised loss functions, which we will discuss in details in Section 3.5. The details of the proposed GRAD-Net are as follows:

3.4.1 Graph Constructions

We first introduce how to construct graphs as the input to GRAD-Net. To encode both local and global structures from the image manifold, we design two types of graphs: local sparsified graph and global bipartite graph.

Local Sparsified Graph. To construct the graphs of locality, we follow the mutual k-NN method in Iscen et al. [5] that computes the k nearest neighbors of every instance. The dataset is interpreted as a weighted graph G = (V, E), consisting of N = n + m nodes , and the edges connect those node pairs that have a non-zero affinity value, for . The affinity matrix is defined as

where denotes the set of k nearest neighbors of node i, and is a predefined similarity metric typically based on cosine similarity or Euclidean distance. Eq.8 enables A to be sparse when k is small, which makes diffusion on large graphs feasible. The edge weight is initialized as , and then the diffusion processes propagate the affinity values through the entire graph. Global Bipartite Graph by Sparse Coding. To efficiently encode the global structure, we adopt a similar sparse coding in Liu et al. [48], in which we select anchor nodes to represent the original data distribution and reconstruct the original nodes.

We first apply a simple clustering algorithm, K-means, to select B anchor nodes, and denote the feature matrix of these anchor nodes as . We then construct a bipartite graph , where is the union of the original image nodes and the selected anchor nodes (i.e. ), and denotes all the edges between original images nodes and anchor nodes (i.e. ). Let denote all the weights on the edge set . We reconstruct each node by learning the edge weights, , that connect to its c closest neighboring anchor nodes by

The optimization of Eq.9 is a typical sparse coding problem, which we adopt the same technique as in Lan et al. [49]. We then obtain the global bipartite graph with edge weights , and concatenate Z as additional features to the image descriptors .

3.4.2 Graph Diffusion Layers.

The input graph G consists of an affinity matrix and instance features as in Eq.1. We denote the input feature matrix in the first layer as,

The graph diffusion layer of GRAD-Net has a layer function as follows,

where is the trainable parameters of the l-th layer, and and are the input and output features of the l-th layer, respectively. denotes

Fig. 2: The framework of GRAD-Net. Given the input image features, GRAD-Net first constructs local sparsified k-NN graph and bipartite graph for the image instances. GRAD-Net refines the image features by the stacked graph diffusion layers that are optimized by global and local loss functions. The right panel is a detailed structure of the graph diffusion layer in GRAD-Net.

where denotes the feature of node i in the l-th layer, and controls the scale of this message. We employ a similar first-order transition operation as in the conventional diffusion (i.e. in Eq.3) in our graph diffusion layer, and incorporate additional trainable parameters . In the l-th layer, is defined as,

where . This function turns out to share a similar form as the GCN layer in Eq.6.

The second-order operator, , is a function that takes in the information of node j, a second-order neighbor of node u (i.e. the neighbor of the neighbor of u), to update the feature vector of node u. The second-order message generated by is,

where is the feature vector of node j at the l-th layer, and controls the scale of this message. The double arrow, , represents the second-order connectivity between node j and node u.

The challenge of designing lies in the fact that node j and u are not adjacent, i.e. . Inspired by the TPG diffusion of Eq.5 [30], we introduce as

where denotes the element-wise product, and is the trainable parameters at the l-th layer. The second-order operator, , can be interpreted as a two-step hop. Let node i be the shared neighbor of node j and u. The product is a one-step of message passing that resembles the first-order operation in Eq.13, i.e. . Then can be considered as the weighted feature of node i. Lastly, passes the message from node j to u through their common shared neighbor node i. This second-order propagation enlarges the representational power of the graph diffusion layer.

Collectively, the graph diffusion layer performs the propagation as

where is an activation function and is set to LeakyRelu [50] in GRAD-Net, and are the trainable parameters. An identity matrix, I, is added to the first order operation to introduce a self-loop propagation. The learned features of GRAD-Net can be the output of the last layer, , or the concatenated layer features, H,

where is the concatenation along the feature dimension.

Affinity Estimation. Pairwise similarity is calculated using the concatenated layer features, (Eq.17). GRADNet adopts cosine similarity. The similarity between node i and node , is

where and are the i-th and j-th rows of H, respectively.

3.5 Loss Functions for Global and Local Consistency

Loss functions are crucial for training the GRAD-Net model. To capture both local and global structures of the underlying data manifold, we propose two loss functions to optimize local smoothness and global order, which will be illustrated in this section.

Local smoothness refers to the assumption that if two nodes are topologically closer in the graph G, the similarity of their features should be higher. To regularize local smoothness, we propose a similar loss as the pairwise Bayesian Personalized Ranking (BPR) loss [51],

Fig. 3: Illustration of sextet. (i) Triplet node and are first-order neighbors, and is a non-neighbor of on the graph. (ii) Quadruplet nodes and are first-order neighbors and and are first-order neighbors. (iii) Sextet (i, j, k, l, u, v): the collection of , and . Global order. and both measure the similarities between two neighborhoods and thus should be similar. takes in the triplet nodes takes in the quadruplet nodes (i, j, k, l).

where is the similarity between the features of instances denotes the triplet training samples, and refers to the nodes that share direct edge connections with i in the graph G.

The local smoothness loss in Eq.19 can be interpreted as the difference of the similarities between a close node-pair and a distant node-pair. The local loss function enforces larger similarities between the nodes in the same local neighborhood, which is consistent with the empirical findings that the diffusion process should be performed locally on the manifold [3], [29]. Global Order refers to the assumption that the similarities measured by different nodes from two neighborhoods should remain consistent [6], [30]. As illustrated in Fig. 3, if node i and j are from one neighborhood while node k and l are from another neighborhood, both and should reflect the similarities between the two neighborhoods, and thus should be similar under the global order assumption. To enforce global order, we introduce a loss function suitable for neural networks training,

where is a weighting coefficient, is the affinity in the manifold affinity matrix A, and is the similarity defined in Eq.18. The total loss function is a weighted combination of the local and global loss functions:

where is the set of all the trainable parameters and is the l2-norm function. and are the weights of different loss components.

while the global loss in Eq.20 works on quadruplet nodes (i, j, k, l). Considering this, we provide a compact solution to implement the training process. For every training iteration, a sextet (i, j, k, l, u, v) is selected to compute the total loss (Eq.21) as shown in Fig. 3, where the local loss is calculated twice from the two triplets (i, j, u) and (k, l, v), while the global loss is calculated once from the quadruplet (i, j, k, l). Therefore, the implemented loss function is

where is the number of sextets in a mini batch.

3.6 Inductive Learning on Unseen Instances

Most existing methods assume the queries are observed in the dataset, yet this setting cannot be met in many real-life scenarios. When queries are only accessible in the production stage after deployment, conventional diffusion methods suffer substantially from computation burden, and thus lack the feasibility to process queries in real-time. One solution to this issue is Query Expansion (QE) [52], in which a new query is constructed by averaging the image features of its top nearest neighbors in the original database.

We extend QE to derive a fast query expansion approach, the query feature expansion (QFE), that is compatible with our feature learning fashion. Given a new query q, QFE first finds the k nearest neighbor of q using original image descriptors, and directly computes the features of q in the learned feature space by

where is the learned feature of node i, and is the similarity between q and its neighbor i computed by the original image descriptors. is a feature approximation for q and can be used to retrieve instances via similarity search. We demonstrate the effectiveness of QFE with empirical results in Section 5.3.

3.7 Training on Large Datasets

Mini-batch Training on Graph. We found out that only the nodes in the computation flow of the loss function (Eq.21) will contribute to the parameter updating process.

To alleviate the computation burden on large-scale datasets, we adopt a mini-batch approach similar to Breadth-First Search (BFS) [53] on the sparsified mutual kNN graph. We start from a batch of sextets (nodes) to find relevant nodes, and then only include these of nodes in each training iteration. This acceleration process remarkably speeds up by over 20 times on training large-scale datasets (e.g. Oxford105k [5]). Truncation. As image retrieval on very large datasets is challenging to diffusion-based methods, truncation has become a common practice to alleviate this issue [5], [6]. Though our proposed GRAD-Net is capable of training and

retrieving datasets with over 100k instances efficiently, we implement a similar truncation method to further accelerate the training process. GRAD-Net first forms a union graph of the top 500 nearest neighbors for each query, and uses the learned features based on this union graph for the retrieval task. We demonstrate the efficacy of mini-batch training and truncation on various scales of datasets in Fig. 8.

3.8 Implementation Details

In this section, we address several implementation considerations, and introduce the default configurations of GRADNet. For all the experiments in Section 4 and 5, we deploy the default configurations unless otherwise specified.

We implemented GRAD-Net on the PyTorch[54] framework. All the experiments were conducted on a workstation with a 12-core Intel(R) Xeon(R) CPU W-2133 @ 3.60GHz and one NVIDIA RTX5000 GPU with 16GB GPU memory.

The number of neighbors, k, in the mutual k-NN search is set to 15. We set the number of anchor points B = 100 to compute the sparse features Z. The similarity metric s in Eq.8 is set to 1/(1 + Euclidean Distance). The number of graph diffusion layers, L = 3, with and . An l2-normalization followed by dropout [55] with a probability of 0.3 is applied to the output of each layer. We set in Eq.20, and in the total loss function (Eq.22). The sextet (i, j, k, l, u, v) are sampled as follows: node i, k are first randomly sampled from all nodes, then nodes j, l are randomly selected from the first-order neighbors of node i and k, and lastly nodes u, v are randomly sampled from the non-neighbor nodes of i, j, k, l. We use Adam [56] to optimize the parameters. The learning rate of Adam is initialized as and reduced by 50% at 30 and 100 epochs, respectively. The model is trained for 300 epochs. The training loss reaches a stable state after 200 epochs, and thus we take the model at the 300-th epoch as the final model for evaluation. The number of sextets in a mini-batch, . We also incorporate the efficient similarity search library, faiss [57], to boost up our mutual k-NN search and employ inverted index to reduce the memory consumption of the dataset instances. We store the highly compressed faiss index instead of the original representations in deployment.

4 EXPERIMENT

To demonstrate the performance of GRAD-Net, we conducted experiments on a synthetic toy dataset and seven popular benchmarking datasets, including face images (Section 4.2) and natural images (Sections 4.3, 4.4, 4.5). Table 1 summarizes the type and statistics of all the analyzed datasets. All the implementation settings follow the descriptions in Section 3.8.

4.1 Toy example

We generated 1,500 points in 2-D space that construct the word PAMI to demonstrate that GRAD-Net can well capture the geometric manifold. We take the coordinates of these points as the original features. The query points are chosen in each letter (marked by star) colored in green, yellow, blue, and purple, respectively.

TABLE 1: Summary of datasets

Fig. 4(a) uses Euclidean distance as the similarity metric to find the k nearest neighbors of each query point while Fig. 4(b) uses GRAD-Net to retrieve the relevant points for each query. Take the blue query point in the letter M for example. Fig. 4(e) depicts the initial, intermediate and final retrieval results using GRAD-Net. At the initial step, some points in the letter I that are closer to M by the Euclidean metric and were wrongly retrieved to the blue query point. However, as the training progresses, GRAD-Net is capable of correctly retrieving points from M for the blue query point.

Fig. 4(c) shows the t-SNE plot [62] of the original features, while Fig. 4(d) shows the t-SNE plot of the features produced by GRAD-Net. The t-SNE plots clearly verify that GRAD-Net better captures the intrinsic manifold than Euclidean distances.

4.2 ORL Dataset

The ORL dataset [58] is a face image dataset that contains 400 images of size taken from 40 distinct subjects. The images vary in illuminations, facial expressions and facial details. We first vectorize and normalize the raw image pixels as image descriptors.

Retrieval accuracy is measured by the average recall rate at a window size K for each query, also known as the bullseye score. For this dataset, we set K = 15 and the baseline bullseye score is 62.35%. Table 2 summarizes the performance of various approaches on this dataset. Our proposed GRAD-Net achieves the state-of-the-art performance.

4.3 Oxford5k and Paris6k Dataset

In this section, we examine the performance of GRADNet with real natural images on the Oxford5k and Paris6k datasets [59]. Experiments on Oxford5k. The Oxford5k dataset consists of 5,062 pictures of 11 different Oxford buildings collected from Flickr and 55 query images with the ground truth for evaluation.

For a fair comparison, we employed the image descriptors provided in Iscen et al. [5] to perform the experiments. In particular, the image descriptors are in two types: the first type consists of 512 dimensional descriptors [11] derived from a fine-tuned VGG net, while the second type consists of 2,048 dimensional descriptors [65] derived from a fine-tuned ResNet. For any given instance, Iscen et al. [5] extracts one global feature vector and multiple regional feature vectors. In this experiment, we only consider the scenario where an instance is represented by one global feature vector. Therefore, we only compare GRAD-Net to the models using the global features as input here and after.

Fig. 4: Synthetic toy dataset results. (a) Retrieval results based on Euclidean distance as similarity metric; (b) retrieval results using GRAD-Net; (c) t-SNE visualization of the original features; (d) t-SNE visualization of the features output by GRAD-Net; (e) initial, intermediate and final retrieval results of the blue query point using GRAD-Net.

TABLE 2: Performance comparison (bullseye score) on the ORL dataset.

TABLE 3: Performance comparison (mAP scores) on Oxford5k, Oxford105k, Paris6k, Paris106k, ROxford and RParis

We compare GRAD-Net with existing methods and summarizes the results in Table 3. We use the standard evaluation protocol, mean Average Precision (mAP) ranging from 0% to 100%, to quantify the retrieval accuracy. For both types of image descriptors, GRAD-Net significantly improves the baseline performances and achieves the highest mAP among all the methods, 90.1% for the VGG descriptors and 95.9% for the ResNet descriptors, exceeding the existing state-of-the-art method by a large margin. Fig. 5 depicts the retrieval results by GRAD-Net at the initial state, epoch 1, 20 and 60, respectively, which demonstrates that the training process improves the retrieval results of GRAD-Net.

We also showed that a simple nearest neighbor search using the sparse vectors Z achieves impressive mAP score of 87.5% on the ResNet image descriptors. This demonstrates that our proposed sparse coding method is able to capture global similarity relations and reduce the noise in original dense vectors. GRAD-Net further improves the mAP score by additional 8.4%, which indicates that GRAD-Net is capable of recovering the manifold structure.

Fig. 5: Retrieval results on the Oxford5k dataset at different epochs. The query instance is on the top left. Each row on the right contains the top five retrieved images after the corresponding training epoch. Images in the red frames are the incorrect results, while images in the blue frames are the correct ones. As the training progresses, GRAD-Net is capable of retrieving the correct instances for the query.

Experiments on Paris6k. The Paris6k dataset was collected in a similar fashion to the Oxford5k dataset. It includes 6,392 images of 11 buildings in Paris and uses 55 queries with the ground truth for evaluation. We also employed the VGG and ResNet features from Philbin et al. [60] and adopted the same model configurations as in the Oxford5k experiment, except for k = 60 for the mutual k-NN graph. Results in Table 3 show that GRAD-Net outperforms the existing state-of-the-art on both types of image descriptors.

4.4 Oxford105k and Paris106k Datasets

To demonstrate the efficacy of GRAD-Net on large-scale datasets, we evaluate our proposed method on the Ox-ford105k and Paris106k datasets [60]. These datasets are the extensions of the Oxford5k and Paris6k datasets with additional 100k irrelevant images from Flickr as distractors.

Due to the large scale of the datasets, most conventional diffusion methods fail to process the entire dataset within an acceptable time. For example, the diffusion iteration for one query from Oxford105k takes 13.9s with a 12-core CPU [5]. On the contrary, GRAD-Net, deploying mini-batch training and feature truncation, is capable of handling the >100k instances in both datasets (see Section 5.4 for the detailed complexity analyses).

As shown in Table 3, GRAD-Net achieves outstanding performance (mAP score 94.5%) on the Oxford105k dataset using the ResNet descriptors and outperforms the existing state-of-the-art method by 2.7%. On the Paris106k dataset, GRAD-Net also achieves comparable performance to the state-of-the-art method (mAP score 95.4% compared to 95.6%). Using the VGG descriptors, GRAD-Net also achieves comparable performance. We observed that the truncated union graphs generated from the VGG image descriptors omit some of the correct instances, which may partially explain the under-performance on the VGG descriptors.

4.5 ROxford and RParis Datasets

Radenovi et al. [61] generated the ROxford and RParis datasets by manually examining the instances in Oxford5k and Paris6k, and selected a set of 70 challenging queries with ground truth for evaluation. We adopted the image descriptors from Radenovi et al. [66] and used the Medium evaluation protocol.

5 DISCUSSION

5.1 Ablation Study

In this section, we vary the configurations of several important settings in GRAD-Net to verify the effectiveness of the corresponding variants. For the purpose of illustration, we focus on the Oxford5k dataset, and set the base model as the one discussed in Section 4.3 that achieves an mAP score of 95.95%.

For fair comparisons, we ran six trials for each variant and evaluated the performance with three evaluation metrics:

1) Epochs mAP92+ refers to the number of epochs that a model takes to first reach an mAP score of 92% during the training process. We use this metric to examine the learning efficiency.

2) mAP is the mean Average Precision score. We report the average and standard deviation of the mAP scores of the six trial runs.

3) Diff refers to the mAP drop compared to the base model (the first row in Table 4).

Loss functions. The ablation study of the loss functions shows that and have substantial impact on the model performance. Without the local loss, the proposed model can only reach an average mAP score of 90.16% and exhibits a slight increase of instability in the training process. The absence of the global loss, , also increases the standard deviation of the mAP score by 0.42%. We also tested the impact of global loss function on the Oxford105k dataset. The performance drops by 0.96% without global loss.

Sparse coding. The sparse coding vector Z is another important component of GRAD-Net. By default, Z is a 100-dimensional vector appended to the original features. We test on two variants related to Z: taking (i) original features only and (ii) Z only as the inputs to GRAD-Net. The ablation studies show that both Z and the original features are crucial to the retrieval performance.

Model Structure. We vary the layer settings of GRAD-Net, and the results show that GRAD-Net is robust to the layer size and the number of layers selection on the Oxford5k dataset. One interesting finding is that when replacing the graph diffusion layer (that includes a second-order operator) with the vanilla GCN network (that only includes a first-order operator), the mAP score drops by 0.96%. This finding empirically validates the effectiveness of the second-order

TABLE 4: Ablation study results of the variants in GRAD-Net.

operator in the diffusion layer of GRAD-Net. Moreover, we perform ablation studies on the GRAD-Net output features with or without the concatenation operation as in Eq.17. The results verify that the concatenation of features in Eq.17 is essential and accounts for 3.44% of the performance gain as well as decreases the variation of performance.

Manifold Structure. The test results on the manifold structures show that the value of k in the mutual k-NN search is crucial to the learned features. On the Oxford5k dataset, k is optimal around 15, yet GRAD-Net demonstrates robust performance for k ranging from 8 to 25.

Similarity metric and batch size. The variants in the similarity metrics show that GRAD-Net is robust to different similarity metric choices. Experiments of various batch sizes show that smaller batch size is preferable for better performance.

Fig. 6: The feature learning results. (a) t-SNE visualization of the original image features; (b) t-SNE visualization of the learned features. Colors denote the image categories. Grey points are denote the images that do not belong to any query category.

5.2 Feature Learning

The most noticeable difference between GRAD-Net and conventional diffusion methods is that GRAD-Net performs feature learning, while conventional diffusion methods only operates on the affinity matrix.

Based on the results from Fig. 4(c-d), we argue that GRAD-Net learns a better representation of semantic information. We further validate the feature learning ability of GRAD-Net using the Oxford5k dataset. Fig. 6 plots the t-SNE visualizations of the Oxford5k images using the original image descriptors and the learned features of GRADNet respectively. The nodes are colored by the ground truth labels. Fig. 6 shows that GRAD-Net groups the instances from the same category with similar features.

Fig. 7: Query Induction Example. The first row represents the initial search result in the original feature space. The second row represents the first QFE search result, and the third row is the second QFE search result, which implies QFE can improve the retrieval results for unseen queries.

Fig. 8: Memory and time consumption comparison. (a) CPU RAM memory consumption against the number of dataset instances; (b) GPU RAM memory consumption against the number of dataset instances. Efficient is omitted since the method cannot be implemented in GPU. (c) Time consumption against the number of instances.

5.3 Inductive Leaning

To test the generalizability of GRAD-Net for unseen queries, we manually put aside 11 queries from the 55 queries of the Oxford5k dataset and left them out in the training. We then performed retrieval of these 11 queries using the QFE method described in Section 3.6. The mAP score in the Table 5 shows the effectiveness of the QFE method. We also observe a slight improvement in the performance when applying QFE twice. As shown in Fig. 7, the QFE method can project approximated features of the unseen query to the learned feature space, which leads to more accurate retrieval performance.

TABLE 5: Performance comparison on unseen queries

5.4 Complexity Analysis

The first step in the pipeline of GRAD-Net is to build a mutual k-NN graph, which requires the pairwise similarity search among instances. This step has a computational complexity of , where N is the number of instances. However, since we only need to build the graph once and it is considerably accelerated by the faiss toolbox [57], we argue that the computational complexity of GRAD-Net is O(N) that is mainly determined by the training process of the neural network.

We conducted a series of experiments to evaluate the memory and time consumption of GRAD-Net and several other methods. All experiments were conducted using a workstation with a 12-core Intel(R) Xeon(R) CPU W-2133 @ 3.60GHz and one NVIDIA RTX5000 GPU with 16GB GPU memory. We started with the Oxford5k dataset, gradually added distractors to the original dataset, and recorded the time and memory consumption of training or the diffusion iterations. Empirical results verify our argument as follows. Memory consumption analysis. Fig. 8(a-b) show the memory consumption on CPU and GPU respectively. As the

number of instances increases, GRAD-Net can easily scale up to 10k samples even without using truncation, thanks to the mini-batch training on graphs. After integrating the truncation technique, the memory consumption is notably reduced by a large margin. This validates the advantage of GRAD-Net regarding memory consumption.

In comparison, the memory consumption of Efficient Diffusion [64], a conventional diffusion method, exceeds GRAD-Net as the number of instances reaches 20k regardless of that it employs a truncated search technique. We also compare with the memory consumption of GSS [47], a neural network based method, increases rapidly as the number of instances gets larger, and encounters out of memory (OOM) error when the number of instances exceeds 21k. Time consumption analysis. We analyzed the time consumption of GRAD-Net as the number of instances increases. Fig. 8(c) plots the average time per forward computing and gradient back-propagation against the numbers of instances. The time consumption results validate that GRAD-Net has linear complexity, O(N). We also compare GRAD-Net with the neural network based method, GSS (blue line in Fig. 8(c)), and argue that GRAD-Net has faster iteration time.

6 CONCLUSION

In this paper, we propose GRAD-Net, a novel deep learning based diffusion method to address the challenge of unsupervised image retrieval. In contrast to conventional diffusion methods, GRAD-Net enables effective representation learning for images while preserves both the local and global geometric properties of the image manifold. Moreover, GRADNet is trained in an unsupervised end-to-end fashion and can easily scale up to handle large-scale datasets. Extensive empirical results on multiple benchmarks demonstrate the efficacy of GRAD-Net. Besides its outstanding performance, GRAD-Net also shows its generalizability to unseen queries in an inductive learning manner. However, for highly dynamic image retrieval systems where instances are constantly modified in the database, we envision that GRADNet is capable of performing the retrieval tasks with regular re-training. The trigger to re-train the network can be determined by the mean average precision over a separate validation set. We will explore these directions in our future works.

REFERENCES

[1] M. R. Ghorab, D. Zhou, A. O‘connor, and V. Wade, “Personalised information retrieval: Survey and clas-sification,” User Modeling and User-Adapted Interaction, vol. 23, no. 4, pp. 381–443, 2013.

[2] D. Van Aken, A. Pavlo, G. J. Gordon, and B. Zhang, “Automatic database management system tuning through large-scale machine learning,” in Proceedings of the 2017 ACM International Conference on Management of Data, ACM, 2017, pp. 1009–1024.

[3] M. Donoser and H. Bischof, “Diffusion processes for retrieval revisited,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 1320–1327.

[4] F. Bach, “Breaking the curse of dimensionality with convex neural networks,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 629–681, 2017.

[5] A. Iscen, G. Tolias, Y. Avrithis, T. Furon, and O. Chum, “Efficient diffusion on region manifolds: Recovering small objects with compact cnn representations,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2077–2086.

[6] S. Bai, X. Bai, Q. Tian, and L. J. Latecki, “Regularized diffusion process on bidirectional context for object retrieval,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 5, pp. 1213–1226, 2018.

[7] B. Wang, Z. Tu, and J. K. Tsotsos, “Dynamic label propagation for semi-supervised multi-class multilabel classification,” in Proceedings of the IEEE international conference on computer vision, 2013, pp. 425–432.

[8] D. Zhou, J. Weston, A. Gretton, O. Bousquet, and B. Schlkopf, “Ranking on data manifolds,” in Advances in neural information processing systems, 2004, pp. 169–176.

[9] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky, “Neural codes for image retrieval,” in European conference on computer vision, Springer, 2014, pp. 584– 599.

[10] A. Gordo, J. Almazn, J. Revaud, and D. Larlus, “Deep image retrieval: Learning global representations for image search,” in European conference on computer vision, Springer, 2016, pp. 241–257.

[11] F. Radenovi, G. Tolias, and O. Chum, “Cnn image retrieval learns from bow: Unsupervised fine-tuning with hard examples,” in European conference on computer vision, Springer, 2016, pp. 3–20.

[12] A. S. Razavian, J. Sullivan, S. Carlsson, and A. Maki, “Visual instance retrieval with deep convolutional networks,” IEEE Transactions on Media Technology and Applications, vol. 4, no. 3, pp. 251–258, 2016.

[13] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in IEEE conference on computer vision and pattern recognition, Ieee, 2009, pp. 248–255.

[14] T. N. Kipf and M. Welling, “Semi-supervised classifi-cation with graph convolutional networks,” in 5th International Conference on Learning Representations, ICLR, 2017.

[15] J. Atwood and D. Towsley, “Diffusion-convolutional neural networks,” in Advances in Neural Information Processing Systems, 2016, pp. 1993–2001.

[16] M. Niepert, M. Ahmed, and K. Kutzkov, “Learning convolutional neural networks for graphs,” in International conference on machine learning, 2016, pp. 2014– 2023.

[17] S. Cao, W. Lu, and Q. Xu, “Deep neural networks for learning graph representations,” in Thirtieth AAAI Conference on Artificial Intelligence, 2016.

[18] R. Levie, F. Monti, X. Bresson, and M. M. Bronstein, “Cayleynets: Graph convolutional neural networks with complex rational spectral filters,” IEEE Transactions on Signal Processing, vol. 67, no. 1, pp. 97–109, 2018.

[19] X. Wang, X. He, M. Wang, F. Feng, and T.-S. Chua, “Neural graph collaborative filtering,” in Proceedings of the 42nd international ACM SIGIR conference on Research and development in Information Retrieval, 2019, pp. 165–174.

[20] N. Park, A. Kan, X. L. Dong, T. Zhao, and C. Faloutsos, “Estimating node importance in knowledge graphs using graph neural networks,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp. 596–606.

[21] D. P. Vassileios Balntas Edgar Riba and K. Mikolajczyk, “Learning local feature descriptors with triplets and shallow convolutional neural networks,” in Proceedings of the British Machine Vision Conference (BMVC), BMVA Press, Sep. 2016, pp. 119.1–119.11.

[22] R. Yu, Z. Dou, S. Bai, Z. Zhang, Y. Xu, and X. Bai, “Hard-aware point-to-set deep metric for person re-identification,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 188–204.

[23] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schlkopf, “Learning with local and global consistency,” in Advances in neural information processing systems, 2004, pp. 321–328.

[24] X. Bai, X. Yang, L. J. Latecki, W. Liu, and Z. Tu, “Learning context-sensitive shape similarity by graph transduction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 5, pp. 861–874, 2009.

[25] U. N. Raghavan, R. Albert, and S. Kumara, “Near linear time algorithm to detect community structures in large-scale networks,” Physical review E, vol. 76, no. 3, p. 036 106, 2007.

[26] H. Jegou, C. Schmid, H. Harzallah, and J. Verbeek, “Accurate image search using the contextual dissimilarity measure,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 1, pp. 2–11, 2008.

[27] L. Fei-Fei and P. Perona, “A bayesian hierarchical model for learning natural scene categories,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), IEEE, vol. 2, 2005, pp. 524–531.

[28] R. Sinkhorn, “A relationship between arbitrary positive matrices and doubly stochastic matrices,” The annals of mathematical statistics, vol. 35, no. 2, pp. 876–879, 1964.

[29] B. Wang and Z. Tu, “Affinity learning via selfdiffusion for image segmentation and clustering,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2012, pp. 2312–2319.

[30] X. Yang, L. Prasad, and L. J. Latecki, “Affinity learning with diffusion on tensor product graph,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 28–38, 2012.

[31] S. Bai, X. Bai, Q. Tian, and L. J. Latecki, “Regularized diffusion process for visual retrieval,” in Thirty-First AAAI Conference on Artificial Intelligence, 2017.

[32] P. Frasconi, M. Gori, and A. Sperduti, “A general framework for adaptive processing of data structures,” IEEE transactions on Neural Networks, vol. 9, no. 5, pp. 768–786, 1998.

[33] M. Gori, G. Monfardini, and F. Scarselli, “A new model for learning in graph domains,” in Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., IEEE, vol. 2, 2005, pp. 729–734.

[34] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE Transactions on Neural Networks, vol. 20, no. 1, pp. 61–80, 2008.

[35] Y. Li, D. Tarlow, M. Brockschmidt, and R. S. Zemel, “Gated graph sequence neural networks,” in 4th International Conference on Learning Representations, ICLR, 2016.

[36] K. Cho, B. van Merrienboer,. Glehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP, ACL, 2014, pp. 1724–1734.

[37] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and locally connected networks on graphs,” in 2nd International Conference on Learning Representations, ICLR, 2014.

[38] M. Henaff, J. Bruna, and Y. LeCun, “Deep convolutional networks on graph-structured data,” ArXiv preprint arXiv:1506.05163, 2015.

[39] P. Velikovi, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” ArXiv preprint arXiv:1710.10903, 2017.

[40] V. Garcia and J. Bruna, “Few-shot learning with graph neural networks,” in 6th International Conference on Learning Representations, ICLR, 2018.

[41] M. Guo, E. Chou, D.-A. Huang, S. Song, S. Yeung, and L. Fei-Fei, “Neural graph matching networks for fewshot 3d action recognition,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 653–669.

[42] X. Qi, R. Liao, J. Jia, S. Fidler, and R. Urtasun, “3d graph neural networks for rgbd semantic segmentation,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5199–5208.

[43] L. Yi, H. Su, X. Guo, and L. J. Guibas, “Syncspeccnn: Synchronized spectral cnn for 3d shape segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2282–2290.

[44] X. Chen, L.-J. Li, L. Fei-Fei, and A. Gupta, “Iterative visual reasoning beyond convolutions,” in Proceedings

of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7239–7248.

[45] M. Narasimhan, S. Lazebnik, and A. Schwing, “Out of the box: Reasoning with graph convolution nets for factual visual question answering,” in Advances in Neural Information Processing Systems, 2018, pp. 2654– 2665.

[46] B. Jiang, D. Lin, J. Tang, and B. Luo, “Data representation and learning with graph diffusion-embedding networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 414–10 423.

[47] C. Liu, G. Yu, M. Volkovs, C. Chang, H. Rai, J. Ma, and S. K. Gorti, “Guided similarity separation for image retrieval,” in Advances in Neural Information Processing Systems, 2019, pp. 1554–1564.

[48] W. Liu, J. He, and S.-F. Chang, “Large graph construction for scalable semi-supervised learning,” in Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 679–686.

[49] X. Lan, S. Zhang, P. C. Yuen, and R. Chellappa, “Learning common and feature-specific patterns: A novel multiple-sparse-representation-based tracker,” IEEE Transactions on Image Processing, vol. 27, no. 4, pp. 2022–2037, 2017.

[50] B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation of rectified activations in convolutional network,” ArXiv preprint arXiv:1505.00853, 2015.

[51] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme, “Bpr: Bayesian personalized ranking from implicit feedback,” in Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence, AUAI Press, 2009, pp. 452–461.

[52] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman, “Total recall: Automatic query expansion with a generative feature model for object retrieval,” in 2007 IEEE 11th International Conference on Computer Vision, IEEE, 2007, pp. 1–8.

[53] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Advances in Neural Information Processing Systems, 2017, pp. 1024– 1034.

[54] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems, 2019, pp. 8024–8035.

[55] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.

[56] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR, 2015.

[57] J. Johnson, M. Douze, and H. Jgou, “Billion-scale similarity search with gpus,” IEEE Transactions on Big Data, 2019.

[58] F. S. Samaria and A. C. Harter, “Parameterisation of a stochastic model for human face identification,” in

Proceedings of 1994 IEEE Workshop on Applications of Computer Vision, IEEE, 1994, pp. 138–142.

[59] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object retrieval with large vocabularies and fast spatial matching,” in 2007 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2007, pp. 1–8.

[60] ——, “Lost in quantization: Improving particular object retrieval in large scale image databases,” in 2008 IEEE conference on computer vision and pattern recognition, IEEE, 2008, pp. 1–8.

[61] F. Radenovi, A. Iscen, G. Tolias, Y. Avrithis, and O. Chum, “Revisiting oxford and paris: Large-scale image retrieval benchmarking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5706–5715.

[62] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of machine learning research, vol. 9, no. Nov, pp. 2579–2605, 2008.

[63] X. Yang, S. Koknar-Tezel, and L. J. Latecki, “Locally constrained diffusion process on locally densified distance spaces with applications to shape retrieval,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2009, pp. 357–364.

[64] F. Yang, R. Hinami, Y. Matsui, S. Ly, and S. Satoh, “Efficient image retrieval via decoupling diffusion into online and offline processing,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 9087–9094.

[65] A. Gordo, J. Almazan, J. Revaud, and D. Larlus, “End-to-end learning of deep visual representations for image retrieval,” International Journal of Computer Vision, vol. 124, no. 2, pp. 237–254, 2017.

[66] F. Radenovi, G. Tolias, and O. Chum, “Fine-tuning cnn image retrieval with no human annotation,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 7, pp. 1655–1668, 2018.

Zhiyong Dou received his B.S. in Electric and Electronic Engineering from Huazhong University of Science and Technology, Wuhan, China in 2017. He is currently a Ph.D candidate at the School of Electronic Information and Communications in Huazhong University of Science and Technology. He is also a visiting student at the University of Toronto beginning in 2019. His research interests include computer vision, computational biology and machine learning.

Haotian Cui received his B.S. in Biomedical Engineering from the Tsinghua University, China in 2015. He is currently pursuing his Ph.D. degree in Computer Science at the University of Toronto. His research interests include computer vision, computational biology and machine learning.

Lin Zhang received her HB.A. in Statistics and B.A. in Economics from University of California, Berkeley in 2012, and received her M.A. in Applied Statistics from University of California, Santa Barbara in 2015. She is currently a Ph.D. candidate at the Department of Statistical Sciences at the University of Toronto.

Bo Wang is the lead AI scientist for the Peter Munk Cardiac Centre at the University Health Network (UHN) in Toronto. He is also an Assistant Professor at the Department of Medical Biophysics at the University of Toronto and a CIFAR AI Chair of the Vector Institute. Dr. Wang obtained his Ph.D. from the Department of Computer Science at Stanford University in 2017 and his M.Sc. in Computer Science from the University of Toronto in 2012. His current research interests include computer vision, computational biology and machine learning. Dr. Wang is a member of IEEE.