Distance Metric Learning (DML) learns a discriminative embedding by bringing similar examples closer, and moving away dissimilar ones. Semi-Supervised DML (SSDML) is useful in applications where we have class labels for only a few examples, with additionally available unlabeled data. Surprisingly, apart from a few classical SSDML approaches [1–6] that learn a linear Mahalanobis metric, deep SSDML has not been studied (although deep semi-supervised methods for classifica-tion tasks [7–12], and supervised deep DML methods [13–20] exist). We begin our paper with a straightforward extension of classical SSDML to deep SSDML.
Let be a given dataset, consisting of a set of labeled examples
with the associated label set
is the number of classes), and a set of unlabeled examples
. Existing SSDML approaches learn the parametric matrix
of the squared Mahalanobis distance metric
, for a pair of descriptors
(classically obtained using hand-crafted features). The SSDML approaches can mainly be categorized under two major paradigms: i) entropy minimization [21], and ii) graph-based. The SERAPH [3,6] approach, the only representative from the first category, tries to maximize the entropy of a conditional probability distribution over the pairwise constraints obtained from labeled data, while minimizing the entropy over the unlabeled data. The other category, i.e., graph-based, consisting of LRML [2], ISDML [4], APLLR [5], and APIT [5], involves variation of a Laplacian regularizer [22] that preserves the pairwise distances among all the examples in X (hence, unlabeled data as well). In addition, each of them has a supervised term leveraging pairwise constraints obtained from the limited labeled data.
As the losses in these methods are differentiable, we can backpropagate their gradients while using mini-batch based stochastic optimization. For the graph-based methods, one could first sample a random partition of the unlabeled examples, and construct a sub-graph along with the limited number of available labeled examples. Then, the mini-batches can be drawn from this partition alone for several epochs (over this partition). This process could be iterated over several partitions over the entire dataset. The same partition based strategy could be used for SERAPH as well, with the only
Figure 1: (a) A schematic illustration providing the intuition of our method (Best viewed in color). The raw image belongs to the MNIST dataset [23], (b) Neighborhood triplet mining using propagated affinities (Best viewed in color). . An anchor
(shown in black) is a point in the dataset that is currently in consideration for triplet mining within its k- neighborhood
distances in current embedding space). Points in blue (
semantically similar (by virtue of propagated affinities) to the anchor, than the points in red (
). In the current embedding, the blue points are farther from the anchor
than the red ones, despite having more semantic similarity. Hence, they should be pulled closer to the anchor in the learned space, compared to the red ones.
Figure 2: Our architecture to facilitate Riemannian optimization.
exception that it does not require graph construction. Formally, we can learn a deep embedding that induces a distance metric of the form , such that the descriptors
are obtained using a deep neural network
with parameters
, while simultaneously learning
in an end-to-end manner. However, in many cases, the full-rank matrix M of the distance metric learns redundancies in the data, which degrades the quality of the nonlinear embedding. Hence, we utilize the property that
, and factorize it as:
s.t.
, and rather propose to learn the metric wrt the matrix L (O(dl) parameters).
Because of a few limitations of the existing SSDML methods (discussed later in the Appendix), we now propose a method for SSDML, which is illustrated in Figure 1a. In the semi-supervised DML setting, the first task is to mine informative constraints, and then use an appropriate loss function. As the first stage, we propose a novel method for mining constraints using a graph-based technique while leveraging affinity propagation [1]. Let be a randomly selected partition of unlabeled data. We construct a kNN graph using
, s.t. the nodes represent the examples, and edge weights denote the affinities (or similarities) among the examples. The initial affinity matrix
is defined as follows: i)
, ii)
, otherwise. Here,
are respective cardinalities of
The neighborhood structure of the kNN graph can be indicated using a matrix defined as: i)
, and ii)
, otherwise. Here,
is the set of k-nearest neighbor examples of
. The flow of information from the non-zero entries of
(representing the labeled pairs) to the zero entries (representing the unlabeled pairs), can be performed by using a closed-form expression described by Liu et al. [1], as:
. This step is called affinity propagation. This step essentially performs a random-walk to propagate the affinities, where
is a weight hyper-parameter between the affinities obtained at the current step to that of the initial one. As
not encounter any difficulty while scaling up to large datasets. To obtain a symmetric affinity matrix W , we perform a final symmetrization step as follows:
The finally obtained representation of the symmetric affinity matrix W is used to mine triplets for metric learning. In doing so, we take into account the k-neighborhood of an example
that we consider as an anchor. Let
be the k-neighboring examples of
sorted in descending order of their propagated affinities w.r.t.
, i.e.,
. Essentially,
is simply a sorted version of
. As the obtained affinities are an indication of the semantic similarities among examples, we take it as a guidance to form triplets. Given an anchor
intuitively we can consider an example
with more affinity towards
as a positive, and another example
with lesser affinity towards
, and form a triplet
. By consider- ing first half of examples in the sorted neighborhood
, and remaining half as negatives, we can form the following triplets:
One may select the set of anchors from entire , or by seeking the modes of the graph, with- out loss of generality. Figure 1b illustrates our proposed novel idea of affinity-based triplet mining. Given
and the corresponding W , assume that we have obtained a triplet set
Here, is a mini-batch of
triplets, and
. Let,
. Then, we propose a smooth objective to learn the parameters (
) of our deep metric as follows:
Here, , s.t.,
, and
tries to pull the anchor
and the positive
together, while moving away the negative
from the mean
, with respect to an angle
at the negative
[24]. Given the fact that
is fully-differentiable, we can backpropagate the gradients to learn the parameters
using SGD. This helps us in integrating our method within an end-to-end deep framework using Algorithm 1, the theoretical analysis of which is present in the Appendix.
Additionally, we constrain to be a orthogonal matrix, i.e.,
. This is because orthogonality acts as a regularizer to avoid degenerate embeddings due to a model collapse, as shown in our experiments. We conjecture that this is because orthogonality omits redundant correlations among different dimensions, thereby omitting redundancy in the metric, which could otherwise hurt the generalization ability due to overfitting. This is in contrast to other regularizers that only constrain the values of the elements of a parametric matrix (like
or
regularizers). As L with orthogonality naturally lies on a Stiefel manifold [25], we make use of Riemannian geometry to impose the orthogonality constraint. It should be noted that for the orthogonal group
, and a matrix
, replacing L in
by LB does not change the value of the objective, because
From a Riemannian geometric perspective [25], we say that the objective is invariant to the right action of the group O(l). Hence, the correct geometry to consider for L is that of the Grassmann manifold G(l, d) [25]. For each mini-batch of triplets , we perform Riemannian optimization to learn the parametric matrix L while maintaining the constraint
. After a few iterations of Riemannian optimization, we fix the value of the obtained L and backpropagate the gradients for the rest of the network to learn
. We can repeat the process for several randomly sampled
, until convergence. Note that we only have to perform Riemannian optimization in the loss layer. As such, we do not require any major changes to the standard deep metric learning architecture (as shown in Figure 2).
Table 1: Comparison against state-of-the-art SSDML approaches on MNIST, Fashion-MNIST and CIFAR-10.
Table 2: Comparison against state-of-the-art SS- DML approaches on fine-grained datasets.
Table 3: Comparison of our method against state-of-the-art deep unsupervised methods on fine-grained datasets.
Figure 3: tSNE embeddings for the test examples of CIFAR-10. The left column represents the embeddings obtained right after random initialization. The embeddings obtained by our method: without orthogonality constraint on L (middle column) and with orthogonality constraint on L (rightmost column). Orthogonality leads to better, well-separated embeddings (see Table 4).
Table 4: Quantitative comparison of the perfor- mance of our method, without (w/o orth) and with orthogonality (w/ orth) on the metric parameters.
Table 5: Comparison of our method against supervised deep metric learning baselines on CUB.
In this section we evaluate our proposed method in terms of its effectiveness in clustering (wrt NMI) and retrieval (wrt R@K) tasks on a number of benchmark datasets (MNIST [23], Fashion-MNIST [26], CIFAR-10 [27], CUB-200 [28] and Cars-196 [29]). We compare our method against 3 groups of approaches: i) Deep extensions of the discussed Semi-Supervised DML methods (Deep-LRML [2], Deep-ISDML [4], Deep-SERAPH [3], Deep-APLLR [5] and Deep-APIT [5]), ii) Recent state-of-the-art deep Unsupervised representation learning methods (Exemplar [30], Rotation Net [31], NCE [32], DeepCluster [33] and Synthetic [34]), and iii) Supervised deep metric learning methods (Triplet [35] and Angular [24]). We picked the Triplet [35] and Angular [24] methods as they perform at par with more sophisticated state-of-the-art methods [13–17,19,20], when compared under a fair experimental protocol (as studied recently [36,37]). Details of experiments, performance metrics, datasets and baselines can be found in the Appendix.
Tables 1-2 show that we outperform the SSDML baselines (explanation of why our method outperforms these baselines is provided in the appendix). As shown in Table 3, better performance of our method in contrast to the unsupervised approaches demonstrates the benefit of additionally available labeled examples while learning a metric. Interestingly, many SSDML baselines perform poorer than unsupervised ones. This is because of the redundant information learned by their full-ranked matrices, thereby degrading the embedding information. However, this is not the case in our method. Lastly, as shown in Table 5, better performance of our method in contrast to the fully-supervised baselines demonstrates the benefit of leveraging additional unlabeled examples via affinity propagation and our mining technique. On the other hand, the fully-supervised baselines overfit to the training data because of the availability of only a limited number of labeled examples. By using Table 4 and Figure 3, we quantitatively and qualitatively demonstrate that orthogonality in the metric avoids degenerate embeddings. Orthogonality especially showed more benefit on the more complex CIFAR in comparison to the simpler MNIST dataset, particularly for the clustering task.
The main contributions of this paper are: i) A simple extension of classical linear approaches to address deep SSDML, which was not studied before. ii) To overcome limitations of existing SSDML methods, we propose a novel method that by virtue of its affinity propagation based novel triplet mining strategy outperforms its competitors. iii) In particular, we studied deep SSDML from a Riemannian geometric perspective, due to nature of our constraints. Without requiring any modifications in existing deep metric learning architecture, we facilitate Riemannian optimization within a deep stochastic framework.
[1] Wei Liu, Shiqian Ma, Dacheng Tao, Jianzhuang Liu, and Peng Liu. Semi-supervised sparse metric learning using alternating linearization optimization. In Proc. of ACM International Conference on Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD), pages 1139–1148, 2010. 1, 2
[2] Steven C.H. Hoi, Wei Liu, and Shih-Fu Chang. Semi-supervised distance metric learning for collaborative image retrieval and clustering. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 6(3):18, 2010. 1, 4, 8
[3] Gang Niu, Bo Dai, Makoto Yamada, and Masashi Sugiyama. Information-theoretic semi-supervised metric learning via entropy regularization. Neural computation, 26(8):1717–1762, 2014. 1, 4, 9
[4] Shihui Ying, Zhijie Wen, Jun Shi, Yaxin Peng, Jigen Peng, and Hong Qiao. Manifold preserving: An intrinsic approach for semisupervised distance metric learning. IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 29(7):2731–2742, 2017. 1, 4, 9
[5] Ujjal Kr Dutta and C Chandra Sekhar. Affinity propagation based closed-form semi-supervised metric learning framework. In Proc. of International Conference on Artificial Neural Networks (ICANN), pages 556–565. Springer, 2018. 1, 4, 9
[6] Pengcheng Shen, Xin Du, and Chunguang Li. Distributed semi-supervised metric learning. IEEE Access, 4:8558–8571, 2016. 1
[7] Avital Oliver, Augustus Odena, Colin A Raffel, Ekin Dogus Cubuk, and Ian Goodfellow. Realistic evaluation of deep semi-supervised learning algorithms. In Proc. of Neural Information Processing Systems (NeurIPS), pages 3235–3246, 2018. 1
[8] Xiaojin Jerry Zhu. Semi-supervised learning literature survey. Technical report, University of WisconsinMadison Department of Computer Sciences, 2005. 1
[9] Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. Semi-supervised learning. IEEE Transactions on Neural Networks, 20(3):542–542, 2009. 1
[10] Otilia Stretcu, Krishnamurthy Viswanathan, Dana Movshovitz-Attias, Emmanouil Platanios, Sujith Ravi, and Andrew Tomkins. Graph agreement models for semi-supervised learning. In Proc. of Neural Information Processing Systems (NeurIPS), pages 8713–8723, 2019. 1
[11] Takeru Miyato, Shin-ichi Maeda, Shin Ishii, and Masanori Koyama. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2018. 1
[12] Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ondrej Chum. Label propagation for deep semi-supervised learning. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5070–5079, 2019. 1
[20] Qi Qian, Lei Shang, Baigui Sun, Juhua Hu, Hao Li, and Rong Jin. Softtriple loss: Deep metric learning without triplet sampling. In Proc. of IEEE International Conference on Computer Vision (ICCV), pages 6450–6458, 2019. 1, 4
[21] Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. In Proc. of Neural Information Processing Systems (NeurIPS), pages 529–536, 2005. 1
[22] Xiaofei He and Partha Niyogi. Locality preserving projections. In Proc. of Neural Information Processing Systems (NeurIPS), pages 153–160, 2003. 1
[23] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. 2, 4, 8
[24] Jian Wang, Feng Zhou, Shilei Wen, Xiao Liu, and Yuanqing Lin. Deep metric learning with angular loss. In Proc. of IEEE International Conference on Computer Vision (ICCV), 2017. 3, 4, 9
[25] P-A Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2009. 3
[26] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017. 4, 8
[27] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009. 4, 8
[30] A Dosovitskiy, P Fischer, JT Springenberg, M Riedmiller, and T Brox. Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 38(9):1734–1747, 2016. 4, 9
[31] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. In Proc. of International Conference on Learning Representations (ICLR), 2018. 4, 9
[32] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3733–3742, 2018. 4, 9
[33] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In Proc. of European Conference on Computer Vision (ECCV), pages 132–149, 2018. 4, 9
[34] Ujjal Kr Dutta, Mehrtash Harandi, and C Chandra Sekhar. Unsupervised metric learning with synthetic examples. In Proc. of Association for the Advancement of Artificial Intelligence (AAAI), 2020. 4, 9
[35] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 815–823, 2015. 4, 9
[36] Kevin Musgrave, Serge Belongie, and Ser-Nam Lim. A metric learning reality check. In Proc. of European Conference on Computer Vision (ECCV), 2020. 4
[37] Karsten Roth, Timo Milbich, Samarth Sinha, Prateek Gupta, Bjoern Ommer, and Joseph Paul Cohen. Revisiting training strategies and generalization performance in deep metric learning. In Proc. of International Conference on Machine Learning (ICML), 2020. 4
[38] Andrea Vedaldi and Karel Lenc. Matconvnet: Convolutional neural networks for matlab. In Proceedings of the 23rd ACM international conference on Multimedia, pages 689–692. ACM, 2015. 8
[39] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9, 2015. 8
[40] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision (IJCV), 115(3):211–252, 2015. 8
[41] Giorgos Tolias, Ronan Sicre, and Hervé Jégou. Particular object retrieval with integral max-pooling of cnn activations. In Proc. of International Conference on Learning Representations (ICLR), 2016. 8
[42] Nicolas Boumal, Bamdev Mishra, P-A Absil, and Rodolphe Sepulchre. Manopt, a Matlab toolbox for optimization on manifolds. The Journal of Machine Learning Research, 15(1):1455–1459, 2014. 8
An efficient matrix-based algorithm for our proposed approach: We now provide an efficient matrix based implementation of the algorithm for our method. Given , construct the following matrices:
and
, where
and
denote a vector containing 1 as all the components.
We can compute a vector as:
, and
, such that each component is computed as:
, where
denotes the i-th component of m. Then, the objective can be expressed as (assuming a fixed
):
is the trace operator and diag(f) is a diagonal matrix.
Computational Complexity: We now present the computational time complexity of the major steps involved: i) Cost Function: Computing the cost requires four matrix multiplications resulting in complexity of O(). Next, the transpose of the matrix products need to be computed, requiring O(
). Finally, the sum of losses across all triplets can be computed using a matrix trace operation requiring O(
) complexity. ii) Gradients: The gradient with respect to L requires the following computations: transposes requiring O(
), outer products requiring O(
). The subsequent products require O(
). Hence, the overall complexity is O(
). Our proposed algorithm is linear in terms of the number of triplets in a mini-batch, i.e.,
, which is usually low. The complexity of our algorithm is either linear or quadratic in terms of the original dimensionality d (which in practice is easily controllable within a neural network).
It should be noted that in contrast to existing SSDML approaches, our method also enjoys a lower theoretical computational complexity. For example, LRML has a complexity of O(), while APLLR and APIT are of the order
Convergence analysis Algorithm 1 provides a pseudo-code of our method to jointly learn L in an end-to-end manner. Here,
denote the values of the parameters at the
iteration. In line 3 of Algorithm 1, for a fixed
, we employ Riemannian optimization to learn
. It is done by computing the following Riemannian gradient: grad
. Here, let L(., .) denotes the objective
denotes the Euclidean gradient computed as discussed earlier. The next iterate is obtained as:
, where
is a tangent vector pointing in the negative direction of the Riemannian gradient grad
, and R(.) denotes the retraction. It is well known that moving along the negative direction of the Riemannian gradient minimizes the objective. Hence, we have
. In line 4 of Algorithm
1, for a fixed , we employ SGD to learn
. Therefore, theoretical guarantees of SGD can be directly applied, i.e., the last iterate can converge at the rate of
for updates of the form
. Thus we have
. Furthermore, due to the log-sum-exponential formulation, the objective is bounded, i.e.,
, where
. Since
and
holds true for each iteration,
and the objective is bounded, we have , indicating that our method converges.
Datasets: Following recent literature, the benchmark datasets that have been used are as follows:
• MNIST [23]: It is a benchmark dataset that consists of 70000 gray-scale images of handwritten digits. Each image is of pixels. There are 60000 training images and 10000 test images in the standard split.
• Fashion-MNIST [26]: It is a similar dataset as the MNIST, but consists of images from 10 categories of fashion products. There are 60000 training images and 10000 test images in the standard split.
• CIFAR-10 [27]: This dataset consists of colour images of pixels, containing animal or vehicle objects from 10 different categories. There are 50000 training images and 10000 test images.
• CUB-200 [28]: This dataset consists of images of 200 species of birds with first 100 species for training (5864 examples) and remaining for testing (5924 examples).
• Cars-196 [29]: It consists of images of cars belonging to 196 models. The first 98 models containing 8054 images are used for training. The remaining 98 models containing 8131 images are used for testing.
The MNIST, Fashion-MNIST and CIFAR-10 datasets are widely used benchmarks with sufficiently large number of images for a comparative evaluation of different approaches. The CUB-200 and Cars-196 datasets are well-known for their use in Fine-Grained Visual Categorization (FGVC), and have huge intra-class variances and inter-class similarities.
Implementation details: We adapted the network architectures in the MatConvNet tool [38]. For MNIST and Fashion datasets, the network for MNIST has been adapted as: Conv1(max-pool
Conv2(
max-pool
Conv3(
ReLU
FC(
). For CIFAR-10, we used the following adapted network: Conv1(
max-pool
ReLU
Conv2(
ReLU
avg-pool
Conv3(
ReLU
avg-pool
Conv4(
ReLU
). For our method, we set
in the affinity propagation step, k = 10 in the kNN graph,
in (1), and initial learning rate
. For MNIST and Fashion datasets, we choose 100 labeled examples (10 per class), while for CIFAR-10, we choose 1000 labeled examples (100 per class). We sample a random subset of 9k unlabeled examples and use it along with the labeled data to mine triplets. For each random subset, we run our method for 10 epochs (with mini-batch size of 100 triplets). In total, we run for a maximum of 50 epochs and choose the best model from across all runs. For MNIST and Fashion, we train upon randomly initialized networks. For CIFAR-10, we observed a better performance by pretraining with labeled examples (by replacing the L layer with softmax) for 30 epochs, and then fine-tune using our loss for 50 epochs. For all datasets, the graph has been constructed using the l2-normalized representations obtained just before L.
For the FGVC task, the GoogLeNet [39] architecture pretrained on ImageNet [40], has been used as the backbone CNN, using MatConvNet [38]. We used the Regional Maximum Activation of Convolutions (R-MAC) [41] right before the average pool layer, aggregated over three input scales (). We choose five labeled examples per class. For our method, we set
,
in (1), and embedding size of 128. We take the entire dataset without partition based sampling and run for a maximum of 200 epochs (mini-batch size of 100 triplets), and choose the best model. Specifically, using our triplet mining strategy, we mined 146600 triplets for the CUB dataset, and 201350 triplets for the Cars dataset. In all experiments, we fix a validation dataset by sampling 15% examples from each class of the training data. This validation dataset is used to tune the hyperparameters without taking any feedback from the test data. Note that to learn L with orthogonal constraints we made use of the Manopt [42] tool with CGD (max_iter = 10, with all other default settings).
Compared state-of-the-art baseline approaches: We compare our proposed SSDML method against the following baseline techniques:
• Deep-LRML: This is the stochastic extension of the classical LRML method [2] discussed earlier, by making use of our proposed stochastic approach. It follows the min-max principle for the labeled data (minimizing distances between similar pairs, while maximizing distances
between dissimilar pairs). A Laplacian regularizer is used to capture information from the unlabeled data.
• Deep-ISDML: Stochastic extension of the ISDML [4] method. It is similar to LRML, but makes use of densities around an example to adapt the Laplacian regularizer.
• Deep-SERAPH: Stochastic extension of the SERAPH [3] method that makes use of the entropy minimization principle.
• Deep-APLLR: Stochastic extension of the APLLR [5] method. Makes use of a LogLikelihood Ratio (LLR) based prior metric. The affinity propagation principle is used to propagate information from the labeled pairs to the unlabeled ones, and adapt the Laplacian (but no triplet mining like ours).
• Deep-APIT: Stochastic extension of the APIT [5] method. Makes use of an information-theoretic prior metric. The affinity propagation principle is used to adapt the Laplacian as in APLLR.
• Exemplar (TPAMI’16): This method [30] attempts to learn the parameters associated with certain elementary transformation operations like translation, scaling, rotation, contrast, and colorization applied on random image patches.
• Rotation Net (ICLR’18): This method [31] aims to learn representations of data that can accurately capture the information present in an image despite any rotation of the subject.
• NCE (CVPR’18): This method [32] aims at bringing augmentations of an example together while moving away augmentations from different examples.
• DeepCluster (ECCV’18): This method [33] aims to jointly cluster metric representations while learning pseudo-labels for data in an end-to-end manner.
• Synthetic (AAAI’20): This method [34] learns a metric using synthetic constraints generated in an adversarial manner.
• Triplet (CVPR’15): This method [35] learns a metric using a standard triplet loss. • Angular (ICCV’17): This method [24] learns a metric using an angular loss on a triplet of examples.
Evaluation metrics: For comparing the approaches, we first learn a metric with the training data, and using it we obtain the test embeddings. The methods are compared based on their clustering (wrt NMI) and retrieval (wrt Recall@K, K=1,2,4,8) performances on the test embeddings. NMI is defined as the ratio of mutual information and the average entropy of clusters and entropy of actual ground truth class labels. The Recall@K metric gives us the percentage of test examples that have at least one K nearest neighbor from the same class. A higher value of all these metrics indicates a better performance for an approach.
1. LRML / ISDML vs Ours: Both the LRML and ISDML methods define the affinity matrices directly using the distances among the initial representations. If the initial affinities are poor, the learned metric would be poor as well. On the other hand, our affinity matrix is adapted by the affinity propagation principle, while leveraging labeled data information as well. 2. APLLR / APIT vs Ours: Although APLLR and APIT make use of affinity propagation, they do not mine constraints using the enriched affinity matrix information, whereas we do. While their prior metric may be singular and hence poor, our method has no such dependency on a pre-computed metric. 3. SERAPH vs Ours: In contrast to SERAPH, our method has a stronger regularizer by virtue of orthogonality.