Person re-identification aims at retrieving the most similar images to the probe image, from a large scale gallery set captured by camera networks. Among the challenges which hinder person re-id tasks, include background clutter, Pose, view and illumination variation can be mentioned.
Person re-id can be taken as a person retrieval problem
Figure 1. Shows a variety of existing classification and similarity- based deep person re-id models. (a) Depicts a classification-based deep person re-id model, where refers to the person. (b) Illustrates a verification network whereby the similarity S and dissimilarity D for a pair of images is found. (c) A Triplet loss based DNN, where A, P, N indicate anchor, positive, and negative samples, respectively. (d) A quadruplet based DNN (e) Conventional diffusion-based DNN, which leverages the similarities among all the images in the gallery to learn a better similarity. (f) The proposed deep constrained dominant sets (DCDS), where, P indicates the constraint (probe-image); and, images in the constrained cluster, the enclosed area, indicates the positive samples to the probe image.
based on the ranked similarity score, which is obtained from the pairwise affinities between the probe and the dataset images. However, relying solely on the pairwise affinities of probe-gallery images, ignoring the underlying contextual information between the gallery images often leads to an undesirable similarity ranking. To tackle this, several works have been reported, which employ similarity diffusion to estimate a second order similarity that considers the intrinsic manifold structure of the given affinity matrix [3], [22], [12], [4]. Similarity diffusion is a process of exploiting the contextual information between all the gallery images to provide a context sensitive similarity. Nevertheless, all these methods do not leverage the advantage of deep neural networks. Instead, they employ the similarity diffusion process as a post-processing step on the top of the DNN model. Aiming to improve the discriminative power of a DNN model, there have been recent works which incorporate a similarity diffusion process in an end-to-end manner [25],[26],[7]. Following [5], which applies a random walk in an end-to-end fashion for solving semantic segmentation problem, authors in [25] proposed a group-shuffling random walk network for fully utilizing the affinity information between gallery images in both the training and testing phase. Also, the authors of [26] proposed similarity-guided graph neural network (SGGNN) to exploit the relationship between several prob-gallery image similarities.
However, most of the existing graph-based end-to-end learning methods apply the similarity diffusion without considering any constraint or attention mechanism to the spe-cific query image. Due to that the second order similarity these methods yield is highly prone to noise. To tackle this problem, one possible mechanism could be to guide the similarity propagation by providing seed (or constraint) and let the optimization process estimate the optimal similarity between the seed and nearest neighbors, while treating the seed as our attention point. To formalize this idea, in this paper, we model person re-id problem as finding an internally coherent and externally incoherent constrained cluster in an end-to-end fashion. To this end, we adopt a graph and game theoretic method called constrained dominant sets in an end-to-end manner. To the best of our knowledge, we are the first ones to integrate the well known unsupervised clustering method called dominant sets in a DNN model. To summarize, the contributions of the proposed work are:
• For the very first time, the dominant sets clustering method is integrated in a DNN and optimized in end-to-end fashion.
• A one-to-one correspondence between person re-identification and constrained clustering problem is established.
• State-of-the-art results are significantly improved.
The paper is structured as follow. In section 2, we review the related works. In section 3, we discuss the proposed method with a brief introduction to dominant sets and constrained dominant sets. Finally, in section 4, we provide an extensive experimental analysis on three different benchmark datasets.
Person re-id is one of the challenging computer vision tasks due to the variation of illumination condition, backgrounds, pose and viewpoints. Most recent methods train DNN models with different learning objectives including verification, classification, and similarity learning [9], [42], [31], [1]. For instance, verification network (V-Net) [19], Figure 1(b), applies a binary classification of image-pair representation which is trained under the supervision of binary softmax loss. Learning accurate similarity and robust feature embedding has a vital role in the course of person re-identification process. Methods which integrate siamese network with contrastive loss are a typical showcase of deep similarity learning for person re-id [8]. The optimization goal of these models is to estimate the minimum distance between the same person images, while maximizing the distance between images of different persons. However, these methods focus on the pairwise distance ignoring the contextual or relative distances. Different schemes have tried to overcome these shortcomings. In Figure 1(c), triplet loss is exploited to enforce the correct order of relative distances among image triplets [9], [11], [42] . In Figure 1(d), Quadruplet loss [8] leverages the advantage of both contrastive and triplet loss, thus it is able to maximize the intra-class similarity while minimizing the inter-class similarity. Emphasizing the fact that these methods entirely neglect the global structure of the embedding space, [7], [25], [26] proposed graph based end-to-end diffusion methods shown in Figure 1(e).
Graph based end-to-end learning. Graph-based methods have played a vital role in the rapid growth of computer vision applications in the past. However, lately, the advent of deep convolutional neural networks and their tremendous achievements in the field has attracted great attention of researchers. Accordingly, researchers have made a sig-nificant effort to integrate, classical methods, in particular, graph theoretical methods, in end-to-end learning. Shen et al. [26] developed two constructions of deep convolutional networks on a graph, the first one is based upon hierarchical clustering of the domain, and the other one is based on the spectrum of graph Laplacian. Yan et al. [37] proposed a model of dynamic skeletons called Spatial-Temporal Graph Convolutional Networks (ST-GCN), which provides a capability to automatically learn both the spatial and temporal pattern of data. Bertasius et al. [5] designed a convolutional random walk (RWN), where by jointly optimizing the objective of pixelwise affinity and semantic segmentation they are able to address the problem of blobby boundary and spatially fragmented predictions. Likewise, [25] integrates random walk method in end-to-end learning to tackle person re-identification problem. In [25], through the proposed deep random walk and the complementary feature grouping and group shuffling scheme, the authors demonstrate that one can estimate a robust probe-gallery affinity. Unlike recent Graph neural network (GNN) methods [26], [17], [25], [7], Shen et al. [26] learn the edge weights by exploiting the training label supervision, thus they are able to learn more accurate feature fusion weights for updating nodes feature.
Recent applications of dominant sets. Dominant sets (DS) clustering [24] and its constraint variant constrained dominant sets (CDS) [40] have been employed in several recent computer vision applications ranging from person tracking [29], [30], geo-localization [41], image retrieval [38], [2], 3D object recognition [32], to Image segmentation and co-segmentation [39]. Zemene et al. [40] presented CDS with its applications to interactive Image segmentation. Following, [39] uses CDS to tackle both image segmentation and co-segmentation in interactive and unsupervised setup. Wang et al. [32] recently used dominant sets clustering in a recursive manner to select representative images from a collection of images and applied a pooling operation on the refined images, which survive at the recursive selection process. Nevertheless, none of the above works have attempted to leverage the dominant sets algorithm in an end-to-end manner.
In this work, unlike most of the existing graph-based DNN model, we propose a constrained clustering based scheme in an end-to-end fashion, thereby, leveraging the contextual information hidden in the relationship among person images. In addition, the proposed scheme sig-nificantly magnifies the inter-class variation of different person-images while reducing the intra-class variation of the same person-images. The big picture of our proposed method is depicted in Figure 1(f), as can be seen, the objective here is to find a coherent constrained cluster which incorporates the given probe image P.
In this work, we cast probe-gallery matching as optimizing a constrained clustering problem, where the probe image is treated as a constraint, while the positive images to the probe are taken as members of the constrained-cluster. Thereby, we integrate such clustering mechanism into a deep CNN to learn a robust features through the leveraged contextual information. This is achieved by traversing through the global structure of the given graph to induce a compact set of images based on the given initial similarity(edge-weight).
3.1. Dominant Sets and Constrained Dominant Sets
Dominant sets is a graph theoretic notion of a cluster, which generalizes the concept of a maximal clique to edge-weighted graphs. First, the data to be clustered are represented as an undirected edge-weighted graph with no self-loops, G = (V, E, w), where V = {1, ..., M} is the vertex set, is the edge set, and is the (positive) weight function. Vertices in G correspond to data points, edges represent neighborhood relationships, and edge-weights reflect similarity between pairs of linked vertices. As customary, we represent the graph G with the corresponding weighted adjacency (or similarity) ma-
trix, which is the nonnegative, symmetric matrix , defined as , if , and otherwise. Note that the diagonal elements of the adjacency matrix A are always set to zero indicating that there is no self-loops in graph G. As proved in [24], one can extract a coherent cluster from a given graph by solving a quadratic program f(x) as,
where, is the standard simplex of . Zemene et. al [40] proposed an extension of dominant sets which allows one to constrained the clustering process to contain intended constraint nodes P. Constrained dominant set (CDS) is an extensions of dominant set which contains a parameterized regularization term that controls the global shape of the energy landscape. When the regularization parameter is zero the local solutions are known to be in one-to-one correspondence with the dominant sets. A compact constrained cluster could be easily obtained from a given graph by defining a paramertized quadratic program as,
for i = 1, ..., M.
3.2. Modeling person re-id as a Dominant Set
Recent methods [7], [5] have proposed different models, which leverage local and group similarity of images in an end-to-end manner. Authors in [7] define a group similarity which emphasizes the advantages of estimating a similarity of two images, by employing the dependencies among the whole set of images in a given group. In this work, we establish a natural connection between finding a robust probe-gallery similarity and constrained dominant sets. Let us first elaborate the intuitive concept of finding a coherent subset from a given set based on the global similarity of given images. For simplicity, we represent person-images as vertices of graph G, and their similarity as edge-weight . Given vertices V, and be a non-empty subset of
Figure 2. Workflow of the proposed DCDS. Given n number of gallery images, G, and probe image P, we first extract their Resent101 features right before the global average pooling (GAP) layer, which are then fed to CDS-Net (upper stream) and V-Net (lower stream) branches. In the CDS-branch, after applying GAP, we compute the similarity between pair of probe-gallery image features, using their dot products, where T denotes a transpose. Thereby, we obtain affinity matrix. Then, we run CDS taking the probe image as a constraint to find the solution (similarity), and the dissimilarity, is computed as an additive inverse of the similarity Likewise, in the lower stream we apply elementwise subtraction on M pair of probe-gallery features. This is followed by GAP, batch normalization (BN), and fully connected layer (FC) to obtain probe-gallery similarity score, and probe-gallery dissimilarity score, . Afterward, we elementwise multiply , to find the final similarity, disimilarity, scores, respectively. Finally, to find the prediction loss of our model, we apply a cross entropy loss, the ground truth (is given as
Figure 3. Let comprises probe, P, and gallery images . As can be observed from the above toy example, the proposed method asses the contribution of each participant node with respect to the subset S\i. (1) shows graph G, showing the pairwise similarities of query-gallery images. (2-5) show the relative weight (Equ. 4) of each node with respect to the overall similarity between i and shows that the Node has a negative impact on the coherency of the cluster. (3) shows that clustering has a positive contribution to the compactness of set (4), similarly, shows the relative weight of (5) shows the relative weight of . And, (6) is a coherent subset (cluster) extracted from the graph given in (1).
vertices and , average weighted degree of each i with regard to S is given as
where measures the (relative) similarity between node j and i, with respect to the average similarity between node i and its neighbors in S. Note that can be either positive or negative. Next, to each vertex we assign a weight defined (recursively) as follows:
where for all . Intuitively, gives us a measure of the overall similarity between vertex i and the vertices of S \ {i}, with respect to the overall similarity among the vertices in S \ {i}.
Hence, a positive indicates that adding i into its neighbors in S will raise the internal coherence of the set, whereas in the presence of a negative value we expect the overall coherence to decline. In CDS, besides the additional feature, which allows us to incorporate a constraint element in the resulting cluster, all the characters of DS are inherited.
3.2.1 A Set of a person images as a constrained cluster
We cast person re-identification as finding a constrained cluster, where, elements of the cluster correspond to a set of same person images and the constraint refers to the probe image used to extract the corresponding cluster. As customary, let us consider a given mini-batch with M number of person-images, and each mini batch with k person identities (ID), thus, each person-ID has images in the given mini-batch. Note that, here, instead of a random sampling we design a custom sampler which samples k number of person IDs in each mini-batch. Let refers to the set of images in a single mini-batch. Each time when we consider image as a probe image P, images which belong to the same person id, should be assigned a large membership score to be in that cluster. In contrast, the remaining images in the mini-batch should be assigned sig-nificantly smaller membership-score to be part of that cluster. Note that our ultimate goal here is to find a constrained cluster which comprises all the images of the corresponding person given in that specific mini-batch. Thus, each participant in a given mini-batch is assigned a membership-score to be part of a cluster. Furthermore, the characteristics vector, which contains the membership scores of all participants is always a stochastic vector, meaning that where denotes the membership score of each image in the cluster.
As can be seen from the toy example in Figure 3, the initial pairwise similarities between the query and gallery images hold valuable information, which define the relation of nodes in the given graph. However, it is not straightforward to redefine the initial pairwise similarities in a way which exploit the inter-images relationship. Dominant Sets (DS) overcome this problem with defining a weight of each image with regard to subset S\i as depicted in Figurerespectively. As can be observed from Figure 3, adding node to cluster S degrades the coherency of cluster whereas the relative similarity of the remaining images with respect to set has a positive impact on the coherency of the cluster. It is evident that the illustration in Figure 3 verifies that the proposed DCDS (Deep Constrained Dominant Set) could easily measure the contribution of each node in the graph and utilize it in an end-to-end learning process. Thereby, unlike a siamese, triplet and quadruplet based contrastive methods, DCDS consider the whole set of images in the mini-batch to measure the similarity of image pairs and enhance the learning process.
3.3. CDS Based End-to-end Learning
In this section, we discuss the integration of CDS in end-to-end learning. We adopt a siamese based Resent101, with a novel verification loss to find probe-gallery similarity, R, and dissimilarity, D scores. As can be seen from Figure 2, we have two main branches: CDS network branch (CDSNet) and verification network branch (V-Net). In the CDSNet, the elements of pairwise affinity matrix are computed first as a dot product of the global pooling feature of a pair of images. Afterward, the replicator dynamics [36] is applied, which is a discrete time solver of the parametrized quadratic program, Equ. 5, whose solution corresponds to the CDS. Thus, assuming that there are M images in the given mini-batch, the replicator dynamics, Equ. 3, is recursively applied M times taking each image in the mini-batch as a constraint. Given graph G = (V, E, w) and its corresponding adjacency matrix and probe First, a proper modification of the affinity matrix A is applied by setting parameter to the diagonal corresponding to the subset V \P and zero to the diagonal corresponding to the constraint image P. Next, the modified adjacency matrix, B, is feed to the Replicator dynamics, by initiating the dynamics with a characteristic vector of uniform distribution , such that initially all the images in the mini-batch are assigned equal membership probability to be part of the cluster. Then, to find a constrained cluster a parametrized quadratic program is defined as:
maximize subject to
The solution, of is a characteristics vector which indicates the probability of each gallery image to be included in a cluster, containing the probe image . Thus, once we obtain the CDS, for each probe image, we store each solution , in as
Likewise, for each probe, we store the probe-gallery similarity, R, and dissimilarity, D, obtained from the V-Net (shown in Figure 2) in and as, and Next, we fuse the similarity obtained from the CDS branch with the similarity from the V-Net as
is empirically set to 0.3. We then vectorize and into where, the first column stores the dissimilarity score, while the second column stores the similarity score. Afterward, we simply apply cross entropy loss to find the prediction loss. The intriguing feature of our model is that it does not need any custom optimization technique, it can be end-to-end optimized through a standard back-propagation algorithm. Note that, Figure 2 illustrates the case of a single probe-gallery, whereas Equ. 6 shows the solution of M probe images in a given mini-batch.
3.4. Auxiliary Net
In this work, we integrate an auxiliary net to further improve the performance of our model. The auxiliary net is trained based on the multi-scale prediction of Resnet50 [15]. It is a simple yet effective architecture, whereby we
Figure 4. Illustrates the auxiliary net, which consists of two branches which are jointly trained. We first use features at different layers, and then feed these to Global Maxpooling (GMP), Conv, BN, Relu and FC layers for further encoding. We then compute triplet losses employing the features from the lower three streams after Relu, shown by yellow, blue, and red circles. Next, after the final FC layer, we compute the cross-entropy loss for each of the six different outputs, from the upper and lower stream shown by distinct colored-boxes. Note that even if the upper and lower stream apply the same operations, on they do not share the weights; thus the encoding is different. We finally compute the final loss as the sum of the average of the triplet and cross entropy losses.
Figure 5. During testing, given a probe and gallery images, we extract DCDS and auxiliary features and concatenate them to find a single vector. Afterward, we build M x M affinity matrix and run CDS with constraint expansion mechanism to find the final probe-gallery similarity rank.
can easily compute both triplet and cross entropy loss of different layers of Resnet50 [15], hence further enhancing the learning capability. Consequently, we compute the average of both losses to find the final loss. As can be observed from Figure 4, we employ three features at different layers from Resnet50 conv5 x Layer, and then we fed these three features to the subsequent layers, MP, Conv, BN, and FC layers. Next, we compute triplet and cross entropy loss for each feature which comes from the Relu and FC layers, respectively. During testing phase we concatenate the features that come from the DCDS and Auxiliary Net to find 4096 dimensional feature. We then apply CDS to find the final ranking score, (See Figure 5).
3.5. Constraint Expansion During Testing
We propose a new scheme (illustrated in Figure 6) to expand the number of constraints in order to guide the similarity propagation during the testing phase. Given an affin-ity matrix, which is constructed using the features obtained from the concatinated feature (shown in Figure 5), we first collect k-NN’s of the probe image. Then, we run CDS on the graph of the NNs. Next, from the resulting constrained cluster, we select the one with the highest membership score, which is used as a constraint in the subsequent step. We then use multiple-constraints and run CDS over the entire graph.
To validate the performance of our method we have conducted several experiments on three publicly available benchmark datasets, namely CUHK03 [19], Market1501 [43], and DukeMTMC-reID [45].
4.1. Datasets and evaluation metrics
Datasets: CUHK03 [19] dataset comprises 14,097 manually and automatically cropped images of 1,467 identities, which are captured by two cameras on campus; in our experiments, we have used manually annotated images. Market1501 dataset [43] contains 32,668 images which are split into 12, 936 and 19,732 images as training and testing set, respectively. Market1501 dataset has totally 1501 identities which are captured by five high-resolution and one low-resolution cameras, the training and testing sets have 751 and 750 identities respectively. To obtain the person bounding boxes, Deformable part Model (DPM) [14] is utilized. DukeMTMC-reID is generated from a tracking dataset called DukeMTMC. DukeMTMC is captured by 8 high-resolution cameras, and person-bounding box is manually cropped; it is organized as 16,522 images of 702 person for training and 18, 363 images of 702 person for testing.
Evaluation Metrics: Following the recent person re-id methods, we use mean average precision (mAP) as suggested in [43], and Cumulated Matching Characteristics (CMC) curves to evaluate the performance of our model. Furthermore, all the experiments are conducted using the standard single query setting [43].
4.2. Implementation Details
We implement DCDS based on Resnet101 [15] architecture, which is pretrained on imagenet dataset. We adopt the training strategy of Kalayeh et al. [16], and aggregate eight different person re-id benchmark dataset to train our model. In total, the merged dataset contains 89,091 images, which comprises 4937 person-ID (detail of the eight datasets is given in the supplementary material). We first train our model using the merged dataset (denoted as multi-dataset (MD)) for 150 epochs and fine-tune it with CUHK03, Mar-ket1501, and DukeMTMC-reID dataset. To train our model using the merged dataset, we set image resolution to 450 150. Subsequently, for fine-tuning the model we set image resolution to 384 128. Mini-batch size is set to 64, each mini-batch has 16 person-ID and each person-ID has 4 images. We also experiment only using a single dataset for training and testing, denoted as single-dataset (SD). For data augmentation, we apply random horizontal flipping
Figure 6. Given a constraint (probe-image) we first collect kNNs to the probe-image, based on the pairwise similarities. Subsequently, we run CDS on the graph of the k-NN. Then, based on the cluster membership score obtained, we choose image highest membership score and re-run CDS, considering as constraints, over the graph of the all set of images, minibatch. Afterward, we consider the solution as our final rank.
Table 1. A comparison of the proposed method with state-of-the- art methods on Market1501 dataset. Upper block, without re-ranking methods. Lower block, with re-ranking method, w/RR, [46].
and random erasing [47]. For optimization we use Adam, we initially set the learning rate to 0.0001, and drop it by 0.1 in every 40 epochs. The fusing parameter in Equ. 6, , is set to 0.9.
4.3. Results on Market1501 Datasets
As can be seen from Table 1, on Market dataset, our proposed method improves state-of-the-art method [16] by 2.5%, 1.21%, and 0.6% in mAP, rank-1 and rank-5
Figure 7. Illustrates different experimental analysis performed on Market1501 dataset. a) shows the impact of fusing parameter Equ. 6. b) shows the performance of our model with varying the number of images per person in a given batch. c) and d) illustrate the similarity between the probe and gallery images obtained from the baseline and DCDS method, respectively. It can be observed that the baseline method has given larger similarity values for false positive samples (red asterisks above the blue dashed-line) and smaller similarity values for false negative samples (green circles below the blue dashed- line). On the other hand, the proposed DCDS has efficiently assigned the appropriate similarity scores to the true positive and negative samples.
scores, respectively. Moreover, comparing to state-of-the-art graph-based DNN method, SGGNN [26], the improvement margins are 3%, 2.5%, and 2% in mAP, rank-1, and rank-5 score, respectively. Thus, our framework has signifi-cantly demonstrated its benefits over state-of-the-art graph-based DNN models. To further improve the result we have adapted a re-ranking scheme [46], and we compare our method with state-of-the art methods which use a re-ranking method as a post-processing. As it can be seen from Table 1, our method has gain mAP of 2.2% over HSP [16], and 10.5 % over SGGNN[26], 10.8 % over DGSRW.
4.4. Results on CUHK03 Datasets
Table 5 shows the performance of our method on CUHK03 dataset. Since most of the Graph-based DNN models report their result on the standard protocol [20], we have experimented on the standard evaluation protocol, to make fair comparison. As can be observed from Table 5, our method gain a marginal improvement in the mAP. Using a reranking method [46], we have reported a competitive result in all evaluation metrics.
4.5. Results on DukeMTMC-reID Dataset
Likewise, in DukeMTMC-reID dataset, the improvements of our proposed method is noticeable. Our method has surpassed state-of-the-art method [16] by 1.7%/1.6% in mAP/rank-1 scores. Moreover, comparing to state-of-the-art graph-based DNN, our method outperforms DGSRW
Table 2. Ablation studies on the proposed method. SD and MD respectively refer to the method trained on single and multiple-aggregated datasets. Baseline is the proposed method without CDS branch.
Table 3. A comparison of the proposed method with state-of-the-art methods on CUHK03 dataset.
Table 4. A comparison of the proposed method with state-of-the-art methods on DukeMTMC-reID dataset.Upper block, without re-ranking methods. Lower block, with re-ranking method,w/RR, [46].
[25], SGGNN [26] and GCSL [7] by 9.1%, 7.3%, and 6% in mAP, respectively.
4.6. Ablation Study
To investigate the impact of each component in our architecture, we have performed an ablation study. Thus, we have reported the contributions of each module in Table 2. To make a fair comparison with the baseline and graph-based DNN models, the ablations study is conducted in a
single-dataset (SD) setup.
Improvements over the Baseline. As our main contribution is the DCDS, we examine its impact over the baseline method. The baseline method refers to the lower branch of our architecture that incorporates the verifica-tion network, which has also been utilized in [27], [25], [26]. On Market1501 dataset, DCDS provides improvements of 9.2%, 6.8% and 3.6% in mAP, rank-1, and rank-5 scores, respectively, over the baseline method; whereas in DukeMTMC-reID dataset the proposed DCDS improves the baseline method by 8.0%, 5.5% and 1.7% in mAP, rank-1, and rank-5 scores, respectively.
Comparison with graph-based deep models. We compare our method with recent graph-based-deep models, which adapt similar baseline method as ours, such as [25],[26]. As a result, on DukeMTMC-reID dataset our method surpass [25] by 9.1%/6.8%, and [26] by 17.9 % / 7.4 % in mAP / rank-1 scores. In light of this, We can conclude that incorporating a constrained-clustering mechanism in end-to-end learning has a significant benefit on finding a robust similarity ranking. In addition, experimental findings demonstrate the superiority of DCDS over existing graph-based DNN models.
Parameter analysis. Experimental results by varying several parameters are shown in Figure 7. Figure 7(a) shows the effect of fusing parameter, Equ. (6) on the mAP. Thereby, we can observe that the mAP tends to increase with a larger value. This shows that the result gets better when we deviate much from the CDS branch. Figure 7(b) shows the impact of the number of images per person-ID () in a given batch. We have experimented setting to 4, 8, and 16, as can be seen, we obtain a marginal improvement when we set to 16. However, considering the direct relationship between the running time and , the improvement is negligible. c) and d) show probe-gallery similarity obtained from baseline and DCDS method, using three different probe-images, with a batch size of 64, and setting to 4, 8 and 16.
In this work, we presented a novel insight to enhance the learning capability of a DNN through the exploitation of a constrained clustering mechanism. To validate our method, we have conducted extensive experiments on several benchmark datasets. Thereby, the proposed method not only improves state-of-the-art person re-id methods but also demonstrates the benefit of incorporating a constrained-clustering mechanism in the end-to-end learning process. Furthermore, the presented work could naturally be extended to other applications which leverage a similaritybased learning. As a future work, we would like to investigate dominant sets clustering as a loss function.
This research is partly supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA R&D Contract No. D17PC00345. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.
[1] E. Ahmed, M. J. Jones, and T. K. Marks. An improved deep learning architecture for person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 3908–3916, 2015. 2
[2] L. T. Alemu and M. Pelillo. Multi-feature fusion for image retrieval using constrained dominant sets. CoRR, abs/1808.05075, 2018. 3
[3] S. Bai, X. Bai, and Q. Tian. Scalable person re-identification on supervised smoothed manifold. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 3356– 3365, 2017. 1
[4] S. Bai, Z. Zhou, J. Wang, X. Bai, L. J. Latecki, and Q. Tian. Ensemble diffusion for retrieval. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 774–783, 2017. 1
[5] G. Bertasius, L. Torresani, S. X. Yu, and J. Shi. Convolu- tional random walk networks for semantic image segmentation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 6137–6145, 2017. 2, 3
[6] X. Chang, T. M. Hospedales, and T. Xiang. Multi-level fac- torisation net for person re-identification. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR
2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 2109–2118, 2018. 7, 8
[7] D. Chen, D. Xu, H. Li, N. Sebe, and X. Wang. Group consistent similarity learning via deep CRF for person re-identification. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 8649–8658, 2018. 2, 3, 7, 8
[8] W. Chen, X. Chen, J. Zhang, and K. Huang. Beyond triplet loss: A deep quadruplet network for person re-identification. In CVPR, pages 1320–1329. IEEE Computer Society, 2017. 2
[9] D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng. Per- son re-identification by multi-channel parts-based CNN with improved triplet loss function. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 1335–1344, 2016. 2
[10] S. B. D. Gray and H. Tao. Evaluating appearance models for recognition, reacquisition. In IEEE International workshop on performance evaluation of track- ing and surveillance,, pages 31–44, 2007. 11
[11] S. Ding, L. Lin, G. Wang, and H. Chao. Deep feature learning with relative distance comparison for person re-identification. Pattern Recognition, 48(10):2993–3003, 2015. 2
[12] M. Donoser and H. Bischof. Diffusion processes for retrieval revisited. In CVPR, pages 1320–1327. IEEE Computer Society, 2013. 1
[13] H. Fan, L. Zheng, C. Yan, and Y. Yang. Unsupervised person re-identification: Clustering and fine-tuning. TOMCCAP, 14(4):83:1–83:18, 2018. 11
[14] P. F. Felzenszwalb, R. B. Girshick, D. A. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell., 32(9):1627–1645, 2010. 6
[15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- ing for image recognition. In CVPR, pages 770–778. IEEE Computer Society, 2016. 5, 6
[16] M. M. Kalayeh, E. Basaran, M. G¨okmen, M. E. Kamasak, and M. Shah. Human semantic parsing for person re-identification. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 1062–1071, 2018. 6, 7, 8
[17] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. CoRR, abs/1609.02907, 2016. 2
[18] W. Li, R. Zhao, and X. Wang. Human reidentification with transferred metric learning. In Computer Vision - ACCV 2012 - 11th Asian Conference on Computer Vision, Daejeon, Korea, November 5-9, 2012, Revised Selected Papers, Part I, pages 31–44, 2012. 11
[19] W. Li, R. Zhao, T. Xiao, and X. Wang. Deepreid: Deep filter pairing neural network for person re-identification. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pages 152–159, 2014. 2, 6
[20] W. Li, R. Zhao, T. Xiao, and X. Wang. Deepreid: Deep filter pairing neural network for person re-identification. In CVPR, pages 152–159. IEEE Computer Society, 2014. 7, 11
[21] W. Li, X. Zhu, and S. Gong. Harmonious attention network for person re-identification. In CVPR, pages 2285–2294. IEEE Computer Society, 2018. 7
[22] C. C. Loy, C. Liu, and S. Gong. Person re-identification by manifold ranking. In IEEE International Conference on Image Processing, ICIP 2013, Melbourne, Australia, September 15-18, 2013, pages 3567–3571, 2013. 1
[23] C. C. Loy, T. Xiang, and S. Gong. Multi-camera activity cor- relation analysis. In CVPR, pages 1988–1995. IEEE Computer Society, 2009. 11
[24] M. Pavan and M. Pelillo. Dominant sets and pairwise cluster- ing. IEEE Trans. Pattern Anal. Mach. Intell., 29(1):167–172, 2007. 3
[25] Y. Shen, H. Li, T. Xiao, S. Yi, D. Chen, and X. Wang. Deep group-shuffling random walk for person re-identification. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 2265–2274, 2018. 2, 7, 8
[26] Y. Shen, H. Li, S. Yi, D. Chen, and X. Wang. Person re- identification with deep similarity-guided graph neural network. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XV, pages 508–526, 2018. 2, 7, 8
[27] Y. Shen, T. Xiao, H. Li, S. Yi, and X. Wang. End-to-end deep kronecker-product matching for person re-identification. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 6886–6895, 2018. 7, 8
[28] Y. Suh, J. Wang, S. Tang, T. Mei, and K. M. Lee. Partaligned bilinear representations for person re-identification. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIV, pages 418–437, 2018. 7, 8
[29] Y. T. Tesfaye, E. Zemene, M. Pelillo, and A. Prati. Multi- object tracking using dominant sets. IET Computer Vision, 10(4):289–297, 2016. 3
[30] Y. T. Tesfaye, E. Zemene, A. Prati, M. Pelillo, and M. Shah. Multi-target tracking in multiple non-overlapping cameras using constrained dominant sets. CoRR, abs/1706.06196, 2017. 3
[31] R. R. Varior, M. Haloi, and G. Wang. Gated siamese convolutional neural network architecture for human re-identification. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII, pages 791–808, 2016. 2
[32] C. Wang, M. Pelillo, and K. Siddiqi. Dominant set clustering and pooling for multi-view 3d object recognition. In BMVC. BMVA Press, 2017. 3
[33] Y. Wang, Z. Chen, F. Wu, and G. Wang. Person re-identification with cascaded pairwise convolutions. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 1470–1478, 2018. 7, 8
[34] Y. Wang, L. Wang, Y. You, X. Zou, V. Chen, S. Li, G. Huang, B. Hariharan, and K. Q. Weinberger. Resource aware person re-identification across multiple resolutions. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 8042–8051, 2018. 7, 8
[35] L. Wei, S. Zhang, W. Gao, and Q. Tian. Person transfer GAN to bridge domain gap for person re-identification. In CVPR, pages 79–88. IEEE Computer Society, 2018. 11
[36] J. W. Weibull. Evolutionary Game Theory. MIT press, 1995. 5
[37] S. Yan, Y. Xiong, and D. Lin. Spatial temporal graph convo- lutional networks for skeleton-based action recognition. In Proceedings of the Thirty-Second AAAI Conference on Arti-ficial Intelligence, New Orleans, Louisiana, USA, February 2-7, 2018, pages 7444–7452, 2018. 2
[38] E. Zemene, L. T. Alemu, and M. Pelillo. Constrained dom- inant sets for retrieval. In 23rd International Conference on Pattern Recognition, ICPR 2016, Canc´un, Mexico, December 4-8, 2016, pages 2568–2573, 2016. 3
[39] E. Zemene, L. T. Alemu, and M. Pelillo. Dominant sets for ”constrained” image segmentation. CoRR, abs/1707.05309, 2017. 3
[40] E. Zemene and M. Pelillo. Interactive image segmentation using constrained dominant sets. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII, pages 278–294, 2016. 3
[41] E. Zemene, Y. T. Tesfaye, H. Idrees, A. Prati, M. Pelillo, and M. Shah. Large-scale image geo-localization using dominant sets. IEEE Trans. Pattern Anal. Mach. Intell., 41(1):148– 161, 2019. 3
[42] L. Zhao, X. Li, Y. Zhuang, and J. Wang. Deeply-learned part-aligned representations for person re-identification. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 3239–3248, 2017. 2
[43] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian. Scalable person re-identification: A benchmark. In ICCV, pages 1116–1124. IEEE Computer Society, 2015. 6, 11
[44] W. Zheng, S. Gong, and T. Xiang. Associating groups of people. In BMVC, pages 1–11. British Machine Vision Association, 2009. 11
[45] Z. Zheng, L. Zheng, and Y. Yang. Unlabeled samples gener- ated by GAN improve the person re-identification baseline in vitro. In ICCV, pages 3774–3782. IEEE Computer Society, 2017. 6, 11
[46] Z. Zhong, L. Zheng, D. Cao, and S. Li. Re-ranking person re-identification with k-reciprocal encoding. In CVPR, pages 3652–3661. IEEE Computer Society, 2017. 7, 8
[47] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang. Random erasing data augmentation. CoRR, abs/1708.04896, 2017. 7
Figure 8. On the right hand side, the target matrix is shown. There are total 16 persons in the mini-batch and 4 images per ID (batch size = 64. In the target matrix, the white-blocks represent the similarity between the same person-images in the mini-batch, whereas the black-blocks of the matrix define the dissimilarities between different person images. In the similarity matrix shown left ( after one epoch) and middle (after epochs) each row of the output matrix denotes the fused similarity obtained from the CDS-Net and V-Net, per Equ. (6) in the main manuscript. Thus, we optimize our model until we obtain an output with a similar distribution of the target matrix. As can be seen, our model has effectively learned and gives a similarity matrix (shown in the middle) which is closer to the target matrix.
In the supplementary material, we provide additional experiments on cross-dataset person-re-identification (re-id) using the proposed deep constrained dominant sets (DCDS) on Market1501 dataset. In section one, we summarize the datasets we used in our experiments. In section two, we present the experiments we have performed on cross-dataset person re-id. And, in section three, we provide hyper parameter analysis on DukeMTMC-reID and CUHK03 datasets. Figure 8 illustrates an example of our method training-output (left) and learning objective, target matrix, (right). Figure 9 demonstrates the similarity fusing process, between the V-Net and CDS-Net, alongside sample qualitative results.
Due to the lack of abundant labeled data, cross-dataset person re-id has attracted great interest. Recently, Fan et al. [13] have developed a progressive clustering-based method to attack cross-dataset person re-id problem. To further val-
Table 5. A comparison of the proposed method with PUL [13] on Market1501 dataset.
idate our proposed DCDS, we apply our method on cross-dataset person re-id problem and compare it with progressive unsupervised learning (PUL) [13]. To this end, we train our model on DukeMTMC-reID and CUHK03 datasets and test it on Market1501 dataset. We then compare it with PUL [13], which has also been trained on CUHK03 and DukeMTMC-reID datasets. As can be observed from Table 5, even though our proposed method is not intended for cross-dataset re-id, it has gained a substantial improvements over PUL [13], that was mainly designed to attack person re-id problem in a cross-dataset setup.
Similar to the parameter analysis reported in the main manuscript, we report hyper parameter analysis on DukeMTMC-reID and CUHK03 dataset. The performance of our method with respect to the fusing parameters on DukeMTMC-reID and CUHK03 are shown in Figure 10 (a) and Figure 10 (b), respectively. Thereby, as can be observed, the results show similar phenomena as in Mar-ket1501, where the mAP increases with a larger value. Figure 11 shows the similarity distribution given by the baseline and the proposed DCDS using three different probe-images, with a batch size of 64, and setting to 4, 8 and 16.
Figure 9. Exemplar results obtained as a result of the similarity fusion between the V-Net and CDS-Net. The Upper-row shows the probe and gallery similarity (R) obtained from the V-Net, where the green circles show persons similar to the probe (shown by purple-circle), while the red circles denote persons different from the probe image. Middle-row shows the workflow in CDS-Net. First, graph G is formed using the similarity obtained from the dot products. We then construct the modified affinity matrix B, followed by application of replicator dynamics on B to obtain the probe gallery similarity (). Finally, We elementwise multiply to find the final probe-gallery similarity (), shown in the third row. The intensity of the edges in, define the similarity value, where the bold ones denote larger similarity values, whereas the pale-edges depict smaller similarity values.
Figure 10. Performance of our model with respect to fusing parameter , on (a) CUHK03, and (b) DukeMTMC-reID, datasets.
Figure 11. Shows experimental analysis performed on CUHK03 and DukeMTMC-reID illustrate the similarity between the probe-gallery images obtained from the baseline and DCDS method, respectively. It can be observed that the baseline method has assigned larger similarity values for false positive samples (red asterisks above the blue dashed-line) and smaller similarity values for false negative samples (green circles below the blue dashed-line). On the other hand, the proposed DCDS has efficiently assigned the appropriate similarity scores to the true positive and negative samples. Note that, for better visibility, we have randomly assigned a large (close to 1) self-similarity value to the probe (blue-circle).
Thank you Leulseged Tesfaye Alemu, Marcello Pelillo, Mubarak Shah, who authored Deep Constrained Dominant Sets for Person Re-identification 🙏 This page is the html of their arXiv pdf, with no changes made other than format. Please cite their work