Unsupervised Representation Learning by Discovering Reliable Image Relations

2019·Arxiv

Abstract

Abstract

Learning robust representations that allow to reliably establish relations between images is of paramount importance for virtually all of computer vision. Annotating the quadratic number of pairwise relations between training images is simply not feasible, while unsupervised inference is prone to noise, thus leaving the vast majority of these relations to be unreliable. To nevertheless find those relations which can be reliably utilized for learning, we follow a divide-and-conquer strategy: We find reliable similarities by extracting compact groups of images and reliable dissimilarities by partitioning these groups into subsets, converting the complicated overall problem into few reliable local subproblems. For each of the subsets we obtain a representation by learning a mapping to a target feature space so that their reliable relations are kept. Transitivity relations between the subsets are then exploited to consolidate the local solutions into a concerted global representation. While iterating between grouping, partitioning, and learning, we can successively use more and more reliable relations which, in turn, improves our image representation. In experiments, our approach shows state-of-the-art performance on unsupervised classification on ImageNet with 46.0% and competes favorably on different transfer learning tasks on PASCAL VOC.

Keywords: Unsupervised Learning, Visual Representation Learning, Unsupervised Image Classification, Mining Reliable Relations, Divide-and-Conquer

2019. This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/

1. Introduction

The driving force of deep learning has been supervised training using vast amounts of tediously labeled training samples, such as object bounding boxes for visual recognition. Since easily accessible visual data is growing exponentially, manual labeling of training samples constitutes a bottleneck to utilizing all this valuable data. Consequently, there has recently been great interest in weakly supervised [39], self-supervised [21], and unsupervised [34, 20] approaches to representation learning. Fundamental computer vision problems like classifica-tion [35, 25], object detection [36] and image segmentation [23] all directly depend on such learned representations to find similar objects or group related image areas.

To learn a characteristic representation of images and the distances between them, different degrees of supervision can be considered: (i) supervised learning using samples with class labels [19], (ii) user feedback providing weakly supervised side information in terms of pairwise constraints [29], (iii) problem specific surrogate tasks such as colorization [17], permutations [21], or transitivity [32], and (iv) unsupervised feature learning [1]. Regardless of the training signal, be it unaries such as class labels [26], binary similarity constraints between samples

[29] or sample ordering constraints [24], a dataset of N training samples gives rise to N2 pairwise relations, exploitable for learning our representation. In the absence of supervisory information, these relations need to be automatically inferred during training. However, the vast majority of these inferred pairwise relations turn out to be unreliable as discussed in Sect. 3, Fig. 2, and Fig. 3. Despite the danger of diminished performance due to learning from spurious relations, recent approaches on unsupervised representation learning [2, 1], nevertheless, do not question the reliability of these relations. Now, assuming that only a small fraction of correct relations per sample can be iden-tified reliably (i.e. we are left with at most O(N) class labels or pairwise link constraints), how can we discover those few reliable relations, when no label or guiding side information is available?

In this work we propose a novel approach to visual representation learning that explicitly identifies and leverages reliable image relations without the need for annotations, supervision, problem-specific surrogate tasks for self-supervision, or pre-training. By extracting compact groups of images we are able to harness reliable similarities. Subsequently, we divide these compact groups constituting the overall learning problem into smaller, (potentially overlapping) subproblems, such that each contains only reliable dissimilarities between their groups. Thus, whereas the complicated global problem suffers from many of the N2 relations not being reliable, we ensure

Figure 1: Overview of our iterative learning procedure. We first find reliable similarity constraints by forming compact groups. To avoid unreliable dissimilarities, we partition the data into sets of mutually dissimilar groups, ¯Xk. Based on these (dis-)similarity constraints between/within groups, we learn a local representation ) for subset ¯Xk. Finally, we exploit sparse couplings between the local representations to arrive at a consolidated global representation. This iterative procedure improves the overall representation by successively adding reliable constraints into the learning process.

that the samples in each subproblem are either reliably similar or dissimilar. Optimization is then performed by learning a mapping from the images into a dedicated target space, built to reflect the structure and distribution estimated from the reliable relations for each subset. Next, coupling the local subproblems by utilizing transitivity between their samples allows us to consolidate the learned individual representations into a concerted global representation. Finally, by alternating between extracting reliable relations and learning, we successively incorporate more reliable relations and in turn more data which ultimately improves our image representation (cf. Fig. 1).

We evaluate our model on challenging benchmarks and achiev state-of-the-art performance on the ImageNet dataset, thus proving the scalability of our approach. Further, our approach performs comparably to the state-of-the-art in transfer learning on PASCAL VOC indicating its general applicability. By performing ablation and analysis studies, we finally provide insights into our learning procedure.

2. Related Work

Leveraging explicit relationships between training images for representation learning is widely studied using different orders of constraints (e.g. binary similarities [29], ranking constraints [24]). All these methods use label information or strong pre-training as the performance of such task heavily depends on the quality of the pairwise constraints. However, densely labeling all pairwise constraints is infeasible.

Due to the difficulty of extracting reliable relations from data, many recent label-free approaches resort to generic prior assumptions on the data distribution or single image- and classbased tasks. (Deep) clustering methods [2, 26] rely on a pre-defined number of pseudo-classes which typically is estimated by heuristics and further focus on similarity constraints only. Our model explicitly models both similarity and dissimilarity constraints estimated from data itself. Bojanowski et al.[1] find a mapping between images and a uniformly discretized target space, thus enforcing their representation to resemble a distribution of pairwise relationships independent of the actual data structure. Sanakoyeu et al. [26] cluster data into small surrogate classes to perform a global classification task, however, not considering that similar images may end up in competing, different classes. Thus, inferred relationships during training suffer from contradicting training signals. Also DeepCluster [2] follows this strategy based on disjoint k-means clustering, thus enforcing clear distinct boundaries which potentially disagree with the real data distribution. In contrast, our grouping process does not enforce hard class boundaries and is able to adapt to the data structure. Moreover by splitting groups into reliable subproblems and constructing a learning problem following their distance distribution, our groups corroborate their training signal. Dosovitski et al. [7] cast distance learning as an exemplar classification task utilizing heavy data augmentations at the cost of poor scalability to large data collections.

Self–supervised learning approaches aim to leverage data itself by typically solving surrogate tasks based on temporal [31] and spatial [21] coherence. These approaches are either domain specific or operate on images independently thus missing out on their relationships. Our work, in contrast, explicitly models relationships between images. Gidaris et al. [9] exploit image geometry and classify rotations applied to input images. Even though they report good results on image classification tasks on large datasets, this task is conceptually dependent on large variations in the underlying data distribution to avoid trivial image representation, potentially missing out fine-grained relationships between images.

Figure 2: Nearest neighbour distance ratios. For most of the N2 pairwise relations, the ratio d(xi(for sorted neighbors xj) is close to 1. Only O(N) have a robust ordering. This analysis is based on N = 5000 samples from the STL-10[4] dataset using euclidean distances based on an unsupervised representation.

Generative models based on GAN[11]- and VAE[14]-like architectures recently became a popular choice for unsupervised learning. These approaches learn mappings between images and a latent space driven by a generative task and thus implicitly learn an image representation. Unfortunately such approaches typically suffer from limited applicability for large datasets with high variance due to the difficult optimization problem. Further their training is regularized using data-independent priors, e.g. by enforcing the learned feature space to follow a gaussian distribution. Our approach on the other side explicitly learns an image representation using data dependent constraints and scales to large datasets.

3. Approach

Let us now learn a representation : X that allows to relate image samples xi, x j X to another. This is equivalent to learning a distance d(xi)(xj)d(xi, x j), i.e. learning a representation such that given image relations d(xi, xj) are reflected and preserved in the embedding space . Thus, learning is propelled by pairwise relationships between images indicated by d(xi, xj): In supervised training d(xi, xj) is typically defined on the basis of manually provided class labels l(xi), weak user feedback, problem specific surrogate tasks, dense triplet ranking constraints such as d(xi, xj) < d(xi, xk), or other sparse partial ordering constraints.

3.1. Reliable Relations for Learning a Representation:

Regardless of the origin of d(xi, xj), only a small number of all possible N2 pairwise relations (for N training samples) may be feasibly provided by manual annotation. For the particular case of unsupervised learning only a small number of the pairwise relations can be inferred correctly for training with high confidence. Let’s consider a triplet of images xi, xj, xk with the ground-truth distance between xi and xj being small and between xi and xk being large. A learned distance is correct, if it obeys these ground-truth constraints. Furthermore to ensure

Figure 3: Sorted pairwise similarities for a query image based on different representations and distance metrics resulting from supervised and unsupervised training on STL-10[4]. Only the strong (dis-)similarities at both ends are reliable to provide a robust ordering.

robustness to noise these constraints should be obeyed reliably by a clear margin d(xi, xj)/d(xi, xk) !1. In Fig. 2 we plot this ratio for consecutive nearest neighbors. This is the same ratio used in [18] to measure the reliability of a matching similar image. We observe that for only O(N) pairwise distances the ratio is significantly smaller than 0.95. All other relations would not even have an opportunity to exhibit above correctness constraints reliably and, thus, would be falsely identified to be correct rather than spurious. Further, in Fig. 3 we plot the sorted similarities for a query image. Observe from Fig. 2 that above triplet constraints can only be fulfilled where there is significant slope in Fig. 3. Thus, only relations from both ends, where we have strong (dis-)similarities, can be considered to be reliable. These relations are significantly less susceptible to change under noise than the vast majority, as analyzed in Fig. 4. However, recent work on unsupervised learning has nevertheless simply relied on all pairwise relations [1] inferred during training at the cost of incorporating corrupted relations. In contrast, we now present an approach for unsupervised representation learning which explicitly aims at extracting and leveraging these reliable relations.

3.2. Outline of our Iterative Representation Learning:

We first decompose the training set into subsets ¯Xk 1, . . . , K of images by extracting (Sec. 3.3) and dividing (Sec. 3.4) potentially overlapping compact groups of images, exhibiting reliable mutual similarity, based on the representation from the previous training iteration. Within each ¯Xk all mutual image relations are reliable. In each iteration, learning a representation (Sec. 3.5) then proceeds as follows: For each subset ¯Xk we seek an embedding : xi ¯Xk (xi) . To learn , we randomly sample target points such that the distribution of their pairwise distances matches those of the xi ¯Xk. Learning the mapping from images xi to targets then yields the local representations (Sec. 3.5.2). In a final step, all these local representations are merged into a single overall representation by exploiting transitivity relations between those samples xi

Figure 4: Sorted pairwise image similarities for an STL-10 query image. Color indicates amount of noise (gaussian, 2) needed for each image to change its rank w.r.t. query.

which are shared among subsets ¯Xk (Sec. 3.5.3). Based on representation , reliable relations are then again extracted to serve as input for the next iteration. Since no annotations are provided, the first iteration of our training starts from scratch with a randomly initialized CNN and random assignments.

3.3. Compact Groups for Finding Reliable Similarities

Every training iteration builds upon the distances learned in the previous round. Since the majority of inferred relations between our training samples is not reliable, how can we find the ones we can rely upon without annotations? In Fig. 2 to Fig. 4 we empirically demonstrate that only a few nearest neighbours (NN) can robustly be identified. Unfortunately, even these do not always give rise to correct relations. For instance, two samples may spuriously have a small distance or a sample may be an outlier, thus corrupting the nearest neighbors for a given sample xi. This issue can however be alleviated by forming compact groups of images. Fig. 5 (a) shows that considering dense groups of h samples increases the chance of inferring correct similarity relations, following the intuition that in dense areas of our feature space pairwise relations do not arise accidentally but due to actual commonalities. For different group sizes h on ImageNet, the plot illustrates the average percentage of correct image relations in a group based on their ground-truth labels (blue). As we observe, the chance that h samples appear erroneously close to another (forming a compact group) becomes increasingly unlikely as h increases. On the other hand, large compact groups are scarce and therefore only a small number of samples can be covered for large h (red).

Let the largest pairwise distance, , between members of a group G represent its compactness. Further, when building groups of random samples, it is highly unlikely to form a group with low compactness, i.e. correctly mutually close samples. Thus, we demand for each G to be smaller than the p-th percentile of the compactness of a set of randomly built groups of equal size. Consequently, we extract reliable groups G by starting with a seed xi and add its nearest neighbors as long as the resulting compactness is below the p-th percentile compactness of the randomly sampled groups of equal size. Thus, we grow G to be as large as possible. We denote G as the resulting set of all possible (potentially overlapping) groups G with reliable pairwise similarity relationships.

Figure 5: a) Group correctness vs. data coverage w.r.t group size. blue: average correctness of relations in a group. red: data coverage for groups G of different size extracted on ImageNet using our representation. We call a rela- tion correct if it links images with the same label. Note that we only use labels for evaluation in this plot, but not for extracting our groups. Coverage: fraction of all samples that can be covered by compact groups of a given size relative to the overall number of extracted groups G. b) Data- vs. Target-Distribution. Distribution of all pairwise intra-group and inter-group distances (blue) based on a fully supervised trained representation, of points uniformly sampled from an unit sphere (orange), and of a Gaussian distribution (yellow). Evidently, most data points are far apart, approximated by the orange mode. But there are also characteristic compact hubs approximated by the yellow mode (which has been magnified for the purpose of this illustration). Note that the sampled distributions can approximate the data distribution.

3.4. Reliable Dissimilarity by Partitioning Groups

Finding Reliable Dissimilarities:. The compact groups in G provide small distances that we can reliably use for learning a representation. However, the relationships between the groups can be arbitrary and many are unlikely to be reliable as discussed above and in Fig. 2. Therefore, to further increase reliability, instead of using all relations in X for learning our representation, we partition them into subsets of groups ¯Xk := {Gk1,Gk2, . . .}. By distributing overlapping groups across different ¯Xk while maximizing the distance between groups within a subset, we gathers the reliable dissimilarity relationships from the tail of the distance distributions shown in Fig. 3. Thus, all relations in each subset are as reliable as possible.

Optimization Problem:. Formally, we partition G using the following criteria: (i) all groups G ¯Xk should be mutually distant, (ii) partially overlapping groups should be distributed across different ¯Xk to establish couplings between the subproblems exploitable for transitivity relations, and (iii) the union of all subsets, ¯Xk, should cover X as much as possible to maximize the usage of training samples. Using these constraints, we formulate the partitioning process as the following optimization problem.

Let C be the assignment matrix of samples xi to groups G. Furthermore, the column vector ak indicates groups assigned to subset ¯Xk. Then A = (a1, a2, . . . ) are the assignments of groups to all ¯Xk. Moreover, S contains the mean pairwise distances between any two groups, S kl d(xi, x j).

The objective then becomes

Figure 6: Learning feature vectors by sampling a target space. Large distances between groups Gl are captured by sampling their centroids uniformly from the surface of a hypersphere. Then target feature points sampled from a Gaussian around to represent the compact group.

where tr(ASA) is regularized for the diagonal elements of S and enforces constraint (i), i.e. maximal distance between groups in a ¯Xk, Cmaximizes (ii), i.e. the distribution of groups across the subsets ¯Xk, and enforces (iii), i.e. maximal coverage of are weighting terms for adjusting relative impact of individual constraints. This optimization problem can be efficiently solved following CliqueCNN [26]. This work has addressed a similar problem to find a discrete partitioning of images into surrogate classes for optimizing a standard classification task equal to DeepCluster [2]. However, both [26] and [2] are using all resulting surrogate classes without considering their reliability. Thus, they are inevitably prone to introducing noise into the learning process. In contrast, we seek a partitioning of the set of already extracted groups G into subsets to further increase the reliability of our relations which we then use to construct a dedicated learning problem.

3.5. Unsupervised Representation Learning

Now we learn a representation (xi) that preserves all the reliable relations in ¯Xk. To this end, we build a target space by randomly sampling target points , such that the distribution of their pairwise distances matches the distances in ¯Xk. Finding a representation for the data points xi then corresponds to solving an assignment problem t (i) to find their targets t while simultaneously learning the corresponding mappings.

3.5.1. Constructing the target spaces

We sample the targets from the surface of a high- dimensional sphere with local Gaussian hubs accounting for the compact groups G. More precisely, we first sample centroids 1 : | ¯Xk| uniformly from the surface of the D-dimensional unit hypersphere (where D is the dimensionality of ). Then we sample from the Gaussians ) with fixed covari- ance matrix as illustrated in Fig. 6. Note that this sampling process is derived from and approximates the actual data distribution as illustrated in Fig. 5 (b). Using the representation of a fully supervised model as a proxy for ground-truth image relations (for this experiment only), we observe that the distribution of pairwise intra-group and inter-group distances (blue distribution) exhibits two distinctive modes as intuitively expected: A large mode representing mediocre to large inter-group distances and a minor second mode of small intra-group distances reflect-ing dense neighbourhoods of mutually similar samples. Sampling based on Gaussians ) uniformly distributed on a hypersphere explicitly approximates this distribution (orange and yellow distributions). In contrast, other approaches such as [1] rely on a data independent prior which does not sufficiently account for dense neighbourhoods of highly similar datapoints and consequently diminishes the expressiveness of the representation to be learned.

3.5.2. Learning the Representations φk

Learning is now formulated as establishing correspondences t (i) between the xi from a group G ¯Xk and the targets . This requires to minimize the distances between (xi) and . We obtain and by jointly minimizing the optimization problem

We solve this problem by alternating between two steps: (i) Find optimal assignments based on the current representation . Since global solvers for such an assignment problem typically exhibit the prohibitive cost of O(N3) in the number of input samples, we adopt the efficient algorithm of [1] which uses stochastic local updates to approximate the Hungarian method. (ii) Given an assignment, we optimize the representation by learning a CNN with weights to minimize the distances d(xi)). By alternating between these two steps (we re-assign between xi and every 3 epochs), the model needs to reason about which targets apply for which images and which images should be assigned next to each other. Thus it needs to find an optimal arrangement of groups G on the target space preserving their relations and further needs to infer meaningful relations between groups G. At initialization, we start with a randomly initialized CNN and, thus, random initial weights and random assignments (i).

Using this learning process, we obtain a representation for each of our subsets ¯Xk, each having its own distances dk(xi, xj) (xi)(xj)), with dbeing the L2 metric in our implementation.

3.5.3. Coupling Subproblems to Consolidate their Representa- tions

We now have an ensemble of representations , each representing a different subset of the data. The goal is now to consolidate their underlying learned relationships into a global representation reflecting all of the data. Therefore, we look for reliable relations between different subsets to establish local links that allows to locally transfer relations from one subset into the other. Groups G which are (partially) shared among subsets due to their overlap and distribution to different subsets by the partitioning act as anchors for such links as illustrated in Fig. 7. Using transitivity we thus find triplets (xi, xj, xk) across

Figure 7: Triplet constraints for consolidating subproblems: A reliable rela- tions between two subsets ¯Xm and ¯Xncouples them locally: solid black arrow between xi and x j, which appear in the same group G (green group). Further, we find xk which are reliably dissimilar to xj (dashed black lines). We now have a triplet (xi) relating the xk, which was previously unknown to subset ¯Xm.

subsets which allows to transfer information about previously unknown relations from one local representation to another.

Let xi be a member of a subset ¯Xm, but not of ¯Xn, i.e. ¯Xm xi ¯Xn. Similarly let ¯Xn x j ¯Xm and ¯Xn xk ¯Xm. Assume G : xi, xj G, thus providing a reliable similarity between xi, xj and consequently between both subsets. Similarly, we assume x j, xk to have a reliable dissimilarity. Using transitivity the triplet (xi, xj, xk) thus implies an ordering constraint under

the representation , i.e., we want d(xi, x j) !(xi, xk). Note that these are additional relations imputed from ¯Xn, which were previously not present in ¯Xm. Let be the set of all such triplets deduced by transitivity between ¯Xm and ¯Xn. We now incorporate this additional information by refining solving the triplet ranking problem [33]

The parameter controls the margin between xj and xk with respect to xi. Note that the relations between ¯Xm and the other subsets are only sparse. Thus the potentially large computational complexity of the triplet ranking loss does not dominate the complexity of the overall approach.

However, optimizing this objective alone would ignore and potentially forget about the dense relationships in ¯Xm exploited by . Thus we combine both objectives,

Optimizing (4) retains the constraints from ¯Xm while incorporating couplings to other subsets ¯Xnto improve . Due to the additional inter-set relations, ectively now covers more of X than before the refinement.

In the last training iteration, one representation becomes the final global representation. We conducted experiments using different aggregation strategies (averaging, random selection, etc.) and observed that after the final iteration all capture the dataset X nearly equally well, allowing to randomly select one.

Initialization:. As our overall iterative approach starts from random network initialization, initially we have no representa-

Algorithm 1: Unsupervised Representation Learning by Extracting Reliable Image Relations.

Input : X: Set of training images Parameter : T: Number of training iterations; K: Number of

tion provided to extract reliable (dis)similarities for the grouping process in Sect. 3.3. Therefore, we train the first iteration by optimizing only the problem based on the whole training set and randomly sample individual target points for each image, yielding our first representation . The iterative approach then gradually learns stronger representations ) from iteration to iteration by capturing more and more reliable relationships in our dataset.

3.5.4. Pseudo-Code for summary

For additional clarity we now present a pseudo-code overview for our iterative approach (cf. Algorithm 1).

Table 1: Comparison of our method to other state of the art unsupervised learning approaches on the ImageNet dataset. We report classification accuracy (Acc@1).

4. Experiments

Following we evaluate the performance of our model on large scale datasets and the usability of our learned representation for the tasks of classification, detection and segmentation. Further, we present ablation and analysis experiments providing insights into our iterative learning process.

4.1. Implementation Details and Benchmarking

We now evaluate our learned image representation on the ImageNet and PASCAL VOC dataset. We test on different vision tasks, such as image classification, objection detection, and semantic segmentation and compare our approach against state-of-the-art unsupervised methods. As preprocessing we convert our images to gradient images (obtained using a sobel filter) to avoid trivial solutions based on color, thus following the protocol of recent approaches [1, 2]. These also conducted experiments on supervised ImageNet classification and concluded that gradient images yield similar performance in comparison to RGB inputs, thus indicating a fair comparison. If not stated otherwise, for all experiments the number of subsets ¯Xk used for training is fixed to K = 5. In the grouping stage we set p to be the 3% percentile. We set the dimensionality of to D = 2048 and choose the parameters using crossvalidation on the training set. The margin parameter is set to 0.2 as suggested in various works [33]. We train our model using stochastic gradient descent using an initial learning rate of 0.01 and momentum of 0.9. Building groups can be effi-ciently performed using the FAISS [13] library for fast nearestneighbor retrieval with GPU support, thus leading to no significant computational overhead.

ImageNet:. We evaluate the ability of our model to capture differences between objects and its ability to scale to large image collections on the ImageNet dataset [5]. This dataset is

Table 2: Comparing our model to state-of-the-art unsupervised approaches on PASCAL VOC 2007 classification, detection, and segmentation (measured in mean average precision and mean intersection over union.

composed of 1.2M images distributed over 1,000 categories including subtle category boundaries (such as dog breeds). For fair comparison with other methods we use the AlexNet [16] architecture with batch normalization. Our evaluation follows the standard protocol for unsupervised ImageNet pretraining: First, performing our unsupervised training on a randomly initialized network without using any labels. Afterwards, the convolutional layers are fixed while the last layers are randomly reinitialized and trained using ImageNet labels. Table 1 (left) compares our results with other state-of-the-art unsupervised approaches. Our method converges to its final performance of 46.0% after 4 training iterations (excluding initialization). Hence, we are significantly improving upon all other unsupervised approaches including DeepCluster [2] by 2%, which is trained on all training data at the cost of also incorporating unreliable noisy information. In contrast, our model successfully leverages more and more reliable image relations over the iterations, thus alleviating the issue of noise. Additionally we report mean and standard deviation of our approach (Ours (meanstd)) over 5 runs. Note that all of the other methods are only reporting their best run.

PASCAL VOC:. We now illustrate the generalization capability of our learned representation on different transfer learning tasks. We utilize our representation trained on ImageNet without labels (using the same architecture as above) and fine tune it on the PASCAL VOC 2007 [8] classification, detection, and segmentation tasks (VOC 2012). For transfer learning we use the framework of Kr¨ahenb¨uhl et al. [15] for classification experiments, the Fast R–CNN [10] framework for object detection and the method of Long et al. [27] for semantic segmentation. Our results are summarized in Table 1 (right). On all transfer learning tasks, i.e. classification, detection and segmentation, our approach achieves comparable results to the unsupervised state-of-the-art which further demonstrates the expressive power of our representation. Additionally we report mean and

Table 3: Comparison to other approaches based on average classification accu- racy on STL-10, using the unlabeled split for training.

standard deviation of our approach (Ours (meanstd)) over 5 runs. Note that all of the other methods are only reporting their best run.

4.2. Analysis and Ablation

We now show ablation experiments and analyses to evaluate the individual parts of our approach and to provide insights into the iterative learning procedure.

4.2.1. Ablation studies

Ablation studies are conducted on the STL-10 dataset and summarized in Tab. 3 (Left) and (Right).

STL-10 Performance:. We contrast our approach to state-of-the-art methods on STL. For a fair comparison, we use the same network architecture as [7] and train our model on the unlabeled split of the dataset. For this experiment we set D = 128. Tab. 3 (Left) shows that our proposed approach achieves competitive performance to unsupervised state-of-the-art approaches, ExemplarCNN [7], Discriminative Attributes [12] and Chang et al. [3]. However, in contrast to ExemplarCNN [7] and Discriminative Attributes [12] whose learning procedures rely on dense, costly (instance-level-)exemplar classification tasks, our approach leverages only a sparse set of reliable constraints and thus scales to large datasets as shown on ImageNet. Chang et al. [3] leverages individual images in combination with strong augmentations, thus neglecting valuable information from direct relations between training samples. Further, they are operating on a more powerful network architecture leading to unfair comparison in their favor. Note that our approach outperforms CliqueCNN [26] by 6% which also learns from relations inferred by a grouping of images, however, without considering their reliability. This strongly indicates that the concept of reliability actually helps to reduce the amount of noise introduced into the learning process.

Divide-and-Conquer:. The performance of our representation increases with each iteration and improves over our initial representation by 7.8%, cf. Tab. 3 (Right). Now, to evaluate the effect of incorporating reliable dissimilarities, we train our model using only a single target space (No decomposition) for all groups for one iteration after initialization. As a result there

Table 4: Average classification accuracy on STL-10 for ablations of our approach.

are only reliable relations within the G , but lots of unreliable relationships between them. Due to the former, performance slightly increases over the initialization by 1.6%. The latter explains the 2.1% lower performance compared to our divide-and-conquer strategy. This highlights the importance of modelling reliable dissimilarities when learning visual representations.

Full model vs. triplet learning:. In this ablation experiment (Tab.3 (Right) Triplets Only) we use our extracted reliable (dis-)similarity relationships to mine triplet-constraints as input into a standard triplet-loss framework [33]. For a sample that serves as triplet anchor, reliable similarity relations act as positive constraints and reliable dissimilarity relations act as negative constraints. The aim of this experiment is to contrast our learning objective, Eq. (4), against popular ranking loss approaches. The massive drop of 8.6% in performance can be explained by the dependence of such frameworks on hard-negative mining strategies. Since reliable constraints are only based on high similarities and dissimilarities, the ranking framework obviously has no access to hard constraints, which are very difficult to find reliably without supervision [33] or strong pretraining [24]. Note that we are using triplet constraints only for transferring already learned information to refine our representations (Sec. 3.5.3) rather than using them as the driving overall learning signal.

Number of subsets:. To evaluate the sensitivity of our approach with respect to the number of subsets ¯Xk used for training, we train multiple models using different values of K. Tab. 5 illustrates, that training performance saturates for K > 10 subsets. This indicates that further partitioning of the data and thus further maximizing the dissimilarity between groups in ¯Xk has no effect and has been achieved to a sufficient degree.

4.2.2. Analysis of the model:

We now further analyze the effect of iterative training based on the model used for the ImageNet benchmark. The following experiments highlight the ability of our model to turn reliable (dis-)similarities into increasingly correct sample relations while simultaneously harnessing more and more data.

Table 5: Average classification accuracy on STL-10 dataset as number of subsets K is varied. Results are for one iteration of training after initialization.

Correctness of image relations in groups G:. In Sec. 3.3 and Sec. 3.4 we gather reliable relations driving the learning of our representation. Fig. 5 (a) illustrates that our procedure of extracting groups G for reliable relations increases the probability of finding correct relationships. Moreover, successive iterations of training improve the learned image relations. Fig. 8 confirms this by measuring how many relations within a group disagree. One can see that in consecutive iterations, our groups G exhibit more and more correct relationships. This proves that our model is able to extend the reliable relations learned from a group G to relations which so far could not be identified as reliable. Consequently the performance of our model increases.

Data coverage:. In our proposed method we deliberately explore an orthogonal approach to simply using all training samples by reducing the noise introduced into the training process using only reliable relations for learning. Consequently there is a trade-off between exploring more data and introducing more noise due to unreliable relations. Fig. 9 shows the amount of training data covered in each training iteration. Overall and per subset, our grouping process covers more and more data samples, since representations improve and more relations become reliable. Between iteration 1 and iteration 4 the amount of data available for training is almost doubled to 60%. Moreover, Fig. 8 proves that the additional data is meaningful and not just noise, since overall data quality is improving simultaneously. This demonstrates that our model not only reinforces already available reliable relations but is able to generalize its representation to previously unused data, i.e., discovering new reliable relationships. Further, the result that we achieve state-of-the-art performance using 60% of the available data proves (i) there exists an alternative way to using all but therefore noisy relations between training samples, (ii) we are able to successfully find and exploit reliable samples and (iii) the idea of considering data reliability for unsupervised representation learning is not fully solved, thus opening new promising directions for future research.

Fixing the seed of groups G. In Fig. 10 and Fig. 11 we present examples for groups of size 4 and 8 while training on the ImageNet dataset. To allow for a detailed comparison, we fix the seed elements xi and show how the constituents of a group change over the iterations. We observe that over the iterations the constituents of each group share more and more meaningful visual features. At the beginning the representation focuses on coarse visual commonalities like rough shape and scene. At the

Figure 8: Average correctness for groups G of different size and in different training iterations. Average correctness is the fraction of members with the same ground-truth ImageNet class label (not used for training).

end relationships between constituents are dominated by true (intra-)class-specific features and similar pose.

Typical sources of incorrectness. To explain common sources of incorrect relationships within groups, i.e., group constituents having different labels, we show typical examples in Fig. 12. As one can see, disagreements often arise due to subtle differences between the constituents’ classes (such as different dog breeds, hedgehogs vs. sea urchins, etc.) and misleading scene settings (such as a buildup of flags imitating a ship’s shape, a dog’s head above the surface while swimming looking like a duck, etc.).

Analysis of computational cost. As deriving exact computational complexities is often difficult, it is common practice for deep learning based methods to compare their computational complexities based on their training times. For our model learning one representation (including the refinement step) for a subset ¯Xk takes approx. 16h on a Titan X (Pascal), resulting in 14 days of training in total for the Imagenet dataset. Thus, our overall training time is comparable to the current state-of-the-art approach DeepCluster [2] (12 days). Further, Tab. 6 compares the computational cost of our method with recent unsupervised representation learning approaches in relation to their performance. As we observe, peak performances on Imagenet are computationally costly. Further, we observe that RotNet [9] offers the best trade-off between training efficiency and performance.

5. Conclusion and Future Work:

In our manuscript we have presented a novel iterative approach for unsupervised learning of visual feature representations. Experimental evaluation shows that our method yields a representation of competitive or even superior performance on the tasks of unsupervised classification and transfer learning compared to the state-of-the-art. We propose a novel technique for finding reliable relations (i.e. dis-/similarities) between training images which are likely to agree with the ground-truth. This reduces the amount of noise due to erroneous relations introduced into training, which is typically an issue when all possible training data relations are directly used. As using all

Figure 9: Fraction of training data covered by subsets ¯Xk in successive training iterations. Solid: overall data coverage. Dashed: mean data coverage per subset.

relations is typically common practice in the vast majority of unsupervised learning literature, our work offers a new direction of future research directions: Instead of solely looking for more powerful surrogate tasks to compensate for missing supervision, we also address the question which training samples and relations can be trusted and which are likely to obstruct learning. This question naturally leads to a trade-off between covering all available training data and exploiting only reliable relations for learning. Hence, future work should focus on further improving the estimation of reliability to increase data coverage while maintaining a high reliability for efficient learning. Further, as the concept of reliable relations is of general value and can be decoupled from the learning itself, we investigate its applicability in unsupervised learning in general.

We presented a technique for identifying reliable relations resulting in a set of local sub-problems. For each subproblem the data samples are either reliably similar or reliably dissimilar. Further, the learning process for each subproblem is formulated to preserve their reliable relations and approximate the actual distribution of distances of the training data. This stands in contrast to previous approaches whose feature space relies on data-independent prior assumptions which potentially disagree with the training data at hand. We then optimize each problem individually before using transitivity relations between them to efficiently merge their learned local representations into a single concerted representation. To increase the efficiency of our learning procedure, another line of future work addresses the trade-off between peak performance and computational cost. To this end, we investigate possible approximations of the introduced optimization and learning problem and analyze their implications on the learned representation.

Funding

This work has been supported in part by DFG grant OM81/1-1 and a hardware donation from NVIDIA Corporation.

Author Biographies

Timo Milbich received his masters degree in Scientific Computing from Ruprecht-Karls-University, Heidelberg, in 2014.

Table 6: Training time vs. performance on the ImageNet dataset. Comparison of training times is based on the reported timings in the manuscripts of each method using a single NVIDIA Titan X (Pascal). Additionally, their performance for Acc@1 on ImageNet is reported.

He is currently a Ph.D. Candidate in Heidelberg Collaboratory for Image Processing at Heidelberg University. His current research interests include computer vision focusing on deep representation and metric learning.

Omair Ghori received his masters in Information and Communication Engineering from TU Darmstadt. He is currently a senior data scientist at AGT International and is also pursuing his PhD at the Heidelberg Collaboratory for Image Processing at Heidelberg University. His research interests include Action Recognition and deep representation learning.

Ferran Diego received his Masters Degree and PhD in Computer Science from Autonomous University of Barcelona. He was a postdoctoral researcher in the IAL group at Heidelberg University, Germany. He is currently a Research Scientist at Telefonica R&D. His research interests include computer vision, machine learning, deep learning and neuroscience.

Bj¨orn Ommer received a diploma in computer science from the University of Bonn, Germany and a Ph.D. from ETH Zurich. After holding a postdoctoral position at the University of California at Berkeley he has joint the Department of Mathematics and Computer Science at Heidelberg University as a professor where he is heading the computer vision group. His research interests include computer vision, machine learning, and cognitive science.

References

[1] Bojanowski, P., Joulin, A., 2017. Unsupervised learning by predicting noise, in: Proceedings of the 34th International Conference on Machine Learning.

[2] Caron, M., Bojanowski, P., Joulin, A., Douze, M., 2018. Deep clustering for unsupervised learning of visual features, in: European Conference on Computer Vision.

[3] Chang, J., Wang, L., Meng, G., Xiang, S., Pan, C., 2018. Deep unsuper- vised learning with consistent inference of latent representations. Pattern Recognition (PR) 77, 438 – 453.

[4] Coates, A., Ng, A., Lee, H., 2011. An analysis of single-layer networks in unsupervised feature learning, in: Proceedings of the fourteenth international conference on artificial intelligence and statistics.

[5] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L., 2009. Ima- geNet: A Large-Scale Hierarchical Image Database, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Donahue, J., Kr¨ahenb¨uhl, P., Darrell, T., 2017. Adversarial feature learn- ing, in: Proceedings of the International Conference on Learning Representations (ICLR).

Figure 10: Groups of size 4 for fixed seed image (pink) over the iterations while training the ImageNet model. Each column represents a group G.

Figure 11: Groups of size 8 for fixed seed images (pink) over the iterations while training the ImageNet model. Each column represents a group G.

Figure 12: Examples of groups with incorrect constituent relationships (i.e. different labels) taken from the last iteration while training on ImageNet. Orange highlights the source of disagreement. Each column represents a group G.

[7] Dosovitskiy, A., Fischer, P., Springenberg, J.T., Riedmiller, M., Brox, T., 2016. Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 38, 1734–1747.

[8] Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A., 2007. The PASCAL Visual Object Classes Challenge 2007 (VOC2007).

[9] Gidaris, S., Singh, P., Komodakis, N., 2018. Unsupervised representation learning by predicting image rotations, in: International Conference on

Learning Representations.

[10] Girshick, R., 2015. Fast r-cnn, in: International Conference on Computer Vision (ICCV).

[11] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y., 2014. Generative adversarial nets, in: Advances in Neural Information Processing Systems (NIPS).

[12] Huang, C., Loy, C.C., Tang, X., 2016. Unsupervised learning of discrim- inative attributes and visual representations. 2016 IEEE Conference on

Computer Vision and Pattern Recognition (CVPR) .

[13] Johnson, J., Douze, M., J´egou, H., 2017. Billion-scale similarity search with gpus.

[14] Kingma, D.P., Welling, M., 2013. Auto-encoding variational bayes, in: Proceedings of the International Conference on Learning Representations (ICLR).

[15] Kr¨ahenb¨uhl, P., Doersch, C., Donahue, J., Darrell, T., 2016. Datadependent initializations of convolutional neural networks. Proceedings of the International Conference on Learning Representations (ICLR) .

[16] Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks, in: Proceedings of the International Conference on Neural Information Processing Systems.

[17] Larsson, G., Maire, M., Shakhnarovich, G., 2017. Colorization as a proxy task for visual understanding, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Lowe, D.G., 2004. Distinctive image features from scale-invariant key- points. Int. J. Comput. Vision 60.

[19] Ma, Q., Bai, C., Zhang, J., Liu, Z., Chen, S., 2019. Supervised learning based discrete hashing for image retrieval. Pattern Recognition (PR) 92, 156 – 164.

[20] Milbich, T., Bautista, M., Sutter, E., Ommer, B., 2017. Unsupervised video understanding by reconciliation of posture similarities, in: The IEEE International Conference on Computer Vision (ICCV).

[21] Noroozi, M., Favaro, P., 2016. Unsupervised learning of visual represen- tions by solving jigsaw puzzles, in: The IEEE European Conference on Computer Vision (ECCV).

[22] Noroozi, M., Pirsiavash, H., Favaro, P., 2017. Representation learning by learning to count, in: The IEEE International Conference on Computer Vision (ICCV).

[23] Randrianasoa, J.F., Kurtz, C., ric Desjardin, Passat, N., 2018. Binary partition tree construction from multiple features for image segmentation. Pattern Recognition (PR) 84, 237 – 250.

[24] Ren, C.X., Li, J.Z., Ge, P., Xu, X.L., 2019. Deep metric learning via subtype fuzzy clustering. Pattern Recognition (PR) 90, 210 – 219.

[25] Rubio, J.C., Eigenstetter, A., Ommer, B., 2015. Generative regularization with latent topics for discriminative object recognition. Pattern Recognition 48.

[26] Sanakoyeu, A., Bautista, M.A., Ommer, B., 2018. Deep unsupervised learning of visual similarities. Pattern Recognition (PR) 78, 331 – 343.

[27] Shelhamer, E., Long, J., Darrell, T., 2017. Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence .

[28] Shuo Yang, Ping Luo, C.C.L.K.W.S., Tang, X., 2015. Deep representa- tion learning with target coding, in: Proceedings of AAAI Conference on Artificial Intelligence (AAAI).

[29] Sohn, K., 2016. Improved deep metric learning with multi-class n-pair loss objective, in: Advances in Neural Information Processing Systems 29.

[30] Wang, D., Tan, X., 2016. Unsupervised feature learning with c-svddnet. Pattern Recognition (PR) 60, 473 – 485.

[31] Wang, X., Gupta, A., 2015. Unsupervised learning of visual representa- tions using videos, in: The IEEE International Conference on Computer Vision (ICCV).

[32] Wang, X., He, K., Gupta, A., 2017. Transitive invariance for self-supervised visual representation learning, in: The IEEE International Conference on Computer Vision (ICCV).

[33] Wu, C.Y., Manmatha, R., Smola, A.J., Kr¨ahenb¨uhl, P., 2017. Sampling matters in deep embedding learning, in: ICCV.

[34] Xiong, W., Zhang, L., Du, B., Tao, D., 2017. Combining local and global: Rich and robust feature pooling for visual recognition. Pattern Recognition (PR) 62, 225 – 235.

[35] Zhang, J., Mei, K., Zheng, Y., Fan, J., 2019a. Learning multi-layer coarse- to-fine representations for large-scale image classification. Pattern Recognition (PR) 91, 175 – 189.

[36] Zhang, P., Liu, W., Lei, Y., Lua, H., 2019b. Hyperfusion-net: Hyper- densely reflective feature fusion for salient object detection. Pattern Recognition (PR) 93, 521 – 533.

[37] Zhang, R., Isola, P., Efros, A.A., 2016. Colorful image colorization, in: The IEEE European Conference on Computer Vision (ECCV).

[38] Zhang, R., Isola, P., Efros, A.A., 2017. Split-brain autoencoders: Unsu- pervised learning by cross-channel prediction, in: The IEEE Conference

on Computer Vision and Pattern Recognition (CVPR).

[39] Zhang, Y., Bai, Y., Ding, M., Li, Y., Ghanem, B., 2018. Weaklysupervised object detection via mining pseudo ground truth boundingboxes. Pattern Recognition (PR) 84, 68 – 81.

designed for accessibility and to further open science