Automatic discovery of discriminative parts as a quadratic assignment problem

2016·arXiv

Abstract

1. Introduction

The representation of images as set of patches has a long history in computer vision, especially for object recognition [1], image classification [2] or object detection [3]. Its biggest advantages are the robustness to spatial transformations (rotation, scale changes, etc.) and the ability to focus on the important information of the image while discarding clutter and background.

Part-based recognition raises the questions of i) how to automatically identify what are the parts to be included in the model and ii) how to use them to take a decision e.g. to assign a category to an image. As an illustration, [4] proposed to select informative patches using an entropy based criterion while the decision relies on a naive-Bayes classifier. Following [4], recent approaches separate the construction of the model (i.e. the learning of the parts) and the decision function [5] [6]. The reason behind this choice is that the number of candidate regions in the training images is very large and would lead to a highly non-convex decision function.

Optimizing both (parts and decision) is however possible for simple enough part detectors and decision functions. For instance, [7] unifies the two stages by jointly learning the image classifiers and a set of shared parts. Their claim is that the definition of the parts is directly related to the final classification function.

While this argument is true, the objective function of this joint optimization is highly non-convex with no guaranty of convergence. We believe that deciding which one of the two alternatives – the joint optimization vs separate one – is still an open problem. As an insight, the two stage part-based model of [8] performs better than the joint learning of [7]. We note that there are other differences between the two approaches, e.g. [7] models both positive and negative parts while [8] focuses only on the positive ones.

Interestingly, [8] addresses the learning of parts as an assignment problem. On one hand, regions are sampled randomly from the training images. On the other hand, the model is considered as a set of parts. The assignment is constrained by imposing that each part should be assigned to one image region in each positive image (those belonging to the category to be modeled). This results in a bipartite graph linking parts and regions.

The assignment problem of [8] poses the learning of part-based models in a very appealing way, yet their solution is based on heuristics leaving room for improvements. This paper’s contribution is an extensive study of this assignment problem: We first present of a well-founded formulation of the problem and propose different solutions in a rigorous way. These different methods are evaluated and compared on two different datasets and state-of-the-art performance is obtained. By experimenting with improvements in the underlying description and encoding, we demonstrate that the benefit of our part learning methodology remains complementary to the benefit of more powerful visual representations

obtained by state of the art deep learning approaches.

2. Previous work

Image classification has received a lot of attention during the last decades, with most of the related approaches focused on models based on aggregated features [9, 10] or the Spatial Pyramid Matching [11]. This was before the Convolutional Network revolution [12] still at the heart of most of the recent methods [13].

Several authors have investigated part-based models in which some parts of the image are combined in order to determine if a given object is depicted. This is in contrast to aggregation approaches where all the image regions are pooled without selecting sparse discriminative parts. For instance, [14] discovers sets of regions used as mid-level visual representation; the regions are selected for being representative (occurring frequently enough) and discriminative (different enough from others), during an iterative procedure which alternates between clustering and training classifiers. Similarly, [5] addresses this problem by learning parts incrementally, starting from a single part occurrence with an Exemplar SVM and collecting more occurrences from the training images.

In a different way, [6] poses the discovery of visual elements as a discriminative mode seeking problem solved with the mean-shift algorithm: it discovers visually-coherent patch clusters that are maximally discriminative. In [15], Maji et al investigate the problem of parts discovery when some correspondences between instances of a category are known. The work of [16] bears several similarities to our work in the encoding and classification pipeline. However, parts are assigned to regions using spatial max pooling without any constraint on the number of regions a part is assigned to (from zero to multiple); given this fixed assignment, part detectors are optimized using stochastic gradient descent.

The recent papers related to part-based models are those of Sicre et al [8] and Parizi et al [7]. As said before, the part-based representation of [7] relies on the joint learning of informative parts (using heuristics that promote distinctiveness and diversity) and linear classifiers trained on vectors of part responses. On the other hand, Sicre et al [8] follow the two stage formulation, formulating the discovery of parts as an assignment problem. We also mention the recent and unpublished work of Mettes et al [17] arguing that image categories may share parts and proposing a method to model them as such.

Finally, this paper is related to the assignment problem which finds a maximum weight matching in a weighted bipartite graph. A survey on this topic is the work of Burkard et al [18].

3. Discovering and Learning Parts

The studied approach comprises three steps: (i) distinctive parts are discovered and learned for every category; (ii) a global image signature is computed based on the presence of these parts; and (iii) image signatures are classified by a linear SVM. This paper focuses on the first step. For each category, we learn a set of P distinctive parts which are representative and discriminative.

This section presents different ways to formalize this task giving birth to interesting optimization alternatives in Sect. 4. We first present the parts learning problem as defined in [8]. We show that it boils down to a concave minimization under non convex constraints, which is recast as a quadratic assignment problem.

3.1. Notation

and tr(X) are the transpose and trace of matrix X; vec(X) is the column vector containing all elements of X in column-wise order. Given matrices X, Y of the same size, = is their (Frobenius) inner product, and are the spectral and Frobenius norms. The Euclidean norm of vector x is . Vector () denotes the i-th row (resp. j-th column) of matrix X. The identity matrix is denoted as , while vector (matrix ) is an 1 vector (resp. matrix) of ones. is the indicator function of set A and Projis the Euclidean projector onto A.

Following [8], we denote by with = the set of images of the category to be modeled, i.e. positive images, while represents the negative images. The training set is and contains n = |I| images. A set of regions is extracted from each image . The number of regions per image is fixed and denoted |R|. The total number of regions is thus R = n|R|. is the set of regions from positive images whose size is = .

Each region is represented by a descriptor . In this work, this descriptor is obtained by a CNN, and in particular it is the output of a convolutional or fully connected layer. More details are given in Section 5.2. By ) we denote the (resp. ) matrix whose columns are the descriptors of the complete training set (resp. positive images only).

3.2. Problem setting

A category is modeled by a set of parts P with |P| = P. We introduce the matching matrix M associating image regions of positive images to parts. Element of M corresponds to region r and part p. Ideally, = 1 if region r represents part p, and 0 otherwise. By we denote the submatrix of M that contains columns corresponding to image I.

We keep the requirements of [8]: (i) the P parts are different from one another, (ii) each part is present in every positive image, (iii) parts should occur more frequently in positive images than in negative ones. The first two requirements define constraints on the admissible set A of M:

This implies that each sub-matrix is a partial assignment matrix. Observe

that the set A is not convex. The third assumption is enforced by Linear Discriminant Analysis (LDA): given matching matrix M, the model ) of

part p is defined as

where and Σ = are the empirical mean and covariance matrix of region descriptors over all training images. The similarity between region r and a part p is then computed as the inner product

(M), x.

For a given category, we are looking for an optimal matching matrix

where W(M) is the matrix whose columns are ) for all parts .

3.3. Recasting as a quadratic assignment problem

The previous formulation limits optimization to alternatively resorting to (2) and (4), as done in [8]. Here, we express J as a function of M without W, recasting (3) as a quadratic assignment problem and opening the way to a number of alternative optimization algorithms. We define similarity matrix . Its entries represent the similarities between parts and regions. According to LDA (2), W(M) = Σ, which in

turn gives

where matrix is symmetric and positive definite and matrix has identical rows (rank 1). Now, observe

that problem (3) is equivalent to

for a matrix Q that is a function of A only. This shows that our task is closely related to the quadratic assignment problem [18], a NP-hard combinatorial problem. Moreover, in our setting, the objective function to be minimized is strictly concave.

This new formalism enables to leverage a classical procedure in optimization: the convex relaxation.

3.4. Convex relaxation with entropic regularization

In the specific case of fixed cost matrix C, the previous problem becomes tractable.

3.4.1. Convex relaxation

Solving a linear assignment problem is numerically demanding, with a complexity about ) [18]. It can be done exactly with dedicated methods, such as the Hungarian algorithm; or equivalently, with linear programming methods that assume convex relaxation of the binary constraints, i.e. considering bi-stochastic instead of permutation matrices.

3.4.2. Soft assignment

To reduce the complexity, the problem is approximated using negativeentropy regularization. Considering a bi-stochastic matrix M, the soft-

assignment problem is

where E is the entropy of the bistochastic matrix M, and 0 is the regularization parameter. As increases, the problem converges to the hard-assignment problem. Paper [8] uses the Sinkhorn algorithm [19], which normalizes iteratively the rows and the columns of M to one, initializing from the regularized cost matrix exp().

Observe that in our setting M is not square as we consider partial assignments between P rows and R > P columns. To solve this more general problem, a simple trick is to add as many rows than required and to define a maximal cost value when affecting columns to them.

Soft assignment has gained a lot attention because it solves large scale problems [20]. However, a major limitation is the loss of sparsity of the solution. As a consequence, approximate solutions of the linear soft-assignment are not

Figure 1: Illustration of the convex relaxation of our assignment problem in 3D.

Black lines are level-sets of the objective function in the plane of the simplex, which is a

triangle in . Lower values are displayed in cyan, larger in magenta. (Left) The original problem is the minimization of a concave quadratic function over matching matrices, that lies on the vertices of the simplex.(Middle) A small quadratic regularization of the objective function together with the relaxation of the constraint preserves the solution.(Right) A too large regularization yet shifts the minimum inside the simplex, thus giving less sparse solutions.

suitable for our problem, as observed in experiments. We describe in the section 4.2 how the authors of [8] have circumvented this problem by iterating soft assignment.

3.5. Convex relaxation with quadratic regularization

We consider now the quadratic regularization of the problem (see Figure 1):

This means that provided ) and ) differ by a constant. Therefore, the minimizers of on A are the minimizers of , for any value of . Indeed, if is sufficiently large (becomes convex (see Figure 1).

We also relax the constraints:

= arg min) (10a) [0, 1]: and for

In words, domain S is the convex hull of the set A and we will refer it to as a simplex. Yet, in general, ) + when . We may find different solutions than problem (6), as illustrated in Figure 1. Overrelaxing the problem for the sake of convexity is not interesting as it promotes parts described by many regions instead of a few ones. Indeed, when , the minimum of is achieved for the rank 1 matrix which may lie inside S. This shows that implicitly controls the number of regions used to describe parts.

4. Optimization

The previous section formalizes the part learning task into optimization problems. This section now presents various methods to numerically solve them. We see two kinds of techniques: i) hard assignment optimization directly finding , ii) soft assignment optimization (Sect. 3.4 and 3.5) that finds .

This latter strategy is not solving the initial problem. However, as already observed in [8] and [21] for classification, soft-assignment affecting several regions to describe a part, may provide better results. This lesson learned from previous works deserves an experimental investigation in our context.

4.1. Hard assignment methods

4.1.1. Hungarian Algorithm

As mentioned in Sect. 3.4, problem (3), when the cost matrix C(M) is fixed, is a variant of the linear assignment problem for which several dedicated methods give an exact solution. Solving this approximated problem can be seen as computing the orthogonal projection of the matrix ) (being

an initial guess, see Section 5.2.1) onto the set A

In our setting, we use the fast Hungarian algorithm variant of [22]. The experimental section shows that this gives surprisingly good results in comparison to more sophisticated methods.

4.1.2. IPFP

The Integer Projected Fixed Point (IPFP) method [23] can be seen as the iteration of the previous method, alternating between similarity matrix C(M) update and projections onto the constraints set A. More precisely, a first order Taylor approximation of the objective function is maximized (e.g. using the Hungarian algorithm) and combined with a linesearch (see Algorithm 1). This approach guaranties the convergence to a local minimizer of J(M) on the set

We observed that IPFP converges very fast nevertheless without improving much results. This is explained by the very specific structure of our problem, where the quadratic matrix Q of (7) is very sparse and negative definite.

4.2. Iterative Soft-assignment (ISA)

The strategy of [8] referred here to as Iterative Soft-Assign (ISA) solves a sequence of approximated linear assignment problems. It is based on the rationale: if we better detect regions matching a part, we will better learn that part; if we better learn a part, we will better detect region matching that part. Hence, the approach iteratively assigns regions to parts by yielding a M for a given C(M) (Sect. 3.4) and learns the parts by yielding W(M) for a given M thanks to LDA. The assignation resorted to a soft-assign algorithm (see [24] for instance) which is also an iterative algorithm solving a sequence of entropicregularized problems (Section 3.4) that converges to the target one. The general scheme of the algorithm is drawn in Algorithm 2.

The approach suffers from two major drawbacks: it is computationally demanding due to the three intricate optimization loops, and it is numerically very difficult to converge to an hard-assignment matrix (due to the entropy regularization). Nevertheless, as reported in [8], the latter limitation turns out to be an advantage for this classification task. Indeed, the authors found out that early stopping the algorithm actually improves the performance. However, the obtained matrix M does not satisfy the constraints (neither A nor S).

4.3. Quadratic soft assignment with Generalized Forward Backward (GFB)

To address the relaxed problem (10b), we split the constraints on the match-

ing matrix M for rows and columns: for each row and each column of

Problem (10b) is then equivalent to the following

where and respectively encode constraints on parts and regions:

The General Forward Backward (GFB) algorithm [25] alternates between ex-

plicit gradient descent on the primal problem and implicit gradient ascent on the dual problem. It offers theoretical convergence guaranties in the convex case. The positive parameter L controls the gradient descent step. Experimentally,

P is computed in linear time [26]. The projection onto is trivial. Note that other splitting schemes are possible and have been tested (for instance, using non-negativity constraint on a third variable), but this combination was particularly efficient (faster convergence). The main advantage of this algorithm is that it can be massively parallelized.

5. Experiments

5.1. Datasets

5.1.1. The Willow actions dataset [27]

is a dataset for action classification, which contains 911 images split into

7 classes of common human actions, namely interacting with a computer, pho-

tographing, playing music, riding cycle, riding horse, running, walking. There

are at least 108 images per actions, with around 60 images used as training and

the rest as testing images. The dataset also offers bounding boxes, but we do not use them as we want to detect the relevant parts of images automatically.

5.1.2. The MIT 67 scenes dataset [28]

is an indoor scene classification dataset, composed of 67 categories. These

include stores (e.g. bakery, toy store), home (e.g. kitchen, bedroom), public spaces (e.g. library, subway), leisure (e.g. restaurant, concert hall), and work (e.g. hospital, TV studio). Scenes may be characterized by their global layout (corridor), or by the objects they contain (bookshop). Each category has around 80 images for training and 20 for testing.

5.2. Improved description and classification pipeline

We follow the general learning and classification pipeline of [8], however we also introduce significant improvements. Such improvements makes sense in order to compete with recent works. In summary, during part learning, |R| = 1, 000 regions are extracted from each training image and used to learn the parts. During encoding, |R| regions are extracted from both training and test images, and all images are encoded based on the learned parts. Finally, a linear SVM is used to classify test images. For each stage, we briefly describe the choices made in [8] and discuss our improvements.

5.2.1. Initialization

The initialization step is achieved as in [8]. All training positive regions are clustered and for each cluster an LDA classifier is computed over all regions of the cluster. Maximum responses to the classifiers are then selected per image and averaged over positive and negative sets to obtain two scores. The ratio of these scores is used to select the top P clusters to build the initial part classifiers. Finally, an initial matching matrix M is built by softmax on classifier responses. This scheme is followed for all optimization algorithms, even if a part model matrix is not explicitly formed during iterations.

5.2.2. Extraction of image regions

Two strategies are investigated:

• Random regions (‘R’). As in [8], |R| regions are randomly sampled over the

• Region proposals (‘P’). Following [17], up to |R| regions are obtained based

5.2.3. Region descriptors

Again two strategies are investigated, based on fully connected CNN or convolutional layers:

• Fully connected (‘FC’). As in [8], we use the output of the 7th layer of

• Convolutional (‘C’). As an improvement, we use the last convolutional

5.2.4. Encoding

Given an image, either training or testing, region descriptors are tested against the learned part model to generate a global image descriptor, which is then used by a SVM classifier. We use several alternative strategies:

• Bag-of-Parts (‘BoP’) and Spatial Bag-of-Parts (‘SBoP’). According to

5.2.5. Parameters of the learning algorithms

For the Iterative Soft-Assign (ISA) method, we use the same parameters as [8]. Concerning the GFB method, we perform 2k iterations of the projection, except for the MIT67 dataset with convolutional descriptor, where iterations are limited to 1k. In all experiments performance remains stable after 1k iterations. For the GFB method with = 0, reffed to as GFB, we choose = 10after experimental evaluation on the Willow dataset. We denote by just GFB the case where = 0.

5.3. Results

In the following, we are showing results for (i) fully connected layer descriptor on random regions (R+FC), which follows [8], and (ii) convolutional layer descriptor on region proposals (P+C), which often yields the best performance. We evaluate different learning algorithms on BoP and CoP encoding, and then investigate the new encoding strategies SBoP and PCoP as well as combinations for the ISA algorithm. On Willow we always measure mean Average Precision (mAP) while on MIT67 we calculate both mAP and classification accuracy (Acc).

We start by providing, in Table 1, a baseline corresponding to our description methods on the full image without any part learning. Comparing to subsequent results with part learning reveals that part-based methods always provide better description of the content of an image.

Table 1: Baseline performance, without part learning.

We now focus on the part learning methods, which are evaluated in the context of action and scene classification in still images. Figure 5.3 shows some qualitative results of learned parts on MIT67. Then, Table 2 shows the performance of ISA, IPFP, Hungarian, GFB, and GFBon Willow and MIT67 datasets. After some evaluation on both MIT67 and Willow, IPFP was not evaluated in further experiments since it performs on par with the Hungarian or worst, as previously explained in 4.1.2. On the Willow dataset, we observe that GFBGFB > Hungarian and IPFP > ISA. However, on MIT67 the results are different and we have ISA > Hungarian and GFB > GFB. When using the improved P+C descriptor, we observe a similar trend for the BoP. Nevertheless, note that all methods perform similarly when using the CoP en-

coding.

Figure 2: Top scoring parts for various images of bowling, florists, gym and wine cellar

Based on this experimentation, we can draw two main conclusions. First, methods based on soft assignment (ISA, GFB) clearly outperform methods based on hard assignment. This result is also confirmed, in almost all cases, by the results of Table 3, where one iteration of the Hungarian algorithm is performed on the assignment matrix obtained after ISA (i.e. ISA+H).

Second, while the GFB offers some significant practical advantage, when combined with quadratic regularization it is out-performed by Iterative SoftAssign (except on the Willow dataset with BoP and CoP, first and third line in Table 2). Our explanation is that it demonstrates that quadratic regularization is less appropriate than entropic regularization for this problem. Indeed, as illustrated in Section 3.5, over-relaxing the objective function tends to yield a matrix with very similar rows, meaning that parts are described by the same regions, which is highly undesirable. While this problem also occurs when solv-

Table 2: Performance of various learning methods on Willow and MIT67.

Table 3: Performance of Hard vs Soft assignment. ISA+H refers to performing one iteration of the Hungarian algorithm on the solution obtained by ISA.

ing a soft-assignment with very large regularization, it does not happen when using ISA.

Another possible explanation of this difference in performance may lie in the fact that the Iterative Soft-Assign is stopped before convergence and does not satisfy the constraints P imposed on rows, whereas those constraints are satisfied when using the GFB algorithm. We conjecture that the constraint, i.e. “a part must occur in every positive image” in the original problem definition [8], is too strong and may need to be relaxed.Actually, as highlighted in the introduction (Sect. 1), the limitation of the separate optimization problem in comparison with the joint optimization is that a better optimization of the intermediate

goal does not necessarily produce better final performance.

Focusing on the ISA method, the improved region description and encoding are evaluated, see Table 4. Using region proposals along with convolutional layer descriptions shows a significant performance gain, especially on the Willow dataset. We can see a consistent improvement for the SBoP and PCoP encoding as well and note that PCA yields more improvement on the descriptors based on fully connected layer than on the ones based on convolutional layers. These improvements set a new state of the art on both datasets, obtaining 91.9% mAP on Willow and 81.4% mAP on MIT67. Table 5 compares our best performance on MIT67 to a number of previous methods. Furthermore, we outperform the previous state of the art on Willow [17] with 81.7% mAP.

Table 4: Results on Willow and MIT67 datasets for the ISA method, with improved re- gion descriptions P+C and improved encoding methods SBoP and PCoP. BoP+CoP and SBoP+PCoP refer to concatenated image descriptors.

6. Conclusion

To conclude, we have investigated in this work the problem of discovering parts for part-based image classification. We have shown that this problem can be recast as a quadratic assignment problem with concave objective function to be minimized with non-convex constraints. While being known to be a very difficult problem, several techniques have been proposed in the literature, either trying to find “hard assignment” in a greedy fashion, or based on optimization of the relaxed problem, resulting in “soft assignment”. Several methods to address

Table 5: Performance in terms of accuracy of existing part-based and non part-based methods on the MIT67 Scenes dataset.

this task have been investigated and compared to the previous method of [8] which achieves state of the art results.

We additionally proposed improvements on several stages of the classifica-tion pipeline, namely region extraction, region description and image encoding, using a recent very deep CNN architecture. This achieves a new state-of-the art performance on two different datasets. Furthermore, the new region description method is orders of magnitude faster, as this process was previously the bottleneck in [8].

Our experiments show that, in the context of part-based image classification, soft assignment outperforms hard assignment. Moreover, entropic regularization is more appropriate than quadratic regularization, while the best overall performance is obtained when one constraint is not fully satisfied. While it is a common constraint to consider that a part must occur in every positive image, this interesting finding shows that this constraint may need to be relaxed.

Our reformulation and investigation of different optimization methods allow the exploration of the limits of the original problem, such as defined in [8]. We believe this knowledge will help the community in the search for more appropriate models, potentially end-to-end trainable, using better network architec-

tures.

Acknowledgment

Part of this work was achieved in the context of the IDFRAud project ANR-14-CE28-0012.

References

[1] Y.-L. L. Boureau, F. Bach, Y. LeCun, J. Ponce, Learning mid-level features for recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2010.

[2] C. Doersch, S. Singh, A. Gupta, J. Sivic, A. A. Efros, What makes Paris look like Paris?, ACM Trans. Graph. 31 (4).

[3] J. J. Lim, C. L. Zitnick, P. Doll´ar, Sketch tokens: A learned mid-level representation for contour and object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2013, pp. 3158–3165.

[4] S. Ullman, E. Sali, M. Vidal-Naquet, A Fragment-Based Approach to Ob- ject Representation and Classification, in: Visual Form 2001, Springer Berlin Heidelberg, Berlin, Heidelberg, 2001, pp. 85–100.

[5] M. Juneja, A. Vedaldi, C. V. Jawahar, A. Zisserman, Blocks that shout: Distinctive parts for scene classification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013.

[6] C. Doersch, A. Gupta, A. A. Efros, Mid-level visual element discovery as discriminative mode seeking, in: Advances in Neural Information Processing Systems, 2013, pp. 494–502.

[7] S. N. Parizi, A. Vedaldi, A. Zisserman, P. Felzenszwalb, Automatic dis- covery and optimization of parts for image classification, in: International Conference on Learning Representations, 2015.

[8] R. Sicre, F. Jurie, Discriminative part model for visual recognition, Computer Vision and Image Understanding 141 (2015) 28 – 37. doi:http://dx.doi.org/10.1016/j.cviu.2015.08.002. URL http://www.sciencedirect.com/science/article/pii/ S1077314215001642

[9] G. Csurka, C. R. Dance, L. Fan, J. Willamowski, C. Bray, Visual catego- rization with bags of keypoints, in: Intl. Workshop on Stat. Learning in Comp. Vision, 2004.

[10] F. Perronnin, J. Sanchez, T. Mensink, Improving the fisher kernel for large- scale image classification, in: European Conference on Computer Vision, 2010.

[11] S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2006.

[12] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in neural information processing systems, 2012, pp. 1097–1105.

[13] K. Simonyan, A. Zisserman, Very deep convolutional networks for large- scale image recognition, CoRR abs/1409.1556.

[14] S. Singh, A. Gupta, A. A. Efros, Unsupervised discovery of mid-level dis- criminative patches, in: Proceedings of the European Conference on Computer Vision, Springer, 2012, pp. 73–86.

[15] S. Maji, G. Shakhnarovich, Part discovery from partial correspondence, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 931–938.

[16] J. Sun, J. Ponce, Learning discriminative part detectors for image clas- sification and cosegmentation, in: Proceedings of the IEEE International Conference on Computer Vision, 2013.

[17] P. Mettes, J. C. van Gemert, C. G. M. Snoek, No spare parts: Sharing part detectors for image categorization, CoRR abs/1510.04908. URL http://arxiv.org/abs/1510.04908

[18] R. Burkard, M. Dell’Amico, S. Martello, Assignment Problems, Society for Industrial and Applied Mathematics, 2012. arXiv:http:// epubs.siam.org/doi/pdf/10.1137/1.9781611972238, doi:10.1137/1. 9781611972238. URL http://epubs.siam.org/doi/abs/10.1137/1.9781611972238

[19] R. Sinkhorn, P. Knopp, Concerning nonnegative matrices and doubly stochastic matrices., Pacific J. Math. 21 (2) (1967) 343–348. URL http://projecteuclid.org/euclid.pjm/1102992505

[20] J. Solomon, F. de Goes, G. Peyr´e, M. Cuturi, A. Butscher, A. Nguyen, T. Du, L. Guibas, Convolutional wasserstein distances: Efficient optimal transportation on geometric domains, ACM Trans. Graph. 34 (4) (2015) 66:1–66:11. doi:10.1145/2766963. URL http://doi.acm.org/10.1145/2766963

[21] L. Liu, L. Wang, X. Liu, In defense of soft-assignment coding, in: 2011 International Conference on Computer Vision, 2011, pp. 2486–2493. doi: 10.1109/ICCV.2011.6126534.

[22] S. Bougleux, L. Brun, Linear sum assignment with edition, CoRR abs/1603.04380. URL http://arxiv.org/abs/1603.04380

[23] M. Leordeanu, M. Hebert , R. Sukthankar, An integer projected fixed point method for graph matching and map inference, in: Proceedings Neural Information Processing Systems, Springer, 2009.

[24] A. Rangarajan, A. Yuille, E. Mjolsness, Convergence properties of the soft- assign quadratic assignment algorithm, Neural Comput. 11 (6) (1999) 1455–

1474. doi:10.1162/089976699300016313. URL http://dx.doi.org/10.1162/089976699300016313

[25] H. Raguet, J. Fadili, G. Peyr´e, A generalized forward-backward splitting, SIIMS 6 (3) (2013) 1199–1226. doi:10.1137/120872802.

[26] L. Condat, Fast Projection onto the Simplex and the l1 Ball, to appear in Mathematical Programming Series A (Aug. 2015). doi:10.1007/ s10107-015-0946-6. URL https://hal.archives-ouvertes.fr/hal-01056171

[27] V. Delaitre, I. Laptev, J. Sivic, Recognizing human actions in still images: a study of bag-of-features and part-based representations., in: Proceedings of the British Machine Vision Conference, Vol. 2, 2010.

[28] A. Quattoni, A. Torralba., Recognizing indoor scenes, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009.

[29] K. E. A. van de Sande, J. R. R. Uijlings, T. Gevers, A. W. M. Smeulders, Segmentation as selective search for object recognition, in: IEEE International Conference on Computer Vision, 2011.

[30] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: Convolutional architecture for fast feature embedding, in: ACM International Conference on Multimedia, 2014.

[31] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, A. Oliva, Learning Deep Features for Scene Recognition using Places Database., in: Advances in Neural Information Processing Systems, 2014.

[32] K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep con- volutional networks for visual recognition, in: European Conference on Computer Vision, 2014.

[33] R. Girshick, Fast R-CNN, in: Proceedings of the IEEE International Con- ference on Computer Vision, 2015.

[34] G. Tolias, R. Sicre, H. J´egou, Particular object retrieval with integral max- pooling of CNN activations, 2016. URL http://arxiv.org/abs/1511.05879

[35] L. Zheng, Y. Zhao, S. Wang, J. Wang, Q. Tian, Good practice in cnn feature transfer, arXiv preprint arXiv:1604.00133.

[36] R. Sicre, H. J´egou, Memory vectors for particular object retrieval with mul- tiple queries, in: Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, ACM, 2015, pp. 479–482.

[37] Z. Zuo, G. Wang, B. Shuai, L. Zhao, Q. Yang, X. Jiang, Learning dis- criminative and shareable features for scene classification, in: Computer Vision–ECCV 2014, Springer, 2014, pp. 552–568.

[38] M. Cimpoi, S. Maji, I. Kokkinos, A. Vedaldi, Deep filter banks for tex- ture recognition, description, and segmentation, International Journal of Computer Vision (2015) 1–30.

Designed for Accessibility and to further Open Science