Multi-Scale Superpatch Matching using Dual Superpixel Descriptors

2020·Arxiv

Abstract

Abstract

Over-segmentation into superpixels is a very effective dimensionality reduction strategy, enabling fast dense image processing. The main issue of this approach is the inherent irregularity of the image decomposition compared to standard hierarchical multi-resolution schemes, especially when searching for similar neighboring patterns. Several works have attempted to overcome this issue by taking into account the region irregularity into their comparison model. Nevertheless, they remain sub-optimal to provide robust and accurate superpixel neighborhood descriptors, since they only compute features within each region, poorly capturing contour information at superpixel borders. In this work, we address these limitations by introducing the dual superpatch, a novel superpixel neighborhood descriptor. This structure contains features computed in reduced superpixel regions, as well as at the interfaces of multiple superpixels to explicitly capture contour structure information. A fast multi-scale non-local matching framework is also introduced for the search of similar descriptors at different resolution levels in an image dataset. The proposed dual superpatch enables to more accurately capture similar structured patterns at different scales, and we demonstrate the robustness and performance of this new strategy on matching and supervised labeling applications.

1. Introduction

In many computer vision related applications, such as image classification and segmentation, there is a important need for fully automated results. To this end, a commonly employed strategy is to take inspiration from other data, in a supervised way when ground truth annotations are available. In this context, non-local methods have provided accurate results on various applications. In these methods, image regions are independently considered, generally using a square patch defined for each pixel, capturing a local pattern [4]. For exemplar-based classification or segmentation, matching algorithms are usually used to search for similar patterns in the available data, to then transfer the associated information, at the pixel or image scale after a global decision process.

This search for similar patterns is generally performed for each image patch, and fast matching algorithms have been developed e.g., PatchMatch [3], TreeCANN [24] or FLANN [22], to efficiently exploit a large number of example images in a reduced computational time. To find similar content with such matching algorithms, it is common to extract features from dense image patches or local interest points. These features are usually designed to be robust to transformations such as scaling, viewpoint or illumination changes. Standard descriptors include features based on gradient information such as SIFT [20] or HoG [7, 8], or binary patterns,

∗Corresponding author: Tel.: +33(0)540003552; Email address: michael.clement@labri.fr (Micha¨el Cl´ement)

e.g., BRIEF [5].

In the last few years, convolutional neural networks have also been able to extract particularly relevant features, and have shown promising results in many applications related to image processing [18]. However, although some recent architectures such as U-Net may learn from relatively small datasets [26], these methods rely on costly supervised learning strategies, and often require very large annotated datasets to be trained efficiently. Besides, the results of these models are generally difficult to interpret, and can be very sensitive to small perturbations of their inputs [21]. In many cases, such as medical imaging, these drawbacks can limit the potential of automated processing pipelines. Therefore, there is still an important need for methods that can perform without learning steps, as well as with limited training data and computing power.

In this context of fast image search and matching requirements, many works have first focused on hierarchical approaches using prior image over-segmentation into regular grids, e.g., [17]. To go further, other methods proposed to group similar pixels into connected components of homogeneous colors called superpixels, drastically reducing the number of elements to process while preserving contours and spatial structure [28]. Therefore, a process applied at such over-segmentation scale can be close to the optimal pixel-wise result. Several works have used superpixels in non-local frameworks, e.g., [12, 29], or in unsupervised learning-based superpixel matching approaches using random forests [6, 16]. Nevertheless, the geometrical irregularity of such decompositions [11] (i.e., in terms of shape, adjacency or contour smoothness) can become an issue, since neighborhood information is crucial to compute accurate matches in terms of context.

Other approaches have attempted to use superpixel neighboring information, e.g., [25, 27]. Among them, the SuperPatchMatch (SPM) framework [10] partially addresses this issue with a superpixel neighborhood structure called superpatch and a dedicated metric to compare two structures having different geometry and number of elements. However, SPM remains sub-optimal in terms of computational complexity and matching accuracy. The framework enables the search for matches at the same superpatch scale and it only computes features within each superpixel region, poorly capturing contour information. Several works indeed highlighted the need for accurate superpixel-wise features [23, 31, 30], while most image descriptors are locally computed on a regular square neighborhood.

Contributions

In this paper, we address the limitations of previous non-local methods only focusing on intra-region information within superpixels or superpixel neighborhoods [10], by introducing a novel dual superpixel neighborhood descriptor called dual superpatch (DSP), containing two independent descriptor sets (see Sec. 3).

First, intra-superpixel features capture color or texture information within cropped superpixel regions in order to avoid influence of pixel contours or inaccurate superpixel borders (see Sec. 3.1). Then, to capture structure information, for instance in terms of contour orientations, we extract a relatively regular grid of specific descriptors at superpixel interfaces (see Sec. 3.2). To efficiently compare such irregular dual descriptors, having different geometry and number of elements, we also propose new distances and optimizations, significantly reducing the computational complexity.

The SuperPatchMatch (SPM) search algorithm [10] with our accurate dual superpatch (DSPM), performs more relevant superpixel matching. To go further, we also extend DSPM to the search of matches at multiple scales (see Sec. 4), and propose a framework to perform automatic labeling using exemplar-based images with ground truth labels. In our framework, the comparison of DSP at different scales can be easily performed since we consider reduced spatial information, i.e., sets of barycenter positions. This way, we are able to match similar objects at different sizes in heterogeneous datasets.

Finally, to show the robustness of our framework, especially compared to [10], we consider several matching and exemplar-based labeling experiments on a standard face dataset [14] (see Sec. 5).

2. The SuperPatchMatch Framework

In this section, we first recall the SuperPatchMatch (SPM) framework initially introduced by [10], which constitutes the

Figure 1: Superpatch definition. For a superpixel S i (filled blue), neighboring superpixels (blue contours) having their barycenters (red dot) into a radius r, centered on XS i the barycenter of S i, are part of the superpatch Si.

basis of our approach.

2.1. The SuperPatch Structure

To generalize standard patch-based frameworks to irregular image decompositions, [10] proposed the superpatch structure. As for square patches of pixels defined around each pixel, a superpatch Si, associated to a superpixel S i, contains the adjacent neighbors of a superpixel S i with respect to a fixed radius r. The proximity is simply computed according to the superpixel spatial barycenters such that:

where XS i = [xS i, yS i] and XS i[xS iyS i] respectively denote the spatial barycenters of superpixels S i and S i. This way, the superpatch structure only includes the most significantly neighboring superpixels, using reduced spatial information. In Fig. 1, we show a superpatch example, defined for a superpixel S i, containing its adjacent superpixels to provide a superpixel neighborhood descriptor.

2.2. SuperPatch Comparison Distance

A comparison distance is also proposed in [10] to measure the similarity between two superpatches. The main issue to design such distance is that the two structures are very likely to have different geometry and number of elements. Hence, there is no one-to-one association between superpixels, contrary to pixels within regular patches. To preserve the ability to compare patterns, the spatiality must be taken into account, and [10] proposed to simply consider the proximity of superpixel barycenters after registration on the central superpixels. In the following, we consider two superpatches Si and Sj, for instance in two images A and B. A weight w(XS iXS jexp(XS j(XS i(XS i XS j))), measures the relative displacement between the registered barycenters XS iand XS jof superpixels S iSi and S jSj, with respect to central superpixels S i and S j, and is a scaling parameter set to 1/2 K, with |I| and K the respective number of pixels and superpixels in the image. This way a superpixel S ionly compares to the closest ones in Sj.

Figure 2: Comparison process of two superpatches Si and Sj from Eq. (2). Superpatches are registered according to the barycenters XS i and XS j of their central superpixel. The weights w in Eq. (2) favor the comparison to closest superpixels after registration and the weight values corresponding to the bottom superpixel of Si (filled blue) are represented within each compared superpixel of Sj (darker meaning a higher w weight).

The distance D between two superpatches Si and Sj is fi-nally defined as:

(2) where ws(XS i) also weights the influence of S iaccording to its spatial distance to S i such that ws(XS iexp(XS iXS i(2r2)), and d is the distance between the superpixel fea- tures FS iand FS j. Note that any distance d and feature F can be considered in Eq. (2).

The comparison process between two superpatches having different number of element and geometry is illustrated in Fig. 2. The weight w in Eq. (2) weights the feature distance d between a superpixel in Si and a superpixel in Sj after registration on their central barycenters. In this Figure, the weights w corresponding to the bottom superpixel of Si are represented within each superpixel of Sj.

2.3. SuperPatchMatch Correspondence Algorithm

Non-local methods have soon highlighted the need for fast patch-based matching algorithms to perform the search of correspondences within large areas, e.g., library of example images. A significant breakthrough has been obtained with PatchMatch (PM) [3], a fast partly random matching algorithm, providing for each patch of an image A, a match in an image B. PM has very interesting properties such as no requirements for learning or preprocessing steps, and its complexity only depends on the size of the image to process A, enabling to search for matches in a large set of example images.

The PM algorithm starts from random associations and iteratively refines them with a sequential processing of all image patches. The refinement process is mainly based on the fast propagation of good matches from spatially adjacent neighbors. Large regions are indeed very likely to correspond between images. According to the scan order, which is reversed at each iteration, the shifted correspondences of two

Figure 3: Core of the SPM algorithm. Full lines indicate current best matches. The direct adjacent neighbors of the blue superpixel are considered to propose new match candidates (dotted lines). The relative orientations between superpixels in A tend to be respected in B, e.g., the yellow superpixel remains on top of the blue one in both images. Remaining adjacent neighbors (gray) that have not yet being processed during this iteration will be considered at the next one, processing superpixels in the reverse order.

spatially adjacent patches are considered as new candidates. Also, random patches are tested near the current best match in B. Note that the PM process being partly random, several processes carried out in parallel can potentially provide different matches.

The SuperPatchMatch (SPM) method generalizes PM to superpatches [10], to provide a fast matching algorithm of superpixels. The method mainly requires to adapt the propagation of matches based on adjacent neighbors since there is no more consistent geometry between adjacent superpixels, contrary to the standard pixel grid case. The propagation step of SPM is illustrated in Fig. 3 in the case of two images A and B. The adjacent neighbors are considered to lead to new matches while respecting the relative orientation between superpixels in A and B, to favor the matching of larger regions.

Limitations

Nevertheless, the default SPM distance in Eq. (2) has a quadratic complexity, i.e., each superpixel of a superpatch is compared to all the ones of the other superpatch, and may result in important computational time. Besides, this framework only considers intra-superpixel descriptors. Although, they may be sufficient to capture information within regions, the superpatch does not explicitly focus on capturing contours or gradient information. This may be an issue since such information generally lies at the border of a superpixel, and can be shared between two regions, reducing the relevance of the descriptor. Finally, the SPM framework does not consider multi-scale information that would allow to capture objects of different sizes.

We address all these issues in the following sections with the proposed multi-scale superpatch matching framework that uses new dual superpixel descriptors.

3. Dual Superpixel Descriptors

In this section, we propose an approach to relevantly extract superpixel neighborhood descriptors. We introduce a dual descriptor to efficiently capture around a superpixel, both feature region content and contour information, potentially

Figure 4: Illustration of the dual superpatch (DSP) descriptor. Intra-region information (Ri) within each superpixel S ifrom the border (blue regions) are considered along with information at the superpixel interfaces Ii(green squares) within the same r radius.

lying at their borders. Features are computed within superpixel regions and additional contour features are extracted at multiple interfaces between adjacent superpixels. The proposed dual descriptor of superpixel neighborhood is called Dual SuperPatch (DSP), is denoted ¯Si for a superpixel S i, and is represented in Fig. 4 on the same decomposition example used in Fig. 1. In the following, we present the extraction approach for region (R), i.e., intra-superpixel, and interfaces (I) descriptors, and we propose a general framework to compare different DSP.

3.1. Superpixel Region Descriptors

The superpatch formulation of [10] considers features computed on each whole superpixel region contained into the neighborhood structure. However, superpixels tend to capture homogeneous regions, so pixels at thin contours can be arbitrarily associated to superpixels, leading to altered descriptors. A reduced block area or spatial weighting from the superpixel barycenter could not be applied since these approaches do not guarantee to relevantly extract information when superpixels have very irregular shapes [23].

In this work, we propose to consider the superpixel region information with an offset of pixels to its borders. This way, we take into account almost all the region, while being robust to inaccurate superpixel borders or contour information, that will be considered in another specific interface descriptors within our DSP (see Sec. 3.2). In Fig. 4, the considered regions Rifor superpixels S iare represented in blue. To each region Ri, feature FRiand spatial XRiinformation are considered, so a dual superpatch contains a set of tuples Ri = {FRiXRi.

To demonstrate the issue of considering the whole superpixel region to extract features, we consider two decompositions of images containing regions with 16 different oriented textures (see Fig. 5). The superpatch radius is set to r = 0 to only consider intra-region information, where HoG descriptors [7] are computed. For each superpixel of the left image, we compute in an exhaustive manner its closest match in terms of superpixel content in the right image. If the texture

Figure 5: Two synthetic images containing 16 oriented textures and decom- posed into superpixels with [9].

Table 1: Impact of border ofor intra-region descriptor extraction. Texture matching results between images in Fig. 5 for different values of Gaussian noise levels using both superpixel decompositions from [9] and ground truth ones. Best and second results are respectively bold and underlined.

are similar, we consider the matching as accurate (1), otherwise inaccurate (0). We report in Tab. 1 the average matching accuracy on all superpixels of the left image, according to different values and several levels of Gaussian noise applied to both images after decomposition. We perform this evaluation using the decompositions obtained with a texture-aware method [9], and also the ground truth ones, perfectly capturing texture changes. This experiment highlights the need to restrict the area to extract superpixel information since superpixel decompositions may be not perfectly accurate. Moreover, even on perfectly fitting decompositions, the inaccurate gradient information lying on superpixel borders is captured with 0 and may impact the results with a noise variance superior to 50, while we do not take into account texture in other superpixels with 0.

Fast comparison distance. The comparison between two sets of superpixel region descriptors can be performed in a more computationally efficient manner than with Eq. (2). We propose to only select one superpixel S jSj to compare for each superpixel S iSi. To do so, the superpixel barycenter XS iis first registered by the displacement between central superpixels S i and S j, and we denote this new position by XS j(iwhich is computed such that XS j(iXS i(XS i XS j). In Fig. 2, these correspond to the red superpixel barycenters. Then, we project the registered barycenters on the decomposition of the image from where Sj is extracted. In Fig. 2, the black superpixel containing a red dot XS j(iwould be selected to compare to superpixel S iof region Ri. This corresponding superpixel containing XS j(iin the compared image, is denoted S j(i), and its associated intra-region is denoted Rj(i). This way, we significantly reduce the distance complexity, while potentially increasing the comparison accuracy (see Sec. 5). The comparison between two region descriptors

Figure 6: Impact of the parameter in the DSP distance (5). (a) Only region descriptors Riare used (1), with average color information. This result corresponds to the one obtained using [10]. (b) Only interface HoG [7] descriptors Iiare used (0) and enable to capture structure information. Radius r is set to 25, i.e., approximately capturing the first adjacency ring.

Ri and Rj is defined using barycenter projections such that:

Note that barycenters falling outside the image limits are projected to select the closest superpixel on the image boundary.

A similar projected distance was suggested in [10], in a non-symmetric formulation. In our dual superpatch comparison model, we consider a symmetric projected distance Dp on intra-region descriptors defined as:

3.2. Superpixel Interface Descriptors

To efficiently capture image contour information, we propose to also consider specific descriptors at superpixel interfaces. These can be easily extracted with a low complexity by considering the presence of at least three superpixels in a 33 pixels neighborhood. To avoid over detection, a larger area can be neglected after selection of an interface point. This way, we directly obtain a relatively regular grid of potential interest points in terms of contours, without introducing further scaling or thresholding parameters. In Fig. 4, these interface regions denoted Iiare represented as green squares. On these regions, specific contour descriptors can be computed, e.g., HoG [7].

Acceleration of quadratic distance. Since interface regions do not provide a dense decomposition of the image domain, the distance Eq. (3) using projections cannot be used to fastly compare two sets of interface descriptors. A quadratic one-to-many distance such as Eq. (2) could be used, but at the expense of important computational cost. To address this issue, we propose a one-to-one association for each interface descriptor Ii. Each Iiis only compared to the spatially closest one I jin the other dual superpatch. This way, the framework only requires exhaustive spatial distances between interface barycenters. The distance is computed as for Eq. (3), where I j(i), the selected interface descriptor in Ij for Iiis defined as I j(iargmin I jXIiXI j. Finally, as for Eq. (4), the

distance is also computed from Ij to Ii to obtain a symmetric distance.

3.3. General Dual SuperPatch Comparison Framework

Our dual superpatch (DSP) ¯Si, for a superpixel S i, is described by a set of intra-superpixel regions Ri = {FRiXRiand superpixel interfaces Ii = {FIiXIidescriptors such that ¯Si = [Ri, Ii]. Note that Ri and Ii can have a different number of elements. To relevantly measure the similarity of two DSP ¯Si and ¯Sj having different geometry and number of elements, we propose the following general DSP comparison distance:

with Dp the fast distance on descriptors, using barycenter projections Eq. (4) for intra-region R, and selection of closest descriptor for interfaces I. and [0, 1] a setting parameter. Note that any feature can be considered in FRior FIi. Therefore, can have an intuitive tuning using the same or normalized descriptors for both R and I.

In Fig. 6, we show matching results obtained with our generalized model using average RGB color as intra-region features FRi(Fig. 6(a), 1) and HoG descriptors as interface features FII(Fig. 6(b), 0). Hence, we highlight the general aspect of our framework that allows to either focus on intra-region (Fig. 6(a)) or interface information (Fig. 6(b)). In Sec. 5, we further demonstrate the performances obtained using these complementary descriptors.

4. Multi-Scale Dual SuperPatchMatch

In this section, we propose to extend the SPM framework with our dual descriptor (DSPM), to perform the search of DSP at multiple scales. We first show how to compare two DSP of different sizes, then we propose a multi-scale fusion strategy.

4.1. Dual SuperPatch Rescale

In Sec. 3, we showed how to compare two DSP extracted with the same radius size r. Nevertheless, the proposed distance Eq. (5) can easily adapt to DSP of different sizes, since the spatial information is only measured by barycenter positions denoted with X. We consider two DSP ¯Si and ¯Sj, with different DSP extraction radius ri and r j in Eq. (1). To compare them, all spatial information contained in ¯Sj can be adjusted according to the ratio between the radiuses, such that:

Figure 7: Example of matching result without (gray lines) and with (green lines) a multi-scale search for matches. See text for more details.

This way, similar DSP can be searched at various scales, e.g., in example images. Note that features FRjand FI jremain unchanged by this scaling transformation.

4.2. Multi-Scale Exemplar-based Framework

Most non-local methods perform a search for similar content in a heterogeneous dataset, with no prior information on the targeted object size. In [10], no multi-scale strategy is proposed since it considers an exemplar-based labeling experiment on linearly registered images [14].

Here, we introduce a generalized multi-scale exemplar-based framework allowing to search for similar DSP of different radiuses, in order to capture objects of different sizes. The proposed DSP structure indeed enables to perform a simple automatic rescale, presented in Sec. 4.1. Hence, multiple DSP sizes can be considered in a set of examples images B. A set rB = {rB} is considered for setting the DSP radius in B.

In [10], a supervised labeling framework based on the non-local means algorithm [4] was introduced to merge the information of multiple superpatch matches computed in a library of example images for an image A to process. A label map LrBm (S i) is computed for a superpixel S i, for all M different labels m, such that:

where Bmis the set of matches for S i computed at scale rB and having a ground truth label m, and W is the normalization factor W (S i, S j), with a weight depending on the DSP similarity [10]. The final label of a superpixel

In Fig. 7, we represent matching results obtained without (gray lines) and with (green lines) our multi-scale strategy. We can see that the best match between searches at scales rB = [0.5, 1, 2, 4]rA enables to catch the larger flower with similar colors, instead of the one at the same scale.

5. Experimental Validations

In this section, we present several quantitative experiments to demonstrate the interest of the proposed DSPM framework. We first validate the behavior of our model on standard images with respect to the method parameters. Then, we propose larger scale segmentation experiments on a standard face dataset.

Figure 8: Influence of radius r (a) and trade-o(b) in the proposed DSPM framework compared to SPM. We report the average distance between each superpixel barycenter and the one of its closest match in another decomposition of the same image. See text for more details.

5.1. Parameter Settings

The proposed method was implemented with MATLAB using C-MEX code on a standard Linux computer with 4 cores at 1.90 GHz and 16 GB of RAM. The number of DSPM iterations is set to 5, and we use a -norm as distance d between features F as in [10]. To avoid over detection, interfaces are detected at least each 4 pixels. Default parameters are set such that 1, the border offset for region features R, 5, the trade-off parameter between intra-region and interface distances in Eq. (5), and superpatch radius r = 50. The considered descriptors in R and I are reported according to the application.

5.2. Influence of Parameters

To demonstrate the interest of each contribution, we consider a matching experiment on standard images Baboon, Barbara, House, Lena and Peppers, each decomposed with two superpixel methods: SLIC [1] and SNIC [2]. For each superpixel in a given decomposition, we compute the closest DSP match in the other one. A robust descriptor should indeed be robust to variations in the segmentations. In terms of features, we compute normalized cumulative RGB histogram with 9 bins per canal on intra-regions FRI, and HoG [7] on a local 99 pixels window for interface region descriptors FII.

We evaluate the matching accuracy by the average distance between the superpixel barycenters and the one of their match in the other decomposition. In Fig. 8(a) and Fig. 8(b), we respectively report the average distance with respect to the radius parameter r and with respect to the parameter for r = 50. On the first hand, Fig. 8(a) shows that the accuracy logically increases with the superpatch radius, and with each contribution, i.e., symmetrical projected distance (Eq. (4)), offset to border (0), and interface descriptors. On the other hand, Fig. 8(b) illustrates the interest of using interface descriptors in Eq. (5) in conjunction with cropped regions for 2, and that a balanced trade-off parameter 5 provides the best matching accuracy. In Fig. 9, we also show an example of matching result for DSPM with default parameters compared to SPM.

5.3. Segmentation and Labeling Experiments

Validation framework. We evaluate the proposed DSPM approach on the same face labeling experiment than [10]. The

Figure 9: Matching accuracy of DSPM vs SPM. (a) and (b): decompositions with SLIC and SNIC. (c) and (d): SPM and DSPM matching results (r=50). Distances between superpixel barycenters are shown with standard optical flow representation (stronger colors represent larger displacements).

Table 2: Labeling accuracy for the multi-scale experiment. Training images have been either downsampled or upsampled by a factor of 1.5 or 2 and rA is set to 50. The argmax columns correspond to the fusion strategy proposed in Sec. 4.2 for different combined scales.

considered Labeled Faces in the Wild (LFW) dataset [14], contains 1500 training and 927 testing images of 250250 pixels, linearly registered with [13], and already decomposed into approximately 250 superpixels. LFW contains decompositions and labeling ground truths, so comparisons with state-of-the-art methods do not depend on the superpixel decompositions. Note that to fairly compare with [10], we use the same HoG implementation on a regular grid [8] and compute 50 DSP matches by 50 independent DSPM processes for each superpixel, merged in Eq. (7).

Multi-scale validation. In this experiment, the goal is to validate the interest of our proposed multi-scale matching strategy introduced in Sec. 4. To this end, we have manually applied random scaling transformations to the registered training images. Each image and its corresponding decomposition has been either downsampled or upsampled randomly by a factor of 1.5 or 2 (with no interpolation). As a result, faces depicted in the images can appear up to twice as big or small compared to their initial scales. Hence, the dataset contains face patterns at different scales that would not be ef-ficiently captured using the same DSP radius in A and B.

We apply DSPM with rA=50 and rB 33, 50, 75, 100}. In Tab. 2, we report the labeling accuracy for each radius and for the multi-scale label fusion proposed in Sec. 4.2. From these results, when rA rB, we can observe that the performance for smaller or larger radiuses is always better after applying the rescale strategy, i.e., with Eq. (6). The multi-scale fusion averaging the result over the different radius sizes performs better than the default rB = 50 and than most single scales. Moreover, by considering only larger scales, i.e., rB = 75, 100, thus more precise DSP comparisons, we obtain the highest labeling accuracy, demonstrating the framework ability of the framework to merge information from multi-scale matching. In future works, we plan to apply this ap-

Figure 10: Influence of the number of selected k-ANN for SPM [10] and DSPM on face labeling accuracy.

Table 3: Exemplar-based labeling accuracy on the LFW dataset.

proach on non-registered datasets where objects might naturally appear at different scales.

Comparison to state-of-the-art methods. In Fig. 10, we first compare the influence of the number k of selected ANN for each superpixel, then merged in the label fusion process Eq. (7) for the proposed DSPM method and SPM [10] using Eq. (3). For all ANN numbers, DSPM provides the best results, and is already reaching 95.13% of accuracy with only k = 20 ANN. Note that for both methods, a plateau is reached around k = 50 ANN.

In Tab. 3, we also compare the performance of the proposed DSPM method with the results of state-of-the-art ones, mostly based on supervised (deep) learning approaches. In [15], several approaches are used to label the LFW dataset such as a spatial conditional random field (CRF) and a conditionnal restricted Boltzmann machine (CRBM). The GLOC (GLObal and LOCal) method [15] is also proposed to jointly use both CRF and CRNM approaches to introduce global shape priors in the training process. Finally, in [19], a deep convolutional neural network (DCNN) is proposed and dedicated to the face labeling application. Note that the results in Table 3 are the ones reported by the authors. For SPM, the results correspond to the initial framework using costly quadratic comparisons with Eq. (2), and the results reported by the authors using non-symmetric projected distances with Eq. (3). DSPM reports the best compared labeling accuracy at both superpixel and pixel-wise level. Labeling examples compared to SPM are also represented in Fig. 11. The proposed DSP enables to relevantly capture the context of a superpixel neighborhood in terms of texture and struc-

Figure 11: Examples of labeling results with superpixel accuracy obtained with the proposed DSPM approach on the LFW dataset, compared to the initial SPM method.

ture. Moreover, without further optimizations on non fully multi-threaded code, DSPM performs in less than 3s per subject, against 45s for SPM using Eq. (2). Note that compared state-of-the-art approaches may provide faster computational times but at the expense of previous costly learning-based processes.

Our method is particularly interesting due to its simplicity of use, parameter tuning, and interpretability compared to learning-based approaches, while providing more accurate results than SPM. Besides, any feature can be directly used in the method, even more advanced descriptors, e.g., [31, 30], eventually based on previously trained deep learning architectures.

6. Conclusion

In this work, we addressed some important limitations of existing superpixel matching frameworks, in terms of robustness and computational complexity. We introduced the dual superpatch, a new superpixel neighborhood descriptor containing both intra-region and interface information that are respectively robust to the inaccuracy of superpixel borders and capture contour structures. We also proposed optimized distances and a multi-scale framework to search for similar dual superpatches in an image dataset. Our validations showed an accuracy improvement with our method on matching and exemplar-based labeling applications. The relevance of the proposed dual approach should benefit to all superpixel-based non-local approaches and future works will focus on applying the method to heterogeneous computer vision and medical datasets, and tackling other applications such as superpixel-based image editing.

References

[1] Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., S¨usstrunk, S., 2012. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Trans. on Pattern Analysis and Machine Intelligence 34, 2274–2282.

[2] Achanta, R., S¨usstrunk, S., 2017. Superpixels and poly- gons using simple non-iterative clustering, in: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 4895–4904.

[3] Barnes, C., Shechtman, E., Finkelstein, A., Goldman, D.B., 2009. PatchMatch: A randomized correspondence algorithm for structural image editing. ACM Trans. on Graphics 28.

[4] Buades, A., Coll, B., Morel, J.M., 2005. A non-local algorithm for image denoising, in: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 60– 65.

[5] Calonder, M., Lepetit, V., Strecha, C., Fua, P., 2010. BRIEF: Binary robust independent elementary features, in: European Conf. on Computer Vision (ECCV), pp. 778–792.

[6] Conze, P.H., Tilquin, F., Noblet, V., Rousseau, F., Heitz, F., Pessaux, P., 2017. Hierarchical multi-scale supervoxel matching using random forests for automatic semi-dense abdominal image registration, in: International Symposium on Biomedical Imaging (ISBI), pp. 490–493.

[7] Dalal, N., Triggs, B., 2005. Histograms of oriented gra- dients for human detection, in: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 886– 893.

[8] Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ra- manan, D., 2010. Object detection with discriminatively trained part based models. IEEE Trans. on Pattern Analysis and Machine Intelligence 32, 1627–1645.

[9] Giraud, R., Berthoumieu, Y., 2019. Texture superpixel clustering from patch-based nearest neighbor matching, in: European Signal Processing Conf. (EUSIPCO).

[10] Giraud, R., Ta, V.T., Bugeau, A., Coup´e, P., Papadakis, N., 2017a. SuperPatchMatch: An algorithm for robust correspondences using superpixel patches. IEEE Trans. on Image Processing 26, 4068–4078.

[11] Giraud, R., Ta, V.T., Papadakis, N., 2017b. Evaluation framework of superpixel methods with a global regularity measure. Journal of Electronic Imaging (JEI) .

[12] Gould, S., Rodgers, J., Cohen, D., Elidan, G., Koller, D., 2008. Multi-class segmentation with relative location prior. Int. Journal of Computer Vision 80, 300–316.

[13] Huang, G.B., Jain, V., Learned-Miller, E., 2007a. Unsu- pervised joint alignment of complex images, in: IEEE Int. Conf. on Computer Vision (ICCV).

[14] Huang, G.B., Ramesh, M., Berg, T., Learned-Miller, E., 2007b. Labeled faces in the wild: A database for studying face recognition in unconstrained environments, in: Technical Report 07-49, Univ. of Massachusetts, Amherst.

[15] Kae, A., Sohn, K., Lee, H., Learned-Miller, E., 2013. Augmenting CRFs with Boltzmann machine shape priors for image labeling, in: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 2019–2026.

[16] Kanavati, F., Tong, T., Misawa, K., Fujiwara, M., Mori, K., Rueckert, D., Glocker, B., 2017. Supervoxel classifi-cation forests for estimating pairwise image correspondences. Pattern Recognition (PR) 63, 561–569.

[17] Lazebnik, S., Schmid, C., Ponce, J., 2006. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories, in: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 2169–2178.

[18] LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep Learn- ing. Nature 521, 436–444.

[19] Liu, ., Yang, J., Huang, C., Yang, M.H., 2015. Multi- objective convolutional learning for face labeling, in: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 3451–3459.

[20] Lowe, D.G., 2004. Distinctive image features form scale-invariant keypoints. Int. Journal of Computer Vision 60, 91–110.

[21] Moosavi-Dezfooli, S.M., Fawzi, A., Fawzi, O., Frossard, P., 2016. Universal Adversarial Perturbations, in: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 86–94.

[22] Muja, M., Lowe, D.G., 2014. Scalable Nearest Neigh- bor Algorithms for High Dimensional Data. IEEE Trans. on Pattern Analysis and Machine Intelligence 36, 2227–2240.

[23] Neubert, P., Protzel, P., 2015. Benchmarking superpixel descriptors, in: European Signal Processing Conf. (EUSIPCO), pp. 614–618.

[24] Olonetsky, I., Avidan, S., 2012. TreeCANN-kd tree coherence approximate nearest neighbor algorithm, in: European Conf. on Computer Vision (ECCV), pp. 602– 615.

[25] Pei, S.C., Chang, W.W., Shen, C.T., 2014. Saliency detection using superpixel belief propagation, in: IEEE Int. Conf. on Image Processing (ICIP), pp. 1135–1139.

[26] Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional networks for biomedical image segmentation, in: Int. Conf. on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 234– 241.

[27] Sawhney, R., Li, F., Christensen, H.I., 2014. GASP: Ge- ometric association with surface patches, in: Int. Conf. on 3D Vision (3DV).

[28] Stutz, D., Hermans, A., Leibe, B., 2018. Superpixels: An evaluation of the state-of-the-art. Computer Vision and Image Understanding 166, 1–27.

[29] Tighe, J., Lazebnik, S., 2010. SuperParsing: Scalable nonparametric image parsing with superpixels, in: European Conf. on Computer Vision (ECCV), pp. 352– 365.

[30] Tilquin, F., Conze, P.H., Pessaux, P., Lamard, M., Quel- lec, G., Noblet, V., Heitz, F., 2018. Robust supervoxel matching combining mid-level spectral and context-rich features, in: Int. Work. on Patch-based Techniques in Medical Imaging (Patch-MI, MICCAI), pp. 39–47.

[31] Zhang, Y., Pei, Y., Guo, Y., Ma, G., Xu, T., Zha, H., 2018. Consistent correspondence of cone-beam ct images using volume functional maps, in: Int. Conf. on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 801–809.