SPARSE representations have become popular in severalapplications of signal, image and video processing, such as denoising [1], [2], super-resolution, inpainting, compression [3]–[6] or classification. While it was common to analyze and reconstruct signals based on representations over predefined bases such as wavelets and DCT, research in the recent years has shown that learning overcomplete dictionaries adapted to the structure of the treated signals can significantly improve the representation quality. Observing that learning redundant dictionaries from collections of data samples under sparsity priors leads to models that fit and approximate well the characteristics of signals [7], [8], the learning of dictionaries in a supervised setting for the discrimination of different classes of signals has also become a popular research problem [9]. In this work, we propose a method to learn multilevel structured dictionaries with high discrimination capability for the problem of pixelwise image classification.
We consider a supervised classification setting where the classes are known and exemplars are available for each class. In particular, we are interested in image classification problems with a large amount of variability between data samples of the same class, resulting from e.g., dominant presence
J. Aghaei Mazaheri and E. Vural were with Institut National de Recherche en Informatique et Automatique (INRIA), Rennes, France. E. Vural is now with the Deparment of Electrical and Electronics Engineering at METU, Ankara, Turkey. C. Labit and C. Guillemot are with Institut National de Recherche en Informatique et Automatique (INRIA), Rennes, France.
of irregular high-frequency content in the image classes, or multiple subcategories within the same image class with little resemblance between them. Some example applications could be texture classification problems where the considered image texture classes are rich in high-frequency content with little correlation between several patterns belonging to the same class differing by shifts, orientation differences, etc.; or remote sensing satellite images with high variability within the same image class (e.g., the “city” class containing both smooth image regions corresponding to flat areas such as parks and rivers; and regions rich in texture corresponding to populated urban areas with buildings and streets).
In this setting, we consider the problem of learning a discriminative dictionary model for each class. In order to handle the large variability or the presence of multiple subcategories of patterns in each image class, we propose to use multilevel dictionaries having a tree-like structure. In the proposed setting, the overall class-representative dictionary consists of subdictionaries residing at multiple levels, such that each subdictionary in a level originates from a certain atom of a subdictionary in the preceding level. The representation of an image patch in a multilevel dictionary is simply computed by tracing down the branches, i.e., first choosing an atom in the first-level subdictionary, then selecting an atom from the second-level subdictionary corresponding to the first atom, and similarly descending until the desired sparsity level is attained. The patches of test images are classified with respect to their reconstruction errors over each class-representative multilevel dictionary.
Such a multilevel dictionary structure is particularly suitable for the considered image classification problem with high intra-class variability. In a setting with various patterns of little resemblance in the same class, the atoms in upperlevel subdictionaries capture the main characteristics of the patterns such as orientation, so that dissimilar patterns are represented with different atoms in these early levels. The lower-level subdictionaries originating from different atoms in upper levels are then particularly adapted to the structures of the different types of patterns present in the class and learn the fine details of these patterns. The representation of signals with the proposed multilevel structured dictionaries is illustrated in Figure 1.
Many methods in the literature use sparse representations and dictionary learning for the problem of supervised classifi-cation [9], [10], [11], [12]. Using the known labels of training data, these methods learn one or several dictionaries to allow the classification of test images based on their sparse representations in the learnt dictionaries. Although the traditional
Fig. 1. Illustration of the proposed structured dictionaries for two levels. The atoms in the first level capture the main characteristics of different types of patterns in the same class. Each second-level dictionary originates from a different first-level atom. Second-level atoms learn the details of the patterns selecting the corresponding first-level atom.
single-level flat dictionaries used typically in supervised dictionary learning can learn the main characteristics of different classes via sparse representations, the main philosophy of these methods is to tune the atoms to fit well the common features in the same class while pushing them away from the features of the other classes. While such methods give quite impressive results in applications such as face recognition with rather small variability within the same class, their performance may degrade in problems with large intra-class variability. On the other hand, the multilevel dictionary structures proposed in our method have a high learning capacity that explicitly and efficiently takes account of possible intra-class variations.
We propose to learn the class-representative multilevel dictionaries in a sequential way, by optimizing each atom of a subdictionary with respect to a discriminative learning objective. Our objective function seeks to update each atom in order to fit the residuals of the signals from the same class using that atom, while increasing the reconstruction error of the signals from other classes when represented with that atom. We first train the dictionaries with image patches from a set of known classes. Then, for the image patches centered around each pixel of a test picture, we compute the reconstruction errors over the learnt class-representative dictionaries. Finally, the label image is obtained by applying a combination of two smoothing methods: a label expansion algorithm based on a graph cut, and an erosion algorithm, which both use the information of the reconstruction errors of the patches over the learnt dictionaries. We evaluate our method with experiments on several texture classification problems. The experimental results show that our method gives competitive results with the state of the art.
In Section II, we give an overview of the related work. Section III presents the proposed method for learning supervised dictionaries with a multilevel adaptive structure, together with a description of the classification algorithm based on these learnt structured dictionaries. In Section IV, we describe the smoothing steps applied for improving the label estimates. We present the experimental results on texture images in Section V, and Section VI concludes the paper.
We now briefly overview some related works on sparse representations and dictionary learning.
A. Sparse representations and unsupervised dictionary learning
Sparse representations consist in representing a signal as a linear combination of only a few columns, known as atoms, from a dictionary
, under a sparsity constraint as
where is the coefficient vector corresponding to the sparse representation of y over D, and L > 0 is the sparsity constraint, i.e., the maximum number of non-zero coefficients in x. The
-norm
of x is equal to the number of non-zero coefficients in x. The dictionary D is composed of K atoms
, that are supposed to be normalized to have unit
-norm as
.
The computation of the sparse approximation of a signal in (1) is an NP-hard problem and some greedy algorithms have been developed to find an approximate solution, such as the Matching Pursuit (MP) [13] and the Orthogonal Matching Pursuit (OMP) [14] algorithms, which search in each iteration the atom of the dictionary that is the most correlated with the current residual vector. Several other methods such as the Basis Pursuit algorithm [15] propose to relax the optimization problem by replacing the -norm of x with its
-norm.
Many dictionary learning methods have been proposed to learn a dictionary from a set of N training vectors
under sparsity constraints, in order to better adapt the dictionary to the data. Unsupervised dictionary methods typically solve the problem
subject to
and
where L > 0 is the sparsity constraint applied to each column of X, i.e., the maximum number of non-zero coefficients in
, and
is the Frobenius norm.
Many dictionary learning algorithms such as the Method of Optimal Directions (MOD) [8], [16] and K-SVD [7] apply an iterative optimization procedure with two major steps. The first step consists of the sparse coding of the training vectors over the fixed dictionary D to compute X, which can be solved with pursuit algorithms; and the second step is the update of the dictionary D based on the decompositions computed in the previous step. Some dictionary learning algorithms impose constraints on the dictionary, such as the Sparse K-SVD method [17], which aims to learn a sparse dictionary, or the Non-Negative K-SVD method [18], which learns a non-negative dictionary. An online dictionary learning method based on stochastic approximations is proposed in [19]. Finally, structured multilevel dictionaries are learnt in [6], [20] based on the idea of adapting each dictionary to one iteration of the pursuit algorithm, so that atoms are sequentially selected from dictionaries at different levels by going down the branches in sparse coding. While the multi-level dictionary structure used in our method is based on the principle developed in these previous works, we focus here on the supervised learning problem for classification applications unlike these works.
B. Supervised dictionary learning
Supervised dictionary methods aim to learn dictionaries such that sparse representations of signals over the learnt dictionaries allow an accurate estimation of their class labels. Some dictionary learning methods learn one global dictionary to represent all classes. The study in [21] shows the advantage of learnt dictionaries over predefined dictionaries in classi-fication, where a dictionary is learnt with a discrimination term applied on the coefficients. A discriminative formulation with a linear and bilinear classifier applied to the sparse coefficients is employed in [11]. A discriminative version of K-SVD is presented in [12]. A classifier is jointly learnt with the dictionary and then applied to the coefficients of a test picture to classify it. Applied to face recognition, it offers better results than the K-SVD dictionary. The problem of [12] is extended in the Label Consistent K-SVD method [22], [9]. The dictionary is learnt along with a linear classifier using the sparse coefficients in order to increase the discrimination capability of the coefficients, while another term in the objective directly imposes the similarity of the sparse coefficients among the samples from the same class. The methods in [23] and [24] are based on similar formulations while they also include a graph-regularization term on the sparse coefficients. The authors of [23] further propose to remove the sparse reconstruction term from the objective function and include it only in the constraints of the optimization problem. In the sparse decomposition of a training sample, the coefficients corresponding to other classes are suppressed with a differentiable term based on the -norm in [25], while a graph-regularization term is also included in the objective. A Fisher criterion is applied on the sparse coefficients in the learning in [26]. The dictionary learning problem is formulated in a Bayesian setting in [27], such that sparsity is imposed via class-dependent Bernoulli random vectors, and a classifier is trained on sparse codes. A couple of other methods consider the semi-supervised dictionary learning problem. A linear classifier on sparse codes is learnt in [28] while the unlabeled samples are also incorporated in the discriminative term of the learning objective, proportionally to the confidence of their label estimates. The authors of [29] emphasize that there may be overlapping features between different classes and propose to learn a global dictionary along with the corresponding soft label vectors in a graph-regularized semi-supervised learning scheme.
Some other methods learn one dictionary per class and classify test data based on the reconstruction error over each dictionary. In [10], dictionaries that are both reconstructive and discriminative are learnt for each class by optimizing a sparse reconstruction error term and a discriminative term. The discriminative term in the objective function involves the reconstruction errors of samples over the dictionaries. Test samples are then classified by searching for the dictionary giving the minimum reconstruction error. A smoothing graph cut step is finally applied to refine the label image. A dictionary is learnt for each class in [30] with an incoherence criterion imposed on the dictionaries to make them independent. This incoherence term is also used in [31] where an additional dictionary is also learnt in order to capture the patterns common to different classes.
Finally, there are also discriminative dictionary learning methods relying on a categorical or relational organization of the image classes. A dictionary learning method for multilabel image annotation is proposed in [32], where the image labels are first organized into exclusive groups such that two labels that simultaneously occur in the same training image are in different groups. A discriminative dictionary is then learnt with a Fisher criterion for each label group. Test images are finally classified according to their sparse representations by imposing group sparsity in their sparse coding. The method in [33] learns discriminative dictionaries with a multilevel structure. Their method addresses the particular application of large scale classification with a high number of classes and relies strictly on the availability of a category hierarchy organization of the given classes in the form of a tree model. A global tree-structured dictionary is then learnt where the multilevel tree structure is directly inherited from the given category hierarchy tree model, and the dictionary in each node of the tree is specialized for a group of classes residing under the same subcategory. A similar tree structure is used for emotion classification in [34], where each node is associated with a dictionary and a classifier. While the dictionaries are learnt in an unsupervised manner, the classifiers are trained so as to discriminate between the confused classes branching from that node based on sparse codes. Although the methods in [33] and [34] learn tree-structured multilevel dictionaries, these methods differ significantly from ours in that their multilevel dictionary structures are formed quite differently for different usages and purposes.
Our classification method is based on the learning of discriminative structured dictionaries. Multilevel structured dictionaries, composed of many small dictionaries organized on several levels, have the ability to better specialize and thus more efficiently capture the high variability within a class.
This concept of structured dictionaries has been first introduced in [20], and developed in [6], under the name of Iteration-Tuned Dictionaries (ITD). The structure is based on the idea of learning a different dictionary for each iteration of the pursuit algorithm. Thus, each atom added in the decomposition of a signal is selected in a new dictionary by going down the multilevel dictionary strucuture. Several structures, represented in several levels that contain one or several dictionaries, have been developed within this concept, like the Basic ITD (BITD) composed of one dictionary per level, or the Tree-Structured ITD (TSITD) structured as a tree of dictionaries. In these structures, each dictionary at a level is
Fig. 2. The Adaptive Structure. Each atom at a given level leads to the generation of a specialized dictionary at the next level, learnt from only the data samples selecting that atom. At each level k, all branches without sufficiently many data samples to continue learning a dictionary at the next level are merged together to learn a new dictionary at the next level k + 1.
learnt based on a subset of residuals computed at the previous level. Another tree structure, called Tree K-SVD [35], has been derived from the TSITD structure. Each dictionary it contains is learnt with the K-SVD algorithm [7] with a sparsity of one atom. Starting with one dictionary at the first level, the principle of these tree structures is to learn for each atom at a level one child dictionary at the next level. They are thus quickly composed of too many dictionaries when the number of levels increases, and many can be incomplete or even empty, which can be problematic.
Motivated by these observations, we propose here the discriminative Adaptive Structure by building on our previous work [36], which focused on image compression by learning reconstructive Adaptive Structures. The Adaptive Structure is a new dictionary structure whose topology is adaptively determined during the learning in order to not contain any incomplete dictionary. The branches in the structure are progressively pruned, according to their usage rate, and merged into a unique and more general branch whenever there is not enough data to learn new dictionaries down the branches. This adaptive structure enables the learning of more levels than the tree structure while keeping the total number of atoms reasonable. In the sequel, we first describe the Adaptive Structure in Section III-A. Then, in Section III-B we present the proposed discriminative dictionary learning method based on Adaptive Structures in a supervised learning setting. Finally, in Section III-C, we present our supervised classification algorithm.
A. The Adaptive Structure
The Adaptive Structure demonstrated in Fig. 2 is learnt with a top-down approach level after level. Each dictionary in the structure consists of K atoms and is learnt with K-SVD [7], [37], for a sparsity of one atom. Let denote the set of training samples. The single dictionary at the first level
consisting of K unit-norm atoms is learnt using all training data Y , by setting the residual term of the first level simply as . Each training data sample
, is then approximated by one atom of the first dictionary
as
and the residual vectors
are computed to form the residual set for the next level. The residuals in
are split into K groups
such that each group
consists of the residuals of the training samples selecting the atom
in the first level
For each set with k = 1, ..., K, if it contains sufficiently many residuals to satisfy
a dictionary at the second level is learnt from , where
denotes the cardinality of a set. Otherwise, in order not to create an incomplete dictionary, the dictionary is not learnt and the set of residuals
is saved. At the end of the learning of the second level, all the saved residual sets at this level are merged in
as
The merged residual set is then used to learn a new dictionary
, the dictionary of the “merged branches” at the second level of the structure.
The same procedure is then applied to the dictionaries of the second level to learn the dictionaries of the third level. The residual sets of insufficient cardinality at the third level are merged, together with the residuals from at the previous level, to form
. The residual set
is then used to learn the corresponding dictionary
at the third level.
This procedure is continued to learn the multilevel Adaptive Structure until a desired number of levels is reached. With this method, the branches with a high usage rate, i.e., the branches selected by many training samples, will be further developed to result in new dictionaries down the tree. On the other hand, the branches with a low usage rate will be quickly pruned and the corresponding residuals will be merged to learn rather general dictionaries (in contrast to the more specialized ones residing at non-merged branches). Thus, during this learning process, the structure adapts itself according to the training vectors in order not to contain any incomplete or empty dictionaries.
Once the Adaptive Structure is learnt, the sparse decomposition of a test sample is computed by selecting one atom per level, beginning with the first level and descending down the multilevel structure. Given a test sample y, it is approximated at the first level by an atom of the first-level dictionary selected with the MP algorithm [13] with a sparsity of one
where is the sparse coefficient obtained as
The residual vector is then computed and approximated with the same procedure by another atom
from a dictionary at the second level, the child dictionary of the atom
chosen at the first level. The residual computation and atom selection procedure is continued by descending down the multilevel structure along a branch until a given sparsity is reached. The dictionary to use at each level l is thus determined by the atom
chosen at the previous level in the approximation of y. When the end of a branch is reached, the atom at the next level is selected within the dictionary of the “merged branch” of the structure, and the decomposition continues after that along this branch. For a structured multilevel dictionary D, the reconstruction error of the test sample y for a sparsity of L atoms is thus obtained as
where are the atoms chosen at the levels 1 to L from D and
are the corresponding coefficients.
B. Discriminative Learning with Adaptive Structures
We now describe our proposed method where discriminative Adaptive Structures are learnt for supervised image classifica-tion. We propose to learn one multilevel dictionary with the Adaptive Structure for each class. We have observed that in order to achieve satisfactory performance, it suffices to apply the discriminative learning procedure described below at the first level of the structures where there is only one dictionary. We learn the dictionaries at the other levels with the K-SVD algorithm with a sparsity of 1 atom, by following the Adaptive Structure as described in Section III-A. Since the dictionary structure is learnt with a top-down approach, applying a discrimination-based learning at the top level impacts the other levels as well and has an effect on the whole multilevel dictionary structure.
Let denote the dictionary at the first level of the Adaptive Structure to be learnt for the class c, for c = 1, ..., C. We aim to learn a dictionary that is both reconstructive and discriminative, which efficiently represents the data from its own class but yields a large reconstruction error for the data from other classes. Hence, the dictionaries are learnt considering the data from both their own class and the other classes. In this way, the reconstruction errors of test samples on the learnt class-representative dictionaries can be used to classify test data.
In the following, we first introduce our discriminative dictionary learning objective and then discuss its minimization. Next, we explain how the data samples included in the objective function are chosen and finally present the overall discriminative dictionary learning algorithm.
1) Discrimination model: We propose to update the dictionaries sequentially (atom by atom), by minimizing the following objective function for updating an atom of the dictionary
of class c
with
The first term in the above cost function is a reconstructive term aiming to adapt the atom to the training data from its own class c, where
denotes the restricted subset of data samples from class c that use the atom d in their decomposition. The second term is a discriminative term, whose goal is to push the atom d away from the training samples
of the other classes. Thus, we search for the atom d minimizing the reconstruction error of the data from its own class and maximizing the reconstruction error of the data from the other classes. The positive weight parameter
balances the two terms according to the ratio between the number of samples in
and
as
where denotes the number of columns in a matrix (i.e., the number of data samples) with a slight abuse of notation. The positive constant
adjusts the compromise between reconstruction and discrimination. The exact choice of the samples
for each class c and atom d will be explained later in Section III-B3.
2) Minimization of the objective function: The cost function to minimize can be rewritten as
This is equivalent to
With the constraint , we have
. Hence, we can simplify the cost function to
In order to solve this minimization problem under the constraint , we then apply the Lagrange multipliers method and minimize the function
Setting the derivative of with respect to
to 0 gives
We then evaluate the derivative with respect to d and equate
it to 0 as
∂d = ∂∂dtrdd
λ ∂∂dtr
which gives
Taking the transpose of both sides, we get
This equation is of the form
with
The atom d is thus an eigenvector of A with a unit -norm. Since our objective in (4) imposes the atom to fit the samples
while repulsing it from
, the sought atom is the eigenvector of A corresponding to its maximum eigenvalue.
3) Choice of the sample set : In order to adapt the discrimination term to each class and even to each atom to update, we follow a particular strategy when forming the matrix
that contains the data from the other classes than the current class c. Rather than choosing
to contain all data samples from the other classes, we wish to particularly discriminate the updated atom from the classes most similar to its class. For this purpose, we compute an affinity matrix that represents the similarity between each pair of classes.
In order to compute the class affinity matrix, for each class c we first compute a representative vector that best fits the data samples
. In order to avoid computing an almost constant vector as the representative vector, we first subtract the mean value of each data sample
in
. We then choose the representative vector as the one that maximizes the energy of the data samples when projected onto it as
where is the mean-removed version of the training sample
. The solution of the above problem gives
as the unit- norm eigenvector of
associated with its maximum eigenvalue, where
is the matrix containing the mean-removed samples
. We then obtain the class affinity matrix
such that the affinity
between the i-th and j-th classes is given by the similarity between their class representative vectors as
Hence, S is a symmetric matrix with 1’s on the diagonal and affinity values varying between 0 and 1 on its off-diagonal entries.
With this affinity matrix, we then determine by selecting from each class j a variable number of vectors according to its affinity
with the current class c. If the number of training samples in each class is the same and equal to N, we set the number of samples to be selected from class j as
where the round function rounds the values to the nearest integer. Note that this strategy can be easily adapted to the case where the number of training samples is different for each class, by choosing such that
where contains the samples in
from class j, with
. The samples
are chosen as the samples from class j that have the highest correlation with the atom d to update, i.e., the samples from class j that are the most susceptible to choose d for their sparse decomposition in the dictionary
of class c.
With this strategy, each dictionary becomes more discriminative towards the classes closest to its own class, instead of equally treating all the other classes. Indeed, two classes with high dissimilarity do not necessarily need an extra discrimination criterion to be distinguished.
4) Overall discriminative dictionary learning algorithm: Let us now describe the overall algorithm to learn a multilevel discriminative dictionary for each class c.
The dictionary that composes the first level of the Adaptive Structure for class c is computed as follows. The dictionary is first initialized by training vectors from its own class, randomly selected and normalized to be of unit
-norm. The algorithm iterates between a sparse decomposition step and a dictionary update step as frequently done.
In the sparse decomposition step, the decompositions of the data samples from class c are computed with the MP algorithm [13] for a sparsity of one atom. Thus, for each vector in
, we search for the atom in
that is the most correlated with it. This step will allow us to compute for each atom d the matrix
composed of the training vectors from the class c choosing this atom d at this decomposition step.
The dictionary is then updated sequentially, atom by atom. For each atom d of , the matrix
composed of training vectors from the other classes is formed with respect to the class affinities as described in Section III-B3 and the matrix
is computed. The matrix A in (15) can then be computed and the atom d is updated as the unit-norm eigenvector of A corresponding to the maximum eigenvalue.
Once the discriminative dictionary is computed by alternatingly updating the sparse codes and the atoms, the residual set for the next level is computed from
, and the reconstructive Adaptive Structure learning described in Section III-A continues until the desired number of levels. This procedure is repeated for each class c to obtain a class-representative multilevel structured dictionary
for each class.
C. Classification of test images based on learnt structured dictionaries
Test samples are classified with respect to their reconstruction errors over the learnt multilevel dictionaries as follows. A given test sample y is first decomposed over each one of the C class-representative dictionaries for a given sparsity L, where C is the number of classes. Then the reconstruction error of y is computed over each dictionary of class c, c = 1, ..., C, as described in Section III-A
Here is the atom selected at level l, chosen in the dictionary at level l that corresponds to the atom
selected at the previous level
; and
is the coefficient of
in the decomposition of y.
In this paper, we focus on pixelwise classification of a test image. In this case, the training and test samples are image patches. The test samples y are obtained by taking a square patch around each pixel of the given test image so as to assign a class label to each pixel. In such a setting, it is useful to normalize the reconstruction residuals by the norm of the test patch y, in order to prevent the patches of high norm from dominating the overall label estimation during the smoothing steps discussed in Section IV. We thus consider the normalized error
A simple classification strategy would be to search the class minimizing the error
for each patch y
However, this classification rule leads in general to a fractional segmentation of the test picture resulting in many small and disconnected label support regions. In order to improve the label image and obtain more uniform and smooth label supports, we apply two smoothing steps, discussed in Section IV.
In order to improve the estimate of the label image obtained via the reconstruction errors over the class-representative dictoinaries as described in (18), we apply a label smoothing procedure that comprises two steps: a label expansion step with a graph cut, followed by an erosion step to erode the remaining small undesirable label support regions.
A. Label expansion via graph cuts
The first smoothing step considers an -expansion algorithm minimizing an energy function with a graph cut [38]–[40]. The algorithm estimates the label
of each pixel p by minimizing the following energy function based on a Potts model:
The first term is the data cost and corresponds to the sum on all the pixels P of the cost of assigning a label
to the pixel
. Rather than applying a cost of 0 to the class offering the lowest reconstruction error, and 1 to the other classes as done in [10], we set the data cost as
where is the patch centered around the pixel p and
is the reconstruction error of
over the dictionary of the class
as defined in (17). Such a choice of the data cost provides a ranking of all the classes for each pixel.
The second term is the smoothing cost summed over all neighboring pixels p and q (denoted as ), which is defined as
Hence, if two neighboring pixels p and q share the same label, then the associated cost is 0. Otherwise, this cost is constant and equal to the parameter .
This model encourages a labeling with several large regions whose pixels share the same label. Adapting the parameter makes the label image more smooth or less smooth. By setting it to 0, only the data cost, i.e. the reconstruction errors, is considered and the resulting label image is composed of many small and disconnected regions as label supports. Meanwhile, choosing a too big
will fuse the label supports too much and the estimated label image will contain less label support regions than desired. The parameter
can possibly be chosen as a constant or depending on the class labels
and
.
An -expansion method [38] is applied to minimize the energy function. This method expands at each iteration label after label, searching for the optimal expansion for each label, in order to decrease the energy function J(f). The expansion consists of modifying possibly numerous pixels simultaneously by assigning the current label, called label
, to these pixels. We have used the Matlab wrapper [41] for the experiments.
B. Erosion
After a first smoothing step realized with a label expansion algorithm, some small undesirable label support regions can remain in the label image. In order to remove them, we add an erosion step [42] applied directly on the label image obtained after the first smoothing step and based on the same data cost .
The -erosion algorithm [42] (available online [43]) works with a close variant of the energy function (19) and seeks to erode the small segments in priority (a segment corresponds to a group of connected pixels of the same label) to decrease the energy function. Too small segments are always eroded whereas too big segments are never eroded. The segments between these limits are treated one after the other, beginning with the smaller ones. For each segment, its pixels are relabeled one by one, the segment being eroded by the segments around. If the new labeling of the segment decreases the energy function, then the erosion of the segment is accepted, otherwise it is canceled.
Fig. 3. Texture image dataset: test images.
A slight erosion step is also realized on the label support edges.
The proposed method is tested on a set of texture images, commonly used for supervised classification and texture segmentation. We compare our method to several state-of-the-art dictionary learning and texture classification algorithms.
A. Pre-processing of the data
In order to improve the classification, some pre-processing operations, already used in [10], are performed on the training and test patches. A Gaussian mask of standard deviation 4 is first applied with an element-wise multiplication with the patch, in order to give more weight to the center of the patches, as it is possible that the peripheral pixels of a patch are from a
different class if the patch lies on an edge between two classes. The weight is thus 1 at the central pixel(s) of the patch and decreases with the distance to the center of the patch. Each patch is then sharpened with a Laplacian filter (of size ), each Laplacian filtered patch being subtracted from the patch to get the sharpened patch. Note that when computing the affinity matrix S as described in Section III-B3, we use the non-processed (original) versions of the image patches, as we have observed that pre-processing may impair the estimation of the affinities between classes.
Besides, in order to be able to classify the pixels on the borders of the test picture, we generate some additional pixels along the borders by taking the mirror image of the pixels close to the borders. This allows the classification of the border pixels by using the square patches artificially generated around them.
B. Texture classification
1) Texture image dataset: The dataset composed of texture images, used in our experiments, has first been introduced in [44], and has since been used in several articles dealing with classification and segmentation. It has been created with pictures from the Brodatz album [45], from the Vision Texture database of the MIT, and from the texture image database MeasTex. The pictures have thus been captured with different equipments under different conditions. Each one of the 12 test images in Fig. 3 corresponds to a different supervised classi-fication problem with different texture classes. The number of classes in each problem varies between 2 and 16. The training images corresponding to each one of these 12 supervised classification problems are also available. The training and test images have been taken from different portions of each texture. The dataset is available online [46].
2) Parameters: For each one of the 12 texture classification problems, an Adaptive Structure is learnt per texture class from the corresponding training picture. Overlapping blocks are extracted from these
pictures to learn each dictionary structure on 62001 training vectors. The structures are composed of complete dictionaries of 64 atoms. We limit the size of each dictionary in the structures so that they capture the characteristics of their own class and do not become too efficient for the representation of other classes. The first level of the structures, made discriminant, is learnt in 50 iterations, whereas the next levels are learnt in 10 iterations. The dictionaries in the structures are initialized with randomly chosen training vectors. The parameter
balancing the reconstruction and discrimination at the first level is empirically set to
.
The test patches, also of size pixels, are decomposed over each class-representative dictionary structure of the corresponding texture classification problem, for a sparsity of 2 atoms. Since in a classification problem we do not look for the best approximation of a patch but rather would like to classify it based on its reconstruction errors over the different dictionary structures, it is better to avoid high sparsity values. In practice, we have observed that the sparsity of 2 atoms gives good results in general.
For the first smoothing step with the graph cut, the smoothing parameter is experimentally set to the constant value 0.16 for all different label pairs
with
, and to 0 for
, which has been observed to yield good results. Finally, for the second smoothing step of erosion, the parameter1
is set to 2 as in [42]. Areas of less than 2000 pixels are always eroded whereas areas of more than 10000 pixels never are. Between these limits, the erosion depends on the minimization of the cost function. A slight erosion of 2 pixels is also performed on the edges.
3) Results: We first present in Fig. 4 some example atoms from the multilevel dictionaries learnt with the proposed algorithm for the classification problem of experiment 6 (Fig. 3(f)). Sample regions from the training images of the texture classes 1 and 12 of this experiment are shown in Figures 4(a) and 4(b).
Fig. 4. Some example multilevel dictionaries learnt for classes 1 and 12 of experiment 6
Figures 4(c) and 4(d) show some of the multilevel dictionaries learnt for these two texture classes. For both texture classes, the first-level dictionary is displayed, together with the second-level dictionaries originating from two different atoms of the first-level dictionary.
It can be observed that the atoms in the first-level dictionaries capture well the main characteristics of each class. The first-level dictionary of class 1 consists of atoms containing rather smooth and curvy features, whereas the first-level dictionary of class 12 contains atoms capturing straight edges and corners. We can observe that, due to the large intra-class variation in these texture classes rich in content, the atoms in the first-level dictionary of the same class can be quite different from each other. The proposed multi-level dictionary structure is then seen to be well-adapted to this setting as it allows the specialization of the atoms at later levels based on the structure of the atoms at earlier levels they originate from. Indeed, it can be seen in Fig. 4 that the second-level dictionaries derived from two different first-level atoms of the same class capture finer details but tend to have different characteristics. In class 1, the second-level dictionary derived from atom 2 of the first level inherits the roundshaped circular structure of atom 2, while the second-level dictionary derived from atom 45 is tuned to represent more straight and diagonally-oriented texture features. Similarly, in class 12, the second-level dictionary originating from atom 60 contains mainly horizontally oriented atoms as the dominant orientation of the atom 60 is horizontal, while the second-level dictionary originating from atom 22 captures both vertically and horizontally oriented corner-like features just like the atom 22. This confirms that the dictionaries learnt at different levels are successfully specialized to adapt to different fine and coarse texture features present in image classes of large intra-class variation.
Next, we demonstrate the effect of the different stages of
Fig. 5. Test image 4 at the successive steps of the classification algorithm.
the proposed method. In Fig. 5 the label images obtained after each step of our classification method are shown for the test image 4. We can see the benefits of the smoothing steps whereas a straighforward estimation of the class labels based on the minimum reconstruction error leads to a noisy segmentation (Fig. 5(b)). The label expansion algorithm via graph cut is crucial to create larger label support regions and suppress the majority of the small isolated label supports (Fig. 5(c)). Note that the graph cut algorithm does not take the label image in Fig. 5(b) as an input parameter but the data cost matrix consisting of the reconstruction errors computed for each pixel and each class. The final erosion step erodes the last remaining small label supports and slightly erodes the edges in order to obtain a clean label image (Fig. 5(d)), close to the ground truth (Fig. 5(e)). The erosion algorithm takes the label image obtained after the graph cut algorithm (Fig. 5(c)) as an initial segmentation, and uses the same data cost matrix.
Finally, we compare in Table I the classification error rates of our method with several methods from the literature. In [44], introducing the dataset, numerous filtering methods are compared and the best result obtained for each test picture is presented. The authors of [47] have improved the previous results using the Local Binary Pattern operator on texture patches and by computing histograms of the values to characterize a texture.2 A multi-scale version, considering several patch sizes, has also been studied. The authors of [48] have then proposed to extract texture discriminative features in the frequency domain by applying a Fourier transform in polar coordinates, followed by dimensionality reduction via PCA (Principal Components Analysis) or the computation of Fisher coefficients. Centroids are then computed for each class with a vector quantization method. The results presented in [10] are also included in our comparison, using reconstructive (R) or discriminative (D) dictionaries, and a graph cut based smoothing method. Finally, the results obtained with the -erosion method in [42] are added. A dictionary is learnt per class with the RLS-DLA algorithm [49] and class labels are
TABLE I CLASSIFICATION ERROR RATES (IN %) OF OUR METHOD FOR THE TEST IMAGES IN COMPARISON WITH SEVERAL METHODS FROM THE STATE OF THE ART. THE BEST TWO RESULTS FOR EACH IMAGE ARE IN BOLD.
estimated based on the approximation errors computed for each pixel and each class in an energy minimization step. In this step, a Gaussian filter is applied before applying the -erosion algorithm, followed by further erosion of the edges of the label support regions in order to smooth their borders. In the smoothing steps of our method, the label expansion algorithm uses a random ordering of the labels to be expanded in each iteration. The classification results can thus change from one trial to another, despite the use of the same dictionaries and parameters. We thus perform the smoothing steps 20 times and report the average error over these 20 random trials. The difference between different trials remains small in general for the same image.
It is seen in Table I that, over the 12 different texture classification experiments, our method gives the best results in three experiments and is among the best two methods in nine experiments. Except for two problematic images (test images 5 and 6) the classification error of our method does not exceed that of the state of the art by more than 1%. Our average classification error over the 12 experiments is 2.86%, which is the smallest among the compared methods.
Some example classification results are presented for several test images in Figures 6, 7, and 8. We observe that the only zones of misclassification are concentrated within a thin band over the edges between label supports, and the classification performance of our method is quite satisfactory in these experiments.
Fig. 6. Test image 7 and its label image compared to the ground truth.
Meanwhile, the texture classification experiments corresponding to the test images 5 and 6 are particularly challenging and these are the only settings where our method gives a
Fig. 7. Test image 9 and its label image compared to the ground truth.
Fig. 8. Test image 11 and its label image compared to the ground truth.
classification error rate superior to 4% (Table I). Our results on these two test images are presented in Figures 9 and 10. For the test image 5 in Fig. 9, the main factor increasing the classification error is the misclassification of the region in the bottom-left corner of the picture, where the label of the bottom texture spreads too much on the leftmost texture. Indeed, the border between these two textures on the test image is difficult to see even for the human eye and the two texture classes have very similar characteristics. Hence, the erroneous label can easily be diffused over the leftmost texture and it is not surprising to observe a relatively high misclassification rate in this experiment.
For the test image 6 given in Fig. 10, the problem is different. When we observe the final label image (Fig. 10(d)), the major regions of misclassification are over the textures on the left at the bottom, where the label spreads too much, and on the right at the bottom, where the whole texture has a wrong label. In the label image obtained after the graph cut based smoothing step (Fig. 10(c)), we notice that the misclassification regions for these two textures lie respectively on the left part of the first one and at the top of the second one. When we look at these specific areas in the original image (Fig. 10(a)), we can see that they seem over-exposed and thus brighter in comparison to the rest of the textures. Meanwhile, this over-exposure is not present in the training images, which can disturb the classification algorithm as this kind of variation has not been learnt, and lead to confusion with the other classes.
We also notice that textures with regular and small patterns are easily classified, even without the smoothing steps (Fig. 10(b)), as the learning is easier.
C. Enriching the training dataset
In order to deal with the over-exposure problems, particularly present on the test image 6, we propose to enrich the training dataset by adding some over-exposed versions of the
Fig. 9. Test image 5 and the successive steps of the classification algorithm.
Fig. 10. Test image 6 and the successive steps of the classification algorithm.
training images. An over-exposed image, called , is created from the training image I with the following equations
where is the exposure offset added to the image I, M is the maximum value in
after the exposure offset
has been added to the image I, and m is the minimum value in I.
In this way, the training set corresponding to the test image 6 is augmented by generating different over-exposed versions of the training images with over-exposure levels , and 700. We balance the number of original and over-exposed training samples by generating a total of 40000 over-exposed samples (10000 samples for each
value) added to 40000 original training samples. Dictionaries are then learnt from this new training dataset for the classification problem 6.
Some classification results obtained for the test image are presented in Fig. 11. It can be observed that augmenting the training data set with over-exposed samples has the potential to improve the classification performance. However, we have also observed that in the smoothing step, the expansion of the labels with a random ordering in the graph cut method may produce a more erroneous label image in some random realizations of the experiment. We have obtained an average classification error of 4.36% over 20 different random repetitions of the same experiment, whereas the error rate was 8.20% before enriching the learning dataset for the test image 6. If we take into account this new error rate for the test image 6, the mean classification error rate computed over the 12 experiments is reduced to 2.54% from its previous value 2.86% in Table I.
Fig. 11. Label images obtained for the test image 6 after enriching the training dataset with over-exposed patches.
Enriching the training dataset has thus improved the results. The new over-exposed training data have been helpful for learning dictionary structures containing more information and more conscious of this possible intra-class exposure variation. This solution could be applied to other test images as well undergoing the same problem.
In this paper, we have proposed a method for learning discriminative multilevel structured dictionaries for supervised image classification. We have presented a classification algorithm that learns one dictionary per class, where test images are classified with respect to their reconstruction errors on these dictionaries. For the construction of the dictionaries, we have adopted the Adaptive Structure derived from a tree structure, which we made discriminant with a novel objective function to learn multilevel dictionaries that are both reconstructive and discriminative. The proposed dictionaries thus have a high learning capacity due to their multilevel topology and are well-adapted to the classification of images with high intra-class variation. An affinity matrix has been incorporated in the objective function to adjust the discrimination of a class from the others depending on their pairwise affinities. A combination of two smoothing methods has been used to obtain a clean segmentation and classification of the textures in the test image. Experiments conducted on a common dataset of texture images have shown competitive results with the state of the art. We have finally proposed to enrich the training dataset to deal with over-exposure problems.
Enriching the dataset seems promising and future efforts may focus on more complex and realistic over-exposure models. Applying discrimination to all the dictionaries in the multilevel structure may also potentially be of interest, but might increase the complexity of the learning. Finally, a last future direction is to explore other affinity measures in the construction of the affinity matrix, in order to better characterize the pairwise similarities of classes and thus enhance the discrimination capability of the learnt dictionaries.
This work was supported by Airbus Defence & Space.
[1] M. Elad and M. Aharon, “Image denoising via sparse and redundant representations over learned dictionaries,” IEEE Transactions on Image Processing, vol. 15, no. 12, pp. 3736–3745, 2006.
[2] W. Dong, X. Li, D. Zhang, and G. Shi, “Sparsity-based image denoising via dictionary learning and structural clustering,” in IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 457–464.
[3] R. M. Figueras i Ventura, P. Vandergheynst, and P. Frossard, “Low- rate and flexible image coding with redundant representations,” IEEE Transactions on Image Processing, vol. 15, no. 3, pp. 726–739, 2006.
[4] O. G. Sezer, O. Harmanci, and O. G. Guleryuz, “Sparse orthonormal transforms for image compression,” in IEEE International Conference on Image Processing, 2008, pp. 149–152.
[5] O. Bryt and M. Elad, “Compression of facial images using the K-SVD algorithm,” Journal of Visual Communication and Image Representation, vol. 19, no. 4, pp. 270–282, 2008.
[6] J. Zepeda, C. Guillemot, and E. Kijak, “Image Compression Using Sparse Representations and the Iteration-Tuned and Aligned Dictionary,” IEEE Journal of Selected Topics in Signal Processing, vol. 5, pp. 1061– 1073, 2011.
[7] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation,” IEEE Transactions on Signal Processing, vol. 54, no. 11, pp. 4311–4322, 2006.
[8] K. Engan, S. O. Aase, and J. H. Husoy, “Method of optimal directions for frame design,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 5, 1999, pp. 2443–2446.
[9] Z. Jiang, Z. Lin, and L. S. Davis, “Label consistent K-SVD: learning a discriminative dictionary for recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 11, pp. 2651–2664, 2013.
[10] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Discrimina- tive learned dictionaries for local image analysis,” in IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8.
[11] ——, “Supervised dictionary learning,” in Advances in neural information processing systems (NIPS 2008), 2008, pp. 1033–1040.
[12] Q. Zhang and B. Li, “Discriminative K-SVD for dictionary learning in face recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 2691–2698.
[13] S. G. Mallat and Z. Zhang, “Matching pursuits with time-frequency dictionaries,” IEEE Transactions on Signal Processing, vol. 41, no. 12, pp. 3397–3415, 1993.
[14] Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad, “Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition,” in Conference Record of The Twenty-Seventh Asilomar Conference on Signals, Systems and Computers., 1993, pp. 40–44.
[15] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,” SIAM journal on scientific computing, vol. 20, no. 1, pp. 33–61, 1998.
[16] K. Engan, B. D. Rao, and K. Kreutz-Delgado, “Frame design using FOCUSS with method of optimal directions (MOD),” in Proc. NORSIG, vol. 99, 1999.
[17] R. Rubinstein, M. Zibulevsky, and M. Elad, “Double sparsity: Learning sparse dictionaries for sparse signal approximation,” IEEE Transactions on Signal Processing, vol. 58, no. 3, pp. 1553–1564, 2010.
[18] M. Aharon, M. Elad, and A. M. Bruckstein, “K-SVD and its non- negative variant for dictionary design,” in Optics & Photonics 2005, 2005.
[19] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix factorization and sparse coding,” The Journal of Machine Learning Research, vol. 11, pp. 19–60, 2010.
[20] J. Zepeda, C. Guillemot, E. Kijak et al., “The iteration-tuned dictionary for sparse representations,” in Proc. of the IEEE International Workshop on MMSP, 2010.
[21] F. Rodriguez and G. Sapiro, “Sparse representations for image clas- sification: Learning discriminative and reconstructive non-parametric dictionaries,” DTIC Document, Tech. Rep., 2008.
[22] Z. Jiang, Z. Lin, and L. S. Davis, “Learning a discriminative dictionary for sparse coding via label consistent K-SVD,” in IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 1697–1704.
[23] P. Zhou, C. Zhang, and Z. Lin, “Bilevel model-based discriminative dictionary learning for recognition,” IEEE Trans. Image Processing, vol. 26, no. 3, pp. 1173–1187, 2017.
[24] Y. Yankelevsky and M. Elad, “Structure-aware classification using supervised dictionary learning,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2017, pp. 4421–4425.
[25] X. Wang and Y. Gu, “Cross-label suppression: A discriminative and fast dictionary learning with group regularization,” IEEE Trans. Image Processing, vol. 26, no. 8, pp. 3859–3873, 2017.
[26] M. Yang, D. Zhang, and X. Feng, “Fisher discrimination dictionary learning for sparse representation,” in IEEE International Conference on Computer Vision, 2011, pp. 543–550.
[27] N. Akhtar, F. Shafait, and A. S. Mian, “Discriminative bayesian dictio- nary learning for classification,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 12, pp. 2374–2388, 2016.
[28] X. Wang, X. Guo, and S. Z. Li, “Adaptively unified semi-supervised dictionary learning with active points,” in IEEE International Conference on Computer Vision, 2015, pp. 1787–1795.
[29] M. Jian and C. Jung, “Semi-supervised bi-dictionary learning for image classification with smooth representation-based label propagation,” IEEE Trans. Multimedia, vol. 18, no. 3, pp. 458–473, 2016.
[30] I. Ramirez, P. Sprechmann, and G. Sapiro, “Classification and clustering via dictionary learning with structured incoherence and shared features,” in IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 3501–3508.
[31] S. Kong and D. Wang, “A dictionary learning approach for classification: Separating the particularity and the commonality,” in 12th European Conference on Computer Vision, 2012, pp. 186–199.
[32] X. Cao, H. Zhang, X. Guo, S. Liu, and D. Meng, “SLED: semantic label embedding dictionary representation for multilabel image annotation,” IEEE Trans. Image Processing, vol. 24, no. 9, pp. 2746–2759, 2015.
[33] L. Shen, G. Sun, Q. Huang, S. Wang, Z. Lin, and E. Wu, “Multi-level discriminative dictionary learning with application to large scale image classification,” IEEE Trans. Image Processing, vol. 24, no. 10, pp. 3109– 3123, 2015.
[34] H. Chen, M. Z. Comiter, H. T. Kung, and B. McDanel, “Sparse coding trees with application to emotion classification,” in IEEE Conference on Computer Vision and Pattern Recognition Workshops,, 2015, pp. 77–86.
[35] J. Aghaei Mazaheri, C. Guillemot, and C. Labit, “Learning a tree- structured dictionary for efficient image representation with adaptive sparse coding,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2013.
[36] ——, “Learning an adaptive dictionary structure for efficient image sparse coding,” in Picture Coding Symposium, 2013.
[37] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD Matlab toolbox.” [Online]. Available: http://www.cs.technion.ac.il/$\sim$elad/software/
[38] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy min- imization via graph cuts,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 11, pp. 1222–1239, 2001.
[39] V. Kolmogorov and R. Zabin, “What energy functions can be minimized via graph cuts?” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 2, pp. 147–159, 2004.
[40] Y. Boykov and V. Kolmogorov, “An experimental comparison of min- cut/max-flow algorithms for energy minimization in vision,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 9, pp. 1124–1137, 2004.
[41] S. Bagon, “Matlab Wrapper for Graph Cut,” Dec. 2006. [Online]. Avail- able: http://www.wisdom.weizmann.ac.il/$\sim$bagon/matlab.html
[42] K. Skretting and K. Engan, “Energy Minimization by alpha-Erosion for Supervised Texture Segmentation,” in International Conference on Image Analysis and Recognition, 2014, pp. 207–214.
[43] K. Skretting, “Karl Skretting web page. Texture classification (segmentation) tools for Matlab.” [Online]. Available: http://www.ux. uis.no/$\sim$karlsk/tct/index.html
[44] T. Randen and J. H. Husoy, “Filtering for texture classification: A comparative study,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, no. 4, pp. 291–310, 1999.
[45] P. Brodatz, Textures: A Photographic Album for Artists and Designers, new york: dover ed., 1966.
[46] T. Randen, “Trygve Randen web page.” [Online]. Available: http: //www.ux.uis.no/$\sim$tranden/
[47] T. M¨aenp¨a¨a, M. Pietik¨ainen, and T. Ojala, “Texture classification by multi-predicate local binary pattern operators,” in 15th Int. Conf. Pattern Recognition, 2000, pp. 3951–3954.
[48] A. Di Lillo, G. Motta, and J. A. Storer, “Texture classification based on discriminative features extracted in the frequency domain,” in IEEE International Conference on Image Processing, vol. 2, 2007, pp. II–53.
[49] K. Skretting and K. Engan, “Recursive least squares dictionary learning algorithm,” IEEE Transactions on Signal Processing, vol. 58, no. 4, pp. 2121–2130, 2010.