Multilabel classification (MLC) extends conventional single label classification (SLC) by allowing an instance to be assigned to multiple labels from a label set. It occurs naturally from a wide range of practical problems, such as document categorization, image classification, music annotation, webpage classification and bioinformatics applications, where each instance can be simultaneously described by several class labels out of a candidate label set. MLC is also closely related to many other research areas, such as subspace learning [1], nonnegative matrix factorization [2], multi-view learning [3]
This research was supported in part by Australian Research Council Projects FT-130101457, DP-140102164 and LE-140100061. The funding support from the Hong Kong government under its General Research Fund (GRF) scheme (Ref. no. 152202/14E) and the Hong Kong Polytechnic University Central Research Grant is greatly appreciated.
Q. Li is with the Centre for Quantum Computation and Intelligent Systems, Faculty of Engineering and Information Technology, University of Technology Sydney, 81 Broadway, Ultimo, NSW 2007, Australia, and also with Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong (e-mail: leetsiang.cloud@gmail.com).
W. Bian and D. Tao are with the Centre for Quantum Computation and Intelligent Systems, Faculty of Engineering and Information Technology, University of Technology Sydney, 81 Broadway, Ultimo, NSW 2007, Australia (e-mail: wei.bian@uts.edu.au, dacheng.tao@uts.edu.au).
B. Xie is with College of Computing, Georgia Institute of Technology, Atlanta, GA 30345, USA (email: zixu1986@gmail.com).
J. You is with Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong (e-mail: csyjia@comp.polyu.edu.hk).
c20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
and multi-task learning [4]. Because of its great generality and wide applications, MLC has received increasing attentions in recent years from machine learning, data mining, to computer vision communities, and developed rapidly with both algorithmic and theoretical achievements [5], [6], [7], [8], [9], [10].
The key feature of MLC that makes it distinct from SLC is label correlation, without which classifiers can be trained independently for each individual label and MLC degenerates to SLC. The correlation between different labels can be verified by calculating the statistics, e.g., test and Pearson’s correlation coefficient, of their distributions. According to [11], there are two types of label correlations (or dependence), i.e., the conditional correlations and the unconditional correlations, wherein the former describes the label correlations conditioned on a given instance while the latter summarizes the global label correlations of only label distribution by marginalizing out the instance. From a classification point of view, modelling of label conditional correlations is preferable since they are directly related to prediction; however, proper utilization of unconditional correlations is also helpful, but in an average sense because of the marginalization. Accordingly, quite a number of MLC algorithms have been proposed in the past a few years, by exploiting either of the two types of label correlations,1 and below, we give a brief review of the representative ones. As it is a very big literature, we cannot cover all the algorithms. The recent surveys [8], [9] contain many references omitted from this paper.
• By exploiting unconditional label correlations: A large class of MLC algorithms that utilize unconditional label correlations are built upon label transformation. The key idea is to find new representation for the label vector (one dimension corresponds to an individual label), so that the transformed labels or responses are uncorrelated and thus can be predicted independently. Original label vector needs to be recovered after the prediction. MLC algorithms using label transformation include [12] which utilizes low-dimensional embedding and [7] and [13] which use random projections. Another strategy of using unconditional label correlations, e.g., used in the stacking method [6] and the “Curds” & “Whey” procedure [14], is first to predict each individual label independently and correct/adjust the prediction by proper post-processing. Algorithms are also proposed based on co-occurrence or structure information extracted from the label set,
which include random k-label sets (RAKEL) [15], pruned problem transformation (PPT) [16], hierarchical binary relevance (HBR) [17] and hierarchy of multilabel classi-fiers (HOMER) [8]. Regression-based models, including reduced-rank regression and multitask learning, can also be used for MLC, with an interpretation of utilizing unconditional label correlations [11].
• By exploiting conditional label correlations: MLC algorithms in this category are diverse and often developed by specific heuristics. For example, multilabel K-nearest neighbour (MLkNN) [5] extends KNN to the multilabel situation, which applies maximum a posterior (MAP) label prediction by obtaining the prior label distribution within the K nearest neighbours of an instance. Instancebased logistic regression (IBLR) [6] is also a localized algorithm, which modifies logistic regression by using label information from the neighbourhood as features. Classifier chain (CC) [18], as well as its ensemble and probabilistic variants [19], incorporate label correlations into a chain of binary classifiers, where the prediction of a label uses previous labels as features. Channel coding based MLC techniques such as principal label space transformation (PLST) [20] and maximum margin output coding (MMOC) [21] proposed to select codes that exploits conditional label correlations. Graphical models, e.g., conditional random fields (CRFs) [22], are also applied to MLC, which provides a richer framework to handle conditional label correlations.
A. Multilabel Image Classification
Multilabel image classification belongs to the generic scope of MLC, but handles the specific problem of predicting the presence or absence of multiple object categories in an image. Like many related high-level vision tasks such as object recognition [23], [24], visual tracking [25], image annotation [26], [27], [28] and scene classification [29], [30], [31], multilabel image classification [32], [33], [34], [35], [36], [37] is very challenging due to large intra-class variation. In general, the variation is caused by viewpoint, scale, occlusion, illumination, semantic context, etc.
On the one hand, many effective image representation schemes have been developed to handle this high-level vision task. Most of the classical approaches derive from handcrafted image features, such as GIST [38], dense SIFT [39], VLAD [40], and object bank [41]. In contrast, the very recent deep learning techniques have also been developed for image feature learning, such as deep CNN features [42], [43]. These techniques are more powerful than classical methods when learning from a very large amount of unlabeled images.
On the other hand, label correlations have also been exploited to significantly improve image classification performance. Most of the current multilabel image classification algorithms are motivated by considering label correlations conditioned on image features, thus intrinsically falls into the CRFs framework. For example, probabilistic label enhancement model (PLEM) [44] designed to exploit image label co-occurrence pairs based on a maximum spanning tree construction and a piecewise procedure is utilized to train the pairwise CRFs model. More recently, clique generating machine (CGM) [45] proposed to learn the image label graph structure and parameters by iteratively activating a set of cliques. It also belongs to the CRFs framework, but the labels are not constrained to be all connected which may result in isolated cliques.
B. Motivation and Organization
Correlated logistic model (CorrLog) provides a more principled way to handle conditional label correlations, and enjoys several favourable properties: 1) built upon independent logistic regressions (ILRs), it offers an explicit way to model the pairwise (second order) label correlations; 2) by using the pseudo likelihood technique, the parameters of CorrLog can be learned approximately with a computational complexity linear with respect to label number; 3) the learning of CorrLog is stable, and the empirically learned model enjoys a generalization error bound that is independent of label number. In addition, the results presented in this paper extend our previous study [46] in following aspects: 1) we introduce elastic net regularization to CorrLog, which facilitates the utilization of the sparsity in both feature selection and label correlations; 2) a learning algorithm for CorrLog based on soft thresholding is derived to handle the nonsmoothness of the elastic net regularization; 3) the proof of generalization bound is also extended for the new regularization; 4) we apply CorrLog to multilabel image classification, and achieve competitive results with the state-of-the-art methods of this area.
To ease the presentation, we first summarize the important notations in Table I. The rest of this paper is organized as follows. Section II introduces the model CorrLog with elastic net regularization. Section III presents algorithms for learning CorrLog by regularized maximum pseudo likelihood estimation, and for prediction with CorrLog by message passing. A generalization analysis of CorrLog based on the concept of algorithm stability is presented in Section IV. Section V to Section VII report results of empirical evaluations, including experiments on synthetic dataset and on benchmark multilabel image classification datasets.
We study the problem of learning a joint prediction y = , where the instance space and the label space . By assuming the conditional independence among labels, we can model MLC by a set of independent logistic regressions (ILRs). Specifically, the conditional probability of ILRs is given by
where is the coefficients for the i-th logistic regression (LR) in ILRs. For the convenience of expression,
TABLE I SUMMARY OF IMPORTANT NOTATIONS THROUGHOUT THIS PAPER.
the bias of the standard LR is omitted here, which is equivalent to augmenting the feature of x with a constant.
Clearly, ILRs (1) enjoys several merits, such as, it can be learned efficiently, in particular with a linear computational complexity with respect to label number m, and its probabilistic formulation inherently helps deal with the imbalance of positive and negative examples for each label, which is a common problem encountered by MLC. However, it ignores entirely the potential correlation among labels and thus tends to under-fit the true posterior , especially when the label number m is large.
A. Correlated Logistic Regressions
CorrLog tries to extend ILRs with as small effort as possible, so that the correlation among labels is explicitly modelled while the advantages of ILRs can be also preserved. To achieve this, we propose to augment (1) with a simple function q(y) and reformulate the posterior probability as
As long as q(y) cannot be decomposed into independent product terms for individual labels, it introduces label correlations into p(y|x). It is worth noticing that we assumed q(y) to be independent of x. Therefore, (2) models label correlations in an average sense. This is similar to the concept of “marginal correlations” in MLC [11]. However, they are intrinsically different, because (2) integrate the correlation into the posterior probability, which directly aims at prediction. In addition, the idea used in (2) for correlation modelling is also distinct from the “Curds and Whey” procedure in [14] which corrects outputs of multivariate linear regression by reconsidering their correlations to the true responses.
In this paper, we choose q(y) to be the following quadratic form,
It means that and are positively correlated given and negatively correlated given . It is also possible to define as functions of x, but this will drastically increase the number of model parameters, e.g., by if linear functions are used.
By substituting (3) into (2), we obtain the conditional probability for CorrLog
where the model parameter contains and . It can be seen that CorrLog is a simple modification of (1), by using a quadratic term to adjust the joint prediction, so that hidden label correlations can be exploited. In addition, CorrLog is closely related to popular statistical models for joint modelling of binary variables. For example, conditional on x, (4) is exactly an Ising model [47] for y. It can also be treated as a special instance of CRFs [22], by defining features and . Moreover, classical model multivariate probit (MP) [48] also models pairwise correlations in y. However, it utilizes Gaussian latent variables for correlation modelling, which is essentially different from CorrLog.
B. Elastic Net Regularization
Given a set of training data , CorrLog can be learned by regularized maximum log likelihood estimation (MLE), i.e.,
where is the negative log likelihood
and is a properly chosen regularization. A possible choice for is the regularizer,
with being the weighting parameters. The regularization enjoys the merits of computational flexibility and learning stability. However, it is unable to exploit any sparsity that can be possessed by the problem at hand. For example, for MLC, it is likely that the prediction of each label only depends on a subset of the D features of x, which implies the sparsity of . Besides, can also be sparse since not all labels in y are correlated to each other. regularizer is another choice for , especially regarding model sparsity. Nevertheless, it has been noticed by several studies that regularized algorithms are inherently unstable, that is, a slight change of the training data set can lead to substantially different prediction models. Based on above consideration, we propose to use the elastic net regularizer [49], which is a combination of and regularizers and inherits their individual advantages, i.e., learning stability and model sparsity,
where controls the trade-off between the regularization and the regularization, and large encourages a high level of sparsity.
In this section, we derive algorithms for learning and prediction with CorrLog. The exponentially large size of the label space makes exact algorithms for CorrLog computationally intractable, since the conditional probability (4) needs to be normalized by the partition function
which is a summation over an exponential number of terms. Thus, we turn to approximate learning and prediction algorithms, by exploiting the pseudo likelihood and the message passing techniques.
A. Approximate Learning via Pseudo Likelihood
Maximum pseudo likelihood estimation (MPLE) [50] provides an alternative approach for estimating model parameters, especially when the partition function of the likelihood cannot be evaluated efficiently. It was developed in the field of spatial dependence analysis and has been widely applied to the estimation of various statistical models, from the Ising model [47] to the CRFs [51]. Here, we apply MPLE to the learning of parameter in CorrLog.
The pseudo likelihood of the model over m jointly distributed random variables is defined as the product of the conditional probability of each individual random variables conditioned on all the rest ones. For CorrLog (4), its pseudo likelihood is given by
where and the conditional probability can be directly obtained from (4),
Accordingly, the negative log pseudo likelihood over the training data D is given by
To this end, the optimal model parameter of CorrLog can be learned approximately by the elastic net regularized MPLE,
where and are tuning parameters.
A First-Order Method by Soft Thresholding: Problem (13) is a convex optimization problem, thanks to the convexity of the logarithmic loss function and the elastic net regularization, and thus a unique optimal solution. However, the elastic net regularization is non-smooth due to the norm regularizer, which makes direct gradient based algorithm inapplicable. The main idea of our algorithm for solving (13) is to divide the objective function into smooth and non-smooth parts, and then apply the soft thresholding technique to deal with the nonsmoothness.
Denoting by the smooth part of , i.e.,
its gradient at the k-th iteration is given by
with
(16) Then, a surrogate of the objective function in (13) can be obtained by using , i.e.,
The parameter in (17) servers a similar role to the variable updating step size in gradient descent methods, and it is set such that is larger than the Lipschitz constant of . For such , it can be shown that
and . Therefore, the update of can be realized by the minimization
which is solved by the soft thresholding function , i.e.,
where
Iteratively applying (19) until convergence provides a first-order method for solving (13). Algorithm 1 presents the pseudo code for this procedure.
Remark 1 From the above derivation, especially equations (15) and (19), the computational complexity of our learning algorithm is linear with respect to the label number m. Therefore, learning CorrLog is no more expensive than learning m independent logistic regressions, which makes CorrLog scalable to the case of large label numbers.
Remark 2 It is possible to further speed up the learning algorithm. In particular, Algorithm 1 can be modified to have the optimal convergence rate in the sense of Nemirovsky and Yudin [52], i.e., wherein k is the number of iterations. However, its convergence is usually as slow as in standard gradient descent methods. Actually, we only need to replace the current variable in the surrogate (17) by a weighted combination of the variables from previous iterations. As such modification is a direct application of the fast iterative shrinkage thresholding, [53], we do not present the details here but leave readers to the reference.
B. Joint Prediction by Message Passing
For MLC, as the labels are not independent in general, the prediction task is actually a joint maximum a posterior (MAP) estimation over p(y|x). In the case of CorrLog, suppose the model parameter is learned by the regularized MPLE from the last subsection, the prediction of for a new instance x can be obtained by
We use the belief propagation (BP) to solve (21) [54]. Specifically, we run the max-product algorithm with uniformly initialized messages and an early stopping criterion with 50 iterations. Since the graphical model defined by in (21) has loops, we cannot guarantee the convergence of the algorithm. However, we found that it works well on all experiments in this paper.
An important issue in designing a machine learning algorithm is generalization, i.e., how the algorithm will perform on the test data compared to on the training data. In the section, we present a generalization analysis for CorrLog, by using the concept of algorithmic stability [55]. Our analysis follows two steps. First, we show that the learning of CorrLog by MPLE is stable, i.e., the learned model parameter does not vary much given a slight change of the training data set D. Then, we prove that the generalization error of CorrLog can be bounded by the empirical error, plus a term related to the stability but independent of the label number m.
A. The Stability of MPLE
The stability of a learning algorithm indicates how much the learned model changes according to a small change of the training data set. Denote by a modified training data set the same with D but replacing the k-th training example by another independent example . Suppose and are the model parameters learned by MPLE (13) on D and , respectively. We intend to show that the difference between these two models, defined as
∥−
is bounded by an order of O(1/n), so that the learning is stable for large n.
learned on , which is the same with D but without the k-th example
where
The following Lemma provides an upper bound of the difference .
Lemma 1. Given and defined in (13), and defined in (23), it holds for ,
Proof. Denote by RHS the righthand side of (25), we have
Furthermore, the definition of implies . Combining these two we have (25). This completes the proof.
Next, we show a lower bound of the difference .
Lemma 2. Given and defined in (13), and defined in (23), it holds for ,
Proof. We define the following function
Then, for (26), it is sufficient to show that . By using (13), we have
It is straightforward to verify that and in (13) have the same subgradient at , i.e.,
Since minimizes , we have and thus , which implies also minimizes . Therefore .
In addition, by checking the Lipschitz continuous property of , we have the following Lemma 3.
Lemma 3. Given defined in (13) and defined in (23), it holds for and
Proof. First, we have
and
That is is Lipschitz continuous with respect to and , with constant 2 and 4, respectively. Therefore, (29) holds.
By combining the above three Lemmas, we have the following Theorem 1 that shows the stability of CorrLog.
Theorem 1. Given model parameters and learned on training datasets D and , respectively, both by (13), it holds that
Proof. By combining (25), (26) and (29), we have
Further, by using
we have
Since and differ from each other with only the k-th training example, the same argument gives
Then, (30) is obtained immediately. This completes the proof.
B. Generalization Bound
We first define a loss function to measure the generalization error. Considering that CorrLog predicts labels by MAP estimation, we define the loss function by using the log probability
where the constant γ > 0 and
The loss function (35) is defined analogously to the loss function used in binary classification, where is replaced with the margin if a linear classifier w is used. Besides, (35) gives a 0 loss only if all dimensions of y are correctly predicted, which emphasizes the joint prediction in MLC. By using this loss function, the generalization error and the empirical error are given by
According to [55], an exponential bound exists for if CorrLog has a uniform stability with respect to the loss function (35). The following Theorem 2 shows this condition holds.
Theorem 2. Given model parameters and learned on training datasets D and , respectively, both by (13), it holds for ,
Then, by introducing notation
and rewriting
we have
Due to the fact that for any functions and it holds2
we have
loss of generality
Then, the proof is completed by applying Theorem 1.
Now, we are ready to present the main theorem on the generalization ability of CorrLog.
Theorem 3. Given the model parameter learned by (13), with i.i.d. training data 1, 2, ..., n} and regularization parameters , it holds with at least probability ,
Proof. Given Theorem 2, the generalization bound (46) is a direct result of Theorem 12 in [55] (Please refer to the reference for details).
Remark 3 A notable observation from Theorem 3 is that the generalization bound (46) of CorrLog is independent of the label number m. Therefore, CorrLog is preferable for MLC with a large number of labels, for which the generalization error still can be bounded with high confidence.
Remark 4 While the learning of CorrLog (13) utilizes the elastic net regularization , where is the weighting parameter on the regularization to encourage sparsity, the generalization bound (46) is independent of the parameter . The reason is that regularization does not lead to stable learning algorithms [56], and only the regularization in contributes to the stability of CorrLog.
We design a simple toy example to illustrate the capacity of CorrLog on label correlation modelling. In particular, we show that when ILRs fail drastically due to ignoring the label correlations (under-fitting), CorrLog performs well. Consider a two-label classification problem on a 2-D plane, where each instance x is sampled uniformly from the unit disc and the corresponding labels are defined by
where and the augmented feature is . The function takes value 1 or , and the operation outputs 1 if either of its input is 1. The definition of makes the two labels correlated. We generate 1000 random examples according to above setting and split them into training and test sets, each of which contains 500 examples. During training, we set the parameter of the elastic net regularization to 0, i.e., we actually used an regularization, this is because in this example the model is not sparse in terms of both feature selection and label correlation. In addition, as the number of the training examples is sufficiently large for this problem, we suppose there is no over-fitting and tune the regularization parameters for both ILRs and CorrLog by minimizing the 0-1 loss on the training set.
Figure 1 shows that true labels of test data, the predictions of ILRs and the predictions of CorrLog, where different labels
Fig. 1. A two-label toy example: (a) true labels of test data; (b) predictions given by ILRs; (c) predictions given by CorrLog. The dash and solid black boundaries are specified by . In the legend, “+” and “-” stand for positive and negative labels, respectively, e.g., “, and so on.
are marked by different colors. In (a), the disc is divided into three regions, and +/+, where the two black boundaries are specified by and , respectively. In (b), the first boundary is properly learned by ILRs, while the second one is learned wrongly. This is because the second label is highly correlated to the first label, but ILRs ignores such correlation. As a result, ILRs wrongly predicted the impossible case of . The misclassification rate measured by 0-1 loss is 0.197. In contrast, CorrLog predicts correct labels for most instances with a 0-1 loss 0.068. Besides, it is interesting to note that the correlation between the two labels are “asymmetric”, for the first label is not affected by the second. This asymmetry contributes the most to the misclassification of CorrLog, because the previous definition implies that only symmetric correlations are modelled in CorrLog.
In this section, we apply the proposed CorrLog to multilabel image classification. In particular, four multilabel image datasets are used in this paper, including MULAN scene (MULANscene)3, MIT outdoor scene (MITscene) [38], PASCAL VOC 2007 (PASCAL07) [57] and PASCAL VOC 2012 (PASCAL12) [58]. MULAN scene dataset contains 2047 images with 6 labels, and each image is represented by 294 features. MIT outdoor scene dataset contains 2688 images in 8 categories. To make it suitable for multilabel experiment, we transformed each category label with several tags according to the image contents of that category4. PASCAL VOC 2007 dataset consists of 9963 images with 20 labels. For PASCAL VOC 2012, we use the available train-validation subset which
4The 8 categories are coast, forest, highway, insidecity, mountain, opencountry, street, and tallbuildings. The 8 binary tags are building, grass, cement-road, dirt-road, mountain, sea, sky, and tree. The transformation follows,
TABLE II DATASETS SUMMARY. #IMAGES STANDS FOR THE NUMBER OF ALL IMAGES, #FEATURES STANDS FOR THE DIMENSION OF THE FEATURES, AND #LABELS STANDS FOR THE NUMBER OF LABELS.
contains 11540 images. In addition, two kinds of features are adopted to represent the last three datasets, i.e., the PHOW (a variant of dense SIFT descriptors extracted at multiple scales) features [39] and deep CNN (convolutional neural network) features [42], [43]. Summary of the basic information of the datasets is illustrated in Table II. To extract PHOW features, we use the VLFeat implementation [59]. For deep CNN features, we use the ’imagenet-vgg-f’ model pretrained on ImageNet database [43] which is available in MatConvNet matlab toolbox [60].
A. A Warming-Up Qualitative Experiment
As an extension to our previous work on CorrLog, this paper utilizes elastic net to inherit individual advantages of and regularization. To build up the intuition, we employ MITscene with PHOW features to visualize the difference between and elastic net regularization. Table III presents the learned CorrLog label graphs using these two types of regularization respectively. In the label graph, the color of each edge represents the correlation strength between two certain labels. We have also listed 8 representative example images, one for each category, and their binary tags for completeness.
According to the comparison, one can see that elastic net regularization results in a sparse label graph due to its component, while regularization can only lead to a fully-
TABLE III LEARNED CORRLOG LABEL GRAPH ON MITSCENE USING OR ELASTIC NET REGULARIZATION.
connected label graph. In addition, the learned label correlations in elastic net case are more reasonable than that of . For example, in the label graph, dirt-road and mountain have weekly positive correlation (according to the link between them), though they seldom co-occur on the images in the datasets, while in the elastic net graph, their correlation is corrected as negative. It has to be confessed that elastic net regularization also discarded some reasonable correlations such as cement-road and building. This phenomenon is a direct result of the compromise between learning stability and model sparsity. We shall mention that those reasonable correlations can be maintained by decreasing or , though more unreasonable connections will also be maintained. Thus, applying weak sparsity may impair the model performance. As a result, it is important to choose a good level of sparsity to achieve a compromise. In our experiments, CorrLog with elastic net regularization generally outperforms that with regularization, which confirms our motivation that appropriate level of sparsity in feature selection and label correlations help boost the performance of MLC. In the following presentation, we will use CorrLog with elastic net regularization in all experimental comparisons. To benefit following research, our code is available upon request.
B. Quantitative Experimental Setting
In this subsection, we present further comparisons between CorrLog and other MLC methods. First, to demonstrate the effectiveness of utilizing label correlation, we first compare CorrLog’s performance with ILRs. Moreover, four state-of-the-art MLC methods - instance-based learning by logistic regression (IBLR) [6], multilabel k-nearest neighbour (MLkNN) [5], classifier chains (CC) [18] and maximum margin output coding (MMOC) [21] were also employed for comparison study. Note that ILRs can be regarded as the basic baseline and other methods represent state-of-the-arts. In our experiments, LIBlinear [61] -regularized logistic regression is employed to build binary classifiers for ILRs. As for other methods, we use publicly available codes in MEKA5 or the authors’ homepages.
We used six different measures to evaluate the performance. These include different loss functions (Hamming loss and zero-one loss) and other popular measures (accuracy, F1 score, Macro-F1 and Micro-F1). The details of these evaluation measures can be found in [62], [15], [18], [19]. The parameters for CorrLog are fixed across all experiments as , and . On each dataset, all the methods are compared by 5-fold cross validation. The mean and standard deviation are reported for each criterion. In addition, paired t-tests at 0.05 significance level is applied to evaluate the statistical significance of performance difference.
C. Quantitative Results and Discussions
Tables IV, V, VI and VII summarized the experimental results on MULANscene, MITscene, PASCAL07 and PAS-CAL12 of all six algorithms evaluated by the six measures. By comparing the results of CorrLog and ILRs, we can clearly see the improvements obtained by exploiting label correlations for MLC. Except the Hamming loss, CorrLog greatly outperforms ILRs on all datasets. Especially, the reduction of zero-one loss is significant on all four datasets with different type of features. This confirms the value of correlation modelling to joint prediction. However, it should be noticed that the improvement of CorrLog over ILRs is less significant when the performance is measured by Hamming loss. This is because Hamming loss treats the prediction of each label individually.
In addition, CorrLog is more effective in exploiting label correlations than other four state-of-the-art MLC algorithms. For MULANscene dataset, CorrLog achieved comparable results with IBLR and both of them outperformed other methods. For MITscene dataset, both PHOW and CNN features are very effective representations and boost the classification results. As a consequence, the performance of CorrLog and the four MLC algorithms are very close to each other. It is worth noting that, the MMOC method is time-consuming in the training stage, though it achieved the best performance on this dataset. As for both PASCAL07 and PASCAL12 datasets, CNN features perform significantly better than PHOW features. CorrLog obtained much better results than the competing MLC schemes, except for the Hamming loss and zero-one loss. Note that the CorrLog also performs competitively with PLEM and CGM, according to the results reported in [45].
D. Complexity Analysis and Execution Time
Table VIII summarizes the algorithm computational complexity of all MLC methods. The training computational cost of both CorrLog and ILRs are linear to the number of labels, while CorrLog causes more testing computational cost than ILRs due to the iterative belief propagation algorithm. In contrast, the training complexity of CC and MMOC are polynomial to the number of labels. The two instance-based methods, MLkNN and IBLR, are relatively computational in both train and test stages due to the involvement of instance-based searching of nearest neighbours. In particular, training MLkNN requires estimating the prior label distribution from training data which needs the consideration of all k nearest neighbours of all training samples. Testing a given sample in MLkNN consists of finding its k-nearest neighbours and applying maximum a posterior (MAP) inference. Different from MLkNN, IBLR constructs logistic regression models by adopting labels of k-nearest neighbours as features.
To evaluate the practical efficiency, Table IX presents the execution time (train and test phase) of all comparison algorithms under Matlab environment. A Linux server equipped with Intel Xeon CPU (8 cores @ 3.4 GHz) and 32 GB memory is used for conducting all the experiments. CorrLog is implemented in Matlab language, while ILRs is implemented based on LIBlinear’s mex functions. MMOC is evaluated using the authors’ Matlab code which also builds upon LIBlinear. As for IBLR, MLkNN and CC, the MEKA Java library is called via a Matlab wrapper. Based on the comparison results, the following observations can be made: 1) the execution time is largely consistent with the complexity analysis, though there maybe some unavoidable computational differences between Matlab scripts, mex functions and Java codes; 2) CorrLog’s train phase is very efficient and its test phase is also comparable with ILRs, CC and MMOC; 3) CorrLog is more efficient than IBLR and MLkNN in both train and test stages.
We have proposed a new MLC algorithm CorrLog and applied it to multilabel image classification. Built upon IRLs, CorrLog explicitly models the pairwise correlation between labels, and thus improves the effectiveness for MLC. Besides, by using the elastic net regularization, CorrLog is able to exploit the sparsity in both feature selection and label correlations, and thus further boost the performance of MLC. Theoretically, we have shown that the generalization error of CorrLog is upper bounded and is independent of the number of labels. This suggests the generalization bound holds with high confidence even when the number of labels is large. Evaluations on four benchmark multilabel image datasets confirm the effectiveness of CorrLog for multilabel image classification and show its competitiveness with the state-of-the-arts.
[1] W. Bian and D. Tao, “Asymptotic generalization bound of fisher’s linear discriminant analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 12, pp. 2325–2337, 2014.
[2] T. Liu and D. Tao, “On the performance of manhattan nonnegative matrix factorizatio,” IEEE Trans. Neural Netw. Learn. Syst., vol. PP, no. 99, pp. 1–1, 2016.
[3] C. Xu, D. Tao, and C. Xu, “Multi-view intact space learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 12, pp. 2531–2544, 2015.
[4] T. Liu, D. Tao, S. Mingli, and M. Stephen, “Algorithm-dependent generalization bounds for multi-task learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. PP, no. 99, pp. 1–1, 2016.
[5] M.-L. Zhang and Z.-H. Zhou, “Ml-knn: A lazy learning approach to multi-label learning,” Pattern Recognit., vol. 40, no. 7, pp. 2038–2048, 2007.
[6] W. Cheng and E. H¨ullermeier, “Combining instance-based learning and logistic regression for multilabel classification,” Mach. Learn., vol. 76, no. 2-3, pp. 211–225, 2009.
[7] D. Hsu, S. Kakade, J. Langford, and T. Zhang, “Multi-label prediction via compressed sensing.” in Proc. Adv. Neural Inf. Process. Syst., vol. 22, 2009, pp. 772–780.
[8] G. Tsoumakas, I. Katakis, and I. Vlahavas, “Mining multi-label data,” in Data Mining and Knowledge Discovery Handbook, O. Maimon and L. Rokach, Eds. Boston, MA: Springer US, 2010, pp. 667–685.
[9] M.-L. Zhang and Z.-H. Zhou, “A review on multi-label learning algo- rithms,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 8, pp. 1819–1837, 2014.
[10] J. Petterson and T. S. Caetano, “Reverse multi-label learning,” in Proc. Adv. Neural Inf. Process. Syst., 2010, pp. 1912–1920.
[11] K. Dembczy´nski, W. Waegeman, W. Cheng, and E. H¨ullermeier, “On label dependence in multi-label classification,” in Proc. Int. Conf. Mach. Learn. Workshop on Learning from Multi-label Data, 2010, pp. 5–13.
[12] S. Yu, K. Yu, V. Tresp, and H.-P. Kriegel, “Multi-output regularized feature projection,” IEEE Trans. Knowl. Data Eng., vol. 18, no. 12, pp. 1600–1613, 2006.
TABLE IV MULANSCENE PERFORMANCE COMPARISON VIA 5-FOLD CROSS VALIDATION. MARKER INDICATES WHETHER CORRLOG IS STATISTICALLY SUPERIOR/INFERIOR TO THE COMPARED METHOD (USING PAIRED T-TEST AT 0.05 SIGNIFICANCE LEVEL).
TABLE V MITSCENE PERFORMANCE COMPARISON VIA 5-FOLD CROSS VALIDATION. MARKER INDICATES WHETHER CORRLOG IS STATISTICALLY SUPERIOR/INFERIOR TO THE COMPARED METHOD (USING PAIRED T-TEST AT 0.05 SIGNIFICANCE LEVEL).
TABLE VI PASCAL07 PERFORMANCE COMPARISON VIA 5-FOLD CROSS VALIDATION. MARKER INDICATES WHETHER CORRLOG IS STATISTICALLY SUPERIOR/INFERIOR TO THE COMPARED METHOD (USING PAIRED T-TEST AT 0.05 SIGNIFICANCE LEVEL).
[13] T. Zhou, D. Tao, and X. Wu, “Compressed labeling on distilled labelsets for multi-label learning,” Mach. Learn., vol. 88, no. 1-2, pp. 69–126, 2012.
[14] L. Breiman and J. H. Friedman, “Predicting multivariate responses in multiple linear regression (with discussion),” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 59, no. 1, pp. 3–54, 1997.
[15] G. Tsoumakas and I. Vlahavas, “Random k-labelsets: An ensemble method for multilabel classification,” in Proc. Eur. Conf. Mach. Learn. Springer, 2007, pp. 406–417.
[16] J. Read, “A Pruned Problem Transformation Method for Multi-label classification,” in Proc. New Zealand Computer Science Research Student Conference, 2008, pp. 143–150.
[17] N. Cesa-Bianchi, C. Gentile, and L. Zaniboni, “Incremental algorithms for hierarchical classification,” J. Mach. Learn. Res., vol. 7, pp. 31–54, 2006.
[18] J. Read, B. Pfahringer, G. Holmes, and E. Frank, “Classifier chains for multi-label classification,” Mach. Learn., vol. 85, no. 3, pp. 333–359,
2011.
[19] W. Cheng, E. H¨ullermeier, and K. J. Dembczynski, “Bayes optimal multilabel classification via probabilistic classifier chains,” in Proc. Int. Conf. Mach. Learn., 2010, pp. 279–286.
[20] F. Tai and H.-T. Lin, “Multilabel classification with principal label space transformation,” Neural Computation, vol. 24, no. 9, pp. 2508–2542, 2012.
[21] Y. Zhang and J. G. Schneider, “Maximum margin output coding,” in Proc. Int. Conf. Mach. Learn. ACM, 2012, pp. 1575–1582.
[22] J. D. Lafferty, A. McCallum, and F. C. N. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in Proc. Int. Conf. Mach. Learn. ACM, 2001, pp. 282–289.
[23] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and object recognition using shape contexts,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 4, pp. 509–522, 2002.
[24] G. Wang, D. Forsyth, and D. Hoiem, “Improved object categorization and detection using comparative object similarity,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 10, pp. 2442–2453, 2013.
TABLE VII PASCAL12 PERFORMANCE COMPARISON VIA 5-FOLD CROSS VALIDATION. MARKER INDICATES WHETHER CORRLOG IS STATISTICALLY SUPERIOR/INFERIOR TO THE COMPARED METHOD (USING PAIRED T-TEST AT 0.05 SIGNIFICANCE LEVEL).
TABLE VIII COMPUTATIONAL COMPLEXITY ANALYSIS. RECALL THAT STANDS FOR THE NUMBER OF TRAIN IMAGES, STANDS FOR THE DIMENSION OF THE FEATURES, AND STANDS FOR THE NUMBER OF LABELS. NOTE THAT IS THE ITERATION NUMBER OF THE MAX-PRODUCT ALGORITHM IN CORRLOG, AND IS THE NUMBER OF NEAREST NEIGHBOURS IN MLKNN AND IBLR.
[25] L. Wang, T. Liu, G. Wang, K. L. Chan, and Q. Yang, “Video track- ing using learned hierarchical features,” IEEE Trans. Image Process., vol. 24, no. 4, pp. 1424–1435, 2015.
[26] S. Feng, R. Manmatha, and V. Lavrenko, “Multiple bernoulli relevance models for image and video annotation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., vol. 2. IEEE, 2004, pp. II–1002.
[27] J. Fan, Y. Shen, C. Yang, and N. Zhou, “Structured max-margin learning for inter-related classifier training and multilabel image annotation,” IEEE Trans. Image Process., vol. 20, no. 3, pp. 837–854, 2011.
[28] Q. Mao, I. W.-H. Tsang, and S. Gao, “Objective-guided image anno- tation,” IEEE Trans. Image Process., vol. 22, no. 4, pp. 1585–1597, 2013.
[29] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown, “Learning multi-label scene classification,” Pattern Recognit., vol. 37, no. 9, pp. 1757–1771, 2004.
[30] D. Song and D. Tao, “Biologically inspired feature manifold for scene classification,” IEEE Trans. Image Process., vol. 19, no. 1, pp. 174–184, 2010.
[31] Z. Zuo, G. Wang, B. Shuai, L. Zhao, Q. Yang, and X. Jiang, “Learning discriminative and shareable features for scene classification,” in Proc. Eur. Conf. Comput. Vis. Springer, 2014, pp. 552–568.
[32] Y. Luo, D. Tao, C. Xu, C. Xu, H. Liu, and Y. Wen, “Multiview vector- valued manifold regularization for multilabel image classification,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 5, pp. 709–722, 2013.
[33] Y. Luo, D. Tao, B. Geng, C. Xu, and S. J. Maybank, “Manifold regularized multitask learning for semi-supervised multilabel image classification,” IEEE Trans. Image Process., vol. 22, no. 2, pp. 523– 536, 2013.
[34] F. Sun, J. Tang, H. Li, G.-J. Qi, and T. S. Huang, “Multi-label image categorization with sparse factor representation,” IEEE Trans. Image Process., vol. 23, no. 3, pp. 1028–1037, 2014.
[35] B. Zhang, Y. Wang, and F. Chen, “Multilabel image classification via high-order label correlation driven active learning,” IEEE Trans. Image Process., vol. 23, no. 3, pp. 1430–1441, 2014.
[36] Y. Luo, T. Liu, D. Tao, and C. Xu, “Multiview matrix completion for
multilabel image classification,” IEEE Trans. Image Process., vol. 24, no. 8, pp. 2355–2368, 2015.
[37] C. Xu, T. Liu, D. Tao, and C. Xu, “Local rademacher complexity for multi-label learning,” IEEE Trans. Image Process., vol. 25, no. 3, pp. 1495–1507, 2016.
[38] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” Int. J. Comput. Vis., vol. 42, no. 3, pp. 145–175, 2001.
[39] A. Bosch, A. Zisserman, and X. Munoz, “Image classification using random forests and ferns,” in Proc. IEEE Int. Conf. Comput. Vis. IEEE, 2007, pp. 1–8.
[40] H. J´egou, M. Douze, C. Schmid, and P. P´erez, “Aggregating local descriptors into a compact image representation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. IEEE, 2010, pp. 3304–3311.
[41] L.-J. Li, H. Su, L. Fei-Fei, and E. P. Xing, “Object bank: A high- level image representation for scene classification & semantic feature sparsification,” in Proc. Adv. Neural Inf. Process. Syst., 2010, pp. 1378– 1386.
[42] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2012, pp. 1097–1105.
[43] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the devil in the details: Delving deep into convolutional nets,” in Proc. Brit. Mach. Vis. Conf. BMVA Press, 2014, pp. 6:1–6:12.
[44] X. Li, F. Zhao, and Y. Guo, “Multi-label image classification with a probabilistic label enhancement model,” in Proc. Conf. Uncertain. Artif. Intell., 2014, pp. 430–439.
[45] M. Tan, Q. Shi, A. van den Hengel, C. Shen, J. Gao, F. Hu, and Z. Zhang, “Learning graph structure for multi-label image classification via clique generation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 4100–4109.
[46] W. Bian, B. Xie, and D. Tao, “Corrlog: Correlated logistic models for joint prediction of multiple labels,” in Proc. Int. Conf. Artif. Intell. Stat., 2012, pp. 109–117.
[47] P. Ravikumar, M. J. Wainwright, and J. D. Lafferty, “High-dimensional ising model selection using l 1 -regularized logistic regression,” Annals of Statistics, vol. 38, pp. 1287–1319, 2010.
[48] J. Ashford and R. Sowden, “Multi-variate probit analysis,” Biometrics, vol. 26, pp. 535–546, 1970.
[49] H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 67, no. 2, pp. 301–320, 2005.
[50] J. Besag, “Statistical Analysis of Non-Lattice Data,” Journal of the Royal Statistical Society: Series D (The Statistician), vol. 24, no. 3, pp. 179– 195, 1975.
[51] C. Sutton and A. McCallum, “Piecewise pseudolikelihood for efficient training of conditional random fields,” in Proc. Int. Conf. Mach. Learn. ACM, 2007, pp. 863–870.
[52] A. S. Nemirovsky and D. B. Yudin, Problem Complexity and Method Efficiency in Optimization, ser. Wiley-Interscience Series in Discrete Mathematics. New York, USA: John Wiley & Sons, 1983.
[53] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM J. Img. Sci., vol. 2, no. 1, pp. 183–202, 2009.
TABLE IX AVERAGE EXECUTION TIME COMPARISON ON ALL DATASETS.
[54] C. M. Bishop, Pattern Recognition and Machine Learning. New York, NY: Springer, 2006.
[55] O. Bousquet and A. Elisseeff, “Stability and generalization,” J. Mach. Learn. Res., vol. 2, pp. 499–526, 2002.
[56] H. Xu, C. Caramanis, and S. Mannor, “Sparse algorithms are not stable: A no-free-lunch theorem,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 1, pp. 187–193, 2012.
[57] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser- man, “The pascal visual object classes (voc) challenge,” Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, 2010.
[58] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge: A retrospective,” Int. J. Comput. Vis., vol. 111, no. 1, pp. 98–136, 2015.
[59] A. Vedaldi and B. Fulkerson, “Vlfeat: An open and portable library of computer vision algorithms,” in Proc. ACM Int. Conf. Multimedia. ACM, 2010, pp. 1469–1472.
[60] A. Vedaldi and K. Lenc, “Matconvnet: Convolutional neural networks for matlab,” in Proc. ACM Int. Conf. Multimedia. ACM, 2015, pp. 689–692.
[61] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, “Liblinear: A library for large linear classification,” J. Mach. Learn. Res., vol. 9, pp. 1871–1874, 2008.
[62] G. Madjarov, D. Kocev, D. Gjorgjevikj, and S. Dˇzeroski, “An extensive experimental comparison of methods for multi-label learning,” Pattern Recognit., vol. 45, no. 9, pp. 3084–3104, 2012.