Nearest Labelset Using Double Distances for Multi-label Classification

2017·Arxiv

Abstract

Abstract

Multi-label classification is a type of supervised learning where an instance may belong to multiple labels simultaneously. Predicting each label independently has been criticized for not exploiting any correlation between labels. In this paper we propose a novel approach, Nearest Labelset using Double Distances (NLDD), that predicts the labelset observed in the training data that minimizes a weighted sum of the distances in both the feature space and the label space to the new instance. The weights specify the relative tradeoff between the two distances. The weights are estimated from a binomial regression of the number of misclassified labels as a function of the two distances. Model parameters are estimated by maximum likelihood. NLDD only considers labelsets observed in the training data, thus implicitly taking into account label dependencies. Experiments on benchmark multi-label data sets show that the proposed method on average outperforms other well-known approaches in terms of Hamming loss, 0/1 loss, and multi-label accuracy and ranks second after ECC on the F-measure.

Keywords Multi-label classification, Machine learning, Label correlations

1 Introduction

In multi-label classification, an instance can belong to multiple labels at the same time. This is different from multi-class or binary classification, where an instance can only be associated with a single label. For example, a newspaper article talking about electronic books may be labelled with multiple topics such as business, arts and technology simultaneously. Multi-label classification has been applied in many areas of application including text (Schapire and Singer 2000; Godbole and Sarawagi 2004), image (Boutell et al 2004; Zhang and Zhou 2007), music (Li and Ogihara 2003; Trohidis et al 2008) and bioinformatics (Elisseeff and Weston 2001). A labelset for an instance is the set of all labels that are associated with that instance.

Approaches for solving multi-label classification problems may be categorized into either problem transformation methods or algorithm adaptation methods (Tsoumakas and Katakis 2007). Problem transformation methods transform a multi-label problem into one or more single-label problems. For the single-label classification problems, binary or multi-class classifiers are used. The results are combined and transformed back into a multi-label representation. Algorithm adaptation methods, on the other hand, modify specific learning algorithms directly for multi-label problems. Individual approaches are explained in Section 2.

In this paper, we propose a new problem transformation approach to multi-label classification. Our proposed approach applies the nearest neighbor method to predict the label with the shortest distance in the feature space. However, because we have multiple labels, we additionally consider the shortest distance in the label space. We then find the labelset that minimizes the expected label misclassification rate as a function of both distances, feature space and label space, exploiting high-order interdependencies between labels. The nonlinear function is estimated using maximum likelihood.

The effectiveness of the proposed approach is evaluated with various multi-label data sets. Our experiments show that the proposed method performs on average better on standard evaluation metrics (Hammming loss, 0/1 loss, multi-label accuracy and the F-measure) than other commonly used algorithms.

The rest of this paper is organized as follows: In Section 2 we review previous work on multi-label classification. In Section 3, we present the details of the proposed method. In Section 4, we report on experiments that compare the proposed method with other algorithms on standard metrics. In Section 6 we discuss the results. In Section 7, we draw conclusions.

2 Related work

There are several approaches to classifying multi-label data. The most common approach, binary relevance (BR) (Zhang and Zhou 2005; Tsoumakas and Katakis 2007), transforms a multi-label problem into separate binary problems. That is, using training data, BR constructs a binary classifier for each label independently. For a test instance, the prediction set of labels is obtained simply by combining the individual binary results. In other words, the predicted labelset is the union of the results predicted from the L binary models. This approach requires one binary model for each label. The method has been adapted in many domains including text (Gonçalves and Quaresma 2003), music (Li and Ogihara 2003) and images (Boutell et al 2004). One drawback of the basic binary approach is that it does not account for any correlation that may exist between labels, because the labels are modelled independently. Taking correlations into account is often critical for prediction in multi-label problems (Godbole and Sarawagi 2004; Ji et al 2008).

A method related to BR is Subset-Mapping (SMBR) (Schapire and Singer 1999; Read et al 2011). For a new instance, a vector of labels is obtained by the binary outputs of BR and the final prediction is made by the training labelset with the shortest Hamming distance to the prediction set. For predictions SMBR only chooses labelsets observed in training data, thus SMBR exploits the interdependencies among labels.

An extension of binary relevance is Classifier Chain (CC) (Read et al 2011). CC fits labels sequentially using binary classifiers. Labels already predicted are included as features in subsequent classifiers until all labels have been fit. Including previous predictions as features “chains” the classifiers together and also takes into account potential label correlations. However, the order of the labels in a chain affects the predictive performances. Read et al (2011) also introduced the ensemble of classifier chains (ECC), where multiple CC are built with re-sampled training sets. The order of the labels in each CC is randomly chosen. The prediction label of an ECC is obtained by the majority vote of the CC models.

Label Powerset learning (LP) transforms a multi-label classification into a multi-class problem (Tsoumakas and Katakis 2007). In other words, LP treats each labelset as a single label. The transformed problem requires a single classifier. Although LP captures correlations between labels, the number of classes in the

transformed problem increases exponentially with the number of original labels. LP learning can only choose observed labelsets for predictions (Tsoumakas and Katakis 2007; Read et al 2008).

The random k-labelsets method, (RAKEL) (Tsoumakas and Vlahavas 2007), is a variation on the LP approach. In a multi-label problem with L different labels, RAKEL employs m multi-class models each of which considers randomly chosen labels, rather than the entire labelset. For a test instance, the prediction labelset is obtained by the majority vote of the results based on the m models. RAKEL overcomes the problem that the number of multinomial classes increases exponentially as a function of the number of labels. It also considers interdependencies between labels by using multi-class models with subsets of the labels.

A popular lazy learning algorithm based on the k Nearest Neighbours (kNN) approach is MLKNN (Zhang and Zhou 2007). Like other kNN-based methods, MLKNN identifies the k nearest training instances in the feature space for a test instance. Then for each label, MLKNN estimates the prior and likelihood for the number of neighbours associated with the label. Using Bayes theorem, MLKNN calculates the posterior probability from which a prediction is made.

The Conditional Bernoulli Mixtures (CBM) (Li et al 2016) approach transforms a multi-label problem into a mixture of binary and multi-class problems. CBM divides the feature space into K regions and learns a multi-class classifier for the regional components as well as binary classifiers in each region. The posterior probability for a labelset is obtained by mixing the multi-class and multiple binary classifiers. The model parameters are estimated using the Expectation Maximization algorithm.

3 The nearest labelset using double distances approach

3.1 Hypercube view of a multi-label problem

In multi-label classification, we are given a set of possible output labels L = {1, 2, ..., L}. Each instance with a feature vector is associated with a subset of these labels. Equivalently, the subset can be described as is associated with the instance and otherwise. A multi-label training data set is described as

Any labelset y can be described as a vertex in the L-dimensional unit hypercube (Tai and Lin 2012). Each component represents an axis of the hypercube. As an example, Figure 1 illustrates the label space of a multi-label problem with three labels (

Assume that the presence or absence of each label is modeled independently with a probabilistic classi-fier. For a new instance, the classifiers provide the probabilities, , that the corresponding labels are associated with the instance. Using the probability outputs, we may obtain a L-dimensional vector . Every element of has a value from 0 to 1 and the vector is an inner point in the hypercube (see Figure 1). Given the prediction task is completed by assigning the inner point to a vertex of the cube.

For the new instance, we may calculate the Euclidean distance, the labelset of the training instance). In Figure 1, three training instances corresponding distances are shown. A small distance indicates that is likely to be the labelset for the new instance.

3.2 Nearest labelset using double distances (NLDD)

In addition to computing the distance in the label space, , we may also obtain the (Euclidean) distance in the feature space, denoted by . The proposed method, NLDD, uses both as predictors to

Fig. 1: An illustration of the label space when L = 3. Each vertex represents a labelset. The inner point represents a fitted vector of an instance. represents the distance between

find a training labelset that minimizes the expected loss. For each test instance, we define loss as the number of misclassified labels out of L labels. The expected loss is then represents the probability of misclassifying each label. The predicted labelset, , is the labelset observed in the training data that minimizes the expected loss:

The loss follows a binomial distribution with L and a parameter as follows:

where are the model parameters. Greater values for imply that becomes more sensitive to the distances in the feature and label spaces, respectively. The misclassification probability decreases as approach zero.

A test instance with has a duplicate instance in the training data (i.e. with identical features). The predicted probabilities for the test instance are either 0 or 1 and the match the labels of the duplicate training observation. For such a “double”-duplicate instance (i.e. ), the probability of misclassification is . As expected, the uncertainty of a test observation with a “doubleduplicate” training observation is greater than zero.

The model in (2) implies is a monotone transformation of is a constant, the minimization problem in (1) is equivalent to

That is, NLDD predicts by choosing the labelset of the training instance that minimizes the weighted sum of the distances. For prediction, the only remaining issue is how to estimate the weights.

3.3 Estimating the relative weights of the two distances

We need to estimate the parameters . This requires computing , but of course the outcomes in the test data are not known. We therefore split the training data, T, equally into two data sets, is used for validation. Using , we next fit a binary classifier to each of the L labels separately and obtain the labelset predictions (i.e. probability outcomes) for the instances in . We then create a set of by pairing instances in with those in . Note that matching any single instance in in results in N/2 distance pairs. Most of the pairs are uninformative because the distance in either the feature space or the label space is very large. Moreover, since instances, the number of possible pairs is potentially large (/4). Therefore, to reduce computational complexity, for each instance we only identify two pairs: the pair with the smallest distance in x and the pair with the smallest distance in y. In case of ties in one distance, the pair with the smallest value in the other distance is chosen. More formally we identify the first pair

where is the set of pairs that are tied; i.e. that each corresponds to the minimum distance in Similarly, the second pair is found by

where is the set of labels that are tied with the minimal distance in . Figure 2 illustrates an example of how to identify . Our goal was to identify the instance with the smallest distance in x and the instance with the smallest distance in y. Note that may be the same instance If we find a single instance that minimizes both distances, we use just that instance. (A possible duplication of that instance is unlikely to make any difference in practice).

Fig. 2: An illustration of how to identify contain 10 instances each. The 10 points in the scatter plot were obtained by calculating between an instance in the 10 instances in . In this example two points have the lowest distance in and are candidates for . Among the candidates, the point with the lowest is chosen.

The two pairs corresponding to the instance in are denoted as the set , and their union for all instances is denoted as . The binomial regression specified in (2) is performed on the instances in S and maximum likelihood estimators of the parameters are obtained. Algorithm 1 outlines the training procedure.

For the classification of a new instance, we first obtain using the probabilistic classifiers fitted to the training data are obtained by matching the instance with the training instance. Using the , we calculate . The final prediction

of the new instance is obtained by

The second equality holds because is a constant. As in LP, NLDD chooses a training labelset as the predicted vector. Algorithm 2 outlines the classification procedure.

The training time of O(f(d, N)) is the complexity of each binary classifier with d features and N training instances, O(g(d, N/2)) is the complexity for predicting each label for is the complexity for obtaining the distance pairs for the regression and O(Nlog(k)) is the complexity for fitting a binomial regression. have N/2 instances respectively. O(Lf(d, N/2)) is the complexity for fitting binary classifiers using and obtaining the probability results for . For each instance of , we obtain N/2 numbers of distance pairs. This has complexity O((N/2)(d + L)). Since there are N/2 instances, overall it takes when omitting the constant. Among the N/2 pairs for each instance of , we only identify at most 2 pairs. This implies is the number of elements in S. Each iteration of the Newton-Raphson method has a complexity of O(N). For k-digit precision

complexity O(logk) is required (Ypma 1995). Combined, the complexity for estimating the parameters with k-digit precision is O(Nlog(k)). In practice, however, this term is dominated by as we can set k << N.

4 Experimental evaluation

In this section we compare the algorithms for multi-label classification on nine data sets in terms of Hamming loss, 0/1 loss, multi-label accuracy and F-measure. We next introduce the data sets and the evaluation measures and then present the results of our experiments.

4.1 Data sets

We evaluated the proposed approach using nine commonly used multi-label data sets from different domains. Table 1 shows basic statistics for each data set including its domain, numbers of labels and features. In the text data sets, all features are categorical (i.e. binary). The last column “lcard”, short for label cardinality, represents the average number of labels associated with an instance. The data sets are ordered by (

The emotions data set (Trohidis et al 2008) consists of pieces of music with rhythmic and timbre features. Each instance is associated with up to 6 emotion labels such as “sad-lonely”, “amazed-surprised” and “happy-pleased”. The scene data set (Boutell et al 2004) consists of images with 294 visual features. Each image is associated with up to 6 labels including “mountain”, “urban” and “beach”. The yeast data set (Elisseeff and Weston 2001) contains 2417 yeast genes in the Yeast Saccharomyces Cerevisiae. Each gene is represented by 103 features and is associated with a subset of 14 functional labels. The medical data set consists of documents that describe patient symptom histories. The data were made available in the Medical Natural language Processing Challenge in 2007. Each document is associated with a set of 45 disease codes. The slashdot data set consists of 3782 text instances with 22 labels obtained from Slashdot.org. The enron data set (Klimt and Yang 2004) contains 1702 email messages from the Enron corporation employees. The emails were categorized into 53 labels. The ohsumed data set (Hersh et al 1994) is a collection of medical research articles from MEDLINE database. We used the same data set as in Read et al (2011) that contains 13929 instances and 23 labels. The tmc2007 data set (Srivastava and Zane-Ulman 2005) contains 28596 aviation safety reports associated with up to 22 labels. Following Tsoumakas et al (2011), we used a reduced version of the data set with 500 features. The bibtex data set (Katakis et al 2008) consists of 7395 bibtex entries for automated tag suggestion. The entries were classified into 159 labels. All data sets are available online at: http://mulan.sourceforge.net/datasets-mlc.html and http://meka.sourceforge.net/#datasets.

Table 1: Multi-label data sets and their associated characteristics. Label cardinality (lcards) is the average number of labels associated with an instance

4.2 Evaluation metrics

Multi-label classifiers can be evaluated with various loss functions. Here, four of the most popular criteria are used: Hamming loss, 0/1 loss, multi-label accuracy and F-measure. These criteria are defined in the following paragraphs.

Let L be the number of labels in a multi-label problem. For a particular test instance, let y = be the labelset where label is associated with the instance and 0 otherwise. Let be the predicted values obtained by any machine learning method. Hamming loss refers to the percentage of incorrect labels. The Hamming loss for the instance is

where 1 is the indicator function. Despite its simplicity, the Hamming loss may be less discriminative than other metrics. In practice, an instance is usually associated with a small subset of labels. As the elements of the L-dimensional label vector are mostly zero, even the empty set (i.e. zero vector) prediction may lead to a decent Hamming loss.

The 0/1 loss is 0 if all predicted labels match the true labels and 1 otherwise. Hence,

Compared to other evaluation metrics, 0/1 loss is strict as all the L labels must match to the true ones simultaneously.

The multi-label accuracy (Godbole and Sarawagi 2004) (also known as the Jaccard index) is defined as the number of labels counted in the intersection of the predicted and true labelsets divided by the number of labels counted in the union of the labelsets. That is,

The multi-label accuracy measures the similarity between the true and predicted labelsets. The F-measure is the harmonic mean of precision and recall. The F-measure is defined as

The metrics above were defined for a single instance. On each metric, the overall value for an entire test data set is obtained by averaging out the individual values.

4.3 Experimental setup

We compared our proposed method against BR, SMBR, ECC, MLKNN, RAKEL and CBM. To train multi-label classifiers, the parameters recommended by the authors were used. In the case of MLKNN, we set the number of neighbors and the smoothing parameter to 10 and 1 respectively. For RAKEL, we set the number of separate models to 2L and the size of each sub-labelset to 3. For ECC, the number of CC models for each ensemble was set to 10. On the larger data sets (ohsumed, tmc2007 and bibtex), we fit ECC using reduced training data sets (75% of the instances and 50% of the features) as suggested in Read et al (2011). On the same data sets, we ran NLDD using 70% of the training data to reduce redundancy in learning.

For NLDD, we used support vector machines (SV M) (Vapnik 2000) as the base classifier on unscaled variables with a linear kernel and tuning parameter C = 1. The SV M scores were converted into probabilities using Platt’s method (Platt 2000). SV M was also used as the base classifier for BR, SMBR, ECC and RAKEL. The analysis was conducted in R (R Core Team 2014) using the e1071 package (Meyer et al 2014) for SV M. For the data sets with less than 5,000 instances 10-fold cross validations (CV ) were performed. On the larger data sets, we used 75/25 train/test splits. For fitting binomial regression models, we divided the training data sets at random into two parts of equal sizes.

For implementing CBM we used a Java program developed by the authors. The default settings (e.g. logistic regression and 10 iterations for the EM algorithm) were used on non-large data sets. For the large data sets tmc2007 and bibtex, the number of iterations was set to 5 and random feature reduction was applied as suggested by the developers. On each data set we used train/test split available at their website (https://github.com/cheng-li/pyramid).

We applied the Wilcoxon signed-rank test (Wilcoxon 1945; Demšar 2006) to compare the methods over multiple data sets because unlike the t–test it does not make a distributional assumption. Also, the Wilcoxon test is more robust to outliers than the t–test (Demšar 2006). Each test was one-sided at significance level 0.05. In multi-label classification, the Wilcoxon signed-ranks test was employed by Tsoumakas et al (2011).

In NLDD, when calculating distances in the feature spaces we used the standardized features so that no particular features dominated distances. For a numerical feature variable x, the standardized variable z is obtained by are the mean and standard deviation of x in the training data.

4.4 Results

Tables 2 to 5 summarize the results in terms of Hamming loss, 0/1 loss, multi-label accuracy and F- measure, respectively. We also ranked the algorithms for each metric. The Wilcoxon test results report whether or not any two methods were significantly different in their rankings across data sets. The results are shown at the bottom of each table. NLDD achieved highest average ranks on Hamming loss, 0/1 loss and multi-label accuracy while ECC achieved the highest average rank on the F-measure with NLDD taking the second place (and the difference between ECC and NLDD was not statistically significant). RAKEL achieved the second highest average rank on Hamming loss, while CBM achieved the second highest average rank on 0/1 loss and multi-label accuracy. The performance of CBM on the 0/1 loss was very variable achieving the highest rank on five out of nine data sets and the second worst on two data sets.

Table 6 shows the running time in seconds of the methods. On the non-large data sets, the relative differences of running time between NLDD and BR tended to increase with the size of the data sets. On two of the large data sets, ohsumend and tmc2007, NLDD required less time than BR as we only used 70% of the training data.

Table 2: Hamming loss (lower is better) averaged over 10 cross validations (with ranks in parentheses). The data sets are ordered as in Table 1. The results from the Wilcoxon test on whether or not any two results are statistically significant from one another are summarized at the bottom of the table.

Table 3: 0/1 loss (lower is better) averaged over 10 cross validations (with ranks in parentheses). The loss is 0 if a predicted labelset matches the true labelset exactly and 1 otherwise. The results from the Wilcoxon test on whether or not any two results are statistically significant from one another are summarized at the bottom of the table.

We next look at the performance of NLDD by whether or not the true labelsets were observed in the training data. A labelset has been observed if the exact labelset can be found in the training data and unobserved otherwise. Since NLDD makes a prediction by choosing a training labelset, a predicted labelset can only be partially correct on an unobserved labelset. Table 7 compares the evaluation results of BR and NLDD on two separate subsets of the test set of the bibtex data. The bibtex data were chosen because the data set contains by far the largest percentage of unobserved labelsets (33%) among the data sets investigated. The test data set was split into subsets A and B; if the labelset of a test instance was an observed labelset, the instance was assigned to A; otherwise the instance was assigned to B. For all of the four metrics, NLDD outperformed BR even though 33% of the labelsets in the test data were unobserved labelsets.

We next look at the three regression parameters the proposed method (NLDD) estimated (equation 2) for each data set in more detail. Table 8 displays the MLE of the parameters of the binomial model in

Table 4: Multi-label accuracy (higher is better) averaged over 10 cross validations (with ranks in parentheses). The results from the Wilcoxon test on whether or not any two results are statistically significant from one another are summarized at the bottom of the table.

Table 5: F-measure (higher is better) averaged over 10 cross validations (with ranks in parentheses). The results from the Wilcoxon test on whether or not any two results are statistically significant from one another are summarized at the bottom of the table.

each data set. In all data sets, the estimates of were all positive. The positive slopes imply that the expected loss (or, equivalently the probability of misclassification for each label) decreases as decreases.

From the values of we may infer how low the expected loss is when either is 0. For example, data set. If . This is the expected number of mismatched labels for choosing a training labelset whose distances to the new instance are zero in both feature and label spaces. The results suggest the expected loss would be very small when classifying a new instance that had a duplicate in the training data () and whose labels are predicted with probability 1 and the predicted labelset was observed in the training data (

Table 6: Running times (seconds) on benchmark multi-label data sets

Table 7: Evaluation results on the bibtex data set by whether or not the labelset was observed (Subset A) or unobserved (Subset B) in the training data. Subset A contains 67% of the test instances and subset B contains 33%. For Hamming loss and 0/1 loss, lower is better. For Multi-label accuracy and F-measure, higher is better.

Table 8: The maximum likelihood estimates of the parameters of equation 2 averaged over 10 cross validations

5 Scaling up NLDD

As seen in Section 3.2, the time complexity of NLDD is dependent on the size of the training data (N). In particular, the term makes the complexity of NLDD quadratic in N. For larger data sets the running time could be reduced by running the algorithm on a fraction of the N instances, but performance may be affected. This is investigated next.

Figure 3 illustrates the running time and the corresponding performance of NLDD as a function of the percentage of N. For the result, we used the tmc2007 data with 75/25 train/test splits. After splitting, we randomly chose 10% - 100% of the training data and ran NLDD with the reduced data. As before, we used SV M with a linear kernel as the base classifier.

The result shows that NLDD can obtain similar predictive performances for considerably less time. The running time increased quadratically as a function of N while the improvement of the performance of NLDD appeared to converge. Using 60% of the training data, NLDD achieved almost the same performance in the number of mismatched labels as using the full training data. Similar results were obtained on other large data sets.

Fig. 3: Running time (left) and the average number of mismatched labels (right) as a function of the percentage of the instance space for NLDD

6 Discussion

For the sample data sets selected, NLDD performed significantly better than BR, SMBR and MLKNN on all of the four metrics. NLDD also significantly outperformed ECC on Hamming loss and 0/1 loss, RAKEL on 0/1 loss, multi-label accuracy and F-measure. Although no significant difference was found between NLDD and CBM, NLDD achieved higher average ranks on all of the four metrics. On any evaluation metric, no method performed statistically significantly better than NLDD.

Like BR, NLDD uses outputs of independent binary classifiers. Using the distances in the feature and label spaces in binomial regression, NLDD can make more accurate predictions than BR. NLDD was also significantly superior to SMBR, which is similar to NLDD in the sense that it makes predictions by choosing training labelsets using binary classifiers. SMBR is based on the label space only, while NLDD uses the distances in the feature space as well.

Like LP, the proposed method treats each training labelset as a different class of a single-label problem in the prediction stage. Using a training labelset as a predicted vector, the proposed approach takes potentially high order label correlations into account.

In fitting the binomial regression, NLDD restricts the fit of the binomial model to distance pairs with low distances in the feature and label spaces. This dramatically reduces the size of the data used for regression fitting. In the yeast data set, the training data T contained 2178 instances. Since we equally divided the training data into , each of them contained 1089 instances. Hence the number of possible instances available for fitting is . On the other hand, NLDD used only 2, 018 instances which is less than 0.2% of all instances.

NLDD requires more time than BR. The relative differences of running time between NLDD and BR depended on the size of the training data (N). The number of labels and features had less impact on the differences, as the complexity of NLDD is linear in them. For the larger data sets, we reduced the running time of NLDD by using a subset (70%) of the training data. The results of ohsumed and tmc2007 data sets show that NLDD with reduced data can perform fast compared to not only BR but also the other methods on large data problems.

Because NLDD makes a prediction by choosing a training labelset, the prediction label vector is confined to a labelset appearing in the training data. If a new instance has a true labelset unobserved in the training data, there will be at least one incorrect predicted label. Even so, NLDD beat the other methods on average. How frequently an unobserved labelset occurs depends on the data set. For most data sets, less than 5% of the test data contained labelsets not observed in the training data. In other words, most of the labelsets of the test instances could be found in the training data. However, for the bibtex data set about 33% of the test data contained unobserved labelsets. As seen in Table 7, when the true labelsets of the test instances were not observed in the training data (subset B), BR performed slightly better than NLDD in terms of 0/1 loss, multi-label accuracy and F-measure. On the other hand, when the true labelsets of the test instances were observed in the training data (subset A), NLDD outperformed BR on all of the metrics. Combined, NLDD achieved higher performances than BR on the entire test data.

NLDD uses binomial regression to estimate the parameters. This setup assumes that the instances in S are independent. While it turned out that this assumption worked well in practice, dependencies may arise between the two pairs of a given . If required this dependency could be modeled using, for example, generalized estimating equations (Liang and Zeger 1986). We examined GEE using an exchangeable correlation structure. The estimates were almost the same and the prediction results were unchanged. The analogous results are not shown.

For prediction, the minimization in (3) only requires the estimates of the coefficients determine the tradeoff between . The estimate of is not needed. However, estimating allows estimating the probability of a misclassification of a label for an instance, . Such an assessment of uncertainty of the prediction can be useful. For example, one might only want to classify instances where the probability of misclassification is below a certain threshold value.

NLDD uses a linear model for binomial regression specified in 2. To investigate how the performance of NLDD changes in nonlinear models, we also considered a model: in which the distances are combined in a multiplicative way. The difference of prediction results obtained by the linear and multiplicative models was small. The analogous results are not shown.

While SV M was employed as the base classifier, other algorithms could be chosen provided the classifier can estimate posterior probabilities rather than just scores. Better predictions of binary classifiers will make distances in the label space more useful and hence lead to a better performance.

7 Conclusion

In this paper, we have presented NLDD based on probabilistic binary classifiers. The proposed method chooses a training labelset with the minimum expected loss, where the expected loss is a function of two variables: the distances in feature and label spaces. The parameters are estimated by maximum likelihood. The experimental study with 9 different multi-label data sets showed that NLDD outperformed other state-of-the-art methods on average in terms of Hamming loss, 0/1 loss, multi-label accuracy and F-measure.

NLDD relies on labelsets observed in the training data and is unable to predict previously unobserved labelsets. NLDD outperformed other methods on the data sets observed where most test data sets contained 5% unobserved labelsets. While the method still outperforms the other methods with 33% of unobserved labelsets on the bibtex data, the method might not fare as well when the percentage of unobserved labelsets are substantially greater.

Acknowledgement

This research was supported in part by National Science and Engineering Research Council of Canada (NSERC, Gweon) and by Social Sciences and Humanities Research Council of Canada (SSHRC # 435-2013-0128, Schonlau).

References

Boutell MR, Luo J, Shen X, Brown CM (2004) Learning multi-label scene classification. Pattern Recognition 37(9):1757 – 1771, DOI 10.1016/j.patcog.2004.03.009

Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7:1–30, http://dl.acm.org/citation.cfm?id=1248547.1248548

Elisseeff A, Weston J (2001) A kernel method for multi-labelled classification. In: Advances in Neural Information Processing Systems 14, MIT Press, pp 681–687

Godbole S, Sarawagi S (2004) Discriminative methods for multi-labeled classification, Springer, pp 22–30. DOI 10.1007/978-3-540-24775-3_5

Gonçalves T, Quaresma P (2003) A preliminary approach to the multilabel classification problem of portuguese juridical documents. In: Proceedings of the 11th Portuguese Conference on Artificial Intelligence, Springer, pp 435–444, DOI 10.1007/978-3-540-24580-3_50

Hersh W, Buckley C, Leone TJ, Hickam D (1994) Ohsumed: an interactive retrieval evaluation and new large test collection for research. In: Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, London, pp 192–201, DOI 10.1007/978-1-4471-2099-5_20

Ji S, Tang L, Yu S, Ye J (2008) Extracting shared subspace for multi-label classification. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp 381–389, DOI 10.1145/1401890.1401939

Katakis I, Tsoumakas G, Vlahavas I (2008) Multilabel text classification for automated tag suggestion. In: Proceedings of the ECML/PKDD Discovery Challenge, Antwerp, Belgium

Klimt B, Yang Y (2004) The enron corpus: a new dataset for email classification research. In: Proceedings of the 15th European Conference on Machine Learning, Springer, Pisa, Italy, pp 217–226, DOI 10.1007/978-3-540-30115-8_ 22

Li C, Wang B, Pavlu V, Aslam JA (2016) Conditional bernoulli mixtures for multi-label classification. In: Proceedings of the 33rd International Conference on Machine Learning, New York City, NY, pp 2482–2491

Li T, Ogihara M (2003) Detecting emotion in music. In: Proceedings of the International Symposium on Music Information Retrieval, pp 239–240

Liang KY, Zeger SL (1986) Longitudinal data analysis using generalized linear models. Biometrika 73(1):13–22, DOI 10.2307/2336267

Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F (2014) e1071: misc functions of the department of statistics, TU Wien. http://CRAN.R-project.org/package=e1071

Platt J (2000) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Smola A, Bartlett P, Schoelkopf B, Schuurmans D (eds) Advances in Large Margin Classifiers, pp 61–74, mIT Press

R Core Team (2014) R: a language and environment for statistical computing. R Foundation for Statistical Comput- ing, Vienna, Austria, http://www.R-project.org/

Read J, Pfahringer B, Holmes G (2008) Multi-label classification using ensembles of pruned sets. In: Proceedings of the 8th IEEE International Conference on Data Mining, pp 995–1000, DOI 10.1109/ICDM.2008.74

Read J, Pfahringer B, Holmes G, Frank E (2011) Classifier chains for multi-label classification. Machine Learning 85(3):333–359, DOI 10.1007/s10994-011-5256-5

Schapire RE, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Machine Learning 37(3):297–336, DOI 10.1023/A:1007614523901

Schapire RE, Singer Y (2000) Boostexter: a boosting-based system for text categorization. Machine Learning 39(2):135–168, DOI 10.1023/A:1007649029923

Srivastava A, Zane-Ulman B (2005) Discovering recurring anomalies in text reports regarding complex space systems. In: Proceedings of the 2005 IEEE Aerospace Conference, pp 3853–3862, DOI 10.1109/AERO.2005.1559692

Tai F, Lin HT (2012) Multilabel classification with principal label space transformation. Neural Computation 24(9):2508–2542

Trohidis K, Tsoumakas G, Kalliris G, Vlahavas I (2008) Multilabel classification of music into emotions. In: Proceed- ings of the 9th International Conference on Music Information Retrieval, Philadelphia, PA, USA, pp 325–330

Tsoumakas G, Katakis I (2007) Multi-label classification: an overview. International Journal of Data Warehousing and Mining 3:1–13

Tsoumakas G, Vlahavas I (2007) Random k-labelsets: an ensemble method for multilabel classification. In: Pro- ceedings of the 18th European Conference on Machine Learning, Springer, Berlin, pp 406–417, DOI 10.1007/ 978-3-540-74958-5_38

Tsoumakas G, Katakis I, Vlahavas I (2011) Random k-labelsets for multilabel classification. IEEE Transactions on Knowledge and Data Engineering 23(7):1079–1089, DOI 10.1109/TKDE.2010.164

Vapnik VN (2000) The nature of statistical learning theory. 2nd edition. Springer

Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics Bulletin 1(6):80–83, http://www.jstor. org/stable/3001968

Ypma TJ (1995) Historical development of the newton-raphson method. SIAM Review 37(4):531–551

Zhang ML, Zhou ZH (2005) A k-nearest neighbor based algorithm for multi-label classification. In: Proceedings of the 1st IEEE International Conference on Granular Computing, vol 2, pp 718–721, DOI 10.1109/GRC.2005.1547385

Zhang ML, Zhou ZH (2007) Ml-knn: a lazy learning approach to multi-label learning. Pattern Recognition 40(7):2038 – 2048, DOI 10.1016/j.patcog.2006.12.019

designed for accessibility and to further open science