Label ranking is a prediction task which deals with learning a mapping between an instance and a ranking (i.e., order) of labels from a finite set, representing their relevance to the instance [Zhou et al., 2014]. Due to its wide applicability, label ranking has attracted a lot of focus from the ar-tificial intelligence community in recent years [H¨ullermeier et al., 2008; Cheng et al., 2009; Aiguzhinov et al., 2010; Cheng et al., 2013; Zhou et al., 2014; Gurrieri et al., 2014; Destercke et al., 2015; Aledo et al., 2017; S´a et al., 2017; Zhou and Qiu, 2018].
Applications of label ranking include for example, text classification, where a news article may belong to multiple topics, and the goal of the label ranking algorithm is to rank the topics according to their relevance to the document. In pattern recognition, objects can be ordered according to their relevance to the image [Yang et al., 2016]. In meta-learning, a label ranking model can provide a list of algorithms to a given problem, ranked according to their fit to the problem, based on the characteristics of the problem at hand [Brazdil et al., 2003].
Label ranking tasks must not be confused with multi-label tasks [Tsoumakas et al., 2009] nor with learning to rank tasks [Cohen et al., 1998]. In label ranking tasks, the target attribute of each instance contains a ranking of labels, representing their relative relevance to the instance. In contrast, in multi-label tasks, the target attribute is a non-ranked subset of relevant labels, and in learning to rank tasks, the goal is to produce a ranked list of the instances themselves.
Boosting is a well-known and reliable ensemble technique [Schapire, 2003] that was shown to often outperform other learning algorithms. AdaBoost [Freund and Schapire, 1997] is one of the most widely used classification boosting techniques. Variations of AdaBoost were developed for multi-class tasks [Freund and Schapire, 1997], multi-label tasks [Schapire and Singer, 2000], regression tasks [Drucker, 1997; Solomatine and Shrestha, 2004], and learning to rank tasks [Cohen et al., 1998; Xu and Li, 2007; Wu et al., 2010]. However, to the best of our knowledge, no boosting algorithm was suggested for label ranking tasks.
In this paper, we propose a novel boosting algorithm, AdaBoost.LR, which was specifically designed for label ranking tasks. An extensive evaluation of AdaBoost.LR over 24 semi-synthetic and real-world datasets shows that it signifi-cantly outperforms existing state-of-the-art label ranking algorithms.
The rest of this paper is organized as follows: we first discuss label ranking ensembles in section 2. Thereafter, we describe our proposed method in section 3. This is followed by an overview of our experimental setting in section 4, a description of our extensive evaluation in section 5, and a summary and suggestions for future work in section 6.
Numerous label ranking algorithms were suggested in the literature. One approach is based on turning the problem into several binary classification problems and then combining them into output rankings. (e.g. [H¨ullermeier et al., 2008; Cheng et al., 2013; Destercke et al., 2015]). Another common approach is based on modifying existing probabilistic algorithms to directly support label ranking. Some main examples are: naive Bayes models [Aiguzhinov et al., 2010], k-nearest neighbor models [Brazdil et al., 2003] and decision tree models [Cheng et al., 2009; de S´a et al., 2015]. For a surveys on label ranking algorithms see e.g. [Zhou et al., 2014].
In order to improve the performance of label ranking algorithms, four recent papers suggested the use of ensembles [Aledo et al., 2017; S´a et al., 2017; Werbin-Ofir et al., 2019; Zhou and Qiu, 2018]. The ensembles proposed in these papers differ in several aspects. Aledo et al. [2017] used Label Ranking Trees (LRT) [Cheng et al., 2009] as the base label ranking algorithm , whereas Sa et al. [2017] used Ranking Trees (RT) and Entropy Ranking Trees (ERT) [de S´a et al., 2015], Zhou and Qiu. [2018] developed their own method named Top Label As Class (TLAC) and Werbin et al. [2019] used both LRT and Ranking by Pairwise Comparison (RPC) [H¨ullermeier et al., 2008]. To select the training data for each simple classifier, Aledo et al. [2017] and Werbin et al. [Werbin-Ofir et al., 2019] used a technique known as Bootstrap aggregation or Bagging [Breiman, 1996], whereas Sa et al. [2017] and Zhou and Qiu [2018] suggested modifications to the well-known Random Forest technique [Breiman, 2001]. As for the aggregation method used, three studies [de S´a et al., 2015; Aledo et al., 2017; Zhou and Qiu, 2018] used a single predefined voting rule (either Borda or Modal Ranking) and one study [Werbin-Ofir et al., 2019] used a Voting Rule Selector (VRS).
To the best of our knowledge, boosting for label ranking has been so far overlooked. The boosting variations that are perhaps the most closely related to the label ranking task are AdaBoost.MR [Schapire and Singer, 2000], which was designed for multi-label tasks, and AdaBoost.R2, which was designed for regression tasks [Drucker, 1997]. AdaBoost.MR assumes that each instance in the training set is associated with a subset of relevant labels, and given a new instance, the goal is to predict a subset of relevant labels. AdaBoost.R2 assumes that each instance in the training set is associated with a numeric value, and given a new instance, the goal is to predict a numeric value. In contrast, our proposed boosting-based algorithm assumes that each instance in the training set is associated with a ranking (not necessarily complete) of all labels, and given a new instance, the goal is to predict a complete ranking of all labels. We proceed to describe the proposed algorithm in section 3.
We now describe our proposed boosting algorithm for label ranking tasks: AdaBoost.LR. A pseudo-code of the algorithm is provided in Algorithm 1.
The algorithm receives as input a training set of m instances with rankings as targets, a base label ranking learning algorithm, and the maximum number of iterations T (line 1). The algorithm begins by initializing the weights of all training instances to the same value (line 2).
Then, the algorithm proceeds in iterations as follows (lines 3-13). First, a set of training instances is sampled based on the current weights
of the instances (line 4).. Next, the base learning algorithm is called with the sampled set of training instances, and the resulting trained model (hypothesis)
is returned (line 5). Then, the the trained model is applied on each training instance, and the loss for each training example, denoted by
, is calculated as the Kendall-tau-b coefficient [Kendall, 1938], kt, between the predicted ranking
and the target ranking
of training instance i (line 6):
Next, the adjusted loss for each training instance , is calculated by dividing its loss
by the maximum loss of all training instances (line 7):
(3) The performance of the weak model is then evaluated by computing the average (adjusted) loss over all instances (line 8):
In the next steps, the confidence of the model (line 9) and the weight of the model
(line 10) are calculated:
Then, the weights of the training instances are updated towards the next iteration. Training instances for which the model predicted poorly will receive a higher weight, thereby ”forcing” the next trained model to ”pay special attention” to them. The calculation of the new weight, , for a training instance i, depends on the confidence of the model at time t, (
), and the adjusted similarity,
, between the predicted value and the actual target (line 11):
Where is a normalization factor:
The process is repeated until T weak models are constructed or until (line 3). Finally, the algorithm outputs the weak models
and their weights
(line 14).
Prediction for a new (unseen) instance is done by retrieving the weak models’ outputs and their weights, and then employing weighted Borda. Formally, let there be n labels . Let
be 1 if
was ranked above
in the output of model t, and 0 otherwise. The score of label
is:
The n labels are then ranked according to their scores.
Our proposed boosting algorithm for label ranking tasks, AdaBoost.LR, was inspired by the AdaBoost.R2 algorithm for regression tasks [Drucker, 1997]. Notable differences between AdaBoost.R2 and AdaBoost.LR are highlighted below:
• Nature of target attribute and base algorithm - In regression tasks the target attribute contains numeric values. In contrast, in label ranking tasks, the target attribute contains rankings of labels from a predefined set. Consequently, the base algorithm used by a label ranking ensemble must be able to handle a label ranking target.
• Loss calculation - In regression tasks, the loss is computed as the absolute difference between the predicted and actual numeric values. However, in our case, we need to calculate the error between the predicted and actual rankings. To do so, we used the Kendall-tau coeffi-cient function. Kendall-tau counts the minimal number of swaps needed to transform one ranking into another ranking. Specifically, we used Kendall-tau-b which can handle ties in the order of the two rankings, and a normalized version which measures the similarity between two rankings on a scale of means that the two rankings are entirely opposite and 1 means the two rankings are identical). Thus, equation 2 measures (for each training instance i at time t) the difference between
the predicted ranking and the real ranking. A training instance i with high value means it was poorly predicted.
• Output calculation - In regression tasks the output of each weak model is a numeric value, and the different outputs are combined using weighted median. In the label ranking case, the output of each weak model is a ranking. Moreover, in the boosting case, each weak model has a different importance as measured by its weight. Therefore we employed a weighted Borda method that is able to aggregate a set of rankings into a single ranking, and take into account their corresponding weights. This aggregation step is formally described in equation 8.
In this section, we describe the compared methods, the datasets used and the experimental flow.
4.1 Compared Methods
We compared the prediction performance of a simple single model for label ranking based on LRT, and the following four label ranking ensembles:
• Boosting - the AdaBoost.LR algorithm for label ranking tasks that we developed.
• Bagging - Modal Ranking - an existing label ranking ensemble which was described in [Aledo et al., 2017]. This algorithm uses Bagging as the data sampling technique and Modal Ranking as the aggregation technique.
• Bagging - VRS - an existing label ranking ensemble which was described in [Werbin-Ofir et al., 2019]. This algorithm uses Bagging as the data sampling technique and a voting rule selector, which automatically selects the best voting rule to be used as the aggregation technique.
• Random Forest - an existing label ranking ensemble which was described in [S´a et al., 2017; Zhou and Qiu, 2018]. This algorithm uses Random Forest as the data sampling technique and Borda as the aggregation technique.
To ensure a fair comparison, we re-implemented the compared ensembles, and all ensemble implementations used LRT as the base label ranking algorithm. LRT was chosen as the base label ranking algorithm, since it was shown to work well in previous studies [Aledo et al., 2017; H¨ullermeier et al., 2008].
4.2 Datasets
As a benchmark, we used a total of 24 datasets. The first 16 were proposed in [Cheng et al., 2009] and used since then as a standard benchmark for the label ranking problem. They can be considered semi-synthetic (SS), as they were obtained by transforming a multi-class or regression problem to a label ranking one (see [Cheng et al., 2009] for the transformation details). The last eight are real-world (RW) datasets. Five of them correspond to real-world biological data problems and were used by [H¨ullermeier et al., 2008]. The labels in these datasets represent the expression level of genes of the yeast genome, where each gene has a phylogenetic profile of length 24 (i.e., 24 attributes). In this case the expression profile of the output gene was directly converted into a rank (e.g. was converted into (2, 1, 3, 4)). The other three datasets were used by [de S´a et al., 2018; Werbin-Ofir et al., 2019]. One dataset contains demographics and preferences over sushi types [Kamishima, 2003] and the other two contain social-economic and electoral results from Germany [Boley et al., 2013].
4.3 Experimental Flow
Our experimental flow follows a similar approach to that of Aledo et al. [Aledo et al., 2017] and Werbin et al. [Werbin- Ofir et al., 2019] and is composed of four main steps:
1. Cross Validation. Given a label ranking dataset, we employed a 10-fold cross-validation process, in which the dataset was partitioned into ten distinct folds, and then in ten iterations, one fold was used as a test set and the remaining were used for the training set.
2. Data Sampling and Training. Given a training set, multiple bags were sampled, and each such bag was used to train a single label ranking prediction model. For our
Table 1: Datasets Characteristics
evaluation, we used 50 bags. In the case of boosting, the sampling of bags was done sequentially, where the sampling of a bag in a given iteration depended on the weights of training instances, which in turn depended on the performance of the weak model trained in the previous iteration.
3. Prediction. Each one of the test instances was given as input to each one of the 50 trained models, and the 50 outputted rankings were aggregated into a single aggregated output, using Modal Ranking in the case of Bagging, Borda in the case of Random Forest and weighted Borda in the case of Boosting.
4. Evaluation. The aggregated ranking was compared to the real ranking using the Kendal-tau-b coefficient. The results were averaged over all test instances, and then over the ten test folds, to produce a single measure for a dataset.
In the first experiment, we wanted to compare the prediction performance of the five compared methods: Boosting, Random Forest, Bagging (two variations) and Single Model. To do so, we first calculated the average Kendall-Tau-b for each method (over all instances in a given dataset). Then, we calculated the improvement in percentages in comparison to the simple model approach, for each dataset. Finally, we averaged these numbers over all 24 datasets.
Fig. 1 reports the average improvement (in percentages) of each of the ensemble methods in comparison to the single
Figure 1: Improvement of each ensemble method in comparison to the single model method
model approach. As can be seen from the figure, all four ensemble methods performed better than the single model approach. This result emphasizes the superiority of ensemble methods over simple models in general, and in the case of label ranking tasks in particular. We further observe that Boosting outperforms the other ensemble methods with an average improvement of 60% (compared to 49% , 53%, and 15% obtained by Random Forest, Bagging: VRS and Bagging: Modal Ranking respectively).
Fig. 2 provides a breakdown of the results reported in Fig. 1 to real-world datasets (left) and semi-synthetic ones (right). As can be seen in the figure, the improvement obtained by all ensemble methods in the case of real-world datasets is substantially better. For example, in the case of Boosting we observe an average improvement of 121% in the case of real-world datasets, compared to an average improvement of 29% in the case of semi-synthetic datasets.
While the figures above reported the average results over multiple datasets, Fig 3 provides a breakdown to the 21 datasets, where each entry represents the rank of the given method for the given dataset (a rank of 1 means that the given method performed the best for the given dataset, and a rank of 4 means that it performed the worst). As can be seen from Fig 3, Boosting defeats the four other methods in 20 out of 24 of the datasets. In contrast, VRS was ranked first in only three of the 21 datasets, Random forest was ranked first once and for all 24 datasets - neither Bagging with Modal Ranking nor Simple Model obtained the first place.
Finally, we were interested in testing how the number of weak models used in the ensemble affects prediction performance. Fig 4 shows the improvement in percentages as a function of the number of weak models used for each of the aggregation methods. As can be seen in Fig 4, the prediction performance of all ensemble methods increases with the number of weak models used. Moreover, we observe that the Boosting method outperforms the rest of the methods for all considered numbers of weak models. Lastly, we notice a diminishing effect, where prediction performance seems to stabilize for all of the considered aggregation methods, when using 50 weak models.
In order to complete the picture, we performed statistical
Figure 2: Breakdown of improvement to real-world datasets (top) and semi-synthetic ones (bottom).
significance tests to compare between the five methods of interest. We followed the robust non-parametric procedure described in [Garc´ıa et al., 2010]. First, we used the Friedman Aligned Ranks test in order to reject the null hypothesis that all methods perform the same. The Friedman Aligned Ranks test with a confidence level of 95% rejected the nullhypothesis that all methods perform the same. Then, we used the Conover post hoc test to perform pairwise comparison between the five methods. We found that in accordance with our findings above, Boosting performed significantly better than the other four methods.
In this paper, we proposed a novel boosting-based algorithm, AdaBoost.LR, for improving the prediction performance of label ranking ensembles. An extensive evaluation that we performed showed that the proposed algorithm, performed significantly better than existing state-of-the-art algorithms for label ranking. In particular, our algorithm defeated the benchmark methods in 20 out of the 24 publicly available label ranking datasets.
Our evaluation considered the case of complete rankings only. However, given a base label ranking algorithm that accepts partial rankings as inputs (See for example [Aledo et al., 2017]), AdaBoost.LR can be applied without any change. Therefore, in future work, we plan to evaluate the performance of AdaBoost.LR in the case of partial rankings.
Figure 3: Rank of each method for each of the 21 dataset
Figure 4: Improvement as a function of the number of weak models used
It would be interesting to explore other types of label ranking ensembles in addition to the ones already studied (i.e., Bagging, Random Forest, and now Boosting). For example, the Stacking approach [Wolpert, 1992] can be extended to support label ranking tasks, by developing a meta label ranker that receives the output rankings of simple models as input attributes and returns a single aggregated ranking as output.
We believe that insights gained in the process of improving the competence of label ranking ensembles, may be valuable to a variety of fields that are concerned with rankings, including: combining voters’ preferences in computational social choice [Chevaleyre et al., 2007], ensembles in machine learning, rank aggregation in information retrieval [Dwork et al., 2001], group recommendations in recommender systems, phylogenetic profiling [Balasubramaniyan et al., 2004], and even error correction codes [Procaccia et al., 2016].
[Aiguzhinov et al., 2010] Artur Aiguzhinov, Carlos Soares, and Ana Paula Serra. A similarity-based adaptation of
naive bayes for label ranking: Application to the metalearning problem of algorithm recommendation. In International Conference on Discovery Science, pages 16–26. Springer, 2010.
[Aledo et al., 2017] Juan A Aledo, Jos´e A G´amez, and David Molina. Tackling the supervised label ranking problem by bagging weak learners. Information Fusion, 35:38–50, 2017.
[Balasubramaniyan et al., 2004] Rajarajeswari Balasubramaniyan, Eyke H¨ullermeier, Nils Weskamp, and J¨org K¨amper. Clustering of gene expression data using a local shape-based similarity measure. Bioinformatics, 21(7):1069–1077, 2004.
[Boley et al., 2013] Mario Boley, Michael Mampaey, Bo Kang, Pavel Tokmakov, and Stefan Wrobel. One click mining: Interactive local pattern discovery through implicit preference and performance learning. In Proceedings of the ACM SIGKDD workshop on interactive data exploration and analytics, pages 27–35. ACM, 2013.
[Brazdil et al., 2003] Pavel B Brazdil, Carlos Soares, and Joaquim Pinto Da Costa. Ranking learning algorithms: Using ibl and meta-learning on accuracy and time results. Machine Learning, 50(3):251–277, 2003.
[Breiman, 1996] Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.
[Breiman, 2001] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
[Cheng et al., 2009] Weiwei Cheng, Jens H¨uhn, and Eyke H¨ullermeier. Decision tree and instance-based learning for label ranking. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 161–168. ACM, 2009.
[Cheng et al., 2013] Weiwei Cheng, Sascha Henzgen, and Eyke H¨ullermeier. Labelwise versus pairwise decomposition in label ranking. In LWA, pages 129–136, 2013.
[Chevaleyre et al., 2007] Yann Chevaleyre, Ulle Endriss, J´erˆome Lang, and Nicolas Maudet. A short introduction to computational social choice. In International Conference on Current Trends in Theory and Practice of Computer Science, pages 51–69. Springer, 2007.
[Cohen et al., 1998] William W Cohen, Robert E Schapire, and Yoram Singer. Learning to order things. In Advances in Neural Information Processing Systems, pages 451–457, 1998.
[de S´a et al., 2015] Cl´audio Rebelo de S´a, Carla Rebelo, Carlos Soares, and Arno Knobbe. Distance-based decision tree algorithms for label ranking. In Portuguese Conference on Artificial Intelligence, pages 525–534. Springer, 2015.
[de S´a et al., 2018] Cl´audio Rebelo de S´a, Wouter Duivesteijn, Paulo Azevedo, Al´ıpio M´ario Jorge, Carlos Soares, and Arno Knobbe. Discovering a taste for the unusual: exceptional models for preference mining. Machine Learning, 107(11):1775–1807, 2018.
[Destercke et al., 2015] S´ebastien Destercke, Marie-H´el`ene Masson, and Michael Poss. Cautious label ranking with label-wise decomposition. European Journal of Operational Research, 246(3):927–935, 2015.
[Drucker, 1997] Harris Drucker. Improving regressors using boosting techniques. In The 14th Intenational Conferences on Machine Learning, pages 107–115. Morgan Kaufmann, 1997.
[Dwork et al., 2001] Cynthia Dwork, Ravi Kumar, Moni Naor, and Dandapani Sivakumar. Rank aggregation methods for the web. In Proceedings of the 10th international conference on World Wide Web, pages 613–622. ACM, 2001.
[Freund and Schapire, 1997] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
[Garc´ıa et al., 2010] Salvador Garc´ıa, Alberto Fern´andez, Juli´an Luengo, and Francisco Herrera. Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Information Sciences, 180(10):2044–2064, 2010.
[Gurrieri et al., 2014] Massimo Gurrieri, Philippe Fortemps, and Xavier Siebert. Alternative decomposition techniques for label ranking. In International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, pages 464–474. Springer, 2014.
[H¨ullermeier et al., 2008] Eyke H¨ullermeier, Johannes F¨urnkranz, Weiwei Cheng, and Klaus Brinker. Label ranking by learning pairwise preferences. Artificial Intelligence, 172(16-17):1897–1916, 2008.
[Kamishima, 2003] Toshihiro Kamishima. Nantonac collaborative filtering: recommendation based on order responses. In Proceedings of the ninth ACM SIGKDD in-
ternational conference on Knowledge discovery and data mining, pages 583–588. ACM, 2003.
[Kendall, 1938] Maurice G Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81–93, 1938.
[Procaccia et al., 2016] Ariel D Procaccia, Nisarg Shah, and Yair Zick. Voting rules as error-correcting codes. Artificial Intelligence, 231:1–16, 2016.
[S´a et al., 2017] Cl´audio Rebelo S´a, Carlos Soares, Arno Knobbe, and Paulo Cortez. Label ranking forests. Expert Systems, 34(1), 2017.
[Schapire and Singer, 2000] Robert E Schapire and Yoram Singer. Boostexter: A boosting-based system for text categorization. Machine learning, 39(2-3):135–168, 2000.
[Schapire, 2003] Robert E. Schapire. The boosting approach to machine learning: An overview. In Nonlinear Estimation and Classification, pages 149–171. Springer, New York, NY, 2003.
[Solomatine and Shrestha, 2004] Dimitri P Solomatine and Durga L Shrestha. Adaboost. rt: a boosting algorithm for regression problems. In 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No. 04CH37541), volume 2, pages 1163–1168. IEEE, 2004.
[Tsoumakas et al., 2009] Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas. Mining multi-label data. In Data mining and knowledge discovery handbook, pages 667–685. Springer, 2009.
[Werbin-Ofir et al., 2019] Havi Werbin-Ofir, Lihi Dery, and Erez Shmueli. Beyond majority: Label ranking ensembles based on voting rules. Expert Systems with Applications, 2019.
[Wolpert, 1992] David H Wolpert. Stacked generalization. Neural networks, 5(2):241–259, 1992.
[Wu et al., 2010] Qiang Wu, Christopher JC Burges, Krysta M Svore, and Jianfeng Gao. Adapting boosting for information retrieval measures. Information Retrieval, 13(3):254–270, 2010.
[Xu and Li, 2007] Jun Xu and Hang Li. Adarank: a boost- ing algorithm for information retrieval. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 391–398. ACM, 2007.
[Yang et al., 2016] Hao Yang, Joey Tianyi Zhou, Yu Zhang, Bin-Bin Gao, Jianxin Wu, and Jianfei Cai. Exploit bounding box annotations for multi-label object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 280–288, 2016.
[Zhou and Qiu, 2018] Yangming Zhou and Guoping Qiu. Random forest for label ranking. Expert Systems with Applications, 112:99 – 109, 2018.
[Zhou et al., 2014] Yangming Zhou, Yangguang Liu, Jiangang Yang, Xiaoqi He, and Liangliang Liu. A taxonomy of label ranking algorithms. Journal of Computers, 9(3):557–
565, 2014.