An interpretable semi-supervised classifier using two different strategies for amended self-labeling

2020·Arxiv

ABSTRACT

ABSTRACT

In the context of some machine learning applications, obtaining data instances is a relatively easy process but labeling them could become quite expensive or tedious. Such scenarios lead to datasets with few labeled instances and a larger number of unlabeled ones. Semi-supervised classification techniques combine labeled and unlabeled data during the learning phase in order to increase classifier’s generalization capability. Regrettably, most successful semi-supervised classifiers do not allow explaining their outcome, thus behaving like black boxes. However, there is an increasing number of problem domains in which experts demand a clear understanding of the decision process. In this paper, we report on an extended experimental study presenting an interpretable self-labeling grey-box classifier that uses a black box to estimate the missing class labels and a white box to explain the final predictions. Two different approaches for amending the self-labeling process are explored: a first one based on the confidence of the black box and the latter one based on measures from Rough Set Theory. The results of the extended experimental study support the interpretability by means of transparency and simplicity of our classifier, while attaining superior prediction rates when compared with state-of-the-art self-labeling classifiers reported in the literature.

Keywords Semi-supervised Classification Self-labeling Interpretability Explainable Artificial Intelligence Grey-Box Model Rough Sets Theory

1 Introduction

Gathering data examples for training a machine learning classifier in a real-world scenario is often simple, but the process of assigning labels to the examples can be costly in terms of money, time or effort. In such scenarios we might obtain datasets with more unlabeled than labeled data. This is often the case in applications such as image classification [1], industrial fault classification [2], sentiment analysis [3], speaker identification [4] and bioinformatics or medical applications [5]. Semi-supervised classification (SSC) techniques arise from the need to address this problem using both labeled and unlabeled data for training a classifier. The aim is to increase the generalization ability of the classifier when compared to a supervised classifier that only uses the available labeled data.

The SSC literature reports several techniques including transductive Support Vector Machines [6], Graph-based methods [7], Generative Mixture Models [8], Self-labeling techniques [9] and more recently semi-supervised Generative Adversarial Networks [10]. In general, state-of-the-art SSC methods involve three main shortcomings that may vary from a specific family of algorithms to the whole field. The first potential issue affecting all SSC models refers to the assumption that the unlabeled data helps elucidating the distribution of the labeled instances. When this assumption is not met in any of its forms, semi-supervised learning may not be useful. Secondly, some techniques such as Graph-based methods mainly focus on transductive learning, i.e. predicting the label for a given set of unlabeled data rather than finding a model capable of predicting the classification of unseen instances with a proper generalization. Thirdly, while self-labeling approaches such as Co-training [11], Self-training [12] and their variants perform quite well in terms of accuracy, they often result in complex structures combining several classifiers and failing to give the user insight in how the classification process comes about.

An increasing requirement observed in machine learning is to obtain not only precise models but also interpretable ones. End users often demand an insight into how an algorithm arrives at a particular outcome and need an explanation of the decisions to some extent. In general, explainable artificial intelligence is starting to be a central concern in both governing and research communities. For example, the EU General Data Protection Regulation includes a right to obtain an explanation on the decisions made by an algorithm affecting human beings [13]. This regulation might limit the potential of using artificial intelligence in a variety of domains, unless we start developing more transparent models.

Recent studies [14, 15, 16, 17] formalize terms such as interpretability or explainability in sometimes overlapping concepts. However, a common conclusion is that a certain grade of global interpretability can be reached through the use of more transparent techniques as proxies for solving a task. In this paper, we refer to intrinsically interpretable models (e.g., linear regression, decision trees or rule induction algorithms) as white boxes, as opposed to the less interpretable black-box ones (e.g. artificial neural networks or support vector machines). Black boxes are normally more accurate techniques that learn exclusively from data but they are not easily understandable at a global level. Whereas white boxes refer to models which are constructed based on laws or principles of the problem domain, or those who are built from data but their structure allows for explanations or interpretation, since pure white boxes rarely exists [18]. Intrinsically interpretable models can be recommended when a transparent model that can be inspected as a whole is needed and the prediction problem does not require a very powerful technique. On the other hand, agnostic post-hoc methods [19, 20] are a suitable alternative when a black-box is already built and we need to compute explanations for input and output pairs, preserving accuracy. However, post-hoc methods generate explanations that are often local or limited to feature attribution rather than a holistic view of the model. Grey-box models, i.e. using white boxes as surrogates for distilling previously trained black boxes are an approach in between intrinsically interpretability and model agnostic post-hoc. While the white boxes attempt to explain the problem domain directly, the grey-boxes are devoted to explain the domain by approximating the predictions produced by a black-box classifier.

In this paper, we study the SSC problem from the interpretability angle. We conduct a detailed revision of methods reported in the literature and discuss their shortcomings when interpretability comes to play. We explore the performance of our semi-supervised classifier termed self-labeling grey-box (SlGb) [21, 22], which exploits the strength of black-box models being good classifiers with the interpretability of white boxes. In terms of interpretability, we refer to a grey-box model as the combination of a black-box model with a white-box one. Our classifier uses a black box to estimate the decision class for unlabeled instances in order to increase the amount of training data. Afterwards, our approach builds a surrogate white-box classifier from the enlarged dataset that allows explaining the predictions. In addition we explore the effects of using two weighting strategies to reduce the effect of misclassifications when building the enlarged dataset. The former is based on the black box’s confidence for the inferred class label, while the latter is based on granular computing principles. The use of an enlarged dataset combined with a weighting strategy results in a white box with improved prediction rates. Numerical experiments using 55 datasets in different settings show that our proposal attains a good balance between prediction rates and explainability, while outperforming most state-of-the-art methods.

The rest of this paper is structured as follows. Section 2 provides an overview of state-of-the-art SSC algorithms reported in the literature and their interpretability, while making emphasis on self-labeling techniques. Section 3 describes the SlGb approach and Section 4 depicts two alternatives for the amending of the self-labeling performed by the black-box classifier. Section 5 introduces the numerical simulations and an extensive discussion covering the performance and interpretability of the SlGb. Section 6 formalizes the concluding remarks and research directions to be explored in the future.

2 Semi-supervised Classiﬁcation Methods and their Interpretability

In supervised classification the goal is to identify the right category (among those in a predefined set) to which an observation belongs. These observations (henceforth called instances) are often described by a set of numerical and/or nominal attributes. Solving this problem implies to define a mapping that assigns to each instance described by a set of attributes , a decision class . The mapping is learned from data in a supervised fashion, i.e., by relying on a set of previously labeled examples, used to train the classifier.

Semi-supervised techniques attempt to use both labeled and unlabeled instances during the learning process for increasing the prediction capacity when only labeled data is used. More formally, in a SSC scenario we have a set of m instances which are associated with their respective class labels in Y , and a set of n unlabeled instances , where usually n > m. In the context of SSC, the classifier performance can be evaluated in two settings: (1) transductive learning, which only attempts to predict the labels for the given unlabeled instances in U; or (2) inductive learning, which tries to infer a mapping for predicting the class label of any instance associated with the classification problem.

In this section, we review the main state-of-the-art methods for semi-supervised classification, including an analysis of their interpretability. Here, we evaluate the interpretability as the inherent model transparency, as described in [16].

2.1 Semi-supervised Classification Methods

As mentioned, SSC methods often involve assumptions about the distribution or characteristics of the unlabeled data [23]. For example, transductive Support Vector Machines (tSVMs) [6] assume that the decision boundary lies in a low-density region. This method uses unlabeled data for maximizing the margin between the different classes by placing the decision boundaries in sparse regions. However, given the fact that the complexity of the optimization problem increases in the semi-supervised setting, its computational burden is quite high and it does not scale well for large-scale data. Recent studies [24, 25] try to overcome this limitation by using the concave-convex procedure and variations of stochastic gradient descent to solve the optimization problem. Although SVMs are a powerful technique with a strong mathematical framework for building classifiers, it has the drawback of working as a black box from the interpretability point of view. The lack of transparency of SVMs does not allow them to produce explanations or interpretations of the obtained model. In this case, the use of post-hoc methods for generating explanations is necessary when requiring explanations over the obtained predictions.

Graph-based methods [7] assume that high-dimensional data lie on a low-dimensional manifold [26]. These methods represent the data space as a graph (i.e., if two instances are strongly connected, then they likely belong to the same class) and estimate a continuous function which is close enough to the label values, with the ultimate goal of propagating labels between similar instances. Recent works on Label Propagation methods [27, 28] are mainly focused on the construction of an effective graph over data with complex distribution and reducing the risk of error propagation through outliers. This approach could be interpretable to some extent by visually inspecting the obtained graph from the structural point of view, allowing some transparency at the parameters level. A first work toward this direction can be found in [29], where the authors propose a flow sub-graph framework which visualizes the path along the information flow from a source labeled instance to a target unlabeled instance. These sub-graphs can be seen as rather local explanations in the form of visualizations of the model. Their usability is limited to data that can be represented in the graph replacing the abstract representation of the node (e.g. images). A more general option is to obtain kNN-like explanations with examples by leveraging the graph structure, e.g. “the predicted label of instance was propagated from instances

A third approach assumes that the data follow an identifiable mixture distribution (Generative Mixture Models [8]), henceforth they learn a joint probability for identifying the mixture components using the unlabeled data. This approach may be convenient when the available data produce well-separated clusters [23], but most of the time the joint distribution is not easily identifiable. Here the estimated mixture distribution could be interpretable at a very high abstraction level if the representation space of the problem at hand is not too complex. However, the unlabeled data could have a negative effect on algorithm’s performance if the generative model is wrong. From the interpretability point of view, the classification of a new instance can leverage the Bayes rule for building a (rather abstract) explanation: “is the most probable value of y for since the probability is high when ”. Moreover, the estimated mixture distribution could only be visualized in a low-dimensional feature space for gaining insights into the clusters found by the model. In our opinion, GMMs require the use of post-hoc methods or global surrogates for gaining in interpretability of their results. An interesting work in this direction includes generating rectangular regions from the clusters and transforming them into rules [30].

More recently, deep architectures have been explored by extending graph-based methods [31] and generative models [32]. Particularly successful has been the extension of Generative Adversarial Networks (GAN) [33] to the SSC context [34, 10]. For example, Feature Matching GANs [10] use a discriminator for c+1 labels instead of the binary “real/fake" distinction, where the first c are the class labels of the problem and c + 1 corresponds to the generated instances. The authors in [35] theoretically analyze whether a good generator and a good discriminator for semi-supervised learning can be obtained at the same time. The study concludes that the generator should be “bad” in the sense of assigning high probabilities to low-density regions of the input space according to the true distribution, in order to complement the true data distribution and improve the semi-supervised performance. Regarding interpretability, deep neural networks are black-box models that need post-hoc procedures for generating explanations of their predictions. The majority of contributions are focused on local surrogate models or feature importance methods specially designed for deep multilayer, convolutional or recurrent neural networks [36, 37]. Interesting works connected to the semi-supervised setting include learning disentangled latent representations in a variational autoencoder, i.e. latent variables with an interpretable meaning coming from labeled data are added to the latent representation [38]. These latent interpretable variables can be used later on for inspecting their influence in the prediction.

Finally, self-labeling refers to a wide family of very powerful and versatile wrapper methods that employ one or more base classifiers for enlarging the available labeled dataset assuming the predictions they produce on the unlabeled data are correct. Since our contribution falls within this category, we decided to revise those SSC methods in a separate subsection to gain in clarity.

2.2 Self-labeling Techniques

According to [9], self-labeling techniques can be categorized into single-view or multi-view methods based on whether they need one or multiple datasets for learning. Self-training approaches [12] are single-view wrapper classifiers, which rely on the prediction of only one base classifier to repeatedly increase the size of the labeled dataset by predicting the unlabeled instances. The instances are added incrementally, in batch [39] or in an amending procedure [40]. The use of amending procedures allows selecting or weighting the self-labeled instances for enlarging the labeled dataset, hence avoiding error propagation.

The multi-view methods assume that the data space can be described from two or more different viewpoints. These different views normally correspond to distinct sets of attributes describing the same instances [41]. A classic example of multi-view methods is the Co-training [42] approach, where different classifiers are trained separately, each using a different attribute subset. Thereafter, the prediction of each classifier over the unlabeled dataset is used for enlarging the training set of the other. Other alternatives using multiple classifiers but not needing multi-view datasets are Democratic Co-learning [43], Tri-training [44], Co-training by committee [11] and Co-Forest [45] which use several base classifiers of the same type. Tri-training uses three base classifiers that collaborate in the learning process by labeling an unlabeled example if the other two classifiers agree. An alternative to Co-training is Co-training by committee, which does not require multi-view nor different learning algorithms, and explores different ensemble strategies with Bagging as the best performing one. Similarly, Co-Forest adopts a Random Forest classifier as an alternative for Co-training.

Self-labeling techniques are easy to implement and apply to almost all existing classifiers [46, 47, 48, 49]. A wide experiment conducted in [9] shows that CoTraining using Support Vector Machines as a base classifier [11], TriTraining using C4.5 decision tree [44], CoBagging using C4.5 [11] and Democratic Co-learning (as an ensemble of Naïve Bayes, C4.5 and K-Nearest Neighbors) [43], are the best performing self-labeling classifiers evaluated against a comprehensive collection of benchmark datasets. Other semi-supervised classifiers that have demonstrated competitive performance in a variety of datasets are self-training using logistic model trees [46], differential evolution [50] or naive Bayes [48].

In terms of interpretability, a self-training scheme producing a simulatable model (e.g., relatively simple tree structure) as the final classifier can be considered a transparent model. More complex schema such as Tri-training, Co-Bagging, Co-Forest or Co-training are less likely to be interpretable due to the collaborative nature of the algorithms and the complexity of the resulting structure. However, the ensemble character of self-labeling is a perfect match with the use of local or global surrogate models for explaining predictions. Combining base classifiers using self-labeling in a way that the resulting ensemble works as a surrogate white box is the challenge we want to address. In the next section, we describe a simple yet effective self-labeling method which uses two base classifiers, a black box and a white box, for reaching a suitable trade-off between performance and interpretability.

3 Self-labeling Grey-box Approach

In this section we describe the self-labeling grey-box proposed in our previous works [21]2. Here, we use a black-box classifier to predict the decision class of the unlabeled instances, while a surrogate white box is used to build an interpretable predictive model (e.g., a rule-based approach), based on the whole instance set. The aim is to outperform the base white-box component using only the originally labeled data, while maintaining a good balance between performance and interpretability. It is worth mentioning that the main motivation behind SlGb is not to outperform the most complicated state-of-the-art algorithms but to provide a simple approach allowing for interpretability. In other words, we should be able to produce competitive solutions without significantly increasing the complexity inherent to the base classifiers.

The learning process is performed in a sequential order. In a first step, we provide the available labeled dataset (L, Y ) to a black-box classifier for training. Once the supervised learning is completed, the black-box component has learned a function the hypothesis space that associates each instance with a class label. The f function can be computed from the scoring function Thereafter, the trained black-box component is used for generating new tuples (u, y) by mapping all unlabeled instances to a class label , adding a self-labeling character to the approach. From this step we obtain an enlarged training set comprising the original labeled instances and the extra labeled ones.

In the second step, the enlarged training set is used to train a surrogate white-box classifier. Once the learning process in the white-box component is completed, we obtain a function resulting in a classifier which is more likely to have better generalization capabilities than the original white-box component, when trained on only the labeled data. Figure 1 summarizes this process.

Figure 1: Blueprint of the SlGb architecture. During the first step, labeled data is used for training a black-box model, which assigns labels to the unlabeled data. Later on, a white-box surrogate model is trained on the enlarged dataset, thus resulting in an interpretable model.

When applying self-labeling, we should be aware of the risk of having imbalanced data with respect to the class labels. It might be easier to obtain unlabeled data of a certain class, for example, in the context of credit fraud detection or rare diseases classification. In order to deal with this problem, our approach additionally incorporates a simple strategy for balancing instances as a preprocessing step. This weight is computed as:

where denote the sets of labeled instances that are mapped to the class label and the minority class , respectively. In this way we assign higher importance to instances belonging to the minority class.

In general, the SlGb approach is only based on the general assumption of SSC methods: the distribution of unlabeled instances helps elucidate the distribution of all examples. In addition, our approach allows retaining the inherent interpretability of the chosen white-box surrogate. According to the taxonomy proposed in [9], our approach can be categorized as follows:

single-view: the SlGb classifier does not need different attribute sets for describing the instances, adding simplicity to the model;

multi-classifier: two different base classifiers are used, connected in a sequential process, the first classifier should be a good performing black-box supervised classifier, whereas the second should be a white-box technique guaranteeing interpretability to the final model;

multi-learning: the learning process comprises two steps, where two different learning algorithms are used depending on the base classifiers.

It can be noticed that the performance of the whole SlGb approach largely depends on the prediction capability of the black-box classifier when classifying unseen instances. Obviously, like any other machine learning algorithm, when solving application problems the performance will also depend on the quality of the data and the application of domain-dependent preprocessing steps [52]. However, in the context of self-labeling, the classification mistakes can reinforce themselves if no amending procedure is used during self-training. Therefore, in the next section we describe two amending strategies for the self-labeled instances, in order to prevent the error from propagating through the model.

4 Amending Strategies

In this section, we describe two strategies for weighting the instances that result from the self-training stage. The goal is to improve the quality of the final model either in terms of performance or interpretability. The first strategy uses the confidence of the predictions made by the base black box and the second one focuses on the possible inconsistency of the enlarged dataset. Therefore, both procedures assign more importance to more reliable instances in the second learning step, avoiding the propagation of errors or inconsistent information.

4.1 Using the Class Membership Probabilities of the Black-box Classifier

For the first strategy, the amending process for each unlabeled instance u is based on the class membership probability, which is computed by the black-box classifier in the self-labeling. The weights are assigned to the instances after they are labeled by the black-box classifier, thus expressing the confidence degree associated to the self-labeling process. Equation (2) shows how to compute the weight using the scoring function of the black-box base classifier that expresses the class membership probability of being correctly assigned to the

The proposed amending strategy constitutes an alternative to the use of incremental or batch procedures. Our amending does not need several iterations, thus reducing the computational burden of the self-labeling process. The pseudo-code in Algorithm 1 formalizes the method and incorporates the amending step in the general scheme.

end

It is important to mention that the black-box classifier should be able to measure calibrated probabilities in order to correctly interpret them as the confidence of its predictions. Not all machine learning models are able to provide probabilities that match with the expected distribution of probabilities for each class. According to a study on different supervised classifiers regarding probabilities estimation [53], logistic regression, multilayer perceptrons and bagged trees naturally provide well calibrated probabilities, whereas others such as boosted trees and SVM produce distorted ones. When the calibration of probabilities is needed, two main options are available: Platt’s scaling [54] and isotonic regression [55]. Platt’s scaling is more recommended when the distortion in the predicted probabilities has a sigmoid shape, whereas isotonic regression is able to correct any monotonic distortion but it requires large amounts of data for avoiding overfitting.

The amending based on class membership probabilities assumes the ground truth labels are correct and induces the white box to focus its learning on instances that are certain according to that. However, when dealing with limited labeled data we should not discard the existence of noise in the class labels. This can generate class inconsistency, especially when unlabeled data is added from different sources.

4.2 Using the Inclusion Degree Measures from Rough Set Theory

In this subsection we describe a second strategy for amending the enlarged dataset, which is based on the knowledge structures attached to Rough Set Theory [56]. This formalism allows handling uncertainty in the form of inconsistency through the computation of the lower and upper approximations for any set of instances in the decision space. The rough regions associated to these approximations can be used to weight the instances after performing the self-labeling process. Particularly, we assign higher weights to more confident instances as they have more chance to be correctly classified by the base black box.

4.2.1 Rough Set Theory

Rough Set Theory (RST) [56] is a mathematical formalism for handling uncertainty in the form of inconsistency. Given a decision system where the universe of instances U is described by a non-empty finite set of attributes A and its respective decision class d, any concept (subset of instances) can be approximated by two crisp sets. These sets are called lower and upper approximations of X (BX and BX, respectively) and can be computed taking into account an equivalence relation, as follows:

The equivalence class gathers the instances in the universe U which are inseparable according to a subset of attributes . From the formulations of upper and lower approximation, we can derive the positive, negative and boundary regions of any subset . The positive region P(X) = BX includes those instances that are surely contained in X; the negative region denotes those instances that are surely not contained in X, while the boundary region captures the instances whose membership to the set X is uncertain, i.e., they might be members of X.

The classic RST is regularly defined over a subset of discrete attributes, thus generating a partition of U. A more relaxed formulation of RST establishes the inseparability between instances based on a weak binary relation. Equation (5) formalizes the similarity relation used in this paper, which define whether any pair of instances and can be considered similar,

where computes the extent to which are deemed inseparable as indicated by the similarity threshold . Under this assumption, the universe is arranged in similarity classes that are not longer disjoint but overlapped. In this paper, and the inseparability relation is defined as the complement of a distance function, such as the Heterogeneous Euclidean-Overlap Metric [57]. This distance function computes the normalized Euclidean distance between numerical attributes and an overlap metric for nominal attributes. Equations (6) and (7) define this dissimilarity function,

with,

where and denote the normalized values of the t-th attribute for heterogeneous instances and , respectively, and is the information gain of the attribute.

Once the covering of the decision space is generated according to the similarity function, several RST based measures can be computed for measuring the uncertainty contained in a dataset [58]. In the following subsection, we adopt one of these measures to weight the instances belonging to the enlarged training set obtained after performing the self-labeling process.

4.2.2 Inclusion Degree

The use of this amending strategy is based on the fact that the black box could produce wrong labels for unlabeled instances. In addition, there is no guarantee that the knowledge concerning the original labeled instances is confident. To address both situations together, we propose a mechanism to weight the instances after the self-labeling process. Unlike the confidence-based strategy, this amending procedure is adopted for the entire enlarged dataset, instead of only the self-labeled instances. Therefore, it treats the uncertainty in the form of inconsistency of the labeled and unlabeled instances together.

More explicitly, the second weighting strategy is based on the inclusion degree of both labeled and self-labeled instances into the RST granules, thus let denote the membership degrees of any instance x to the positive, boundary and negative region of each class label , respectively. These membership degrees are computed from the inclusion degree of the similarity class of x into each information granule,

where is the similarity class associated with the instance denotes the set of instances with label . The similarity class of an instance x groups all instances that are similar to x according to the subset of attributes taken into account. By computing how much x and its similar instances are included in the positive region of a class , we are estimating how sure we are of this classification. The same reasoning holds for the negative and boundary regions.

Equation (11) computes the weight for the instance x belonging to the enlarged dataset, given its label and a similarity relation R. The sigmoid function is used to maintain the weight in the (0, 1) range.

The intuition of this weight is that if an instance and its similar ones are included mostly in the positive region of a class, therefore their label must be correct and they should have a high weight in the second learning phase. In the same way, if the instance x and its similarity class are mostly contained in the negative region of a class then the class assigned to x by the black box must be a mistake. Observe that the boundary information is also interesting since a high inclusion degree of an instance in the boundary region of a class is to some extent positive evidence (see Equation 4). This boundary region role can be reinforced or diluted according to the evidence coming from the inclusion degrees in the other two regions. When using the RST-based amending, Equation (11) replaces Equation (2) in the pseudo-code of Algorithm 1.

Figure 2: Blueprint of the SlGb architecture using amending procedures for correcting the influence of the misclassifica-tions from the self-labeling process. When RST-based amending is used, it also tackles class inconsistency coming from noise in the labeled data.

Figure 2 illustrates the inclusion of the amending procedures into the learning algorithm of the SlGb approach. It is important to note that the amending process is only carried out in the learning phase of the SlGb. Therefore, the amending strategies do not affect the transparency of the white-box surrogate during the inference on new cases.

The use of amending by weighting could have some implications for the interpretability. Assigning high weights to a small subset of instances transforms the global surrogate model towards a more local one. In other words, the weighting of instances makes the white box biased towards learning from the most confident ones, thus providing explanations for that subspace of the domain mostly. However, it makes sense to provide interpretability or explanations over the predictions that are most certain in the problem domain. In addition, it could have a positive influence on reducing the number of explanations produced by the white box.

5 Experiments and Discussion

In this section, we evaluate the SlGb approach through a three-step methodology using standard benchmark datasets. Unlike other experiments reported in the literature, the one developed in this section evaluates both algorithms’ performance and interpretability, when having different percentages of labeled instances. As a complement, we propose three new evaluation measures that go beyond the prediction rates.

Being more specific, the first step of our experimentation methodology is devoted to determining which black-box classifier produces the best results in terms of prediction performance. This step is quite important since the overall performance will depend on the discriminatory ability of the black box. The second step is dedicated to determining which combination of white box and amending reaches the best commitment between prediction rates and interpretability. As a third step, we further explore the impact of having different percentages of labeled and unlabeled instances on the algorithm’s performance.

As a complement of the evaluation methodology for interpretable SSC methods, in the last part of this section we compare SlGb against the best-performing state-of-the-art methods. In this case, the evaluation is confined to the prediction rates as these methods cannot be interpreted, thus the goal here is to show that SlGb is not just simple and elegant, but also able to outperform other self-labeling methods reported in the literature.

5.1 Benchmark Datasets, Base Classifiers and Parameter Settings

Our experimental design includes 55 challenging and diverse datasets for classification tasks where features are structured (i.e. the dataset has tabular form) and therefore are potentially interpretable. Four ratios of labeled instances in the training set (from 10% to 40%) allow studying the influence of the number of labeled examples on the overall performance. Testing with a 10% ratio means that the training set contains only a 10% of labeled instances and the rest of are unlabeled, the instances in the test set are all labeled but set apart. These datasets comprise different characteristics: the number of instances ranges from 100 to 19000, the number of attributes from 2 to 90, and the number of decision classes from 2 to 28. Moreover, we have 25 datasets with different degrees of class imbalance and roughly half of the datasets are multiclass problems (see Table 5).

These datasets are partitioned into training and test sets as done in a 10-fold cross-validation process, but each training set consists of labeled and unlabeled instances. The subset of unlabeled instances is obtained by performing a random selection without replacement and neglecting the class label of such instances. The ratio (10% to 40%) determines the number of labeled instances that are kept in this process for each training set. These datasets (including the cross-validation fold partitions) were provided as supplementary material in [9] and constitute an standard in the evaluation of shallow SSC techniques. We use these datasets, including the partitions as a form of guaranteeing a fair comparison against state-of-the-art SSC methods.

There are several algorithms that can be adopted as base classifiers. On one hand, the selected classifier for the base black box should exhibit a strong predictive capability as it is used to determine the decision class of unlabeled instances. Next, we describe three mainstream supervised classifiers that will be used in the experiments for instantiating the black-box component. Our choice is motivated by experimental evidence of their superior performance in a wide range of classification problems [59, 60, 61] and their ability to produce calibrated probabilities (except for support vector machines where a calibration post-hoc is needed).

Black-box classifiers

• Random Forests (RF) [62]: Ensemble of decision trees that uses bagging technique for aggregating the results in order to reduce the high variance of individual decision trees. Individual decision trees are built with a random subset of attributes and a random sample with replacement of instances. In our implementation 100 trees are aggregated and the number of random attributes to consider for each tree equals

• Multilayer Perceptron (MLP) [63]: Feed-forward neural network using backpropagation algorithm for adjusting its weights. Our implementation uses learning rate equals to 0.3, momentum equals to 0.2, 500 epochs for learning and one hidden layer with (|A| + |Y |)/2 as the number of neurons.

• Support Vector Machine (SVM) [64, 65]: Support vector machine classifier using sequential minimal optimization algorithm for training. Our implementation uses a polynomial kernel with Platt’s scaling (logistic) calibration of probabilities.

On the other hand, for the white-box component any intrinsically interpretable classifier can be used as a surrogate model. Therefore, the choice of a white box must be driven by the type of explanations that are desired, e.g. rules, feature coefficients, probabilities, examples, etc. We decide to explore decision trees and decision lists alternatives as they provide both intuitive individual explanations in the form of if-then rules and a view of the model as a whole. For decision trees, the hierarchical structure provides this view and it can be considered transparent as long as the size of the tree remains manageable. For the case of the decision lists, rules sets are generally more concise than the ones extracted from decision trees. Additionally, these algorithms are able to handle weighted instances in the learning process. Next, we describe three classifiers explored in the scope of this experiment.

White-box classifiers

• Decision Tree (C45) [66]: Our implementation uses C4.5 algorithm for inducing a decision tree. We allow two instances as the minimum number of instances per leaf. The confidence factor for pruning is 0.25, where a lower value incurs in more pruning. When pruning the sub-tree raising operation is used.

• PART Decision List (PART) [67]: PART uses the separate-and-conquer strategy for building a rule set by generating a partial C4.5 decision tree and making the most confident leaf into a rule. In the next iteration, all covered instances are removed from the dataset and the process is repeated. Thus, decision lists must be interpreted in order. Our implementation uses the same hyper-parameters of the decision tree described above for generating the partial C4.5 decision trees.

• RIPPER Decision List (RIP) [68]: This method is a propositional rule learner with a separate-and-conquer strategy, as described for PART. Additionally, the training data is split into a growing set and a pruning set for performing reduced error pruning. The rule set formed from the growing set is simplified with pruning operations optimizing the error on the pruning set. For our implementation, the minimum allowed support of a rule is two and the data is split in three folds where one is used for pruning. Besides, two optimization iterations are performed.

For completeness, we enumerate the amending procedures that will be tested in combination with the previous base classifiers.

5.1.1 Amending procedures

• No amending (NONE): The first option is not using amending. All self-labeled instances are provided as extra data to the surrogate white box. This is used as a baseline for evaluating the contribution of the two amending procedures proposed.

• Amending based on class membership probabilities (CONF): Amending procedure based on calibrated class membership probabilities obtained from the black-box base classifier [22].

• Amending based on RST inclusion degree measure (RST): Amending procedure based on RST aiming to correct the inconsistency in the classifications, as described in the previous section.

Hereinafter, when referring to a particular configuration of SlGb we denote it as “bb-wb-am" where bb represents the base black box, wb represents the surrogate white box and am represents the amending procedure3.

5.2 Impact of the Black-box Base Classifiers on the Performance

We first focus on evaluating the influence of the base black box in the performance of the algorithm. Here no amending procedure is taken into account yet since it does not directly affects the ability of the black box to produce correct classifications. In order to measure the configurations in terms of prediction rates we report the Cohen’s kappa coefficient [69]. This measure estimate the inter-rater agreement for categorical items and ranges in , where indicates no agreement between the prediction and the actual values, 0 means no learning (i.e., random prediction), and 1 total agreement or perfect performance. While accuracy is considered mainstream when measuring classification rates, the kappa is a more robust measure since this coefficient takes into account the agreement occurring by chance, which is especially relevant for datasets with class imbalance [70, 71].

Table 1 gives the mean and the standard deviation of the kappa coefficient achieved on each setting. We group the results for different percentages of labeled instances. The numerical simulations indicate that using RF as the black-box component leads to higher prediction rates. In particular, the RF-PART-NONE configuration stands as the best performing one for varying amounts of labeled instances, very closely followed by RF-C45-NONE and RF-RIP-NONE.

Table 1: Prediction rates (kappa) achieved by different combinations of black-box and white-box algorithms without using amending. Results are grouped by ratio and best results are highlighted in bold.

With the aim of providing a rigorous statistical analysis of the differences, we compute the Friedman two-way analysis of variances by ranks [72], per ratio. The test suggests rejecting the null hypotheses for all labeled ratios based on a confidence interval of 95% (see Table 6 in appendix4). This means that there exist significant differences between at least two configurations on each ratio.

The next step is focused on determining whether RF black box is truly superior compared to other configurations. To do so, we adopt the Wilcoxon signed rank test [73] and Holm’s post-hoc procedure [74] to correct the p-values, as suggested by Benavoli et al. [75]. Table 7 reports the unadjusted p-value computed by the Wilcoxon test and the corrected p-value associated with each pairwise comparison. In order to discover the influence of the black box we compare the pairs of configurations using the same surrogate white box. Each section of the table represents the ratio of labeled instances. The null hypothesis states that there is no significant difference between the performance of each pair of configurations. All null hypotheses are rejected, except for RF-RIP vs. MLP-RIP in the 40% ratio (however RF still has higher prediction rates).

This suggests that RF is clearly the best-performing base black box for the self-learning grey-box. This result is not surprising since RF has proven to be very competent classifier in different experimental studies [59, 60, 61, 76]. Furthermore, RF produces consistent probabilities that does not need to be calibrated [53], which is a requirement for the later use of confidence-based amending.

5.3 Impact of using Different White Boxes and Amending Configurations

In this section, we study how different choices of the amending processes and white-box surrogates impact the overall results. We propose some measures for evaluating performance taking both accuracy and interpretability into account.

We first explore the influence on the prediction rates. Based on the selection of RF as black-box base, Table 2 shows very similar results across each ratio. Going deeper with the statistical analysis, we apply Friedman and Wilcoxon tests with post-hoc correction. Although Friedman test finds significant differences in the four groups (Table 8), examining Wilcoxon corrected tests we ascertain that the null hypothesis cannot be rejected for the vast majority of pairs compared (see Tables 9 and 10 for details). This means that there are no statistically significant differences in the prediction rates when comparing different amending procedures with a fixed white box and vice versa. This behavior suggests that the overall prediction rates of the approach mostly relies on the correct choice of the black-box algorithm.

Table 2: Prediction rates achieved by different combinations of white boxes and amending strategies while using RF as black box. Results are grouped by ratio and best results are highlighted in bold.

However, when examining the number of rules obtained, the difference is significantly visible. Figure 3 plots of the number of rules produced by each combination, per ratio of labeled data. Two results are consistent across ratios: both amending strategies (specially RST) reduce the number of rules while RIP as surrogate white box produces the lowest number of rules for all possible combinations.

Figure 3: Number of rules produced by each combination of white box and amending, using random forests as black box. Both amending strategies (specially RST) reduce the number of rules while RIP white box produces the lowest number of rules.

Toward exploring this result further, we also propose two new measures to evaluate models’ interpretability via a quantifiable proxy. The first measure can be used in the context of self-labeling and the second one is applicable to any model containing explanation units. According to [14], there are three main forms of evaluating interpretability: application-grounded, human-grounded and functionally-grounded metrics. The functionally-grounded approach is the only one not requiring human experiments and collaboration. As an alternative it uses desiderata for interpretability (e.g. transparency, trust, etc.) as a proxy for assessing the quality of the model. Since we are working with benchmark datasets, we use the functionally-grounded approach for creating measures based on the simplicity as a mean for gaining transparency and simulatability (i.e. a human is able to simulate and reason about the model’s entire decision-making process). The first measure can be used in the context of self-labeling for base methods that produce tree structures, rules or decision lists. It involves the number of rules in the decision lists (or equivalently the number of leaves in a decision tree) and expresses the relative growth in structure as:

where is the set of rules produced by the self-labeling method (here the grey-box) and is the set of rules produced by the baseline white box when using only labeled data. For this measure, a number much greater than one indicates that a major growth in the structure of the self-labeling method is needed when using the extra unlabeled data. In that case, the balance between interpretability and performance must be taken into account for further evaluation.

The second measure is more general and applicable to any model whose structure is formed by quantifiable explanation units (e.g. rules, prototypes, features, derived features, etc.). For our case, this measure estimates the simplicity of the model according to the size of the structure in terms of the number of rules. Although the first notion would be that the smaller the rule set the better, this is not necessarily a linear relation. The desired simplicity in terms of the number of rules has a smooth behavior which can drop quickly. Therefore, we propose to measure simplicity through a generalized sigmoid function which has been historically used for fitting growth curves [77], since it allows representing this relation with enough flexibility. The simplicity can be formalized as the following equation:

where and represent the upper and lower asymptotes of the function respectively, is the slope of the curve, regulates the shift over the x-axis and affects near which asymptote maximum growth occurs. In this way, a result value of one indicates high simplicity and it decreases smoothly towards zero. A bigger would make the function less smooth, generating a drastic drop in simplicity after a threshold in the number of rules is surpassed. The value of defines where the middle value of the function is obtained. While a value of makes no change in the curve, moves the growth toward the upper asymptote and toward the lower one. Observe that both influence where 0.5 simplicity is obtained. Given the diversity of our benchmark data, we take for illustrating a general setting (see Figure 4).

Figure 4: Simplicity function with default parameters used for the benchmark datasets. For specific applications these parameters are domain dependent.

With these values, the function produces medium evaluations (around 0.5) when the number of rules is around 40. Similarly, it obtains rather high simplicity (higher than 0.8) when the number of rules goes below 30. However, parameter values should be estimated based on expert knowledge for specific applications. This highly flexible function allows customizing the value of simplicity according with the specifics of a given case study.

Table 3 shows the average relative growth and simplicity over the 55 datasets tested for the four ratios. Regarding the relative growth, the increase in the structure of the grey-box is on average larger when using small amounts of labeled data, while for bigger ratios this difference decreases. This growth in the structure is an expected consequence of providing more unlabeled data to the white-box surrogate in the grey-box scheme. However, the use of amending procedures alleviates this effect by giving more importance to relevant unlabeled instances. In general, a smaller growth is observed when using RST amending, especially in combination with PART as the white box, thus resulting in the winner combination for all ratios.

Table 3: Mean (and standard deviation) of the relative growth and simplicity achieved by different combinations of white boxes and amending strategies while using RF as black box. Results are grouped by ratio and best results are highlighted in bold.

In addition, the simplicity measure (the closer the value to one the better) also indicates that the use of amending is convenient for obtaining more concise sets of rules. It is also evident that using RIP as surrogate generates the least number of rules, followed by PART. For this measure the absolute winner is RF-RIP-RST combination, exhibiting the highest values of simplicity for all ratio values used for experimentation. A similar statistical validation support this statement (see tables 11 and 12), finding significant statistical differences when comparing RF-RIP-RST with other configurations using simplicity as interpretability measure.

It is important to remark that the simplicity measure solely quantifies what it would be considered a simulatable model. Of course, a very simple model with only one rule and poor prediction rates is not desirable, whereas for a very simple dataset three or four rules might be enough to reach accurate results. That is why taking into account the prediction performance is fundamental for a proper assessment. To measure algorithms’ quality based on the balance between the prediction rates and the simplicity of the learned model, we propose a third measure, called utility, combining the kappa (re-scaled to (0,1)) and the simplicity values with a weighing parameter

where is set to 0.6 in our experimental setting, representing a scenario where the accuracy and the interpretability have almost the same preference. Utility functions are commonly used in multi-objective optimization for mapping a vector of pay-offs to a single scalar value [78]. In this case, the utility function is a linear combination of two terms parameterized by the weight . This weighing parameter allows adjusting the preference of the user for prioritizing the accuracy or the interpretability objectives. Here, the two objectives are measured based on kappa and simplicity, respectively. It would be interesting to extend the proposed utility to involve more objectives where the parameters should be obtained from the preferences of a panel of domain experts [79, 80].

As a partial summary, Figure 5 visualizes the utility values in a heat-map plot. From this figure, it is easy to perceive that the RIP algorithm, as a white-box surrogate, positively contributes to the overall performance of the approach when taking both kappa and simplicity into account. Additionally, RST amending also increases the value of utility when compared with CONF amending or not using amending at all. This measure reflects that, in general, the best trade-off is reached when using the RF-RIP-RST combination and the highest values are achieved when more labeled data is available.

Figure 5: Mean utility values of each combination of white box and amending, using random forests as the black-box base classifier. The use of RIP as a white-box component in combination with the RST-based amending achieves the best trade-off between accuracy and interpretability, for all explored ratios.

5.4 Influence of the Number of Labeled and Unlabeled Instances

In this subsection, we use RF-RIP-RST to explore the impact of having different amounts of labeled and unlabeled instances on algorithm’s results. In the evaluation of semi-supervised techniques, it is a common strategy to vary the size of L by systematically neglecting the label of different amounts of instances and adding them to U. But this procedure does not explore the scenario where also the unlabeled instances could be hard to obtain [81]. Observe that since this is a controlled experiment we can safely assume that the unlabeled instances follow the same class distribution as the labeled ones. In reality, one might need to re-balance the dataset after self-labeling if the unlabeled instances per-class distribution significantly differs.

Due to the fact that we do not have truly unlabeled instances, we use the same datasets from the previous experiment. First, a test set with 20% of instances is kept aside for evaluation. Then, we divide the train set into two equally sized and disjoint subsets (each with 40% of the total instances). Each subset is a source for labeled and unlabeled instances, respectively, from where we vary the amount of instances we use for training. Figure 6 shows the surfaces resulting from the average of different measures over the 55 datasets.

From the first two surfaces (Figures 6a and 6b), it can be observed that the prediction rates (accuracy and kappa) have a pronounced increment when adding more labeled and unlabeled instances. The most dramatic change is observed when adding labeled data to a few unlabeled instances (5%), which is an expected result as it tends to be a more supervised setting. However, when labeled instances are very limited (5% of the dataset), adding unlabeled instances clearly increase the overall performance. In addition, even when more labeled data is available (40%), an increase in performance is observed by adding more unlabeled data. This result confirms that our approach fulfills the main aim of SSC approaches.

Figure 6: Performance of RF-RIP-RST when varying the number of labeled and unlabeled instances, for different measures. Axes x and y are expressed in percentage of instances taken for training from each subset. Sub-figures (d) and (e) are rotated for visualization purposes.

The number of rules (Figure 6c) increases almost linearly with the number of training instances, either labeled or unlabeled. However, the relative growth (Figure 6d) is more sensitive to adding unlabeled data when labeled data is very scarce, i.e. a bigger amount of unlabeled instances rapidly increases the structure and loses in interpretability, compared with the baseline white box. However, the base white boxes generally perform very poor when the labeled data is scarce. When the labeled data is not so scarce, then the growth is more robust to adding more unlabeled data. This means that even when a base white box can achieve good performance with some labeled data, adding unlabeled data does not generates too much growth in structure and can benefit the performance of the grey-box (Figures 6a and 6b).

The simplicity (Figure 6e) shows the expected behavior: the best values of this measure are observed with the least number of instances and it decreases uniformly in both directions. This means that adding more unlabeled instances does not generate a greater number of extra rules compared to adding more labeled instances. This is a consequence of using amending procedures for adjusting the confidence of the unlabeled instances, thus avoiding that the white box learns from inconsistent instances. Finally, the utility surface (Fig. 6f) summarizes all results reflecting the increase in the overall performance when adding both labeled and unlabeled instances.

5.5 Comparing against State-of-the-Art Self-labeling Classifiers

In this section we compare the predictive capability of SlGb against the four best self-labeling techniques reported in the review paper in [9]: Co-training using support vector machine [11] (CT(SMO)), Tri-training using C45 decision tree [44] (TT(C45)), Co-Bagging using C45 decision tree [11] (CB(C45)) and Democratic Co-learning [43] (DCT). Since these algorithms are not inherently interpretable we focus our comparison on the prediction rates only. For this section, SlGb refers to the RF-PART-RST combination which exhibits the best results in kappa, as showed in Subsection 5.3.

Table 4 reports the mean and standard deviation of kappa coefficient for each classifier, taking into account the four studied ratios. The results reveal that SlGb has the highest mean for all ratios. In order to support this assertion, we compute the Friedman p-value per ratio. The test suggests rejecting the null hypotheses for all labeled ratios based on a confidence interval of 95% (see Table 13). This means that there is an indication that there exist significant differences between at least two algorithms in each comparison.

Table 4: Mean and standard deviation of kappa coefficient obtained by SlGb and four self-labeling methods from the state-of-the-art. The best performance is highlighted in bold.

The next step is focused on determining whether the superiority of the SlGb classifier is responsible for the significant difference reported by the Friedman test. Similar to previous sections we use the Wilcoxon signed rank test and the Holm post-hoc procedure for computing the corrected p-values associated with each pairwise comparison. Each section of the Table 14 represents a ratio of labeled instances. The null hypothesis states that there is no significant difference between the performance of each pair of algorithms, taking SlGb as the control one.

From the statistical tests we can draw the following conclusions. First, there is no doubt about the superiority of the SlGb classifier when tested with datasets with ratios of 10% and 20% of labeled instances, as all the null hypotheses were rejected. This result, in combination with the first place in the Friedman ranking, demonstrates that our algorithm significantly outperforms the other four algorithms in these settings. In the case of datasets comprising 30% and 40% of labeled instances, the results show that SlGb is the best-performing classifier, but with no significant differences observed between the pairs SlGb vs. DCT (for 30%), and SlGb vs. CT(SMO) (for both ratios), as these null hypotheses could not be rejected. However, DCT and CT(SMO) cannot be considered transparent due to their complex structure involving support vector machines and collaboration between base classifiers. Although our main goal was not to outperform the SSC methods in terms of classification rates, the analysis reported above supports our claim that we obtain a favorable balance between performance and interpretability by using the SlGb approach for solving SSC problems.

6 Conclusions

In this paper, we report on an extended experimental study to determine the suitability of SlGb classifier for semi-supervised classification problems where interpretability is a requirement. We explore two different amending procedures for weighting the instances coming from the self-labeling process. Such procedures aim at preventing the effect of misclassifications to propagate across the whole model. The experiments have shown that using Random Forests as the base black box for the self-labeling process is the best choice in terms of prediction rates. The choice of a white box and amending does not significantly affect the prediction rates but it is relevant for the size of the structure.

Three measures based on the number of rules were proposed for estimating the relative growth, simplicity, and utility of the SlGb. SlGb produces simpler models when using decision lists instead of a C4.5 decision tree as surrogate white boxes, even when no amending is performed. However, the amending procedures help further increase the simplicity (and therefore transparency) without affecting the prediction rates by giving more importance to confident instances in the self-labeling. Especially RST based amending looks more promising since it does not need the black-box base classifier to provide calibrated probabilities. Furthermore, RST based amending could be a good choice for a given case study where the uncertainty coming from inconsistency is high, even on the available labeled data. Therefore, we strongly advise the use of random forests as a base black box and RST for amending the self-labeling, while the choice of white box is more flexible to the desired interpretability, either a decision tree with rules or a decision list. Although, the best trade-off between accuracy and interpretability (utility) is reached when using the RF-RIP-RST combination.

The study varying the number of unlabeled instances and labeled instances together shows that even when the number of labeled instances is not that scarce, the SlGb is able to leverage unlabeled instances for increasing the performance. Another conclusion is that adding unlabeled instances does not make the interpretability worse compared to adding more labeled instances. This evidences that the amending procedure (in this case RST-based amending) avoids that the SlGb generates more rules from inconsistent instances. Finally, the experimental comparison shows that our SlGb method outperforms the state-of-the-art self-labeling approaches, yet being far more simple in structure than these techniques.

References

[1] Chen Gong, Dacheng Tao, Stephen J Maybank, Wei Liu, Guoliang Kang, and Jie Yang. Multi-modal curriculum learning for semi-supervised image classification. IEEE Transactions on Image Processing, 25(7):3249–3260, 2016.

[2] Lili Yin, Huangang Wang, Wenhui Fan, Li Kou, Tingyu Lin, and Yingying Xiao. Incorporate active learning to semi-supervised industrial fault classification. Journal of Process Control, 78:88 – 97, 2019.

[3] Weidi Xu and Ying Tan. Semi-supervised target-oriented sentiment classification. Neurocomputing, 337:120 – 128, 2019.

[4] Nikos Fazakis, Stamatis Karlos, Sotiris Kotsiantis, and Kyriakos Sgarbas. Speaker identification using semi-supervised learning. In Proceeding of the 2015 International Conference on Speech and Computer, volume LNCS 931, pages 389–396. Springer, 2015.

[5] Yutong Xie, Jianpeng Zhang, and Yong Xia. Semi-supervised adversarial model for benign–malignant lung nodule classification on chest CT. Medical Image Analysis, 57:237 – 248, 2019.

[6] Kristin P. Bennett and Ayhan Demiriz. Semi-supervised support vector machines. In Advances in Neural Information Processing Systems 11, pages 368–374. MIT Press, 1999.

[7] Avrim Blum and Shuchi Chawla. Learning from labeled and unlabeled data using graph mincuts. In Proceedings of the Eighteenth International Conference on Machine Learning, pages 19–26, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.

[8] Akinori Fujino, Naonori Ueda, and Kazumi Saito. Semisupervised learning for a hybrid generative/discriminative classifier based on the maximum entropy principle. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(3):424–437, 2008.

[9] Isaac Triguero, Salvador García, and Francisco Herrera. Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study. Knowledge and Information Systems, 42(2):245–284, Feb 2015.

[10] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems 28, pages 2234–2242. Curran Associates, Inc., 2016.

[11] Mohamed Farouk Abdel Hady and Friedhelm Schwenker. Co-training by committee: a new semi-supervised learning framework. In Proceedings of the 2008 IEEE International Conference on Data Mining Workshops, pages 563–572. IEEE, 2008.

[12] David Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics, pages 189–196, Stroudsburg, PA, USA, 1995. Association for Computational Linguistics.

[13] Bryce Goodman and Seth Flaxman. European union regulations on algorithmic decision-making and a “right to explanation”. AI Magazine, 38(3):50–57, 2017.

[14] Finale Doshi-Velez and Been Kim. Considerations for Evaluation and Generalization in Interpretable Machine Learning, pages 3–17. Springer, 2018.

[15] Leilani H Gilpin, David Bau, Ben Z Yuan, Ayesha Bajwa, Michael Specter, and Lalana Kagal. Explaining explanations: An overview of interpretability of machine learning. In Proceedings of the IEEE 5th International Conference on Data Science and Advanced Analytics, pages 80–89. IEEE, 2018.

[16] Zachary C. Lipton. The mythos of model interpretability. In Proceedings of the 33rd International Conference on Machine Learning. Workshop on Human Interpretability in Machine Learning, pages 96–100, 2016.

[17] Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser, Adrien Bennetot, Siham Tabik, Alberto Barbado, Salvador Garcia, Sergio Gil-Lopez, Daniel Molina, Richard Benjamins, Raja Chatila, and Francisco Herrera. Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward responsible ai. Information Fusion, 58:82–115, 2020.

[18] Oliver Nelles. Nonlinear system identification: from classical approaches to neural networks and fuzzy models. Springer, 2013.

[19] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144. ACM, 2016.

[20] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4765–4774. Curran Associates, Inc., 2017.

[21] Isel Grau, Dipankar Sengupta, Maria M. Garcia Lorenzo, and Ann Nowe. Grey-box model: An ensemble approach for addressing semi-supervised classification problems. In Belgian-Dutch Conference on Machine Learning BENELEARN 2016, 9 2016.

[22] Isel Grau, Dipankar Sengupta, Maria M. Garcia Lorenzo, and Ann Nowe. Interpretable self-labeling semi-supervised classifier. In David Aha, Trevor Darrell, Patrick Doherty, and Daniel Magazzeni, editors, Proceedings of the 2nd Workshop on Explainable Artificial Intelligence, International Joint Conference on Artificial Intelligence IJCAI/ECAI 2018, pages 52–57, 7 2018.

[23] Xiaojin Zhu. Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison, 2005.

[24] Hakan Cevikalp and Vojtech Franc. Large-scale robust transductive support vector machines. Neurocomputing, 235:199 – 209, 2017.

[25] Yanchao Li, Yongli Wang, Cheng Bi, and Xiaohui Jiang. Revisiting transductive support vector machines with margin distribution embedding. Knowledge-Based Systems, 152:200 – 214, 2018.

[26] Olivier Chapelle, Bernhard Schlkopf, and Alexander Zien. Semi-Supervised Learning. The MIT Press, 1st edition, 2010.

[27] C. Gong, D. Tao, W. Liu, L. Liu, and J. Yang. Label propagation via teaching-to-learn and learning-to-teach. IEEE Transactions on Neural Networks and Learning Systems, 28(6):1452–1465, 2017.

[28] M. Fan, X. Zhang, L. Du, L. Chen, and D. Tao. Semi-supervised learning through label propagation on geodesics. IEEE Transactions on Cybernetics, 48(5):1486–1499, 2018.

[29] Raif M Rustamov and James T Klosowski. Interpretable graph-based semi-supervised learning via flows. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

[30] J. Chen, Y. Chang, B. Hobbs, P. Castaldi, M. Cho, E. Silverman, and J. Dy. Interpretable clustering via discriminative rectangle mixture model. In Proceedings of the IEEE 16th International Conference on Data Mining (ICDM), pages 823–828, 2016.

[31] Jason Weston, Frédéric Ratle, Hossein Mobahi, and Ronan Collobert. Deep learning via semi-supervised embedding. In Neural Networks: Tricks of the Trade, volume LNCS 7700, pages 639–655. Springer, 2012.

[32] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems 27, pages 3581–3589. Curran Associates, Inc., 2014.

[33] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.

[34] Augustus Odena. Semi-supervised learning with generative adversarial networks. arXiv preprint arXiv:1606.01583, 2016.

[35] Zihang Dai, Zhilin Yang, Fan Yang, William W Cohen, and Ruslan R Salakhutdinov. Good semi-supervised learning that requires a bad GAN. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6510–6520. Curran Associates, Inc., 2017.

[36] Supriyo Chakraborty, Richard Tomsett, Ramya Raghavendra, Daniel Harborne, Moustafa Alzantot, Federico Cerutti, Mani Srivastava, Alun Preece, Simon Julier, Raghuveer M Rao, et al. Interpretability of deep learning models: a survey of results. In Proceedings of the IEEE Smart World Congress 2017, 2017.

[37] Jesper E Van Engelen and Holger H Hoos. A survey on semi-supervised learning. Machine Learning, 109(2):373– 440, 2020.

[38] Siddharth N, Brooks Paige, Jan-Willem van de Meent, Alban Desmaison, Noah Goodman, Pushmeet Kohli, Frank Wood, and Philip Torr. Learning disentangled representations with semi-supervised deep generative models. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5925–5935. Curran Associates, Inc., 2017.

[39] Anindya Halder, Susmita Ghosh, and Ashish Ghosh. Ant based semi-supervised classification. In Marco Dorigo, Mauro Birattari, Gianni A. Di Caro, René Doursat, Andries P. Engelbrecht, Dario Floreano, Luca Maria Gambardella, Roderich Groß, Erol ¸Sahin, Hiroki Sayama, and Thomas Stützle, editors, Proceedings of the 7th International Conference on Swarm Intelligence, volume LNCS 6234, pages 376–383. Springer, 2010.

[40] Ming Li and Zhi-Hua Zhou. Setred: Self-training with editing. In Proceedings of the 9th Pacific-Asia Conference on Knowledge Discovery and Data Mining, volume LNCS 3518, pages 611–621. Springer, 2005.

[41] Ian H. Witten, Eibe Frank, Mark A. Hall, and Christopher J. Pal. Chapter 11 - Beyond supervised and unsupervised learning, pages 467–478. Morgan Kaufmann, fourth edition, 2017.

[42] Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pages 92–100, New York, NY, USA, 1998. ACM.

[43] Yan Zhou and Sally Goldman. Democratic co-learning. In Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence, pages 594–602. IEEE, 2004.

[44] Zhi-Hua Zhou and Ming Li. Tri-training: Exploiting unlabeled data using three classifiers. IEEE Transactions on Knowledge and Data Engineering, 17(11):1529–1541, 2005.

[45] Ming Li and Zhi-Hua Zhou. Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. IEEE Transactions on Systems, Man, and Cybernetics Part A: Systems and Humans, 37(6):1088–1098, 2007.

[46] Nikos Fazakis, Stamatis Karlos, Sotiris Kotsiantis, and Kyriakos Sgarbas. Self-trained LMT for semisupervised learning. Computational Intelligence and Neuroscience, 2016:10, 2016.

[47] Julio Albinati, Samuel E. L. Oliveira, Fernando E. B. Otero, and Gisele L. Pappa. An ant colony-based semi-supervised approach for learning classification rules. Swarm Intelligence, 9(4):315–341, 2015.

[48] Stamatis Karlos, Nikos Fazakis, Angeliki-Panagiota Panagopoulou, Sotiris Kotsiantis, and Kyriakos Sgarbas. Locally application of naive bayes for self-training. Evolving Systems, 8(1):3–18, 2017.

[49] Sarah Vluymans, Neil Mac Parthaláin, Chris Cornelis, and Yvan Saeys. Fuzzy rough sets for self-labelling: An exploratory analysis. In Proceedings of the 2016 IEEE International Conference on Fuzzy Systems, pages 931–938. IEEE, 2016.

[50] D. Wu, X. Luo, G. Wang, M. Shang, Y. Yuan, and H. Yan. A highly accurate framework for self-labeled semisupervised classification in industrial applications. IEEE Transactions on Industrial Informatics, 14(3):909– 920, 2018.

[51] Emmanuel Pintelas, Ioannis E Livieris, and Panagiotis Pintelas. A grey-box ensemble model exploiting black-box accuracy and white-box intrinsic interpretability. Algorithms, 13(1):17, 2020.

[52] Cosmin Lazar, Stijn Meganck, Jonatan Taminau, David Steenhoff, Alain Coletta, Colin Molter, David Y Weiss-Solís, Robin Duque, Hugues Bersini, and Ann Nowé. Batch effect removal methods for microarray gene expression data integration: a survey. Briefings in Bioinformatics, 14(4):469–490, 2012.

[53] Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning, pages 625–632, New York, NY, USA, 2005. ACM.

[54] John Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, 10(3):61–74, 1999.

[55] Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In Proceedings of the Eighteenth International Conference on Machine Learning, volume 1, pages 609–616, 2001.

[56] Zdzisaw Pawlak. Rough sets. International Journal of Computer & Information Sciences, 11(5):341–356, 1982.

[57] D Randall Wilson and Tony R Martinez. Improved heterogeneous distance functions. Journal of Artificial Intelligence Research, 6:1–34, 1997.

[58] Rafael Bello and José Luis Verdegay. Rough sets in the Soft Computing environment. Information Sciences, 212:1–14, 2012.

[59] Chongsheng Zhang, Changchang Liu, Xiangliang Zhang, and George Almpanidis. An up-to-date comparison of state-of-the-art classification algorithms. Expert Systems with Applications, 82:128–150, 2017.

[60] Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim. Do we need hundreds of classifiers to solve real world classification problems. Journal of Machine Learning Research, 15(1):3133–3181, 2014.

[61] Michael Wainberg, Babak Alipanahi, and Brendan J Frey. Are random forests truly the best classifiers? The Journal of Machine Learning Research, 17(1):3837–3841, 2016.

[62] Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.

[63] Robert Hecht-Nielsen. Theory of the backpropagation neural network. In Proceedings of the International Joint Conference on Neural Networks, pages 593–605. IEEE, 1989.

[64] J. Platt. Fast training of support vector machines using sequential minimal optimization. In B. Schoelkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning. MIT Press, 1998.

[65] S. Sathiya Keerthi, Shirish Krishnaj Shevade, Chiranjib Bhattacharyya, and Karuturi Radha Krishna Murthy. Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Computation, 13(3):637–649, 2001.

[66] J Ross Quinlan. C4.5: Programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993.

[67] Eibe Frank and Ian H. Witten. Generating accurate rule sets without global optimization. In Proceedings of the Fifteenth International Conference on Machine Learning, pages 144–151, San Francisco, CA, USA, 1998. Morgan Kaufmann Publishers Inc.

[68] William W. Cohen. Fast effective rule induction. In Armand Prieditis and Stuart Russell, editors, Proceedings of the Twelfth International Conference on Machine Learning, pages 115 – 123, San Francisco, CA, USA, 1995. Morgan Kaufmann Publishers Inc.

[69] Jacob Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37–46, 1960.

[70] Nathalie Japkowicz and Mohak Shah. Evaluating learning algorithms: a classification perspective. Cambridge University Press, 2011.

[71] Arie Ben-David. Comparison of classification accuracy using cohen’s weighted kappa. Expert Systems with Applications, 34(2):825–832, 2008.

[72] Milton Friedman. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32(200):675–701, 1937.

[73] F. Wilcoxon. Individual comparisons by ranking methods. Biometrics, 1:80–83, 1945.

[74] Sture Holm. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, pages 65–70, 1979.

[75] Alessio Benavoli, Giorgio Corani, and Francesca Mangili. Should we really use post-hoc tests based on meanranks? Journal of Machine Learning Research, 17:1–10, 2016.

[76] Wouter G Touw, Jumamurat R Bayjanov, Lex Overmars, Lennart Backus, Jos Boekhorst, Michiel Wels, and Sacha AFT van Hijum. Data mining in the life sciences with random forest: a walk in the park or lost in the jungle? Briefings in Bioinformatics, 14(3):315–326, 2012.

[77] Colin PD Birch. A new generalized logistic sigmoid growth equation compared with the richards growth equation. Annals of Botany, 83(6):713–723, 1999.

[78] Roxana R˘adulescu, Patrick Mannion, Yijie Zhang, Diederik M. Roijers, and Ann Nowé. A utility-based analysis of equilibria in multi-objective normal-form games. The Knowledge Engineering Review, 35:e32, June 2020.

[79] Luisa M Zintgraf, Diederik M Roijers, Catholijn M Jonker, and Ann Nowé. Ordered preference elicitation strategies for supporting multi-objective decision making. In Proceedings of the 17th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2018), volume 9. International Foundation for Autonomous Agents and Multiagent Systems, 2018.

[80] Diederik M Roijers, Peter Vamplew, Shimon Whiteson, A Whitesonuva Nl, and Richard Dazeley. A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research, 48:67–113, 2013.

[81] Avital Oliver, Augustus Odena, Colin A Raffel, Ekin Dogus Cubuk, and Ian Goodfellow. Realistic evaluation of deep semi-supervised learning algorithms. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 3235–3246. Curran Associates, Inc., 2018.

7 Appendix: Benchmark Datasets and Detailed Results of Statistical Tests

Table 5: Characterization of the datasets used in experiments in Section 5. The imbalance is computed as the ratio of the number of instances between the majority and the minority class of the dataset, NA means a ratio smaller than two.

Table 5 – Continued from previous page

Table 6: Friedman p-values for all ratios when testing different black-box base classifiers. The prediction rates are measured using kappa coefficient. There are significant differences among all the configurations compared.

Table 7: Wilcoxon p-values and Holm’s post-hoc correction when comparing different black-boxes configurations. The test supports the superiority of RF as black-box base classifier when comparing prediction rates.

Table 8: Friedman p-values for all ratios when testing the prediction performance (kappa) for different white-box and amending configurations. There are statistical differences in the prediction rates in at least one pair of the configurations compared.

Table 9: Wilcoxon’s p-values and Holm’s post-hoc correction when comparing different white-box and amending configurations, for 10% and 20% ratio. Per ratio, first subsection compares using different amending procedures while fixing the white box and the second subsection fixes the amending for comparing the influence of white boxes. The vast majority of null hypothesis cannot be rejected, indicating that amending or white-box alternatives do not strongly influence the prediction rates.

Table 10: Wilcoxon’s p-values and Holm’s post-hoc correction when comparing different white-box and amending configurations, for 30% and 40% ratio. Per ratio, first subsection compares using different amending procedures while fixing the white box and the second subsection fixes the amending for comparing the influence of white boxes. The vast majority of null hypothesis cannot be rejected, indicating that amending or white-box alternatives do not strongly influence the prediction rates.

Table 11: Friedman p-values for all ratios when comparing the interpretability in terms of simplicity, for different white-box and amending configurations. There are significant differences among all the configurations compared, where RF-RIP-RST exhibits the highest mean for all ratios (see Table 3).

Table 12: Wilcoxon p-values and Holm’s post-hoc correction when comparing different white-box and amending configurations against the highest mean simplicity combination: RF-RIP-RST. All null hypothesis can be safely rejected, showing statistically significant superiority in terms of simplicity.

Table 13: Friedman p-values for all ratios when comparing SlGb (RF-PART-RST) with state-of-the-art semi-supervised classifiers in terms of prediction rates (kappa). There are significant differences for all ratios, where SlGb exhibits the highest mean (see Table 4).

Table 14: Wilcoxon p-values and Holm’s post-hoc correction using SlGb approach as control method against state-of-the-art semi-supervised classifiers. SlGb significantly outperforms other methods except for CT(SMO) and DCT when using 30% and 40% of labeled instances.