Deep Inverse Feature Learning: A Representation Learning of Error

2020·Arxiv

Abstract

Abstract

This paper introduces a novel perspective about error in machine learning and proposes inverse feature learning (IFL) as a representation learning approach that learns a set of high-level features based on the representation of error for classification or clustering purposes. The proposed perspective about error representation is fundamentally different from current learning methods, where in classification approaches they interpret the error as a function of the differences between the true labels and the predicted ones or in clustering approaches, in which the clustering objective functions such as compactness are used. Inverse feature learning method operates based on a deep clustering approach to obtain a qualitative form of the representation of error as features. The performance of the proposed IFL method is evaluated by applying the learned features along with the original features, or just using the learned features in different classification and clustering techniques for several data sets. The experimental results show that the proposed method leads to promising results in classification and especially in clustering. In classification, the proposed features along with the primary features improve the results of most of the classification methods on several popular data sets. In clustering, the performance of different clustering methods is considerably improved on different data sets. There are interesting results that show some few features of the representation of error capture highly informative aspects of primary features. We hope this paper helps to utilize the error representation learning in different feature learning domains.

Index Terms—error representation learning, deep learning, inverse feature learning, clustering, classification

I. INTRODUCTION

Error as the main source of knowledge in learning has a unique and key role in machine learning approaches. In current machine learning techniques, the error is interpreted without considering its representation in a variety of forms in objective functions, loss functions, or cost functions. In supervised approaches, different loss functions and cost functions have been defined to measure the error as the differences between the true and the predicted labels during the training process in order to provide decision functions which minimize the risk of wrong decisions [1]. In clustering, high-level objectives in the form of clustering loss (e.g., compactness or connectivity) are used to consider similar instances as the clusters. The clustering objectives minimize the cost functions that are defined as the distances between the instances and the representative of clusters (e.g., the centers of clusters in k-means [2]). In this paper, we define the error in a more informative format which captures different representations of each instance based on its class or cluster and uses that toward learning several high-level features to represent the raw data in a compact way.

The proposed error representation learning offers a new approach in which the error is represented as a dynamic quantity that captures the relationships between the instances and the predicted labels in classification or the predicted clusters in clustering. This method is called “inverse feature learning (IFL)” as it firsts generates the error and then learn high-level features based on the representation of error. We propose IFL as a framework to transform the error representation to a compact set of impactful high-level features in different machine learning techniques. The IFL method can be particularly effective in the data sets, where the instances of each class have diverse feature representations or the ones with imbalanced classes. It can also be effective in active learning applications with a considerable cost of training the instances.

In summary, the error representation learning method offers a different source of knowledge that can be implemented by different learning techniques and be used alongside a variety of feature learning methods in order to enhance their performance. In this paper, we present a basic implementation of this strategy to learn the representation of error to focus on the concept, its potential role in improving the performance of state-of-the-art machine learning techniques when a general set of high-level features are utilized across all these techniques. Indeed, the proposed technique can be improved by using additional feature representations as well as defining other dynamic notions of error. The experimental results confirm that even a few simple features defined based on error representation can improve the performance of learning in classification and clustering.

II. RELATED WORKS

The error in most machine learning approaches, depending on whether the approach is supervised or unsupervised has a similar definition of the difference between the predicted and true labels in supervised ones or the clustering losses in unsupervised ones. Also, the term error refers to the same concept in similarity learning [3], semi-supervised learning [4], self-supervised learning [5]. Generative methods are distinguished from discriminative methods in machine learning in terms of considering the underlying distribution, but the error is the same depend on is that the approach is supervised or unsupervised.

Representation learning methods are known for their role in improving the performance of clustering and classification approaches by learning new data representations from the raw data which better presents different variations behind the data [6]. Classification with representation learning typically is a closed-loop learning with many different architectures. Clustering with deep learning, while does not use the labels, can lead to promising results [7]. Unsupervised representation learning methods can be used for pre-training nets, generally for deep networks, or for extracting high-level features, denoising, and dimensionality reductions.

Representation learning approaches in supervised or unsupervised applications are generally based on elements such as restricted Boltzmann machines (RBMs) [8], autoencoder [9, 10], convolutional neural networks (ConvNets) [11], and clustering methods [7]. In the majority of existing classifi-cation with representation learning methods, the term error refers to a function of the differences between the true and the predicted labels (i.e., loss functions) [12]. Also, the error in clustering with representation learning methods is defined as clustering loss of high-level objectives alongside other loss functions [13].

In unsupervised feature learning such as autoencoders, the error still is defined in the form of reconstruction error between the input data and the resultant of encoding and decoding process [9, 10]. Regularization terms [14] or dropout [15] are used alongside with loss functions to enhance the training process to have a better generalization of the learned features (e.g., learning the weights of neural nets). Generative adversarial nets (GANs) [16] use two separate neural networks competing against each other with the ultimate goal of generating synthetic data similar to the original inputs through the generator. However, the concept error in GANs is the same as the other ML approaches. In this paper, we propose a novel notion of error as the resultant representation of assigning the data instances to different available classes or clusters as a source of knowledge for learning. We would like to note that this approach is different from generative approaches such as GANs which are based on data generation while the proposed method is based on the virtual assignment of the data instances to different classes or clusters to consider the resultant representation of such assumptions for each instance in relation to the existing classes or clusters.

This proposed method is also different from similarity learning [3], which learns a similarity function to measure how similar two objects are, in the sense that it extracts the relationships between the objects and the classes. Selfsupervised learning methods are based on finding patterns inside of input instances [5]. The semi-supervised learning and active learning methods, that utilize a combination of classification and clustering, are based on the assumption that the instances which are in the same cluster have the same label and using this assumption toward predicting the labels for new instances [17, 18, 4, 19]. In these methods, the instances near the center of clusters are considered as the most representative objects for the purpose of determining the labels. Other approaches including [20, 21] utilized clustering for active learning in several different ways.

In summary, the proposed method is different from existing techniques in the literature since they focus on data representation learning and calculate errors in classification by the loss functions or cost functions to minimize the risk by taking the difference between the predicted labels and true labels. In clustering with deep learning, the objectives are defined in the form of clustering loss and auxiliary losses [7] which are still based on data representation. The distinctions of the inverse feature learning related to common trends in machine learning is that the inverse feature learning method generates the error by a trial and calculates the resultant representation as the error and then transform the error as high-level features to represent the characteristics of each instance related to the available classes or clusters.

III. NOTATIONS

First, we define the notations which are used for both clustering and classification. Let us consider X and Y to refer to the input and the output spaces, respectively in which each instance consists of h features — . We use notation s to denote the number of classes or the number of clusters. We consider C to denote the set of clusters in clustering, , as is the corresponding classes of instances in the classification. The set of clusters, C can simply correspond to Y in assignment problems using approaches such as Hungarian algorithm.

Notation refers to the center of cluster i. |b| indicates the number of instances in set b. In continue, we define the specific notations used for the classification and clustering approaches.

In clustering, the input data set is presented with , in which indicates the set of input instances and n shows the number of input instances — |X| = n.

In classification, the input training data set is presented with , in which indicates the set of input training instances and shows the number of input instances in the training partition. The label set is denoted by , where is a vector corresponded with data set . Thus, shows the corresponding label for denotes the test set in which refers to the number of test instances. — and .

Z refers to the latent feature space, in which is the new representation of , and e denotes the dimension of latent feature space, where the dimension of the latent space is usually much smaller than the original space (i.e., e << h).

IV. DEEP INVERSE FEATURE LEARNING

In this section, we describe the proposed inverse feature learning, which is defined as a trail process that learns a set of features of the error representation from the latent features of the instances using a deep clustering approach. The learned features of the representation of error can be used for clustering or classification purposes. Here, we describe how the representation of error is generated, and then presented in the form of learned features. First, we provide a high-level

(a) A simple autoencoder which learns a latent feature space, Z, and provides initial clusters’ centroids as the first phase of DEC. The hashed part is used in the second phase of DEC after the training phase.

(b) In the second phase of DEC architecture, a clustering based on KL divergence metric is repeated untill the method converges.

Fig. 1: The network structure of deep embedded clustering (DEC)[22].

review of different steps of this method and then describe them in more detail in the following sub-sections. The pseudo-codes for the IFL in clustering and classification are presented in Algorithm 1, and Algorithms 2, 3, and 4, respectively.

Clustering with deep learning is a practical and efficient solution to handle different data sets, especially the high dimensional ones. Autoencoders can provide a latent feature space which is much smaller than the original feature set [13]. Thus, a clustering method based on an autoencoder has a proper foundation for the proposed inverse feature learning. There are several types of deep clustering methods which can be categorized based on different factors such as the architecture, being deterministic or generative, or the loss functions [13, 7]. We use deep embedded clustering (DEC) as a general and basic approach in the literature of clustering with deep learning that simultaneously learns feature representations and cluster assignments [22]. The autoencoder in DEC offers a robust and embedded representation of data with a nonlinear mapping of the original features to a latent feature space Z. DEC is deterministic, and uses an autoencoder network with learnable parameters based on the fully connected networks (FCNs) and the reconstruction loss in its first phase. Then, the DEC clusters the latent space to s clusters by using cluster assignment hardening in its second phase to generate the centroid, of the clusters. The network structure of DEC is shown in Figure 1.

The inverse feature learning is based on four main steps: Step 1- Inner folding : We introduce inner folding as a method to perform a trial process to extract the desired characteristics of error for all the instances. This process is inspired by k-fold cross-validation, where the data set is partitioned to r folds. One partition of data is considered

Fig. 2: The inner folding process. Step 1: partition the input data to r non-overlapping folds or partitions. A loop with r runs is applied, where in each run, one fold is considered as an inner test and remaining folds are considered as innertrain. The inner-test fold in each run is colored with green.

Fig. 3: Steps 2 and 3: Learning the clustered representation of the inner train data as step 2 by using the DEC to train the autoencoder in the first phase and cluster the latent feature space and obtain their centers in the second phase. The trained encoder in the first phase is used in the second phase of the DEC and also in the extracting the latent features of inner test as step 3.

as inner test and other partitions as inner train in each run. However, the objective of the inner folding process is different from cross-validation in the sense that we cluster the inner train data to obtain the representation of inner test instances based on that clustered representation.

Step 2- Learning the clustered representation of the inner train: In each round of the inner training process, the inner train data is fed as input to the deep embedded clustering (DEC) [22]. DEC clusters the inner train instances. Then, the decoder of DEC which is trained with the inner train instances and the centroids of the clusters are used for the next steps.

Step 3- Extracting the latent features of inner test: The inner test instances are given as input to the trained decoder of the previous step to obtain the latent features of inner test. Step 4- Feature learning: In this step, the clusters’ centroids of the inner train which were calculated in step 2 and the latent features of inner test which were obtained in step 3 are used here to measure several features for the one inner test instance. The relations between the latent features of the inner test and the clusters’ centroids of the inner train are calculated and considered as new features for that inner test sample. Thus, we extract a new set of features for each inner test instance during each run of inner folding process.

In continue, we describe each of these steps in detail.

A. Step 1- Inner folding

Inner folding provides a framework to obtain the evaluation of the representation of instances based on each other. Inner folding partition the data into r non-overlapping folds, in which one fold as the inner test is evaluated against the remaining folds called inner train instances. The inner folding process is similar to the k-fold cross-validation in the sense that during each round, one data fold is considered as the test sample and the remaining folds are the training samples until all the samples are considered as an inner test sample one time. The goal of defining the inner folding process is different from cross-validation as it intends to learn the characteristics of the inner test samples against the inner training samples. This process is depicted in Figure. 2, where the one fold inner test instance of each run is colored with green. In other words, the inner folding partitions the input data to r folds and provides r different versions of the input data depending on which fold is the inner test.

In a formal definition, the inner folding is an indexing function that allocates each instance, to a Fold in which by randomization. , in which .

The inner folding is performed in r runs as a loop where in the loop of r run, the fold J is considered as the inner test and the remaining folds make up the inner train. Similar to the cross-validation procedure, it is assumed that , in which P refers to the distribution of data.

We should note that the inner folding process for the clustering task will be applied to the entire data set (i.e., X), however, during the IFL for the classification task, this inner folding process is only applied on the training data (see Algorithm.3). Inner folding provides features for all train instances through runs on different inner test folds. For test data, we can just evaluate test instance on the whole of training data.

B. Step 2- Learning the clustered representation of the inner train

To handle the data sets with a large number of input features, the deep embedded clustering [22] is selected in this paper to perform the clustering of inner instances, since this method is based on autoencoder that transforms the raw data to a latent feature space with a much smaller number of features. As shown in Figure 1, the DEC is based on two phases. In the first phase, the autoencoder is trained based on the reconstruction loss on input data. In the second phase, the

decoder and the centroid of latent features are given as input to KullbackLeibler (KL) divergence minimization to improve the clustering performance. In the following, we first describe how the DEC works and then how we use the DEC in step 2 and its outcomes in step 3 and step 4 of the IFL.

The autoencoder is an unsupervised data representation that minimizes the reconstruction loss through the network on the input data. The autoencoder network consists of an encoder and a decoder. The encoder is considered as a function that maps the input x into a latent representation z and the decoder is a function that reconstructs the latent representation to input . Parameters and denote

the weights of the encoder and decoder networks, receptively and are learned through the minimizing reconstruction loss — Figure.1a. We used the mean square error as the distance measure [22]; thus, the optimization objective is defined as follow:

The second phase of DEC is clustering with KL divergence which is based on two procedures that are repeated alternately till the clustering method converges. The first procedure, called procedure 2.1 here, computes a soft assignment between the embedded points and the centroids of the clusters. The second procedure, procedure 2.2, updates the weights of the encoder, , and uses auxiliary target distributions to learn from the current assignment to refine the cluster centers. Procedure 2.1: Soft Assignment

The probability of assignment, , as a soft assignment for a sample i to a cluster j is calculated by Student’s

t-distribution [23] as a kernel that measures the similarity between embedded point and centroid :

where denote the latent feature of is the degree of freedom of Student’s t-distribution. Following [22], we consider in all experiments. Procedure 2.2: KL Divergence Minimization

In this step, KL divergence loss is defined to train the model by matching the soft assignment, of the procedure 2.1, to the auxiliary target distribution, , as follow:

We follow the defined target distribution of [22]:

where are soft cluster frequencies. The learning is based on optimizing the cluster centers and the weight of network, .

As mentioned in the previous step, the inner folding is applied on in classification and on X in clustering resulting in folds as inner train and one fold as inner test. Inner train folds in each run, , are given as input, , to the DEC in which the autoencoder in the first phase is trained with inner train instances — Figure 3. Then, the latent features of inner train are learned, , and the centroid of the latent features are used as initial clusters’ centroid for the clustering in the second level. The DEC uses the encoder, the initial clusters centroid, and the latent features to improve the clustering accuracy by using KL divergence which provides and the corresponding centroid — in Figure.3. The encoder in the first phase of DEC is used in step 3 and the clusters and their centroid are utilized in step 4 of the IFL.

C. Step 3- Extracting the latent features of inner test

Now in step 3, the encoder part of autoencoder which was trained in the first phase of DEC, , is fed with the inner test instances of the run, , to obtain the corresponding latent features of inner test instances, , as shown in step 3 in Figure 3.

The output of this step is a set of the latent features learned for the inner test instances of the training data, , in classification and from X in clustering. It should be noted that for the test data in classification, there is not any inner train or inner test since labels are used in classification and the labels of test data in classification are not available. The latent representation of test data in classification, , is obtained based on the autoencoder that was trained over the test data in step 2 — see Algorithm 4.

D. Step 4- Feature Learning

We consider two sets of non-overlapping instances as inner train and inner test for training data in classification or the whole instances in clustering. Also, training and test data in classification are two separate sets. The representation of one set, inner test instances or test instances, is evaluated based on the clustered representation of other set instances, inner train or training. We assign each inner-test instance to all possible clusters and measure the representation of such assignment on the clustered representation and map in the form of features as the corresponding error representation of that inner test instance for that cluster. Here, we introduce three metrics which are measured per instance of each cluster, . Two metrics of confidence and weight are calculated easily without the labels and can be used in both classification and clustering. Also, we extract the accuracy of assignment for clusters by using the labels of training for classification purposes.

Confidence is the ratio of the number of elements in a cluster in which the inner-test instance has the closest distance to its center to the number of all instances in all clusters. Thus, we need to find the closest cluster by measuring the distance between the latent space of the instance to the centroids of the clusters which are obtained from the DEC and calculating the number of instances which belongs to that cluster to the number of all instances. If , in which is the corresponding latent feature of . Confidence is one feature. It can be used in both clustering and classification purposes.

In classification, we replicate the instances as the number of clusters and we place the learned features of each cluster in one of the replications.

Weight of belonging an instance in inner test to a cluster on inner train is the distance of the latent feature of the element to the center of the cluster, . We calculate the weight for each inner test and the centers of all clusters, , through different runs of inner folding. In other words, weight is a vector in length of number of clusters in which = dist . Thus, the number of features of weight is the same as the number of clusters. This measure can be used in both clustering and classification purposes.

Accuracy of Assignment is the unsupervised clustering accuracy (ACC) of a cluster. It requires the labels of the instances; thus, it can be only applied in classification purposes. Thus, we use the selected cluster for each inner test, , in confidence that has the closest distance and measure ACC for that cluster — = . The number of ACC feature is one.

We track the clusters through different runs of inner-folding to place the calculated weight feature of inner test instances of clusters which are the most representative of the same elements in the same columns through different runs. It can be calculated and verified that the clusters can be tracked correctly if the ACC of a clustering approach which is used in step 2 on instances is more than . In this paper, we set r = 10; thus, the ACC of the clustering approach should be at least 35% for a clustering problem with 4 clusters. The ACC of the clustering approach in step 2 should be at least 20% for a problem with 10 clusters. 20% and 35% are low standards for clustering approaches especially deep ones. DEC’s ACCs on such data sets are times bigger than such thresholds. Therefore, the clusters which are representative of similar elements are tracked accurately. This process is similar to using the labels to track the clusters as the representative of classes.

V. EXPERIMENTAL RESULTS

In this section, we evaluate the performance of the proposed feature learning model. Since this is the first work to propose the strategy of feature learning based on the representation of error, we consider a basic framework that can be applied across various data sets in order to demonstrate the capability of this strategy in achieving high accuracy in both classification and clustering applications when only using a small set of learned features. Indeed, this strategy can be further extended to develop refined implementations to better match the characteristics of different types of input data sets. To present a fundamental implementation of the IFL, we deployed DEC that uses an autoencoder based on fully connected networks (FCNs) [22]. Thus, DEC can be applied in diverse types of input data such as real-value, text, and some types of images. Although the DEC or other deep clustering based on FCNs [13, 7] cannot offer a comparable performance related to the models based on ConvNets for image data sets while they can be applied in a wide set of data sets. For example, deep clustering models based on ConvNets can learn the locality and shift-invariance in images much better than other architecture such as FCNs, however, the structures of ConvNets usually have limited range of usages in other domains[13, 7]. We

TABLE I: Descriptions of the data sets.

should note that the majority of the state-of-the-art deep learning methods are based on using augmentations or generative approaches and regularization inside the process that considerably improve their performance. As mentioned earlier, in this paper we mainly focus on the representation of error rather than attempting to develop the most optimal network architecture. Developing more advanced clustering techniques can be considered as future work. In this paper, following DEC [22], we used the autoencoder network that is based on FCNs in 9 layers that dimensions are d500500200010-2000-500-500-d, where d is the number of features of data sets and the batch size is 256 for all of the experiments.

In this paper, we evaluated the proposed method on five common data sets [22, 27, 28, 13, 7] that are described in Table I. These data sets cover a variety of data types, sizes, and dimensions out of which, two are handwritten digit image data sets (USPS 1, MNIST [11]), and two of them include extracted features of documents ( Reuters-10K [30] and Reuters [30]), and the other one is HHAR [31] which is a multi-class benchmark with real values of the sensory activities. The details of the data sets and pre-processing of them are described in the following:

MNIST: MNIST data set [11] include 70000 handwritten digits with size 28*28 pixels that range of the values is between [0, 255]. The value of each pixel divided into max values to transform into an interval [0, 1]. Since we are using FCN, we reshape that in form of a vector in size 784 [22].

USPS: USPS data set contains 9298 gray-scale handwritten digits with size 16*16 pixels that the range of the values is between [-1, 1]. The value of each pixel divided into two to transform into an interval [-0.5, 0.5].

Reuters: Reuters data set [30] includes 810000 English news stories that are labeled based on a category tree. We followed [22] to use 4 root categories: corporate/industrial, government/social, markets, and economics as labels while the documents with multiple labels have not considered. The result documents are 685071 ones that tf-idf features are computed on the 2000 word stems with most frequent occurring.

In Reuters-10K, the 10000 random documents of Reuters are sampled as same as [22].

HHAR: The Heterogeneity Human Activity Recognition (HHAR) data sets that contains sensory information about smartphones and smartwatches. The data set include 10299 instances with 561 features of 6 categories of human activities. we follow the data set as same as [28].

Here, we compare the performance of the proposed IFL with some state-of-the-art techniques that are fairly general such as

Fig. 4: The representation of HHAR data sets while the learned features of error are added to corresponding instances by tSNE through different phases of DEC.

[22, 27, 28] in clustering or several common classification methods which can handle real values, text or images. In this paper, we set the number of runs for the inner folding process to 10 (i.e., r = 10) in both classification and clustering for all experiments. We repeat the experiments 5 times to calculate the variance of the results in both classification and clustering. The numbers in the parenthesis show the variance. If there are not any parentheses, it means that the variance is smaller than 0.01.

A. Clustering

Two main categories in clustering methods are the classical clustering and deep clustering. Classical clustering methods include partition-based method (e.g. K-means), spectral clustering such as density-based clustering methods, and hierarchical clustering (e.g., agglomerative clustering). The spectral clustering methods are not scalable to large data sets considering their computational complexity [22]. The deep learning based clustering methods are developed based on different criteria such as the architecture or the loss as reviewed in [13, 7]. The clustering methods with deep learning have shown superior performance in several applications (e.g., ConvNets in image data sets, or the generative-based models). Here, we implement the idea of inverse feature learning using DEC that shows our results are comparable or even better than the state-of-the-art methods, where we select at least one clustering method out of each category.

The results are compared with classical clustering methods such as k-means[32], hierarchical cluster analysis (HCA) based on bottom up approach for the ward and average linkage criterion [33], and Gaussian mixture model (GMMs) as well as several deep learning based clustering methods including LDMGI [24], autoencoder+k-means, autoencoder + LDMGI [24], autoencoder+SEC [25], DEC, improved DEC (IDEC) [27], VaDE [28] which is a generative and based on variational autoencoder (VAE) [34], autoencoder+GMM, VAE+GMM,

TABLE II: Comparison of clustering accuracy for different data sets.

adversarial autoencoder (AAE) [26], and information maximizing self-augmented training (IMAST) [35]. We do not compare the results with the clustering approaches based on ConvNets as they can just be applied on image data sets.

The common unsupervised clustering accuracy, denoted by ACC, is used to evaluate the performance of this model as defined as follows:

where is the cluster assignment for instance i determined by the clustering approach, is the ground-truth label or the label of instance, and m is the mapping function that ranges over all possible one-to-one mapping between and .

The ACC of the clustering models is reported for the primary features, just with the learned features of error (IFL), and the primary features in addition to the learned features (Primary +IFL). It should be noted that the IFL features are highly abstract, (i.e., only 1 + s in which s shows the number of clusters). This number of features is significantly less than the number of original features of data sets — several hundred times. For instance, the number of IFL features in Reuters and Reuters 10k data set is 5 while the number of primary features is 2000. In MNIST and USPS, the numbers of primary features are 784 and 256, respectively while the number of IFL features is 11. For HHAR data set, the number of primary features is 561 while IFL features are just 7 ones.

We evaluate IFL based on two scenarios to show how the IFL feature works separately and along with the primary features in Table II. HCA clustering approaches and LDMGI cannot be applied to Reuters data set. We use the open libraries of scikit-learn [36] for HCA that cannot handle the data sets like Reuters. Also, as motioned in [22] LDMGI and SEC need “months of computation time and terabytes of memory” for Reuters.

As shown in Table II, the approaches such as AAE which is based on GAN autoencoder, variational deep embedding (VaDE) which is a generative model based on VAE or IMAST (VAT) which uses regularization and self-augmentation provide better performance compared to other techniques. It can be seen in the table that IMAST (VAT) provides the best results on MNIST data set (even better than ConvNets [35]), however. it cannot keep the superiority over other data sets. In fact, the self-training augmentation of IMAST helps the networks handle shift-invariance issues in image data sets, but in data sets such as Reuters 10K that this issue does not exist, its performance in comparison to other deep learning methods degrades considerably. VaDE which utilizes both variational autoencoders and generative models shows a robust superior performance related to other approaches except on MNIST data set. HHAR data set is not a sparse data set like other data sets that makes it a challenging one for classic clustering approaches.

Table II shows when the IFL features are used in addition to the original features in different clustering methods such as DEC and k-means, they provide robust results with a negligible variance (i.e., smaller than 0.01). In this table, “N/R” to the cases, for which the results were not reported and “N/A” refers to the cases where a particular approach cannot be applied one a data set because of the required memory or the required time as mentioned in [22].

In the first scenario, we only used the IFL features alone, and noting that the number of INF features is considerably less than the primary features, the k-means clustering method cannot work properly on most of the data sets except Reuters, in which there are considerable instances. HCA based on average and ward linkage offers interesting results in comparison to other deep learning methods except for the HHAR data set. The results of this table show that the IFL features are not descriptive enough to be used alone on the data sets which are not sparse. However, the performance of the DEC approach using only the IFL features is considerably better than the cases when this clustering is applied on the primary features or many other deep clustering methods including LDMGI, autoencoder +(k-means, SEC, LDMGI, GMM), VAE+GMM, AAE or IDEC.

In the second scenario, the learned features are used along with the primary features for DEC or other clustering approaches. In this scenario, we achieved the best results compared to several state-of-the-art techniques, except for the MNIST data set, where our best performance is about 2% lower than IMAST — as we mentioned earlier, IMAST uses self-training augmentation to handle shift-invariance of images and its results are not steady in other types of data sets such as Reuters 10k, where the accuracy of IMAST is 10% lower than our proposed approach. Moreover, the results in Table II demonstrate that adding the IFL features to the original features considerably improves the performance of the classical clustering techniques such as k-means and HCA to the point that they achieve comparable results with the state-of-the-art deep clustering methods.

We demonstrate the latent representations of input data and latent feature space for HHAR data set through different phases of DEC into 2D space by t-SNE [23] in Figure 4. As it can be seen in this figure, the clusters become well separated through the steps while the accuracy improves.

In summary, based on the extensive experimental results, we can conclude that the IFL features provide a new perspective that offers considerable accuracy improvement in DEC and even the classic clustering techniques. Even when the IFL features are fed alone to the clustering methods, they provide a proposer representation of the data set that improves the performance of the classic clustering techniques to the level of deep-learning-based clustering approaches. Thus, the IFL framework can provide a practical and fast solution to enhance the performance of the classical clustering techniques, in particular for sparse data sets such as image and text data sets.

B. Classification

We evaluated the performance of the IFL features in classi-fication by using several known and state-of-the-art classifica-tion methods in the literature which can handle different types of data including Decision Tree [37], Random Forests [38] as an general ensemble method, XGboost [39] as a known boosting method in machine learning, K-nearest neighbors (KNNs), and a Multi-layer Perceptron classifier (MLP) that utilizes adam as weight optimization. We use the libraries of scikit-learn [36] for the mentioned classifiers. Accuracy is used as the evaluation metric.

In classification, IFL features can be used based on two techniques. In the first technique, we consider the weight feature of each cluster for each instance separately. In other words, we form a new data set that the number of its instances is the number of instances of the original data set that is multiplied with the number of clusters and the number of IFL features equals 3 — one feature for confidence, one feature for weight, and one feature for accuracy of assignment. When the IFL features are considered along with the primary features in the first technique, there are versions as the number of clusters of each instance that the primary features and also the confidence and accuracy of the assignment are the same among them while the weight feature is different. The labels of all the versions of each instance are the same for training. The final label that is considered for each instance of test data is the label with the maximum value among its versions. We used the first technique for the data sets that are sparse like images and text. The second technique is like the procedure of IFL features in clustering. In the second technique, we consider the features of each instance including weight corresponding to different clusters as a whole. Thus, the number of IFL features equal with the number of clusters+2 — one feature for confidence, one feature for accuracy of assignment, and s features for the weight in which s is the number of classes. We found this technique is proper for the data sets that are not sparse like HHAR. The first technique is the reliable and robust one and we consider that as the default technique. The second technique is recommended for data sets that are not sparse. It can be seen in Table III for HHAR data set the second technique provides much better results in comparison to the first technique.

It should be noted that the IFL features are highly abstract, (i.e., only 3 in the first technique and only 2 + s in which s shows the number of clusters in the second technique). This number of features is significantly less than the number of original features of data sets — several hundred times. For instance, the number of IFL features in Reuters and Reuters 10k data set is 3 in the first technique while the number of primary features is 2000. In MNIST and USPS, the numbers of primary features are 784 and 256, respectively while the number of IFL features is 3. For HHAR data set, the number of primary features is 561 while IFL features are just 8 ones in the second technique.

TABLE III: It is compared accuracy of the classifiers without and with the inverse feature learning based on the first technique. In the proposed method, the learned features are independent of the classifier, hence using a better classifier can achieve higher accuracy over our learned features as it does over other sets of features.

TABLE IV: The accuracy of the classifiers are shown just with the learned features. It should be noted the features in IFL are highly abstract and much few in comparing the number of features of data sets (several hundred time fewer); for example, in Reuters data set, 2000 vs 3.

As shown in Table III, the IFL features improved the accuracy of the Decision Tree classifier on different data sets and also bring down the variance of the results compared to the scenario when only the primary features are used. We have not reported the results on Reuters data set for classification except the MLP classifier that results for the primary features and (primary features + IFL) are 97.41% & 98.12% correspondingly, since the required time and memory were out of the capacity of our systems and the libraries. Since the decision tree directly works with the feature space, the IFL features in different data sets improve the results from 2% to 12%. For the Random Forests classifier, the IFL features improve the results over all data sets except the USPS, where the accuracy is degraded for 0.03%. For the XGboost and KNNs classifiers, the features improve the results or at least offer a comparable accuracy. In MLP, we can see a slight improvement in the results in which the variance also decreased. In conclusion, the IFL features enhance the classifiers’ accuracy, especially on approaches in which the features are used directly. As shown in Table IV, the IFL features when used alone provide some interesting results. It shows the IFL features capture informative aspects of data.

VI. CONCLUSION

In this paper, we introduce error representation learning as a novel perspective about the error in machine learning and inverse feature learning as a representation learning strategy that stands on deep clustering to learn the representation of error in the form of high-level features. The strategy of inverse feature learning can be implemented by different learning techniques. We applied the strategy based on DEC which is a general and basic clustering approach, however, the IFL framework can be applied on other clustering techniques as well. The performance of the IFL framework is evaluated on five popular data sets using different clustering techniques. We consider two evaluation scenarios, where the IFL features are used alone or in addition to the primary features. The experimental results show that even in the cases that the small set of IFL features are used, the clustering accuracy for some clustering techniques (e.g., DEC) is comparable to several other approaches. This confirms that the IFL features are highly informative. The results of the second evaluation scenario, in which the IFL features are used along with the primary ones, demonstrate considerable improvement for DEC and other classical clustering approaches. In this scenario, we achieved the best result compared to several state-of-the-art techniques over the Reuters-10K, Reuters, and HHAR data sets. It should be noted that several of these state-of-the-art approaches are generative, based on VAE, and also use self-augmentation training and customized regularization. Moreover, the experimental results show the performance of the proposed IFL framework on a variety of known classification methods and in most cases the result is improved. Learning error representing features based on the generative and specialized network tailored to the data sets (e.g., ConvNets for images) can be considered as future works.

REFERENCES

[1] L. Rosasco, E. D. Vito, A. Caponnetto, M. Piana, and A. Verri, “Are loss functions all the same?” Neural Computation, vol. 16, no. 5, pp. 1063–1076, 2004.

[2] S. Lloyd, “Least squares quantization in pcm,” IEEE transactions on information theory, 28, no. 2, pp. 129– 137, 1982.

[3] M. Koestinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof, “Large scale metric learning from equivalence constraints,” in Computer Vision and Pattern

Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 2288–2295.

[4] O. Chapelle, B. Scholkopf, and A. Zien, “Semisupervised learning (chapelle, o. et al., eds.; 2006)[book reviews],” IEEE Transactions on Neural Networks, 20, no. 3, pp. 542–542, 2009.

[5] V. R. de Sa, “Learning classification with unlabeled data,” in Advances in neural information processing systems, 1994, pp. 112–119.

[6] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436–444, 2015.

[7] E. Min, X. Guo, Q. Liu, G. Zhang, J. Cui, and J. Long, “A survey of clustering with deep learning: From the perspective of network architecture,” IEEE Access, vol. 6, pp. 39 501–39 514, 2018.

[8] P. Smolensky, “Information processing in dynamical systems: Foundations of harmony theory,” COLORADO UNIV AT BOULDER DEPT OF COMPUTER SCIENCE, 1986.

[9] H. Bourlard and Y. Kamp, “Auto-association by mul- tilayer perceptrons and singular value decomposition,” Biological cybernetics, 59, no. 4-5, pp. 291–294, 1988.

[10] G. E. Hinton and R. S. Zemel, “Autoencoders, minimum description length and helmholtz free energy,” in Advances in neural information processing systems, 1994, pp. 3–10.

[11] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, 86, no. 11, pp. 2278– 2324, 1998.

[12] K. Janocha and W. M. Czarnecki, “On loss functions for deep neural networks in classification,” arXiv preprint arXiv:1702.05659, 2017.

[13] E. Aljalbout, V. Golkov, Y. Siddiqui, M. Strobel, and D. Cremers, “Clustering with deep learning: Taxonomy and new methods,” arXiv preprint arXiv:1801.07648, 2018.

[14] J. Kukaˇcka, V. Golkov, and D. Cremers, “Regulariza- tion for deep learning: A taxonomy,” arXiv preprint arXiv:1710.10686, 2017.

[15] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.

[16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.

[17] M. Seeger, “Learning with labeled and unlabeled data,” Tech. Rep., 2000.

[18] O. Chapelle, J. Weston, and B. Sch¨olkopf, “Cluster kernels for semi-supervised learning,” in Advances in neural information processing systems, 2003, pp. 601– 608.

[19] K. Benabdeslem and M. Hindawi, “Efficient semi- supervised feature selection: constraint, relevance, and redundancy,” IEEE transactions on knowledge and data

engineering, 26, no. 5, pp. 1131–1143, 2014.

[20] Z. Xu, K. Yu, V. Tresp, X. Xu, and J. Wang, “Repre- sentative sampling for text classification using support vector machines,” in European Conference on Information Retrieval. Springer, 2003, pp. 393–407.

[21] H. T. Nguyen and A. Smeulders, “Active learning using pre-clustering,” in Proceedings of the twenty-first international conference on Machine learning. ACM, 2004, p. 79.

[22] J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embedding for clustering analysis,” in International conference on machine learning, 2016, pp. 478–487.

[23] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of machine learning research, vol. 9, no. Nov, pp. 2579–2605, 2008.

[24] Y. Yang, D. Xu, F. Nie, S. Yan, and Y. Zhuang, “Image clustering using local discriminant models and global integration,” IEEE Transactions on Image Processing, vol. 19, no. 10, pp. 2761–2773, 2010.

[25] F. Nie, Z. Zeng, I. W. Tsang, D. Xu, and C. Zhang, “Spectral embedded clustering: A framework for insample and out-of-sample spectral clustering,” IEEE Transactions on Neural Networks, vol. 22, no. 11, pp. 1796–1808, 2011.

[26] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey, “Adversarial autoencoders,” arXiv preprint arXiv:1511.05644, 2015.

[27] X. Guo, L. Gao, X. Liu, and J. Yin, “Improved deep embedded clustering with local structure preservation.” in IJCAI, 2017, pp. 1753–1759.

[28] Z. Jiang, Y. Zheng, H. Tan, B. Tang, and H. Zhou, “Variational deep embedding: an unsupervised and generative approach to clustering,” in Proceedings of the 26th International Joint Conference on Artificial Intelligence, 2017, pp. 1965–1972.

[29] K. Ghasedi Dizaji, A. Herandi, C. Deng, W. Cai, and H. Huang, “Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 5736–5745.

[30] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li, “Rcv1: A new benchmark collection for text categorization research,” Journal of machine learning research, vol. 5, no. Apr, pp. 361–397, 2004.

[31] A. Stisen, H. Blunck, S. Bhattacharya, T. S. Prentow, M. B. Kjærgaard, A. Dey, T. Sonne, and M. M. Jensen, “Smart devices are different: Assessing and mitigatingmobile sensing heterogeneities for activity recognition,” in Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems. ACM, 2015, pp. 127– 140.

[32] J. MacQueen et al., “Some methods for classification and analysis of multivariate observations,” in Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol. 1, no. 14. Oakland, CA, USA, 1967, pp. 281–297.

[33] T. Hastie, R. Tibshirani, and J. Friedman, The elements of statistical learning: data mining, inference, and pre-

diction. Springer Science & Business Media, 2009.

[34] D. P. Kingma and M. Welling, “Auto-encoding varia- tional bayes,” arXiv preprint arXiv:1312.6114, 2013.

[35] W. Hu, T. Miyato, S. Tokui, E. Matsumoto, and M. Sugiyama, “Learning discrete representations via information maximizing self-augmented training,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 1558– 1567.

[36] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.

[37] L. Breiman, Classification and regression trees. Routledge, 2017.

[38] A. Liaw, M. Wiener et al., “Classification and regression by randomforest,” R news, vol. 2, no. 3, pp. 18–22, 2002.

[39] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM, 2016, pp. 785–794.