1.1 Biological Background
Cancer can be seen as a collection of diseases, where all are characterized by abnormal and non-stopping cell growth, potentially spreading to surrounding tissues. In 2018, cancerous conditions were the second leading cause of death, worldwide, being responsible for 9.6 million deaths, where approximately 70% occurred in developing countries [6]. Gene expression is the phenotypic manifestation of a gene or genes by the processes of genetic transcription and translation [5]. Its analysis can help understand the molecular cancer basis better, that can directly influence the prognosis, diagnosis, and treatment of such conditions. The main cancer genomics projects, such as The Cancer Genome Atlas (TCGA) 1 and the International Cancer Genome Consortium2, try to translate gene expression, by cataloging and profiling through next-generation sequencing thousands of samples across different types of cancers. With more than 50k gene representative features, one can find in these projects genome-wide gene expression assays datasets. It can be a challenge working with this type of data, due to (1) a small number of examples, (2) lack of balance distribution between classes, and (3) potential underlying noise, caused by eventual technical and biological covariates [4].
1.2 Technical Background
Given its high mortality rate, it is crucial to correctly and accurately classify this type of diseases. This need has led many research groups to experiment and study the application of Machine Learning algorithms, as an aim to model the progression and the treatment of cancerous conditions [3].
Xie et al. developed a predictive model based on a combination of a Multilayer Perceptron and Stacked Denoising Autoencoder (MLP-SAE), to assess how good genetic variants will contribute to gene expression changes [10]. The described model is composed of 4 layers (one for the input, two hidden layers from two autoencoders (AEs), and one for the output), with the Mean Squared Error (MSE) as the loss function. Firstly, the authors trained the AEs with a stochastic gradient descent algorithm to later use them on the multilayer perceptron training phase (i.e. they use the AEs as weight initialization). The authors used cross-validation to select the optimal model to subsequently (1) compare its performance with the Lasso and Random Forest methods, and (2) evaluate its performance when predicting the gene expression values, on an independent dataset. The authors concluded that the MLP-SAE model: (1) with an MSE of 0.2890 outperformed both previously referred methods (0.2912 and 0.2967, accordingly), and (2) can capture the changes in gene expression quantification.
[8] describes the analysis of the combination of different methods of unsupervised feature learning — viz. Principal Component Analysis (PCA), Kernel Principal Component Analysis (KPCA), Denoising Autoencoder (DAE), and Stacked Denoising Autoencoder — with different sampling methods for classification purposes. The authors focused on studying the influence of the input nodes on the reconstructed output of the AEs, when feeding these combinations results to a shallow artificial network, for distinguishing papillary thyroid carcinoma from healthy samples. In 5-fold cross-validation, the combination of a SMOTE and Tomek links, with a KPCA, was the one with the best overall performance, with a mean F1 score of 98.12%. Notwithstanding, Teixeira et al. preferred the usage of a DAE, affirming it yielded similar results (though with a mean F1 score of 94.83%).
In [1], the authors developed a methodology for the detection of papillary thyroid carcinoma. Ferreira et al. studied and compared the performance of a deep neural network classifier architecture, where they used autoencoders (AEs) as a weight initialization method. The AEs were pre-trained to minimize the reconstruction error and subsequently used to initialize the weights of the top layers of the classification network, with two different strategies: (1) Just the encoding layers, and (2) All the pre-trained AE. 6 types of AEs were used: Basic AE, Denoising AE, Sparse AE, Denoising Sparse AE, Deep AE, and Deep Sparse Denoising AE. Sampling, data augmentation, and normalization techniques when pre-processing the data were not applied. To evaluate and support the results, the authors used stratified 5-fold cross-validation to split the data into training and validation partitions, providing 4 different metrics: Loss, Precision, Recall, and F1 score. Their best result was the combination of unsupervised feature learning through a single-layer Denoising AE, followed by its complete import into the classi-fication network, and subsequent fine-tuning through supervised training, achieving an F1 score of 99.61%, with a variance of 0.54.
Table 1: An example of 5 samples of the thyroid dataset. The header line represents the names of the genes and column values represent its expression for each sample. NA means that a value is missing, for that gene, and sample.
2.1 The Data
We used 3 different RNA-Seq datasets from The Cancer Genome Atlas (TCGA), each one representing a type of cancer: thyroid, skin, and stomach. A small sample of one of the datasets is shown in Table 1. All three datasets are composed of the same 20442 features (genes). Each feature represents one certain gene, where the cell values in the table represent the expression of that gene, for a certain sample. The thyroid cancer dataset has 509 examples, the skin cancer dataset 472 and the stomach cancer dataset 415.
Each dataset is processed separately. We start removing, for each one, the features that had the same value for all the instances in the dataset. When a value is constant for all the examples, there is no entropic value (i.e., it is not possible to infer any information). We then imputed the missing values (NA’s, as shown in Table 1) with the average value of its respective column, and added (to each one) a column Label to match each instance to its type of cancer. Our goal is to distinguish different types of cancer, so we assign a positive value (1) to the class we want to predict, and 0 to the remaining ones: when training the model to detect thyroid cancer, all thyroid examples are labeled as 1 and the skin and stomach instances as 0. Respectively, when training to detect skin/stomach cancer, all skin/stomach examples are labeled as 1 and the remaining two types of cancer’s instances are labeled as 0. However, after this process, it is not guaranteed (and actually quite unlikely) that the same features will be removed in the 3 cancer datasets. Thus, when merging the 3 sets of data, we only use their intersection, so that the different types of cancer are represented by the same features. After the full data pre-processing, the final dataset has 18321 feature columns and 1396 examples (36% of thyroid cancer, 34% of skin cancer, and 30% of stomach cancer).
2.2 Autoencoders
An AE is a neural network that aims to reproduce its input [7]. Let f and g correspond to the encoding and decoding functions of the AE, parameterized on respectively, where
being an appropriate loss function, and J the cost function to be minimized. In its learning process, an AE tries to find the value for
that leads to the minimal value of function
, assigning a penalty to the reconstruction of the input
when it is distinct from the original data X [2].
In this work we chose to only compare the performance of the AE that had the best result in [1] (DAE) as weight initialization of a classification architecture, studying two different approaches for weight initialization and two different strategies for embedding the AE layers. A DAE [9] is a type of AE that tries to preserve the input’s information, undoing the effect of a corruption process applied to the input of the AE, by is a copy of the input X, corrupted by some form of noise [2]. In our case, we apply a Dropout layer, directly after the input layer as a form of Bernoulli Noise, where 10% of the connections are randomly deleted. The hidden encoding layer size is 128.
2.3 Methodology
As in [1], our experiment consists in the performance assessment of a deep neural network classifier architecture, where we vary its top layers. However, we aim to identify 3 distinct types of cancer, instead of distinguishing cancerous from healthy samples. We pre-train the autoencoders to minimize the reconstruction error and subsequently use them to initialize the top layers weights of the classification network, with two different strategies: (1) Just the encoding layers, and (2) All the pre-trained autoencoder.
Each architecture is thus trained to classify the input data as either thyroid, skin or stomach, accordingly to the type of cancer. We use the same architecture as Ferreira et. al and, given that such architecture was build for a binary classification task, we decided to adapt this multi-label classification problem to a “binary label” one: for a type of cancer C, we train the model to detect C and not C, instead of detecting cancer and healthy samples. Besides the top layers imported from the AE, the classification region of the full network starts with a Batch Normalization layer, and proceeds with two Fully Connected layers using Rectified Linear Unit (ReLU) activation; the last one — prediction layer — is a single neuron layer with a Sigmoid non-linearity.
Figure 1: Loss values on the training set for the 300 epochs of autoencoder training, with corresponding minimas. The x axis represents the number of epochs, and the y axis the loss value. The grey area represents the variance of the loss value.
2.4 Evaluation
In order to ensure and provide statistical evidence, we use stratified 5-fold cross-validation. The DAE and classifier are trained during 100 and 300 epochs, respectively, with a batch size of 500. The loss of the classifier model is calculated by the binary cross-entropy [2], and trained using an adam optimizer. We then evaluate its performance through 4 additional metrics: Accuracy, Precision, Recall, and F1 scorealso for the training and the validation sets.
alized to other datasets and problems: Importing the complete pre-trained DAE to the upper layers of the classification architecture and allowing subsequent fine-tuning achieved the best overall performance, with an F1 score of 98.04% (when detecting thyroid cancer), a result that is quite close to the overall best of 99.61% reported in [1].
Figure 2: Loss values on the validation set for the 300 epochs of autoencoder training, with corresponding minimas. The x axis represents the number of epochs, and the y axis the loss value. The grey area represents the variance of the loss value.
Table 2: Performance comparison of the classifier. We are only importing the top layers from a DAE since it was the AE that led to better results in [1] (where the best overall result was the combination of a Complete DAE, with Approach B, achieving an F1 score of 99.61% represents thyroid cancer detection, Sk skin cancer detection, and St stomach cancer detection. When measuring loss, lower is better. For all the other metrics, higher is better. All the values presented are the average value of a 5-fold cross-validation, at the validation set, by selecting the best performing model according to its F1 score.
However, for both detection of skin and stomach cancers, the best-achieved result was, respectively, 97.81% (1.76) and 97.54% (
1.25), where the combination differs only on the DAE layers that are embedded into the classifier (only the encoding layers). We may assume that this methodology can generalize to other types of data.
Fine-tuning (Approach B) leads to better results than fixing the weights (Approach A): In [1], the authors claimed that their results cannot support that Approach B gave better results than Approach A. However, with our data, it is clear that fine-tuning the weights of the top layers leads to better results, by a margin of 10 – 20%, when considering the F1 score metric, as one can see in Table 2.
There is not enough evidence to support the assumption that the overall usage of AEs seem to capture the most relevant information for the task: Although our overall best was close to the overall best of the previously referred work, there is a big difference between the two approaches of weight initialization when experimenting our data. Also, there is a big divergence when analyzing the AEs curves in the train and validation phases, as it is observable in Figure 2 and Figure 1. One may assume that the AEs learning process is being compromised given that (1) in some cases, in the validation phase (for example the DSAE – Figure 2d – and the DSDAE – Figure 2f), the minima is found too early and (2) the data split in the cross-validation may have influence on the learning process.
In this work, we havecompared the performance of a Denoising Autoencoder (DAE) as an unsupervised initialization method for deep classification neural networks applied to a cancer vs. cancer classification task. For that, we have used the methodology described in [1]: we combined a DAE with two different approaches when training the classification architecture: (a) by fixing the imported weights, and (b) by allowing them to be fine-tuned during supervised training. We studied two different strategies for embedding the DAE into the classification network: (1) using the encoding layers as weight initialization, and (2) using the complete AE, i.e., both the encoding and decoding layers.
Taking Ferreira et al. as a reference model, we think that it may be possible to generalize the methodology to other datasets and problems. Importing a complete pre-trained DAE to the top layers of the classifier (Strategy 2), followed by fine-tuning (Approach B), when detecting thyroid cancer, achieved the best overall results, with an F1 score of 98.04% 1.09. Fine-tune led to better results, boosting the results between 10 and 20% in the F1 score metric. Contrary to the results obtained in the mentioned previous work, there is not enough evidence to support the assumption that the overall usage of AEs seems to capture the most relevant information for the task, in this problem.
[1] Ferreira, M. F., Camacho, R., and Teixeira, L. F. (2018). Autoencoders as weight initialization of deep classification networks applied to papillary thyroid carcinoma. In 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 629–632.
[2] Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. The MIT Press.
[3] Kourou, K., Exarchos, T. P., Exarchos, K. P., Karamouzis, M. V., and Fotiadis, D. I. (2015). Machine learning applications in cancer prognosis and prediction. Computational and Structural Biotechnology Journal, 13:8 – 17.
[4] Kukurba, K. R. and Montgomery, S. B. (2015). Rna sequencing and analysis. Cold Spring Harb Protoc, 2015(11):951–969.
[5] NCBI - National Center for Biotechnology Information (2017). Gene expression. https://www. ncbi.nlm.nih.gov/probe/docs/applexpression/.
[6] Organization, W. W. H. (2018). Cancer fact sheet. https://www.who.int/en/news-room/ fact-sheets/detail/cancer.
[7] Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Parallel distributed processing: Explo- rations in the microstructure of cognition, vol. 1. chapter Learning Internal Representations by Error Propagation, pages 318–362. MIT Press, Cambridge, MA, USA.
[8] Teixeira, V., Camacho, R., and Ferreira, P. G. (2017). Learning influential genes on cancer gene expression data with stacked denoising autoencoders. 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 1201–1205.
[9] Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pages 1096–1103, New York, NY, USA. ACM.
[10] Xie, R., Wen, J., Quitadamo, A., Cheng, J., and Shi, X. (2017). A deep auto-encoder model for gene expression prediction. BMC Genomics, 18(9):845.