Fig. 1: Chromatin modifications in epigenome: 1) DNA methylation: DNA sequence with cytosine followed by guanine from 5’ to 3’, modified by methyl (denoted by M); 2) Histone modification: histone composed of H2A, H2B, H3, and H4, modified by methyl, acetyl (denoted by A) and phosphate (denoted by P).
to normal cells. The CpG regions associated with genetic loci of Alzheimer’s disease are validated [9] and differentially methylated regions are identified. Brain DNA methylation in selected genetic loci was found [10] to be associated with the pathology of Alzheimer’s disease. It is found that global DNA methylation is increased in a study involving late-onset Alzheimer’s disease group [11] and the healthy control group. On schizophrenia disease, by the genome-wide DNA methylation analysis [12], unique genetic locations are found to be associated with schizophrenia. Differential expression and methylation of genes [13] are found for schizophrenia and bipolar disorder patients. There are more recent works discussing the genetic associations of DNA methylation in the schizophrenia disease [14]–[17]. The existing works on DNA methylation disease association are focusing on specific CpG islands and genetic locations where the disease cells have aberrant patterns compared with non-disease cells.
Existing cell-classification works based on DNA methylation are focusing on identifying certain CpG islands with most significant changes of DNA methylation between the disease and non-disease cells, and on the association study of the DNA methylation on these CpG islands to the disease status. A classification method is proposed to identify the cancer-driver genes [18] that have significant relevance in terms of statistical DNA methylation variation between cancer and non-cancer disease cells. Note that, our proposal is different from these existing works in the sense that it does not need to identify genes
TABLE I: The GEO datasets in the data analysis of this work.
that have disease associations. The DNA methylation vector is directly transformed by WHT and the resulting sequence is the input to the machine learning classifiers for the disease cell classification. The classification is studied based on the DNA methylation of 22 selected CpG islands [19] for cancer cell classification. The classification identifier of CpG Island Methylator Phenotype (CIMP) [20] is applied to classify the patients with T-Cell Acute Lymphoblastic Leukemia. A defined value, named FR-pair, is proposed [21] to classify the normal and cancer cells based on the DNA methylation beta value vector. To the best of our knowledge, no transform-domain method has been proposed on the DNA methylation vector for cell classification by other research groups.
Existing works on applying machine learning/deep learning approaches to DNA methylation are mostly on the regression computation of DNA methylation to predict the methylation states at unknown CpG sites. The deep convolutional neural network is designed in program CpGenie [22] for DNA methylation prediction at CpG sites, and the methylation prediction results are applied to the prediction of the impact of functional variants at non-coding regions on the DNA methylation. A hybrid deep learning network, named DeepCpG [23], has been proposed to obtain the regression results of DNA methylation states based on the single-cell profiled DNA and DNA methylation data, where the proposed structure is composed of a convolutional neural network to process DNA data, and a recurrent neural network to process DNA methylation data. Support Vector Machine (SVM) and stacked denoising autodecoder have been applied [24] on the prediction of binary DNA methylation states, and it is found that the stacked denoising autodecoder has close and slightly lower prediction accuracy than SVM. A gradient boosting method, named BoostMe [25], is studied on the prediction of the DNA methylation and shows performance gains over random forests and DeepCpG. Review work on deep learning on bioinformatics has also been reported in the same work. Other related computational methods studying the DNA methylation datasets include continuous-time Markov chain [26] and support vector regression model [27]. In these existing works, deep-learning techniques are applied to the regression to predict the DNA methylation states.
Our contributions: In contrast to existing works on human disease cell classification which identify the
TABLE II: Step values and ranges of selected normal tissue cells in the GEO datasets.
most aberrant CpG islands at genetic locations, we are proposing new classification approach based on the discovery of the low dimension and three-step property of DNA methylation profile after transform-domain method of Walsh-Hadamard Transform (WHT). We find that the WHT of the DNA methylation beta value vector produces a low dimension transform-domain vector. This transform domain vector has values with three steps of amplitude levels. The rest of the transform-domain vector has low amplitude similar to noisy elements. This low dimension and three-step property of DNA methylation with WHT is one of the main contributions and was published in our work [28]. In this journal extension, this dicovery is extended to the performance comparison between the proposal and the whole sequence classification by comprehensive evaluations of typical machine learning classifiers suited for this application. The comparisons are done on classification of disease and non-disease cells. To summarize, the contributions of our work are:
• The low-dimension and three-step properties are found on the WHT transform-domain vector of
• A new cell classification pipeline is proposed based on the transform-domain method of WHT. The
• It is found that, the proposal can speed up the disease cell classification by more than an order
Broader applications: This approach is also valuable for disease risk evaluations, for example, cancer risk or neural degenerative disease risk evaluations. The DNA methylation variations are associated with several internal factors such as the disease status, the age, the genetic information, and the tissue types.
Fig. 2: The WHT transform vectors for NCBI datasets GSE63384 and GSE40032. Three-step values are observed for both cancer and normal cells in each dataset. This property is found to be a consistent trend.
However, external factors such as dietary and lifestyle/living conditions complicate the study of these
variations, since both internal and external factors jointly determine the DNA methylation profile status, for a particular tissue cell at a particular time. For example, the tissue cells of potential cancer patients can be sequenced and subsequently classified by the proposed pipeline to obtain its disease status.
Article outline: This article is organized as follows. The proposed pipeline is described in Sect. II. The cell classification results are given in Sect. III. Finally, conclusions are drawn in Sect. IV.
In this work, the data in the analysis part is from the NCBI Gene Expression Omnibus (GEO) datasets. GEO contains DNA methylation profiles of a large number of human diseases and tissue types, such as colorectal cancer cell, lung cancer cell, etc. Four datasets of cancer cells are selected and the beta
Fig. 3: The WHT transform vectors for NCBI datasets GSE17648 and GSE73003. Three-step values are observed for both cancer and normal cells in each dataset.
values are the measured values for the CpG islands detected. The cell-classification results include the machine-learning and additional deep-learning results on the four datasets.
Data acquisition procedure: The Human Methylation27K DNA Analysis BeadChip (Illumina) can measure methylation information at a single-base resolution for 27,578 CpG islands. After the bisulfite-conversion of the samples, the whole-genome amplification was performed with the input of bisulfite-converted DNA. The output was fragmented, purified and applied to the BeadChips by Illumina-supplied reagents and conditions. The array was stained and scanned after extension. The intensities of the unmethylated and methylated bead types were measured by the fluorescent signals.
Extraction of CpG islands and computation of beta values: The DNA methylation signal intensities are extracted at the CpG islands measured by the DNA methylation Human Methylation27K BeadChip. The DNA methylation beta values are subsequently computed. The beta value [29] is defined on each CpG island as is the CpG island index, M(m) is the methylated signal intensity, and U(m) is the unmethylated signal intensity, and
a normalization factor commonly chosen to be 100. The methylated and unmethylated signal intensities are the fluorescent signal measurements. The beta value is in the range of 0 to 1, representing the level of methylation at particular CpG islands.
Performing Walsh-Hadamard Transform (WHT): We introduce now the low-dimension, three-step property of DNA methylation profile beta value vector after applying the Walsh-Hadamard Transform (WHT). Firstly WHT is introduced; then the property after transformation is presented. The WHT
is an orthogonal transform that is based on the Walsh matrix. The transform of length K is defined by
where a(i) is the i-th element in the data sequence of length K before transformation, W(k, i) is the k-th row and i-th column element of the Walsh matrix, and r(k) is the k-th element of the vector after the
transformation, with . The Walsh matrix is defined recursively as,
for -th Walsh matrix can also be expressed by,
where is the Kronecker product. WHT basically decomposes the original data sequence into a series of basis functions of Walsh matrix. The transform is also named Walsh Transform or Hadamard Transform in short. This transform has a fast implementation named Fast Walsh-Hadamard Transform (FWHT), therefore it is suitable for the transformation of large-sized data vector.
Machine learning classification analysis: The disease and normal cell samples are processed with the pipeline followed by machine learning classification. The beta value vectors are firstly truncated to short sequence with varying length; then, the short sequences are analyzed by the machine learning classifiers on disease association. Classification accuracy results are generated for the k-fold cross-validation on several machine-learning classifiers and the preferred classifiers are selected for this classification task.
Fig. 4: Scatter plot of colorectal cancer cell (GSE69954) and T-cell acute lymphoblastic leukemia cell (GSE17648).
In this part, the three-step properties of the transform-domain vector are firstly reported. This discovered property is further applied to the classification of disease and normal cells. The classification accuracies and computation time are evaluated for various machine learning classifiers.
Three-Step properties after WHT: We found that the beta value vector after WHT shows a three-step property, and this property is consistently shown for the human tissue cells we have evaluated. To illustrate this property, WHT is performed on the beta value vectors of three NCBI GEO datasets and the three-step effect are plotted in Fig. 2, which depicts the WHT domain value vs. the domain index (from 2 to 100). These three datasets are beta value vectors measured by the HumanMethylation 27K bead chip. The length of the sequence before WHT transformation is approximately 27 thousand. It can be observed that the transform-domain vector has the three-step property with low dimension. The three-step property for 27K bead chip array is, there is one step in the range of 2-nd to 4-th elements, which has the highest amplitude. The second step is in the range of 5-th to 8-th elements with the amplitude lower than in the first step. The third step is in the range of 9-th to 32-nd elements, and the amplitude is lower than in the second step. We have neglected the first element due to its very large amplitude compared with other samples. We have tested a large number of beta value vectors including the normal tissue cells and disease tissue cells and found that this property is consistently shown in all the human cell samples tested. Based on this observation, we further propose disease and non-disease cell classification method.
The averaged step values and ranges of the steps of normal human tissue cell from the GEO datasets
TABLE III: The classification accuracy comparisons of the whole original sequence classifiers and the WHT based classifiers. The k-fold cross-validation is applied to all the classifiers. In the table the value of k in the k-fold cross-validation is noted. The classification accuracy with WHT is the averaged result of classification accuracies at three WHT feature space values equaling 83, 89 and 95.
are given in Table II. The step values in the table are computed from the normal tissue cells only in the selected GEO datasets. For each step value, the mean of one sample within the range of the step is firstly computed, then the step value is averaged over all the normal tissue cells. It is found that the three-step amplitude values are distinctively different for different human tissue cells; it is further found that the range of the step value only depends on the sequencing technique for obtaining the data. In Table II, we
TABLE IV: The computation time (in seconds) of the classification with the whole original sequence and the proposed WHT data.
have shown one dataset measured by Illumina HumanMethylation 450K beadchip. The transform-domain vector also shown three-step property, with different cut-off elements than 27K beadchip measurements. For 450K beadchip, the range of the first step is from the 2-nd to the 4-th element, and the range of the second step is from the 5-th to the 16-th element, and the range of the third step is from the 17-th to the 64-th element. We have neglected the first element because this value is much larger than the other elements and also varies greatly from sample to sample. Later classification algorithm also indicates that including the first element does not improve the classification accuracy. Thus, the first element is excluded in the step-range partition.
The three-step property of the WHT transform-domain vector is related to the distribution of the beta values controlled by the chromatin state of human cells. The chromatin state is characterized by the modifications on the eight histone proteins, the methylation on DNA sequence, and the chromatinassociated proteins in a particular genome sequence. The chromatin state can be modelled by Hidden Markov Model (HMM) [30], [31], where the hidden states of the HMM are the states of chromatin; and the observed outputs of the HMM are the measured DNA methylation and histone modification. There are different chromatin states for different types of human cells [32]. The DNA methylation intensity measurements at the CpG islands are controlled by the chromatin state. The beta value distribution is determined by the DNA methylation intensity measurements at the CpG islands. The three-step property is determined by the distribution of the beta values. Thus the chromatin state of human cell is the major reason of the three-step property in the transform domain vector of the DNA methylation beta values of human cells. The human cells have chromatin states producing the DNA methylation intensity measurements that create the three-step property of the beta values after WHT transformation.
It is informative to note that, the transform-domain vector can be applied to classify different types of cells. In Fig. 4 the scatter plot of two different types of human tissue cells, the colorectal cancer cell (GSE69954) and the T-cell acute lymphoblastic leukemia cell (GSE17648). The mean of the first and second steps are the classification features for classifying these two types of cells in the disease condition. It can be observed from the scatter plot that the samples of the two cells are far apart; thus, a linear classifier can easily distinguish these two type of cells. Therefore, the mean of the step values can be the feature space for the classification of different human tissue cells. By testing, we find that the classification accuracy of the cell types by the Support Vector Machine in GSE69954 and GSE17648 can achieve 100% accuracy for the two types of issues. This is because, for different types of cells the scatter plot samples are well separated in the plane as indicated in Fig. 4. Due to this fact, the cell type classification is not the focus on this work.
The disease cell classification is a more difficult problem than cell type classification due to the subtle difference in the DNA methylation profile between disease cells and normal cells. The scatter plots of disease and non-disease cells of the same type of tissue have significant overlap. The major classification problem becomes how to classify the disease and non-disease cells for the same tissue type. It is thus necessary to test the machine learning classification algorithms on disease and non-disease cells of the same tissue type. In the following section, we show the results to classify the disease and non-disease cells by the proposed method.
Disease cell classifications: In this part, the disease cell classification is done for WHT and the whole original sequence. The numbers of disease and normal cells in the four types of NCBI datasets are given in Table I. Since the samples in the analysis are from NCBI datasets, the procedure to obtain these samples is following the respective groups that perform the data collection. Since all datasets are collected on the HumanMethylation 27K beadchip platform, the procedures are the same for the four datasets applied in this research. The experimental procedure to obtain these datasets is explained in Sect. II. The machine learning classification methods we considered are: Ensemble Method of Boosting, Ensemble Method of Bagging, Ensemble Method of Random Subspace, Decision Tree, Support Vector Machine, k-Nearest Neighbors, and Feedforward Neural Network. For the Ensemble Method of Boosting, the algorithm of LogitBoost is applied. For WHT data, these machine learning classifiers are applied on the truncated transform-domain data. Based on the discussions in previous sections, the proposed WHTbased transformation can reduce the length of the beta value vector to less than 100 for both Illumina HumanMethylation 27K beadchip and Illumina HumanMethylation 450K beadchip. Note that, the latent low-dimensional feature of DNA methylation is discovered by WHT, therefore in the article all the machine learning algorithms are compared with the WHT assumption. The comparisons with other latent feature methods are out of the scope of this work.
The classification accuracy performance is evaluated with k-fold cross-validation. There are two k values selected for the performance evaluation—(i) k = sample size: one sample is for each validation and the remain samples are for training. The training and validation are done for the times equaling to the sample size. The final classification accuracy performance result is obtained by averaging these intermediate results. (ii) k = 3: the datasets are partitioned into three folds, two folds for the training and one fold for the validation. The training and validation are done three times with three setups, and the final classification accuracy results are obtained by averaging the three results. The performance results of the WHT data and the original whole sequence data are obtained in Table III. The computation running time results are given in Table IV. The running time include the time of WHT computation, k-fold cross-validation training and inference of the classifiers evaluated.
From Table III, we can observe that the Ensemble Method of Boosting and Support Vector Machine achieve comparably high accuracy performance among all the classifiers evaluated. With Ensemble Method of Boosting and Support Vector Machine, the WHT has the close performance to the whole sequence, for both k-fold cross-validation options including k equaling to the sample size and k equal to 3. For the Ensemble Method of Boosting and k equal to sample size, with dataset GSE63384 the classification accuracy is 0.9381 and 0.9286 for WHT and whole sequence; with dataset GSE40032, the classification accuracy is 0.9119 and 0.9540 for WHT and whole sequence; with dataset GSE17648 the accuracy is 0.9773 and 0.9318 for WHT and whole sequence; with dataset GSE73003 the accuracy is 0.9250 and 0.9500 for WHT and whole sequence. For the Ensemble Method of Subspace, the proposal has performed better than the whole sequence classification for all the four datasets. For the classifier of Decision Tree, the proposed WHT method performs slightly better for GSE63384 and GSE40032 to the whole sequence classification. Similarly, with Support Vector Machine, WHT also achieves close performance to the whole sequence for both k-fold cross-validation options.
It can also be observed that for the classifiers of Ensemble Method of Boosting and Support Vector Machine, the overall performance of WHT is at the same level to the performance of the whole sequence— the classification accuracy of WHT is slightly higher than with the whole sequence in some cases, and slightly below in some other cases. The classification accuracies of WHT are in bold in Table III for the two selected classifiers of Ensemble Method of Boosting and Support Vector Machine. It is noted that for the Ensemble Method of Bagging there is one dataset GSE40032 that has performance discrepancy between WHT and whole sequence, therefore Ensemble Method of Bagging is not selected as the preferred classifier. It can also be observed that the following classifiers do not produce high classification accuracy results, therefore, are not selected: Ensemble Method of Subspace, Decision Tree, k-Nearest Neighbors, and Feedforward Neural Network. Note that, the Feedforward Neural Network adopts the basic structure of one input layer, and one hidden layer and one output layer.
From the running time results in Table IV, it can be observed that the proposed WHT based classifiers achieve one order of magnitude reduction in computation time compared with the whole sequence classification. For the Ensemble Method of Boosting, the proposed WHT based method achieves the following folds of running time speed improvement compared with the whole sequence classification: 13 folds for GSE63384, 9.5 folds for GSE17648, 8.8 folds for GSE73003, and 26 folds for GSE40032. The run time speed up folds observed for the classifier of Support Vector Machine are: 31 folds for GSE63384, 31 folds for GSE17648, 29 folds for GSE73003, and 36 folds for GSE40032. Together with the classification accuracy performance in Table III, it is identified that the Ensemble Method of Boosting and Support Vector Machine with WHT achieve comparable performance compared with the whole sequence while having significant advantages of reduced computational time. All the training is done on the same qual-core desktop computer so that the computational time durations of different machine learning algorithms are compared under the same hardware configuration. Note that the data in Table III is obtained by averaging the last three WHT feature space vector length values equal 83, 89, and 95. The reason we have selected the last three points to produce the results in Table III is that there are fluctuations while varying the feature space vector length, and the last three points are the most stable results.
We presented a novel human disease cell-classification approach based on the three-step property of the DNA methylation profile with the Walsh-Hadamard Transform (WHT). The feature space vector is chosen as the WHT transform domain vector, and classification accuracy results are presented for competing machine-learning classifiers. We found that our proposal can achieve significant running time reduction while maintaining comparable performance compared with the original sequence classification for the identified classifiers include the Ensemble Method of Boosting and Support Vector Machine. Our proposal has valueable practical applications to the disease human cell classification tasks that require expedited, real-time computation. This framework has the perspectives to be applied to human cell classifications of other disease types and be generalized to other types of epigenome and genome datasets.
[1] J. T. Poirier and et al., “DNA methylation in small cell lung cancer defines distinct disease subtypes and correlates with high expression of EZH2,” Oncogene, vol. 34, p. 58695878, Mar 2015.
[2] A. A. Slater and et al., “DNA methylation profiling distinguishes histological subtypes of renal cell carcinoma,” Epigenetics, vol. 8, no. 3, pp. 252–267, 2013.
[3] M. R. Hassler and et al., “Insights into the pathogenesis of anaplastic large-cell lymphoma through genome-wide DNA methylation profiling,” Cell Reports, vol. 17, pp. 596–608, Feb 2018.
[4] T. Mazor and et al., “DNA methylation and somatic mutations converge on cell cycle and define similar evolutionary histories in brain tumors,” Cancer Cell, vol. 28, pp. 307–317, Sep 2015.
[5] S. A. Farkas, B. G. Sorbe, and T. K. Nilsson, “Epigenetic changes as prognostic predictors in endometrial carcinomas,” Epigenetics, vol. 12, no. 1, pp. 19–26, 2017.
[6] M. M. Bjaan and et al., “Genome-wide DNA methylation analyses in lung adenocarcinomas: Association with EGFR, KRAS and TP53 mutation status, gene expression and prognosis,” Molecular Oncology, vol. 10, no. 2, pp. 330 – 343, 2016.
[7] E. Cappelli, G. Felici, and E. Weitschek, “Combining DNA methylation and RNA sequencing data of cancer for supervised knowledge extraction,” BioData Mining, vol. 11, no. 1, p. 22, 2018.
[8] E. Weitschek, F. Cumbo, E. Cappelli, G. Felici, and P. Bertolazzi, “Classifying big DNA methylation data: A gene-oriented approach,” Communications in Computer and Information Science, vol. 903, 2018.
[9] P. L. De Jager and et al., “Alzheimer’s disease: early alterations in brain DNA methylation at ANK1, BIN1, RHBDF2 and other loci,” Nature Neuroscience, vol. 17, pp. 1156–1163, Aug 2014.
[10] L. Yu and et al., “Association of brain DNA methylation in SORL1,ABCA7, HLA-DRB5, SLC24A4 and BIN1 with pathological diagnosis of Alzheimer’s disease,” JAMA Neurology, vol. 72, no. 1, pp. 15–24, 2015.
[11] A. D. Francesco and et al., “Global changes in DNA methylation in Alzheimers disease peripheral blood mononuclear cells,” Brain, Behavior, and Immunity, vol. 45, pp. 139 – 144, 2015.
[12] L. F. Wockner and et al., “Genome-wide DNA methylation analysis of human brain tissue from schizophrenia patients,” Translational Psychiatry, vol. 4, pp. 339 –, Jan 2014.
[13] C. Chen and et al., “Correlation between DNA methylation and gene expression in the brains of patients with bipolar disorder and schizophrenia,” Bipolar Disorders, vol. 16, no. 8, pp. 790–799, 2014.
[14] S. Deng, W. Hu, V. D. Calhoun, and Y. Wang, “Integrating imaging genomic data in the quest for biomarkers of schizophrenia disease,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 15, pp. 1480–1491, Sep. 2018.
[15] B. C. McKinney, H. Lin, Y. Ding, D. A. Lewis, and R. A. Sweet, “DNA methylation age is not accelerated in brain or blood of subjects with schizophrenia,” Schizophrenia Research, Feb. 2018.
[16] T. Jiang and et al., “Variation in global DNA hydroxymethylation with age associated with schizophrenia,” Psychiatry Research, vol. 257, pp. 497 – 500, 2017.
[17] N. E. Huda and et al., “DNA methylation of membrane-bound catechol-o-methyltransferase in malaysian schizophrenia patients,” Psychiatry and Clinical Neurosciences, Nov. 2017.
[18] F. Celli, F. Cumbo, and E. Weitschek, “Classification of large DNA methylation datasets for identifying cancer drivers,” Big Data Research, vol. 13, pp. 21 – 28, 2018.
[19] S. M. Langevin and et al., “CpG island methylation profile in non-invasive oral rinse samples is predictive of oral and pharyngeal carcinoma,” Clinical Epigenetics, vol. 7, no. 1, p. 125, 2015.
[20] M. Borssen and et. al., “DNA methylation adds prognostic value to minimal residual disease status in pediatric T-Cell acute lymphoblastic leukemia,” Pediatric Blood & Cancer, vol. 63, no. 7, pp. 1185–1192, 2016.
[21] H. Li, G. Hong, H. Xu, and Z. Guo, “Application of the rank-based method to DNA methylation for cancer diagnosis,” Gene, vol. 555, no. 2, pp. 203 – 207, 2015.
[22] H. Zeng and D. K. Gifford, “Predicting the impact of non-coding variants on dna methylation,” Nucleic Acids Research, vol. 45, no. 11, p. e99, 2017.
[23] C. Angermueller, H. J. Lee, W. Reik, and O. Stegle, “DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning,” Genome Biology, vol. 18, p. 67, Apr 2017.
[24] Y. Wang and et al., “Predicting DNA methylation state of CpG dinucleotide using genome topological features and deep networks,” Scientific Reports, vol. 6, pp. 19598–, Jan 2016.
[25] L. S. Zou and et al., “BoostMe accurately predicts DNA methylation values in whole-genome bisulfite sequencing of multiple human tissues,” BMC Genomics, vol. 19, p. 390, May 2018.
[26] J. A. Capra and D. Kostka, “Modeling DNA methylation dynamics with approaches from phylogenetics,” Bioinformatics, vol. 30, no. 17, pp. 408–414, 2014.
[27] C. Xu and et al., “A novel strategy for forensic age prediction by DNA methylation and support vector regression model,” Scientific Reports, vol. 5, pp. 17788–, Dec 2015.
[28] X. Zhao and D. Pompili, “Walsh-Hadamard transform of DNA methylation profile for the classification of human cancer cells,” in ACM International Conference on Bioinformatics and Computational Biology, ICBCB ’17, pp. 26–29, 2017.
[29] P. Du and et al., “Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis,” BMC Bioinformatics, vol. 11, no. 1, p. 587, 2010.
[30] J. Ernst and M. Kellis, “Chromatin-state discovery and genome annotation with ChromHMM,” Nature Protocols, vol. 12, p. 2478, Nov 2017.
[31] E. Marco and et al., “Multi-scale chromatin state annotation using a hierarchical hidden markov model,” Nature Communications, vol. 8, p. 15011, Apr 2017.
[32] J. Ernst and et al., “Mapping and analysis of chromatin state dynamics in nine human cell types,” Nature, vol. 473, no. 7345, pp. 43–49, 2011.