Solving Problems of Clustering and Classification of Cancer Diseases Based on DNA Methylation Data 1,2

Similar documents
Machine-Learning on Prediction of Inherited Genomic Susceptibility for 20 Major Cancers

Supplementary Figures

Identification of Tissue Independent Cancer Driver Genes

Exploring TCGA Pan-Cancer Data at the UCSC Cancer Genomics Browser

Supplementary Figure 1: LUMP Leukocytes unmethylabon to infer tumor purity

The 16th KJC Bioinformatics Symposium Integrative analysis identifies potential DNA methylation biomarkers for pan-cancer diagnosis and prognosis

Elevated RNA Editing Activity Is a Major Contributor to Transcriptomic Diversity in Tumors

TCGA. The Cancer Genome Atlas

An integrative pan-cancer-wide analysis of epigenetic enzymes reveals universal patterns of epigenomic deregulation in cancer

Nature Genetics: doi: /ng Supplementary Figure 1. Workflow of CDR3 sequence assembly from RNA-seq data.

Nature Getetics: doi: /ng.3471

DiffVar: a new method for detecting differential variability with application to methylation in cancer and aging

FOLLICULAR LYMPHOMA- ILLUMINA METHYLATION. Jude Fitzgibbon

TCGA-Assembler: Pipeline for TCGA Data Downloading, Assembling, and Processing. (Supplementary Methods)

RNA SEQUENCING AND DATA ANALYSIS

Ahrim Youn 1,2, Kyung In Kim 2, Raul Rabadan 3,4, Benjamin Tycko 5, Yufeng Shen 3,4,6 and Shuang Wang 1*

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Promoter methylation of DNA damage repair (DDR) genes in human tumor entities: RBBP8/CtIP is almost exclusively methylated in bladder cancer

RNA SEQUENCING AND DATA ANALYSIS

Tissue-specific DNA methylation loss during ageing and carcinogenesis is linked to chromosome structure, replication timing and cell division rates

An annotated list of bivalent chromatin regions in human ES cells: a new tool for cancer epigenetic research

Predicting Non-Small Cell Lung Cancer Diagnosis and Prognosis by Fully Automated Microscopic Pathology Image Features

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Identifying Thyroid Carcinoma Subtypes and Outcomes through Gene Expression Data Kun-Hsing Yu, Wei Wang, Chung-Yu Wang

Pan-cancer analysis of expressed somatic nucleotide variants in long intergenic non-coding RNA

Comparison of Methylation Density in Different Cancer Types by Illumina Infinium HumanMethylation450 Methods

Feature Selection and Extraction Framework for DNA Methylation in Cancer

Large-scale DNA methylation expression analysis across 12 solid cancers reveals hypermethylation in the calcium-signaling pathway

Tissue independent and tissue specific patterns of DNA methylation alteration in cancer

Survey on Breast Cancer Analysis using Machine Learning Techniques

Mammogram Analysis: Tumor Classification

CLASSIFICATION OF BRAIN TUMOUR IN MRI USING PROBABILISTIC NEURAL NETWORK

Predicting Kidney Cancer Survival from Genomic Data

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection

Model-free machine learning methods for personalized breast cancer risk prediction -SWISS PROMPT

Package MethPed. September 1, 2018

Pan-cancer patterns of DNA methylation

Mammogram Analysis: Tumor Classification

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Improved Intelligent Classification Technique Based On Support Vector Machines

Dominic J Smiraglia, PhD Department of Cancer Genetics. DNA methylation in prostate cancer

User s Manual Version 1.0

Multiclass Classification of Cervical Cancer Tissues by Hidden Markov Model

When Overlapping Unexpectedly Alters the Class Imbalance Effects

Augmented Medical Decisions

DNA methylation signatures for 2016 WHO classification subtypes of diffuse gliomas

Nature Methods: doi: /nmeth.3115

An Improved Algorithm To Predict Recurrence Of Breast Cancer

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.

Data analysis in microarray experiment

Boosted PRIM with Application to Searching for Oncogenic Pathway of Lung Cancer

Clustered mutations of oncogenes and tumor suppressors.

Lung Cancer Diagnosis from CT Images Using Fuzzy Inference System

Propensity scores and causal inference using machine learning methods

Identifying Relevant micrornas in Bladder Cancer using Multi-Task Learning

Distinct cellular functional profiles in pan-cancer expression analysis of cancers with alterations in oncogenes c-myc and n-myc

SRARP and HSPB7 are epigenetically regulated gene pairs that function as tumor suppressors and predict clinical outcome in malignancies

Big Image-Omics Data Analytics for Clinical Outcome Prediction

Model reconnaissance: discretization, naive Bayes and maximum-entropy. Sanne de Roever/ spdrnl

Expert-guided Visual Exploration (EVE) for patient stratification. Hamid Bolouri, Lue-Ping Zhao, Eric C. Holland

The Cancer Genome Atlas Clinical Explorer: a web and mobile interface for identifying clinical genomic driver associations

ECG Beat Recognition using Principal Components Analysis and Artificial Neural Network

LinkedOmics. A Web-based platform for analyzing cancer-associated multi-dimensional data. Manual. First edition 4 April 2017 Updated on July 3, 2017

Diagnosis Of Ovarian Cancer Using Artificial Neural Network

Contents. Just Classifier? Rules. Rules: example. Classification Rule Generation for Bioinformatics. Rule Extraction from a trained network

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

DNA methylation in Uropathology

Hybridized KNN and SVM for gene expression data classification

Bayesian Prediction Tree Models

Application of Artificial Neural Networks in Classification of Autism Diagnosis Based on Gene Expression Signatures

International Journal of Advance Research in Computer Science and Management Studies

Supplementary Materials for

The Cancer Genome Atlas Pan-cancer analysis Katherine A. Hoadley

The Open Access Institutional Repository at Robert Gordon University

Using Ensemble-Based Methods for Directly Estimating Causal Effects: An Investigation of Tree-Based G-Computation

LncRNA TUSC7 affects malignant tumor prognosis by regulating protein ubiquitination: a genome-wide analysis from 10,237 pancancer

Breast cancer. Risk factors you cannot change include: Treatment Plan Selection. Inferring Transcriptional Module from Breast Cancer Profile Data

Stage-Specific Predictive Models for Cancer Survivability

Deep Learning Analytics for Predicting Prognosis of Acute Myeloid Leukemia with Cytogenetics, Age, and Mutations

Classification. Methods Course: Gene Expression Data Analysis -Day Five. Rainer Spang

Beyond the DVH: Voxelbased Automated Planning and Quality Assurance

Integrative DNA methylome analysis of pan-cancer biomarkers in cancer discordant monozygotic twin-pairs

BIOINFORMATICS ORIGINAL PAPER

Evidence Contrary to the Statistical View of Boosting: A Rejoinder to Responses

A Learning Method of Directly Optimizing Classifier Performance at Local Operating Range

A Comparison of Collaborative Filtering Methods for Medication Reconciliation

Comprehensive analyses of tumor immunity: implications for cancer immunotherapy

Lecture Outline. Biost 517 Applied Biostatistics I. Purpose of Descriptive Statistics. Purpose of Descriptive Statistics

Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features. Tyler Yue Lab

CS2220 Introduction to Computational Biology

Dynamic reprogramming of DNA methylation in SETD2- deregulated renal cell carcinoma

EXTRACT THE BREAST CANCER IN MAMMOGRAM IMAGES

Nicholas Borcherding, Nicholas L. Bormann, Andrew Voigt, Weizhou Zhang 1-4

Lecture 1: Course Introduction and Cancer ʻby the Numbersʼ. biochemistry 4450a

Pailin Group Executive Search Assistant Investigator: Cancer Genetics, Epigenetics and Biomarkers Dallas, TX

Variable Features Selection for Classification of Medical Data using SVM

Integrated genomic analysis of human osteosarcomas

Data complexity measures for analyzing the effect of SMOTE over microarrays

Transcription:

APPLIED PROBLEMS Solving Problems of Clustering and Classification of Cancer Diseases Based on DNA Methylation Data 1,2 A. N. Polovinkin a, I. B. Krylov a, P. N. Druzhkov a, M. V. Ivanchenko a, I. B. Meyerov a, A. A. Zaikin a,b, and N. Yu. Zolotykh a a Nizhni Novgorod, Institute of Information Technologies, Mathematics and Mechanics, Nizhni Novgorod, Russia b University College London Department of Mathematics, London, Great Britain e-mail: itlab.bio@cs.vmk.unn.ru Abstract The article deals with the problem of diagnosis of oncological diseases based on the analysis of DNA methylation data using algorithms of cluster analysis and supervised learning. The groups of genes are identified, methylation patterns of which significantly change when cancer appears. High accuracy is achieved in classification of patients impacted by different cancer types and in identification if the cell taken from a certain tissue is aberrant or normal. With method of cluster analysis two cancer types are highlighted for which the hypothesis was confirmed stating that among the people affected by certain cancer types there are groups with principally different methylation pattern. Keywords: cancer, DNA methylation, classification, clustering. DOI: 10.1134/S1054661816010181 2 1 INTRODUCTION In spite of all advances of modern medicine introduction of new diagnosis and treatment methods, cancer disease and mortality rates constantly keep steadily growing all over the world. Unfortunately show signs of clinical symptoms indicate the extensive-stage disease that is why pre-existing cancer detection seems to be the most promising approach. According to the international practices the selection of risk groups and screening study are the most prospective early detection of malignant neoplasms. This article discusses a new approach to the problem of early cancer diagnosis based on searching and analysis of low-level factors which can witness the development and existence of oncological disease in gene domain. At the moment the genetic risk factors of cancer are practically elusive, however, their identification is supposed to be breakthrough advance which will result in much more effective screening methods and early diagnostics. 1 This paper uses the materials of the report submitted at the 9th Open German-Russian Workshop on Pattern Recognition and Image Understanding, held on Koblenz, December 1 5, 2014 (OGRW-9-2014). 2 The article is published in the original. Received September 9, 2015 Epigenetic changes are DNA modifications resulted not from variations of nucleotide sequence, i.e., the variations occur not in genes but in external factors directly related to gene activities. One type of epigenetic changes is DNA methylation when methyl group ( CH 3 ) joins certain molecule regions. Aberrant structure of DNA methylation is one of the essential cancer signs, which enables its early diagnosis [7], however the exact role of this data in cancer genesis and clinical prediction remains elusive. Cancer is characterized by both hypermethylation (increase of methylation) and hypomethylation (decrease) of DNA. However, cancer can be witnessed not only by variation of mean level of gene methylation. The hypothesis has been proposed according to which dysregulation of stem cell genes results from aberrant variability (dispersion) of intragenic DNA methylation. This correlates with the fact that not only methylation level but also variability in certain genomic locations may be highly relevant to cancer development [6]. In particular, it has been shown that the increased stochasticity and variability in regions where the methylation level changes with cancer, results in aberrant and modified gene expression, thus explaining tumor heterogeneity [3]. Also some authors have shown that the markers reflecting differential variability of DNA methylation features may provide for better diagnosis and risk assessment of pre-cancer genesis [8, 9]. ISSN 1054-6618, Pattern Recognition and Image Analysis, 2016, Vol. 26, No. 1, pp. 176 180. Pleiades Publishing, Ltd., 2016.

SOLVING PROBLEMS OF CLUSTERING AND CLASSIFICATION 177 Table 1. Results of binary classification for 13 cancer types Cancer type Accuracy Type I error Type II error BLCA 0.965 0.166 0.008 BRCA 0.978 0.133 0.009 COAD 0.989 0.105 0 NHSC 0.975 0.12 0.006 KIRC 1.000 0 0 KIRP 0.984 0 0.023 LIHC 0.966 0.02 0.051 LUAD 1.000 0 0 LUSC 1.000 0 0 PRAD 0.933 0.183 0.04 READ 0.98 0.25 0 THCA 0.968 0.26 0.003 UCEC 0.983 0.139 0 Though the importance of study of intragenic and intergenic DNA methylation structure is clearly understood at the moment only modifications between different genomes have been studied but the problem how the remodeling of intragenic and intergenic DNA methylation is related to the origin of carcinomas. This article deals with the problem how to identify the gene group, methylation patterns of which significantly change with emerging of cancer disease, and analyses the application efficiency of certain DNA methylation methods to solve problems of classification and identification of essential features. Using methods of cluster analysis the hypothesis is studied which states that among the people affected by the same cancer type there are different groups which might be treated with potentially different methods. Authors analyze accuracy of solving problems of binary and multiclass classification between different cancer types under application of ensembles of decision trees. METHODS AND DATA As initial data for supposed study we propose to use inspection results of examinees from the international data base The Cancer Genome Atlas [10], which contain information about methylation level received with TheIlluminaInfinium HumanMethylation450 Bead- Chip [5]. Data contain circa 485000 loci per genome the intragenic location for 330000 of which is known as well as the name of related gene. Thus 15 17 loci per gene are available. These data are available for 13 different cancer types (Bladder Urothelial Carcenoma, BLCA; Breast Invasive Carcenomia, BRCA; etc.). The number of objects related to each cancer type varies from tens to hundreds. The following methods and markers are proposed to study intragenic DNA methylation structure. The first marker group does not depend on probe sequence inside the gene or related gene region. This marker group includes the mean value for gene methylation and dispersion. The second group includes markers considering the intragenic probe location. These markers are the mean value of outlier derivative, degree of spatial outlier asymmetry, degree of deviation from line linking methylation levels at the gene ends. Except the computation of their value, for the markers of the first and second groups Z-score is applied that is computation of deviation from the mean value of proper degree for all samples from normal selection, measured in mean-square deviation. The value obtained in such manner will be an instability degree for methylation outlier for the corresponding gene. As classifier we suggest to apply the decision trees [4] and their ensembles (in particular Random Forest [1]). Among the advantages of Random Forest the following features shall be mentioned: high quality of obtained models, similar to SVM and boosting, and better compared with neural nets [2], ability to effectively process data with large number of features and classes, invariance for monotonic transformations of features values, possibility to process both continuous and discrete features, presence of methods to evaluate the importance of specific features in the model. Moreover Random Forest model enables estimation of generalization error on-the-fly during its training (out-of-bag error [1]). RESULTS OF NUMERICAL EXPERIMENTS Classification Results From the practical point of view it is desirable to consider two types of problems: to define if a person is affected or normal using tissue samples of certain organ; to distinguish between different cancer types and normal cells. The developed measures of intragenic methylation are used as a sample description. Classification accuracy is expressed in terms of misclassified samples fraction and estimated using the out-of-bag error of the Random Forest model. As Table 1 shows, the achieved accuracy for problems of

178 POLOVINKIN et al. Table 2. Misclassification table to classify individuals affected by different cancer types HEALTHY BLCA BRCA COAD HNSC KIRC KIRP LIHC LUAD LUSC PRAD READ THCA UCEC HEALTHY 0.96 0 0.01 0 0 0 0 0.02 0 0 0 0 0.01 0 BLCA 0.01 0.85 0.01 0 0.09 0 0 0 0 0 0.04 0 0 0 BRCA 0.01 0 0.99 0 0 0 0 0 0 0 0 0 0 0 COAD 0 0 0 1.00 0 0 0 0 0 0 0 0 0 0 HNSC 0.01 0 0 0 0.97 0 0 0 0 0 0.02 0 0 0 KIRC 0.03 0 0 0 0 0.95 0.01 0 0 0 0.01 0 0 0 KIRP 0.02 0.03 0 0 0 0.13 0.82 0 0 0 0 0 0 0 LIHC 0.01 0 0 0 0 0 0 0.99 0 0 0 0 0 0 LUAD 0.07 0.01 0 0 0 0 0 0 0.92 0 0 0 0 0 LUSC 0.01 0 0 0 0 0 0 0 0 0.97 0.02 0 0 0 PRAD 0 0 0 0 0.08 0 0 0 0 0.08 0.84 0 0 0 READ 0.23 0 0 0.03 0 0 0 0 0 0.04 0 0.70 0 0 THCA 0.07 0 0 0 0 0 0 0 0 0 0 0 0.93 0 UCEC 0 0 0 0 0 0 0 0 0 0 0 0 0 1.00 binary classification makes up from 93 to 100% and for problem of multiclass classification (Table 2) is 96.5% which enables practical application of these results. Clustering Results The hypothesis has been proposed that within the same type of oncological diseases there are a number KIRC: 2 clusters 2 THCA: 2 clusters 5 1 0 0 1 5 2 3 10 4 2 0 2 4 6 2 0 2 4 6 8 10 Fig. 1. Clustering results with k-means algorithm for KIRC and THCA cancer types.

SOLVING PROBLEMS OF CLUSTERING AND CLASSIFICATION 179 Number of genes 10 3 10 2 10 1 10 0 1 2 3 4 5 6 Number of cancer types Fig. 2. Number of genes important for classification of several cancer types. of different groups which theoretically speaking might be treated differently. To test this hypothesis the following has been proposed for each type of cancer to divide classifying description of cells, related to specified type, in groups applying the methods of cluster analysis. The degree of mean level of methylation for the gene has been used as a classifying description, the k-means algorithm has been used for clustering [4] (for practical reasons 2 3 clusters has been supposed to be available). The experiment has shown (see Fig. 1) that there is a distinct separation of individuals affected by KIRC and THCA cancer types in clusters. So the further analysis of these results is required from practical point of view. Selection of Important Features One of the problems of practical importance is the existence of specific genes responsible for the development of one or another oncological disease. In this article the importance of each gene was analyzed regarding its usefulness to solve problems of binary classification (if a cell of certain tissue is malignant or normal). A set of features has been selected for each type of cancer thereby each feature ensures the quality of binary classification (cross-validation error in decision trees of depth 1) exceeding a certain threshold (within this article this value equals 0.9). The lists of significant genes for all cancer types obtained in this manner have been combined. The Fig. 2 shows the overall statistics (the number of genes simultaneously important for classification of a certain number of cancer types). As the diagram shows there are relatively small sets of genes important for classification of several cancer types. In the future the obtained results shall be analyzed medically. CONCLUSION Within this article the gene group methylation patterns of which significantly change with emerging of cancer disease has been identified. Using methods of cluster analysis the hypothesis has been studied which states that among the people affected by the same cancer type there are different groups. Two cancer types have been outlined for which the hypothesis has been confirmed. The obtained solution accuracy for the problems of binary and multiclass classification enables the practical application of the results. ACKNOWLEDGE We acknowledge financial support from Russian Foundation for Basic Research (14-02-01202). The research is supported (partly supported) by the grant (the agreement of August 27, 2013 no. 02.B.49.21.0003 between The Ministry of education and science of the Russian Federation and Lobachevsky State University of Nizhni Novgorod). REFERENCES 1. L. Breiman, Random forests, Mach. Learn. 45 (1), 5 32 (2011). 2. R. Caruana and A. Niculescu-Mizil, An empirical comparison of supervised learning algorithms using different performance metrics, in Proc. 23rd Int. Conf. on Machine Learning ICML 06 (Pittsburgh, 2006), pp. 161 168. 3. K. D. Hansen et al., Increased methylation variation in epigenetic domains across cancer types, Nat. Genet 43 (8), 768 775 (2011). 4. T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer, 2001). 5. Infinium HumanMethylation450 BeadChip Illumina. http://products.illumina.com/products/methylation_450_beadchip_kits.ilmn 6. A. E. Jaffe et al., Significance analysis and statistical dissection of variably methylated regions, Biostatistics 13 (1), 166 178 (2012). 7. P. A. Jones and S. B. Baylin, The epigenomics of cancer, Cell 128 (4), 683 692 (2007). 8. A. E. Teschendorff and M. Widschwendter, Differential variability improves the identification of cancer risk markers in DNA methylation studies profiling precursor cancer lesions, Bioinformatics 28 (11), 1487 1494 (2012). 9. A. E. Teschendorff et al., Epigenetic variability in cells of normal cytology is associated with the risk of future morphological transformation, Genome Med. 4 (3), 24 (2012). 10. The Cancer Genome Atlas Cancer Genome TCGA. http://cancergenome.nih.gov/ 11. M. Widschwendter et al., Epigenetic stem cell signature in cancer, Nat. Genet. 39 (2), 157 158 (2007).

180 POLOVINKIN et al. Aleksei Nikolaevich Polovinkin. Born 1984. Graduated from the Nizhni Novgorod in 2007. He is a assistant of the Lobachevsky State University. Research interests: computer vision, machine learning, high-performance computing. and articles): 13. Il ya Borisovich Krylov. Born 1991, graduated from the Nizhni Novgorod in 2014. He is a assistant of thelobachevsky State University. Research interests: machine learning, bioinformatics. and articles): 2. Pavel Nikolaevich Druzhkov. Born 1989. In 2012, graduated He is a junior research of the Research interests: data mining, computer vision. and articles): 6. Mikhail Vasil evich Ivanchenko. Born 1981. Graduated Lobachevsky State University of Nizhni Novgorod in 2004. (Candidate s, Doctoral): 2007, Candidate of Physico-Mathematical Sciences; 2011, Doctor of Physico- Mathematical Sciences. He is a leading researcher of the Research interests: nonlinear dynamics, complex systems, bioinformatics. and articles): over 50. Aleksei Anatol evich Zaikin. Born 1973, graduated from the Lomonosov Moscow State University in 1995. (Candidate s, Doctoral): 1998, Candidate of Physico-Mathematical Sciences. He is a professor of the University College London. Research interests: systems medicine, systems biology, synthetic biology, applied mathematics, nonlinear dynamics. and articles): 99. Iosif Borisovich Meyerov. Born 1976, graduated from the Nizhni Novgorod in 1999. (Candidate s, Doctoral): 2005, Candidate of Engineering Sciences. He is a associate professor of the Research interests: high-performance computing, sparse algebra, code optimization, system programming. and articles): 50. Nikolai Yur evich Zolotykh. Born 1973. Graduated from the Nizhni Novgorod in 1995. (Candidate s, Doctoral): 1998, Candidate of Physico-Mathematical Sciences; 2014, Doctor of Physico- Mathematical Sciences. He is a associate professor of the Research interests: machine learning, discrete optimization, discrete geometry. and articles): 30.