CHAPTER 3 PROBLEM STATEMENT AND RESEARCH METHODOLOGY

64 CHAPTER 3 PROBLEM STATEMENT AND RESEARCH METHODOLOGY 3.1 PROBLEM DEFINITION Clinical data mining (CDM) is a rising field of research that aims at the utilization of data mining techniques to extract patterns from biological and clinical data. Oncogenomics is one of the key research areas in CDM that aims at applying high-throughput technologies to characterize genes associated with cancer. The major hurdles associated with this task involve: (i) Need to transform oncogenic data for processing by computational methods (ii) Ability to extract interpretable and valid patterns from the processed, voluminous data (iii) Improve prediction of cancer from diverse natured data with the extracted patterns. Since analysis of oncogenic data is a labor and resource intensive task, computational methods were investigated for faster and efficient analysis of oncogenic data piloting a potential area of cancer research named Computational Oncogenomics. This research focused on utilizing data mining methods to analyze and process the stated oncogenic data for the detection of oncogene patterns, oncoprotein patterns, oncoprotein mutations and oncoprotein interactions from the biological data and detection of cancer-cause/symptom patterns from the clinical data comprising of patient records, laboratory investigations and image-based features. Based on an exploration of the existing research issues (Kusiak et al, 2001; Kriegel et al, 2007; Hu, 2011; Huang et al, 2011) in the sphere of data

mining methodologies and their utilization in the field of pattern discovery from oncogenic data, the following research objectives were formulated. 65 3.2 RESEARCH OBJECTIVES The aim of this research was to investigate and explore the utilization of data mining methodologies in detecting oncogene patterns (gene expression data), oncoprotein properties (lung cancer tumor data), oncomutation patterns (P53 mutation data), oncoprotein interactions (HIV1-human PPI data) and identifying cancer-cause patterns/cancer-symptom patterns (from clinical data) by formulating novel feature selection and predictive techniques. In view of this, the following objectives were articulated: Large number of oncogene attributes but with comparatively very few instances characterized microarray-based gene expression data. This data required categorization of the contributory oncogenes according to the specific cancer sub-types. The data contained more than two target classes. Hence, the gene expression data required a suitable feature selection algorithm to extract the most minimal and optimal set of oncogenes that improved cancer prediction accuracy on the diverse gene expression cancer sub-types. Detection of oncoprotein properties for drug design incorporated extensive data cleaning strategies with reported low prediction accuracy. This required a computationally efficient feature selection technique that could eliminate the need for the data cleaning procedures while generating high cancer prediction accuracy with optimal set of protein properties for drug design. Lung cancer tumor data was stated to be the leading cause of

death around the world and hence was targeted for drug therapy in this research. 66 Detection of oncoprotein properties/mutations from P53 transcriptional activity data was a serious hurdle due to the heavy imbalance of records and massive data size. This required the formulation of an embedded supervised machine learning technique to detect the minimal and optimal set of oncogenic structural properties from P53 mutations for prediction of P53 transcriptional activity. Detection of oncomutation patterns by predicting P53 transcriptional activity from amino-acid substitutions unfolded a new research issue of categorizing the oncomutations as hot-spot cancer, strong rescue and weak rescue mutants. This led to the formulation of genetic mutant marker extraction methodology that could categorize the P53 mutants from amino-acid substitutions. The methodology needs to be computationally efficient and accurate in processing massive data. Discovery of novel oncoprotein interaction patterns was a challenging task due to the absence of established non-interacting protein pairs. Methodologies devised thus far failed to identify many novel interactions. HIV is a dreaded oncoprotein and hence the objective was to predict novel HIV1 human protein-protein interactions through association rule mining methodology that could capture maximum number of novel HIV1-human PPIs with least loss of information. Research on oncogenic clinical data for detection of cancercause/symptom patterns posed several issues in terms of the

67 diverse nature of data (continuous/discrete), multi-class categorization and biased nature of class distribution. This led to an investigation on the utilization of the proposed prediction method to predict cancer-cause/symptom patterns from oncogenic clinical data to identify the most efficient and diagnostically accurate method. The need for a clinical data classifier was identified to predict the nature of oncogenic data. The formulation of research objectives eventually led to the design of a suitable research methodology to explore and investigate the research issues and achieve the stated objectives. 3.3 RESEARCH METHODOLOGY The basic research methodology is stated to involve the process of identifying the problem followed by the formulation of appropriate techniques to handle the defined problem. Analysis of the collected data led this research to focus on two categories of Oncogenic data: (i) Biological data for detection of oncogene and oncoprotein patterns, oncomutation patterns and oncoprotein interactions (ii) Clinical data for detection of cancer-cause/cancer-symptom patterns. Following this, the data mining techniques were explored to analyze and process the stated oncogenic data. The research methodology involved the following phases: (i) Data collection and pre-processing (ii) Data analysis and processing (iii) Performance evaluation of proposed methodologies. Data collection and pre-processing is a pre-requisite to analyze oncogenic data. Data collection required identification of authenticated data from publicly available repositories namely UCI repository, NCBI database, KEGG database and AI labs. This is followed by analysis of the collected data that involved an investigation on the existing feature selection and classification algorithms to evaluate their performance in retrieving the relevant oncogenic features for cancer prediction from diverse types of oncogenic data. The

68 subsequent process would involve development of improved and computationally efficient feature selection and classification techniques to yield enhanced cancer prediction accuracy with minimal and optimal set of oncogenic features. In addition, association rule mining techniques were to be investigated to mine valid and potentially useful association rules. The extracted rules may be utilized to identify novel and previously unknown oncogenic patterns with adequate justification. The developed data mining framework could be utilized to design a clinical data classifier to predict the nature of unknown oncogenic data. 3.4 SUMMARY This chapter outlined the definition of the problem based on which the research objectives were articulated to handle the challenges and also concisely presents the diverse oncogenic biological and clinical data for detection of novel oncogene, oncoprotein, oncomutation, oncoprotein interaction and cancer-cause/symptom patterns. A research methodology to address the identified objectives is also given in this chapter. The next chapter details the formulated pattern discovery framework for detecting the most significant oncopatterns in biological and clinical data and their contribution to oncogenic pattern discovery.