CHAPTER 3 PROBLEM STATEMENT AND RESEARCH METHODOLOGY

Similar documents
CHAPTER 8 ONCOGENIC MARKER DETECTION FROM P53 MUTANT AMINO-ACID SUBSTITUTIONS

How to Create Better Performing Bayesian Networks: A Heuristic Approach for Variable Selection

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16

Chapter 1. Introduction

Keywords Missing values, Medoids, Partitioning Around Medoids, Auto Associative Neural Network classifier, Pima Indian Diabetes dataset.

Empirical function attribute construction in classification learning

Contents. Just Classifier? Rules. Rules: example. Classification Rule Generation for Bioinformatics. Rule Extraction from a trained network

Discovering Meaningful Cut-points to Predict High HbA1c Variation

SCIENCE & TECHNOLOGY

Mayuri Takore 1, Prof.R.R. Shelke 2 1 ME First Yr. (CSE), 2 Assistant Professor Computer Science & Engg, Department

Variable Features Selection for Classification of Medical Data using SVM

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics

Experimental Methods. Anna Fahlgren, Phd Associate professor in Experimental Orthopaedics

Modeling Individual and Group Behavior in Complex Environments. Modeling Individual and Group Behavior in Complex Environments

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 1, Jan Feb 2017

Predication-based Bayesian network analysis of gene sets and knowledge-based SNP abstractions

Introduction. Introduction

Statistical Analysis Using Machine Learning Approach for Multiple Imputation of Missing Data

A Comparison of Collaborative Filtering Methods for Medication Reconciliation

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Hypothesis-Driven Research

HOW TO WRITE A STUDY PROTOCOL

Multi Parametric Approach Using Fuzzification On Heart Disease Analysis Upasana Juneja #1, Deepti #2 *

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD

Artificial intelligence and judicial systems: The so-called predictive justice. 20 April

Molecular Biology of Cancer. Code: ECTS Credits: 6. Degree Type Year Semester

Statistical analysis of RIM data (retroviral insertional mutagenesis) Bioinformatics and Statistics The Netherlands Cancer Institute Amsterdam

COMPUTATIONAL OPTIMISATION OF TARGETED DNA SEQUENCING FOR CANCER DETECTION

Statistical considerations in indirect comparisons and network meta-analysis

Statement of research interest

R2 Training Courses. Release The R2 support team

Predicting Breast Cancer Survivability Rates

A Deep Learning Approach to Identify Diabetes

PEPFAR Malawi Baobab Health Trust EMRS

Assigning B cell Maturity in Pediatric Leukemia Gabi Fragiadakis 1, Jamie Irvine 2 1 Microbiology and Immunology, 2 Computer Science

Project Aims. Management Sciences for Health

N. Laskaris, S. Fotopoulos, A. Ioannides

Inter-session reproducibility measures for high-throughput data sources

An Improved Algorithm To Predict Recurrence Of Breast Cancer

Emotion Recognition using a Cauchy Naive Bayes Classifier

General concepts. Chapters 1 and 2 uploaded to blackboard. All other material on my page. STA 2201S: Jan 13, /8

DEVELOPMENT OF AN EXPERT SYSTEM ALGORITHM FOR DIAGNOSING CARDIOVASCULAR DISEASE USING ROUGH SET THEORY IMPLEMENTED IN MATLAB

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics

Research and Innovation Roadmap Annual Projects

CHAPTER VI RESEARCH METHODOLOGY

Identifying Novel Targets for Non-Small Cell Lung Cancer Just How Novel Are They?

Classification of Smoking Status: The Case of Turkey

Final Project Report Sean Fischer CS229 Introduction

JOB DESCRIPTION. Job Title: Part time (0.7) Clinical Psychologist - Band 7 Equivalent

CANCER DIAGNOSIS USING NAIVE BAYES ALGORITHM

7.1 Grading Diabetic Retinopathy

Bioinformatics Laboratory Exercise

TPMI Presents: Translational Genomics Research Update, Opportunities and Challenges

Discovery and Validation of Prognostic Genomic Based Signatures in High Risk Bladder Cancer Following Cystectomy

Mapping evolutionary pathways of HIV-1 drug resistance using conditional selection pressure. Christopher Lee, UCLA

Improved Intelligent Classification Technique Based On Support Vector Machines

Grounding Ontologies in the External World

AD (Leave blank) TITLE: Genomic Characterization of Brain Metastasis in Non-Small Cell Lung Cancer Patients

A Fuzzy Improved Neural based Soft Computing Approach for Pest Disease Prediction

Development of a NGS Cancer Research Database CancerBase

Molecular and Cell Biology of Cancer. Code: ECTS Credits: 6. Degree Type Year Semester Biomedical Sciences OT 4 0

ABSTRACT I. INTRODUCTION. Mohd Thousif Ahemad TSKC Faculty Nagarjuna Govt. College(A) Nalgonda, Telangana, India

International Journal of Advance Research in Computer Science and Management Studies

Phone Number:

CSC2130: Empirical Research Methods for Software Engineering

Analysis of Classification Algorithms towards Breast Tissue Data Set

Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features. Tyler Yue Lab

Hybridized KNN and SVM for gene expression data classification

A Learning Method of Directly Optimizing Classifier Performance at Local Operating Range

Reporting Checklist for Nature Neuroscience

Probability-Based Protein Identification for Post-Translational Modifications and Amino Acid Variants Using Peptide Mass Fingerprint Data

FORECASTING TRENDS FOR PROACTIVE CRIME PREVENTION AND DETECTION USING WEKA DATA MINING TOOL-KIT ABSTRACT

Comparative Analysis of Machine Learning Algorithms for Chronic Kidney Disease Detection using Weka

Ultrasonic Phased Array Inspection of Turbine Components

Comparative study of Naïve Bayes Classifier and KNN for Tuberculosis

IDENTIFICATION OF OUTLIERS: A SIMULATION STUDY

Effective Values of Physical Features for Type-2 Diabetic and Non-diabetic Patients Classifying Case Study: Shiraz University of Medical Sciences

HIV Drug Resistance South Africa, How to address the increasing need? 14 Apr. 2016

Ethiopia. Targeted Tuberculosis Case Finding Interventions in Six Mining Shafts in Remote Districts of Oromia Region in Ethiopia PROJECT CONTEXT

Visualizing Cancer Heterogeneity with Dynamic Flow

QUALITY ASSURANCE GUIDELINES FOR LATENT PRINT EXAMINERS

DIRECT IDENTIFICATION OF NEO-EPITOPES IN TUMOR TISSUE

Global Trends in Early Infant Diagnosis of HIV

OncoPPi Portal A Cancer Protein Interaction Network to Inform Therapeutic Strategies

THE UMD TP53 MUTATION DATABASE UPDATES AND BENEFITS. Pr. Thierry Soussi

Sequential, Multiple Assignment, Randomized Trials

Diagnosis of Breast Cancer Using Ensemble of Data Mining Classification Methods

Automated Medical Diagnosis using K-Nearest Neighbor Classification

Experimental Design for Immunologists

A scored AUC Metric for Classifier Evaluation and Selection

What do we know about HIV trial design for adolescents?

Detection of Cognitive States from fmri data using Machine Learning Techniques

Funnelling Used to describe a process of narrowing down of focus within a literature review. So, the writer begins with a broad discussion providing b

Structural Variation and Medical Genomics

Detecting Anomalous Patterns of Care Using Health Insurance Claims

A Roadmap for Improving Epilepsy Therapy Through Integrated Advanced Technologies (The Knowledge Project) December 3, 2011

Big Image-Omics Data Analytics for Clinical Outcome Prediction

Semantic Pattern Transformation

Predictive and Similarity Analytics for Healthcare

INTERVIEWS II: THEORIES AND TECHNIQUES 5. CLINICAL APPROACH TO INTERVIEWING PART 1

Transcription:

64 CHAPTER 3 PROBLEM STATEMENT AND RESEARCH METHODOLOGY 3.1 PROBLEM DEFINITION Clinical data mining (CDM) is a rising field of research that aims at the utilization of data mining techniques to extract patterns from biological and clinical data. Oncogenomics is one of the key research areas in CDM that aims at applying high-throughput technologies to characterize genes associated with cancer. The major hurdles associated with this task involve: (i) Need to transform oncogenic data for processing by computational methods (ii) Ability to extract interpretable and valid patterns from the processed, voluminous data (iii) Improve prediction of cancer from diverse natured data with the extracted patterns. Since analysis of oncogenic data is a labor and resource intensive task, computational methods were investigated for faster and efficient analysis of oncogenic data piloting a potential area of cancer research named Computational Oncogenomics. This research focused on utilizing data mining methods to analyze and process the stated oncogenic data for the detection of oncogene patterns, oncoprotein patterns, oncoprotein mutations and oncoprotein interactions from the biological data and detection of cancer-cause/symptom patterns from the clinical data comprising of patient records, laboratory investigations and image-based features. Based on an exploration of the existing research issues (Kusiak et al, 2001; Kriegel et al, 2007; Hu, 2011; Huang et al, 2011) in the sphere of data

mining methodologies and their utilization in the field of pattern discovery from oncogenic data, the following research objectives were formulated. 65 3.2 RESEARCH OBJECTIVES The aim of this research was to investigate and explore the utilization of data mining methodologies in detecting oncogene patterns (gene expression data), oncoprotein properties (lung cancer tumor data), oncomutation patterns (P53 mutation data), oncoprotein interactions (HIV1-human PPI data) and identifying cancer-cause patterns/cancer-symptom patterns (from clinical data) by formulating novel feature selection and predictive techniques. In view of this, the following objectives were articulated: Large number of oncogene attributes but with comparatively very few instances characterized microarray-based gene expression data. This data required categorization of the contributory oncogenes according to the specific cancer sub-types. The data contained more than two target classes. Hence, the gene expression data required a suitable feature selection algorithm to extract the most minimal and optimal set of oncogenes that improved cancer prediction accuracy on the diverse gene expression cancer sub-types. Detection of oncoprotein properties for drug design incorporated extensive data cleaning strategies with reported low prediction accuracy. This required a computationally efficient feature selection technique that could eliminate the need for the data cleaning procedures while generating high cancer prediction accuracy with optimal set of protein properties for drug design. Lung cancer tumor data was stated to be the leading cause of

death around the world and hence was targeted for drug therapy in this research. 66 Detection of oncoprotein properties/mutations from P53 transcriptional activity data was a serious hurdle due to the heavy imbalance of records and massive data size. This required the formulation of an embedded supervised machine learning technique to detect the minimal and optimal set of oncogenic structural properties from P53 mutations for prediction of P53 transcriptional activity. Detection of oncomutation patterns by predicting P53 transcriptional activity from amino-acid substitutions unfolded a new research issue of categorizing the oncomutations as hot-spot cancer, strong rescue and weak rescue mutants. This led to the formulation of genetic mutant marker extraction methodology that could categorize the P53 mutants from amino-acid substitutions. The methodology needs to be computationally efficient and accurate in processing massive data. Discovery of novel oncoprotein interaction patterns was a challenging task due to the absence of established non-interacting protein pairs. Methodologies devised thus far failed to identify many novel interactions. HIV is a dreaded oncoprotein and hence the objective was to predict novel HIV1 human protein-protein interactions through association rule mining methodology that could capture maximum number of novel HIV1-human PPIs with least loss of information. Research on oncogenic clinical data for detection of cancercause/symptom patterns posed several issues in terms of the

67 diverse nature of data (continuous/discrete), multi-class categorization and biased nature of class distribution. This led to an investigation on the utilization of the proposed prediction method to predict cancer-cause/symptom patterns from oncogenic clinical data to identify the most efficient and diagnostically accurate method. The need for a clinical data classifier was identified to predict the nature of oncogenic data. The formulation of research objectives eventually led to the design of a suitable research methodology to explore and investigate the research issues and achieve the stated objectives. 3.3 RESEARCH METHODOLOGY The basic research methodology is stated to involve the process of identifying the problem followed by the formulation of appropriate techniques to handle the defined problem. Analysis of the collected data led this research to focus on two categories of Oncogenic data: (i) Biological data for detection of oncogene and oncoprotein patterns, oncomutation patterns and oncoprotein interactions (ii) Clinical data for detection of cancer-cause/cancer-symptom patterns. Following this, the data mining techniques were explored to analyze and process the stated oncogenic data. The research methodology involved the following phases: (i) Data collection and pre-processing (ii) Data analysis and processing (iii) Performance evaluation of proposed methodologies. Data collection and pre-processing is a pre-requisite to analyze oncogenic data. Data collection required identification of authenticated data from publicly available repositories namely UCI repository, NCBI database, KEGG database and AI labs. This is followed by analysis of the collected data that involved an investigation on the existing feature selection and classification algorithms to evaluate their performance in retrieving the relevant oncogenic features for cancer prediction from diverse types of oncogenic data. The

68 subsequent process would involve development of improved and computationally efficient feature selection and classification techniques to yield enhanced cancer prediction accuracy with minimal and optimal set of oncogenic features. In addition, association rule mining techniques were to be investigated to mine valid and potentially useful association rules. The extracted rules may be utilized to identify novel and previously unknown oncogenic patterns with adequate justification. The developed data mining framework could be utilized to design a clinical data classifier to predict the nature of unknown oncogenic data. 3.4 SUMMARY This chapter outlined the definition of the problem based on which the research objectives were articulated to handle the challenges and also concisely presents the diverse oncogenic biological and clinical data for detection of novel oncogene, oncoprotein, oncomutation, oncoprotein interaction and cancer-cause/symptom patterns. A research methodology to address the identified objectives is also given in this chapter. The next chapter details the formulated pattern discovery framework for detecting the most significant oncopatterns in biological and clinical data and their contribution to oncogenic pattern discovery.