Data Fusion for Predicting Breast Cancer Survival

Similar documents

EXPLORING THE PROCESS OF ASSESSMENT AND OTHER RELATED CONCEPTS

PET FORM Planning and Evaluation Tracking ( Assessment Period)

FOUNDATIONS OF DECISION-MAKING...

The Interface Between Theory of Mind and Language Impairment

State Health Improvement Plan Choosing Priorities, Creating a Plan. DHHS DPH - SHIP Priorities (Sept2016) 1

A pre-conference should include the following: an introduction, a discussion based on the review of lesson materials, and a summary of next steps.

Chapter 3 Perceiving Ourselves and Others in Organizations

BRCA1 and BRCA2 Mutations

23/11/2015. Introduction & Aims. Methods. Methods. Survey response. Patient Survey (baseline)

The principles of evidence-based medicine

ICT4LIFE. Final Conference. ICT4Life field work - tailored solutions in diverse regional context Ariane Girault, E-Seniors Association

The estimator, X, is unbiased and, if one assumes that the variance of X7 is constant from week to week, then the variance of X7 is given by

Chapter 6: Impact Indicators

Swindon Joint Strategic Needs Assessment Bulletin

Q 5: Is relaxation training better (more effective than/as safe as) than treatment as usual in adults with depressive episode/disorder?

BIOLOGY 101. CHAPTER 15: The Chromosomal Basis of Inheritance: Locating Genes Along Chromosomes

Public consultation on the NHMRC s draft revised Australian alcohol guidelines for low-risk drinking

EDPS 475: Instructional Objectives for Midterm Exam Behaviorism

Related Policies None

Computer vision quality assessment of barley kernels

Imaging tests allow the cancer care team to check for cancer and other problems inside the body.

Reliability and Validity Plan 2017

Individual Assessments for Couples Treatment with HFCA

BIOLOGY 101. CHAPTER 13: Meiosis and Sexual Life Cycles: Variations on a Theme

Religious Beliefs, Knowledge about Science and Attitudes Towards Medical Genetics. Nick Allum, Elissa Sibley, Patrick Sturgis & Paul Stoneman

Code of employment practice on infant feeding

Record of Revisions to Patient Tracking Spreadsheet Template

Awareness of Autistic Spectrum Conditions

UNIT 2: mapping bananas

Frequently Asked Questions: IS RT-Q-PCR Testing

Campus Climate Survey

ANXIETY SYMPTOMS INTERVENTION SESSION HANDOUTS. Introduction to Fighting Fear by Facing Fear. Making a Fears and Worries List

Volume Measurement at CT

DATA RELEASE: UPDATED PRELIMINARY ANALYSIS ON 2016 HEALTH & LIFESTYLE SURVEY ELECTRONIC CIGARETTE QUESTIONS

EXECUTIVE SUMMARY INNOVATION IS THE KEY TO CHANGING THE PARADIGM FOR THE TREATMENT OF PAIN AND ADDICTION TO CREATE AN AMERICA FREE OF OPIOID ADDICTION

How to become an AME Online

Breast Cancer Awareness Month 2018 Key Messages (as of June 6, 2018)

DIRECTED FORGETIING: SHORT-TERM MEMORY OR CONDITIONED RESPONSE? WENDY S. MILLER and HARVARD L. ARMUS The University of Toledo

Completing the NPA online Patient Safety Incident Report form: 2016

Graduating Senior Forum

WHAT IS HEAD AND NECK CANCER FACT SHEET

Assessment Field Activity Collaborative Assessment, Planning, and Support: Safety and Risk in Teams

Using Causal Inference To Make Sense of Messy Data

The Mental Capacity Act 2005; a short guide for the carers and relatives of those who may need support. Ian Burgess MCA Lead 13 February 2017

Success Criteria: Extend your thinking:

Introduction to Psychological Disorders (Myers for AP 2 nd Edition, Module 65)

Podcast Transcript Title: Common Miscoding of LARC Services Impacting Revenue Speaker Name: Ann Finn Duration: 00:16:10

Annual Assembly Abstract Review Process

GSB of EDA Meeting Minutes

Structured Assessment using Multiple Patient. Scenarios (StAMPS) Exam Information

TASKFORCE REPORT AIMS TO BOOST CANCER SURVIVAL AND TRANSFORM PATIENT EXPERIENCE

STAKEHOLDER IN-DEPTH INTERVIEW GUIDE

detailed in Ward and Lockhead (1970), is only summarized here.

2018 Medical Association Poster Symposium Guidelines

Risk factors in health and disease

The Great Divide: Is it Operant or Classical? Lindsay Wood

True Patient & Partner Engagement How is it done? How can I do it?

Improving Surveillance and Monitoring of Self-harm in Irish Prisons

Name: Anchana Ganesh Age: 21 years Home Town: Chennai, Tamil Nadu Degree: B.Com. Profilometer Score. Profilometer Graph

P02-03 CALA Program Description Proficiency Testing Policy for Accreditation Revision 1.9 July 26, 2017

GUIDANCE DOCUMENT FOR ENROLLING SUBJECTS WHO DO NOT SPEAK ENGLISH

CONSENT FOR KYBELLA INJECTABLE FAT REDUCTION

SUICIDE AND MENTAL ILLNESS IN SINGAPORE

If, then. Homework: Finish entire guided notes packet. Name: Pod: Date: Which variable does a scientist manipulate or control in an experiment?

Herbal Medicines: Traditional Herbal Registration

Medicare Advantage 2019 Advance Notice Part 1 21 st Century Cures Act Methodological Changes

NFS284 Lecture 3. How much of a nutrient is required to maintain health? Types and amounts of foods to maintain health

Osteoporosis Fast Facts

Pain relief after surgery

Continuous Quality Improvement: Treatment Record Reviews. Third Thursday Provider Call (August 20, 2015) Wendy Bowlin, QM Administrator

PILI Ohana Facilitator s Guide

2019 Canada Winter Games Team NT Female Hockey Selection Camp August 16-19, 2018

CDC Influenza Division Key Points MMWR Updates February 20, 2014

National Programme on Technology Enhanced Learning (Phase II)

Career Confidence. by Kevin Gaw

Before Your Visit: Mohs Skin Cancer Surgery

Nutrition Care Process Model Tutorials. Nutrition Monitoring & Evaluation: Overview & Definition. By the end of this module, the participant will:

Session78-P.doc College Adjustment And Sense Of Belonging Of First-Year Students: A Comparison Of Learning Community And Traditional Students

Strategies for Avoiding Plagiarism Part Two: Paraphrasing

Module 6: Goal Setting

Study Design Open, three arm-stratified, non-randomized, prospective, multicentric study

Novel methods and approaches for sensing, evaluating, modulating and regulating mood and emotional states.

Impacts of State Level Dental Hygienist Scope of Practice on Oral Health Outcomes in the U.S. Population

Field Epidemiology Training Program

Key Points Enterovirus D68 in the United States, 2014 Note: Newly added information is in red.

Social Learning Theories

Head and neck cancers are often treated with radiotherapy. Radiotherapy can lead to faster rates of tooth decay and poor healing in the mouth.

A Phase I Study of CEP-701 in Patients with Refractory Neuroblastoma NANT (01-03) A New Approaches to Neuroblastoma Therapy (NANT) treatment protocol.

A foot x-ray series is required only if there is pain in the midfoot zone and any one of the following:

Dementia Cal MediConnect Project DEMENTIA CARE MANAGER TRAINING FACILITATOR GUIDE

The effects of a two-school. school-year. back education program. in elementary schoolchildren

PROCEDURAL SAFEGUARDS NOTICE PARENTAL RIGHTS FOR PRIVATE SCHOOL SPECIAL EDUCATION STUDENTS

Session 5: Is FOOD fair?

Referral Criteria: Inflammation of the Spine Feb

FDA Dietary Supplement cgmp

Concept paper on the need for revision of the guideline on clinical investigation of medicinal products in the treatment of depression

ARLA FOOD FOR HEALTH 4 th ANNUAL CALL FOR EXPRESSIONS OF INTEREST

Interpretation. Historical enquiry religious diversity

For homework, students continue their AIR through the lens of their focus standard.

Transcription:

Data Fusin fr Predicting Breast Cancer Linbailu Jiang, Yufei Zhang, Siyi Peng Mentr: Irene Kaplw December 11, 2015 1 Intrductin 1.1 Backgrund Cancer is mre f a severe health issue than ever in ur current sciety. As severe as it is lethal in general, there are many factrs that may affect a patient s survival. It is nt easy t find a clear pattern t predict the survival utcme f a cancer patient, which culd be a cmplex prcess invlving different bilgic cnditins. Based n previus studies, many features are cnsidered t ptentially affect a survival event cancer type, age at diagnsis, treatment pathway, time since diagnsis, and sme specific genme patterns that may relate t the prgressin f the cancer disease. While sme f these features, such as cancer type and age at diagnsis, are relatively explicit and have mre direct relatins with the survival utcme, sme ther features, such as gene mutatins and methylatin levels, seem mre cmplicated and require mre effrt t analyze the ptential interactins amng these features. In this paper, we mainly fcus n the pathlgical cnditins f the patients withut examining different treatments f cancer. 1.2 Gal and Outline In this paper, we want t understand a patient s survival rate given his/her genme pattern and time interval since diagnsis. Our ultimate gal is t predict the survival rate f a patient with breast cancer changing ver time based n sme related genmic data. Our first step is t understand hw genmic data is influenced by different factrs, such as gene expressin fr different RNA, cpy number value, and the methylatin level fr different DNA. By ding feature crrelatin analysis and feature selectin in genmic data, we identify cancer-related genes and their targets. The mst challenging task here is t distinguish the driver mutatins, as a subset that truly cntributes t the tumr s prgressin, frm a large number f neutral passenger mutatins that characterize the cancer. Based n sme previus studies, we guess that a supprt vectr machine methd might be helpful during this prcess. At the secnd stage, we have relatively fewer features fr genmic data that may affect the cancer survivals, s it culd be easier fr us t cnduct a merge based n the patient ID in the survival dataset and the sample ID in the genmic dataset. This wuld help us t relate all the genmic infrmatin t the survival results s we culd cmbine them with ther ptential features and start t train ur mdel. After that, we wuld estimate and cmpare the perfrmance f the mdels by using sme crss validatin methds. 1

1.3 Data The data f ur prject is frm NIH(Natinal Institutes f Health) Prject, and we btained them frm Prfessr Olivier Gevaert. The data are all pre-pssessed, lg-transfrmed, and well-separated based n cancer type int 11 dataset. The whle dataset is cmprised f tw parts. First part is a dataset f patient ID, cancer type, time since diagnsis ( TimeT- LastCntactOrVisit in days), and his/her survival status (nrmally 0=alive, 1=dead). Secnd part are sme large lists abut genmic data. Each cancer type has ne large list. In each large list, there are 3 datasets which separately cntain gene expressin data, cpy number data, and methylatin data. All the data are pre-prcessed s there can be sme negative values in these charts. The first genmic data we used in this prject is the gene expressin data. Basically, fr each sample, a high (large psitive) gene expressin value fr a specific gene cde means the infrmatin encded in this gene has been highly interpreted, while a lw value indicates that mst f its infrmatin has been hidden. The secnd genmic data is called cpy number variatins data. This data represents the structural variatin f a specific gene. A larger cpy number implies that the gene might be duplicated s that it becmes mre than the nrmal number while a smaller cpy number dentes a deletin in a specific regin s that the number is less than the nrmal number. During these years, methylatin als becmes an imprtant cncept in the cancer research, s ur third dataset recrds the pattern f methylatin fr each sample. An aberrant DNA methylatin pattern, such as hyper-methylatin r hypmethylatin pattern, can usually be assciated with many types f human malignancies. In this prject, due t time limit, we wuld nly fcus n breast cancer(brca). The sample sizes f the BRCA dataset is relatively large cmpared with thers, s we believe it s mre likely t get valuable results when training n this dataset. 2 Methds 2.1 Feature Selectin Fr mst f ur datasets, the numbers f features are much larger than the numbers f sample (p» n), which can cause sme difficulties when training the data. Fr instance, the sample size f the gene expressin dataset fr breast cancer is nly 985, while the crrespnding feature size is 16020, which is much larger than the sample size here. A ptential prblem that can be caused by this is verfitting. T avid this prblem, the first thing needed t be handled is t reduce the number f variables in these datasets. The first step we did here is t find variance f each feature and get rid f the features with lw variance. Fr example, the variance range f gene expressin features fr breast cancer is frm 0.03 t 25.50 and we find mst f these features have relatively lw variances. We assume that the lw-variance predictrs have less predictive pwer (which is nt always true, we just use this simple methd t btain a first impressin f the data), s we tried remving sme predictrs with small variance and use 10-fld crss validatin t see hw this prcess may affect the mdel perfrmance. Table 1 shws the mdel perfrmance based n different variance threshlds. # f Selected Features Based n the Threshld 5909 1673 598 78 13 Variance Threshld 1 3 5 10 15 SVM 0.110037 0.109963 0.109976 0.109975 0.110061 Naive Bayes 0.352112 0.347705 0.332234 0.269524 0.121026 Table 1: Errr estimatins f SVM & Naive Bayes using 10-fld crss validatin As we see, fr an SVM mdel, there s n bvius imprvement by reducing feature size; 2

hwever, fr a Naive Bayes mdel, the smaller the number f variables is, the better accuracy is achieved. One pssible explanatin fr this is that the actual mdel (with raw data) vilates the cnditinal independence assumptin f the Naive Bayes mdel, which means sme genes may wrk tgether, and their expressin values can prbably be highly crrelated. The accuracies fr bth SVM and Naive Bayes mdel are clse t 90 percent, hwever we dn t think these mdels are actually gd at this pint since we fund that all f them tend t predict 0(alive) rather than 1(dead), and the crrespnding ROC curves imply that the mdels are uninfrmative. T slve this prblem, we decided t imprve ur feature selectin prcess and use a mre reliable methd t pick imprtant features. Accrding t previus studies, training lgistic regressin mdels n each feature separately and ranking them by lwest CV errr can be a gd methd t find valuable features. We picked the tp 200 genes with the best perfrmance n individual training, and then ranked them again by using cx mdel t reduce the final feature size t be 20 (details in next sectin). 2.2 Survfit and Cx Mdel After significantly reducing the number f feature variables, we then cmbined all the imprtant features t the survival dataset. One imprtant pint t ntice is the feature variable TimeTLastCntactOrVisit. It indicates the number f days frm a patient was first diagnsed breast cancer t his/her last visit date. If the patient is dead, this variable represents the duratin t death; if nt, this variable becmes a censred data, as we dn t knw actually what the patient s current status is. We nly knw this patient was alive x days after the beginning f study (x = TimeTLastCntactOrVisit ). In previus part, t simplify the mdel, we treated the TimeTLastCntactOrVisit as a cntinuus feature variable and trained SVM mdels n it. Hwever, the perfrmance f these mdels seems unfavrable, which may be explained by us nt handling this term in a right way. An alternate methd t deal with it is t treat this term as a part f respnse instead f a predictr. We will discuss why it culd be imprtant t crrectly handle this term later when we analyze ur sample results. Based n previus study, a cmmn way t cmbine the timeline and status f an event is t create a new type bject, usually called Surv r survival bject, which is a small data matrix that cntains cmprehensive infrmatin f an event. In this prject, we used cx mdel t fit the survival data. The cx mdel has the frm: λ(t X) = λ 0 (t) exp(β 1 X 1 +... + β p X p ) = λ 0 (t) exp(xβ ) We used functins Survfit and cxph in the survival package in R t train the mdels with cx mdel and then fit the survival bject.the next prblem we met in this part was the large time cst f fitting cx mdels n hundreds f features. T imprve the efficiency f ur mdels, we ran cx mdels n each feature individually and pick the tp 20 cx-ranked genes t be ur final features in ur mdel. After training a cx mdel n the 20 features, and predicted a survival curve fr each test sample and calculate the crss validatin errr fr ur mdel. 3 Results and Discussin 3.1 Interpreting Curves Different frm ther types f respnse, survival bjects cannt be directly cmpared with the riginal test data. Therefre, we need t find a way t transfrm ur predictins back int tw-level (0/1) factrs t cal- 3

culate mdel errrs. Figure 1,2 and 3 are several examples f survival curves that can help t illustrate this transfrm prcess. Each plt represents ne patient s survival curve. the first patient s survival curve is much flatter and higher than the secnd ne. Hwever, since we are lking at different timeline, the tw patients are bth predicted dead at their timeline, respectively. (a) Ex1: Event 0 n the(b) Ex2: Event 0 n the 1321th Day 189th Day (a) Ex1: Event 0 n the(b) Ex5: Event 1 n the 1321th Day 1365th Day Figure 1: Curves fr test samples with Event 0(alive) Figure 1 cmpares tw examples with status 0 at last cntact. Althugh the tw survival curves lk very different, they can have the same event status based n different timelines. The survival curve f the 1st sample seems much better than the curve f the 2nd sample, s we may predict that the survival rate f the 1st persn is higher than the 2nd persn. Hwever, after cnsidering the time infrmatin f the last cntact, it s easy t find that the survival rate f the 1st persn n the 1321th Day is arund 0.75 while the survival rate f the 2nd persn n the 189th Day is arund 0.98, which is higher than the survival rate f the 1st persn in this case. This example clearly illustrates that Time- TLastCntactOrVisit is a crucial term that can affect ur predictins t a great extent. Figure 3: Curves fr test samples with Events 0 and 1 Figure 3 gives us an example f tw patients with similar timelines and different survival curves. Since at the 1321th day, the first patient s survival rate is arund 0.7, while the secnd patient s is arund 0.4, we predict the first patient alive and the secnd patient dead. Frm the example f three cmparisns f graphs, we can see that bth survival curve and and timeline determine hw we predict the living cnditin f a patient. 3.2 Cmparing Mdels n Different Datasets (a) Ex3: Event 1 n the(b) Ex4: Event 1 n the 1920th Day 825th Day Figure 2: Curves fr test samples with Event 1(dead) Figure 2 cmpares tw exapmles with status 1 at last cntact. Again, it s clear that tw survival curves are very different, where Breast Cancer Mdel Trained n: Gene Expressin Cpy Number Methylatin Tp 20 Cx-ranked Genes 83.6% 52.3% 65.3% Tp 40 Cx-ranked Genes 85.2% 54.1% 69.6% Tp 60 Cx-ranked Genes 86.1% 54.8% 70.0% Table 2: Accuracy f BRCA Mdels n Different Datasets & Feature Size Frm the mdel cmparisn result, we can see that bth gene expressin data and methylatin data have a decent predictin accuracy n the patient s survival rate. Tp 20 Cx-rankedg genes are gd enugh t make rather accurate predictins. 4

4 Cnclusin Frm varius methd, we fund that treating bth TimeTLastCntactOrVisit and event (the survival status) as a survival bject and fitting a cx mdel t it is a gd apprach t train and predict the survival status f cancer patients. It has much higher accuracy n predicting the patients survival status than simply treating the whle prblem as a classificatin mdel and implementing supprt vectr machine r Naive Bayes mdel. Bth gene expressin and methylatin dataset wrk well as feature variables, and using the tp 20 Cx-ranked features is enugh accurate t make gd predictins. 5 Future Wrk Currently, we set the threshld as 0.5 in the cx mdel t predict the cancer patient s survival status. In the future, we culd raise the threshld s that we will increase the specificity while nt decreasing the sensitivity t much. Mrever, we hpe t find a way t cmbine gene expressin and methylatin data and use cmbined dataset t have a better predictin n cancer patient s survival status. Other than breast cancer, we culd braden ur research n ther cancer types as well. [5] Geaghan M, Cairns MJ (2015). MicrRNA and Psttranscriptinal Dysregulatin in Psychiatry. Bil. Psychiatry 78 (4): 231-9. [6] Zaidi SK, Yung DW, Chi JY, Pratap J, Javed A, Mntecin M, Stein JL, Lian JB, van Wijnen AJ, Stein GS (Octber 2004). Intranuclear trafficking: rganizatin and assembly f regulatry machinery fr cmbinatrial bilgical cntrl. J. Bil. Chem. 279 (42): 43363-6. [7] Hegde RS, Kang SW (July 2008). The cncept f translcatinal regulatin. J. Cell Bil. 182 (2): 225-32. [8] https://en.wikipedia.rg/wiki/cpynumber variatin [9] https://en.wikipedia.rg/wiki/ Cpy number analysis References [1] https://en.wikipedia.rg/wiki/ Gene expressin [2] https://en.wikipedia.rg/wiki/ Methylatin [3] Magali Champin, Olivier Gevaert,Multimics data fusin fr cancer data [4] https://en.wikipedia.rg/wiki/ Prprtinal hazards mdel 5