Data Fusin fr Predicting Breast Cancer Linbailu Jiang, Yufei Zhang, Siyi Peng Mentr: Irene Kaplw December 11, 2015 1 Intrductin 1.1 Backgrund Cancer is mre f a severe health issue than ever in ur current sciety. As severe as it is lethal in general, there are many factrs that may affect a patient s survival. It is nt easy t find a clear pattern t predict the survival utcme f a cancer patient, which culd be a cmplex prcess invlving different bilgic cnditins. Based n previus studies, many features are cnsidered t ptentially affect a survival event cancer type, age at diagnsis, treatment pathway, time since diagnsis, and sme specific genme patterns that may relate t the prgressin f the cancer disease. While sme f these features, such as cancer type and age at diagnsis, are relatively explicit and have mre direct relatins with the survival utcme, sme ther features, such as gene mutatins and methylatin levels, seem mre cmplicated and require mre effrt t analyze the ptential interactins amng these features. In this paper, we mainly fcus n the pathlgical cnditins f the patients withut examining different treatments f cancer. 1.2 Gal and Outline In this paper, we want t understand a patient s survival rate given his/her genme pattern and time interval since diagnsis. Our ultimate gal is t predict the survival rate f a patient with breast cancer changing ver time based n sme related genmic data. Our first step is t understand hw genmic data is influenced by different factrs, such as gene expressin fr different RNA, cpy number value, and the methylatin level fr different DNA. By ding feature crrelatin analysis and feature selectin in genmic data, we identify cancer-related genes and their targets. The mst challenging task here is t distinguish the driver mutatins, as a subset that truly cntributes t the tumr s prgressin, frm a large number f neutral passenger mutatins that characterize the cancer. Based n sme previus studies, we guess that a supprt vectr machine methd might be helpful during this prcess. At the secnd stage, we have relatively fewer features fr genmic data that may affect the cancer survivals, s it culd be easier fr us t cnduct a merge based n the patient ID in the survival dataset and the sample ID in the genmic dataset. This wuld help us t relate all the genmic infrmatin t the survival results s we culd cmbine them with ther ptential features and start t train ur mdel. After that, we wuld estimate and cmpare the perfrmance f the mdels by using sme crss validatin methds. 1
1.3 Data The data f ur prject is frm NIH(Natinal Institutes f Health) Prject, and we btained them frm Prfessr Olivier Gevaert. The data are all pre-pssessed, lg-transfrmed, and well-separated based n cancer type int 11 dataset. The whle dataset is cmprised f tw parts. First part is a dataset f patient ID, cancer type, time since diagnsis ( TimeT- LastCntactOrVisit in days), and his/her survival status (nrmally 0=alive, 1=dead). Secnd part are sme large lists abut genmic data. Each cancer type has ne large list. In each large list, there are 3 datasets which separately cntain gene expressin data, cpy number data, and methylatin data. All the data are pre-prcessed s there can be sme negative values in these charts. The first genmic data we used in this prject is the gene expressin data. Basically, fr each sample, a high (large psitive) gene expressin value fr a specific gene cde means the infrmatin encded in this gene has been highly interpreted, while a lw value indicates that mst f its infrmatin has been hidden. The secnd genmic data is called cpy number variatins data. This data represents the structural variatin f a specific gene. A larger cpy number implies that the gene might be duplicated s that it becmes mre than the nrmal number while a smaller cpy number dentes a deletin in a specific regin s that the number is less than the nrmal number. During these years, methylatin als becmes an imprtant cncept in the cancer research, s ur third dataset recrds the pattern f methylatin fr each sample. An aberrant DNA methylatin pattern, such as hyper-methylatin r hypmethylatin pattern, can usually be assciated with many types f human malignancies. In this prject, due t time limit, we wuld nly fcus n breast cancer(brca). The sample sizes f the BRCA dataset is relatively large cmpared with thers, s we believe it s mre likely t get valuable results when training n this dataset. 2 Methds 2.1 Feature Selectin Fr mst f ur datasets, the numbers f features are much larger than the numbers f sample (p» n), which can cause sme difficulties when training the data. Fr instance, the sample size f the gene expressin dataset fr breast cancer is nly 985, while the crrespnding feature size is 16020, which is much larger than the sample size here. A ptential prblem that can be caused by this is verfitting. T avid this prblem, the first thing needed t be handled is t reduce the number f variables in these datasets. The first step we did here is t find variance f each feature and get rid f the features with lw variance. Fr example, the variance range f gene expressin features fr breast cancer is frm 0.03 t 25.50 and we find mst f these features have relatively lw variances. We assume that the lw-variance predictrs have less predictive pwer (which is nt always true, we just use this simple methd t btain a first impressin f the data), s we tried remving sme predictrs with small variance and use 10-fld crss validatin t see hw this prcess may affect the mdel perfrmance. Table 1 shws the mdel perfrmance based n different variance threshlds. # f Selected Features Based n the Threshld 5909 1673 598 78 13 Variance Threshld 1 3 5 10 15 SVM 0.110037 0.109963 0.109976 0.109975 0.110061 Naive Bayes 0.352112 0.347705 0.332234 0.269524 0.121026 Table 1: Errr estimatins f SVM & Naive Bayes using 10-fld crss validatin As we see, fr an SVM mdel, there s n bvius imprvement by reducing feature size; 2
hwever, fr a Naive Bayes mdel, the smaller the number f variables is, the better accuracy is achieved. One pssible explanatin fr this is that the actual mdel (with raw data) vilates the cnditinal independence assumptin f the Naive Bayes mdel, which means sme genes may wrk tgether, and their expressin values can prbably be highly crrelated. The accuracies fr bth SVM and Naive Bayes mdel are clse t 90 percent, hwever we dn t think these mdels are actually gd at this pint since we fund that all f them tend t predict 0(alive) rather than 1(dead), and the crrespnding ROC curves imply that the mdels are uninfrmative. T slve this prblem, we decided t imprve ur feature selectin prcess and use a mre reliable methd t pick imprtant features. Accrding t previus studies, training lgistic regressin mdels n each feature separately and ranking them by lwest CV errr can be a gd methd t find valuable features. We picked the tp 200 genes with the best perfrmance n individual training, and then ranked them again by using cx mdel t reduce the final feature size t be 20 (details in next sectin). 2.2 Survfit and Cx Mdel After significantly reducing the number f feature variables, we then cmbined all the imprtant features t the survival dataset. One imprtant pint t ntice is the feature variable TimeTLastCntactOrVisit. It indicates the number f days frm a patient was first diagnsed breast cancer t his/her last visit date. If the patient is dead, this variable represents the duratin t death; if nt, this variable becmes a censred data, as we dn t knw actually what the patient s current status is. We nly knw this patient was alive x days after the beginning f study (x = TimeTLastCntactOrVisit ). In previus part, t simplify the mdel, we treated the TimeTLastCntactOrVisit as a cntinuus feature variable and trained SVM mdels n it. Hwever, the perfrmance f these mdels seems unfavrable, which may be explained by us nt handling this term in a right way. An alternate methd t deal with it is t treat this term as a part f respnse instead f a predictr. We will discuss why it culd be imprtant t crrectly handle this term later when we analyze ur sample results. Based n previus study, a cmmn way t cmbine the timeline and status f an event is t create a new type bject, usually called Surv r survival bject, which is a small data matrix that cntains cmprehensive infrmatin f an event. In this prject, we used cx mdel t fit the survival data. The cx mdel has the frm: λ(t X) = λ 0 (t) exp(β 1 X 1 +... + β p X p ) = λ 0 (t) exp(xβ ) We used functins Survfit and cxph in the survival package in R t train the mdels with cx mdel and then fit the survival bject.the next prblem we met in this part was the large time cst f fitting cx mdels n hundreds f features. T imprve the efficiency f ur mdels, we ran cx mdels n each feature individually and pick the tp 20 cx-ranked genes t be ur final features in ur mdel. After training a cx mdel n the 20 features, and predicted a survival curve fr each test sample and calculate the crss validatin errr fr ur mdel. 3 Results and Discussin 3.1 Interpreting Curves Different frm ther types f respnse, survival bjects cannt be directly cmpared with the riginal test data. Therefre, we need t find a way t transfrm ur predictins back int tw-level (0/1) factrs t cal- 3
culate mdel errrs. Figure 1,2 and 3 are several examples f survival curves that can help t illustrate this transfrm prcess. Each plt represents ne patient s survival curve. the first patient s survival curve is much flatter and higher than the secnd ne. Hwever, since we are lking at different timeline, the tw patients are bth predicted dead at their timeline, respectively. (a) Ex1: Event 0 n the(b) Ex2: Event 0 n the 1321th Day 189th Day (a) Ex1: Event 0 n the(b) Ex5: Event 1 n the 1321th Day 1365th Day Figure 1: Curves fr test samples with Event 0(alive) Figure 1 cmpares tw examples with status 0 at last cntact. Althugh the tw survival curves lk very different, they can have the same event status based n different timelines. The survival curve f the 1st sample seems much better than the curve f the 2nd sample, s we may predict that the survival rate f the 1st persn is higher than the 2nd persn. Hwever, after cnsidering the time infrmatin f the last cntact, it s easy t find that the survival rate f the 1st persn n the 1321th Day is arund 0.75 while the survival rate f the 2nd persn n the 189th Day is arund 0.98, which is higher than the survival rate f the 1st persn in this case. This example clearly illustrates that Time- TLastCntactOrVisit is a crucial term that can affect ur predictins t a great extent. Figure 3: Curves fr test samples with Events 0 and 1 Figure 3 gives us an example f tw patients with similar timelines and different survival curves. Since at the 1321th day, the first patient s survival rate is arund 0.7, while the secnd patient s is arund 0.4, we predict the first patient alive and the secnd patient dead. Frm the example f three cmparisns f graphs, we can see that bth survival curve and and timeline determine hw we predict the living cnditin f a patient. 3.2 Cmparing Mdels n Different Datasets (a) Ex3: Event 1 n the(b) Ex4: Event 1 n the 1920th Day 825th Day Figure 2: Curves fr test samples with Event 1(dead) Figure 2 cmpares tw exapmles with status 1 at last cntact. Again, it s clear that tw survival curves are very different, where Breast Cancer Mdel Trained n: Gene Expressin Cpy Number Methylatin Tp 20 Cx-ranked Genes 83.6% 52.3% 65.3% Tp 40 Cx-ranked Genes 85.2% 54.1% 69.6% Tp 60 Cx-ranked Genes 86.1% 54.8% 70.0% Table 2: Accuracy f BRCA Mdels n Different Datasets & Feature Size Frm the mdel cmparisn result, we can see that bth gene expressin data and methylatin data have a decent predictin accuracy n the patient s survival rate. Tp 20 Cx-rankedg genes are gd enugh t make rather accurate predictins. 4
4 Cnclusin Frm varius methd, we fund that treating bth TimeTLastCntactOrVisit and event (the survival status) as a survival bject and fitting a cx mdel t it is a gd apprach t train and predict the survival status f cancer patients. It has much higher accuracy n predicting the patients survival status than simply treating the whle prblem as a classificatin mdel and implementing supprt vectr machine r Naive Bayes mdel. Bth gene expressin and methylatin dataset wrk well as feature variables, and using the tp 20 Cx-ranked features is enugh accurate t make gd predictins. 5 Future Wrk Currently, we set the threshld as 0.5 in the cx mdel t predict the cancer patient s survival status. In the future, we culd raise the threshld s that we will increase the specificity while nt decreasing the sensitivity t much. Mrever, we hpe t find a way t cmbine gene expressin and methylatin data and use cmbined dataset t have a better predictin n cancer patient s survival status. Other than breast cancer, we culd braden ur research n ther cancer types as well. [5] Geaghan M, Cairns MJ (2015). MicrRNA and Psttranscriptinal Dysregulatin in Psychiatry. Bil. Psychiatry 78 (4): 231-9. [6] Zaidi SK, Yung DW, Chi JY, Pratap J, Javed A, Mntecin M, Stein JL, Lian JB, van Wijnen AJ, Stein GS (Octber 2004). Intranuclear trafficking: rganizatin and assembly f regulatry machinery fr cmbinatrial bilgical cntrl. J. Bil. Chem. 279 (42): 43363-6. [7] Hegde RS, Kang SW (July 2008). The cncept f translcatinal regulatin. J. Cell Bil. 182 (2): 225-32. [8] https://en.wikipedia.rg/wiki/cpynumber variatin [9] https://en.wikipedia.rg/wiki/ Cpy number analysis References [1] https://en.wikipedia.rg/wiki/ Gene expressin [2] https://en.wikipedia.rg/wiki/ Methylatin [3] Magali Champin, Olivier Gevaert,Multimics data fusin fr cancer data [4] https://en.wikipedia.rg/wiki/ Prprtinal hazards mdel 5