Orgnal Artcle Impact of Imputaton of Mssng Data on Estmaton of Survval Rates: An Example n Breast Cancer Banesh MR 1, Tale AR 2 Abstract Background: Multfactoral regresson models are frequently used n medcne to estmate survval rate of patents across rsk groups. However, ther results are not generalsable, f n the development of models assumptons requred are not satsfed. Mssng data s a common problem n pathology. The am of ths paper s to address the danger of excluson of cases wth mssng data, and to hghlght the mportance of mputaton of mssng data before development of multfactoral models. Methods: Ths study was performed on 310 breast cancer patents dagnosed n Shraz (Southern Iran). Performng a complete-case Cox regresson model, a prognostc ndex was calculated so as to categorse the patents nto 3 rsk groups. Then, applyng the Multvarate Imputaton va Chaned Equatons (MICE) method, mssng data were mputed 10 tmes. Usng mputed data sets, modellng was performed to assgn patents nto rsk groups. Estmated actuaral Overal Survval (OS) rates correspondng to analyss of complete-case and mputed data sets were compared. Results: Cases wth at least one mssng datum experenced a sgnfcantly better survval curve. Estmates derved analysng complete-case data, relatve to mputed data sets, underestmated the OS rate n all rsk groups. In addton confdence ntervals were wder ndcatng loss n precson due to attrton n sample sze and power. Concluson: Results obtaned hghlghted the danger of excluson of mssng data. Imputaton of mssng data avods based estmates, ncreases the precson of estmates, and mproves genralsablty of results to other smlar populatons. Key words: Mssng data; Multple mputaton; Breast neoplasm; Overall survval, Iran Please cte ths artcle as: Banesh MR, Tale AR. Impact of Imputaton of Mssng Data on Estmaton of Survval Rates: An Example n Breast Cancer. Iran J Cancer Prev. 2010; Vol3, No3, p.127-31. 1. Health School, Kerman Unversty of Medcal Scences, Deptartment of Bostatstcs and Epdemology, Kerman, Iran 2. Shahd Faghh Hosptal, Shraz Unversty of Medcal Scences, Shraz, Iran Correspondng author: Mohammad Reza Banesh, PhD n Bostatstcs Tel: (+98) 913 442 39 48 Emal: m_banesh@ kmu.ac.r Receved: 12 Jan, 2010 Accepted: 21 Jun, 2010 Iran J Cancer Prev 2010; 3: 127-31 Introducton Cancer s one of the most major health problems worldwde. In 2002, a quarter of the 11 mllon new cases of cancer reported worldwde occurred n Europe. Among new cancer patents dagnosed n the UK, whch s more than a quarter of a mllon per year, the most prevalent carcnomas (ncdence rate) were breast (16%), lung (13%), bowel or colorectal (13%), and prostate (12%) [1]. Breast carcnoma, wth one mllon newly dagnosed cases annually, s the most prevalent malgnancy, comprsng 18% of all female cancers [2]. In Iran, cancer s the thrd cause of deaths after cardovascular dseases and accdents [3]. The breast cancer s the most lethal one among women. The prevalence of breast cancer was reported 25.4 and deaths due to breast cancer were 12.3 per 100,000 [3]. Clncal trals typcally nvolve collecton of patent data at entry and n so far as are possble these data wll nclude varables of potental relevance to the lkely cause of the dsease under study. These data sets have been used n development of prognostc models, whch provdes a valuable resource n dentfyng mportant rsk factors for dsease course and hence also for rsk stratfcaton of patents. However, f n development of prognostc models, one gnores model assumptons and lmtatons the models obtaned mght not be generalsable [4,5]. Presence of mssng data s one ssue whch makes dffcultes n model buldng. When mssng data 127
Banesh and Tale present, researchers frequently drop out patents wth mssng data on any of varables under study from consderaton. Ths ad hoc method s known as Complete-Case (C-C) analyss [6]. It has been emphaszed that excluson of mssng data wll dmnsh precson of estmates and can lead to based estmates [7]. Survval rates are frequently reported n the lterature to compare treatment optons, and to nform the patents about ther lkely outcome [8]. Excluson of mssng data results n based estmate of cohort survval rates, n partcular when there s dfference n survval curve of cases wth avalable data wth the remander (who had at least one mssng datum) [9]. As an example when cases wth mssng data, n comparson wth those who had data avalable, exhbts lower survval curve, omsson of mssng data results n overestmaton of survval rates [9]. Therefore, approprate methods should be appled to mpute mssng data so as to avod attrton n sample sze. The am of ths paper s to compare estmaton of survval rates under two scenaros: n complete-case analyss, and after mputaton of mssng data. Methods were appled analysng a breast cancer data set. Materals and Methods Patents and outcome From 1994 to 2003, the nformaton of 310 breast cancer patents n Shraz, southern Iran were collected from Hosptal-based Cancer Regstry of Nemazee Hosptal afflated to Shraz Unversty of Medcal Scences. Medan follow-up tme was 2.5 years. The man outcome of study was Overall Survval (OS). Survval was consdered as the tme perod between dagnoss and death for patents who ded, and from dagnoss to the last vst for censored patent. At the end of the study, there had been 56 deaths. At the frst step a multfactoral model was developed (see the rest of the text). The OS rates were estmated from rsk groups derved (explaned later). Varables offered to the multfactoral models were those showed to have unvarate predctve ablty [10] (tumour stage wth 3 levels (early, locally advanced, and advanced), tumour grade wth 3 levels (1, 2, and 3), hstory of bengn breast dsease (postve versus negatve), and age at dagnoss). Pror to analyss, the age varable was dchotomsed at 48 to be a surrogate for approxmate menopausal status [11]. Multfactoral Models At frst a dummy varable was created whch took a value of 0 f patent had avalable data on all varables under consderaton and 1 otherwse. Survval curve of patents wth and wthout mssng data were compared plottng Kaplan-Meer curves ad performng Log-Rank test. Lnear Cox model was then appled to develop the multfactoral regresson models [12]. Complete-Case (C-C) Model In the C-C model, patents wth mssng data on any of 4 canddate varables were excluded. Cox regresson model n conjuncton wth ENTER varable selecton method was then ftted. A fnal rsk score was calculated by multplyng varables nto the estmated regresson coeffcent. Tertles of the rsk score estmated were appled as cut off to categorse patents nto low, ntermedate, and hgh rsk groups. MICE Model Multvarable Imputaton va Chaned Equatons (MICE) method s then appled to mpute mssng data. The MICE method s a powerful tool to tackle the mssng values. The MICE method replaces each mssng value by multple mputed values, typcally 10, resultng n multply mputed data sets [13,14]. Patents' outcome and set of 4 rsk factors were used n the MICE algorthm [15]. Polytomous and logstc regresson were used to mpute mssng data for categorcal and bnary data respectvely. The creaton of 10 data sets means there s a requrement for 10 modellng analyses, one for each data set, and there wll therefore be 10 dfferent estmates for each parameter. A Cox regresson model was ftted to each of 10 mputed data sets. In each of 10 data sets, multplyng data set specfc estmates nto the varables, a rsk score was calculated (10 n total). Fnally, for each patent a sngle averaged rsk score was calculated by averagng her estmated rsk scores from each of the 10 mputed data sets. Tertles of the fnal rsk score was appled as cut offs to dvde patents nto low, ntermedate, and hgh rsk groups. Estmaton of Overall Survval (OS) rates To compare the OS rates n rsk groups, actuaral 2, 4, and 5-year OS rates n the lowest, ntermedate, and hghest rsk groups are reported. Ths was done analysng complete-case and mputed data sets. Based on defnton the survval functon, say S (4), s the probablty of beng alve at least tll 4 th year of 128 Iranan Journal of Cancer Preventon
Impact of Imputaton of Mssng Data on Estmaton of Survval Rates: An Example n Breast Cancer 1.0 0.8 Patents wth at least one mssg datum Proporton alve 0.6 0.4 0.2 Patents wth complete data 0.0 0.00 2.00 4.00 6.00 Tme (years) Fgure 1. K-M curves for cases wth avalable data and cases wth at least one. Table 1. Comparson of survval of patents wth avalable data and wth at least one mssng datum Group # of patents # of events Log-Rank P-value Cases wth avalable data on all 4 varables 203 54 <0.0001 Cases wth at least 1 mssng datum 107 2 Table 2. Comparson of estmated OS rates n the rsk groups derved analysng complete case and mputed data sets Model Rsk group 2-year OS (%) 4-year OS (%) 5-year OS (%) Complete Case Imputed data set Low 92 (84, 100) 84 (70, 96) 84 (70, 96) Intermedate 79 (67, 91) 67 (51, 83) 67 (51, 83) Hgh 52 (38, 66) 28 (12, 44) 16 (0, 32) Low 95 (91, 99) 90 (82, 98) 90 (82, 98) Intermedate 88 (80, 96) 82 (70, 94) 82 (70, 94) Hgh 64 (52, 76) 42 (28, 56) 32 (16, 48) follow up. Therefore, survval at the 4 th year depends on survval at frst, second and 4 th year whch mples that S(4) = P( T 4). In actuaral lfetable procedure, the whole follow-up duraton wll be splt to ntervals (as an example to 1 year ntervals (0, 1], (1, 2], (2, 3], (3, 4] respectvely). If n and d show number of patents at rsk just before the -th nterval and the number of events at -th nterval, then the probablty of survvng to 4 th 4 d year s gven by S (4) = (1 ) n = 1 Based on Greenwood s formula the varance of ths estmator can be estmated by 4 ˆ ˆ2 d ˆ( ( )) ( ) V S t = S t = 1 n( n d) To address loss n precson of estmates, confdence ntervals of OS rates, correspondng to analyss of C-C and mputed data sets, were estmated and compared. Software A seres of packages whch work under R software (verson 2.5.1) were used [16]. Mssng data were 129
Banesh and Tale mputed usng MICE package [17]. Performance of models (dscrmnaton and predctve ablty) were assessed usng Desgn [18] lbrary. K-M curves are plotted usng SPSS software. Results The numbers (percentages) of patents wth mssng value on node status, grade, and hstory of bengn dsease were 63 (20.3%), 64 (20.6%), and 47 (15.2%) respectvely. In total, out of 310 patents, 203 cases (65%) had data avalable on all 4 varables of whch 54 had ded. Table 1 reports the number of deaths for patents wth complete data and the remander wth at least one mssng datum. Correspondng K-M curves s plotted n Fgure 1. Cases wth complete data had much lower survval curve (Log-Rank P-value <0.0001). Ths ndcates that excluson of cases wth mssng data leads to underestmaton of the true OS rates n the cohort analysed. As explaned n methods secton a rsk score was estmated for complete-case and mputed data sets. Usng tertles as cut off, patents were categorsed nto 3 rsk groups (low, ntermedate, and hgh). Estmated OS rates n rsk groups derved are summarsed n Table 2. Estmatons derved analysng patents wth avalable data, underestmated OS rates n all 3 rsk groups. Ths was the case n all 3 rsk groups, and tme ponts. For example, estmated 2-year OS rate n lowest rsk group for complete-case ad mputed data sets were 92% and 95% respectvely. Correspondng rates at 4 years were 84% and 90% respectvely. Furthermore, C.I.'s correspondng to mputed data sets, relatve to complete-case data, was tghter snce attrton n sample sze s avoded. Dscusson We have seen that confdence ntervals of OS rates correspondng to the mputed data sets were narrower ndcatng mprovement n precson of estmates. Furthermore, comparng K-M curves of patents wth avalable data wth those wth at least one mssng datum suggested that excluson of mssng data leads to underestmaton of OS rates. Ths was consstent wth estmated we obtaned whch are summarsed n Table 2. To provde more accurate estmates, we mputed mssng data 10 tmes. Ths was to protect aganst chance effects dues to mputaton. Ths protecton was to be felt worth the nconvenence of havng to average rsk scores across 10 fnal models. Easer mputaton methods such as Expectaton Maxmum (E- M) algorthm are lkelhood based and sutable approaches. However, E-M method replaces each mssng data by a sngle value so does not take nto account mputaton uncertanty. It has been noted that under the Mssng Completely At Random (MCAR) assumpton, subjects wth complete data are a random sample of data [19]. It has been argued that under MCAR mechansm f mssng rate s less than 5%, case deleton s a reasonable approach [20]. However, t should be emphaszed that even when C-C analyss gve results comparable to the MICE, a gold standard (MICE) s requred to compare results from other smpler methods [21]. On the other hand, when mssng rate s hgh, excluson of mssng data wll dmnsh precson of estmates. Another ssue s that even a low rate of mssng data on each varable mght cause serous problems n multvarate modellng when patents wth mssng data on are scattered across the data. That s because ths mght substantally reduce the number of complete cases avalable for analyss, and ncrease the chance of bas due to excluded cases. There are lots of ad hoc (such as C-C, replacement by mean, and mssng ndcator approaches) and maxmum lkelhood methods (such as E-M algorthm, and multple mputaton technque) to deal wth mssng data [22]. Applcaton and comparson of alternatve mputaton methods was beyond the scope of ths paper and wll be publshed elsewhere. The ultmate consequence of complete-case analyss s power reducton. In addton, case-deleton mght result n based regresson coeffcents f the remanng cases are not the representatve of the whole sample [7,23]. Results presented showed that excluson of cases wth mssng data leads to bas and mprecse estmates. Therefore mputaton of mssng data should be a prme before any modellng practce. Acknowledgment We should thank staff of Motahhar Para clnc and Shahd Faghh hosptal who facltated our access to patents' folder and nformaton. Conflct of Interest There s no conflct of nterest n ths artcle. Authors' Contrbuton The data set analyzed n ths project was collected under the drecton of Professor TAR at Shraz Unversty of Medcal Scences. All analyses and wrtng of manuscrpt has been done by BMR. Both 130 Iranan Journal of Cancer Preventon
Impact of Imputaton of Mssng Data on Estmaton of Survval Rates: An Example n Breast Cancer authors have read and approved the fnal verson of the manuscrpt. References 1. Cancer Research UK. UK cancer ncdence statstcs. http://nfo cancerresearchuk org/ cancerstats/ ncdence/?a=5441 2007 January [cted 2007 Feb 26];Avalable from: URL: http:// nfo. cancerresearchuk. org/ cancerstats/ ncdence/?a=5441 2. McPherson K, Steel CM, Dxon JM. ABC of breast dseases. Breast cancer-epdemology, rsk factors, and genetcs. BMJ 2000 Sep 9; 321(7261):624-8. 3. Naghav M. Iranan annual of natonal death regstraton report. Iran mnstry of health and medcal educaton; 2005. 4. Concato J, Fensten AR, Holford TR. The rsk of determnng rsk wth multvarable models. Ann Intern Med 1993 Feb 1; 118(3):201-10. 5. Wyatt JC, Altman DG. Prognostc models: clncally useful or smply forgotten. BRITISH MEDICAL JOURNAL 1995; 311:1539-41. 6. Burton A, Altman DG. Mssng covarate data wthn cancer prognostc studes: a revew of current reportng and proposed gudelnes. Br J Cancer 2004 Jul 5; 91(1):4-8. 7. Altman DG, Bland JM. Mssng data. BMJ 2007 Feb 24; 334(7590):424. 8. Altman DG, Lyman GH. Methodologcal challenges n the evaluaton of prognostc factors n breast cancer. Breast Cancer Res Treat 1998; 52(1-3):289-303. 9. Van Buuren S, Boshuzen HC, Knook DL. Multple mputaton of mssng blood pressure covarates n survval analyss. Stat Med 1999 Mar 30; 18(6):681-94. 10. Rajaeefard AR, Banesh MR, Tale AR, Mehraban D. Survval Models n Breast Cancer. Iranan Red Crescent Medcal Journal 2009; 11(3):295-300. 11. Ayatollah SM GHASA. Menstrual-reproductve factors and age at natural menopause n Iran. Internatonal journal of gynaecology and obstetrcs 2003; 80(3):311-3. 12. Cox DR. Regresson models and lfe tables. Journal of royal statstcal socety 1972; 34:187-220. 13. Schafer JL. Analyss of Incomplete Multvarate Data. Florda: Chapman and Hall; 1997. 14. Schafer JL. Multple mputatons: a prmer. Stat Methods Med Res 1999 Mar; 8(1):3-15. 15. Moons KG, Donders RA, Stjnen T, Harrell FE, Jr. Usng the outcome for mputaton of mssng predctor values were preferred. J Cln Epdemol 2006 Oct; 59(10):1092-101. 16. R: A language and envronment for statstcal computng [computer program]. 2007. 17. Mce: Multvarate Imputaton by Chaned Equatons [computer program]. 2007. 18. Desgn: Desgn Package [computer program]. 2008. 19. Donders AR, van der Hejden GJ, Stjnen T, Moons KG. Revew: a gentle ntroducton to mputaton of mssng values. J Cln Epdemol 2006 Oct; 59(10):1087-91. 20. Farclough DL. Patent reported outcomes as endponts n medcal research. Stat Methods Med Res 2004 Apr; 13(2):115-38. 21. Greenland S, Fnkle WD. A crtcal look at methods for handlng mssng covarates n epdemologc regresson analyses. Am J Epdemol 1995 Dec 15; 142(12):1255-64. 22. Banesh MR. Statstcal Models n Prognostc Modellng of Many Skewed Varables and Mssng Data: A Case Study n Breast Cancer (PhD thess submtted at Ednburgh Unversty) 2009. 23. Harrell FE. Regresson modellng strateges wth applcaton to lnear models, logstc regresson, and survval analyss. New York: Sprnger-Verlag; 2001. 131