Prediction and Inference under Competing Risks in High Dimension - An EHR Demonstration Project for Prostate Cancer

Size: px

Start display at page:

Download "Prediction and Inference under Competing Risks in High Dimension - An EHR Demonstration Project for Prostate Cancer"

Amice Armstrong
5 years ago
Views:

1 Prediction and Inference under Competing Risks in High Dimension - An EHR Demonstration Project for Prostate Cancer Ronghui (Lily) Xu Division of Biostatistics and Bioinformatics Department of Family Medicine and Public Health and Department of Mathematics University of California, San Diego rxu@ucsd.edu This is joint work with: Jiayi Hou, Ph.D., Anthony Paravati, M.D., James Murphy, M.D., MPH; and Jue (Marquis) Hou, Jelena Bradic, Ph.D. R. Xu (UCSD) Competing Risks in High Dimension 1 / 30

2 Linked SEER-Medicare database As an illustration project of how information contained in patients electronic medical records (EMR) can be harvested for the purposes of precision medicine, we consider the large data set linking: the Surveillance, Epidemiology and End Results (SEER) Program database of the National Cancer Institute with the federal health insurance program Medicare database for prostate cancer patients of age 65 or older. R. Xu (UCSD) Competing Risks in High Dimension 2 / 30

3 Cancer and non-cancer mortality Clinical guidelines for the management of prostate cancer based on two factors: an estimation of the aggressiveness of a patient s tumor; and estimation of a patient s overall life expectancy. Currently no tool exists to predict non-cancer survival for this patient population. Meanwhile non-cancer and cancer survival are not independent, i.e. competing risks, a comprehensive prediction tool should consider both types of risks simultaneously. R. Xu (UCSD) Competing Risks in High Dimension 3 / 30

4 Data We restrict our analysis to patients diagnosed between 2004 and 2009 in the SEER-Medicare database. After excluding additional patients with missing clinical records, we have a total of 57,011 patients 7 relevant clinical variables (age, PSA, Gleason score, AJCC stage, and AJCC stage T, N, M, respectively) 5 demographical variables (race, marital status, metro, registry and year of diagnosis) 8971 binary insurance claim codes, capturing medical diagnoses, procedures, hospitalization and outpatient activities, etc. 1 if the procedure/activity present, and 0 if absent. The clinical and demographic information are at the time of diagnosis, and insurance claims data during the year prior to diagnosis, so that survival prediction would occur at the time of diagnosis R. Xu (UCSD) Competing Risks in High Dimension 4 / 30

5 Data (cont d) Until December 2013 (end of follow-up) there were a total of 1,247 deaths due to cancer 5,221 deaths unrelated to cancer. As the number of events dictates the effective sample size, we have more predictors than the effective sample size. R. Xu (UCSD) Competing Risks in High Dimension 5 / 30

6 High dimensional problem Different approaches have been proposed to analyze survival data with high dimensional covariates: LASSO under the Cox model (Tibshirani 1997), adaptive LASSO under the Cox model (Zhang and Lu 2007), random forest for right-censoring data (Hothorn et al. 2006), non-concave penalized methods under ultra high dimension (Bradic et al. 2011). But very few high-dimensional methods have been developed in the presence of competing risks: Binder et al. (2009) first proposed a boosting approach for fitting the proportional subdistribution hazards model, Fu et al. (2017) considered penalized approaches under the same model (requires p n). R. Xu (UCSD) Competing Risks in High Dimension 6 / 30

7 The models Denote the observed data X i = min(t i, C i ), δ i = I(T i C i ), ɛ i = 1, 2 is the cause or type, Z i is a vector of covariates, i = 1,..., n. We are interested in the cumulative incidence function (CIF) F j (t Z) = P (T t, ɛ = j Z) for failure type j. The proportional cause-specific hazards (PCSH) model λ j (t Z) = λ 0j (t) exp(β jz), (1) for j = 1, 2, where 1 λ j (t) = lim t 0+ tp (t T < t + t, J = j T t) is the cause-specific hazard function. Then F j (t Z) = t 0 λ j(u Z)S(u Z)du, where S(t Z) = P (T > t Z) is the overall survival function. R. Xu (UCSD) Competing Risks in High Dimension 7 / 30

8 The models (cont d) The proportional subdistribution hazards (PSDH, Fine and Gray 1999) model λ 1 (t Z) = λ 0 (t) exp(β Z). (2) where λ 1 (t Z) = d dt log{1 F 1(t Z)}. Obviously F 1 (t Z) is readily estimated after fitting the PSDH model for cause 1 only. R. Xu (UCSD) Competing Risks in High Dimension 8 / 30

9 High-d algorithm simulation: PCSH model with LASSO (1) Figure: Continuous covariates: n = 500, the three columns correspond to p = 20, 500, and The rows are: 1) CV10, 2) CV+1SE, 3) minimum AIC, 4) minimum BIC, 5) elbow AIC and 6) elbow BIC. The true CIF 1 (2 z 0 ) = Ind Exch AR1 Oracle R. Xu (UCSD) Competing Risks in High Dimension 9 / 30

10 High-d algorithm simulation: PCSH model with LASSO (2) Figure: Balanced binary covariates: n = 500, the three columns correspond to p = 20, 500, and The rows are: 1) CV10, 2) CV+1SE, 3) minimum AIC, 4) minimum BIC, 5) elbow AIC and 6) elbow BIC. The true CIF 1 (2 z 0 ) = See Supplement See Supplement Ind Exch AR1 Oracle See Supplement See Supplement R. Xu (UCSD) Competing Risks in High Dimension 10 / 30

11 High-d algorithm simulation: PCSH model with LASSO (3) Figure: Sparse binary covariates: n = 500, the three columns correspond to p = 20, 500, and The rows are: 1) CV10, 2) CV+1SE, 3) minimum AIC, 4) minimum BIC, 5) elbow AIC and 6) elbow BIC. The true CIF 1 (2 z 0 ) = See Supplement See Supplement Ind Exch AR1 Oracle See Supplement See Supplement See Supplement See Supplement R. Xu (UCSD) Competing Risks in High Dimension 11 / 30

12 High-d algorithm simulation: PSDH with boosting (1) Figure: Continuous covariates: n = 500, the three columns correspond to p = 20, 500, and The rows are: 1) CV10, 2) minimum AIC, 3) minimum BIC, 4) elbow AIC and 5) elbow BIC. The true CIF 1 (2 z 0 ) = Ind Exch AR1 Oracle R. Xu (UCSD) Competing Risks in High Dimension 12 / 30

13 High-d algorithm simulation: PSDH with boosting (2) Figure: Balanced binary covariates: n = 500, the three columns correspond to p = 20, 500, and The rows are: 1) CV10, 2) minimum AIC, 3) minimum BIC, 4) elbow AIC and 5) elbow BIC. The true CIF 1 (2 z 0 ) = Ind Exch AR1 Oracle R. Xu (UCSD) Competing Risks in High Dimension 13 / 30

14 High-d algorithm simulation: PSDH with boosting (3) Figure: Sparse binary covariates: n = 500, the three columns correspond to p = 20, 500, and The rows are: 1) CV10, 2) minimum AIC, 3) minimum BIC, 4) elbow AIC and 5) elbow BIC. The true CIF 1 (2 z 0 ) = Ind Exch AR1 Oracle R. Xu (UCSD) Competing Risks in High Dimension 14 / 30

15 Apply to SEER-Medicare data: PCSH model with LASSO Random split into training (n = 28, 505) and test (n = 28, 506) datasets. On training data: p 1 = 2188 and p 2 = 1079 claim codes for non-cancer and cancer mortality after univariate screening; CV10 was used to choose the penalty parameter; final model contained 143 predictors for non-cancer mortality, and 9 predictors for cancer mortality. On test data: For each mortality type j = 1, 2, we divided into 4 risk strata: low (L), median low (ML), median high (MH) and high (H) by quartiles; this gives 16 strata for the combined two types of mortalities; after plotting the predicted CIF s for the 16 strata, 5 distinct risk groups emerged for both cancer and non-cancer mortalities. R. Xu (UCSD) Competing Risks in High Dimension 15 / 30

16 SEER-Medicare results: PCSH model with LASSO R. Xu (UCSD) Competing Risks in High Dimension 16 / 30

17 SEER-Medicare results: PCSH model with LASSO Risk Groups n # events Cause-Specific Non-cancer mortality: Non-cancer: Cancer: L L L, ML, MH, H ML ML L, ML, MH, H M MH L, ML, MH, H MH H L, ML, MH H H H Cancer mortality: Non-cancer: Cancer: L L, ML, MH, H L ML L, ML, MH, H ML M L, ML, MH, H MH MH L, ML, MH H H H H R. Xu (UCSD) Competing Risks in High Dimension 17 / 30

18 SEER-Medicare results: PCSH model with LASSO R. Xu (UCSD) Competing Risks in High Dimension 18 / 30

19 Apply to SEER-Medicare data: PSDH model with boosting Same training (n = 28, 505) and test (n = 28, 506) datasets. On training data: top 2000 claim codes (ranked by p-value) for each type of mortality (p 1 = 4634 and p 2 = 6088 after univariate screening); AIC was used, because it was a close second to CV10 and less computationally intensive; final model contained 53 predictors for non-cancer mortality, and 13 predictors for cancer mortality. On test data: For each mortality type j = 1, 2, we divided into 5 risk strata: low (L), median low (ML), median (M), median high (MH) and high (H) by quintiles. R. Xu (UCSD) Competing Risks in High Dimension 19 / 30

20 SEER-Medicare results: PSDH model with boosting R. Xu (UCSD) Competing Risks in High Dimension 20 / 30

21 Prediction, variable selection, and inference The field of high dimensional statistical theory has been advancing (Bühlmann, Peter and van de Geer, 2011). Consistent prediction can be relatively easily achieved (eg. LASSO can be consistent for estimating the underlying regression function); Consistent variable selection needs a beta-min condition (i.e. all non-zero coefficients are sufficiently large), in addition to the restrictive irrepresentable condition (cannot have too strongly correlated design); Recent developments by Zhang and Zhang (2014), van de Geer et al. (2014) allows confidence intervals, in the presence of many small non-zero coefficients. R. Xu (UCSD) Competing Risks in High Dimension 21 / 30

22 One-step estimator Consider the high-dimensional PSDH model where Z(t) may be time-dependent. λ 1 (t Z(t)) = λ 0 (t) exp{β Z(t)}. (3) The modified log partial likelihood that gives rise to the estimating equation for β (in low dim.) is n t n m(β) = n 1 β Z i (t) log ω j (t)y j (t)e β Z j (t) dn i o (t), i=1 0 where Ni o (t) = I(X i t, δ iɛ i = 1), ω i(t)y i(t) = r i(t)y i(t) Ĝ(t)/Ĝ(t Xi), G(t) = P (C > t) and r i(t)y i(t) = I(t X i) + I(t > X i)i(δ iɛ i > 1). j=1 (4) R. Xu (UCSD) Competing Risks in High Dimension 22 / 30

23 Initial LASSO estimator Following Zhang and Zhang (2014), van de Geer et al. (2014), let the initial estimator be } β(λ) = argmin { m(β) + λ β 1. (5) β Oracle inequality = for the regularization parameter λ choosen to be of the order log(n) log(p)/n, we have β β o 1 = O p (s o log(n) ) log(p)/n. R. Xu (UCSD) Competing Risks in High Dimension 23 / 30

24 One-step estimator (cont d) The one-step bias-corrected estimator is b := β + Θṁ( β), (6) where Θ is such that m(β o ) Θ I p, and is obtained by a version of nodewise LASSO based on the score vectors instead of the design matrix (Zhang and Zhang, 2014). R. Xu (UCSD) Competing Risks in High Dimension 24 / 30

25 Confidence intervals We can show that b β o = Θṁ(β o ) + o p(1), and that for any c R p such that c 1 = 1, c ṁ(β o ) d N(0, c Vc). An 1 α confidence interval for c β o is (c b Z1 α/2 c Θ V Θ c/n, c b ) + Z1 α/2 c Θ V Θ c/n. (7) R. Xu (UCSD) Competing Risks in High Dimension 25 / 30

26 Simulation results For cause 1, s o = 2. True Mean Est SD SE SE corrected Coverage Level/Power n=200, p=1000 β 1, β 1, β 1, n=500, p=1000 β 1, β 1, β 1, R. Xu (UCSD) Competing Risks in High Dimension 26 / 30

27 References 1 Hou J, Paravati A, Xu R, Murphy J. High-Dimensional Variable Selection and Prediction under Competing Risks with Application to SEER-Medicare Linked Data. 2 Hou J, Bradic J, Xu R, Fine-Gray Competing Risks Model with High Dimensional Covariates: Estimation and Inference. R. Xu (UCSD) Competing Risks in High Dimension 27 / 30

28 High dimensional algorithms: LASSO Tibshirani s (1997) LASSO can be applied to the PCSH model, as standard Cox regression software is used in low dimensions. Selection of penalty parameter: CV10: minimum 10-fold cross-validated (CV) negative predictive log partial likelihood (referred to as error in the following); CV+1SE: minimum 10-fold CV error plus one standard error of the CV estimated errors; min AIC / BIC; elbow AIC / BIC (Tibshirani et al. 2001). Under the Cox model AIC = 2 log(l) + 2s, where L is the partial likelihood and s = S( β) is the size of the active set (Verweij and van Houwelingen, 1993; Xu et al. 2009); BIC = 2 log(l) + 2s log(k), where k is the number of observed events (Volinsky and Raftery, 2000) from the cause of interest. R. Xu (UCSD) Competing Risks in High Dimension 28 / 30

29 LASSO (cont d) Fu et al. (R package crrp ) contains LASSO for the PSDH model, unfortunately was not able to apply to the linked SEER-Medicare data as it ran out of memory. (This becomes a challenge for the initial estimator in our 2nd paper.) R. Xu (UCSD) Competing Risks in High Dimension 29 / 30

30 High dimensional algorithms: boosting Binder et al. (2009) applied a likelihood based boosting approach to the weighted partial likelihood under the PSDH model. Selection of optimal number of boosting steps: CV10; min AIC / BIC; elbow AIC / BIC. Definitions of the criteria are similar to those under the PCSH model above, with the likelihood returned by the crr() function in R package cmprsk. R. Xu (UCSD) Competing Risks in High Dimension 30 / 30

Machine Learning to Inform Breast Cancer Post-Recovery Surveillance

Machine Learning to Inform Breast Cancer Post-Recovery Surveillance Final Project Report CS 229 Autumn 2017 Category: Life Sciences Maxwell Allman (mallman) Lin Fan (linfan) Jamie Kang (kangjh) 1 Introduction