Regression Tree Methods for Precision Medicine

Size: px

Start display at page:

Download "Regression Tree Methods for Precision Medicine"

Deirdre Fields
5 years ago
Views:

1 Regression Tree Methods for Precision Medicine Wei-Yin Loh Department of Statistics University of Wisconsin Madison W-Y Loh July 12,

2 Subgroup identification: breast cancer trial Randomized trial of 672 subjects with primary node positive breast cancer (Schumacher et al., 1994) Response is recurrence-free survival time ( days, 299 uncensored, 387 censored) Eight predictor variables with no missing values: 1. horth (hormone therapy, yes/no) 2. age (21 80 years) 3. tsize (tumor size, mm) 4. pnodes (number of positive lymph nodes, 1 51) 5. progrec (progesterone receptor status, fmol) 6. estrec (estrogen receptor status, fmol) 7. menostat (menopausal status, pre/post) 8. tgrade (tumor grade, 1, 2, 3) W-Y Loh July 12,

3 Survival probability horth = no horth = yes Days Variable Coef p-value Variable Coef p-value horth=yes e-03 tsize age pnodes e-11 meno=pre progrec tgrade estrec W-Y Loh July 12,

4 cor(estrec,progrec) = 0.39 cor(ln(estrec+1),ln(progrec+1)) = 0.64 estrec estrec progrec progrec+1 W-Y Loh July 12,

5 GUIDE model (2nd best variable is estrec) progrec Node 2 Node 3 Survival probability horth = yes horth = no horth = yes horth = no W-Y Loh July 12,

6 Earlier subgroup identification methods Interaction trees (Su et al., 2008, 2009). For each X and split set S (e.g., {X < c} or {X A}), fit E(Y) = β 0 +β 2 I(S)+β 1 Z +β 3 Z I(S) to data. Find split (X,S) with most significant interaction (β 3 ). SIDES: (Lipkovich et al., 2011; Lipkovich and Dmitrienko, 2014). Find split (X,S) with most significant between-node difference in treatment effects. QUINT: Qualitative interaction tree (Dusseldorp and Van Mechelen, 2014) Find split (X,S) to optimize function of effect size and subgroup size. VT: Virtual twins (Foster et al., 2011). 1. Fit a Random forest model (Breiman, 2001) to observed outcomes y obs 2. Use model to predict counterfactual outcomes y unobs (other treatment) 3. Fit CART model to (y obs y unobs ) to find subgroups W-Y Loh July 12,

7 Limitations 1. Most methods follow CART approach of greedy search over all (X,S) result is bias in variable selection 2. Many are only applicable to 2 treatment levels 3. Most require imputation to deal with missing covariate values but imputation is possibly the hardest problem in statistics! 4. All are designed for univariate response only; extension to multivariate or longitudinal, time-dependent response is not straightforward W-Y Loh July 12,

8 Selection bias of CART and Random forest Ordinal X with n distinct values allows (n1) splits of the form {X c} Categorical X with m levels has (2 m1 1) splits of the form {X A} Bias: Variables with large n and m have more chance to split a node W-Y Loh July 12,

9 Example of selection bias: predicting heart disease 617 observations, no missing values Response is diagnosis of heart disease (5 levels) 52 predictor variables (29 ordinal, 23 categorical), including 1. ekgmo: month of electrocardiogram (12 values, splits) 2. ekgday: day of electrocardiogram (31 values, splits) W-Y Loh July 12,

10 RPART tree (Breiman et al., 1984) (3.6 hrs) W-Y Loh July 12,

11 GUIDE tree (Loh, 2002, 2009) (3 sec.) lmt=0 rcaprox 1 lmt 1 ladprox 1 rcadist 1 cxmain 1 2 8/15 laddist /32 ladprox 1 rcadist 1 cxmain 1 rcaprox 1 4 1/31 laddist 1 ladprox 1 cxmain 1 laddist 1 1 2/15 2/11 0/7 1 2 laddist 1 cxmain 1 ladprox 1 3 0/11 cxmain 1 3 3/18 ramus 1 3 0/27 2/ /6 cxmain 1 cxmain ramus om /10 2/9 2/ / /18 om /9 5/63 0/20 1/16 0/7 4/36 0/ / /7 W-Y Loh July 12,

12 Many missing values: a retrospective candidate gene study 1504 subjects randomized to treatment or placebo Response is survival time in days, with 63% censored 23 baseline (17 ordered, 6 categorical) and 282 genetic (cat.) variables 95% of subjects have missing values; only 7 variables are complete Survival probability Treatment Placebo Days W-Y Loh July 12,

13 GUIDE model with 95% bootstrap intervals for relative risk (treatment vs placebo) a2 0.1 or NA (0.73, 1.54) (0.45, 0.81) a2 0.1 or NA a2 > 0.1 Survival probability Treatment Placebo Days Treatment Placebo Days At each node, a case goes to the left child node if stated condition is satisfied. Sample sizes are beside terminal nodes. W-Y Loh July 12,

14 GUIDE method for subgroup identification (Loh, 2014; Loh et al., 2015) 1. Let Z = 1, 2,..., be treatment variable and X a split variable 2. Do for each X at each node: (a) If X is a categorical variable, add a category to X for missing values and test lack of fit of the additive model: EY = η + j β j I(X = j)+ k γ k I(Z = k) (b) If X is ordinal, convert it to categorical by discretization at quartiles compare with: EY = η+ β j I(X j < c j )+ γ k I(Z = k)+ ω jk I(X j < c j,z = k) j k j k 3. Let X be the variable with the most significant chi-squared 4. Find split on X that minimizes sum of squared residuals of the model EY = η + k γ ki(z = k) fitted to each subnode W-Y Loh July 12,

15 Type 2 diabetes longitudinal study with missing values in responses and covariates (Loh et al., 2016) 1249 subjects from a multi-center, randomized double-blind trial (Charbonnel et al., 2004) Subjects randomized to a 52-week treatment period of drug G (Gliclazide) or P (Pioglitazone) 24 baseline (time 0) variables measured for each subject as well as their HbA1c at 10 time points (-2, 0, 4, 8, 12, 16, 24, 32, 42, and 52 weeks) Gliclazide increases amount of insulin produced by the pancreas Pioglitazone improves how body uses insulin ( insulin sensitizer ) W-Y Loh July 12,

16 HbA1c means for 747 subjects A1C Pioglitazone Gliclazide Weeks W-Y Loh July 12,

17 Baseline variables and their missing values Variable #Missing Variable #Missing HDL 7 Age 0 LDL 77 Weight 1 Total cholesterol 6 BMI 0 Triglycerides 6 Waist 4 Creatinine 0 A1CBase 0 Fasting insulin 46 HomaS 62 ALT 0 HomaIR 62 AST 0 HomaB 62 GGT 0 Diastolic blood pressure 0 C-peptide 593 Systolic blood pressure 0 Diabetes duration 0 Pulse 0 Fasting blood glucose 0 W-Y Loh July 12,

18 GUIDE tree with 95% bootstrap CIs (Loh et al., 2016) HOMAB Fasting blood glucose Weeks Gliclazide Pioglitazone Node Weeks Gliclazide Pioglitazone Node Weeks Gliclazide Pioglitazone Node 7 W-Y Loh July 12,

19 Frequently (and not so frequently) asked questions 1. P(Type I error) controlled? 2. Subgroup correct? 3. Split points statistically significant? 4. Estimated subgroup treatment effects unbiased? 5. Estimated subgroup treatment effects statistically significant? 6. Estimated subgroup treatment effects confounded with covariates? W-Y Loh July 12,

20 Q1. Does GUIDE control P(Type I error)? As n, the estimated regression function is asymptotically consistent (Chaudhuri et al., 1994, 1995; Chaudhuri and Loh, 2002). Hence P(Type I error) 0 W-Y Loh July 12,

21 Q2. Is subgroup correctly identified? Surprise! There is no correct subgroup progrec Node 2 Node 3 Survival probability horth = yes horth = no horth = yes horth = no W-Y Loh July 12,

22 Model without progrec estrec Node Node 3 Survival probability horth = yes horth = no horth = yes horth = no W-Y Loh July 12,

23 Where is the correct subgroup? progrec estrec estrec progrec+1 W-Y Loh July 12,

24 Q3. Are split points statistically significant? Consider these two simulation models Jump model Broken line model Y Y X X W-Y Loh July 12,

25 IT and SIDES vs GUIDE Jump model (true subgroup marked by dotted line) Interaction Trees SIDES GUIDE Response Drug Placebo Response Drug Placebo Response Drug Placebo Biomarker Biomarker Biomarker Interaction Trees maximizes significance of treatment-biomarker interaction SIDES minimizes p-value of difference between treatment effects GUIDE minimizes sum of squared residuals W-Y Loh July 12,

26 Mean of split point for two models Model Interaction Trees SIDES GUIDE Jump (0.003) (0.005) (0.002) Broken line (0.006) (0.013) (0.006) based on iterations; simulation SEs in parentheses For Jump model, true split point is 5.0 For Broken line model, true split point is undefined W-Y Loh July 12,

27 Q4. Are estimated treatment effects unbiased? Ans: Usually not, but some methods are better Subgroup treatment effect bias Model Interaction Trees SIDES GUIDE Jump (0.001) (0.001) (0.001) Broken line (0.001) (0.002) (0.001) based on iterations; simulation SEs in parentheses W-Y Loh July 12,

28 Q5. Are treatment effects statistically significant? 1. Subgroups are random because they are results of search algorithms 2. Hence, unlike classical theory, true subgroup effects θ are also random 3. Statistical significance of estimates ˆθ must account for the search 4. P-value requires a null hypothesis H 0 but what is H 0? W-Y Loh July 12,

29 Bootstrap calibration (Loh, 1987, 1991) Naïve intervals too short do not account for subgroup search Need to increase nominal confidence level Use bootstrap to estimate true confidence levels Increase nominal level of intervals to reach desired level W-Y Loh July 12,

30 Tree from a bootstrap sample X X Real data Bootstrap sample x x x x 1 W-Y Loh July 12,

31 Bootstrap calibrated intervals (Loh et al., 2016) 1. Let F be true (unknown) distribution of data 2. Given sample of data, construct a tree model 3. Given γ, construct a nominal 100γ% interval at each terminal node 4. Let C(F,γ) be true average coverage of nominal 100γ% intervals 5. Let γ F be such that C(F,γ F ) = If we know F, construct nominal 100γ F % intervals and we are finished 7. Because F is unknown, let ˆF be its bootstrap estimate 8. Use simulation to find calibrated level γˆf such that C(ˆF,γˆF) = Construct desired intervals at nominal level γˆf W-Y Loh July 12,

32 Bootstrap calibrated alpha for 95% confidence intervals Bootstrap coverage Nominal alpha W-Y Loh July 12,

33 95% bootstrap intervals for RR (therapy vs none) progrec (0.56,1.42) (0.30,0.89) Bootstrap calibrated ˆα = Node 2 Node 3 Survival probability horth = yes horth = no horth = yes horth = no W-Y Loh July 12,

34 Coverage of 95% CIs for treatment effect for breast cancer data Naïve t interval ± Bootstrap calibrated interval ± simulation trials with 25 bootstraps each (± 2 simulation SEs in parentheses) W-Y Loh July 12,

35 Q6. How to ensure treatment effects are unconfounded within subgroups? Many studies include prognostic variables (e.g., age, tumor size) Treatment randomization balances the overall effects of these variables But balance may be upset within subgroups W-Y Loh July 12,

36 95% bootstrap intervals for RR due to horth with linear control of prognostic variables progrec 24 1 (0.56,1.18) (0.34,0.82) Bootstrap calibrated ˆα = Node 2 Node 3 coef p-value coef p-value constant pnodes horth=yes unadjusted p-values W-Y Loh July 12,

37 Coverage (± 2 SEs) of 95% CIs for treatment effect with local linear prognostic control for breast cancer data Naïve t interval ± Bootstrap calibrated interval ± based on 1200 simulation trials with 25 bootstraps per trial W-Y Loh July 12,

38 Conclusions 1. Asking for correct subgroup is naïve: often there is no unique subgroup 2. GUIDE handles missing values without imputation 3. GUIDE has no selection bias: does not select variables that have more splits 4. GUIDE seems to give less biased estimates of subgroup treatment effects 5. GUIDE can control for prognostic effects within subgroups 6. Simple way to assess statistical significance is bootstrap calibrated intervals Some outstanding problems 1. Given a tree model, how to tell which node defines the subgroup? 2. How to remove the bias in estimated treatment effect in the subgroup? 3. How to deal with longitudinal (time-dependent) covariates? W-Y Loh July 12,

39 Acknowledgments Xu He, Michael Man and Lei Shen Probal Chaudhuri Yu-Shan Shih, Wei Zheng and 18 other PhD students US Army Research Office US National Science Foundation US Bureau of Labor Statistics US National Institutes of Health AbbVie, Eli Lilly, Gilead Sciences, Pfizer and Takeda W-Y Loh July 12,

40 References Breiman, L. (2001). Random forests. Machine Learning, 45:5 32. Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and Regression Trees. Chapman & Hall/CRC. Charbonnel, B. H.and Matthews, D. R., Schernthaner, G., Hanefeld, M., and Brunetti, P. (2004). A long-term comparison of Pioglitazone and Gliclazide in patients with Type 2 diabetes mellitus: a randomized, double-blind, parallel-group comparison trial. Diabetic Medicine, 22: Chaudhuri, P., Huang, M.-C., Loh, W.-Y., and Yao, R. (1994). Piecewise-polynomial regression trees. Statistica Sinica, 4: Chaudhuri, P., Lo, W.-D., Loh, W.-Y., and Yang, C.-C. (1995). Generalized regression trees. Statistica Sinica, 5: Chaudhuri, P. and Loh, W.-Y. (2002). Nonparametric estimation of conditional quantiles using quantile regression trees. Bernoulli, 8: W-Y Loh July 12,

41 Dusseldorp, E. and Van Mechelen, I. (2014). Qualitative interaction trees: a tool to identify qualitative treatment-subgroup interactions. Statistics in Medicine, 33: Foster, J. C., Taylor, J. M. G., and Ruberg, S. J. (2011). Subgroup identification from randomized clinical trial data. Statistics in Medicine, 30: Lipkovich, I. and Dmitrienko, A. (2014). Strategies for identifying predictive biomarkers and subgroups with enhanced treatment effect in clinical trials using SIDES. Journal of Biopharmaceutical Statistics, 24: Lipkovich, I., Dmitrienko, A., Denne, J., and Enas, G. (2011). Subgroup identification based on differential effect search a recursive partitioning method for establishing response to treatment in patient subpopulations. Statistics in Medicine, 30: Loh, W.-Y. (1987). Calibrating confidence coefficients. Journal of the American Statistical Association, 82: Loh, W.-Y. (1991). Bootstrap calibration for confidence interval construction and selection. Statistica Sinica, 1: W-Y Loh July 12,

42 Loh, W.-Y. (2002). Regression trees with unbiased variable selection and interaction detection. Statistica Sinica, 12: Loh, W.-Y. (2009). Improving the precision of classification trees. Annals of Applied Statistics, 3: Loh, W.-Y. (2014). Fifty years of classification and regression trees (with discussion). International Statistical Review, 34: Loh, W.-Y., Fu, H., Man, M., Champion, V., and Yu, M. (2016). Identification of subgroups with differential treatment effects for longitudinal and multiresponse variables. Statistics in Medicine, 35: Loh, W.-Y., He, X., and Man, M. (2015). A regression tree approach to identifying subgroups with differential treatment effects. Statistics in Medicine, 34: Schumacher, M., Baster, G., Bojar, H., Hübner, K., Olschewski, M., Sauerbrei, W., Schmoor, C., Beyerle, C., Newmann, R. L. A., and Rauschecker, H. F. (1994). Randomized 2 2 trial evaluating hormonal treatment and the W-Y Loh July 12,

43 duration of chemotherapy in node-positive breast cancer patients. Journal of Clinical Oncology, 12: Su, X., Tsai, C. L., Wang, H., Nickerson, D. M., and Bogong, L. (2009). Subgroup analysis via recursive partitioning. Journal of Machine Learning Research, 10: Su, X., Zhou, T., Yan, X., Fan, J., and Yang, S. (2008). Interaction trees with censored survival data. International Journal of Biostatistics, 4. Article 2. W-Y Loh July 12,

44 Model for 10-week A1C without linear control InsulinFastpmolLBase A1CBase Sample sizes below nodes; treatment means for G and P beside nodes. Symbol stands for or missing. Red nodes indicate significant treatment effects. W-Y Loh July 12,

45 Node 2: Terminal node Regressor Coefficient t-stat p-val Thera.P E Mean of A1C10 = Node 6: Terminal node Regressor Coefficient t-stat p-val Thera.P E Mean of A1C10 = Node 7: Terminal node Regressor Coefficient t-stat p-val Thera.P E Mean of A1C10 = W-Y Loh July 12,

46 Model for 10-week A1C with linear control InsulinFastpmolLBase A1CBase A1CBase A1CBase FastBGBase Sample size, mean A1C10 and linear covariate below node. Red nodes indicate significant treatment effects. W-Y Loh July 12,

47 Node 2: Terminal node Regressor Coefficient t-stat p-val A1CBase E Thera.P E Mean of A1C10 = Node 6: Terminal node Regressor Coefficient t-stat p-val A1CBase E Thera.P E Mean of A1C10 = Node 7: Terminal node Regressor Coefficient t-stat p-val FastBGBase E Thera.P E Mean of A1C10 = W-Y Loh July 12,

48 Extension to censored response data via Poisson regression 1. Let U i and C i be survival and censoring times of subject i 2. Let Y i = min(u i,c i ) and δ i = I(T i < C i ) be the event indicator 3. Let Λ 0 (.) be the baseline cumulative hazard function of PH model 4. Estimate coefficients of PH model by iteratively fitting a Poisson regression model with δ i as response and logλ 0 (y i ) as offset: (a) Use the Nelson-Aalen method to get an initial estimate of Λ 0 (.) (b) Use GUIDE to construct a Poisson regression tree (c) Update Λ 0 (.) with the tree (d) Repeat steps (b) and (c) four more times W-Y Loh July 12,

49 Do at each node: Extension to multiple responses 1. For each response variable Y j, find chi-squared of each X variable 2. Choose the variable X with largest sum of chi-squared values over j 3. Choose the split on X that yields smallest sum of squared residuals over all response variables Extension to correlated response variables Apply principal components of Y variables computed locally at each node W-Y Loh July 12,

Data mining methods for subgroup identification. Ilya Lipkovich and Alex Dmitrienko, Quintiles TICTS, April 22, 2014

Data mining methods for subgroup identification Ilya Lipkovich and Alex Dmitrienko, Quintiles TICTS, April 22, 2014 Outline Introduction Principles and standards for Subgroup Analysis in clinical research