PRINCIPLES OF EFFECTIVE MACHINE LEARNING APPLICATIONS IN REAL-WORLD EVIDENCE

Size: px

Start display at page:

Download "PRINCIPLES OF EFFECTIVE MACHINE LEARNING APPLICATIONS IN REAL-WORLD EVIDENCE"

Myra Lambert
5 years ago
Views:

PRINCIPLES OF EFFECTIVE MACHINE LEARNING APPLICATIONS IN REAL-WORLD EVIDENCE Prepared and Presented by: Gorana Capkun-Niggli, PhD, Global Head of Innovation, Health Economics and Outcomes Research,

1 PRINCIPLES OF EFFECTIVE MACHINE LEARNING APPLICATIONS IN REAL-WORLD EVIDENCE Prepared and Presented by: Gorana Capkun-Niggli, PhD, Global Head of Innovation, Health Economics and Outcomes Research, Novartis, Basel, Switzerland Sreeram Ramagopalan, PhD, Director, Centre for Observational Research and Data Sciences, Bristol-Myers Squibb, London, UK Andrew Cox, PhD, Research Scientist, Evidera, London, UK David Vanness, PhD, Associate Professor, Department of Population Health Sciences, University of Wisconsin, Wisconsin, USA 2018 Evidera. All Rights Reserved. Introduction and Background 2 1

2 Why are we hearing so much about Machine Learning? Moores Law (1965) Computing power is rapidly increasing Data Flood Greater availability of real-world structured and unstructured data Incidentally collected data Not collected for a specific question or an experiment e.g. clinical trial Messiness of Data (Sets) Statistical assumptions often violated More outliers More variables than observations 3 What is Machine Learning? Is it all new to us? Algorithmic techniques aiming at discovering data structure and relationships between variables. Statistics Machine Learning Estimation Learning Classifier Hypothesis Data point Example/Instance Regression Supervised Learning Clustering Unsupervised Learning Covariate Feature Response Label 4 2

3 Some interesting ideas: Training and testing independently; out of black box with decision trees 1 All patients 3 Build a model based on the training data set Predictive algorithm Training set 2 Split into training and test sets / independent set 4 Apply the predictive model to the test patients Training set Test set Independent set Test set Trained predictive model is tested on a test/independent set e.g. using root mean square error (RMSE) 5 Use decision trees to make results actionable: jumping out of black-box High Value Subgroup method in combination with medical expertise can optimize treatment choice for patients Classical Approach (pre-defined univariate analyses) - usual way of doing - subgroup is defined by one variable - driven by prior medical knowledge However, - existing subgroups may not be detected Data splits 1000 times Modeling samples Develop predictive models Validation samples Treatment difference curve Decision tree HVS rules Higher Value Subgroups (Machine Learning) - uses all collected data - explores complex, nonlinear structures in data - generates new hypotheses However, - subgroups may not have straightforward medical interpretation - paradigm shift towards posthoc nature of analyses Li J, Zhao L, Tian L, et al. A predictive enrichment procedure to identify potential responders to a new therapy for randomized, comparative controlled clinical studies. Biometrics. 2016;72 (3):

4 Seeking your opinion Situation A Pharma company announces their drug candidate ABC123 in a highly prevalent patient population reduces mortality by 25% compared to placebo in the confirmatory Ph III RCT. The result is consistent across all predefined subpopulations. A machine learning algorithm identifies a subpopulation with clear additional benefit A machine learning algorithm recently published in JASA identifies a subgroup of patients (e.g. moderate or severe patients who are hypertensive) for which mortality is reduced by 40%. This Higher Value Subgroup (HVS) outperforms the rest of the patients in all other relevant clinical endpoints with similar safety profile The size of the subpopulation is 30% of the total population. Business Use Only 7 Question 1: As a payer A. I would welcome these results and give access to the subpopulation only B. I would welcome these results but give access to all patients with same conditions C. I would welcome these results and offer differential access for the subpopulation D. I would ignore these findings in my decision making Business Use Only 8 4

5 Question 2: Complication Same results as before but our current medical and biological knowledge cannot explain why the selected subgroup would have increased benefit compared to the rest of the study population. As a payer A. I would welcome these results and give access to the subpopulation only B. I would welcome these results but give access to all patients with same conditions C. I would welcome these results and offer differential access for the subpopulation D. I would ignore these findings in my decision making Business Use Only 9 ML versus Traditional Statistics 10 5

6 When should machine learning be applied? 11 Looking at the original risk model development and purpose 12 6

7 Performance of the algorithms Top risk factor variables for CVD algorithms 7

8 Cannot forget good study design. 15 Cannot forget bias and confounding. 16 8

9 These issues also apply to non-ml studies

10 Key Concepts 19 The Five Key Questions for ML Projects 1. Is ML really needed/appropriate? Is there a more simple and appropriate approach Is the choice of ML more about ML than the research question 2. Are there imbalanced classes? How rare/common are the cases being predicted? 3. Data Leakage? When information from outside the training dataset is used to create the model 4. Reporting Performance Is the reporting adequate and fair? 5. Overfitting Single holdout method is no longer acceptable 20 10

11 An Appropriate Approach? Is there a good reason to use ML? Is the dataset small with few variables? Is the interest in inference rather than prediction? Is the data sufficiently self-contained, or relatively insulated from outside influences o If in the future the problem is likely to change, then the ML model will no longer predict well Example: Google Flu Trends Prediction When the data are not generalizable? Occam s Razor Test o If you can get equal performances from two models, then you should select the more simple approach 21 Imbalanced Classes What is it?: Where the number of the class you are trying to predict is either rare or a majority 22 ML algorithms work ideally with balanced data (50% of A and 50% of B) Imbalanced classes are very common! Why is it a problem? Standard accuracy no longer reliably measures performance Example: If you are predicting a rare disease state (1% of population) then you can have great performance by just predicting all cases as No Disease ; i.e., just voting with the majority class. Therefore accuracy is very misleading How can this be addressed? Oversampling Use appropriate performance metrics (ROC curve) Use penalized methods Tree-based ensemble methods (e.g. Random Forest, Gradient Boosting) 11

12 Data Leakage Data leakage is when information from outside the training dataset is used to create the model Can cause overoptimistic prediction performance leading to invalid results How do I know if I have data leakage The performance results are too good to be true! Examples: Develop model on the training dataset test it on the training set decide if you need to change something based on the results adjust model and retest on the training set Standardize your variables across the entire dataset create your training and test datasets How can I avoid data leakage? Perform data preparation separately within each cross validation fold Create a separate validation set early on, only use it once 23 Reporting Performance Confusion matrix should be shown Predicted Status = 1 Predicted Status = 0 True Status = 1 2,221 (True Positives) 364 (False Negative) True Status = (False Positive) 1,015 (True Negatives) Gives the most comprehensive view of model performance All other performance measures are summaries of these data Summary Measures F Measure / F Statistic Accuracy True Positive Rate Out of Bag Error Rate Receiver Operator Curve (ROC) Useful for assessment of model performance Shows the trade off s for different probability thresholds 24 12

13 Overfitting What is overfitting?: Overfitting refers to a model that models the training data too well Overfitting happens when a model learns the noise in the training data to the extent that it negatively impacts the performance of the model on new data The problem is that overfitting negatively impact the models ability to generalize to new data How do I know when my model is overfitting? When you get a large difference in performance on training and testing datasets o 90% accuracy on training and only 50% accuracy on test dataset How can overfitting be avoided? Follow proper training and testing approaches K-fold cross validation Hold back a validation dataset and use it only once Avoid data leakage 25 Hyperparameters / Tuning Parameters What are Hyperparameters?: Hyperparameters are model-specific settings that can be tuned to control the behavior of a ML algorithm Example: o Random Forest number of trees, number of features o SVM C value and Kernel parameters Often overlooked! Why are they important? Can lead to better model performance, but also overfitting How do I determine the best hyperparameters for my case? Grid searches Optimization algorithms Automated tuning algorithms 26 13

14 Practical Example 27 Application: Estimating Propensity Scores for the Receipt of Allogeneic Hematopoietic Cell Transplantation (AlloHCT) in Outcomes Research Using Claims Data: A Machine Learning Approach Vanness DJ, Preussler JM, Burns LJ, Denzen EM, Leppke SN, Majhail NS, Mupfudze T, Saber W, Silver A, Steinert P, Mau LW. Biology of Blood and Marrow Transplantation Mar 1;24(3):S

Approach Propensity scores can be used to adjust for observed confounders that affect both selection of allohct and outcomes.

West, Unknown and combinations), and year of diagnosis Insurance: payer (commercial, Medicare, combined commercial and Medicare), product (exclusive provider organization [EPO], point of service

15 Approach Propensity scores can be used to adjust for observed confounders that affect both selection of allohct and outcomes. Machine learning may offer advantages over logistic regression when there are many possible predictors relative to observations or when interactions and discontinuities may be important but would be difficult to pre-specify. Using Optum Clinformatics Data Mart ( ), we identified patients age 20 receiving allohct (n=278) or chemotherapy only (n=570) Demographics: age, gender, region (Midwest, Northeast, South, West, Unknown and combinations), and year of diagnosis Insurance: payer (commercial, Medicare, combined commercial and Medicare), product (exclusive provider organization [EPO], point of service [POS], preferred provider organization [PPO] health maintenance organization [HMO], indemnity [IND], other, and combinations Baseline health status: Elixhauser Comorbidity Index, comorbidity indicators (hypertension, diabetes, coagulopathy, electrolyte imbalance, anemia), and time from diagnosis to chemotherapy initiation Compared propensity scores constructed with logistic regression glm to extreme gradient boosting (R package xgboost) with stochastic resampling and feature selection, tuned to maximize 10-fold cross-validated log-likelihood. Illustration of the Boosting Algorithm 15

16 Balance Results Stabilized inverse propensityweighted datasets constructed using either logistic regression or xgboost each produced no covariates having an absolute standardized difference greater than 0.2, indicating that both approaches achieved acceptable covariate balance Distribution of Treatment Selection Errors The xgboost algorithm outperformed logistic regression in terms of accurately predicting receipt of allohct, which shows both fewer false positives and false negatives. 16

17 Investigating performance under less-than ideal circumstances. Our AlloHCT application demonstrated that boosting produced more accurate propensity scores with real-world-data, but we do not know for sure that using it result in less-biased estimates in cost. We conducted a Monte Carlo simulation using 3,000 replicated datasets where we do know the answer. Including extraneous variables that have no effect on selection or outcomes Unobservable confounders 10 T = F M X + ε 1 > 0 m= Y = T + Δ j F j + 1 Δ j β j X j + β k X k + ε 2 j=1 k=11 T Y X11-X25 X16-X25 Always Observed ε1 F1(X1) Linear F2(X2) Threshold F3(X3) Range- Discontinuous F4(X4,X5) Range- Interaction F5(X6,X7,X8) Range- Interaction F6(X9) Exponential F7(X10,X11) Signed Interaction F8(X12,X13) Multiplicative Interaction F9(X14) Squared F10(X15) Cubed F(X) Always Unobserved X1-X15 Solid: Relationship Always Present ε2 Dashed: Relationship Sometimes Present X1-X15 Elements Observable at Random 17

Treatment Effect Estimation Error The distributions of errors for the estimated treatment effects when using the boosted propensity score indicates substantially reduced bias at the cost of slightly

18 Treatment Effect Estimation Error The distributions of errors for the estimated treatment effects when using the boosted propensity score indicates substantially reduced bias at the cost of slightly higher variance when compared to the same approach with logit propensity scores Overall, the best approach appears to be to include boosted propensity scores as a regressor. Difference in RMSE between Boost and Logit (including PS as a regressor) The average pairwise difference in RMSE was lower for the boosting approach (compared to logit) when the propensity score is included as a regressor. RMSE was lower with the boosted propensity score in 81.4% of simulations. 18

19 Exploring what drove our results Used randomforest to investigate which factors varying from replication to replication led to better (or worse) relative performance of boosting compared to logit (for PS included as a regressor) Partial dependence plots show the partial relationship of each factor to comparative performance (lower values favor boosting over logit) Summary Boosting is a particularly promising method for constructing propensity scores to adjust for confounding based on observables. For confounding based on unobservables, next step will be to explore its performance in two-stage modeling with instruments Other even-more-promising approaches may await (including super-learners) Other potential applications in HEOR: High dimensional risk prediction for treatment decisions and resource allocation Risk adjustment for fair contracting and quality reporting Exploration of treatment effect heterogeneity / treatment personalization Limitations: Black box Inability to hypothesis test (though this may potentially be overcome with Targeted Machine Learning see, e.g., Van der Laan MJ, Rose S. Targeted learning: causal inference for observational and experimental data. Springer Science & Business Media; 2011 Jun 17). 19

Reviewing an Abstract 39 Reviewing an Abstract Using the 5 Key Concepts 1. Is ML really needed/appropriate? 2. Are there imbalanced classes? 3. Data Leakage? 4. Reporting Performance? 5. Overfitting?

20 Reviewing an Abstract 39 Reviewing an Abstract Using the 5 Key Concepts 1. Is ML really needed/appropriate? 2. Are there imbalanced classes? 3. Data Leakage? 4. Reporting Performance? 5. Overfitting? 6. Score your faith in this research on a scale of 0 to 10. Where zero is not faith at all, and 10 is complete faith and trust in the results, conclusions and methods) 40 20

outpatients.. In total, 14,199 patients from the MCHD met all the eligibility criteria and composed the study cohort.

21 Is ML Really Needed/Appropriate? Objective: Predict probability of death (POD) so that patients at high risk can be admitted to the hospital, while patients at low risk can be treated as outpatients.. In total, 14,199 patients from the MCHD met all the eligibility criteria and composed the study cohort. There are 250 features describing each patient. 41 Are there Imbalanced Classes? Data: 10.86% of the patients in the dataset (1,542 patients) died from pneumonia. Methods: Random Forest was performed using a training and testing approach. The train set contains 9,847 patients and the test set has 4,352 patients (a 70%:30% train - test split)

22 Data Leakage? The train set contains 9,847 patients and the test set has 4,352 patients (a 70%:30% train - test split). Does not say anything about data preparation 43 Is Performance Reporting Adequate? Results: The AUC for the Random Forest method was Overall predictive accuracy was 0.82 and a false positive rate was

Was There Enough Attention to Overfitting? The train set contains 9,847 patients and the test set has 4,352 patients (a 70%:30% train - test split). There are 250 features describing each patient.

(Scale 0 to 10) Conclusions: Having a history if Asthma prior to admission reduced the probability of death as an outcome of the pneumonia.

23 Was There Enough Attention to Overfitting? The train set contains 9,847 patients and the test set has 4,352 patients (a 70%:30% train - test split). There are 250 features describing each patient. No information on model tuning (hyperparameters) 45 Do you Trust the Research? (Scale 0 to 10) Conclusions: Having a history if Asthma prior to admission reduced the probability of death as an outcome of the pneumonia. It may be that Asthma medication or the heightened respiratory awareness in Asthma patients leads to a better outcome in those individuals. Conclusions: Overall the performance of the Random Forest algorithm used in this work showed a high level of predictive performance, indicating it can be a useful tool to predict the probability of death from pneumonia at time of admission to hospital

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Predicting Breast Cancer Survival Using Treatment and Patient Factors William Chen wchen808@stanford.edu Henry Wang hwang9@stanford.edu 1. Introduction Breast cancer is the leading type of cancer in women