Predictive Model for Detection of Colorectal Cancer in Primary Care by Analysis of Complete Blood Counts

Similar documents
Predicting Breast Cancer Survival Using Treatment and Patient Factors

Anirudh Kamath 1, Raj Ramnani 2, Jay Shenoy 3, Aditya Singh 4, and Ayush Vyas 5 arxiv: v2 [stat.ml] 15 Aug 2017.

Uncovering interactions with Random Forests. Jake Michaelson Marit Ackermann Andreas Beyer

Positive Results on Fecal Blood Tests

Interventions to Improve Follow-up of Positive Results on Fecal Blood Tests

INTRODUCTION TO MACHINE LEARNING. Decision tree learning

An Improved Algorithm To Predict Recurrence Of Breast Cancer

Nearest Shrunken Centroid as Feature Selection of Microarray Data

COMP9444 Neural Networks and Deep Learning 5. Convolutional Networks

Summary of main challenges and future directions

MACHINE LEARNING BASED APPROACHES FOR PREDICTION OF PARKINSON S DISEASE

The New Grade A: USPSTF Updated Colorectal Cancer Screening Guidelines, What does it all mean?

Colon cancer subtypes from gene expression data

An Empirical and Formal Analysis of Decision Trees for Ranking

International Journal of Pharma and Bio Sciences A NOVEL SUBSET SELECTION FOR CLASSIFICATION OF DIABETES DATASET BY ITERATIVE METHODS ABSTRACT

Comparison of discrimination methods for the classification of tumors using gene expression data

Prediction of Dengue, Diabetes and Swine Flu Using Random Forest Classification Algorithm

ColonCancerCheck (CCC): Modelling FOBT screening in Ontario for colorectal cancer (CRC) using the Cancer Risk Management Model (CRMM)

A Random Forest Model for the Analysis of Chemical Descriptors for the Elucidation of HIV 1 Protease Protein Ligand Interactions

Random forest analysis in vaccine manufacturing. Matt Wiener Dept. of Applied Computer Science & Mathematics Merck & Co.

Panel: Machine Learning in Surgery and Cancer

DPPred: An Effective Prediction Framework with Concise Discriminative Patterns

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.

J2.6 Imputation of missing data with nonlinear relationships

Assessment of diabetic retinopathy risk with random forests

Roadmap for Developing and Validating Therapeutically Relevant Genomic Classifiers. Richard Simon, J Clin Oncol 23:

American Indian and Alaska Native Colorectal Cancer Screening Data April 26, 2016

Recommendations on Screening for Colorectal Cancer 2016

DIABETIC RISK PREDICTION FOR WOMEN USING BOOTSTRAP AGGREGATION ON BACK-PROPAGATION NEURAL NETWORKS

Increasing Colorectal Cancer Screening Rates Why it s not as easy as you ve been told

Radiotherapy Outcomes

Colorectal Cancer Screening in Later Life: Blum Center Rounds

Identifying Parkinson s Patients: A Functional Gradient Boosting Approach

Modeling Sentiment with Ridge Regression

Using Ensemble-Based Methods for Directly Estimating Causal Effects: An Investigation of Tree-Based G-Computation

Risk scoring incorporating FIT in triage of symptomatic patients

COLORECTAL CANCER SCREENING &THE FECAL IMMUNOCHEMICAL TEST (FIT) MATHEW ESTEY, PHD, FCACB CLINICAL CHEMIST

Personalized Colorectal Cancer Survivability Prediction with Machine Learning Methods*

Classification of Patients Treated for Infertility Using the IVF Method

Feature selection methods for early predictive biomarker discovery using untargeted metabolomic data

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

Discovering New Variables and Improving the Prediction of ALS Progression

Prediction-based Threshold for Medication Alert

Personalized Effect of Health Behavior on Blood Pressure: Machine Learning Based Prediction and Recommendation

ACS Colorectal Cancer Screening Guideline for Average Risk Adults 2018

A NOVEL VARIABLE SELECTION METHOD BASED ON FREQUENT PATTERN TREE FOR REAL-TIME TRAFFIC ACCIDENT RISK PREDICTION

Colorectal Cancer Screening. Dr Kishor Muniyappa 2626 Care Drive, Suite 101 Tallahassee, FL Ph:

Increasing Colorectal Cancer Screening in Wyoming. Allie Bain, MPH Outreach & Education Supervisor Wyoming Integrated Cancer Services Program

Colorectal Cancer: Preventable, Beatable, Treatable. American Cancer Society

Survival Prediction Models for Estimating the Benefit of Post-Operative Radiation Therapy for Gallbladder Cancer and Lung Cancer

Propensity scores and causal inference using machine learning methods

Supplementary Materials

What variables are important in predicting bovine viral diarrhea virus? A random forest approach

Data Mining and Knowledge Discovery: Practice Notes

HOW TO EVALUATE ACTIVITIES INTENDED TO INCREASE AWARENESS AND USE OF COLORECTAL CANCER SCREENING. Using your toolkit to conduct an evaluation

BIOINFORMATICS. Permutation importance: a corrected feature importance measure

Optimizing implementation of fecal immunochemical testing in Ontario: A randomized controlled trial

Implementing of Population-based FOBT Screening

Biomarker adaptive designs in clinical trials

Estimating Likelihood of Having a BRCA Gene Mutation Based on Family History of Cancers and Recommending Optimized Cancer Preventive Actions

Keywords Data Mining Techniques (DMT), Breast Cancer, R-Programming techniques, SVM, Ada Boost Model, Random Forest Model

Computer Age Statistical Inference. Algorithms, Evidence, and Data Science. BRADLEY EFRON Stanford University, California

Knowledge, Attitude, Self-Efficacy, Literacy and CRC Screening in Rural Community Clinics

Predicting Juvenile Diabetes from Clinical Test Results

BayesRandomForest: An R

A DATA MINING APPROACH FOR PRECISE DIAGNOSIS OF DENGUE FEVER

Global colorectal cancer screening appropriate or practical? Graeme P Young, Flinders University WCC, Melbourne

Colorectal Cancer Screening in Ohio CHCs. Ohio Association of Community Health Centers

Critical reading of diagnostic imaging studies. Lecture Goals. Constantine Gatsonis, PhD. Brown University

The Dutch bowel cancer screening program Relevant lessions for Ontario

Deep Learning Analytics for Predicting Prognosis of Acute Myeloid Leukemia with Cytogenetics, Age, and Mutations

FECAL OCCULT BLOOD TEST (FOBT) Common Guaiac versus Immunochemical Test

Predictive Models for Healthcare Analytics

An Update on the Bowel Cancer Screening Programme. Natasha Djedovic, London Hub Director 17 th September 2018

Detection and Classification of Diabetic Retinopathy using Retinal Images

Improving Outcomes in Colorectal Cancer: The Science of Screening. Colorectal Cancer (CRC)

The choice of methods for Colorectal Cancer Screening; The Dutch experience

Prospective Clinical Study of Circulating Tumor Cells For Colorectal Cancer Screening

Credal decision trees in noisy domains

COMPARISON OF DECISION TREE METHODS FOR BREAST CANCER DIAGNOSIS

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

CRC screening at age 45 What does the modeling suggest?

Predicting Breast Cancer Survivability Rates

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection

Should we still be performing IHC on all sentinel nodes?

Colon Cancer Screening. A Provider Opinion Survey

PRINCIPLES OF EFFECTIVE MACHINE LEARNING APPLICATIONS IN REAL-WORLD EVIDENCE

TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS)

BIOSTATISTICAL METHODS

Colon cancer survival prediction using ensemble data mining on SEER data

Adherence to treatment guidelines in Type 2 diabetes patients failing metformin monotherapy in a real-world setting

Random Forest for Gene Expression Based Cancer Classification: Overlooked Issues

BREAST CANCER EPIDEMIOLOGY MODEL:

Response to Mease and Wyner, Evidence Contrary to the Statistical View of Boosting, JMLR 9:1 26, 2008

TPMG experience in improving colorectal cancer screening rates

How effective are colorectal cancer screening programs at increasing the rate of screening in asymptomatic average risk groups in Canada?

TACKLING SIMPSON'S PARADOX IN BIG DATA USING CLASSIFICATION & REGRESSION TREES

Identifikation von Risikofaktoren in der koronaren Herzchirurgie

Testing Statistical Models to Improve Screening of Lung Cancer

Practical challenges in establishing and running the Czech national colorectal cancer screening programme

Transcription:

Predictive Model for Detection of Colorectal Cancer in Primary Care by Analysis of Complete Blood Counts Kinar, Y., Kalkstein, N., Akiva, P., Levin, B., Half, E.E., Goldshtein, I., Chodick, G. and Shalev, V. JAMA (2016) Slides: Dan Coster 1

Colorectal Cancer (CRC) Globally more than 1 million people get colorectal cancer every year (In 2012, 1.4 million new cases). 2 nd most common cause of cancer in women (9.2% of diagnoses) 3 rd most common in men (10% of diagnoses) Screening Tests: Colonoscopy (recommended above age 50) gfobt (guaiac fecal occult blood test) Many avoid both tests and CRC discovered too late. Can regular blood tests help? 2

Data Data Sets: Maccabi Health Care persons of age 40+ between 01/2003 06/2011 The Health Improvement Network (THIN) UK data, persons of age 40+ between 01/2007 05/2012 CRC Labeling: Maccabi via Israel National Cancer Registry THIN based on patients records (documented Tumors / medications) 3

Data 4

Israel: median of 8 CBC tests in 7 years UK: 3 5

Derivation (training) Set: Design 80% of Maccabi s data (466,107 patients, 2,437 with CRC) Patients with other types of cancer were excluded. Each individual was assigned a single score to avoid bias of overrepresenting individuals with many blood counts. Cases: The most recent CBC within a defined time window (3-6 months/ 0-30 days) prior to CRC diagnosis was used Individuals with no CBC in the time window were excluded. Controls: Individuals aged 50-75 (per current CRC screening guidelines) A single randomly selected CBC was used (age bias?) Validation Sets: Israel validation set: other 20% - 139,205 patients, 698 with CRC UK validation set: THIN data 25,613 patients, 5,061 with CRC. 6

Data Preparation Generated 62 features per patient: Age (on CBC date) Gender 20 CBC values on CBC date 20 CBC estimated values* 18 + 36 months before current CBC date *Computed by linear regression on past CBCs of the individual 7

Algorithm Using Random Forest & Gradient Boosting. 8

Decision Trees (reminder) Split (node, {Examples}): 1. A The best feature for splitting the {examples} 2. Decision feature for this node A 3. For each value of A, create new child node 4. Split training examples to child nodes 5. For each child node / subset: 1. It subset is pure: STOP 2. Else: Split (child_node, {subset}) Splitting criterion: Information gain Quinlan, J.R., 1986. Induction of decision trees. Machine learning, 1(1), pp.81-106. 9

Random Forest 1. Select ntree, no. of trees to grow, and mtry, no. of vbls to use. 2. For i = 1 to ntree: 1. Draw a bootstrap sample (subsample with replacement) from the data. Call those not in the sample the out-of-bag data. 2. Grow a tree, where at each node, the best split is chosen among mtry randomly selected variables. The tree is grown to maximum size and not pruned. 3. Use the tree to predict out-of-bag data. 6. In the end, use the predictions on out-of-bag data to form majority votes. 7. Prediction of test data is done by majority votes from predictions from the ensemble of trees. www.rci.rutgers.edu/~cabrera/587/

Results On Israel validation set: AUC 0.82 ± 0.01 OR at FP=0.5% 26±5 Specificity at 50% sensitivity 88% ±2% On UK validation set: AUC 0.81 OR at FP=0.5% 40 Specificity at 50% sensitivity 94% 12

Sensitivity at 0.5% specificity 13

Variable importance Want to find out which vbls are most important for the prediction. Age is the most important contributing parameter (AUC = 0.72 with age alone) What about the contribution of blood variables? Problem: Some of the blood variables are highly correlated Soln here: Compute two parameters Contribution & Redundancy Use iterative removal of parameters from the model Evaluate performance by AUC 14

Computing variable importance Repeat for each CBC variable p: (example: Hg) 1. Let Δ 0 be the decrease of AUC between the full model and the model without p. 2. Sort the other 19 vbls by their correlation with p. 3. For K =1,19: 1. Remove from the data the K vbls most correlated with p (but not p itself!). Construct a new partial model on these data. 2. Let λ k be the decrease in AUC between the full model and the new partial model. 3. Let Δ k = λ k - Δ 0 4. Define the contribution of p as max i {Δ i }. 5. Define the redundancy of p: min {i Δ i > Threshold}). An important vbl: high contribution, low redundancy 15

Variable importance 16

Using the model + gfobt Compared the method s CRC detection rate to that of gfobt on the Israeli cohort. Data: 75 822 gfobt tests for 63 847 individuals, compared to 210 923 individuals with CBCs. The gfobt positive rate was 5%, detecting 170 CRC cases The model discovered 252 CRC cases. By considering individuals who were identified either by the model or by gfobt the number of CRC cases detected would increase by 115%: from 70 to 365. 17

Performance on other cancers Does the same model detect other cancer types? Applied the model; examined the sensitivity at a FP rate of 3% (shifted from 0.5% to get reliable results on less common cancers). 18

Early predictions? Results using only CBCs performed 0 2 months, 2 4,, 20-22 months prior to CRC diagnosis 19

http://www.israelhayom.co.il/article/436995 12/2016 23

Recap CRC predictor based on age, gender, temporal blood measurements Uses decision tree ensemble methods: RF, gradient boosting Feature engineering key to the results Results improve over FOBT, anemia guidelines Can detect risk of CRC 3-6 (12?) months ahead of time Great potential 24