Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties

Similar documents
Bias Adjustment: Local Control Analysis of Radon and Ozone

ANOVA. Thomas Elliott. January 29, 2013

Math 215, Lab 7: 5/23/2007

From Biostatistics Using JMP: A Practical Guide. Full book available for purchase here. Chapter 1: Introduction... 1

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

Dr. Kelly Bradley Final Exam Summer {2 points} Name

Midterm Exam ANSWERS Categorical Data Analysis, CHL5407H

Analysis of Variance: repeated measures

Quantitative Methods in Computing Education Research (A brief overview tips and techniques)

Statistics as a Tool. A set of tools for collecting, organizing, presenting and analyzing numerical facts or observations.

Business Statistics Probability

Psych 5741/5751: Data Analysis University of Boulder Gary McClelland & Charles Judd. Exam #2, Spring 1992

Overview of Lecture. Survey Methods & Design in Psychology. Correlational statistics vs tests of differences between groups

Multiple Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Ecological Statistics

HZAU MULTIVARIATE HOMEWORK #2 MULTIPLE AND STEPWISE LINEAR REGRESSION

Reveal Relationships in Categorical Data

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Small Group Presentations

The SAGE Encyclopedia of Educational Research, Measurement, and Evaluation Multivariate Analysis of Variance

ANOVA in SPSS (Practical)

List of Figures. List of Tables. Preface to the Second Edition. Preface to the First Edition

Basic Biostatistics. Chapter 1. Content

Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F

Score Tests of Normality in Bivariate Probit Models

Analysis and Interpretation of Data Part 1

1. Objective: analyzing CD4 counts data using GEE marginal model and random effects model. Demonstrate the analysis using SAS and STATA.

MBA 605 Business Analytics Don Conant, PhD. GETTING TO THE STANDARD NORMAL DISTRIBUTION

Intro to SPSS. Using SPSS through WebFAS

Lecture Outline. Biost 517 Applied Biostatistics I. Purpose of Descriptive Statistics. Purpose of Descriptive Statistics

NORTH SOUTH UNIVERSITY TUTORIAL 2

Multiple Linear Regression Analysis

BIOL 458 BIOMETRY Lab 7 Multi-Factor ANOVA

Media, Discussion and Attitudes Technical Appendix. 6 October 2015 BBC Media Action Andrea Scavo and Hana Rohan

Normal Q Q. Residuals vs Fitted. Standardized residuals. Theoretical Quantiles. Fitted values. Scale Location 26. Residuals vs Leverage

Still important ideas

Chapter 1: Exploring Data

Modeling Sentiment with Ridge Regression

Readings Assumed knowledge

bivariate analysis: The statistical analysis of the relationship between two variables.

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Biology 345: Biometry Fall 2005 SONOMA STATE UNIVERSITY Lab Exercise 8 One Way ANOVA and comparisons among means Introduction

STATISTICS AND RESEARCH DESIGN

Statistics 2. RCBD Review. Agriculture Innovation Program

Still important ideas

Student name: SOCI 420 Advanced Methods of Social Research Fall 2017

Biostatistics II

Cross-over trials. Martin Bland. Cross-over trials. Cross-over trials. Professor of Health Statistics University of York

What you should know before you collect data. BAE 815 (Fall 2017) Dr. Zifei Liu

Choosing the Correct Statistical Test

Biology 345: Biometry Fall 2005 SONOMA STATE UNIVERSITY Lab Exercise 5 Residuals and multiple regression Introduction

Primary Lighting in a Growth Chamber with Lettuce 330 Watt LED vs. 600 Watt HPS. Prepared for: LumiGrow Inc. By: Robert L. Starnes & Chris P.

Statistical analysis DIANA SAPLACAN 2017 * SLIDES ADAPTED BASED ON LECTURE NOTES BY ALMA LEORA CULEN

Introduction to Discrimination in Microarray Data Analysis

Numerous hypothesis tests were performed in this study. To reduce the false positive due to

3 CONCEPTUAL FOUNDATIONS OF STATISTICS

n Outline final paper, add to outline as research progresses n Update literature review periodically (check citeseer)

An Introduction to Multiple Imputation for Missing Items in Complex Surveys

Chapter 1: Review of Basic Concepts

Table of Contents. Plots. Essential Statistics for Nursing Research 1/12/2017

Statistical Techniques. Meta-Stat provides a wealth of statistical tools to help you examine your data. Overview

Overview of Non-Parametric Statistics

Regression Including the Interaction Between Quantitative Variables

An Introduction to Bayesian Statistics

HS Exam 1 -- March 9, 2006

Repeated Measures ANOVA and Mixed Model ANOVA. Comparing more than two measurements of the same or matched participants

PSY 216: Elementary Statistics Exam 4

Abstract. Introduction

Unit 1 Exploring and Understanding Data

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

One-Way ANOVAs t-test two statistically significant Type I error alpha null hypothesis dependant variable Independent variable three levels;

investigate. educate. inform.

Correlation and Regression

Multivariate dose-response meta-analysis: an update on glst

10. LINEAR REGRESSION AND CORRELATION

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

In each hospital-year, we calculated a 30-day unplanned. readmission rate among patients who survived at least 30 days

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions

Data Analysis in Practice-Based Research. Stephen Zyzanski, PhD Department of Family Medicine Case Western Reserve University School of Medicine

Section 6: Analysing Relationships Between Variables

Two-Way Independent Samples ANOVA with SPSS

Prepared by: Assoc. Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies

Here are the various choices. All of them are found in the Analyze menu in SPSS, under the sub-menu for Descriptive Statistics :

Survey research (Lecture 1) Summary & Conclusion. Lecture 10 Survey Research & Design in Psychology James Neill, 2015 Creative Commons Attribution 4.

Survey research (Lecture 1)

Content. Basic Statistics and Data Analysis for Health Researchers from Foreign Countries. Research question. Example Newly diagnosed Type 2 Diabetes

SUMMER 2011 RE-EXAM PSYF11STAT - STATISTIK

Comparison of discrimination methods for the classification of tumors using gene expression data

Addendum: Multiple Regression Analysis (DRAFT 8/2/07)

Modeling unobserved heterogeneity in Stata

Notes for laboratory session 2

STAT 201 Chapter 3. Association and Regression

Chapter 1: Explaining Behavior

Readings: Textbook readings: OpenStax - Chapters 1 11 Online readings: Appendix D, E & F Plous Chapters 10, 11, 12 and 14

Chapter 9. Factorial ANOVA with Two Between-Group Factors 10/22/ Factorial ANOVA with Two Between-Group Factors

A COMPARISON OF IMPUTATION METHODS FOR MISSING DATA IN A MULTI-CENTER RANDOMIZED CLINICAL TRIAL: THE IMPACT STUDY

Linear Regression in SAS

Understandable Statistics

Causal Mediation Analysis with the CAUSALMED Procedure

One way Analysis of Variance (ANOVA)

Transcription:

Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties Bob Obenchain, Risk Benefit Statistics, August 2015 Our motivation for using a Cut-Point of 2.6 pci/l for the Radon level that defines a High-Low "Treatment" Dichotomy is given by the initial binary split in the following Partition Regression (Tree) Model: Partition of (unadjusted) Lung Cancer Mortality on Radon RSquare RMSE N Number of AICc Splits 0.157 16.200826 2881 1 24229.5 All Rows Count Mean Std De 2881 78.126974 17.650253 LogWorth 108.34819 Difference 14.1607 Radon>=2.6 Count 1220 Mean 69.962838 Std De 16.609661 Radon<2.6 Count 1661 Mean 84.12351 Std De 15.903853 "Low Radon" : Level strictly less than 2.6 pci/l (picocuries per liter.) "High Radon" : Level = 2.6 pci/l (picocuries per liter) or greater. We will see below that higher Radon levels are associated with lower Lung Cancer Mortality rates. Neither this analysis nor the ones depicted on page 2 have been "covariate adjusted" for possible X-confounding factors included within in the datasets being analyzed here. 1

Prediction of Lung Cancer Mortality from Ln[Rn]...Unadjusted for all other X-confounders. Ln[Rn] = Natural Logarithm of Radon level. Here, 10 US counties with Radon level coded as "0.0" have been Windsorized in the dataset to Ln[0.05] = -2.9957. The cut-point at Radon = 2.6 pci/l (Ln[Rn] = 0.9555) is used in the fits displayed on this page only to color counties either Red or Blue. Linear Fit: Lung Cancer Mortality = 82.89-7.1703 * Ln[Rn] RSquare 0.168045 RSquare Adj 0.167756 Root Mean Square Error 16.10187 Mean of Response 78.12697 Observations (or Sum Wgts) 2881 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 1 150771.65 150772 581.5233 Error 2879 746438.87 259 Prob > F C. Total 2880 897210.52 <.0001* Parameter Estimates Term Estimate Std Error t Ratio Prob> t Intercept 82.89048 0.359184 230.77 <.0001* Ln[Rn] -7.17027 0.297339-24.11 <.0001* Smoothing Spline Fit, lambda=5 R-Square 0.185349 Sum of Squares Error 730913.5 2

Output from the "Local Control" JMP Add-In: Pages 3,4 and 5. Outcome Variable: Treatment Variable: Cluster Effect Type: Variability Assumption: Lung Cancer Mortality Radon Level Flag Fixed Homoskedastic Random Number Seed: 12345 Specify Number of Clusters = 50 Specify Number of Permutations = 50 0.9 1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 Mean_LTD LTD distribution for 50 clusters 3

Hierarchical Clustering Method = Fast Ward Obesity (%) Currently Smoke Age Over 65 (%) Dendrogram Hierarchically Clustered Differences 4

Response Lung Cancer Mortality -- Nested ANOVA (Treatment within Cluster) RSquare 0.460825 RSquare Adj 0.441832 Root Mean Square Error 13.18662 Mean of Response 78.12697 Observations (or Sum Wgts) 2881 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 98 413456.80 4218.95 24.2626 Error 2782 483753.72 173.89 Prob > F C. Total 2880 897210.52 <.0001* Effect Tests Source Nparm DF Sum of Squares F Ratio Prob > F Cluster 49 49 230589.37 27.0630 <.0001* High Radon[Cluster] 49 49 58556.58 6.8725 <.0001* NOTE: Cluster #10 is uninformative about Lung Cancer Mortality LTDs (High minus Low Radon) because all 11 US counties it contains have Radon levels less than 2.6 pci/l (picocuries per liter.) This explains why there are only 49 (rather than 50) Degrees-of-Freedom for Treatment-within-Cluster (LTD) effects. Only 49 Degrees-of-Freedom are attributed to main-effects within 50 Clusters by convention; the overall mean effect for mortality is simply removed and not shown in the ANOVA table. Although it may not be obvious from the entries in above Nested ANOVA table, results depend upon choice of the High-Low Radon cut-point (2.6 pci/l here) as well as the numbers of both requested and informative clusters (50 and 49, respectively.) Specifically, the y-outcome column vector here has 2,881 rows for US counties and consists of Lung Cancer Mortality rates being viewed as realizations of a continuous random variable. Furthermore, the "design" matrix has only two non-constant columns viewed as fixed (given) categorical variables: 1. The vector of treatment indicators has 2 levels - say, zeros (Low Radon) and ones (High Radon.) 2. The vector of cluster membership indicators has 50 levels - say, the integers 1 through 50. The analysis of this cross-classification of mortality rates is essentially nonparametric because no information is used on either how clusters were formed / defined from county X-characteristics or what the numerical values of X- characteristics are. 5

Aggregate Phase: Observed LTD Distribution (49 Informative Clusters containing 2,870 US Counties) Observed Local Treatment Difference (LTD) Distribution for 50 Ward Clusters Lung Cancer Mortality is measured in Deaths per 100,000 Person-Years. LTDs are differences in Mortality rates: Radon High minus Low. Above histogram depicts the Most Typical LTD Distribution derived from micro-aggregation of 2,881 US Counties on 3 primary X-confounders o Age Over 65 % o Currently Smoke % o Obesity % Y-outcome = Lung Cancer Mortality Binary Treatment Indicator: Radon High ( at least 2.6 pci/l ) vs. Low Best fitting Normal approximation has mean µ = -8.52 deaths and std. dev. σ = 4.902. 6

Confirm Phase: Comparison of empirical Cumulative Distribution Functions (CDFs) Random Permutation LTD-like Distribution Observed LTD Distribution These two distributions are rather clearly different; they differ most on statistical measures of location and shape (skewness, kurtosis, range) also see histograms and statistics listed on the next page. This means that clustering (local conditioning, matching) on 3 primary X-confounders [% over 65, % currently smoke and % obese] has indeed yielded appropriately adjusted treatment effect-size estimates. Local treatment effect-size estimates are LTDs expressed as a difference in mortality rates (deaths per 100,000 person-years) of the form: Average for one-or-more High Radon counties minus Average for one-or-more Low Radon counties. 7

Random Permutation LTD-like Distribution 0.10 0.08 0.06 0.04 0.02-50 -40-30 -20-10 0 10 20 0.10 0.08 0.06 0.04 0.02-50 -40-30 -20-10 0 10 20 Observed LTD Distribution Mean -14.19 Std Dev 5.090 Std Err Mean 0.01344 Upper 95% Mean -14.17 Lower 95% Mean -14.22 N 50 * 2870 Skewness 0.02731 Kurtosis 4.0870 Mean -8.52 Std Dev 4.902 Std Err Mean 0.09149 Upper 95% Mean -8.34 Lower 95% Mean -8.70 N 2870 Skewness -0.6327 Kurtosis 0.68719 8

Explore Phases: Tried using Complete Linkage as well as Fast Ward clustering in JMP. Tried using combinations of 3 out of 5 potential X confounders for clustering: o Age Over 65 % o Obesity % o Currently Smoke % o Ever Smoke % o Median Household Income ($1,000s) Tried varying total # of clusters used from 50 to 400. Reveal Phase: NOTE: Cluster #10 is uninformative about LTDs and contains 11 counties. Thus the following predictions use the data from only 2,870 US counties I.E. LTD missingness is not considered informative of potential treatment effect-sizes. Fitted Supervised Learning Models for predicting observed LTDs: o JMP 11 Analyze > Modeling Platform > Partition option single Tree (7 terminal nodes) Bootstrap Forest Model Average of 100 Trees o JMP Analyze > Fit Model Platform Multi Variable Regression (Degree at most 2) Tried using as many as 6 potential X confounders for predicting observed LTDs: o Age Over 65 % o Obesity % o Currently Smoke % o Ever Smoke % o Median Household Income ($1,000s) o Numeric Radon ( or Ln[Rn] ) Level...as either an ordinal or continuous measure 9

Predicting LTDs using Supervised Learning: Method One (Single "Small" Tree), R 2 =0.51 Partition - Best such Tree for predicting LTDobserved (6 splits, 7 terminal nodes) LTD RSquare RMSE N Number of Splits AICc 0.511 3.426779 2870 6 15230.3 10

Best "Small" Tree: Mean = average LTD within Leaf Note that all 3 splits on "Age Over 65 %" are such that the counties with the higher % elderly population are predicted to have LARGER (more negative) ADVANTAGES of High Radon in keeping Lung Cancer Mortality low. Note also that both splits on "Currently Smoke %" are such that the counties with the lower % smoking are predicted to have LARGER (more negative) ADVANTAGES of High Radon in keeping Lung Cancer Mortality low. Finally, the single split on "Obesity %" is such that the counties with the lower % obese are predicted to have LARGER (more negative) ADVANTAGES of High Radon in keeping Lung Cancer Mortality low. X-Confounder Contributions: Term Number of SS SS Portion Splits Age Over 65 (%) 3 23366.6152 0.6633 Currently Smoke (%) 2 6001.42539 0.1704 Obesity (%) 1 5858.01953 0.1663 Radon level in pci/l 0 0 0.0000 Ever Smoke (%) 0 0 0.0000 Median HH Income 0 0 0.0000 Although membership in the High or Low Treatment cohorts is perfectly predicted by Radon level within a county, it is somewhat interesting that Radon level is not used in the above predictions of the corresponding LTDs in Lung Cancer Mortality rate. 11

Predicting LTDs using Supervised Learning: Method Two (Bootstrap Forest), R 2 =0.78 Bootstrap Forest for LTDobserved Number of trees in the forest: 250 Number of terms sampled per split: 4 Training rows: 2870 Validation rows: 0 Test rows: 0 Number of terms: 6 Bootstrap samples: 2870 Minimum Splits Per Tree: 6 Minimum Size Split: 20 Overall Statistics Individual Trees RMSE In Bag 2.070927 Out of Bag 3.034105 RSquare RMSE N 0.782 2.2874704 2870 12

Observed LTD Estimates vs their Forest Predictions... X-Confounder Contributions Term Number of SS SS Portion Splits Age Over 65 (%) 3704 4154123.04 0.5298 Currently Smoke (%) 3382 1793851.36 0.2288 Obesity (%) 3477 1556085.54 0.1984 Ever Smoke (%) 893 216352.226 0.0276 Median HH Income 513 83668.561 0.0107 Radon level in pci/l 281 37384.8095 0.0048 NOTE: Because Partitioning methods (Trees and Forests) use only the ordinal information about Clusters formed using X-confounders, this could help explain why they do not find Radon Level particularly predictive of LTDs I.E. Radon level in pci/l ranks 6 th out-of-six in the above table of X-confounder predictability!!! On the hand, traditional fully-parametric model fitting methods assume all continuous variables are measured on an interval scale...i.e. individual terms can represent linear or quadratic effects or hyperbolic interactions. We will see 13

that numerical values of Radon level are much more predictive of LTDs under these much stronger assumptions. Predicting LTDs using Supervised Learning: Method Three (MultiVariable Regression), R 2 =0.49 RSquare 0.486915 RSquare Adj 0.48512 Root Mean Square Error 3.517104 Mean of Response -8.51949 Observations (or Sum Wgts) 2870 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 10 33562.043 3356.20 271.3175 Error 2859 35365.895 12.37 Prob > F C. Total 2869 68927.938 <.0001* Parameter Estimates Term Estimate Std Error t Ratio Prob> t Intercept -13.99763 0.965559-14.50 <.0001* Radon -0.063917 0.019376-3.30 0.0010* Obesity (%) 0.1870584 0.023306 8.03 <.0001* Age Over 65 (%) -0.522728 0.021938-23.83 <.0001* Currently Smoke 0.2856342 0.02726 10.48 <.0001* Ever Smoke 0.0118171 0.022495 0.53 0.5994 (Radon-3.09104)*(Age Over 65 (%)-14.8208) -0.014809 0.003749-3.95 <.0001* (Age Over 65 (%)-14.8208)*(Currently Smoke-25.3327) -0.011782 0.003529-3.34 0.0009* (Currently Smoke-25.3327)*(Ever Smoke-49.9641) 0.0109257 0.001997 5.47 <.0001* (Obesity (%)-28.9948)*(Obesity (%)-28.9948) 0.0321816 0.002642 12.18 <.0001* (Age Over 65 (%)-14.8208)*(Age Over 65 (%)-14.8208) -0.028262 0.002687-10.52 <.0001* The 2 quadratic terms (in % obese and % over 65) used here seem particularly curious. Furthermore, including such terms in multi-variable regression model(s) can cause any predictions made strictly outside of the observed ranges of the given X-variables to represent potentially severe and unwarranted extrapolations. Furthermore, of the three methods considered for predicting LTD estimates from six available X-confounding factors, traditional MultiVariable Regression is the least accurate. 14

Correlations between Observed LTDs and their Predictions LTD observed LTDtreePred LTDforestPred LTDmvregPred LTD observed 1.0000 0.7149 0.8917 0.6978 LTDtreePred 0.7149 1.0000 0.8714 0.8015 LTDforestPred 0.8917 0.8714 1.0000 0.8495 LTDmvregPred 0.6978 0.8015 0.8495 1.0000 R-squared 0.5111 0.7951* 0.4869 * The R 2 value for Bootstrap Forest "model averaging" of (only) 0.782 listed on page 12 apparently incorporates some sort of further "adjustment" or penalty for being thorough, complicated or versatile. Scatterplot Matrix 0-6 -12 LTD -18-6 -9 LTDtreePred -12-2 -4-10 -8-6 -12-14 -16-18 -20 LTDforestPred 0-5 -10-15 -20-25 -30 LTDmvregPred -18-12 -6 0-12 -9-6 -20-16 -12-8 -4-30 -20-10 0 5 15