Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties

Size: px

Start display at page:

Download "Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties"

Roland Smith
5 years ago
Views:

1 Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties Bob Obenchain, Risk Benefit Statistics, August 2015 Our motivation for using a Cut-Point of 2.6 pci/l for the Radon level that defines a High-Low "Treatment" Dichotomy is given by the initial binary split in the following Partition Regression (Tree) Model: Partition of (unadjusted) Lung Cancer Mortality on Radon RSquare RMSE N Number of AICc Splits All Rows Count Mean Std De LogWorth Difference Radon>=2.6 Count 1220 Mean Std De Radon<2.6 Count 1661 Mean Std De "Low Radon" : Level strictly less than 2.6 pci/l (picocuries per liter.) "High Radon" : Level = 2.6 pci/l (picocuries per liter) or greater. We will see below that higher Radon levels are associated with lower Lung Cancer Mortality rates. Neither this analysis nor the ones depicted on page 2 have been "covariate adjusted" for possible X-confounding factors included within in the datasets being analyzed here. 1

2 Prediction of Lung Cancer Mortality from Ln[Rn]...Unadjusted for all other X-confounders. Ln[Rn] = Natural Logarithm of Radon level. Here, 10 US counties with Radon level coded as "0.0" have been Windsorized in the dataset to Ln[0.05] = The cut-point at Radon = 2.6 pci/l (Ln[Rn] = ) is used in the fits displayed on this page only to color counties either Red or Blue. Linear Fit: Lung Cancer Mortality = * Ln[Rn] RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 2881 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model Error Prob > F C. Total <.0001* Parameter Estimates Term Estimate Std Error t Ratio Prob> t Intercept <.0001* Ln[Rn] <.0001* Smoothing Spline Fit, lambda=5 R-Square Sum of Squares Error

Mortality Radon Level Flag Fixed Homoskedastic Random Number Seed: 12345 Specify Number of

3 Output from the "Local Control" JMP Add-In: Pages 3,4 and 5. Outcome Variable: Treatment Variable: Cluster Effect Type: Variability Assumption: Lung Cancer Mortality Radon Level Flag Fixed Homoskedastic Random Number Seed: Specify Number of Clusters = 50 Specify Number of Permutations = Mean_LTD LTD distribution for 50 clusters 3

4 Hierarchical Clustering Method = Fast Ward Obesity (%) Currently Smoke Age Over 65 (%) Dendrogram Hierarchically Clustered Differences 4

5 Response Lung Cancer Mortality -- Nested ANOVA (Treatment within Cluster) RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 2881 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model Error Prob > F C. Total <.0001* Effect Tests Source Nparm DF Sum of Squares F Ratio Prob > F Cluster <.0001* High Radon[Cluster] <.0001* NOTE: Cluster #10 is uninformative about Lung Cancer Mortality LTDs (High minus Low Radon) because all 11 US counties it contains have Radon levels less than 2.6 pci/l (picocuries per liter.) This explains why there are only 49 (rather than 50) Degrees-of-Freedom for Treatment-within-Cluster (LTD) effects. Only 49 Degrees-of-Freedom are attributed to main-effects within 50 Clusters by convention; the overall mean effect for mortality is simply removed and not shown in the ANOVA table. Although it may not be obvious from the entries in above Nested ANOVA table, results depend upon choice of the High-Low Radon cut-point (2.6 pci/l here) as well as the numbers of both requested and informative clusters (50 and 49, respectively.) Specifically, the y-outcome column vector here has 2,881 rows for US counties and consists of Lung Cancer Mortality rates being viewed as realizations of a continuous random variable. Furthermore, the "design" matrix has only two non-constant columns viewed as fixed (given) categorical variables: 1. The vector of treatment indicators has 2 levels - say, zeros (Low Radon) and ones (High Radon.) 2. The vector of cluster membership indicators has 50 levels - say, the integers 1 through 50. The analysis of this cross-classification of mortality rates is essentially nonparametric because no information is used on either how clusters were formed / defined from county X-characteristics or what the numerical values of X- characteristics are. 5

6 Aggregate Phase: Observed LTD Distribution (49 Informative Clusters containing 2,870 US Counties) Observed Local Treatment Difference (LTD) Distribution for 50 Ward Clusters Lung Cancer Mortality is measured in Deaths per 100,000 Person-Years. LTDs are differences in Mortality rates: Radon High minus Low. Above histogram depicts the Most Typical LTD Distribution derived from micro-aggregation of 2,881 US Counties on 3 primary X-confounders o Age Over 65 % o Currently Smoke % o Obesity % Y-outcome = Lung Cancer Mortality Binary Treatment Indicator: Radon High ( at least 2.6 pci/l ) vs. Low Best fitting Normal approximation has mean µ = deaths and std. dev. σ =

7 Confirm Phase: Comparison of empirical Cumulative Distribution Functions (CDFs) Random Permutation LTD-like Distribution Observed LTD Distribution These two distributions are rather clearly different; they differ most on statistical measures of location and shape (skewness, kurtosis, range) also see histograms and statistics listed on the next page. This means that clustering (local conditioning, matching) on 3 primary X-confounders [% over 65, % currently smoke and % obese] has indeed yielded appropriately adjusted treatment effect-size estimates. Local treatment effect-size estimates are LTDs expressed as a difference in mortality rates (deaths per 100,000 person-years) of the form: Average for one-or-more High Radon counties minus Average for one-or-more Low Radon counties. 7

8 Random Permutation LTD-like Distribution Observed LTD Distribution Mean Std Dev Std Err Mean Upper 95% Mean Lower 95% Mean N 50 * 2870 Skewness Kurtosis Mean Std Dev Std Err Mean Upper 95% Mean Lower 95% Mean N 2870 Skewness Kurtosis

9 Explore Phases: Tried using Complete Linkage as well as Fast Ward clustering in JMP. Tried using combinations of 3 out of 5 potential X confounders for clustering: o Age Over 65 % o Obesity % o Currently Smoke % o Ever Smoke % o Median Household Income ($1,000s) Tried varying total # of clusters used from 50 to 400. Reveal Phase: NOTE: Cluster #10 is uninformative about LTDs and contains 11 counties. Thus the following predictions use the data from only 2,870 US counties I.E. LTD missingness is not considered informative of potential treatment effect-sizes. Fitted Supervised Learning Models for predicting observed LTDs: o JMP 11 Analyze > Modeling Platform > Partition option single Tree (7 terminal nodes) Bootstrap Forest Model Average of 100 Trees o JMP Analyze > Fit Model Platform Multi Variable Regression (Degree at most 2) Tried using as many as 6 potential X confounders for predicting observed LTDs: o Age Over 65 % o Obesity % o Currently Smoke % o Ever Smoke % o Median Household Income ($1,000s) o Numeric Radon ( or Ln[Rn] ) Level...as either an ordinal or continuous measure 9

10 Predicting LTDs using Supervised Learning: Method One (Single "Small" Tree), R 2 =0.51 Partition - Best such Tree for predicting LTDobserved (6 splits, 7 terminal nodes) LTD RSquare RMSE N Number of Splits AICc

11 Best "Small" Tree: Mean = average LTD within Leaf Note that all 3 splits on "Age Over 65 %" are such that the counties with the higher % elderly population are predicted to have LARGER (more negative) ADVANTAGES of High Radon in keeping Lung Cancer Mortality low. Note also that both splits on "Currently Smoke %" are such that the counties with the lower % smoking are predicted to have LARGER (more negative) ADVANTAGES of High Radon in keeping Lung Cancer Mortality low. Finally, the single split on "Obesity %" is such that the counties with the lower % obese are predicted to have LARGER (more negative) ADVANTAGES of High Radon in keeping Lung Cancer Mortality low. X-Confounder Contributions: Term Number of SS SS Portion Splits Age Over 65 (%) Currently Smoke (%) Obesity (%) Radon level in pci/l Ever Smoke (%) Median HH Income Although membership in the High or Low Treatment cohorts is perfectly predicted by Radon level within a county, it is somewhat interesting that Radon level is not used in the above predictions of the corresponding LTDs in Lung Cancer Mortality rate. 11

12 Predicting LTDs using Supervised Learning: Method Two (Bootstrap Forest), R 2 =0.78 Bootstrap Forest for LTDobserved Number of trees in the forest: 250 Number of terms sampled per split: 4 Training rows: 2870 Validation rows: 0 Test rows: 0 Number of terms: 6 Bootstrap samples: 2870 Minimum Splits Per Tree: 6 Minimum Size Split: 20 Overall Statistics Individual Trees RMSE In Bag Out of Bag RSquare RMSE N

13 Observed LTD Estimates vs their Forest Predictions... X-Confounder Contributions Term Number of SS SS Portion Splits Age Over 65 (%) Currently Smoke (%) Obesity (%) Ever Smoke (%) Median HH Income Radon level in pci/l NOTE: Because Partitioning methods (Trees and Forests) use only the ordinal information about Clusters formed using X-confounders, this could help explain why they do not find Radon Level particularly predictive of LTDs I.E. Radon level in pci/l ranks 6 th out-of-six in the above table of X-confounder predictability!!! On the hand, traditional fully-parametric model fitting methods assume all continuous variables are measured on an interval scale...i.e. individual terms can represent linear or quadratic effects or hyperbolic interactions. We will see 13

14 that numerical values of Radon level are much more predictive of LTDs under these much stronger assumptions. Predicting LTDs using Supervised Learning: Method Three (MultiVariable Regression), R 2 =0.49 RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 2870 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model Error Prob > F C. Total <.0001* Parameter Estimates Term Estimate Std Error t Ratio Prob> t Intercept <.0001* Radon * Obesity (%) <.0001* Age Over 65 (%) <.0001* Currently Smoke <.0001* Ever Smoke (Radon )*(Age Over 65 (%) ) <.0001* (Age Over 65 (%) )*(Currently Smoke ) * (Currently Smoke )*(Ever Smoke ) <.0001* (Obesity (%) )*(Obesity (%) ) <.0001* (Age Over 65 (%) )*(Age Over 65 (%) ) <.0001* The 2 quadratic terms (in % obese and % over 65) used here seem particularly curious. Furthermore, including such terms in multi-variable regression model(s) can cause any predictions made strictly outside of the observed ranges of the given X-variables to represent potentially severe and unwarranted extrapolations. Furthermore, of the three methods considered for predicting LTD estimates from six available X-confounding factors, traditional MultiVariable Regression is the least accurate. 14

15 Correlations between Observed LTDs and their Predictions LTD observed LTDtreePred LTDforestPred LTDmvregPred LTD observed LTDtreePred LTDforestPred LTDmvregPred R-squared * * The R 2 value for Bootstrap Forest "model averaging" of (only) listed on page 12 apparently incorporates some sort of further "adjustment" or penalty for being thorough, complicated or versatile. Scatterplot Matrix LTD LTDtreePred LTDforestPred LTDmvregPred

Bias Adjustment: Local Control Analysis of Radon and Ozone

Bias Adjustment: Local Control Analysis of Radon and Ozone S. Stanley Young Robert Obenchain Goran Krstic NCSU 19Oct2016 Abstract Bias Adjustment: Local control analysis of Radon and ozone S. Stanley Young,