County-Level Small Area Estimation using the National Health Interview Survey (NHIS) and the Behavioral Risk Factor Surveillance System (BRFSS)

Similar documents
Model-based Small Area Estimation for Cancer Screening and Smoking Related Behaviors

JSM Survey Research Methods Section

Small-area estimation of mental illness prevalence for schools

Holey Data: Prediction and Mapping of US Cervical Cancer Incidence Rates

Geographical Accuracy of Cell Phone Samples and the Effect on Telephone Survey Bias, Variance, and Cost

Bayesian hierarchical modelling

Joint Spatio-Temporal Modeling of Low Incidence Cancers Sharing Common Risk Factors

Comparing Multiple Imputation to Single Imputation in the Presence of Large Design Effects: A Case Study and Some Theory

Noncoverage Adjustments in a Single-Frame Cell-Phone Survey: Weighting Approach to Adjust for Phoneless and Landline-Only Households

How do we combine two treatment arm trials with multiple arms trials in IPD metaanalysis? An Illustration with College Drinking Interventions

Baseline Health Data Report: Cambria and Somerset Counties, Pennsylvania

Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm

JSM Survey Research Methods Section

Nonresponse Adjustment Methodology for NHIS-Medicare Linked Data

USING THE CENSUS 2000/2001 SUPPLEMENTARY SURVEY AS A SAMPLING FRAME FOR THE NATIONAL EPIDEMIOLOGICAL SURVEY ON ALCOHOL AND RELATED CONDITIONS

Trends in COPD (Chronic Bronchitis and Emphysema): Morbidity and Mortality. Please note, this report is designed for double-sided printing

Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data

1.4 - Linear Regression and MS Excel

CARE Cross-project Collectives Analysis: Technical Appendix

An Introduction to Multiple Imputation for Missing Items in Complex Surveys

An Examination of Factors Affecting Incidence and Survival in Respiratory Cancers. Katie Frank Roberto Perez Mentor: Dr. Kate Cowles.

Improving ecological inference using individual-level data

An Introduction to Bayesian Statistics

Data Analysis Using Regression and Multilevel/Hierarchical Models

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test February 2016

Trends in Smoking Prevalence by Race based on the Tobacco Use Supplement to the Current Population Survey

Analyzing data from educational surveys: a comparison of HLM and Multilevel IRT. Amin Mousavi

A Bayesian Measurement Model of Political Support for Endorsement Experiments, with Application to the Militant Groups in Pakistan

A COMPARISON OF IMPUTATION METHODS FOR MISSING DATA IN A MULTI-CENTER RANDOMIZED CLINICAL TRIAL: THE IMPACT STUDY

Trends in Allergic Conditions Among Children: United States,

Combining Risks from Several Tumors Using Markov Chain Monte Carlo

Funnelling Used to describe a process of narrowing down of focus within a literature review. So, the writer begins with a broad discussion providing b

Bayesian approaches to handling missing data: Practical Exercises

Survey Errors and Survey Costs

Quantitative Data: Measuring Breast Cancer Impact in Local Communities

Introduction to Bayesian Analysis 1

AnExaminationoftheQualityand UtilityofInterviewerEstimatesof HouseholdCharacteristicsinthe NationalSurveyofFamilyGrowth. BradyWest

Supplementary Online Content

2.75: 84% 2.5: 80% 2.25: 78% 2: 74% 1.75: 70% 1.5: 66% 1.25: 64% 1.0: 60% 0.5: 50% 0.25: 25% 0: 0%

Adjustment for Noncoverage of Non-landline Telephone Households in an RDD Survey

An Application of Propensity Modeling: Comparing Unweighted and Weighted Logistic Regression Models for Nonresponse Adjustments

Fundamental Clinical Trial Design

Estimates of Influenza Vaccination Coverage among Adults United States, Flu Season

Unit 1 Exploring and Understanding Data

BAYESIAN HYPOTHESIS TESTING WITH SPSS AMOS

Advanced Bayesian Models for the Social Sciences. TA: Elizabeth Menninga (University of North Carolina, Chapel Hill)

Walkability vs. Several Health Diagnoses for Klamath Falls, OR

Weight Adjustment Methods using Multilevel Propensity Models and Random Forests

CHAPTER 2: TWO-VARIABLE REGRESSION ANALYSIS: SOME BASIC IDEAS

Design and Analysis Plan Quantitative Synthesis of Federally-Funded Teen Pregnancy Prevention Programs HHS Contract #HHSP I 5/2/2016

The Essential Role of Pair Matching in. Cluster-Randomized Experiments. with Application to the Mexican Universal Health Insurance Evaluation

Understandable Statistics

Probabilistic Projections of Populations With HIV: A Bayesian Melding Approach

Bias in regression coefficient estimates when assumptions for handling missing data are violated: a simulation study

Missing Data and Imputation

Nonresponse Rates and Nonresponse Bias In Household Surveys

An Exercise in Bayesian Econometric Analysis Probit and Linear Probability Models

Decline and Disparities in Mammography Use Trends by Socioeconomic Status and Race/Ethnicity

Disparities in Tobacco Product Use in the United States

Combining Estimates from Multiple Surveys

Statistical Issues in Road Safety Part I: Uncertainty, Variability, Sampling

The Impact of Cellphone Sample Representation on Variance Estimates in a Dual-Frame Telephone Survey

Profile of DeKalb County

Epidemiological Fact Sheet on HIV and AIDS. Core data on epidemiology and response. Costa Rica Update. July 2008 / Version 0.

Bayesian and Frequentist Approaches

A Model for Spatially Disaggregated Trends and Forecasts of Diabetes Prevalence

Methods for treating bias in ISTAT mixed mode social surveys

County-Level Analysis of U.S. Licensed Psychologists and Health Indicators

ST440/550: Applied Bayesian Statistics. (10) Frequentist Properties of Bayesian Methods

A Case Study: Two-sample categorical data

Preliminary Report on Simple Statistical Tests (t-tests and bivariate correlations)

Ecological Statistics

Underweight Children in Ghana: Evidence of Policy Effects. Samuel Kobina Annim

Bayesian Analysis of Between-Group Differences in Variance Components in Hierarchical Generalized Linear Models

Estimation of contraceptive prevalence and unmet need for family planning in Africa and worldwide,

LOGISTIC PROPENSITY MODELS TO ADJUST FOR NONRESPONSE IN PHYSICIAN SURVEYS

Propensity Score Methods for Estimating Causality in the Absence of Random Assignment: Applications for Child Care Policy Research

CDRI Cancer Disparities Geocoding Project. November 29, 2006 Chris Johnson, CDRI

Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties

Regression Discontinuity Analysis

Chapter 3. Producing Data

Behavioral Risk Factor Surveillance System (BRFSS)

GUIDELINE COMPARATORS & COMPARISONS:

For general queries, contact

Bayesian Methods for Small Population Analysis

Equatorial Guinea. Epidemiological Fact Sheet on HIV and AIDS Update. July 2008 / Version 0.1 beta. Core data on epidemiology and response

Inference and Error in Surveys. Professor Ron Fricker Naval Postgraduate School Monterey, California

Technical Specifications

The Influence of Rural Residence on the Use of Preventive Health Care Services

Lecture 14: Adjusting for between- and within-cluster covariates in the analysis of clustered data May 14, 2009

David V. McQueen. BRFSS Surveillance General Atlanta - Rome 2006

JSM Survey Research Methods Section

Introduction to NHANES and NAMCS

Design for Targeted Therapies: Statistical Considerations

Marno Verbeek Erasmus University, the Netherlands. Cons. Pros

Racial disparities in health outcomes and factors that affect health: Findings from the 2011 County Health Rankings

Using Soft Refusal Status in the Cell-Phone Nonresponse Adjustment in the National Immunization Survey

UK household structure and Infectious Disease Transmission: Supplementary Information

English summary Modus Via. Fine-tuning geographical offender profiling.

Transcription:

County-Level Small Area Estimation using the National Health Interview Survey (NHIS) and the Behavioral Risk Factor Surveillance System (BRFSS) Van L. Parsons, Nathaniel Schenker Office of Research and Methodology, National Center for Health Statistics, Centers for Disease Control and Prevention, 33 Belcrest Rd, Hyattsville, MD, 20782 Abstract This paper looks at two modeling approaches to a small area estimation problem, a hierarchical Bayesian procedure and a generalized linear mixed model. The problem is to produce U.S. county-level estimates for mammography prevalence using combined data from the NHIS and BRFSS surveys. Several comparisons of the two approaches are provided. Key Words: Bayesian, random effects, mixed models. Introduction A recent collaborative project among statisticians at the National Center for Health Statistics (NCHS), the National Cancer Institute (NCI), the University of Michigan and the University of Pennsylvania focused on developing Bayesian methodologies to produce small area estimates for the prevalence of cancer risk and cancer screening factors for counties in the United States. Data sources for this project were the Behavioral Risk Factor Surveillance System (BRFSS), an ongoing telephone survey of the health behaviors of adults in the United States, and the National Health Interview Survey (NHIS), a face-to-face household interview survey. An outcome of this project was the development of a hierarchical Bayesian model for combining data from the BRFSS and NHIS surveys to produce small area countylevel estimates of prevalence. The details of this model appear in Raghunathan et al. (2007). As there are numerous modeling methodologies for producing small area estimates, see Rao (2003), it was decided to compare the proposed Bayesian approach to that of using a Generalized Linear Mixed Model (GLMM). The GLMM is a modern, multi-level modeling technique that has gained popularity due to computational advancements; see Gelman and Hill (2007). The recent development of the R (2008) package lme4 provided a flexible means for fitting a wide range of models and was chosen for this study. The results of this Bayesian and GLMM comparison were presented as a poster session, and this paper provides a condensed summary of the tables, graphs and maps of the poster. 2. A Comparison of a Bayesian and GLMM Small Area Models for County-Level Mammography Screening Prevalence 2. Comparison of BRFSS and NHIS Surveys Table provides summary information, with an emphasis on the contrasts, of these two surveys. Detailed information about BRFSS and NHIS is available at http://www.cdc.gov/brfss/ and http://www.cdc.gov/nchs/nhis.htm, respectively. The findings and conclusions in this report are those of the authors and do not necessarily represent the views of the Centers for Disease Control and Prevention 00

For these two surveys we were interested in county-level estimates of the percentage of women age 40 and over who have had a mammogram in the past 2 years. The mammography screening questions were asked during different years for the two surveys, so the data years 997-999 and 2000-2003 were pooled to define two collapsed time periods. Using the SUDAAN software, the direct county-level estimates for each survey were computed along with the standard errors. For the NHIS estimates the weights were not poststratified to any control totals. For the proposed Bayesian and GLMM modeling, the standard errors were used to estimate effective sample sizes for the county-level estimates (Table2). For each county, socio-economic-demographic (SED) covariates were available (Table3). The BRFSS direct county-level estimates of mammography prevalence tend to be greater than the corresponding NHIS direct estimates over both time periods. At the National level this is also true, with the difference about 7 percent between BRFSS and NHIS estimates. We attributed this significant discrepancy to a bias within the BRFSS system. Our modeling procedures discussed in the next section attempt remove to this bias. 2.2 A Bayesian and GLMM small area approach to county-level estimation The BRFSS survey system provides sample for about 99% of all counties, but the BRFSS direct estimates have possible deficiencies: bias for landline-only sample, system biases (5 operationally different surveys, large nonresponse, telephone sample, and telephone-mode effect) and small samples for many counties. While the NHIS samples only about 800 counties, for those that are sampled we can make the assumption that the county direct NHIS estimates listed in Table 2 are unbiased. This assumption that the direct NHIS estimates are unbiased along with associated county-level SED covariates allows us to create models that separate a hypothesized BRFSS bias and borrow strength from the totality of data to produce model-based county-level estimates. A starting point was to make the following conditional distributional statement about the entries of Table 2: Given some model-specified stochastic parameters, say B = b, as fixed, the county-level response p times the effective sample of size n has a binomial distribution. The stochastic nature of the variable B will be modeled under Bayesian and GLMM frameworks. 2.2. Bayesian Model For this approach the arcsine transformation was used on the p s of Table 2. The specifics of the model chosen are described in detail in Raghunathan et al. (2007), but the following hierarchical model, (Bayesian Model Equation) provides an indication of the structures imposed on the direct estimates listed in Table 2. The posterior distributions of the model parameters, θ, φ, δ, were computed at the county level using an MCMC sampling algorithm. Customprogramming was done using the Gauss software package (Aptech Systems, 2003). We shall abbreviate this model in the remaining text by BAYES. 2.2.2 GLMM model We used the GLMM framework in the R package lme4 (Version: 0.999375-20) using the logit as the link function for the binomial. In terms of fixed and random effects the GLMM model can be specified by Fixed Effects: intercept, x (covariates), survey and time, County-level Random Effects: intercept, survey and time, and State-level Random Effects: intercept and survey. Reference points: survey = 0, for NHIS and BRFSS time = 0, for 997-999 and 2000-2003. 0

Table. Comparisons of BRFSS and NHIS Surveys BRFSS NHIS Type State-level, Telephone only National, face-to-face, multistage cluster sample Sample size/year 50-250 K Households 2000+ Households per state Cost/response Low High Sponsor and Data CDC/States NCHS/Census Collection Organizations Response rate Lower Higher 30-40 K Households 50-5000 Households per State (Non-state design) Coverage Available Geographical Information Landline Telephone Residential Households, About 99% of counties State (public) County (on request) Households (including dormitories and group quarters) Sample contains about 800 of 3000+ counties 4 Regions (public), State/County (restricted access) NCHS Research Data Center Table 2. Direct Estimates of Proportions for County d in sample ( 997-999 and 2000-2003) (time period index t omitted) Domain Direct Estimates Effective Sample Sizes NHIS BRFSS NHIS BRFSS Telephone Households p Td p Bd n Td n Bd Non-Telephone Households p Nd n Nd All Households p d n d Table 3: County-Level SED Covariates from Multiple Sources (independent of NHIS and BFSS) Small MSA status Percent persons high school graduates Non MSA status Percent persons college graduates Per capita property taxes Percent persons below poverty Per capita expenditures Civilian labor force unemployment rate Total Social Security benefit recipients Per capita income Number of serious crimes County population density Number of social service establishments per capita Population size Newspaper readership rate Percentage of persons work commute 30+ min Median effective buying income index Percentage over 65 Per capita personal income Percentage -person households Percent blue collar workers Percentage households with children Percent black Percent persons living in urban area Percent Hispanic Median home value 02

Bayesian Model Equation (BAYES) (details omitted) - sin - sin - sin 4n ptdt θ dt p Ndt ~ N 3 φ dt, p Bdt ( + δ dt ) θ dt θ = ( β, Σ ) = unknown parameters, vague prior distribution for θ. Tdt ρ t 4 n 4n Tdt Ndt n Ndt θ dt φ dt ~ N 3 ( X dt β, Σ ), δ dt where X = area - level covariates and Indicators for period, dt 0 0 4n Bdt Between-area model Within -area model The parameters θ, φ, δ correspond to telephone sample, non-telephone sample and bias parameters. 2.2.3 Differences in the Bayesian and GLMM models The Bayesian methods allow for a finer level of parameterization than traditional methods. Counties with sparse data are compensated by the prior distribution. In the GLMM framework the non-telephone NHIS data was too sparse at the county level to allow for a distinct non-telephone component. Thus, only the direct full-county NHIS estimates were modeled using the GLMM approach. Furthermore, the lack of NHIS county data limits the GLMM models to main effects. The state random effect helps to express the nature of the BRFSS state survey system and was included for GLMM. At the time of this research the customized program we used for Bayesian analysis was too rigidly defined to easily include any additional state parameters in a timely manner and they were not included. The fixed survey component in the GLMM model must be considered as a somewhat overall bias parameter aggregating all types of bias discussed in section 2.2. Since BAYES and GLMM models are slightly different, we will not make any statements as to which one is preferred, but focus on the strengths and weaknesses of each approach. Furthermore, both the BAYES and GLMM models should be considered as exploratory attempts, with many alternative options still to be investigated. 3.0 Findings The state of California is somewhat representative in showing the types of characteristics the model-based estimates will have. BRFSS has samples in most counties. By the nature of having highly populated areas, many of the California counties are in the NHIS sample with certainty; others are sampled, so it is possible to have counties with small or no samples. Figure shows the BAYES and GLMM California county level estimates along with the original data. The horizontal reference lines of 0.65 and 0.70 represent the national prevalence at time periods and 2 respectively. Figure 2 shows estimated levels of the coefficients of variation (CV) of the county-level estimators. Figure 3 shows the totality of the model-based estimates for the 997-999 time-period. 03

3. Observations Figure : Both models demonstrate a high degree of smoothing of the original data. The GLMM model had strong global survey and time fixed effects upon which the corresponding random effects had only moderate impact. For example, all differences in the county-level time prevalence are positive. The BAYES did not smooth the data as much as the GLMM model; it tracked the NHIS data somewhat more closely. Figure 2: The second time period had larger sample sizes which explains the general tendency of having smaller CV s in the second time period. CV patterns within BAYES or GLMM method were not as consistent for BAYES as for GLMM. For example, the GLMM second time period county-level CV was always smaller than the CV from the first time period, while the BAYES time period CV s reversed order for some counties. For those counties with NHIS sample less than 20, the CVs of the GLMM are less than those of BAYES within each time period, which is an evidence of a higher degree of smoothing in the GLMM. Figures and 2: For the counties with large NHIS samples, the two methods appear to be somewhat consistent with each other. Figures 3a and 3b: The maps show a substantial smoothing over the map of the original BRFSS data (omitted for space reasons). One apparent difference between the GLMM and BAYES is the inclusion of the state random effect in the former model. Some states show a distinct separation by boundary, for example, the adjacent states of Mississippi and Alabama show a distinct separation for GLMM, but not for BAYES. 3.2 Conclusions Both models appear to produce estimates in some proximity to each other. Independent estimates are not available to assess the degree of bias. The selected GLMM is a stronger data smoother than the BAYES. Both models provide examples of the methods but not the final model. Refinements are needed. Acknowledgments: This project was a spinoff of the project mentioned in section.0. The authors thank the team listed in Raghunathan et al. (2007) for the Bayesian estimates. References Aptech Systems (2003). Gauss: Advanced Mathematical and Statistical Systems. Version 5, Black Diamond, CA. Gelman A, Hill J. (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models, Cambridge University Press R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-90005-07-0, URL http://www.r-project.org. Raghunathan TE, Xie D, Schenker N, Parsons V, Davis WW, Dodd K, Feuer EJ. (2007). Combining information from multiple surveys for small area estimation: a Bayesian approach. Journal of the American Statistical Association, 02, 474-486. Rao, JNK. (2003) Small Area Estimation, Wiley 04

Figure : Bayesian and GLMM Model Fittings for CA County Estimates Women 40+, Had Mammography Within Past 2 years 997 999 and 2000 2003 0.9 0.8 p 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.5 0 2 3 4 5 6 7 8 9 0 2 3 4 5 6 7 8 9 20 2 22 23 24 25 CA Counties Blocked by NHIS Sample Size Ordered by GLMM Fitted p Figure : Bayesian and GLMM Model Fittings for CA County Estimates Women 40+, Had Mammography Within Past 2 years 997 999 and 2000 2003 p 0.9 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.5 0 Vertical Lines NHIS Sample n = 0 n in (0,20) n in [20,50) n in [50,oo) BRFSS 97 99 BRFSS 00 03 NHIS 97 99 NHIS 00 03 GLMM 97 99 GLMM 00 03 Bayesian 97 99 Bayesian 00 03 26 27 28 29 30 3 32 33 34 35 36 37 38 39 40 4 42 43 44 45 46 47 48 49 50 5 52 53 54 55 56 57 58 CA Counties Blocked by NHIS Sample Size Ordered by GLMM Fitted p 05

CV Figure 2: Bayesian and GLMM Coefficients of Variation (CV) for Fittings for CA County Estimates Women 40+, Had Mammography Within Past 2 years 997 999 & 2000 2003 0.0 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 2 3 4 5 6 7 8 9 0 2 3 4 5 6 7 8 9 20 2 22 23 24 25 CA Counties Blocked by NHIS Sample Size Ordered by GLMM Fitted p Figure 2: Bayesian and GLMM Coefficients of Variation (CV) for Fittings for CA County Estimates Women 40+, Had Mammography Within Past 2 years 997 999 & 2000 2003 0.0 CV 0.09 0.08 0.07 0.06 0.05 0.04 0.03 GLMM 97 99 GLMM 00 03 Bayesian 97 99 Bayesian 00 03 Vertical Lines NHIS Sample n = 0 n in (0,20) n in [20,50) n in [50,oo) 0.02 26 27 28 29 30 3 32 33 34 35 36 37 38 39 40 4 42 43 44 45 46 47 48 49 50 5 52 53 54 55 56 57 58 CA Counties Blocked by NHIS Sample Size Ordered by GLMM Fitted p 06

Figure 3a: Bayesian Model: Percent of Women 40+ having Mammography in past 2 years, 997 999 % [35,55) [55,60) [60,65) [65,75) [75,00] Bayesian County Quartiles : 56%,6%, 65%, range: 35% 82% Figure 3b: GLMM Model: Percent of Women 40+ having Mammography in past 2 years, 997 999 % [35,55) [55,60) [60,65) [65,75) [75,00] GLMM County Quartiles : 55%,60%, 64%, range: 36% 78% 07