Predictive statistical modelling approach to estimating TB burden Sandra Alba, Ente Rood, Masja Straetemans and Mirjam Bakker
Overall aim, interim results Overall aim of predictive models: 1. To enable predictions of TB incidence, prevalence and mortality - from 1990 to 2015 - for a selection of countries for which the Task Force is mandate to produce estimates 2. To identify a set of conditions which warrant or do not warrant the use of these models Focus of this presentation: - Development of database and model structures - to enable predictions for 2013 2
Approach to predictive modelling Explanatory models Predictive models correctly capture causal pathways maximize predictive power (training set) Source: Schmueli: To explain or to predict?" 3
4
Task 1: Incidence Use data from robust sruveillance systems mostly from high-income countries Incidence in middle/high income countries relying on indirect incidence estimates 5
Incidence: Training set vs. predictions 6
Task 2: Prevalence Use data from recent national TB prevalence surveys Prevalence for countries where national surveys have not been implemented -low/middle income countries -with predicted prevalence of over 0.1% 7
Prevalence: Training set vs. predictions 8
Task 3: Mortality Use data from countries using vital registration systems to estimate mortality Mortality in countries without vital registration data mostly from middle and high-income countries mostly low-income countries 9
Mortality: Training set vs. predictions 10
11
Conceptual framework TB outcome data TB case notification MDR TB TB programmatic determinants Weak health system, case finding ability (suspects, diagnostics) Poor access to TB services, treatment outcomes Inappropriate health seeking behavior Co-morbidities HIV Poor-nutritional status Diabetes Lung diseases BCG vaccination coverage in children Socio environmental factors Weather: Humidity and temperature High risk groups: prisoners, homeless people, migrants, drug addicts, refugees, IDP Urbanization: population density, poor water source and sanitation, crowded living conditions, poor ventilation, indoor air pollution Smoking, alcoholism Aging populations 12
Database compilation Outcome variables National estimates: WHO TB database (Global TB data collection system) -Subnational estimates (prevalence): Survey reports/collaborators National Level Predictor variables Subnational level -Tuberculosis Monitoring and Evaluation (TME) -Global Health Repository (GHR) -World Bank -UNICEF reports (BCG prevalence) -International Diabetes Federation -National statistical agencies (e.g. census) -Multiple Indicator Surveys (MICS) -Demographic and Health Surveys (DHS) -District level surveys (India) -NTPs (TB notifications) 13
14
Database completeness WHO TB estimates were produced in 2013 for 217 countries - countries with complete set of predictors in 2013: 166 Missing data imputation 1. Predictor data only available at set intervals (e.g. every 5 years) - missing values imputed using a linear imputation. - start and end observations used as anchor points 2. Predictor data missing for the most recent years only - linear trend imputation to extrapolate the existing series - Only if fit R-sq >90% 15
Before imputation World Bank GHR Darker shades = data missing X Lighter shades = data available TME climate, bcg, diabetes 16
After imputation World Bank GHR TME Missing value imputation "successful" for World Bank and GHR data Still many missing covariates for 2013 climate, bcg, diabetes 17
Incidence: Model inputs and outputs Outcome variable: - Num: Estimated number of incident cases (all forms) - Den: Estimated total population Percentage of countries with complete covariate data out of all eligible countries for training set, per year (73 countries, 213 datapoints) Training set: - First instance 1,688 datapoints over 24 years (1990-2013) - Final model: 213 datapoints with complete data Predictions: - 2013 estimates for 100 middle income + 6 high income countries - Not all have complete data for predictions 18
Prevalence: Model inputs and outputs Outcome variable - Bacteriologically confirmed (BC) TB prevalence Training set: - Country estimates from prevalence surveys conducted from 2007 onwards (standardised analysis methodology): 13 countries - Subnational estimates from TB prevalence surveys: 5 countries - India only district level prevalence survey estimates: 2 districts - Total: 30 datapoints Predictions - 2013 estimates for 25 low and 49 middle income countries - without prevalence survey - with expected prevalence >0.1% according to WHO estimates 19
Mortality: model inputs and outputs Outcome: - Num: Estimated number of deaths from TB (all forms, exc. HIV) from vital registration systems - Den: Estimated total population Percentage of countries with complete covariate data out of all eligible countries for training set, per year (128 countries, 307 datapoints) Training set - First instance 3,022 datapoints over 24 years (1990-2013) - Final model: 307 datapoints with complete data Predictions - 2013 estimates for 11 high, 42 middle 32 low income countries and 6 countries with missing income status - Not all have complete data for predictions 20
21
Selection criteria for predictors in model 1. Predictors selected based on completeness in training dataset - <40% complete excluded for mortality and incidence models - <100% complete excluded for prevalence models 2. Univariate relationships: complete predictors vs. outcome 3. Pairwise correlations - Identify highly correlated predictors (> 0.8) - drop based on the lowest relative fit vs. outcome Mortality: Reduction of variables from 166 to 37 Prevalence: Reduction of variables from 166 to 30 Mortality: Reduction of variables from 166 to 54 22
Selected predictors (all models combined) TB outcome data All new cases and relapse cases with unknown treatment history * New all forms cases/rate (+ lag 1/lag 2y) * New laboratory confirmed cases/rate (+ lags) * All notified cases* % MDR among new cases Co-morbidities TB patients recorded as HIV positive % HIV positive among all patients notified Prevalence of diabetes * Not included in incidence model TB programmatic determinants Total expenditure on health/expressed as % GDP TB patients with HIV test result in TB register All previously treated TB cases Percentage retreated out of all cases Treatment success rate in new and retreatment cases Number of laboratories providing tuberculosis diagnostic services using sputum smear microscopy BCG vaccination rates in children Socio environmental factors Gross National Income per capita Life expectancy at birth (overall, M, F) Life expectancy at 60 (overall, M, F) Total population (overall, M. F), sex ratio % population 15 or younger/60 or older Maternal mortality ratio, under-five mortality Percentage urban population, population density % population with access to improved water/sanitation Average temperature in the coldest/warmest month Average temperature Average precipitation 23
24
Model selection approach Mortality and incidence: - GLM, Poisson, negative binomial and zero inflated distribution Prevalence: - GLM, binomial (logistic link) and negative binomial distributions Final multivariate model selected based on the Akaike Information Criterion AIC (Likelihood based) 25
26
Final model: incidence (negative binomial) Model predictors coefficient (log scale) (Intercept) -8.02000 strength Percentage retreatment (out of all cases) 0.02368 23.7 % population 60 yrs + 0.02053 20.5 % MDR out of all new cases 0.00466 4.7 Average temperature in warmest month 0.00311 3.1 TB patients recorded as HIV positive 0.00141 1.4 All previously treated cases 0.00054 0.5 Average precipitation 0.00027 0.3 GNI 0.00001 0.0 TB patients with HIV test result in TB register -0.00005-0.1 Total expenditure on health -0.00008-0.1 Life expectancy at 60 in males -0.08271-82.7 Government expenditure on health (% GDP) -0.10510-105.1 Diabetes prevalence -2.31000-2310.0 27
Final model: incidence Predicted vs. observed rate (log scale) Predicted vs. observed rate 28
Final model: prevalence (binomial logistic) Model predictors coefficient (log scale) (Intercept) -3.03588 Strength Climate score 0.16039 160 New laboratory confirmed rate 0.00812 8 BCG coverage -0.03610-36 29
0-7 -6.5 Predicted prevalence -6.002.004.006.008-5.5-5 Titel Final model: prevalence Predicted vs. observed prevalence (logistic scale) Predicted vs. observed prevalence -7.5-7 -6.5-6 -5.5-5 Observed prevalence 0.002.004.006.008 Observed prevalence 30
Final model: mortality (negative binomial) Model predictors coefficient (log scale) strength (Intercept) -10.57740 New all forms rate 0.71356 714 BCG coverage 0.17597 176 TB patients recorded as HIV positive 0.16718 167 Percentage retreatment (out of all cases) 0.15697 157 Precipiation 0.11985 120 % MDR out of all new cases 0.05991 60 Diabetes prevalence -0.09344-93 Government expenditure on health -0.13399-134 Urban Population -0.14584-146 Treatment success rate for re-treatment cases -0.21212-212 31
Final model: mortality Predicted vs. observed rate (log scale) Predicted vs. observed rate 32
0 p_hat.002.004.006.008 Titel Model fit: deviance residuals Incidence Prevalence Mortality -5 0 5 Deviance residual 33
Model validation: cross validation 1. Split data randomly into k partitions 2. For each partition fit specified model using the other k-1 groups. 3. Pseudo-R-sq: square of correlation predicted vs. observed Training set K=2 Develop model in this subset Check predicted vs. observed in this subset Incidence: k=5 R-sq=0.94 Prevalence: k=2, x5 R-sq=0.76 Mortality: k=5 R-sq=0.89 34
35
Discussion Predictive models could be fitted for all three tasks - Goodness of fit satisfactory - Further refinement of models and database necessary before predictions can be made Incidence and mortality: - Include random effects for countries or income status - build one model just on high income countries and one just on middle income countries and compare with variables/coefficients Include time lagged variables: - so far only included lagged TB notification rates (1 and 2 years) New predictor data recently compiled: - Number of large cities >500.000 and >1million inhabitants - Prevalence of prisoners (UNDP), migrants (World Bank), drug use (UNODC), refugees and displaced populations (UNHCR) 36
Thank you 37
38