On Missing Data and Genotyping Errors in Association Studies

On Missing Data and Genotyping Errors in Association Studies Department of Biostatistics Johns Hopkins Bloomberg School of Public Health May 16, 2008

Specific Aims of our R01 1 Develop and evaluate new statistical methods to prioritize genes through proper ranking in genome-wide association (GWA) studies that address GxE interactions. 2 Develop and evaluate new statistical methods to localize causal genes as part of linkage and fine mapping studies while considering GxE interactions. 3 Develop and evaluate new statistical methods to identify higher order interactions between environmental variables and SNPs in candidate genes studies.

Specific Aims (cont.) 4 Adapt existing and develop new statistical methods to address imprecise and missing environmental and genetic measurements. 5 Develop and disseminate efficient algorithms for GxE analyses, and apply these methods in several ongoing genetic studies of complex diseases.

Missing Data There are mainly three types of missing / unobserved data in genetic association studies: 1 Missing observations in some environmental variables. 2 Missing data at SNPs selected for genotyping. 3 Genotypes of SNPs not selected.

Approaches for Missing Data The most common approach for dealing with missing data is to omit the observations that have missing records in the model s covariates. This approach can have several shortcomings, including: Loss of power. Bias in the parameter estimates. A good reference on this topic is Greenland and Finkle (1995). Some other used approaches are: To impute a value from the marginal distribution of the covariate. To create an extra level indicating missingness, if the covariate is a factor. These choices tend to be not so great either.

Approaches for Missing Data Multiple imputation can be used to draw valid statistical inference from data with missing values when the data are missing at random (Little and Rubin 1987, Schafer 1997). In essence, multiple imputation acknowledges the uncertainty due to missing data, instead of simply ignoring it: several complete data sets are generated, and the uncertainty in the model parameter estimates incorporates the standard errors of the parameter estimates as well as the variability between the parameter estimates from the replicate data sets. While the hypothesis of missing at random cannot formally be tested, it is a lot less stringent than the requirement of missing completely at random, which is the underlying assumption made when observations are omitted.

Missing Environmental Data Number of Pairs Odds Ratio Confidence Interval XPD Lys751Gln original data set 202 1.90 ( 1.20 3.00 ) multiple imputations 321 1.45 ( 1.00 2.10 ) XPD Gln751Gln original data set 202 2.18 ( 1.08 4.40 ) multiple imputations 321 1.31 ( 0.74 2.34 ) Positive Family History original data set 202 2.53 ( 1.43 4.50 ) multiple imputations 321 2.53 ( 1.58 4.03 )

Missing Environmental Data Family History not complete Family History complete AA AC CC na AA AC CC na raw numbers case 43 54 5 5 61 121 25 7 control 35 57 12 3 90 102 22 0 percentages case 40.2 50.5 4.7 4.7 28.5 56.5 11.7 3.3 control 32.7 53.3 11.2 2.8 42.1 47.7 10.3 0.0 Reference: Brewster AM et al (2006). Polymorphisms of the DNA Repair Genes XPD (Lys751Gln) and XRCC1 (Arg399Gln and Arg194Trp): Relationship to Breast Cancer Risk and Familial Predisposition to Breast Cancer. Breast Cancer Res Treat, 95(1): 73-80.

Missing Environmental Data 4 XPD Lys751Gln odds ratio 3 2 1 4 XPD Gln751Gln odds ratio 3 2 1 4 Family History odds ratio 3 2 1 1 2 3 4 5 6 7 8 9 10 combined original The missing data were imputed using decision trees. Reference: Dai J et al (2006). Imputation Methods to Improve Inference in SNP Association Studies. Genetic Epidemiology, 30(8): 690-702.

Why Become a Biostatistician? Because people appreciate your help analyzing their data, and that means that people surely will like you.

Tree-based Imputation Classification trees are great for categorical data!

Dummy Levels for Missingness 6 5 4 3 hazards ratio 2 1 0.5 SNP 1 SNP 2 SNP 3 SNP 4 SNP 5 SNP 6 SNP 7 SNP 8 SNP 9 SNP 10 SNP 11 SNP 12 SNP 13 Unpublished data.

Incorporating Genotype Uncertainty The confidence in genotype calls can differ substantially between SNPs! 4 Concordant 2 AA AB BB Discordant called BB (AB) Sense 0 2 4 4 2 0 2 4 4 2 0 2 4 Antisense

Missingness at Random? From the white paper, http://www.affymetrix.com/support/technical/product updates/brlmm algorithm.affx

Incorporating Genotype Uncertainty 100 [ truncated ] 80 60 40 20 0 log(ratio) Ratio 0.30 0.20 0.10 0.00 0.10 0.20 0.30 0.40 0.50 0.74 0.82 0.90 1.00 1.11 1.22 1.35 1.49 1.65 correlation(genotype,crlmm) / correlation(genotype,brlmm)

Incorporating Genotype Uncertainty Easy SNP Difficult SNP 100 100 95 95 Median Rank [ 10,000 simulations ] 90 85 80 True Genotype 90 85 80 75 CRLMM (continuous) CRLMM (called) 75 BRLMM (all) BRLMM (selected) 70 70 RR: θ 1.2 1.3 1.4 1.5 1.6 1.7 0.18 0.26 0.34 0.41 0.47 0.53 RR: θ 1.2 1.3 1.4 1.5 1.6 1.7 0.18 0.26 0.34 0.41 0.47 0.53

Incorporating Genotype Uncertainty 100 [ truncated ] 80 60 40 20 0 log(ratio) Ratio 0.02 0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.98 0.99 1.00 1.01 1.02 1.03 1.04 1.05 1.06 correlation(genotype,crlmm) / correlation(genotype,crlmm[called])

Incorporating Genotype Uncertainty - CNVs 5.0 A D B C E 2.0 1.0 0.5

Incorporating Genotype Uncertainty - CNVs 5 4 3 Deletion Normal LOH Amplification A D B C E 2 1 Van ICE A D B E 3 2 1 Van ICE 50 51 52 53 54 55 69 70 71 72 73 74 174 176 178 Mb 238 240 242 244 246

A HapMap Sample 5 4 3 2 Deletion Normal LOH Amplification 1 Van ICE 100 150 190 200 5 4 3 2 1 Van ICE 135 140 143 145 150 Mb 149 149.5 150 150.5 151 151.5 152

Many HapMap Samples

De Novo Deletion