On Missing Data and Genotyping Errors in Association Studies

Similar documents
CS2220 Introduction to Computational Biology

The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0

Appendix 1. Sensitivity analysis for ACQ: missing value analysis by multiple imputation

BST227 Introduction to Statistical Genetics. Lecture 4: Introduction to linkage and association analysis

Genome-wide copy-number calling (CNAs not CNVs!) Dr Geoff Macintyre

Statistical Analysis of Single Nucleotide Polymorphism Microarrays in Cancer Studies

Help! Statistics! Missing data. An introduction

Multiplex target enrichment using DNA indexing for ultra-high throughput variant detection

Large-scale identity-by-descent mapping discovers rare haplotypes of large effect. Suyash Shringarpure 23andMe, Inc. ASHG 2017

Global variation in copy number in the human genome

Introduction to the Genetics of Complex Disease

Missing data. Patrick Breheny. April 23. Introduction Missing response data Missing covariate data

Catherine A. Welch 1*, Séverine Sabia 1,2, Eric Brunner 1, Mika Kivimäki 1 and Martin J. Shipley 1

Copy Number Variations and Association Mapping Advanced Topics in Computa8onal Genomics

The Relative Performance of Full Information Maximum Likelihood Estimation for Missing Data in Structural Equation Models

What to do with missing data in clinical registry analysis?

GENOME-WIDE ASSOCIATION STUDIES

Selected Topics in Biostatistics Seminar Series. Missing Data. Sponsored by: Center For Clinical Investigation and Cleveland CTSC

Tutorial on Genome-Wide Association Studies

Structural Variation and Medical Genomics

Master thesis Department of Statistics

Recent advances in non-experimental comparison group designs

A COMPARISON OF IMPUTATION METHODS FOR MISSING DATA IN A MULTI-CENTER RANDOMIZED CLINICAL TRIAL: THE IMPACT STUDY

Supplementary Figure 1: Attenuation of association signals after conditioning for the lead SNP. a) attenuation of association signal at the 9p22.

UNIVERSITY OF CALIFORNIA, LOS ANGELES

LTA Analysis of HapMap Genotype Data

DOES THE BRCAX GENE EXIST? FUTURE OUTLOOK

Advanced Handling of Missing Data

Statistical data preparation: management of missing values and outliers

How should the propensity score be estimated when some confounders are partially observed?

Problem 3: Simulated Rheumatoid Arthritis Data

BIOSTATISTICAL METHODS AND RESEARCH DESIGNS. Xihong Lin Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA

Analysis methods for improved external validity

Genetics and Pharmacogenetics in Human Complex Disorders (Example of Bipolar Disorder)

Quantitative genetics: traits controlled by alleles at many loci

Carrying out an Empirical Project

Methods for Computing Missing Item Response in Psychometric Scale Construction

Missing Data in Longitudinal Studies: Strategies for Bayesian Modeling, Sensitivity Analysis, and Causal Inference

Human Genetics of Tuberculosis. Laurent Abel Laboratory of Human Genetics of Infectious Diseases University Paris Descartes/INSERM U980

Impact and adjustment of selection bias. in the assessment of measurement equivalence

Supplementary Figure 1 Dosage correlation between imputed and genotyped alleles Imputed dosages (0 to 2) of 2-digit alleles (red) and 4-digit alleles

New Enhancements: GWAS Workflows with SVS

Genetics and Genomics in Medicine Chapter 8 Questions

Instrumental Variables Estimation: An Introduction

Human population sub-structure and genetic association studies

Propensity scores and instrumental variables to control for confounding. ISPE mid-year meeting München, 2013 Rolf H.H. Groenwold, MD, PhD

2) Cases and controls were genotyped on different platforms. The comparability of the platforms should be discussed.

Missing Data and Imputation

Challenges of CGH array testing in children with developmental delay. Dr Sally Davies 17 th September 2014

The RoB 2.0 tool (individually randomized, cross-over trials)

Sequential nonparametric regression multiple imputations. Irina Bondarenko and Trivellore Raghunathan

In this module I provide a few illustrations of options within lavaan for handling various situations.

A Brief Introduction to Bayesian Statistics

Practical Statistical Reasoning in Clinical Trials

Research Strategy: 1. Background and Significance

Measuring association in contingency tables

Supplementary Figure 1. Principal components analysis of European ancestry in the African American, Native Hawaiian and Latino populations.

Replacing IBS with IBD: The MLS Method. Biostatistics 666 Lecture 15

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

Lecture II: Difference in Difference. Causality is difficult to Show from cross

Challenges of Observational and Retrospective Studies

Application of chromosomal radiosensitivity assays to temporary nuclear power plant workers

Quality Control Analysis of Add Health GWAS Data

Book review of Herbert I. Weisberg: Bias and Causation, Models and Judgment for Valid Comparisons Reviewed by Judea Pearl

Understanding DNA Copy Number Data

Applied Medical. Statistics Using SAS. Geoff Der. Brian S. Everitt. CRC Press. Taylor Si Francis Croup. Taylor & Francis Croup, an informa business

Introduction to LOH and Allele Specific Copy Number User Forum

Developing and evaluating polygenic risk prediction models for stratified disease prevention

Imputation approaches for potential outcomes in causal inference

Imputation classes as a framework for inferences from non-random samples. 1

Practical challenges that copy number variation and whole genome sequencing create for genetic diagnostic labs

THE USE OF NONPARAMETRIC PROPENSITY SCORE ESTIMATION WITH DATA OBTAINED USING A COMPLEX SAMPLING DESIGN

Getting ready for propensity score methods: Designing non-experimental studies and selecting comparison groups

Assessing Accuracy of Genotype Imputation in American Indians

Detection of aneuploidy in a single cell using the Ion ReproSeq PGS View Kit

Dan Koller, Ph.D. Medical and Molecular Genetics

By: Mei-Jie Zhang, Ph.D.

Rare Variant Burden Tests. Biostatistics 666

Introduction to Genetics and Genomics

Sensitivity Analysis in Observational Research: Introducing the E-value

Friday, September 9, :00-11:00 am Warwick Evans Conference Room, Building D Refreshments will be provided at 9:45am

MISSING DATA AND PARAMETERS ESTIMATES IN MULTIDIMENSIONAL ITEM RESPONSE MODELS. Federico Andreis, Pier Alda Ferrari *

Propensity Score Methods for Estimating Causality in the Absence of Random Assignment: Applications for Child Care Policy Research

Introduction to Observational Studies. Jane Pinelis

STATISTICS & PROBABILITY

November 9, Johns Hopkins School of Medicine, Baltimore, MD,

Summary & general discussion

Lec 02: Estimation & Hypothesis Testing in Animal Ecology

Nature Genetics: doi: /ng Supplementary Figure 1

NGS panels in clinical diagnostics: Utrecht experience. Van Gijn ME PhD Genome Diagnostics UMCUtrecht

Imaging Genetics: Heritability, Linkage & Association

Conditions. Name : dummy Age/sex : xx Y /x. Lab No : xxxxxxxxx. Rep Centre : xxxxxxxxxxx Ref by : Dr. xxxxxxxxxx

(true) Disease Condition Test + Total + a. a + b True Positive False Positive c. c + d False Negative True Negative Total a + c b + d a + b + c + d

P E R S P E C T I V E S

Review of Pre-crash Behaviour in Fatal Road Collisions Report 1: Alcohol

Title: A robustness study of parametric and non-parametric tests in Model-Based Multifactor Dimensionality Reduction for epistasis detection

Estimating heritability for cause specific mortality based on twin studies

Inferring causality in observational epidemiology: Breast Cancer Risk as an Example

Missing data in medical research is

Adjusting for mode of administration effect in surveys using mailed questionnaire and telephone interview data

Transcription:

On Missing Data and Genotyping Errors in Association Studies Department of Biostatistics Johns Hopkins Bloomberg School of Public Health May 16, 2008

Specific Aims of our R01 1 Develop and evaluate new statistical methods to prioritize genes through proper ranking in genome-wide association (GWA) studies that address GxE interactions. 2 Develop and evaluate new statistical methods to localize causal genes as part of linkage and fine mapping studies while considering GxE interactions. 3 Develop and evaluate new statistical methods to identify higher order interactions between environmental variables and SNPs in candidate genes studies.

Specific Aims (cont.) 4 Adapt existing and develop new statistical methods to address imprecise and missing environmental and genetic measurements. 5 Develop and disseminate efficient algorithms for GxE analyses, and apply these methods in several ongoing genetic studies of complex diseases.

Missing Data There are mainly three types of missing / unobserved data in genetic association studies: 1 Missing observations in some environmental variables. 2 Missing data at SNPs selected for genotyping. 3 Genotypes of SNPs not selected.

Approaches for Missing Data The most common approach for dealing with missing data is to omit the observations that have missing records in the model s covariates. This approach can have several shortcomings, including: Loss of power. Bias in the parameter estimates. A good reference on this topic is Greenland and Finkle (1995). Some other used approaches are: To impute a value from the marginal distribution of the covariate. To create an extra level indicating missingness, if the covariate is a factor. These choices tend to be not so great either.

Approaches for Missing Data Multiple imputation can be used to draw valid statistical inference from data with missing values when the data are missing at random (Little and Rubin 1987, Schafer 1997). In essence, multiple imputation acknowledges the uncertainty due to missing data, instead of simply ignoring it: several complete data sets are generated, and the uncertainty in the model parameter estimates incorporates the standard errors of the parameter estimates as well as the variability between the parameter estimates from the replicate data sets. While the hypothesis of missing at random cannot formally be tested, it is a lot less stringent than the requirement of missing completely at random, which is the underlying assumption made when observations are omitted.

Missing Environmental Data Number of Pairs Odds Ratio Confidence Interval XPD Lys751Gln original data set 202 1.90 ( 1.20 3.00 ) multiple imputations 321 1.45 ( 1.00 2.10 ) XPD Gln751Gln original data set 202 2.18 ( 1.08 4.40 ) multiple imputations 321 1.31 ( 0.74 2.34 ) Positive Family History original data set 202 2.53 ( 1.43 4.50 ) multiple imputations 321 2.53 ( 1.58 4.03 )

Missing Environmental Data Family History not complete Family History complete AA AC CC na AA AC CC na raw numbers case 43 54 5 5 61 121 25 7 control 35 57 12 3 90 102 22 0 percentages case 40.2 50.5 4.7 4.7 28.5 56.5 11.7 3.3 control 32.7 53.3 11.2 2.8 42.1 47.7 10.3 0.0 Reference: Brewster AM et al (2006). Polymorphisms of the DNA Repair Genes XPD (Lys751Gln) and XRCC1 (Arg399Gln and Arg194Trp): Relationship to Breast Cancer Risk and Familial Predisposition to Breast Cancer. Breast Cancer Res Treat, 95(1): 73-80.

Missing Environmental Data 4 XPD Lys751Gln odds ratio 3 2 1 4 XPD Gln751Gln odds ratio 3 2 1 4 Family History odds ratio 3 2 1 1 2 3 4 5 6 7 8 9 10 combined original The missing data were imputed using decision trees. Reference: Dai J et al (2006). Imputation Methods to Improve Inference in SNP Association Studies. Genetic Epidemiology, 30(8): 690-702.

Why Become a Biostatistician? Because people appreciate your help analyzing their data, and that means that people surely will like you.

Tree-based Imputation Classification trees are great for categorical data!

Dummy Levels for Missingness 6 5 4 3 hazards ratio 2 1 0.5 SNP 1 SNP 2 SNP 3 SNP 4 SNP 5 SNP 6 SNP 7 SNP 8 SNP 9 SNP 10 SNP 11 SNP 12 SNP 13 Unpublished data.

Incorporating Genotype Uncertainty The confidence in genotype calls can differ substantially between SNPs! 4 Concordant 2 AA AB BB Discordant called BB (AB) Sense 0 2 4 4 2 0 2 4 4 2 0 2 4 Antisense

Missingness at Random? From the white paper, http://www.affymetrix.com/support/technical/product updates/brlmm algorithm.affx

Incorporating Genotype Uncertainty 100 [ truncated ] 80 60 40 20 0 log(ratio) Ratio 0.30 0.20 0.10 0.00 0.10 0.20 0.30 0.40 0.50 0.74 0.82 0.90 1.00 1.11 1.22 1.35 1.49 1.65 correlation(genotype,crlmm) / correlation(genotype,brlmm)

Incorporating Genotype Uncertainty Easy SNP Difficult SNP 100 100 95 95 Median Rank [ 10,000 simulations ] 90 85 80 True Genotype 90 85 80 75 CRLMM (continuous) CRLMM (called) 75 BRLMM (all) BRLMM (selected) 70 70 RR: θ 1.2 1.3 1.4 1.5 1.6 1.7 0.18 0.26 0.34 0.41 0.47 0.53 RR: θ 1.2 1.3 1.4 1.5 1.6 1.7 0.18 0.26 0.34 0.41 0.47 0.53

Incorporating Genotype Uncertainty 100 [ truncated ] 80 60 40 20 0 log(ratio) Ratio 0.02 0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.98 0.99 1.00 1.01 1.02 1.03 1.04 1.05 1.06 correlation(genotype,crlmm) / correlation(genotype,crlmm[called])

Incorporating Genotype Uncertainty - CNVs 5.0 A D B C E 2.0 1.0 0.5

Incorporating Genotype Uncertainty - CNVs 5 4 3 Deletion Normal LOH Amplification A D B C E 2 1 Van ICE A D B E 3 2 1 Van ICE 50 51 52 53 54 55 69 70 71 72 73 74 174 176 178 Mb 238 240 242 244 246

A HapMap Sample 5 4 3 2 Deletion Normal LOH Amplification 1 Van ICE 100 150 190 200 5 4 3 2 1 Van ICE 135 140 143 145 150 Mb 149 149.5 150 150.5 151 151.5 152

Many HapMap Samples

De Novo Deletion