Genetic association analysis incorporating intermediate phenotypes information for complex diseases

University of Iowa Iowa Research Online Theses and Dissertations Fall 2011 Genetic association analysis incorporating intermediate phenotypes information for complex diseases Yafang Li University of Iowa Copyright 2011 Yafang Li This dissertation is available at Iowa Research Online: http://ir.uiowa.edu/etd/2739 Recommended Citation Li, Yafang. "Genetic association analysis incorporating intermediate phenotypes information for complex diseases." PhD (Doctor of Philosophy) thesis, University of Iowa, 2011. http://ir.uiowa.edu/etd/2739. Follow this and additional works at: http://ir.uiowa.edu/etd Part of the Other Genetics and Genomics Commons

GENETIC ASSOCIATION ANALYSIS INCORPORATING INTERMEDIATE PHENOTYPES INFORMATION FOR COMPLEX DISEASES by Yafang Li An Abstract Of a thesis submitted in partial fulfillment of the requirements for the Doctor of Philosophy degree in Statistical Genetics in the Graduate College of The University of Iowa December 2011 Thesis Supervisor: Professor Jian Huang

1 ABSTRACT Genome-wide association (GWA) studies have been successfully applied in detection of susceptibility loci for complex diseases, but most of the identified variants have a large to moderate effect, and explain only a limited proportion of the heritability of the diseases. It is believed that the majority of the latent risk alleles have very small risk effects that are difficult to be identified and GWA study may have inadequate power in dealing with those small effect variants. Researchers will often collect other phenotypic information in addition to disease status to maximize the output from the study. Some of the phenotypes can be on the pathway to the disease, i.e., intermediate phenotype. Statistical methods based on both the disease status and intermediate phenotype should be more powerful than a case-control study as it incorporates more information. Meta-analysis has been used in genetic association analysis for many years to combine information from multiple populations, but never been used in a single population GWA study. In this study, simulations were conducted and the results show that when an intermediate phenotype is available, the meta-analysis incorporating the disease status and intermediate phenotype information from a single population has more power than a case-control study only in GWA study of complex diseases, especially for identification of those loci that have a very small effect. And compared with Fisher s method, the modified inverse variance weighted meta-analysis method is more robust as it is more powerful and has a lower type I error rate at the same time, which provides a potent approach in detecting the susceptibility loci associated with complex diseases, especially for those latent loci whose effect are very

2 small. In the meta-analysis of lung cancer with smoking data, the results replicate the signal in CHRNA3 and CHRNA5 genes on chromosome 15q25. Some new signals in CYP2F1 on chromosome 19, SUMF1 on chromosome 3, and ARHGAP10 on chromosome 4 are also detected. And the CYP2F1 gene, close to the already known cigarette-induced lung cancer gene CYP2A6, is highly possible another cytochrome P450 (CYP) gene that is related to the smoking-involved lung cancer. The meta-analysis of rheumatoid arthritis with anti-cyclic citrullinated peptide (anti-ccp) data identified new signals on 9q24 and 16q12. There are evidences these two regions are involved in other autoimmune diseases and different autoimmune/inflammatory diseases may share same genetic susceptibility loci. Both the theoretical and empirical studies show that the modified variance weighted meta-analysis method is a robust method and is a potent approach in detecting the susceptibility loci associated with complex diseases when an intermediate phenotype is available. Abstract Approved: Thesis Supervisor Title and Department Date

GENETIC ASSOCIATION ANALYSIS INCORPORATING INTERMEDIATE PHENOTYPES INFORMATION FOR COMPLEX DISEASES by Yafang Li A thesis submitted in partial fulfillment of the requirements for the Doctor of Philosophy degree in Statistical Genetics in the Graduate College of The University of Iowa December 2011 Thesis Supervisor: Professor Jian Huang

Graduate College The University of Iowa Iowa City, Iowa CERTIFICATE OF APPROVAL PH.D. THESIS This is to certify that the Ph.D. thesis of Yafang Li has been approved by the Examining Committee for the thesis requirement for the Doctor of Philosophy degree in Statistical Genetics at the December 2011 graduation. Thesis Committee: Jian Huang, Thesis Supervisor Deborah Dawson Kai Wang Trudy Burns Ying Zhang

ACKNOWLEDGEMENTS I would like to thank my advisor, Jian Huang, who suggested this challenging and interesting project for me to work on. He provided me with valuable guidance throughout this work, and encouraged me to become an independent researcher. It has been a privilege to benefit from his vast knowledge of both statistics and genetics. I m also grateful to Dr. Amos for the lung cancer data and rheumatoid arthritis data he provided, and his comments greatly improved this thesis which is really appreciated. I would like to show my gratitude to my husband for his numerous support and encouragement during the past years. This work would not be possible without his contribution to our family. I m also greatly indebted to my parents for the support they provided to my family. They can always be counted on whenever I need help from them. Finally, I would like to thank my little boys, Zachary and Ryan, for accompanying me throughout this journey and the tremendous joy they bring to my life. ii

TABLE OF CONTENTS LIST OF TABLES............................... v LIST OF FIGURES.............................. vii CHAPTER 1 INTRODUCTION........................... 1 1.1 Genome-wide association(gwa) studies............ 1 1.1.1 Complex diseases.................... 1 1.1.2 The missing heritability of complex diseases...... 5 1.2 Estimation of effect size distribution from GWA studies... 11 1.3 Intermediate phenotype..................... 15 1.4 Quantitative trait association analysis............. 21 1.5 Meta-analysis in GWA study.................. 25 1.6 Hypothesis............................ 26 2 STATISTICAL METHODS FOR TESTING ASSOCIATION... 28 2.1 Logistic regression........................ 28 2.2 Linear regression........................ 29 2.3 Meta-analysis within a population............... 31 2.3.1 Fisher s method..................... 32 2.3.2 Modified inverse variance weighted method...... 32 3 SIMULATION............................. 35 3.1 Disease models in the simulations............... 35 3.2 Simulation............................ 37 3.3 Analysis of simulated data................... 41 3.3.1 Common allele with medium risk............ 41 3.3.2 Common variant with low risk............. 44 3.3.3 Rare variant with high risk............... 46 3.3.4 Quantitative trait available in only cases........ 49 3.3.5 Type I error rate..................... 53 4 APPLICATION IN LUNG CANCER DATA............ 55 4.1 Introduction........................... 55 4.2 Lung cancer with smoking data................ 60 4.2.1 Introduction....................... 60 4.2.2 Quantitative trait available in Lung cancer...... 62 4.2.3 Lung cancer with cigarettes per day.......... 65 iii

4.2.4 Lung cancer with smoking years............ 70 4.2.5 Lung cancer with pack year............... 75 4.2.6 Imputed data analysis.................. 78 4.2.6.1 Cigarettes per day............... 78 4.2.6.2 Smoking years................. 84 4.2.7 Summary......................... 89 5 APPLICATION IN RHEUMATOID ARTHRITIS DATA..... 90 5.1 Introduction........................... 90 5.2 Rheumatoid arthritis and anti-ccp data........... 92 6 FUTURE DIRECTION........................ 98 REFERENCES................................. 101 iv

LIST OF TABLES Table 1.1 Heritability and number of loci for several complex traits...... 7 1.2 Additional number of susceptibility loci expected to be discovered from a single-stage GWAS with increasing sample sizes........... 14 1.3 Properties for a diallelic locus under random mating.......... 21 1.4 Value of k and genetic models....................... 22 1.5 No. of sib pairs required to detect linkage and association...... 23 1.6 Approximate sample sizes needed to detect a significantly increased allelic odds ratio.............................. 27 3.1 Haplotype frequency at disease locus (A) and SNP marker (M)... 37 3.2 Genotype frequency at disease locus (A) and SNP marker (M)... 38 3.3 Value of X at different genetic model.................. 38 3.4 Parameters for medium risk model................... 41 3.5 Parameters for low risk model...................... 44 3.6 Parameters for rare variant with high risk............... 46 3.7 Type I error rate for medium risk model................ 53 3.8 Type I error rate for low risk model................... 54 3.9 Type I error rate for rare variant with high risk............ 54 4.1 Summary information for selected SNPs at 15q25........... 58 4.2 Characteristics of study population................... 60 4.3 CPD: summary of association analysis.................. 67 4.4 Smoking years: summary of association analysis............. 71 4.5 Pack years: summary of association analysis............... 77 v

4.6 Number of SNPs: genotyped vs imputed................ 78 4.7 Imputed data+cpd: SUMF1 on chromosome 3............. 80 4.8 Imputed data+cpd: CHRNA3 on chromosome 15........... 82 4.9 Imputed data+cpd: CYP2F1 on chromosome 19........... 83 4.10 Imputed data+smoking years: chromosome 1.............. 85 4.11 Imputed data+smoking years: chromosome 4.............. 86 4.12 Imputed data+smoking years: chromosome 5.............. 86 4.13 Imputed data+smoking years: chromosome 10............. 87 5.1 Summary of association analysis for RA................. 97 vi

LIST OF FIGURES Figure 1.1 Cancer-associated genetic variants identified through GWA studies. 4 1.2 Sample sizes required for genetic association studies.......... 9 1.3 Allele frequency and effect sizes for genetic variants associated with breast cancer................................ 10 1.4 Cumulative fraction of genetic variance explained by 71 Crohns disease risk loci.................................. 12 1.5 Nonparametric estimates for distributions of effect sizes for susceptibility loci..................................... 13 1.6 Putative endophenotypes implicated in schizophrenia research.... 16 1.7 Intermediate phenotype for breast cancer................ 18 1.8 Regional association plot for ZNF365 across a 300-kb window.... 19 3.1 Null and alternative disease models in simulation........... 36 3.2 Power plot for medium risk model.................... 43 3.3 Power plot for low risk model...................... 45 3.4 Power plot for rare variant with high risk................ 48 3.5 Power plot for medium risk model(cases)................ 50 3.6 Power plot for low risk model(cases).................. 51 3.7 Power plot for rare variant with high risk(cases)............ 52 4.1 Allele frequency and effect sizes for genetic variants associated with lung cancer................................... 56 4.2 Genome-wide smoking quantity meta-analysis............. 57 4.3 Genome-wide association analysis using Illumina 300K HumanHap v1.1 Beadchips................................. 61 vii

4.4 Descriptive plot of quantitative trait before transformation....... 63 4.5 Descriptive plot of quantitative traits after outlier removal and transformation................................... 64 4.6 Association analysis of CPD....................... 65 4.7 CPD: Fisher s method.......................... 66 4.8 CPD: Modified inverse variance weighted method........... 66 4.9 CPD: -log10(p) comparison between tests............... 69 4.10 Association analysis of smoking years.................. 70 4.11 Smoking years: Fisher s method..................... 72 4.12 Smoking years: Modified inverse variance weighted method...... 72 4.13 Smoking years: -log10(p) comparison between tests.......... 74 4.14 Association analysis of pack years.................... 75 4.15 Pack years: Fisher s method....................... 76 4.16 Pack years: Modified inverse variance weighted method........ 76 4.17 Pack years: -log10(p) comparison between tests............ 77 4.18 Imputed data+cpd: Case-control study................ 79 4.19 Imputed data+cpd: Fisher s method.................. 79 4.20 Imputed data+cpd: Modified inverse variance weighted method... 79 4.21 Imputed data+smoking years: Case-control study........... 84 4.22 Imputed data+smoking years: Fisher s method............ 84 4.23 Imputed data+smoking years: Modified inverse variance weighted method 85 5.1 Descriptive plot of anti-ccp data in cases................ 93 5.2 Case-control analysis of rheumatoid arthritis.............. 94 5.3 Association analysis of anti-ccp.................... 94 viii

5.4 Fisher s method.............................. 95 5.5 Modified inverse variance weighted method............... 95 5.6 -log10(p) comparison between tests................... 96 ix

1 CHAPTER 1 INTRODUCTION 1.1 Genome-wide association(gwa) studies 1.1.1 Complex diseases Complex diseases, such as asthma, heart disease, diabetes, mental disease, cancer, etc., cover the majority of human diseases. These diseases have a relative high incidence rate. For example, hypertension has an incidence rate of 1 in 3 adults in US [1], type 2 diabetes is 1 in 10 [2], and coronary artery disease is 1 in 17 [3]. These complex diseases are very common among humans, people also call them common diseases. Unlike mendelian diseases, which are caused by a single gene mutation, common diseases are determined by a number of genetic factors and each factor only has a minor effect on the development of the disease. And to make things more complicated, the environmental effect also influences the manifestation of the diseases. The multiple genetic factors, the environmental factors and the interaction effects among the factors work together in the development of common diseases, so complex disease is also called multifactorial disease. Type 2 diabetes is a good example to illustrate the complicated etiology of complex diseases. It is a disease due to a combination effect of lifestyle and genetic factors. The total number of genetic regions known to be associated with type 2 diabetes is 38, and there may be other regions that still remain unknown. Most of the genes associated with diabetes have an odds ratio from 1.1 to 1.2 [4, 5]. Life style is also an important factor in the development of type 2 diabetes. People who have a high level of physical activity, a healthy diet, don t smoke, and consume

2 alcohol in moderation have a much lower rate of diabetes [6]. Complex disease is affected by numerous genes and each gene only contributes slightly to the disease. The traditional linkage analysis method loses its power when applied to complex disease analysis due to the complicated inheritance pattern for common diseases and the limited number of recombination events available in most human pedigrees [7]. Genetic association studies have been used for many years in finding genetic variants that are associated with the disease. It works when there is the linkage disequilibrium (LD) between the disease gene and the genetic marker. LD is the non-random association of alleles at two or more loci which can be caused by close linkage, non-random mating and population structure, etc. In LD mapping, people want to find the susceptibility gene that is closely linked with a marker allele so that they cannot be separated by recombination. A high-density genetic map is usually required in LD mapping. The revolutionary development of Microarray for genotyping allows the markers to be typed throughout the genome in a large cohort study, which enables large-scale genome-wide association study for complex diseases. In a GWA study, the marker allele frequencies between cases and controls are compared to see if certain genetic variations are significantly more frequent in cases rather than controls. The associated variations themselves may not directly cause the disease, but serve as powerful pointers to the region of the human genome where the disease gene resides. Researchers have achieved considerable success in GWA studies for complex diseases. In 2007, a GWA study was conducted on colorectal cancer by tagging 550,000 SNP markers in 930 cases and 960 controls. SNP rs6983267 at chromosome 8q was found to be associated with colorectal cancer. The replication

3 study confirmed this susceptible locus with OR of 1.27 and 1.47 for heterozygotes and rare homozygotes, respectively [8]. Scott and her group (2007) genotyped 1161 cases and 1174 controls with > 315, 000 SNP markers and identified a type 2 diabetes associated variant on chromosome 11 with an OR of 1.25 [4]. In 2008, researchers combined data from three independent Crohn s disease with a total of 3230 cases and 4829 controls and found significant evidence for 21 new loci that were associated with the disease; most of these loci had a small OR between 1.08 and 1.2 [5]. Most of these genes were not known to be involved in the diseases prior to GWA studies. GWA study is an effective tool for identifying genetic factors associated with complex diseases and has in fact contributed substantially to our understanding of the underlying disease mechanism. Figure 1.1 displays cancer-associated genetic variants identified through GWA studies up to 2010 [9].

4 Figure 1.1: Cancer-associated genetic variants identified through GWA studies. Genetic variants were identified from the NHGRI Genome-wide Association Study catalog (www.genome.gov/gwastudies) and include all cancer associations at P < 5 10 8 through 2010 [Hindoff et al., 2011].

5 1.1.2 The missing heritability of complex diseases Both genetic and environmental influences play an important role in the development of complex traits. Researchers try to separate genetic and environmental effects and quantify the genetic influence on a specific trait. People use heritability to measure how much impact the genetics plays on a trait. The concept of heritability was introduced by Sewall Wright nearly a century ago [10]. It is defined as the proportion of phenotypic variation attributed to the genetic variance. Heritability analysis is a wide-spread method to measure the strength of genetic effect on particular traits with the goal to find how much of the variance of a phenotype can be attributed to genetic variance. In heritability analysis, it is assumed that phenotypic variance (V P ) of a trait in a population can be expressed by one component of genetic variance (V G ) and one component of environmental variance (V E ), so that V P = V G + V E V G = V A + V D + V I (1.1) V A, V D and V I are additive variance, dominant variance and interaction or epistasis variation. The broad sense heritability is given by the ratio of the total genetic variance to the phenotypic variance: H 2 = V G /V P (1.2)

6 The narrow sense heritability is based only on additive variance in the genetic variance: h 2 = V A /V P (1.3) Broad sense heritability is used when people are interested to know what is the total genetic contribution to a trait such as psychological traits; in association analysis, narrow sense heritability is more often used as people are interested to pinpoint the specific loci that are associated with the disease, and the additive model is easy to be explained in the model, especially for traits having quantitative characteristic. Although GWA studies have identified hundreds of common variants associated with complex diseases, which provides valuable insights into the genetic architecture of human diseases. Most of the variants confer relatively small risk and explain only a small proportion of the heritability of most complex traits. For example, the 18 variants associated with type 2 diabetes have minor allele frequencies (MAFs) ranging from 0.073 to 0.50, and odds ratios (OR) ranging from 1.05 to 1.15, except for the T CF 7L2 gene, which has and OR of 1.37. Altogether, they account for less than 4% of the estimated heritability. This implies that there are many more susceptibility loci to be identified. It is estimated that more than 800 genetic variants are required to explain the 40% heritability assuming they have similar MAFs and ORs as those that have been identified [11]. Table 1.1 lists the number of variants and the proportion of heritability they explain for some complex diseases. The number of variants associated with the disease ranges from 4 to 32 and the heritability they explain ranges from 1.5% to 15%. The average

7 heritability attributed to a single locus is very small, ranging from 0.31% to 2.50%. Table 1.1: Heritability and number of loci for several complex traits [Manolio et al., 2009]. Disease Number of loci Heritability explained Heritability explained by each locus Crohn s disease 32 20% 0.62% Systemic lupus erythematosus 6 15% 2.50% Type 2 diabetes 18 4% 0.22% HDL cholesterol 7 5.2% 0.74% Height 40 15% 0.38% Early onset myocardial infarction 9 2.8% 0.31% Fasting glucose 4 1.5% 0.38% For most of the complex diseases, less than 10% of genetic variance is explained by identified common variants, leaving the bulk of heritability unexplained [12]. There are several reasons for the missing heritability. The first reason is most of the associated variants that have been identified have a small odds ratio around 1.1 for the heterozygote genotype, and 1.5-1.6 for the homozygote genotype, the latent variants should have an even smaller effect which are difficult to be identified with a limited sample size. For example, in 2010, the number of loci associated with Crohn s disease increased to 72, and 14 of them have an OR between 1.2 and 1.5; 54 of them have an OR less than 1.2 [13]. Figure 1.2 explains the relationship between the effect size, sample size and minor allele frequency of the risk allele. To detect a variant with an odds ratio of 1.1, a sample size of 60,000 (consisting

8 30,000 controls and 30,000 cases) is required to provide 90% power to identify those low risk variants with a MAF of 0.3 [14]. This is a large sample size even when meta-analysis is considered. A lot of variants with small effect size may still remain to be discovered because of the limited sample size in current GWA studies. Another possibility is that some of the missing heritability is likely to be contributed by rare variants. The tagging SNPs in the current GWA studies are not designed to capture the rare variants, and therefore those rare variants will be missed in current GWA studies. For example, the two rare, high-penetrance genes, BRCA1 and BRCA2, account for about 10% of breast cancer cases (Fig1.3). A similar incidence rate is attributed to SNPs identified through GWA studies [9]. Low frequency variants with 2-3 fold risk increase could contribute substantially to missing heritability without demonstrating clear Mendelian segregation. It is reported 20 variants with MAF of 1% and allelic odds ratio of 3 would account for most familial aggregation of type 2 diabetes [15]. Finally, the etiology of complex diseases is often complicated, which is attributed to the combination of effects from genetic and environmental factors, usually many of them, and most of them with a very small effect. To elucidate the gene-gene and gene-environmental interactions is still a big challenge to both statistical and genetic researchers.

9 Figure 1.2: Sample sizes required for genetic association studies. The first three columns pertain to GWASs using common variants across the entire genome; the columns correspond to different levels of statistical power to achieve a significant result at P < 10 8. The fourth column pertains to a search for rare variants where the frequency listed is the collective frequency of rare variants in controls, and the odds ratio is the excess in cases as compared to controls. Sample sizes assume correction for a genome-wide search of 20,000 protein-coding genes in the genome (aiming to achieve P < 10 5 with one test performed per gene). The fifth column pertains to a test of a single hypothesis (e.g., testing association with a single SNP) [Altshuler et al., 2008].

10 Figure 1.3: Allele frequency and effect sizes for genetic variants associated with breast cancer. Allele frequencies and ORs are taken from published literature where available and are not depicted to scale. Associations identified through GWA or GWA follow-up studies are shown with solid colored bars; all others are shaded from dark (top) to light (bottom) [Hindoff et al., 2011].

11 1.2 Estimation of effect size distribution from GWA studies Genome-wide association studies has been fruitful in identifying susceptibility loci for complex diseases in humans, however those variants discovered so far explain only a small proportion of the heritability of the diseases. Consortia for large meta-analysis of common disease is promising and have successfully found additional new loci. However, the heritability explained doesn t increase proportionally. Fig1.4 is the cumulative plot of fraction of genetic variance explained by 71 Crohn s disease risk loci. The 32 loci detected in 2008 accounted for about 20% of Crohn s disease heritability. Another 39 loci was identified in 2010 and only increase the proportion of heritability explained to 23.2%. There should be more Crohn s disease risk alleles of even smaller effect remain unidentified. And these loci would explain only a few more proportion of the heritability. This pattern shows that common variants explain a logarithmical decreasing fraction of heritability. The small plot shows a logarithmic fit to these data extrapolated to an extreme scenario where 20,000 independent common alleles are associated with disease. Even in this situation, less than half of the genetic variance would be explained. This also demonstrates that other types of effect (for example, low frequency and rare alleles with higher penetrance) must also exist [13]. Park estimated the effect size distribution based on the result of recent GWA studies for height, Crohn s disease and breast, prostate and colorectal (BPC) cancers [16]. Figure 1.5 shows the distribution of effect sizes for common SNPs identified from the three GWA studies. The distribution based on observed susceptibility loci (plot a) is skewed because of the lower probability that SNPs with smaller effect size will be identified. They corrected the bias by relying on the

12 Figure 1.4: Cumulative fraction of genetic variance explained by 71 Crohns disease risk loci. The loci are ordered from largest to smallest individual contribution. Black points were identified pre-gwas, green points were identified in the first generation GWAS, blue points were identified in an earlier meta-analysis, and cyan points were identified in this analysis [Franke et al., 2010]. observation that the number of susceptibility loci with a given effect size that are expected to be discovered in a GWA study is in a proportion to the product of the power of that study with that effect size and the total number of underlying susceptibility loci that exist with similar effect size. The estimated density of effect size for all the underlying susceptibility loci continues increasing at an accelerating rate as the effect size decreases (plot b and c). This suggests that the number of loci that could be expected to be discovered in a GWA study increases linearly with increasing sample size, whereas the associated proportion of genetic variance explained increases at a decelerating rate, because the additional loci detected in larger studies will tend to have

13 Figure 1.5: Nonparametric estimates for distributions of effect sizes for susceptibility loci. (a) Curves based only on observed susceptibility loci; these curves are distorted because loci with larger effect sizes are more likely to have been detected. (b) Curves based on estimated susceptibility loci, representative of the population of all susceptibility loci. (c) Estimated nonparametric distributions after normalization over the common observed range for the three traits [Park et al., 2010].

14 smaller effect sizes. Take Crohn s disease as an example, 26 loci are expected to be discovered with a sample size of 10,000, which increases to 140 when sample size increases to 50,000; however the expected genetic variance only increases from 11.1% to 19.8%. If we divide the additional genetic variance explained by the additional number of loci involved we get an estimation on the genetic variance explained by each additional locus. It varies from 0.04% to 1%, with most of them are around 0.05% (Tab 1.2) [13]. Table 1.2: Additional number of susceptibility loci expected to be discovered from a single-stage GWAS with increasing sample sizes [Franke et al., 2010] Expected number Expected GV GV explained Sample size of SNPs explained by each locus Height 25,000 27.4 6.6 0.24 50,000 47.2 3.7 0.08 75,000 51.1 2.9 0.06 100,000 35.9 1.7 0.05 125,000 21.3 0.8 0.04 Crohn s disease 10,000 26.0 11.1 0.43 20,000 38.4 3.5 0.09 30,000 43.8 3.1 0.07 40,000 24.5 1.6 0.07 50,000 7.4 0.5 0.07 PBC cancers 10,000 2.8 2.8 1.0 20,000 7.3 3.0 0.51 30,000 11.1 2.9 0.26 40,000 12.4 2.7 0.22 50,000 10.9 2.1 0.19 PBC: breast, prostate and colorectal cancers; GV: genetic variance, shown as percentage.

15 1.3 Intermediate phenotype Complex disorders are the ultimate result of multiple genotypes and environmental factors. There are many intermediate phenotype (endophenotype) along the pathway to the complex disease, and there are even more quantitative trait loci involved in the expression of those intermediate phenotypes. Each individual locus only has a very small effect and is also sensitive to the environmental impact. Figure 1.6 is a good explanation of the relationship among QTLs, intermediate phenotypes and outcome using schizophrenia as an example. The Endophenotype was first coined by Gottesman and Shields about 40 years ago as a psychiatric concept, it is defined as an internal phenotype discoverable by a biochemical test or microscopic examination. There are several advantages of using intermediate phenotypes [17]: 1. For complex diseases, especially psychiatric disorders, the clinical definition of disease status is complicated and an intermediate phenotype is more clear and easier to classify; 2. An intermediate phenotype is closer to the causal genes, thus less genetically complex and more strongly associated with susceptibility loci; 3. Intermediate phenotypes capture more of the underlying heritable trait variation, thus enhancing the statistical power; 4. The study of intermediate phenotypes will provide the insight to the complicated etiologic pathway to the disease.

16 Figure 1.6: Putative endophenotypes implicated in schizophrenia research [Gottesman et al., 2003]. The concept of endophenotype was primarily applied to the study of psychiatric disorders in the beginning, then it was adapted by researchers for analysis of complex disorders [18]. For example, the breast cancer associated gene BRCA1 and BRCA2 only account for 25% of familial risk and 10% of overall breast-cancer risk, the causal genes of most cases still remain unknown. In 2005, Boyd proposed that mammographic density could be used as an intermediate phenotype to help elucidate the genetic factors that contribute to the cause of

17 breast cancer (Fig 1.7) [19]. Mammographic density is one of the strongest known risk factors for breast cancer. The risk for breast cancer is 4-5 times greater in women with density of more than 75% of the breast than in women with little or no density in the breast; density of more than 50% of the breast could explain about one third of breast cases. Mammographic density is highly heritable and has the property of a quantitative trait, the genes that determine this density should be fewer in number and easier to locate than genes that associated with breast cancer. In 2010, researchers identified a gene ZNF365 on chromosome 10 associated with both breast cancer and breast mammographic density (Fig 1.8) [20]. Another example of using an intermediate phenotype in genetic analysis comes from Alzheimer s disease (AD). The identification of genes associated with sporadic, late onset AD has been challenging. Until recently, most GWA study scans in AD have been relatively underpowered. Several susceptibility loci have been discovered and reported, but most of them have not been validated in followup studies. Shulman and his group used global AD neuropathology and global cognition as the intermediate phenotype and found a new gene ZNF224 associated with AD [21]. Then Burns s group replicated their result in a case-control study in 2011 [22]. These successful examples in GWA study shows that an intermediate phenotype can provide valuable information about the genetic mechanism that is implicated in complex disease and can help us identify those common variants that are difficult to discover using traditional approaches. In order to be a valid endophenotype, there are several criterias that have

18 Figure 1.7: Computer-assisted measurement of mammographic density and categories of percentage mammographic density estimated by radiologists. A=0, B=< 10%, C=< 25%, D=< 50%, E=< 75%, F=> 75% [Boyd et al., 2005]. to be met. Cannon and Keller suggested 6 rules for justification: [23] 1. Intermediate phenotypes should be heritable; 2. Intermediate phenotypes should be associated with causes rather than effects of disorders; 3. Numerous phenotypes should affect a given complex disorder; 4. Intermediate phenotypes should vary continuously in the general population; 5. Intermediate phenotypes should optimally be measured across several levels analysis;

19 Figure 1.8: Regional association plot for ZNF365 across a 300-kb window. Association of individual SNPs is plotted as -log10 P against the chromosomal base pair position. Results of both genotyped and imputed SNPs are provided. Colors indicate the LD relationship between rs10995190 and the other markers (red, r 2 > 0.8). The right y axis shows the recombination rate estimated from the HapMap CEU population. All P values are from the discovery phase [Lindstrom et al., 2010]. 6. Intermediate phenotypes that affect multiple disorders should be found for genetically related disorders. In association analysis, researchers will often collect other phenotypic information in addition to disease status. For example, smoking quantity is the common phenotype recorded in a lung cancer study; the optic disc size is an important component for diagnosis of glaucoma; the urinary sodium level is recorded for cardiovascular disease patients, etc. Fully utilization of the quantitative information

20 will improve the genetic analysis of complex diseases, especially for those diseases where most of their heritability is still unexplained by the replicated genes from multiple genome-wide association studies.

21 1.4 Quantitative trait association analysis Quantitative genetics was developed in last century and it is the study of continuous traits and their underlying mechanism. The expression of quantitative traits (QT) is affected by both genetic and environmental factors. The inheritance pattern of quantitative trait is an extension of Mendel s law of inheritance with a combination effect of multiple genes. The variation existing in a quantitative trait can be divided into genetic and non-genetic components, and the genetic component V G at one locus can be further subdivided into additive variance V A and dominant variances V D. Assume one locus has two alleles B 1 and B 2 with frequency p 1 and p 2, respectively. Then the genotypic value at genotype B 1 B 1, B 1 B 2 and B 2 B 2 can be expressed as 0, (1 + k)a and 2a, where a and k represent additive and dominant effect (Tab 1.3). The value of k determines different genetic models at this locus (Tab 1.4). In Fisher s decomposition method, the variance explained by one locus can be computed, i.e., V A = 2p 1 p 2 α 2, V D = (2p 1 p 2 ak) 2, where α = a[1+k(p1 p2)] [24, 25]. The heritability of complex diseases is usually less than 50% (Tab 1.1). If there are a number of loci involved in the disease, the heritability explained by each locus can be very small. Table 1.3: Properties for a diallelic locus under random mating. Genotype Genotypic value Frequency Genetic variance B 1 B 1 0 p 2 1 V A = 2p 1 p 2 α 2 B 1 B 2 (1+k)a 2p 1 p 2 V D = (2p 1 p 2 ak) 2 B 2 B 2 2a p 2 2

22 Table 1.4: Value of k and genetic models. k Genetic model 0 additive model 1 B 2 completely dominants B 1-1 B 1 completely dominants B 2 > 1 B 2 exhibits overdominance < 1 B 1 exhibits underdominance The multiple genes associated with a quantitative trait are called quantitative trait loci (QTL). There are two basic approaches in QTL mapping: linkage mapping and association mapping. Linkage mapping usually requires a large number of pedigrees; association mapping is more often used to analyze data from random-mating population like human population. The power of QTL analysis is directly related to the QTL heritability. Sham compared the statistical power between linkage analysis and association analysis of quantitative traits for sibpair data [7]. The result shows that the power of both linkage and association analysis is crucially dependent on the proportion of phenotypic variance explained by the QTL. In his simulation, the residual shared variance (environmental variance) is fixed at 0.25, and the additive variance varies from 0.01 to 0.15, so the narrowsense heritability changes from 0.038 to 0.375. When the heritability is high as 0.375, 316 sib-pairs are required for 80% power to detect association with complete disequilibrium in the association analysis; but when the heritability is very low as

23 most of the complex diseases-associated QTLs are, 5524 sib-pairs are needed to achieve the same power (Tab 1.5). Table 1.5: No. of sib pairs required to detect linkage and association [Sham et al., 2000] Sample size required, for V a = Analysis 0.01 0.05 0.10 0.15 Linkage: θ = 0.20... 407,843 97,653 41,270 θ = 0.10... 129,193 30,815 13,033 θ = 0.05... 80,620 19,241 8,128 θ = 0... 52,790 12,614 5,322 Within-pairs association: R 2 = 0.10 55,440 10,770 5,190 2,329 R 2 = 0.25 22,156 4,297 2.064 1,321 R 2 = 0.50 11,068 2,139 1.023 651 R 2 = 1.0 5,524 1,060 502 316 Dominance QTL variance (V D ) is assumed to be 0; Residual shared variance (V S ) is assumed to be 0.25; The heritability of the quantitative trait ranges from 0.04 to 0.38. Sham s study was limited to sibpair data. Slakin conducted his study in a more general condition [26]. He conducted three independent tests in QTL mapping: (1) compare the marker allele frequency between selected samples (unusually high or low values of QT) and general population; (2) compare the mean of a quantitative trait between two different genotypes in the selected samples; (3) combine the previous two tests using Fisher s method. The first two tests are

24 QT association tests and they are independent of each other, so information from them can be combined. He proposed an idea to combine the results from two independent studies within a data set in quantitative trait association analysis. Fulker [27] also applied this idea of combining in his study. He combined linkage and association sib-pair analysis for quantitative traits in the simulation study and the result shows power and efficacy for the method. Linear regression analysis is another powerful method for quantitative trait analysis and nearly all publications on quantitative traits rely on this method, either in cases only or in both cases and controls adjusted for the disease status. However, most of the GWA studies adopt a case-control design and this nonrandom ascertainment can lead to inflated type I error rate for tests of association between the marker genotype and the quantitative trait [28]. This is not a concern for intermediate phenotype analysis. It is shown when the SNP markers are not independently associated with the disease risk, i.e., there is association between the quantitative trait and the SNP marker, the standard tests are still valid [29]. While lots of studies of QTLs for various traits have been published, few genes have been identified corresponding to those QTLs, and those genes usually have a large effect on the trait. It has become clear that hundreds of genes affect only one trait, and each of them only has a very small effect. It is really difficult to identify those loci. This also explains why the detection of complex diseases associated susceptibility genes are so challenging as complex diseases have a similar inheritance patten as quantitative trait.

25 1.5 Meta-analysis in GWA study Most of the susceptibility alleles associated with complex diseases have a moderate to low risk, and a large sample size is required for detecting those loci. To detect an odds ratio of 1.3, a sample size of 2000-10000 cases and controls are needed to achieve 80% power when the marker and disease allele frequencies are the same (Fig 1.6) [30]. A single site usually cannot collect such a large number of samples, but several sites together can achieve it. Methods for combining the evidence across different centers are needed. Meta-analysis is such an approach that provides an estimate of the overall effect across a number of independent studies. The null hypothesis for meta-analysis is the null hypothesis is true for all the independent populations; the alternative hypothesis is that the null hypothesis is not true in at least one of the populations. Meta-analysis has been increasingly applied in genetic association studies and hundreds of papers are published each year using meta-analysis. Meta-analysis combining several GWA-related investigations have already been performed for many complex diseases, such as type 2 diabetes [31] and Parkinsons disease [32] and may still be the norm in the near future [33]. Several statistical methods are available for meta-analysis and one famous one is Fisher s method. It takes minus twice the logarithm of the product of a set of L p-values, which is distributed as a chi-square with 2L degrees of freedom [34]. X 2 = 2Σ L i=1log e (p i ) (1.4) Another popular meta-analysis method is based on inverse variance. Let β i and SE i be the effect size estimate and standard error for study i separately,

26 w i = 1/SE i, then SE = 1/ w i β = β i w i / w i Z = β/se This method is popular because it can create an common effect estimator for multiple populations but it requires the effect sizes and their standard errors to be in consistent units across studies [35]. This may not be the case in different studies. 1.6 Hypothesis The idea of meta-analysis is to combine the information from multiple sites and thus increase the sample size, overcome the lack of power in many association studies. However, this idea of combining information has never been applied to a GWA study in a single population. In case-control studies, researchers will often collect information on multiple traits in addition to disease status on the same set of subjects. Some of the traits can be an intermediate phenotypes for the disease. The statistical tests based on both the disease status and intermediate phenotype should be more powerful than a case-control study only as it integrates more information. In this study, simulation was conducted to compare the per-

27 Table 1.6: Approximate sample sizes needed to detect a significantly increased allelic odds ratio [Zondervan et al., 2004] Disease allele Marker allele Allelic odds ratio of disease gene frequency frequency 3.0 2.0 1.3 No.cases No.cases: No.cases No.cases: No.cases No.cases: (=no. no.controls (=no. no.controls (=no. no.controls controls) (=1:4) controls) (=1:4) controls) (=1:4) 0.05 0.05 360 210:840 1110 650:2600 9500 5600:22400 0.1 600 350:1400 2000 1200:4800 19000 11500:46000 0.2 1170 700:2800 4150 2500:10000 40000 25000:100000 0.3 1900 1200:4800 6800 4300:13200 70000 43000:172000 0.5 4200 2700:10800 15000 9500:38000 160000 100000:400000 0.2 0.05 710 420:1680 1900 1090:4360 14000 8500:34000 0.1 350 200:800 900 500:2000 6600 4400:13600 0.2 150 85:340 360 220:880 2900 1750:7000 0.3 210 130:520 530 360:1440 4800 3000:12000 0.5 430 270:1080 1250 8000:32000 11000 6950:27800 0.5 0.05 3150 1870:7480 6800 4000:16000 40000 25000:100000 0.1 1500 900:3600 3200 2000:8000 19000 12000:48000 0.2 640 390:1560 1350 850:3400 8500 5300:21200 0.3 360 220:880 800 500:2000 5000 3100:12400 0.5 140 90:360 320 200:800 2100 1300:5200 *Linkage disequilibrium between marker and disease allele D = 0.7, power=80%,α = 0.001 formance of meta-analysis methods and the traditional case-control study to test this hypothesis.

28 CHAPTER 2 STATISTICAL METHODS FOR TESTING ASSOCIATION 2.1 Logistic regression Logistic regression is based on a generalized linear model and used to analyze a binomial outcome, i.e., disease status, which is commonly used in genetic association study [36]. It models the risk of disease as a function of markers, often with adjustment for covariates such as age, gender, etc. Let θ be the risk of the disease, then in logistic regression, θ = e α+βx 1 + e α+βx logit[θ(x)] = θ(x) log[ 1 θ(x) ] = α + βx where α is the constant and β is the coefficient of the predictor variables. A Wald test [37, 38] is used to test the statistical significance of the coefficient β in the model. The Wald test calculates a Z statistic, which is: Z = ˆβ se( ˆβ) (2.1) se( ˆβ) is the standard error of the maximum likelihood estimator, the Z statistic can be compared to a standard normal distribution. The hypothesis to test the significance of β is

29 H 0 : β = 0 H 1 : β 0 2.2 Linear regression The linear regression is used to test the association between a quantitative trait and the SNP markers [39]. Y = Xλ + ε (2.2) where y 1 1 x 11 x 12 ε 1 y 2 1 x 21 x 22 Y =.., X =... λ 0 ε 2..., λ = λ 1, ε =..... λ 2.. y n 1 x n1 x n2 ε n x i2 = { 0 if xi2 is unaffected 1 if x i2 is affected y i is the individual quantitative trait which follows a normal distribution with mean Xλ, x i1 is the marker genotype, x i2 is the binary variable that indicates the disease status, λ 0 is the intercept, λ 1 is the regression coefficient that we are interested in; ε i is the error term which follows a normal distribution with mean 0 and variance σ 2. Least squares estimation is used to estimate the β values. ˆλ = (X X) 1 X Y (2.3)

30 X X and (X X) 1 are 3 3 symmetric matrices, X Y is a 3 1 dimensional vector. The fitted values are Ŷ = Xˆλ = X(X X) 1 X Y (2.4) The residuals are ê = Y Ŷ = (I X(X X) 1 X )Y (2.5) The error standard deviation is estimated as ˆσ = (e 2 i /(n 2 1)) (2.6) i The variance of ˆα, ˆβ 1 and ˆβ 2 are the diagonal elements of the standard error matrix ˆσ 2 (X X) 1 (2.7) The hypothesis to test the significance of λ 1 is H 0 : λ 1 = 0 H 1 : λ 1 0

31 The test statistic is approximately normal distribution when sample size is large. Z = ˆλ 1 se( ˆ λ 1 ) (2.8) 2.3 Meta-analysis within a population In case-control studies, researchers will often collect information on multiple traits including disease status on the same set of subjects to maximize the harvest, some traits can be quantitative trait along the pathway to the disease. A statistical method relying on both the disease status and the intermediate quantitative trait should be more powerful than case-control study only as it incorporates more informative data. Statistical methods that can combine both the disease status and quantitative trait is needed and meta-analysis is such a method that can fully utilize all the informative data. Meta-analysis has been used to combine information from multiple independent data sets to increase the sample size in GWA studies for many years, but never used in a GWA study within a single population. The estimator from logistic regression of disease status and linear regression of the quantitative trait with adjustment for disease status should be comparable if this quantitative trait is an intermediate phenotype. And the two tests can be combined to give an overall test. This meta-analysis within a population should be a powerful approach especially for identifying those susceptibility loci with small effect. In this study, this idea of combining information is applied and the meta-analysis was conducted on both theoretical and empirical data to test the hypothesis.

32 2.3.1 Fisher s method Suppose test 1 is to test for the association between a disease and a SNP marker; test 2 is to test for the association between a quantitative trait and a SNP marker with adjustment for disease status. These two tests are independent under the null hypothesis (null model in Fig 3.1), i.e., when there is no association between the SNP marker and the disease locus, these two tests are independent. Fisher s combined probability test [34] can be used to combine the p-values from these two tests to provide an overall p-value. X 2 = 2[ln(p 1 ) + ln(p 2 )] (2.9) which follows a chi-square distribution with 4 degrees freedom. 2.3.2 Modified inverse variance weighted method The inverse variance weighted meta-analysis method is popular in GWA studies as it can create a common effect estimator for multiple populations. This method can also be applied to combine information from a single population, but modification is required to adapt for the within-a-population meta-analysis. We have two Z statistics from test 1 and 2 (equations (2.1) and (2.8)), these two Z-scores are independent under the null hypothesis and can be combined by inverse variance weighted method. Suppose the two Z-scores from test 1 and 2 are Z 1 and Z 2,

33 Z 1 N(µ 1, σ 2 1) Z 2 N(µ 2, σ 2 2 ) Under the null hypothesis, µ 1 = µ 2 = 0, Z 1 and Z 2 are independent. The estimator of effect size from linear regression can be arbitrary depending on the subjective selection of the measurement unit and normalization procedure, which will affect the final combination result. But if the quantitative trait is an intermediate phenotype involved in the development of disease (alternative model in Fig 3.1), the estimator of the coefficients from logistic regression and linear regression and their standard error should be in consistent units because these are two different tests for the same association analysis, i.e., test for the association between the SNP marker and the disease status. Z 2 can be scaled so that it has the same unit as Z 1. In this study, the standard error of Z 2 is scaled so that it has the same standard error as Z 1 Z 2 = Z 2 c N(µ 2, σ 2 1) (2.10) Where c = σ 2 2 /σ 2 1, which is a constant. Let Z = bz 1 +(1 b)z 2, b (0, 1), since Z 1 and Z 2 are independent under the null hypothesis, the variance of Z is the sum of the variance of the two elements V (Z) = b 2 V (Z 1 ) + (1 b) 2 V (Z 2 ) (2.11) V (Z) is a function of b, take the first derivative of V (Z) with respect to b

34 f (b) = V (Z) b = 2bV (Z 1 ) 2(1 b)v (Z 2 ) = 2b[V (Z 1 ) + V (Z 2 )] 2V (Z 2 ) (2.12) Take the second order derivative of this function with respect to b f (b) = 2 V (Z) b 2 = 2[V (Z 1 ) + V (Z 2 )] > 0 (2.13) Since f (b) > 0, f(b) has a global minimum. Let f (b) = 2bV (Z 1 ) 2(1 b)v (Z 2 ) = 2b[V (Z 1 ) + V (Z 2 )] 2V (Z 2 ) = 0 b = When b = V (Z 2 ) V (Z 1 ) + V (Z 2 ) = V (Z 2 ) V (Z 1 )+V (Z 2 ) = 1 2 1 σ 2 1 1 σ 2 1 + 1 σ 2 2 = 1 2 V (Z) = V (Z 1)V (Z 2 ) V (Z 1 ) + V (Z 2 ) = 1 2 V (Z 1) (2.14) achieves its minimum value. Then we get a new statistics L L = bz 1 + (1 b)z 2 1 V (Z 2 1) This inverse variance weighted statistic L has the smallest variance. (2.15)

35 CHAPTER 3 SIMULATION 3.1 Disease models in the simulations The disease mechanism underlying a common disease can be complicated even when only one genetic locus is considered. Three elements are considered in the study, the genetic marker (G1), the quantitative trait (QT) and the disease status (D). Figure 3.1 displays the possible disease models involving the three elements when the quantitative trait is an intermediate phenotype for the disease. The arrow represents the causation relationship. The null hypothesis is the model displayed on the left, the SNP marker G1 is not related to the disease locus, thus not associated with the disease status. There are three possible models under the alternative hypothesis. In model 1, the quantitative trait and the SNP marker are independently associated with the disease status, there is no relationship between the marker and the quantitative trait. For disease models 2 and 3, the intermediate phenotype is between the SNP marker and the disease status. These two models are different from biological point of view, but they can t be differentiated statistically once the allele frequency, risk effect and disease incidence rate are fixed (result not shown). So the simulation will focus on alternative disease model 1 and 3, the null hypothesis is the correlation coefficient between the SNP and disease locus is 0: 1. the SNP marker and the quantitative trait are independently associated with the disease, the quantitative trait is not associated with the SNP marker; 2. the quantitative trait is between the SNP marker and disease status, and

36 the disease status is also directly associated with the SNP marker. Figure 3.1: Null and alternative disease models in simulation.

37 3.2 Simulation Suppose there is a disease locus A which has two alleles A 1 and A 2 with allele frequency p 1 and p 2, a SNP marker M has two alleles M 1 and M 2 with allele frequency m 1 and m 2. The marker and the disease locus are closely linked so that there is linkage disequilibrium (LD) between them, which can be quantified by correlation coefficient r [40]. D = h 11 p 1 m 1 + D r = D p1 p 2 m 1 m 2 (3.1) The haplotype frequency at the SNP marker and disease locus can be expressed using the allele frequencies and the LD between them once the correlation coefficient is specified(tab 3.1), Let h 11 = p 1 m 1 + D, h 12 = p 1 m 2 D, h 21 = p 2 m 1 D, h 22 = p 2 m 2 + D, the genotype frequency at the disease locus A and the marker locus M can be further expressed using those haplotype frequencies (Tab 3.2 ). Table 3.1: Haplotype frequency at disease locus (A) and SNP marker (M) Marker Disease M 1 M 2 Total A 1 p 1 m 1 + D [h 11] p 1 m 2 D [h 12] p 1 A 2 p 2 m 1 D [h 21] p 2 m 2 + D [h 22] p 2 Total m 1 m 2 1

38 Table 3.2: Genotype frequency at disease locus (A) and SNP marker (M) Marker Disease M 1 M 1 M 1 M 2 M 2 M 2 A 1 A 1 h 2 11 2h 11 h 12 h 2 12 A 1 A 2 2h 11 h 21 2h 11 h 22 +2h 12 h 21 2h 12 h 22 A 2 A 2 h 2 21 2h 21 h 22 h 2 22 Let Y denote a quantitative trait, X denote genotype at the marker locus, the relationship between Y and X can be expressed using a linear equation Y = β 0 + β 1 X + ε where ε stands for the error term which follows N(0, σe 2 ). X has different values for different genetic models, for example, X = 0, 1, 2 if X has an additive effect on the quantitative trait (Tab 3.3). Table 3.3: Value of X at different genetic model. X Marker genotype Value M 1 M 1 M 1 M 2 M 2 M 2 Additive model 0 1 2 Dominant model 1 1 0 Recessive model 0 0 1 The disease status is dependent on both the marker genotype and the quan-

39 titative trait, which can be expressed using a logistic equation P (D X, Y ) = eγ 0+γ 1 X+γ 2 Y 1 + e γ 0+γ 1 X+γ 2 Y So the frequency of different marker genotypes in cases and controls can be expressed as P (Y i, X i D = 1) = P (D = 1 Y i, X i )P (Y i, X i ) P (D = 1) P (D = 1) = P (D = 1 Y, X)p(X) Y X P (Y i, X i D = 0) = P (D = 0 Y i, X i )P (Y i, X i ) P (D = 0) P (D = 0) = P (D = 0 Y, X)p(X) Y X The minor allele frequency (MAF) of the SNP marker and disease allele are equal and set to be 0.3. The correlation coefficient (r) between the marker and the disease locus varies from 0 to 0.8. When r=0, there is no LD between the disease locus and the SNP marker, i.e., no association exists, which is the null hypothesis of the test. The values of β, γ parameters are chosen to fix the disease incidence rate at 0.05, which is close to prevalence of most common diseases. β 1 is fixed at 1, which means each copy of the minor allele increase the trait value by one standard deviation in an additive model. The value of σe 2 represents the

40 residual effect, which includes the effect of environmental factors and other genetic loci. When the quantitative trait is affected by multiple loci, the variance of the residual can be very big if it is regressed to only one of those loci. And the heritability of the quantitative trait explained by this one locus can be very small, too. The genetic variance explained by this single locus can be expressed as σa 2 = 2m 1m 2 α 2, σd 2 = (2m 1m 2 ak) 2, where α = a[1 + k(m1 m2)], a and k represent additive and dominant effect (Tab 1.3). In this simulation, the heritability at single locus ranges from 0.002 to 0.01 with a step size of 0.002, which are typical in complex diseases (Fig 1.1 ). γ 1 determines the change in the log odds ratio of disease for each copy of the minor allele of the SNP in an additive model and γ 2 pertains to the change in the log odds ratio of disease for one standard deviation increase in the quantitative trait. These values are chosen to represent a common association between the disease status and the SNP and quantitative trait. Type I error rate is set at 0.01. The simulation was conducted on additive, dominant and recessive models although additive model is usually the model assumed in quantitative trait association analysis.

41 3.3 Analysis of simulated data 3.3.1 Common allele with medium risk The simulation was first conducted on a common variant with an odds ratio ranging from 1.2 to 1.4. Table 3.4 lists the parameter values used in the simulation, 2000 cases and 2000 controls were simulated for additive and dominant models; 6000 cases and 6000 controls were simulated for recessive model. Table 3.4: Parameters for medium risk model. Model β 1 γ 0 γ 1 γ 2 f 11 f 12 f 22 OR hetero OR homo Add1 1-3.36 log1.2 log1.2 0.045 0.053 0.063 1.20 1.44 Add2 1-3.1 log1.15 log1.04 0.045 0.053 0.063 1.20 1.43 Dom1 1-3.36 log1.14 log1.14 0.043 0.056 0.056 1.30 1.30 Dom2 1-3.16 log1.07 log1.07 0.043 0.056 0.056 1.31 1.31 Rec1 1-3.13 log1.14 log1.14 0.049 0.049 0.062 1 1.30 Rec2 1-3.04 log1.07 log1.07 0.049 0.049 0.063 1 1.31 f 11, f 12 and f 22 are penetrance values for the genotypes OR hetero and OR homo are odds ratios for heterozygous and homozygous genotype *1,2 refer to alternative disease models 1 and 3 in Figure 3.1 Figure 3.2 displays the power plot for the simulation result. The X axis denotes the correlation coefficient between SNP marker and disease locus. When the correlation coefficient is 0, there is no association between the SNP and the disease locus, this is the null hypothesis. The simulation was conducted on alternative disease model 1 and 3. When the SNP marker is directly associated with the disease,

42 but the disease related quantitative trait is not associated with the SNP marker of interest, there is no useful information from the quantitative trait pertaining to the SNP studied (i.e., disease model 1 in Fig 3.1), logistic regression is the most powerful method, followed by Fisher s method. When the quantitative trait is indeed intermediate between the SNP marker and disease status (i.e., disease model 3 in Fig 3.1), the linear regression analysis of the quantitative trait can provide valuable information for the association analysis. The power of the four tests increases as the correlation coefficient between the SNP marker and the disease locus increases (X axis). And as the heritability of the quantitative trait increases from 0.002 to 0.01 (column 1-5), the power of linear regression increases, so does the meta-analysis methods because them rely on the result from linear regression. And the modified inverse variance weighted combination method is more powerful than Fisher s method in the meta-analysis. For the recessive model, where the power of a case-control study is small, the linear regression has the predominant effect in the combination analysis, the performance of Fisher s method and modified inverse variance weighted method are almost the same as the linear regression.

43 Add_1 Add_2 Dom_1 Dom_2 Rec_1 Rec_2 Figure 3.2: Power plot for medium risk model. Column 1-5, heritability takes 0.002, 0.004, 0.006, 0.008 and 0.01; x axis, correlation coefficient; y axis, power. Black solid line, logistic regression between disease status and SNP markers; black dashed line, linear regression between QT and SNP markers with adjustment for disease status; red dashed line, Fisher s combination method; red solid line, modified inverse variance weighted method. Model 1, SNP and QT are independently associated with the disease status; model 2, QT is between the SNP and disease status.

44 3.3.2 Common variant with low risk The odds ratios associated with the heterozygote genotypes for many of the identified genetic variants are approximately 1.1, which is a very small effect size. The simulation was also conducted on common variant with an odds ratio ranging from 1.1 to 1.2. 4000 cases and 4000 controls were simulated for additive and dominant models; 8000 cases and 8000 controls were simulated for recessive model. The heritability still ranges from 0.002 to 0.01 with a step size of 0.002. Table 3.5 lists the parameters value for the simulation for common variant with low risk. Figure 3.3 displays the power plot for the simulation result. The behavior of the tests are very similar with that for common variants with medium risk, the proposed method is still better than other tests. Table 3.5: Parameters for low risk model. Model β 1 γ 0 γ 1 γ 2 f 11 f 12 f 22 OR hetero OR homo Add1 1-3.16 log1.1 log1.1 0.047 0.052 0.057 1.10 1.21 Add2 1-3.05 log1.05 log1.05 0.047 0.052 0.057 1.10 1.22 Dom1 1-3.18 log1.08 log1.08 0.046 0.054 0.054 1.17 1.17 Dom2 1-3.07 log1.04 log1.04 0.046 0.053 0.053 1.17 1.17 Rec1 1-3.05 log1.08 log1.08 0.049 0.049 0.057 1 1.17 Rec2 1-3.00 log1.04 log1.04 0.049 0.049 0.057 1 1.17 f 11, f 12 and f 22 are penetrance values for the genotypes OR hetero and OR homo are odds ratio for heterozygous and homozygous genotype *1,2 refer to alternative disease models 1 and 3 in Figure 3.1

45 Add_1 Add_2 Dom_1 Dom_2 Rec_1 Rec_2 Figure 3.3: Power plot for low risk model. Column 1-5, heritability takes 0.002, 0.004, 0.006, 0.008 and 0.01; x axis, correlation coefficient; y axis, power. Black solid line, logistic regression between disease status and SNP markers; black dashed line, linear regression between QT and SNP markers with adjustment for disease status; red dashed line, Fisher s combination method; red solid line, modified inverse variance weighted method. Model 1, SNP and QT are independently associated with the disease status; model 2, QT is between the SNP and disease status.

46 3.3.3 Rare variant with high risk There is strong evidence that rare variants with larger effect size are involved in the development of complex diseases. Data were also simulated for a rare variant with an associated odds ratio ranging from 2 to 4. The disease incidence rate due to the rare variant was still set to be 0.05. The minor allele frequency (MAF) of the SNP marker and disease allele were equal and set to be 0.01. In a case-control sampling frame, the rare variants and common variants are all sampled together, the heritability of the quantitative trait at a specific rare variant was still set to range from 0.002 to 0.01 with a step size of 0.002. Table 3.6 lists the parameter values in the simulation of a common variant with low risk. 500 cases and 500 controls were simulated for additive and dominant models; 1500 cases and 1500 controls were simulated for recessive model. Table 3.6: Parameters for rare variant with high risk. Model β 1 γ 0 γ 1 γ 2 f 11 f 12 f 22 OR hetero OR homo Add1 0-6.72 log2 log2 0.05 0.098 0.194 2.0 4.0 Add2 1-6.36 log1.42 log1.42 0.048 0.098 0.198 2.0 4.1 Dom1 1-6.62 log1.74 log1.74 0.048 0.146 0.146 3.0 3.0 Dom2 1-6.3 log1.32 log1.32 0.048 0.146 0.146 3.0 3.0 Rec1 1-6.54 log1.74 log1.74 0.05 0.05 0.15 1.0 3.0 Rec2 1-6.27 log1.32 log1.32 0.05 0.05 0.15 1.0 3.0 f 11, f 12 and f 22 are penetrance values for the genotypes OR hetero and OR homo are odds ratio for heterozygous and homozygous genotype *1,2 refer to alternative disease models 1 and 3 in Figure 3.1 Figure 3.4 displays the power plot for the simulation result. The rare variants have a very small allele frequency in the standard case-control sampling

47 frame. The heterozygote genotype is 0.0198 when the MAF is 0.01. For additive or dominant model, a sample of 500 cases and 500 controls are big enough to capture rare allele. The traditional case-control study has 60% power to detect the rare allele with a large effect. When an intermediate phenotype is available, the power of the combination method can increase to > 0.99 with such a sample size and the least variance method is the best among the four tests. For recessive model, the homozygote genotype frequency for the susceptibility locus is 0.0001, a sample of 1500 cases is not big enough to capture the risk allele and it is very hard to detect such a rare variant with this sample size.

48 Add_1 Add_2 Dom_1 Dom_2 Rec_1 Rec_2 Figure 3.4: Power plot for rare variant with high risk. Column 1-5, heritability takes 0.002, 0.004, 0.006, 0.008 and 0.01; x axis, correlation coefficient; y axis, power. Black solid line, logistic regression between disease status and SNP markers; black dashed line, linear regression between QT and SNP markers with adjustment for disease status; red dashed line, Fisher s combination method; red solid line, modified inverse variance weighted method. Model 1, SNP and QT are independently associated with the disease status; model 2, QT is between the SNP and disease status.

49 3.3.4 Quantitative trait available in only cases Sometimes the quantitative trait can be obtained only from diseased people as the trait is only expressed in those with the disease. For example, the three-dimensional data for determination of the degree of facial symmetry is only available in unilateral cleft lip and palate patients. Another possible reason is the measurement can be very expensive and not affordable for unaffected people. To test the validity of the meta-analysis methods, the tests were also carried out with the quantitative trait only available in cases. All the parameters in the simulation were the same as in the previous simulation except that the quantitative trait was only available in cases. Figures 3.5, 3.6 and 3.7 display the power plots for the simulation results. When the quantitative trait is only available in cases, the power of linear regression is smaller because the sample size is decreased, but the quantitative trait can still provide valuable information for association analysis. The meta-analysis still has decent power and inverse variance weighted method is still the best.

50 Add_1 Add_2 Dom_1 Dom_2 Rec_1 Rec_2 Figure 3.5: Power plot for medium risk. Column 1-5, heritability takes 0.002, 0.004, 0.006, 0.008 and 0.01; x axis, correlation coefficient; y axis, power. Black solid line, logistic regression between disease status and SNP markers; black dashed line, linear regression between QT and SNP markers with adjustment for disease status; red dashed line, Fisher s combination method; red solid line, modified inverse variance weighted method. Model 1, SNP and QT are independently associated with the disease status; model 2, QT is between the SNP and disease status.

51 Add_1 Add_2 Dom_1 Dom_2 Rec_1 Rec_2 Figure 3.6: Power plot for low risk model. Column 1-5, heritability takes 0.002, 0.004, 0.006, 0.008 and 0.01; x axis, correlation coefficient; y axis, power. Black solid line, logistic regression between disease status and SNP markers; black dashed line, linear regression between QT and SNP markers with adjustment for disease status; red dashed line, Fisher s combination method; red solid line, modified inverse variance weighted method. Model 1, SNP and QT are independently associated with the disease status; model 2, QT is between the SNP and disease status.

52 Add_1 Add_2 Dom_1 Dom_2 Rec_1 Rec_2 Figure 3.7: Power plot for rare variant with high risk model. Column 1-5, heritability takes 0.002, 0.004, 0.006, 0.008 and 0.01; x axis, correlation coefficient; y axis, power. Black solid line, logistic regression between disease status and SNP markers; black dashed line, linear regression between QT and SNP markers with adjustment for disease status; red dashed line, Fisher s combination method; red solid line, modified inverse variance weighted method. Model 1, SNP and QT are independently associated with the disease status; model 2, QT is between the SNP and disease status.

53 3.3.5 Type I error rate The type I error rate in the simulation was set to be 0.01. To obtain a more accurate estimator of the type I error rate, 10,000 simulations were carried out for each set of parameters under the null hypothesis, i.e., there was no association between the SNP marker and the disease. No inflated type I error rate was detected in the simulation. Tables 3.7, 3.8 and 3.9 describe the simulated type I error from various conditions. Table 3.7: Type I error rate for medium risk model. Disease model I Disease model II Genetic Heritability Heritability model Test 0.01 0.008 0.006 0.004 0.002 0.01 0.008 0.006 0.004 0.002 Add Test 1 0.0111 0.0096 0.0098 0.013 0.0101 0.0091 0.0102 0.0093 0.009 0.0097 Test 2 0.0085 0.0097 0.0096 0.0096 0.0096 0.0106 0.0112 0.0099 0.0103 0.0109 Test 3 0.01 0.0083 0.0102 0.0101 0.0084 0.0098 0.0093 0.0099 0.0091 0.0083 Test 4 0.0108 0.0083 0.0102 0.0118 0.0093 0.01 0.0107 0.0096 0.0085 0.0105 Dom Test 1 0.0101 0.0113 0.0101 0.0084 0.0097 0.0101 0.0114 0.0087 0.0105 0.0078 Test 2 0.0104 0.0099 0.0089 0.0094 0.0104 0.0111 0.0107 0.0099 0.011 0.0084 Test 3 0.0104 0.009 0.0091 0.0097 0.0099 0.0117 0.0115 0.0105 0.0098 0.0099 Test 4 0.0095 0.001 0.0102 0.0105 0.0105 0.0116 0.0119 0.0091 0.0095 0.0086 Rec Test 1 0.0098 0.0074 0.0102 0.0105 0.0011 0.0111 0.0103 0.0097 0.0096 0.0106 Test 2 0.011 0.0087 0.0098 0.0082 0.0103 0.0107 0.0102 0.0106 0.0111 0.0095 Test 3 0.0078 0.0082 0.0108 0.01 0.0111 0.0107 0.0111 0.0097 0.0097 0.0102 Test 4 0.0099 0.0087 0.0092 0.0088 0.0101 0.0102 0.009 0.0102 0.0102 0.0103 Test 1-4: logistic regression, linear regression, Fisher s method, modified inverse variance weighted method 10,000 simulations were conducted for each of the conditions under the null hypothesis

54 Table 3.8: Type I error rate for low risk model. Disease model I Disease model II Genetic Heritability Heritability model Test 0.01 0.008 0.006 0.004 0.002 0.01 0.008 0.006 0.004 0.002 Add Test 1 0.0101 0.0095 0.0098 0.0081 0.0105 0.0092 0.0105 0.0101 0.0101 0.0099 Test 2 0.01 0.0099 0.0109 0.0112 0.01 0.01 0.0095 0.0111 0.0094 0.0107 Test 3 0.0095 0.0109 0.0089 0.0087 0.0107 0.0103 0.0087 0.0107 0.0103 0.011 Test 4 0.0092 0.0109 0.0094 0.0083 0.0107 0.0112 0.0105 0.0112 0.0108 0.0123 Dom Test 1 0.0092 0.0081 0.0105 0.0096 0.0095 0.0109 0.0101 0.0121 0.0104 0.0097 Test 2 0.0096 0.0111 0.0103 0.0093 0.0092 0.0099 0.009 0.0098 0.0122 0.0114 Test 3 0.009 0.0084 0.0105 0.0085 0.0112 0.0103 0.01 0.0106 0.0117 0.0095 Test 4 0.0091 0.0096 0.0105 0.0087 0.0099 0.0094 0.0098 0.0102 0.0118 0.011 Rec Test 1 0.009 0.0087 0.0078 0.0102 0.0103 0.0092 0.0101 0.0109 0.0102 0.01 Test 2 0.01 0.009 0.0103 0.0102 0.0101 0.0107 0.0112 0.0115 0.0103 0.0081 Test 3 0.0087 0.0103 0.0088 0.0112 0.009 0.0103 0.0095 0.0106 0.01 0.0111 Test 4 0.0091 0.0105 0.0077 0.0104 0.0103 0.0096 0.0101 0.0113 0.0106 0.009 Test 1-4: logistic regression, linear regression, Fisher s method, modified inverse variance weighted method 10,000 simulations were conducted for each of the conditions under the null hypothesis Table 3.9: Type I error rate for rare variant with high risk. Disease model I Disease model II Genetic Heritability Heritability model Test 0.01 0.008 0.006 0.004 0.002 0.01 0.008 0.006 0.004 0.002 Add Test 1 0.009 0.0076 0.0087 0.0095 0.0089 0.0085 0.0081 0.0087 0.0095 0.0094 Test 2 0.0109 0.0099 0.0099 0.0106 0.0115 0.0103 0.01 0.0118 0.0079 0.0102 Test 3 0.0084 0.0106 0.0095 0.0102 0.0094 0.0078 0.0088 0.0099 0.0096 0.011 Test 4 0.009 0.0096 0.0095 0.009 0.0099 0.0079 0.0101 0.01 0.0098 0.0103 Dom Test 1 0.0088 0.01 0.0095 0.008 0.0095 0.0089 0.0084 0.0073 0.0088 0.0093 Test 2 0.0126 0.0107 0.0088 0.0106 0.0099 0.0098 0.0094 0.0087 0.0108 0.0111 Test 3 0.0093 0.0117 0.0094 0.0095 0.0086 0.0096 0.009 0.0103 0.0103 0.0091 Test 4 0.0102 0.0095 0.0089 0.0091 0.0088 0.0091 0.0093 0.0095 0.0089 0.0091 Rec Test 1 0.01 0.0102 0.0099 0.0099 0.0099 0.0081 0.0092 0.0099 0.0081 0.0093 Test 2 0.0101 0.0089 0.0092 0.0106 0.0099 0.0106 0.0108 0.0105 0.0116 0.0108 Test 3 0.0098 0.0101 0.0085 0.0104 0.0101 0.01 0.0109 0.0097 0.0119 0.0105 Test 4 0.0092 0.0104 0.0099 0.0106 0.0107 0.0091 0.0111 0.0096 0.0106 0.0111 Test 1-4: logistic regression, linear regression, Fisher s method, modified inverse variance weighted method 10,000 simulations were conducted for each of the conditions under the null hypothesis

55 CHAPTER 4 APPLICATION IN LUNG CANCER DATA 4.1 Introduction Lung cancer killed over 160 thousand people in 2007 in the US, which exceeds the combined mortality attributable to breast, prostate and colon cancer. It has become the leading cause of cancer deaths in both men and women. In addition to well-determined environmental factors, genetic factors play an important role in the development of lung cancer disease. A number of studies show that cases with lung cancer have a significantly greater number of first-degree relatives with lung cancer compared to controls without lung cancer, irrespective of the relative s smoking behavior [41]. Several familial and linkage studies have identified high penetrant variants associated with lung cancer in RB1, TP53, and 6q23-25 [42, 43, 44]. More recently, GWA studies have consistently identified three loci at a genome-wide significance level: variants at 5p15.33 (TERT-CLPTM1L), 6p21(BAT3, APOM ) and 15q25.1 (CHRNA3, CHRNA4, and CHRNA5 ), which together explain about 7% of the familial risk of lung cancer [45, 9]. The identified associated alleles usually have a small to medium risk predisposition to lung cancer. For example, rs1051730 and rs8034191 in CHRNA3 and CHRNA5, respectively, have an associated odds ratio around 1.3 (P < 1 10 17 ) in both the study population and replication population [46]. Figure 4.1 displays all the genes that have so far been identified to be associated with lung cancer in humans. In addition to the success of lung cancer studies in humans, researchers also use mice as the study objects because mice develop lung cancer similar in his-

56 Figure 4.1: Allele frequency and effect sizes for genetic variants associated with lung cancer. Allele frequencies and ORs are taken from published literature where available and are not depicted to scale. Associations identified through GWA or GWA follow-up studies are shown with solid colored bars; all others are shaded from dark (top) to light (bottom) [Hindoff et al., 2011]. tology, molecular characteristics, and histogenesis to human. Using inbred mice as the subject in a lung cancer study can reduce the genetic complexity in the analysis. Many more lung cancer associated QTLs have been mapped on chromosome 4, 6, 8, 9, 13, 19, etc [47]. Therefore, lung cancer is a typical complex disorder with multiple genes involved in disease etiology. Smoking is the most important environmental risk factor for lung cancer and is estimated to be responsible for approximately 90% of lung cancer in men and 85% in women. Lung cancer risk attributable to tobacco smoking is strongly affected by the duration of smoking, and declines with increasing time from ces-

57 sation [48]. And the death rate increases as the use of cigarettes increases. GWA study was conducted to find genes associated smoking behavior. Researchers used linear regression analysis to identify the significant markers associated with smoking quantity, smoking duration, etc. Several independent studies found that the nicotinic acetylcholine receptor subunit genes, CHRNA3 and CHRNA5, which were associated with lung cancer, are also associated with smoking quantity (Fig 4.2) [49, 50]. Table 4.1 list the significant SNPs in the CHRNA region [51]. These results provide important evidence that smoking is an heritable trait. Figure 4.2: Genome-wide smoking quantity meta-analysis. Left, UK study [Liu et al., 2010]; Right, TAG Consortium. Smoking quantity: (a) CPD, (b) former versus current smoking, (c) ever versus never smoking, and (d) age of smoking initiation [the tobacco and genetics consortium, 2010].

58 Table 4.1: Summary information for selected SNPs at 15q25 [the tobacco and genetics consortium, 2010]. Ox-GSK TAG ENGAGE Combined SNP Chr. Position Coded allele Coded allele freq. P Phet P P P β s.e.m rs588765 15 76,652,480 T 0.43 1.74 10 3 0.50 NA NA NA NA NA rs16969968 15 76,669,980 G 0.65 1.64 10 18 0.86 1.85 10 27 1.53 10 23 4.29 10 65-0.078 0.0046 rs1051730 15 76,681,394 G 0.66 9.45 10 19 0.68 3.62 10 27 9.98 10 25 1.71 10 66-0.079 0.0046 rs6495308 15 76,694,711 T 0.77 3.30 10 10 0.10 7.99 10 24 1.60 10 13 5.81 10 44 0.073 0.0052

59 The association between smoking and lung cancer is among the strongest in the epidemiological firmament, and any gene variant that is modestly linked with smoking behavior will seem to be associated with lung cancer unless the matching of cancer cases and controls by smoking behavior is close to perfect [52]. There is reason to believe smoking quantity is an intermediate phenotype to lung cancer.

60 4.2 Lung cancer with smoking data 4.2.1 Introduction The genome-wide lung cancer data come from Illumina HumanHap300 v1.1 BeadChips which contains 317,498 tagging SNPs in a series of 1154 ever-smoking lung cancer cases and 1137 ever-smoking controls. The cases and controls were matched by age and sex, and all the samples are restricted to European origin (Tab 4.2 ). The result of the initial analysis using the case-control method was published in Nature Genetics in 2008 and there was no evidence of genome-wide inflation of χ 2 tests (λ=1.014) (Fig 4.3). The quantitative traits of smoking cigarettes per day, smoking years and pack years (pack years=cigarettes per day smoking years/20) were used in the analysis. The threshold for significant signal was set to be 1 10 4.5 in the originally published paper to adjust for the multiple comparison. This cutoff value is kept in the study. Table 4.2: Characteristics of study population [Amos et al., 2008] Characteristic Cases(n=1154) Controls(n=1137) Age (s.d.) 62.1 (10.8) 61.1 (8.9) Sex (percent male) 57.0 56.6 Never smokers (percent) 0 0 Former smokers (percent) 52.3 57.8 Current smokers (percent) 47.8 42.2 Cigarettes per day (s.d.) 28.0 (13.6) 26.6 (14.3) Pack years (s.d.) Current smokers 57.3 (30.6) 47.1 (29.1) Former smokers 46.2 (31.2) 42.8 (30.9) Years smoker (s.d.) Current smokers 40.2 (11.0) 39.0 (11.0) Former smokers 31.9 (12.8) 28.2 (11.9)

61 Figure 4.3: Left:results from genome-wide association analysis of directly tested SNPs in the Texas discovery set using Illumina 300K HumanHap v1.1 Beadchips. Right: QQ-plot of the log 10 (P ) and the inflation factor lambda.

62 4.2.2 Quantitative trait available in Lung cancer The available quantitative traits from the lung cancer study are smoking years and cigarettes per day (CPD). A new quantity of pack years was derived from smoking years CPD/20. Boxplot and histogram are used to visualize the distribution of the data, Q-Q plot is used to check the normality of the data [53]. Some outliers are found and removed from smoking years and CPD. A square root transformation was used to normalize the CPD and pack year data (Fig 4.4, 4.5).

Figure 4.4: Descriptive plot of quantitative trait before transformation. 63

64 Figure 4.5: Descriptive plot of quantitative traits after outlier removal and square root transformation of CPD and pack years.

65 4.2.3 Lung cancer with cigarettes per day The log 10 (p) from linear regression of CPD with adjustment for disease status was plotted (Fig 4.6). The information from CPD was incorporated in the analysis using Fisher s method and the modified inverse variance weighted method. The p value from these two tests are plotted in Figure 4.7 and Figure 4.8. The inflation factors λ are around 1.01-1.02 which are small and indicate there is no spurious association. Figure 4.6: Association analysis of CPD. The SNPs with a log 10 (p) greater than 4.5 are deemed as significant to be consistent with the criteria used in the nature genetics paper. Consecutive significant SNPs within a gene are listed in table 4.3. The SNP within gene CHRNA3 has a p value of 1.14 10 5 in the case-control study, and it has a p-value of 9.18 10 4 in the quantitative trait association analysis. When the two tests are combined together, the p value reached to 2.03 10 7 in Fisher s test and

66 Figure 4.7: Fisher s method. Figure 4.8: Modified inverse variance weighted method. 5.04 10 8 using the inverse variance weighted method, which is a very strong signal in association analysis with such a small sample size. This result confirmed that the CHRNA3 gene is associated with the CPD phenotype which was already proven in several other independent studies, and CPD is an intermediate phenotype for lung cancer. In addition to CHRNA3, another interesting finding is the CYP2F1 gene on chromosome 19, which is very close to the already known cigarette-induced lung cancer gene CY P 2A6. Biochemical investigations have shown that the cy-

67 Table 4.3: CPD: summary of association analysis. P value SNP Chromosome Position (bp) Test1 Test2 Test3 Test4 SUMF1 1 rs1444056 3 4214953 1.83E-03 2.82E-03 6.81E-05 1.58E-05 rs313682 3 4236274 1.39E-04 1.07E-02 2.13E-05 6.75E-06 CHRNA3, CHRNA5 2 rs8034191 15 76593078 1.94E-05 5.27E-04 1.98E-07 4.37E-08 rs1051730 15 76681394 1.14E-05 9.18E-04 2.03E-07 5.04E-08 CYP2F1 3 rs2241714 19 46561232 5.76E-04 3.32E-03 2.71E-05 6.40E-06 rs10853751 19 46595060 1.44E-05 5.62E-02 1.21E-05 9.95E-06 rs1473248 19 46615154 1.10E-05 6.95E-02 1.16E-05 1.12E-05 1 SUMF1 : sulfatase modifying factor 1 isoform 1 2 CHRNA3, CHRNA5 : cholinergic receptor, nicotinic, alpha 3, alpha 5 3 CYP2F1 : cytochrome P450, family 2, subfamily F, polypeptide 1 tochrome P450 superfamily (CYP) is a large and diverse group of enzymes. Cytochrome P450-mediated bioactivation of toxicants is a particularly relevant process to lung diseases because the lungs are exposed directly to environmental pollutants, such as cigarette smoke. It has been shown that many P450 genes, CY P 1A1, CY P 1B1, CY P 2B6, CY P 2E1, CY P 2F 1, CY P 2S1, CY P 3A5, and CY P 4B1, are transcribed in lung tissues, and lung tissues activate carcinogens to produce organ-selective damage. Scientists found that the CY P 2A6 gene is related to tobacco-induced lung cancers [41, 54]. However, there is no evidence that CY P 2F 1 is associated with lung cancer. The results from this study show that CY P 2F 1 is also associated with lung cancer and smoking quantity CPD. 3 consecutive SNPs at this gene have a p value less than 0.0001 in the case-control study, they also have an average p value of 0.043 in the quantitative trait association analysis. In the meta-analysis, they have an average p value of 1.69 10 5 using Fisher s test and 9.18 10 6 using the inverse variance weighted method.

68 The p values from inverse variance weighted method are all smaller than that from Fisher s method. This proves that the combined analysis method is more powerful in association analysis when there is an intermediate phenotype is available. And the inverse variance weighted method is more powerful than Fisher s combination method in meta analysis. Figure 4.9 includes the plots for log 10 (p) against log 10 (p) for case-control vs. Fisher s method and case-control vs. the inverse variance weighted method. The plot on the left shows that Fisher s method tends to produce more significant signals for those non-significant SNPs ( log 10 p < 4), which may introduce a higher false discovery rate in the analysis. The type I error rate is higher in Fisher s method compared with the inverse variance weighted method although no inflated type I error rate was detected in the simulation study. The reason is that the Fisher s method is based on the p-values from the tests, thus cannot tell the direction of the effect from the tests. Fisher s method can produce a significant result even when the effect sizes from logistic regression and linear regression are in opposite direction. This should not be true when the quantitative trait is an intermediate phenotype for the disease. The inverse variance weighted method doesn t have this problem because it is based on the linear combination of the two effect sizes. From this aspect, the modified inverse variance weighted method is a better choice to do the a single population meta-analysis.

69 Figure 4.9: CPD: -log10(p) comparison between tests X and Y axis are -log10(p). Red line is the diagonal line, blue line is the cutoff of significance, log10(p ) > 4.5. Left, case-control vs. Fisher s method; right, case-control vs. inverse variance weighted method.

70 4.2.4 Lung cancer with smoking years The p values from the linear regression with adjustment for disease status, Fisher s method and the modified inverse variance weighted method are plotted in Figure 4.10, 4.11 and Figure 4.12. The smoking years association analysis has many spurious signals, but they are eliminated by the meta-analysis. And the modified inverse variance weighted method has a smaller λ value than Fisher s method which shows it has a better performance in dealing with those spurious signals. Figure 4.13 includes the plots for log 10 (p) against log 10 (p) for casecontrol vs. Fisher s method and case-control vs. the inverse variance weighted method. Figure 4.10: Association analysis of smoking years. Consecutive significant SNPs at a locus are listed in table 4.4. The SNPs at the CHRNA3 and CYP2F1 region are not associated with smoking years (Tab 4.4). SNPs at 1q42.13 and 10q26.13 are associated with smoking years and lung

71 Table 4.4: Smoking years: summary of association analysis. P value SNP Chromosome Position (bp) Test1 Test2 Test3 Test4 Intron 1 rs241301 1 225269162 1.86E-03 2.43E-03 6.00E-05 1.38E-05 rs801114 1 225304570 5.08E-03 4.89E-04 3.45E-05 8.56E-06 rs1341715 1 225325651 8.22E-04 1.87E-04 2.56E-06 5.40E-07 near ARHGAP10 2 rs13119467 4 148911066 4.90E-05 1.31E-02 9.77E-06 3.70E-06 rs4835446 4 148989132 4.83E-05 2.59E-02 1.82E-05 8.59E-06 rs11099666 4 148991033 4.45E-05 2.24E-02 1.48E-05 6.69E-06 MGI QTL 3 rs6872555 5 135800355 3.84E-03 1.47E-03 7.40E-05 1.74E-05 rs6898839 5 135806839 4.25E-03 4.70E-04 2.82E-05 6.85E-06 NRG3 4 rs951204 10 84038745 9.05E-04 4.74E-04 6.72E-06 8.98E-01 rs11193663 10 84066712 4.60E-03 4.22E-04 2.75E-05 6.22E-01 rs1317093 10 84109812 8.01E-04 1.48E-03 1.74E-05 9.04E-01 Intron rs7897813 10 127105588 3.21E-05 1.64E-02 8.13E-06 3.50E-06 rs7896776 10 127108323 3.81E-05 1.21E-02 7.20E-06 2.77E-06 rs1350606 10 127132796 5.84E-04 8.98E-03 6.90E-05 1.86E-05 CHRNA3, CHRNA5 5 rs8034191 15 76593078 1.94E-05 6.69E-01 1.59E-04 6.56E-03 rs1051730 15 76681394 1.14E-05 6.00E-01 8.83E-05 6.29E-03 CYP2F1 6 rs10853751 19 46595060 1.44E-05 7.36E-01 1.32E-04 4.67E-03 rs1473248 19 46615154 1.10E-05 4.87E-01 7.05E-05 8.87E-03 rs284662 19 46624115 2.56E-05 4.47E-01 1.41E-04 1.48E-02 rs2231940 19 46636077 1.62E-05 4.52E-01 9.37E-05 1.18E-02 1 Intron: Human EST that has been spliced,often found near regulatory element 2 ARHGAP10 : Rho GTPase activating protein 10 3 MGI QTL: Mouse QTLs 4 NRG3 : neuregulin 3 isoform 1 5 CHRNA3, CHRNA5 : cholinergic receptor, nicotinic, alpha 3, alpha 5 6 CYP2F1 : cytochrome P450, family 2, subfamily F, polypeptide 1

72 Figure 4.11: Smoking years: Fisher s method. Figure 4.12: Smoking years: Modified inverse variance weighted method. cancer. These SNPs are located in an expressed sequence that is spliced, which is often found near an active regulatory element. Three SNPs near the ARHGAP 10 gene on chromosome 4 were also detected. The three SNPs rs13119467, rs4835446 and rs11099666 have a p value of 4.90 10 5, 4.83 10 5 and 4.45 10 5 in the case-control study. In the meta analysis, they have an average p value of 1.43 10 5 using Fisher s test and 6.33 10 6 using the inverse variance weighted method. In an Australian GWA study, one SNP in ARHGAP10 achieved genomewide significance (p < 5 10 8 ) for nicotine dependence (p = 4.43 10 8 ) [55].

73 The results from this study also suggest ARHGAP10 or its surrounding region is related with nicotine dependence, and nicotine dependence may be another intermediate phenotype for lung cancer. The SNPs in the NRG3 gene show significant associations in both the case-control study and quantitative trait association analysis. However, their allelic effect are in inverse direction in the two tests, i.e., the allele is protective in one test but deleterious in the other test. This should not be true if an intermediate phenotype is involved in the development of the disease. Fisher s method is based on the p value of the individual test which can t tell the direction of the effect; however the inverse variance weighted method is based on the effect from individual test which is more accurate when works on intermediate phenotype information. The results from the smoking years association analysis are not significant at CHRNA3 and CY P 2F 1 loci, the Fisher s method still produces some significant signals at those two loci which indicate it is not a good meta-analysis method to analyze the intermediate phenotype information sometimes.

74 Figure 4.13: Smoking years: -log10(p) comparison between tests. X and Y axis are -log10(p). Red line is the diagonal line, blue line is the cutoff of significance, log10(p ) > 4.5. Left, case-control vs. Fisher s method; right, case-control vs. modified inverse variance weighted method.

75 4.2.5 Lung cancer with pack year The pack years=cigarettes per day smoking years/20. The quantitative trait analysis of pack years also shows some spurious signals as that from smoking years analysis, but they are eliminated by the meta-analysis. The quantitative trait of pack years was incorporated in the analysis using Fisher s method and the inverse variance weighted method.the p values from the tests are plotted in Figures 4.15 and 4.16. Table 4.5 shows the p values across Test 1-4 for the markers that show significant p values at two or more consecutive SNPs within a gene. The results show that pack years is not associated with either CHRNA3 gene or CYP2F1 gene. Figure 4.17 includes the plots for log 10 (p) against log 10 (p) for case-control vs. Fisher s method and case-control vs. the inverse variance weighted method. Figure 4.14: Association analysis of pack years.

76 Figure 4.15: Pack years: Fisher s method. Figure 4.16: Pack years: Modified inverse variance weighted method.

77 Table 4.5: Pack years: summary of association analysis. P value SNP Chromosome Position (bp) Test1 Test2 Test3 Test4 SUMF1 1 rs1403124 3 4188033 1.02E-03 2.22E-03 3.16E-05 7.16E-06 rs1444056 3 4214953 1.83E-03 1.89E-03 4.69E-05 1.07E-05 Intron 2 rs720485 4 145820193 3.14E-05 3.06E-03 1.65E-06 3.97E-01 rs1828591 4 145838385 8.59E-05 4.21E-03 5.72E-06 4.53E-01 rs1512288 4 145848886 1.61E-04 5.22E-03 1.26E-05 4.89E-01 CHRNA3, CHRNA5 3 rs8034191 15 76593078 1.94E-05 1.95E-01 5.10E-05 8.23E-05 rs1051730 15 76681394 1.14E-05 2.37E-01 3.74E-05 8.15E-05 CYP2F1 4 rs2241714 19 46561232 5.76E-04 1.19E-02 8.87E-05 2.52E-05 rs10853751 19 46595060 1.44E-05 2.83E-01 5.44E-05 1.29E-04 rs1473248 19 46615154 1.10E-05 4.46E-01 6.50E-05 2.65E-04 rs284662 19 46624115 2.56E-05 6.16E-01 1.90E-04 8.63E-04 rs2231940 19 46636077 1.62E-05 5.69E-01 1.16E-04 5.56E-04 1 SUMF1 : sulfatase modifying factor 1 isoform 1 2 Intron: Human EST that has been spliced,often found near regulatory element 3 CHRNA3, CHRNA5 : cholinergic receptor, nicotinic, alpha 3, alpha 5 4 CYP2F1 : cytochrome P450, family 2, subfamily F, polypeptide 1 Figure 4.17: Pack years: -log10(p) comparison between tests. X and Y axis are -log10(p). Red line is the diagonal line, blue line is the cutoff of significance, log10(p ) > 4.5. Left, case-control vs. Fisher s method; right, case-control vs. inverse variance weighted method.

78 4.2.6 Imputed data analysis This 330K lung cancer genome data only covers a small fraction of SNPs in the human genome. The majority of SNPs in the genome can be imputed indirectly using one or more of the genotyped SNPs as proxies. And the imputed data have a much higher coverage of the genome (Table 4.6 )[56, 57]. Dr Chris Amos from MD Anderson Cancer Center provided the imputed lung cancer data. Table 4.6: Number of SNPs: genotyped vs imputed Chrom. 330K Imputed Chrom. 330K Imputed 1 23275 193333 3 21579 176162 4 19113 164673 5 19272 169846 15 8900 72469 19 5927 36569 4.2.6.1 Cigarettes per day Imputed data on chromosomes 3, 15 and 19 with quantitative trait CPD were analyzed. More significant SNPs were found in the SUMF1, CHRNA3 and CYP2F1 genes, which support that these three genes are associated with smoking behavior, and implicated in the development of lung cancer. CHRNA3 on chromosome 15 has a much stronger signal than the other two genes. The p values at these SNPs are listed in Tables 4.7, 4.8, and 4.9.

79 Figure 4.18: Imputed data+cpd: Case-control study. Figure 4.19: Imputed data: Fisher s method. Figure 4.20: Imputed data: Modified inverse variance weighted method.