Introduction of Genome wide Complex Trait Analysis (GCTA) resenter: ue Ming Chen Location: Stat Gen Workshop Date: 6/7/013
Outline Brief review of quantitative genetics Overview of GCTA Ideas Main functions ractical Estimating total heritability
Quantitative Genetics Quantitative traits henotypes, continuous variation polygenic effects, product of two or more genes, and their environment Do not follow patterns of Mendelian inheritance Quantitative trait locus (QTL) Underlies quantitative traits Many QTLs associated with a single trait urpose of quantitative genetics To study how the quantitative traits are determined by the genetic factors and their interaction with environmental factors 3
henotype Genotype Relationship Define the model E G Variance due to environment E[( The phenotypic variance E E[ G]) ] E[( E[( E[( E E[ ]) E[ E[ G ] G] E[ G]) G] ] E[( E[ E[ ]) ] G] E[ ]) ] 4
Simple Linear Regression for a Quantitative Trait y j x ij a i e j y x a e ij i j j : phenotypic value of individual j : genotype of individual j at SN i : allele substitution effect of SN i : residual effect, follows N 0, e x ij 0 1 if bb if Bb if BB 5
Heritability An alternative expression i Gi Ei i Assume that G, E and ε are independent Variance of G E Dividing both sides by Heritability G E 1 h e h Heritability is a population concept in that it deals with variation Heritability does not imply causation 6
Missing heritability Study confirmed SNs explain a small fraction of the heritability. Why? Study 1 suggests hiding rather missing heritability Many SNs with small effects (infinitesimal model) GCTA implements the method of estimating the proportion of phenotypic variance explained by genome or chromosome wide SNs for complex traits 1. ang et al, 010, Nature Genetics. ang et al, 011, AJHG 7
Statistical Framework of GCTA Fit the effects of all the SNs as random effects by a mixed linear model y X Wu with vary V WW' I and u ~ N 0, I u u Define the variance explained by all SNs Equivalent expression y X g with I A : The genetic relationship matrix (GRM) between individuals V WW' g N A I g g N u 8
Main Functionalities Estimate the genetic relationship from genome wide SNs Estimate the inbreeding coefficient from genome wide SNs Estimate the variance explained by SNs on a single chromosome or the whole genome by REML Estimate the LD structure encompassing a list of target SNs Simulate GWAS data based upon the observed genotype data redict the genome wide additive genetic effects for individual subjects and for individual SNs 9
QC Cautions Remove close relatives To minimize any confounding of shared environment with GRM Control for ethnic principle components (Cs) To minimize confounding of ethnicity with GRM Adapted from 013 IBG slides of Keller and de Candia 10
ractical Estimating total heritability Workflow Data QC, use LINK plink noweb bfile CDWTCCC geno 0.0 # SN with 100% genotyping rate maf 0.05 # minor allele frequency of at least 5% hwe 1e 3 # HWE test p value (in controls) of p>0.001 chr 10 # on chromosome 10 mind 0.0 # individual have a genotyping rate of less than 100% thin 0.1 # keep a random 10% of SNs make bed out../chr10_thin10/cdwtccc_chr10_thin10 After frequency and genotyping pruning, there are 151 SNs After filtering, 1748 cases, 938 controls and 0 missing After filtering, 16 males, 560 females, and 0 of unspecified sex plink noweb bfile CDWTCCC_chr10_thin10 write snplist # write SN list files Simulate a quantitative trait with the heritability of 0.5, use GCTA gcta64 bfile CDWTCCC_chr10_thin10 simu qt simu causal loci plink.snplist simu hsq 0.5 out test Simulation parameters: Number of simulation replicate(s) = 1 (Default = 1) Heritability = 0.5 (Default = 0.1) Simulated QTL effect(s) have been saved in [test.par]. Simulating GWAS based on the real genotyped data with 1 replicate(s)... Simulated phenotypes of 4686 individuals have been saved in [test.phen]. If the effect sizes are not specified in the file, plink.snplist, they will be generated from a standard normal distribution. 11
ractical Estimating total heritability Workflow (Cont d) Estimate the GRM from all the SNs gcta64 bfile CDWTCCC_chr10_thin10 make grm out test Estimation of the phenotypic variance explained by the SNs using the REML method gcta64 reml grm test pheno test.phen out test 1
The summary result of REML analysis erforming REML analysis... (NOTE: may take hours depends on sample size). 4686 observations, 1 fixed effect(s), and variance component(s)(including residual variance). Calculating prior values by EM-REML... rior values updated from EM-REML: 779.63 768.955 Running AI-REML algorithm... Iter. logl V(G) V(e) 1-18868.80 776.0151 767.48973-18868.76 769.64350 764.3893 3-18868.73 769.71337 764.33605 4-18868.73 769.71303 764.33616 Log-likelihood ratio converged. Calculating the loglikelihood for the reduced model... (variance component 1 is dropped from the model) Calculating prior values by EM-REML... rior values updated from EM-REML: 1565.8500 Running AI-REML algorithm... Iter. logl V(e) 1-19577.74 1565.8500 Log-likelihood ratio converged. Summary result of REML analysis: Source Variance SE V(G) 769.71305 41.7114 V(e) 764.336155 19.516501 Vp 1534.049180 41.70414 V(G)/Vp 0.501753 0.01678 Covariance/Variance/Correlation Matrix: 1740.65-191.155 380.894 13