Power Calculation for Testing If Disease is Associated with Marker in a Case-Control Study Using the GeneticsDesign Package Weiliang Qiu email: weiliang.qiu@gmail.com Ross Lazarus email: ross.lazarus@channing.harvard.edu October 30, 2017 1 Introduction In genetics association studies, investigators are interested in testing the association between disease and DSL (disease susceptibility locus) in a case-control study. However, DSL is usually unknown. Hence what one can do is to test the association between disease and DSL indirectly by testing the association between disease and a marker. The power of the association between disease and DSL depends on the power of the association between disease and the marker, and on the LD (linkage disequilibrium) between disease and the marker. Assume that both DSL and the marker are biallelic and assume additive model. Given the MAFs (minor allele frequencies) for both DSL and the marker, relative risks, r 2 or D between DSL and marker, the numbers of cases and controls, and the significant level for hypothesis testing, the tools in the GeneticsDesign package can calculate the power for the association test that disease is associated with DSL in a case-control study. This information can help investigator to determine appropriate sample sizes in experiment design stage. 2 Examples To call the functions in the R package GeneticsDesign, we first need to load it into R: > library(geneticsdesign) The following is a sample code to get a table of power for different combinations of high risk allele frequency P r(a) and genotype relative risk RR(Aa aa). 1
> set1<-seq(from0.1, to0.5, by0.1) > set2<-c(1.25, 1.5, 1.75, 2.0) > len1<-length(set1) > len2<-length(set2) > mat<-matrix(0, nrowlen1, ncollen2) > rownames(mat)<-paste("maf", set1, sep"") > colnames(mat)<-paste("rraa", set2, sep"") > for(i in 1:len1) + { a<-set1[i] + for(j in 1:len2) + { b<-set2[j] + res<-gpc.default(paa,pd0.1,rraa(1+b)/2, RRAAb, Dprime1,pBa, quiett) + mat[i,j]<-res$power + } + } > print(round(mat,3)) RRAA1.25 RRAA1.5 RRAA1.75 RRAA2 MAF0.1 0.145 0.400 0.690 0.884 MAF0.2 0.214 0.594 0.877 0.977 MAF0.3 0.259 0.682 0.926 0.989 MAF0.4 0.280 0.711 0.936 0.990 MAF0.5 0.282 0.702 0.925 0.986 A Technical Details Suppose that A is the minor allele for the disease locus; a is the common allele for the disease locus; B is the minor allele for the marker locus; and b is the common allele for the marker locus. We use D to denote the disease status diseased and use D to denote the disease status not diseased. Given the minor allele frequencies P r(a) and P r(b) and Linkage disequilibrium (LD) measure D, we can get haplotype frequencies P r(ab), P r(ab), P r(ab), and P r(ab): where P r(ab) P r(a)p r(b) + D P r(ab) P r(a)p r(b) D P r(ab) P r(a)p r(b) D P r(ab) P r(a)p r(b) + D, D D d max 2
and That is, d max { min [P r(a)p r(b), P r(a)p r(b)], if D > 0, max [ P r(a)p r(b), P r(a)p r(b)], if D < 0. D P r(ab) P r(a)p r(b). Note that D > 0 means P r(ab) > P r(a)p r(b), i.e., the probability of occuring the haplotype AB is higher than the probability that the haplotype AB occurs merely by chance. D < 0 means P r(ab) < P r(a)p r(b), i.e., the probability of occuring the haplotype AB is smaller than the probability that the haplotype AB occurs merely by chance. In other words, given the same sample size, we can observe haplotype AB more often when D > 0 than when D < 0. Hence the power of association test that marker is associated with disease is larger when D > 0 than when D < 0. Hence, we assume that D > 0. D can also be rewritten as D P r(b A)P r(a) P r(a)p r(b) [P r(b A) P r(b)] P r(a). Hence D > 0 is equivalent to P r(b A) > P r(a) which indicates positive association between minor allele A in disease locus and minor allele B in marker locus. The relative risks are RR(AA aa) P r(d AA) P r(d aa) RR(Aa aa) P r(d Aa) P r(d aa) The disease prevalence can be rewritten as P r(d AA)P r(aa) + P r(d Aa)P r(aa) + P r(d aa)p r(aa). Dividing both sides by P r(d aa), we get Hence P r(d aa) P r(d AA) P r(d Aa) P r(d aa) P r(aa) + P r(aa) + P r(d aa) P r(d aa) P r(d aa) P r(aa) RR(AA aa) [P r(a)] 2 + RR(Aa aa)2p r(a)p r(a) + [P r(a)] 2. P r(d aa) RR(AA aa) [P r(a)] 2 + RR(Aa aa)2p r(a)p r(a) + [P r(a)] 2 P r(d Aa) RR(Aa aa)p r(d aa) P r(d AA) RR(AA aa)p r(d aa) 3
Once we obtain the penetrances of disease locus (P r(d aa), P r(d Aa), and P r(d AA)), we can get the sampling probabilities: P r(d, BB) P r(bb D) P r(d, BB, AA) + P r(d, BB, Aa) + P r(d, BB, aa) P r(d BB, AA)P r(bb, AA) + P r(d BB, Aa)P r(bb, Aa) + P r(d BB, aa)p r(bb, aa) P r(bb D) P r(d, Bb) P r(d, Bb, AA) + P r(d, Bb, Aa) + P r(d, Bb, aa) P r(d Bb, AA)P r(bb, AA) + P r(d Bb, Aa)P r(bb, Aa) + P r(d Bb, aa)p r(bb, aa) P r(d, bb) P r(bb D) P r(d, bb, AA) + P r(d, bb, Aa) + P r(d, bb, aa) P r(d bb, AA)P r(bb, AA) + P r(d bb, Aa)P r(bb, Aa) + P r(d bb, aa)p r(bb, aa) We assume that P r(d BB, AA) P r(d AA), P r(d BB, Aa) P r(d Aa), P r(d BB, aa) P r(d aa). We also have P r(bb, AA) [P r(ab)] 2 P r(bb, Aa) 2P r(ab)p r(ab) P r(bb, aa) [P r(ab)] 2 P r(bb, AA) 2P r(ab)p r(ab) P r(bb, Aa) 2 [P r(ab) P r(ab) + P r(ab)p r(ab)] P r(bb, aa) 2P r(ab)p r(ab) P r(bb, AA) [P r(ab)] 2 P r(bb, Aa) 2P r(ab)p r(ab) P r(bb, aa) [P r(ab)] 2 4
Hence P r(d BB, AA)P r(bb, AA) + P r(d BB, Aa)P r(bb, Aa) + P r(d BB, aa)p r(bb, aa) P r(bb D) P r(d AA) [P r(ab)]2 + P r(d Aa) [2P r(ab)p r(ab)] + P r(d aa) [P r(ab)] 2 P r(d Bb, AA)P r(bb, AA) + P r(d Bb, Aa)P r(bb, Aa) + P r(d Bb, aa)p r(bb, aa) P r(bb D) P r(d AA) [2P r(ab)p r(ab)] + P r(d Aa) {2 [P r(ab)p r(ab) + P r(ab)p r(ab)]} P r(d aa) [2P r(ab)p r(ab)] + P r(d bb, AA)P r(bb, AA) + P r(d bb, Aa)P r(bb, Aa) + P r(d bb, aa)p r(bb, aa) P r(bb D) P r(d AA) [P r(ab)]2 + P r(d Aa) [2P r(ab)p r(ab)] + P r(d aa) [P r(ab)] 2 To calculate the sampling probabilities P r(bb D), P r(bb D), and P r(bb D), we can apply Bayesian rules again: P r(bb D) P r( D BB)P r(bb) P r( D) P r(bb D) P r( D Bb)P r(bb) P r( D) P r(bb D) P r( D bb)p r(bb) P r( D) [1 P r(d BB)][P r(b)]2 1 [1 P r(d Bb)]2P r(b)p r(b) 1 [1 P r(d bb)][p r(b)]2 1 and P r(d BB) P r(d Bb) P r(d bb) P r(bb D) P r(bb D) P r(bb) [P r(b)] 2 P r(bb D) P r(bb D) P r(bb) [2P r(b)p r(b)] P r(bb D) P r(bb D) P r(bb) [P r(b)] 2. 5
The expected allele frequencies are The counts for cases P r(b D) 2n casep r(bb D) + n case P r(bb D) 2n case P r(bb D) + P r(bb D)/2 P r(b D) 2n casep r(bb D) + n case P r(bb D) 2n case P r(bb D) + P r(bb D)/2 P r(b D) P r(bb D) + P r(bb D)/2 P r(b D) P r(bb D) + P r(bb D)/2. n 2c n cases P r(bb D), n 1c n cases P r(bb D), n 0c n cases P r(bb D). The counts for controls n 2n n controls P r(bb D), n 1n n controls P r(bb D), n 0n n controls P r(bb D). Table 1: Counts for cases and controls marker genotype cases controls BB n 2c n 2n Bb n 1c n 1n bb n 0c n 0n total n cases n controls Odds ratios are Define OR(BB bb) n 2cn 0n n 2n n 0c P r(bb D)P r(bb D) P r(bb D)P r(bb D) OR(Bb bb) n 1cn 0n n 1n n 0c P r(bb D)P r(bb D) P r(bb D)P r(bb D) U 3 ( S x i N r i R ) N s i, i1 6
where We have R n 0c + n 1c + n 2c, S n 0n + n 1n + n 2n, N R + S.x 1 2, x 2 1, x 3 0, r 1 n 0c, r 2 n 1c, r 3 n 2c, s 1 n 0n, s 2 n 1n, s 3 n 2n. V V ar(u) RS N N 3 ( 3 3 ) 2 x 2 i n i x i n i. i1 i1 The Linear Trend Test Statistic 1 Z 2 T ( U V ) 2 asymptotically follows a χ 2 1 distribution under the null hypothesis. Under alternative hypothesis, Z 2 T asymptotically follows a non-central chi-square distribution χ2 1,λ T with non-centrality parameter where λ T RS [ 3 i1 x i(p 1i p 0i ) ] 2 3 i1 x2 i (Rp 0i + Sp 1i ) [ 3 i1 x i(rp 0i + Sp 1i ] 2 /N, x 1 2, x 2 1, x 3 0, n 1 n 2c + n 2n, n 2 n 1c + n 1n, n 3 n 0c + n 0n, p 01 P r(bb D), p 02 P r(bb D), p 03 P r(bb D), p 11 P r(bb D), p 12 P r(bb D), p 13 P r(bb D). Given significant level α, the power 1 β satisfies: References P r(z 2 T > c H 0 ) α P r(z 2 T > c H a ) 1 β. [1] Armitage, P. Tests for linear trends in proportions and frequencies. Biometrics, 11, 375-386, 1955. [2] Cochran, W. G. Some methods for strengthening the common chi-squared tests. Biometrics, 10, 417-451, 1954. 1 http://linkage.rockefeller.edu/pawe3d/help/linear-trend-test-ncp.html 7
[3] Gordon D. and Finch S. J. and Nothnagel M. and Ott J. Power and sample size calculations for case-control genetic association tests when errors are present: application to single nucleotide polymorphisms. Hum. Hered., 54:22-33, 2002. [4] Gordon D. and Haynes C. and Blumenfeld J. and Finch S. J. PAWE-3D: visualizing Power for Association With Error in case/control genetic studies of complex traits. Bioinformatics, 21:3935-3937, 2005. [5] Purcell S. and Cherny S. S. and Sham P. C. Genetic Power Calculator: design of linkage and association genetic mapping studies of complex trait Bioinformatics, 19(1):149-150, 2003. [6] Sham P. Statistics in Human Genetics. Arnold Applications of Statistics, 1998. 8