False Discovery Rates and Copy Number Variation. Bradley Efron and Nancy Zhang Stanford University

Size: px

Start display at page:

Download "False Discovery Rates and Copy Number Variation. Bradley Efron and Nancy Zhang Stanford University"

Rebecca Burns
5 years ago
Views:

1 False Discovery Rates and Copy Number Variation Bradley Efron and Nancy Zhang Stanford University

2 Three Statistical Centuries 19th (Quetelet) Huge data sets, simple questions 20th (Fisher, Neyman, Hotelling,... ) simple questions Small data sets, 21st (Scientific mass production) complicated questions Huge data sets, FDRs and CNV 1

3 Example: Copy Number Variation CNV Gains and losses of chromosome segments (disease association) Instead of 2 copies, might have 0, 1, 3, 4,... Data x ij = noisy msmnt of copy number for subject j at marker position i i = 1, 2,..., N (5000) and j = 1, 2,..., n (150) (< 1% of data!) x ij approx. normal with mean 0 if copy number = 2 FDRs and CNV 2

4 What We Expect To See Hot positions: i where several of subjects j show unusually high (or low) values x ij For some subjects j: intervals of high (or low) values x ij Information on CNV locations in both directions FDRs and CNV 3

5 subject number j > Lowest.001 of the 750,000 entries x[i,j]; Subject 45 shows interval of low values around position 3800; Is position 1755 cnv prone? Interval position number i > FDRs and CNV 4

6 Z-Values X : x ij for i = 1, 2,..., N = 5000 positions, j = 1, 2,..., n = 150 subjects C i = all subjects at ith position (n = 150) Moving averages Replace x ij with x ij = i+5 i5 x lj /11 X (j s msmnts averaged over nearby positions) subtract row median Standardize rows of X divide by row robust standardization Gives Z matrix z ij iterative fdr i estimates FDRs and CNV 5

7 Simultaneous Hypothesis Testing M null hypotheses H 01, H 02,..., H 0M (M = 750, 000 for CNV) Case m has test statistic z m, null density f 0 (z) The problem Given z = (z 1, z 2,..., z M ), simultaneously test all M null hypotheses and don t make many mistakes! FDRs and CNV 6

8 The Bayesian Two-Groups Model Null Mixture Non-null Local false discovery rate fdr(z) = Pr{null z} = π o f 0 (z)/ f (z) Empirical Bayes z ˆπ 0, ˆ f0, ˆ f fdr(z) = ˆπ0 ˆ f0 (z) / ˆ f (z) Reject H 0m if fdr(z m ) small (see Efron, 2008) FDRs and CNV 7

9 Estimated local false discovery rate, all 750,000 zvalues; pihat0=.954, estimated null density N(.04,.93^2) local fdr z value fdrhat(z)=.1 at z=3.30 and 3.57 FDRs and CNV 8

10 zvalues at position i=1755 (solid histogram) compared to all the others (line) << low cnv high cnv > Now for position i= << low cnv high cnv > FDRs and CNV 9

11 A More General Model Classes: C 1, C 2,..., C i,..., C N with n i cases in C i CNV: C i = ith column, n i = n = 150 (the n = 150 subjects measured at position i) fdr i (z) = π i0 f i0 (z)/ f i (z) = Pr{null z, C i } FDRs and CNV 10

12 Combined and Separate Fdr s Strategy: Estimate fdr(z) = π 0 f 0 (z)/ f (z) from combined data and then modify for C i Assume f i0 (z) and f i1 (z) do not depend on i, only π i1 = Pr{non-null C i } varying across classes: fdr i (z) = fdr(z) / [1 + tdr(z)s i ] tdr(z) 1 fdr(z) = true discovery rate and S i = π / i1 πi0 1 π 1 π 0 FDRs and CNV 11

13 Iterative Estimation of fdr i (z) (Model 1) First Estimate fdr(z) = ˆπ 0 ˆ f0 (z) / ˆ f (z) from combined data (z 1, z 2,..., z M ) If k i non-nulls in C i : ˆπ i1 = k i /n gives Ŝ i and fdr i (z) = fdr(z) ( tdri 1 + tdr(z)ŝ = 1 fdr ) i i But ˆki = C i tdri (z m ) estimates k i Iterate! (5 cycles plenty in what follows) FDRs and CNV 12

14 Points where fdrhat <.01. Five iterations of Model 1, z[i,j] from moving averages (i5,i+5) subject marker position FDRs and CNV 13

15 subject j Points where fdrhat.i <.01 (five iterations) z[i,j] from moving averages (i5,i+5) marker position i khat[i] estimates for the 5000 positions khat[i] khat[1755] = marker position i FDRs and CNV 14

16 subject points where fdrk<.01; closeup positions 1700:1800; shows possible CNV region at 1750: marker position khat marker position FDRs and CNV 15

17 Is Position 1755 Significant? ˆk 1755 = 39.1 Believe CNV action at 1755? [ k = 8.13] Permutation test Randomly shift row j of X by s j units left (with wraparound): x j = (x s+1,j, x s+2,j,..., x 5000,j, x 1j, x 2j,..., x sj ) Do this for all 150 rows Recalculate ˆk i values Compare ˆk 1755 = 39.1 with {ˆk, i = 1, 2,..., 5000} i FDRs and CNV 16

18 Actual khat distribution compared to permutation distribution; Maximum khat = 23.3 Frequency actual permutations 39.1 > khat values > FDRs and CNV 17

19 Locally Most Powerful Tests Let r i = π i1 /π 1 = Pr{non-null C i } / Pr{non-null}. l i = n 1 { 1 + (ri 1)T(z ij ) } where T(z) = tdr(z) π 1 π 0 ˆk i nearly MLE in this model Test H 0i : r i = 1 vs r i > 1. Locally most powerful test rejects for large values of Use permutations to get p-values. ˆk (1) i. FDRs and CNV 18

20 Bootstrapping ˆk i Estimates Resample rows (i.e., subjects) Recompute iterative estimate ˆk i (5 iterations model) ŝd i = boot stdev of ˆk, B = 100 resamples i (did not recompute original fdr curve each time) ˆk ) [ N (ˆki, i ŝd2 i 6 ŝd i 7 for ˆk i > 20 ] FDRs and CNV 19

21 Bootstrap estimates of standard deviations for khat[i] values, (5 iterations) plotted vs khat[i]; sdhat[1755]=6.5 bootstrap stdev > khat[i] > FDRs and CNV 20

22 Brown Stein Robbins Estimation Suppose µ g( ) and x µ N(µ, σ 2 ) l(x) log marginal density of x µ x ( x + σ 2 l (x) ), σ 2 ( 1 + σ 2 l (x) ) Apply with µ = k i, x = ˆk i, ˆl(x) = log smoothed density {ˆki } For ˆk i = 39.1, ˆσ = 6.5, gave k 1755 (41.3, ) Conclusion Even taking account of selection effects, k 1755 is probably much larger than k = FDRs and CNV 21

23 More General Model for fdr i (z) Method 2 : Multiclass Bayes model π i0, f i0 (z), f i1 (z) with all f i0 = f 0, but drop assumption that non-null distributions f i1 (z) the same. Define: w i (z) = Pr{C i z} Empirical Bayes of C i indicator on z m. fdr i (z) fdr(z) wi(0) w i (z) Estimate w i (z) by logistic regression FDRs and CNV 22

24 zvalues for positions 1750:1759 (solid) compared to all the other positions (line) Frequency low cnv high cnv z values logistic regression estimate of wi(t)=prob{1750:1759 z} wi(z)/wi(0) z value FDRs and CNV 23

25 Three estimates of fdrhat for positions 1750:1759 fdr estimate Method 1 Method 2 combined << low cn high cn >> z value FDRs and CNV 24

26 References Efron, B. (2008). Simultaneous inference: When should hypothesis testing problems be combined? Ann. Appl. Statist., Tibshirani, R. and Wang, P. (2008). Spatial smoothing and hot spot detection for CGH data using the fused lasso. Biostatistics, Walther, G. (2009). Optimal and fast detection of spatial clusters with scan statistics. Online, URL gwalther/. Wang, P., Kim, Y., Pollack, J., Narasimhan, B. FDRs and CNV 25

27 and Tibshirani, R. (2005). A method for calling gains and losses in array CGH data. Biostatistics, Zhang, N., Siegmund, D., Ji, H. and Li, J. (2009). Detecting simultaneous change-points in multiple sequences. Biometrika. Accepted for publication, URL nzhang/. FDRs and CNV 26

Computer Age Statistical Inference. Algorithms, Evidence, and Data Science. BRADLEY EFRON Stanford University, California

Computer Age Statistical Inference Algorithms, Evidence, and Data Science BRADLEY EFRON Stanford University, California TREVOR HASTIE Stanford University, California ggf CAMBRIDGE UNIVERSITY PRESS Preface