The Limits of Individual Identification from Sample Allele Frequencies: Theory and Statistical Analysis

Similar documents
Copy Number Variation Methods and Data

International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS)

Using the Perpendicular Distance to the Nearest Fracture as a Proxy for Conventional Fracture Spacing Measures

Price linkages in value chains: methodology

Parameter Estimates of a Random Regression Test Day Model for First Three Lactation Somatic Cell Scores

Modeling Multi Layer Feed-forward Neural. Network Model on the Influence of Hypertension. and Diabetes Mellitus on Family History of

Using Past Queries for Resource Selection in Distributed Information Retrieval

Optimal Planning of Charging Station for Phased Electric Vehicle *

ALMALAUREA WORKING PAPERS no. 9

Project title: Mathematical Models of Fish Populations in Marine Reserves

Study and Comparison of Various Techniques of Image Edge Detection

Modeling the Survival of Retrospective Clinical Data from Prostate Cancer Patients in Komfo Anokye Teaching Hospital, Ghana

Physical Model for the Evolution of the Genetic Code

Joint Modelling Approaches in diabetes research. Francisco Gude Clinical Epidemiology Unit, Hospital Clínico Universitario de Santiago

310 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'16

(From the Gastroenterology Division, Cornell University Medical College, New York 10021)

ARTICLE IN PRESS Neuropsychologia xxx (2010) xxx xxx

Richard Williams Notre Dame Sociology Meetings of the European Survey Research Association Ljubljana,

An Introduction to Modern Measurement Theory

Appendix F: The Grant Impact for SBIR Mills

Economic crisis and follow-up of the conditions that define metabolic syndrome in a cohort of Catalonia,

A comparison of statistical methods in interrupted time series analysis to estimate an intervention effect

Reconstruction of gene regulatory network of colon cancer using information theoretic approach

Unobserved Heterogeneity and the Statistical Analysis of Highway Accident Data

THE NATURAL HISTORY AND THE EFFECT OF PIVMECILLINAM IN LOWER URINARY TRACT INFECTION.

Appendix for. Institutions and Behavior: Experimental Evidence on the Effects of Democracy

NUMERICAL COMPARISONS OF BIOASSAY METHODS IN ESTIMATING LC50 TIANHONG ZHOU

NHS Outcomes Framework

Prediction of Total Pressure Drop in Stenotic Coronary Arteries with Their Geometric Parameters

Resampling Methods for the Area Under the ROC Curve

HERMAN AGUINIS University of Colorado at Denver. SCOTT A. PETERSEN U.S. Military Academy at West Point. CHARLES A. PIERCE Montana State University

INITIAL ANALYSIS OF AWS-OBSERVED TEMPERATURE

Investigation of zinc oxide thin film by spectroscopic ellipsometry

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

A GEOGRAPHICAL AND STATISTICAL ANALYSIS OF LEUKEMIA DEATHS RELATING TO NUCLEAR POWER PLANTS. Whitney Thompson, Sarah McGinnis, Darius McDaniel,

Incorrect Beliefs. Overconfidence. Types of Overconfidence. Outline. Overprecision 4/22/2015. Econ 1820: Behavioral Economics Mark Dean Spring 2015

TOPICS IN HEALTH ECONOMETRICS

Alma Mater Studiorum Università di Bologna DOTTORATO DI RICERCA IN METODOLOGIA STATISTICA PER LA RICERCA SCIENTIFICA

Evaluation of two release operations at Bonneville Dam on the smolt-to-adult survival of Spring Creek National Fish Hatchery fall Chinook salmon

What Determines Attitude Improvements? Does Religiosity Help?

A MIXTURE OF EXPERTS FOR CATARACT DIAGNOSIS IN HOSPITAL SCREENING DATA

EVALUATION OF BULK MODULUS AND RING DIAMETER OF SOME TELLURITE GLASS SYSTEMS

Insights in Genetics and Genomics

Does reporting heterogeneity bias the measurement of health disparities?

CONSTRUCTION OF STOCHASTIC MODEL FOR TIME TO DENGUE VIRUS TRANSMISSION WITH EXPONENTIAL DISTRIBUTION

A-UNIFAC Modeling of Binary and Multicomponent Phase Equilibria of Fatty Esters+Water+Methanol+Glycerol

Evaluation of the generalized gamma as a tool for treatment planning optimization

Statistical Analysis on Infectious Diseases in Dubai, UAE

The effect of salvage therapy on survival in a longitudinal study with treatment by indication

Journal of Economic Behavior & Organization

HIV/AIDS-related Expectations and Risky Sexual Behavior in Malawi

Estimating the distribution of the window period for recent HIV infections: A comparison of statistical methods

Integration of sensory information within touch and across modalities

Introduction ORIGINAL RESEARCH

HIV/AIDS-related Expectations and Risky Sexual Behavior in Malawi

Desperation or Desire? The Role of Risk Aversion in Marriage. Christy Spivey, Ph.D. * forthcoming, Economic Inquiry. Abstract

The Case for Selection at CCR5-D32

Normal variation in the length of the luteal phase of the menstrual cycle: identification of the short luteal phase

WHO S ASSESSMENT OF HEALTH CARE INDUSTRY PERFORMANCE: RATING THE RANKINGS

Statistical models for predicting number of involved nodes in breast cancer patients

Biased Perceptions of Income Distribution and Preferences for Redistribution: Evidence from a Survey Experiment

Are National School Lunch Program Participants More Likely to be Obese? Dealing with Identification

Estimation of Relative Survival Based on Cancer Registry Data

Gene Selection Based on Mutual Information for the Classification of Multi-class Cancer

Balanced Query Methods for Improving OCR-Based Retrieval

IMPROVING THE EFFICIENCY OF BIOMARKER IDENTIFICATION USING BIOLOGICAL KNOWLEDGE

UNIVERISTY OF KWAZULU-NATAL, PIETERMARITZBURG SCHOOL OF MATHEMATICS, STATISTICS AND COMPUTER SCIENCE

Computing and Using Reputations for Internet Ratings

Impact of Imputation of Missing Data on Estimation of Survival Rates: An Example in Breast Cancer

The Influence of the Isomerization Reactions on the Soybean Oil Hydrogenation Process

Non-linear Multiple-Cue Judgment Tasks

Were the babies switched? The Genetics of Blood Types i

Evaluation of Literature-based Discovery Systems

Sparse Representation of HCP Grayordinate Data Reveals. Novel Functional Architecture of Cerebral Cortex

Association between cholesterol and cardiac parameters.

THE NORMAL DISTRIBUTION AND Z-SCORES COMMON CORE ALGEBRA II

An Approach to Discover Dependencies between Service Operations*

Encoding processes, in memory scanning tasks

Rainbow trout survival and capture probabilities in the upper Rangitikei River, New Zealand

Maize Varieties Combination Model of Multi-factor. and Implement

Saeed Ghanbari, Seyyed Mohammad Taghi Ayatollahi*, Najaf Zare

Can Subjective Questions on Economic Welfare Be Trusted?

Are Drinkers Prone to Engage in Risky Sexual Behaviors?

Addressing empirical challenges related to the incentive compatibility of stated preference methods

Disease Mapping for Stomach Cancer in Libya Based on Besag York Mollié (BYM) Model

Causal inference in nonexperimental studies typically

A Meta-Analysis of the Effect of Education on Social Capital

Efficiency Considerations for the Purely Tapered Interference Fit (TIF) Abutments Used in Dental Implants

HIV/AIDS AND POVERTY IN SOUTH AFRICA: A BAYESIAN ESTIMATION OF SELECTION MODELS WITH CORRELATED FIXED-EFFECTS

Modeling seasonal variation in indoor radon concentrations

J. H. Rohrer, S. H. Baron, E. L. Hoffman, D. V. Swander

Stephanie von Hinke Kessler Scholder, George Davey Smith, Debbie A. Lawlor, Carol Propper, Frank Windmeijer

Comparison of methods for modelling a count outcome with excess zeros: an application to Activities of Daily Living (ADL-s)

A Mathematical Model of the Cerebellar-Olivary System II: Motor Adaptation Through Systematic Disruption of Climbing Fiber Equilibrium

AN ENHANCED GAGS BASED MTSVSL LEARNING TECHNIQUE FOR CANCER MOLECULAR PATTERN PREDICTION OF CANCER CLASSIFICATION

Survival Rate of Patients of Ovarian Cancer: Rough Set Approach

Rich and Powerful? Subjective Power and Welfare in Russia

Mathematical model of fish schooling behaviour in a set-net

[ ] + [3] i 1 1. is the density of the vegetable oil, R is the universal gas constant, T r. is the reduced temperature, and F c

Subject-Adaptive Real-Time Sleep Stage Classification Based on Conditional Random Field

Transcription:

The Lmts of Indvdual Identfcaton from Sample Allele Frequences: Theory and Statstcal Analyss Peter M. Vsscher 1 *, Wllam G. Hll 2 1 Queensland Insttute of Medcal Research, Brsbane, Australa, 2 Insttute of Evolutonary Bology, School of Bologcal Scences, Unversty of Ednburgh, Unted Kngdom Abstract It was shown recently usng expermental data that t s possble under certan condtons to determne whether a person wth known genotypes at a number of markers was part of a sample from whch only allele frequences are known. Usng populaton genetc and statstcal theory, we show that the power of such dentfcaton s, approxmately, proportonal to the number of ndependent SNPs dvded by the sze of the sample from whch the allele frequences are avalable. We quantfy the lmts of dentfcaton and propose lkelhood and regresson analyss methods for the analyss of data. We show that these methods have smlar statstcal propertes and have more desrable propertes, n terms of type-i error rate and statstcal power, than test statstcs suggested n the lterature. Ctaton: Vsscher PM, Hll WG (2009) The Lmts of Indvdual Identfcaton from Sample Allele Frequences: Theory and Statstcal Analyss. PLoS Genet 5(10): e1000628. do:10.1371/journal.pgen.1000628 Edtor: Greg Gbson, The Unversty of Queensland, Australa Receved May 29, 2009; Accepted August 3, 2009; Publshed October 2, 2009 Copyrght: ß 2009 Vsscher, Hll. Ths s an open-access artcle dstrbuted under the terms of the Creatve Commons Attrbuton Lcense, whch permts unrestrcted use, dstrbuton, and reproducton n any medum, provded the orgnal author and source are credted. Fundng: Ths study was partally supported by the Australan Natonal Health and Medcal Research Councl (grants 389892 and 442915) and the Australan Research Councl (grant DP0770096). None of the funders had any role n the analyses and nterpretaton of the data or n the preparaton, revew or approval of the manuscrpt. Competng Interests: The authors have declared that no competng nterests exst. * E-mal: peter.vsscher@qmr.edu.au Introducton Homer et al. [1] showed that t was possble n some crcumstances to dentfy whether a person wth observed genotypes at multple loc was part of a sample from whch only estmated allele frequences were known. Such dentfcaton would be partcularly useful n forensc scence f the presence or absence of a person s DNA n a mxture of DNA could be establshed. The authors also dscussed the relevance of ther fndngs when summary statstcs such as allele frequences were avalable n the publc doman as part of genotype-phenotype studes, because t possbly could be establshed that ndvduals, or ther close relatves, were part of a partcular study. As a result of the publcaton of Homer et al., NIH and the Wellcome Trust added more restrctons to the access of such data to avod potental dentfablty (http://grants.nh.gov/grants/gwas/ data_sharng_polcy_modfcatons_20080828.pdf). The approach taken by Homer et al. was to have two samples wth estmated allele frequences, here called the test and reference sample, and to ask whether an ndvdual was close to ether of these samples, usng a statstc that measured a dstance to the sample. The propertes of the test statstc were not nvestgated theoretcally (although smulaton studes were performed), and the dfference between sample and populaton was not always clear. In ths note we take a best-case dealsed settng n whch there s a sngle populaton from whch there s a test sample wth allele frequences at a number of loc and from whch there s a sngle ndvdual, called the proband, wth full genotypes. The queston s whether the person was part of ths test sample from whch allele frequences are avalable. We use both lkelhood and lnear regresson theory, whch llustrate dfferent approaches to the problem, to draw nference about the hypothess that a proband was part of the test sample. We show that the power of dentfcaton of a proband as part of a test sample s, approxmately, proportonal to the number of ndependent SNPs dvded by the sze of the sample from whch the allele frequences are avalable. The power s reduced by a predctable magntude f the frequences n the populaton are themselves estmated mprecsely. Propertes of lkelhood-ratos and regresson test statstcs and a comparson wth the statstc used by Homer et al. were verfed by smulaton. Methods Notaton and assumptons There are m ndependent SNP markers wth a populaton frequency of p for allele B at the th SNP. We assume Hardy- Wenberg equlbrum n the populaton, so that the genotype proportons for the th SNP are (12p ) 2, 2p (12p ) and p 2 for genotypes AA, AB and BB, respectvely. We have estmated allele frequences ^p based upon a test sample of N unrelated ndvduals. In the test sample of 2N alleles, n s the number of B alleles at locus. In ths study we assume that N s known and ndvduals are equally represented n computng ^p. Note that these condtons are unlkely to be fully met n forensc applcatons when the test sample may be a DNA pool and we consder the mplcatons later. The genotype for proband X at the th SNP s g, whch can take values of 0, 1 and 2 for genotypes AA, AB and BB, and the expectaton of y = Kg s the populaton frequency p,.e. E[Kg ]=p. To smplfy dervatons, we shall frst assume the populaton frequences p, are known. More generally, we assume we have pror unbased estmates of the allele frequences from the same PLoS Genetcs www.plosgenetcs.org 1 October 2009 Volume 5 Issue 10 e1000628

Author Summary It was shown recently by Homer and colleagues that t may be possble to determne whether a person wth known genotypes at a number of markers was part of a pool of DNA from whch only frequences of alleles at the markers are known. In ths study, we quantfy how well such dentfcaton can work n practce. The larger the sze of the sample from whch the allele frequences are avalable, the more ndependent genetc markers are requred to allow ndvdual dentfcaton. populaton from a dfferent fnte sample (the reference sample ) of sze N*, n whch there are n* B alleles at locus. As both the test and reference samples are drawn ndependently from the populaton, the best estmate of the frequency n the populaton s gven by the pooled value, ^p ~ n zn = 2Nz2N ð Þ It s explaned subsequently why ths estmate, rather than say n* /2N*, the estmate of the allele frequency from the reference sample, s used n the statstcal analyss. Lkelhood Populaton frequences known. If, under the assumptons descrbed above, the numbers of ndvduals n the test sample and populaton frequences are known, then we can compute the relatve lkelhood of samplng the observed genotypes under the two alternatve hypotheses: the proband X s or s not n the test sample. If X s not a member of ths sample, then n, Bnomal(2N, p ) and g s ndependently dstrbuted Bnomal(2, p ). Hence the jont probablty of sample and proband s P(out)~ 2N n n p 2 (1{p) 2N{n g p g (1{p ) 2{g If X s a member of the sample, n has the same dstrbuton, but g s sampled from the 2N wthout replacement and has the hypergeometrc dstrbuton: P(n)~ 2N n p n (1{p) 2N{n 2N{n n g 2{g = 2N 2 Alternatvely P(n) can be vewed as n 2g, Bnomal(2N22, p ) and g, Bnomal(2, p ) ndependently, gvng the same formula. Hence the lkelhood rato for X n vs not n (out) the test sample reduces to a smple equaton, but n vew of the varyng length of the factoral expressons, t s clearer to wrte three separate ones: LR(n=out,AA)~(2N{n )(2N{n {1)=½2N(2N{1)(1{p ) 2 Š LR(n=out,AB)~n (2N{n )=½2N(2N{1)p (1{p )Š LR(n=out,BB)~n (n {1)=½2N(2N{1)p 2 Š For example, f allele B s at low frequency n the populaton (p small) and the proband s BB, then f the number n the sample, n,2, LR(BB) = 0, as t should; but as n ncreases LR(BB) becomes hgh. If the test sample s qute large, the correcton for nonreplacement samplng becomes less mportant, and the formulae smplfy to, for example, LR(n/out, BB) = (n /2N) 2 /p 2,.e. a smple comparson of whether the genotype frequences correspond more closely to those n the sample than n the populaton. For m ndependent loc, the log lkelhood rato (loglr) s log LR(n=out)~{m½log (2N)z log (2N{1)Š z X 0 z X 1 z X 2 { X 0 { X 2 ½log (2N{n )z log (2N{n {1)Š ½log (n )z log (2N{n )Š ½log (n )z log (n {1)Š ½2 log (1{p )Š{ X 1 ½2 log (p )Š ½log (p )z log (1{p )Š where 0, 1, 2 represent AA, AB and BB ndvduals at the respectve loc. If the non-replacement samplng s gnored, ths smplfes to a lkelhood comparson of allele frequences n an ndvdual to one of two dfferent populatons loglr(n=out)~ X (2g 0 zg 1 )½log(1{n =2N){log(1{p )Š z X (g 1 z2g 2 )½log(n =2N){log(p )Š where g 0 etc. refer to counts over the correspondng genotypes. Populaton frequences estmated. If the marker frequences are estmated from a reference sample of the populaton of sze N*, then the allele frequences p n the above equatons have to be replaced n the analyss by an estmate of populaton frequency. Although t would be possble just to use the frequences n* /2N* n the reference sample, ths should not be done as t leads to ncreased expectatons of loglr and, f unadjusted, to bas n assgnment of the proband to the test sample. More approprately, provdng the reference and test samples are ndependent, the pooled estmate of the populaton frequency ^p ~ n zn = 2Nz2N ð Þ should be used nstead of p n the above formulae. Propertes. The lkelhood rato (or ts logarthm) contans all of the nformaton and reflects the relatve probabltes of the two hypotheses (n/out) gven the data. We consder expectatons of loglr under the dfferent hypotheses. Standard statstcal dfferentaton was employed, takng a Taylor seres expanson of terms such as log(n ) about log(2np ), gnorng hgher order terms, and takng expectatons over the samplng dstrbutons of the observed frequences under each hypothess (see Text S1 for more detals). The followng formulae have also been verfed by smulaton. 1. If the populaton frequences are known, then for a proband n the test sample, E(logLR n)<km/n, and for a proband not n the test sample E(logLR out)<2km/n. Therefore the ablty to fnd whether the proband s n or not n the sample s proportonal to the number of ndependent markers and nversely proportonal to the sze of the test sample. 2. The varance of loglr s approxmately the same whether the proband s or s not present, and s close to m/n = 2E(logLR n). One measure of dscrmnatng power s the dfference n expected log-lkelhoods for the two hypotheses, scaled by the varance of that dfference, analogous to the non-centralty parameter of a test statstc: [E(logLR n)2e(loglr out)] 2 / [var(loglr n)+var(loglr out)]<km/n. Hypothess tests are PLoS Genetcs www.plosgenetcs.org 2 October 2009 Volume 5 Issue 10 e1000628

dscussed further n the subsequent secton on the regresson analyss, but note that the two hypothess (n/out) are not nested. The varance under the n hypothess s twce ts expectaton as for a ch-square wth 1 degree of freedom so the proporton of LR exceedng some threshold can be predcted. 3. The allele frequences have lttle nfluence on the dstrbuton of the lkelhoods. Unless the frequences are very extreme, or the test sample very small, the expected lkelhood ratos are lttle affected by whether the non-replacement samplng s accounted for, provdng they are computable. Wth very small numbers of a homozygous class expected under the out hypothess, then exclusons can occur wth some probablty. In such a case, f genotype results are correct, then presence of the proband n the test sample has to be excluded. Ths can occur even wth relatvely large test sample szes. The jont probablty of the proband havng genotype AB and the test sample beng homozygous AA and thereby excluded s 2p(12p) 2N+1 <2pe 22Np for small p, and for example s 0.0027 for p = 0.01 and N =100. 4. If the populaton frequences are estmated as ^p, the expectatons of the lkelhoods and ther varances and hence dscrmnatng ablty are all reduced by a proporton of approxmately N*/ (N+N*), e.g. E(logLR n)=[n*/(n+n*)](km/n). For example, the reducton s by one-half f the frequency s estmated usng a reference sample of the same sze as the test sample, and essentally to zero f there are no such other data. 5. If there s lnkage dsequlbrum amongst the loc, but the data are analysed as f they are ndependent, the expectaton of loglr s the same as f all were unlnked. The samplng varances are, however, ncreased. If the populaton frequences are known wthout error, t can be shown that for any par of loc, regardless of ther frequency, var(loglr n)<var(logl- R out)<2(1+r 2 )/N, approxmately, where r 2 s the squared correlaton of gene frequences between these loc [2]. Hence, for mnloc, h the dscrmnatng o ablty s approxmately 1 =2 m= N 1z(m{1)r 2 asymptotes to 1 h = 2 Nr 2 of loc. If ths quantty can not be calculated drectly t can be predcted from populaton parameters. and, as the number of loc ncreases,, where r 2 s the mean of r 2 over all pars Lnear regresson We show that the man results for the regresson approach are based upon the expectaton that the regresson of the proband frequency, y = Kg, on ^p, each expressed as devatons from populaton frequences, s dstrbuted about unty for all loc f the proband was part of the test sample, and about zero otherwse. Populaton frequences known. Consderng ths case frst for smplcty, the regresson coeffcent s estmated as b~ P (y {p,^p {p )= P (^p {p ) 2. If the proband s n the test sample, y and ^p are correlated, so cov(y {p,^p {p )jn~ 1 = 2 p (1{p )Š=N, and f t s not n the test sample, cov(y {p,^p {p )jout~0. In both cases, var(^p {p )jn~var(^p {p )jout~ 1 = 2 p (1{p )=N: Hence, assumng many loc such that the rato of expectatons approxmates the expectaton of the ratos, h E(bjn)~E X h fcov(y {p,^p {p )jng = X fvar(^p {p )joutg~1 and E(bjout)~0 Therefore the regresson of the proband s allele frequency on the estmated allele frequency n the test sample, both expressed as a devaton from the populaton frequency, s expected to be zero f the proband was not n the test sample and one f the probands was n the test sample. The correspondng samplng varances are, respectvely, assumng large m, var(bjn)~(n{1)=m and var(b=jout)~n=m;.e., the varance s slghtly smaller f the proband s n the sample. These results correspond closely to the expectatons of the condtonal log-lkelhood analyss, and show how they are related. Populaton frequences estmated. There are two approaches to estmatng the populaton frequency and testng: comparson of the proband wth ether the reference sample of N* alone, or comparson of the proband wth the estmate ^p from the combned sample of sze N+N*. Whlst t mght seem counterntutve to use the latter whch ncludes the test data n the estmate, t provdes smpler results, notably expected regresson coeffcents of 0 (out) and 1 (n); hence we use t here. The estmate of the regresson coeffcent s b~ P (y { ^p,^p { ^p ) = P (^p { ^p )2. Now var (^p { p ) ~1 = 2 p (1 { p )1=N ½ {1=(N zn ) Š. Ths s also cov(y {^p,^p {^p jn), whereas cov(y {^p,^p {^p jout)~0. Hence, f the proband s n the test sample, E(bjn)~1, and var(bjn)~ ½(N{1)=mŠ ½(NzN )=N Š: If the proband s not n the test sample, E(bjout)~0, and var(bjout)~ ½N=mŠ ½(NzN )=N Š; where terms of 1 relatve to N+N* are gnored. Hence the test statstcs are smply N*/(N+N*) of those where the populaton frequences are known (.e., N*R ). Hypothess testng. The null hypothess s out, E(b) = 0: the proband was not part of the test sample. The alternatve hypothess (n, E(b).0) s that the proband (or a close relatve) was part of the test sample. If hypothess out s true, a test statstc for the null hypothess that the proband s part of the sample s t =[b21] 2 /var (b out). Agan, t,x 2 (1) f ths hypothess s true. If t s false,.e. the proband s not part of the sample, then t has a non-central ch-square dstrbuton t,x 29 (1),l wth non-centralty l<(m/n)[n*/(n+n*)]. For large N, nferences from testng whether the proband s n or whether the proband s out of the test sample are dentcal, as n the lkelhood approach: the probablty of rejectng the null hypothess that the proband s not part of the sample when that s false s the same as the probablty of rejectng the null hypothess that the proband s n the sample when that s false. For a type-i error rate of a and power of 12b, wth correspondng normal devates of z a and z 12b, the requred rato of m/n = l =(z a +z 12b ) 2, assumng a very large reference sample (N*&N). For a type-i error rate of 0.05 and a power of 80%, the requred m/n rato s therefore approxmately 6, and for a =10 26 and 12b = 99%, the rato s approxmately 50. If, for example, the reference sample were the same sze as the test sample, the number of loc would have to be doubled to gve the same power. Results Smulatons Populaton allele frequences on m markers were drawn from a unform dstrbuton wth lower bound 0.05 and upper bound 0.95 PLoS Genetcs www.plosgenetcs.org 3 October 2009 Volume 5 Issue 10 e1000628

(.e., mnor allele frequency (MAF).0.05). For the th SNP, a genotype score (y ) of a proband was smulated from a bnomal dstrbuton wth probablty p and sample sze 2. Allele frequences n the reference and test samples were smulated from a bnomal dstrbuton wth probablty p and sample sze 2N * and 2N, respectvely. If the proband was part of the test sample then the test sample was smulated on N21 ndvduals and the allele count from the proband was added to that from ths sample to create a sample from N ndvduals. Lnear regresson was performed as descrbed prevously, for a type-i error rate of 0.05, and the Homer et al. [1] test statstc (see Text S2) was also mplemented. 1000 smulatons were performed for combnatons of N = 100, 1000, 10000, N* = 100, 1000, 10000 and and m = 50,000, when the proband was ether part or not part of the test sample. The results are shown n Table 1. The regresson type-i error rates are well controlled when the hypotheses tested are true. As predcted (Text S2), the type-i error rates for the Homer et al. test statstc are not well controlled. In many cases the probablty of rejectng the null hypothess when t s true s close to zero. Power to determne whether the proband s part of the test sample s good for test samples of 1000 f the reference sample sze s large. Inference from the regresson and lkelhood-rato approach s smlar, as expected (Table S1). Dscusson Smple methods were proposed to test the hypothess of whether a proband was part of a test sample. The expected lkelhood rato or the power to reject the null hypothess when t s false were derved and shown to be a smple functon of m/n, the rato of the number of markers and test sample sze. If allele frequences n the populaton are well-estmated then there s good power to determne f a proband s part of a sample of,1000 ndvduals when usng a whole genome scan of,50,000 ndependent markers. There s a strong relatonshp between the loglr statstc and regresson test statstcs. The dfference n the two regresson test statstcs, n or out of the test sample, s approxmately equal to twce the loglr statstc. Hence, twce the loglr statstc s very smlar to a test statstc from regresson that also tests for the n vs out hypothess (Table S1). Could any nference be drawn n the case where there are no pror estmates of allele frequences? The analyses ndcate that, Table 1. Smulaton results (m = 50,000 SNPs; type-i error rate = 0.05; 1000 smulatons). Lnear regresson Homer et al. Proband n test? N* N b P(b.0) P(b,1)} P(D.0) P(D,0) Type-I error Power Type-I error Power NO 100 0.000 0.055 1.000 0.000 1.000 NO 1000 0.002 0.064 1.000 0.002 0.486 NO 10000 0.000 0.056 0.731 0.016 0.133 NO 100 100 0.001 0.061 1.000 0.057 0.039 NO 100 1000 0.005 0.065 0.678 0.994 0.000 NO 100 10000 0.041 0.052 0.079 0.999 0.000 NO 1000 100 20.000 0.047 1.000 0.000 0.997 NO 1000 1000 0.014 0.069 0.999 0.060 0.047 NO 1000 10000 0.002 0.057 0.185 0.404 0.000 NO 10000 100 0.002 0.067 1.000 0.000 0.999 NO 10000 1000 0.001 0.065 1.000 0.001 0.408 NO 10000 10000 20.002 0.053 0.472 0.048 0.051 Power Type-I error Power Type-I error YES 100 0.999 1.000 0.048 1.000 0.000 YES 1000 1.003 1.000 0.051 0.996 0.000 YES 10000 0.997 0.709 0.053 0.396 0.000 YES 100 100 1.004 1.000 0.064 1.000 0.000 YES 100 1000 0.999 0.686 0.060 1.000 0.000 YES 100 10000 0.974 0.078 0.063 0.998 0.000 YES 1000 100 0.999 1.000 0.058 1.000 0.000 YES 1000 1000 1.002 1.000 0.063 0.992 0.000 YES 1000 10000 1.015 0.190 0.053 0.625 0.000 YES 10000 100 1.000 1.000 0.063 1.000 0.000 YES 10000 1000 0.999 1.000 0.059 0.993 0.000 YES 10000 10000 0.998 0.475 0.067 0.375 0.000 D refers to the Homer et al. test statstc. do:10.1371/journal.pgen.1000628.t001 PLoS Genetcs www.plosgenetcs.org 4 October 2009 Volume 5 Issue 10 e1000628

even wth many marker loc, there s lttle power as N* approaches 0 unless the sample sze N s also very small, and no larger than N*. The parameter m was defned as the number of ndependent SNPs. When many SNPs are used, e.g. all common SNPs on a chp, then there s correlaton (lnkage dsequlbrum) among the SNPs. Consequently, the y varables (allele numbers n the proband) are correlated and not takng ths nto account wll nflate the test statstc because the true varance of the estmated regresson coeffcent s larger than appears from the total number of SNPs. Smlarly, the varance of the lkelhood statstc s ncreased f allele frequences across SNPs are correlated. There are a number of ways to deal wth ths correlaton structure. () Restrct the analyses to SNPs that are n lnkage equlbrum. Ths seems wasteful because nformaton s dscarded. () Take the correlated nature of y nto account by fttng the covarance structure of y nto the regresson or lkelhood analyss. The effect of LD on the varance of the log lkelhoods s shown earler, and approprate correctons usng the mean r 2 gven. In vew of the correspondence of the lkelhood and regresson approaches, the same correcton can be appled to the latter. The relevant quantty may be obtaned from a separate data set (e.g. HapMap). () Perform a theoretcal adjustment on the test statstc, by calbratng the varance of the test statstc on the equvalent number of ndependent markers. Accordng to populaton genetcs theory, the number of ndependent loc ( segments ) n a random populaton wth effectve sze N e and genome length L (Morgan) s approxmately 2N e L/log(4N e L) [3]. For human populatons, wth N e = 10,000 and L = 35, ths mples a total of,50,000 SNPs. Ths number can also be estmated usng a smulaton approach, condtonng on the observed LD structure n a sample where ndvdual-level genotype data are avalable. Such an applcaton resulted n,55,000 ndependent SNPs for one genome-wde assocaton study [4]. Populaton dfferences In our dervatons we have assumed that all samples (proband, reference and test) are from the same populaton and that wthn the populaton there s random matng. What f these assumptons are volated? If all samples are from the same populaton but there s devaton from HWE then the tests are somewhat based because HWE s assumed n computng the lkelhood and the varance of sample allele frequences. Populaton dfferences are more serous and can lead to the wrong nference. There are a large number of possbltes because, n prncple, the proband, reference and test samples can all come from dfferent populatons. However, populaton dfferences between the reference and test sample can be tested explctly usng standard tests for dfferences n gene frequency. There seems lttle pont n testng whether a proband was part of a specfc test sample when there s no reference sample from the same populaton. Nevertheless, what can we predct f the reference populaton s not actually from the same populaton, but s used as f t s? Then both the lkelhood statstcs for the hypothess n and out are nflated, by essentally the same amount, so the problem s not the dvergence between the two populatons, but bas n the test statstc. If populaton frequences are napproprately or approxmately estmated, the sample s more lkely to be assgned as n when t should not be. The reference sample s of lttle value f the dvergence between the populatons, expressed as Wrght s F ST, approaches 1/(2N). Can we quantfy the lmts of dentfcaton n practcal stuatons? Ths s hard, because there are (at least) three dffcultes n addton to the theoretcal sample m/n crteron: 1) The sze of reference sample used to estmate the populaton frequency - n effect a sort of outgroup as N gets very large. So f the test sample s much larger than the reference sample (N&N*) the latter provdes the lmt. 2) The degree to whch the test N and the reference N* ndvduals are samples from the same populaton. 3) Lnkage dsequlbrum, whch generates a lmt regardless of numbers of loc. For these reasons we cannot set a smple lmt to dentfcaton wthout reference to other parameters (or speculaton). Relatves In the analyss we have not consdered the possblty that the proband s not n the test sample, but s related to one or more persons who s. For example f a relatve wth relatonshp R (e.g. R = K for full sbs) s n the test sample, then the expectaton of the regresson coeffcent s E(b)=R rather than 0 or 1. Smlar calculatons can be done f, for example, there are several relatves n the test or reference samples. If many markers are used, a value of b of approxmately one-half would rase suspcons that n fact a full sb, parent or chld s n the test sample. Lower, but non-zero values could be consequences of samplng or relatonshp. The smulaton results n Table 1 llustrate how senstve the methods can be, and hence there seems a real possblty of dentfyng not just the proband but also hs/her relatves. Forensc applcatons A problem frequently met n forensc applcatons s whether a partcular ndvdual s DNA appears n a mxture obtaned at a crme scene, for example. In ths case, t s usually unknown how many ndvduals DNA s present n the sample (.e., N s unknown), equal representaton cannot be assumed, and there may be allelc drop out n the sample, although Homer et al. [1] showed emprcally that probands could be detected even f ther contrbuton to the DNA pool was small. We do not therefore consder the present results to be relevant for probablstc nference n a forensc settng. However, excluson of a proband from a pooled DNA sample s possble f many markers are used, the actual N s small and frequences of alleles from the pool are estmated accurately. The lkelhood framework s senstve to genotypng errors n that false exclusons could occur, but the analyss could be adapted to model genotype counts wth specfed probablty of errors or by assumng replacement samplng n computng P(n). The lnear regresson approach s lkely to be robust to genotypng error. Genome-wde assocaton studes In contrast to forensc applcatons, n the stuaton consdered by Homer et al. n whch the test sample s a database constructed usng a specfed number of ndvduals each wth ndvdual genotypes, and wth the gene frequences estmated as ther average, our results support ther conclusons. Probands that were part of a test sample could be dentfed even for samples szes of 1000. If, for example, there are both dseased case and healthy control samples n the assocaton test, each assumed to be sampled from the same populaton, then t s possble to test whether an ndvdual s present n ether the case or control group usng the analyss we have descrbed, but usng each sample n turn as the test sample. Current genome-wde assocaton studes (and meta-analyses based upon multple studes) are conducted on large samples, often of the order of 10,000 or so, and n ths case our results show that PLoS Genetcs www.plosgenetcs.org 5 October 2009 Volume 5 Issue 10 e1000628

the power to dentfy a proband who was part of such a large sample when the reference sample s of smlar sze s only about one-half (Table 1) assumng 50,000 ndependent loc, even under the deal crcumstances consdered n ths study. Supportng Informaton Table S1 Smulaton results comparng the LR and Regresson statstcs. Found at: do:10.1371/journal.pgen.1000628.s001 (0.06 MB DOC) Text S1 Computaton of expected lkelhoods. Found at: do:10.1371/journal.pgen.1000628.s002 (0.03 MB DOC) References 1. Homer N, Szelnger S, Redman M, Duggan D, Tembe W, et al. (2008) Resolvng ndvduals contrbutng trace amounts of DNA to hghly complex mxtures usng hgh-densty SNP genotypng mcroarrays. PLoS Genet 4: e1000167. do:10.1371/journal.pgen.1000167. 2. Hll WG, Robertson A (1968) Lnkage dsequlbrum n fnte populatons. Theor Appl Genet 38: 226 231. Text S2 Homer et al. test statstc. Found at: do:10.1371/journal.pgen.1000628.s003 (0.03 MB DOC) Acknowledgments We thank Naom Wray and two referees for helpful comments on the manuscrpt. Author Contrbutons Conceved and desgned the experments: PMV. Performed the experments: PMV WGH. Analyzed the data: PMV WGH. Contrbuted reagents/materals/analyss tools: PMV WGH. Wrote the paper: PMV WGH. 3. Hayes BJ, Vsscher PM, Goddard ME (2009) Increased accuracy of artfcal selecton by usng the realzed relatonshp matrx. Genetcs Research 91: 47 60. 4. Internatonal Schzophrena Consortum (2009) Common polygenc varaton contrbutes to rsk of schzophrena and bpolar dsorder. Nature (Epub July 1 st 2009). PLoS Genetcs www.plosgenetcs.org 6 October 2009 Volume 5 Issue 10 e1000628