Releasing SNP Data and GWAS Results with Guaranteed Privacy Protection

integrating Data for Analysis, Anonymization, and SHaring Releasing SNP Data and GWAS Results with Guaranteed Privacy Protection Xiaoqian Jiang, PhD and Shuang Wang, PhD

Overview Introduction idash healthcare Privacy Protection Challenge» Tasks overview Summary of results» Task 1: Privacy-preserving SNP data sharing» Task 2: Privacy-preserving GWAS results sharing Conclusions 9/25/2014 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 2

Human genome privacy Human genomes are important to biomedical research, e.g., Genome-wide association studies (GWAS) But genomic data are also highly sensitive» Diseases association: predisposition to Diabetes, Cancer» Re-identification: name» Information disclosure of blood relatives» A great fear of unknown Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 3

Privacy risk at SNP level Lin et. al. 2004 science: as few as 75 statistically independent SNPs (Single-nucleotide polymorphism) will be sufficient to identify a single person Gymrek et al. 2013 Science: surnames can be recovered from personal genomes by profiling short tandem repeats on the Y chromosome and querying recreational genetic genealogy databases Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 4

Even statistics might be unsafe G Y Reference population Person of interest F Mixture Homer et. al. 2008 PLoS genetics: aggregate genome data (i.e., allele frequencies) can also be used for re-identifying an individual in a case group with a certain disease Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 5

Even statistics might be unsafe G Y Reference population Person of interest F Mixture Most likely to be in the mixture Equally likely to be in the mixture in the reference population Most likely to be in the reference population Homer et. al. 2008 PLoS genetics: aggregate genome data (i.e., allele frequencies) can also be used for re-identifying an individual in a case group with a certain disease Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 6

idash healthcare Privacy Protection Challenge Evaluate solutions of guaranteed privacy protection Task 1: Privacy-preserving SNP Data Sharing» Four teams (i.e., IU, OU, UT Dallas, and McGill) Task 2: Privacy-preserving release of top K most significant SNPs» Two teams (i.e., UT Austin and CMU) Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 8

Data preparation Task 1: data publishing Case: 200 PGP individuals Control: 174 CEU individuals Data set 1: 311 SNVs Data set 2: 600 SNVs Filtered and genotyped Task 2: top-k SNP identification Case: 200 PGP individuals Control: 174 CEU individuals Data set 1: 5000 SNVs Data set 2: 106,129 SNVs Overview and methodology papers are under review for BMC Biomedical Informatics and Decision Making 9/25/2014 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 9

Task 1: Privacy-preserving SNP Data Sharing Goal: Understand privacy-utility balance in released SNP data, after proper protection Utility: number of significant SNPs identified by the Chi-square association test over the 200 case samples and 174 control samples Also checked: published data s resistance to the likelihood ratio attack (Sankararaman et. al. 2009 Nature Genetics) Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 11

Task 2: Privacy-preserving GWAS results sharing Goal: assess GWAS utility in differentially private data analysis Utility: how likely top-k (e.g., K=1 or 5) most significant SNPs (using chi-square tests) can be preserved in differentially private queries Privacy Protection: Differential privacy with a budget ε=1.0 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 12

Utility: Case-Control Association Test Adopt from http://bioinformatics.org.au/ws09/presentations/day3_jstankovich.pdf Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 13

Experimental results *Re-identification power was calculated at the 0.95 confidence level (i.e., false positive rate of 0.05). Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 15

Results in Task 2 Probabilities that top K (K = 1, 3, 5, 10, 30) most significant SNPs have been preserved in the release data over 1000 trials Mechanism Utility function UT Austin Exponential mechanism Hamming distance CMU Exponential mechanism Chi-squared statistics Small Dataset: 201 cases and 174 controls 5000 SNPs Large Dataset: All valid genotypes on 201 cases and 174 controls 106,129 SNPs Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 17

Conclusions of Task 1 It remains a challenge to privacy-preserved sharing of SNP data, while maintaining their utilities in GWAS using differential privacy» Even for a single genomic locus involving a few hundreds of SNPs, the utility of the data was large damaged after noise-adding to ensure privacy protection It is un-likely that current differential privacy techniques will scale well for sharing whole human genomic data Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 18

Conclusions of Task 2 Privacy-preserving techniques work surprisingly well on publishing outcomes of GWAS-like analyses» Good accuracy can be achieved when only a small number of most significant SNPs are concerned from the users perspective This task is well aligned with the centralized data/computing model» The centralized data/computing center will host human genomic data as well as service for customized analyses on these data, and will only release the results of these analyses to users Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 19

Papers under review Jiang, X., Zhao, Y., Wang, X., Malin, B., Wang, S., Ohno-, L., & Tang, H. (n.d.). A Community Assessment of Privacy Preserving Techniques for Human Genomes. BMC Medical Informatics Decision Making (under Review). Wang, S., Mohammed, N., & Chen, R. (2014). Differentially Private Genome Data Dissemination through Top-Down Specialization. BMC Medical Informatics Decision Making (under Review). Yu, F., & Ji, Z. (2014). Scalable Privacy-Preserving Data Sharing Methodology for Genome-Wide Association Studies : An Application to idash Healthcare Privacy Protection Challenge. BMC Medical Informatics Decision Making (under Review). Roozgard, A., Barzigar, N., Verma, P., & Cheng, S. (2014). Genomic Data Privacy Protection using Compressed Sensing. BMC Medical Informatics Decision Making (under Review). 9/25/2014 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 20

Acknowledgements NIH idash U54HK108460 NIH R01 HG007078-01 NLM R00 LM011392 NHGRI K99 1K99HG008175 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 21

Thank you! Questions? 9/25/2014 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego

Protection against attack Likelihood Ratio Test 0.05 significance level LR test statistics Participants Sankararaman, S., Obozinski, G., Jordan, M. I., & Halperin, E. (2009). Nature Genetics, 41(9), 965 7. doi:10.1038/ng.436 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 23