Releasing SNP Data and GWAS Results with Guaranteed Privacy Protection

Similar documents
SNPrints: Defining SNP signatures for prediction of onset in complex diseases

Genetics and Genomics in Medicine Chapter 8 Questions

Structural Variation and Medical Genomics

New Enhancements: GWAS Workflows with SVS

CS2220 Introduction to Computational Biology

Single SNP/Gene Analysis. Typical Results of GWAS Analysis (Single SNP Approach) Typical Results of GWAS Analysis (Single SNP Approach)

Computer Models for Medical Diagnosis and Prognostication

DNA Analysis Techniques for Molecular Genealogy. Luke Hutchison Project Supervisor: Scott R. Woodward

BST227 Introduction to Statistical Genetics. Lecture 4: Introduction to linkage and association analysis

I, Mary M. Langman, Director, Information Issues and Policy, Medical Library Association

Genome-wide association study of esophageal squamous cell carcinoma in Chinese subjects identifies susceptibility loci at PLCE1 and C20orf54

Using Network Flow to Bridge the Gap between Genotype and Phenotype. Teresa Przytycka NIH / NLM / NCBI

What can we contribute to cancer research and treatment from Computer Science or Mathematics? How do we adapt our expertise for them

Accessing and Using ENCODE Data Dr. Peggy J. Farnham

OncoPhase: Quantification of somatic mutation cellular prevalence using phase information

Deriving Rules and Assertions From Pharmacogenomic Knowledge Resources In Support Of Patient Drug Metabolism Efficacy Predictions!

A rare variant in MYH6 confers high risk of sick sinus syndrome. Hilma Hólm ESC Congress 2011 Paris, France

Calculate the percentage of cytosine for the beetle. (2)

Variant Classification. Author: Mike Thiesen, Golden Helix, Inc.

The Foundations of Personalized Medicine

AudGenDB: a Public, Internet-Based, Audiologic - Otologic - Genetic Database for Pediatric Hearing Research

DETECTION OF LOW FREQUENCY CXCR4-USING HIV-1 WITH ULTRA-DEEP PYROSEQUENCING. John Archer. Faculty of Life Sciences University of Manchester

Dan Koller, Ph.D. Medical and Molecular Genetics

Dr Rick Tearle Senior Applications Specialist, EMEA Complete Genomics Complete Genomics, Inc.

Data mining with Ensembl Biomart. Stéphanie Le Gras

Inter-session reproducibility measures for high-throughput data sources

Evaluating Classifiers for Disease Gene Discovery

Bjoern Peters La Jolla Institute for Allergy and Immunology Buenos Aires, Oct 31, 2012

Protecting Patient Privacy in Genomic Analysis

Mapping evolutionary pathways of HIV-1 drug resistance using conditional selection pressure. Christopher Lee, UCLA

Golden Helix s End-to-End Solution for Clinical Labs

Causal modeling in the lung Combining multiple data types to enhance clinical diagnosis

Challenges and Opportunities with Rapidly-Changing Biomedical Technologies:

BST227: Introduction to Statistical Genetics

Creating Interpretable Collaborative Patterns to Detect Insider Threats

Corporate Medical Policy

Statistical Analysis of Single Nucleotide Polymorphism Microarrays in Cancer Studies

2) Cases and controls were genotyped on different platforms. The comparability of the platforms should be discussed.

IL10 rs polymorphism is associated with liver cirrhosis and chronic hepatitis B

Introduction of Genome wide Complex Trait Analysis (GCTA) Presenter: Yue Ming Chen Location: Stat Gen Workshop Date: 6/7/2013

Reliability of Ordination Analyses

Imaging Genetics: Heritability, Linkage & Association

World Leading Expertise in Use of Medical Records

EXTRACTION AND IDENTIFICATION OF KNOWN GENES OF LIFESTYLE DISEASES IN MEN AND WOMEN 20 YEARS AND OVER: A PILOT STUDY IN ILOCOS, BICOL AND METRO MANILA

Genetic Heterogeneity of Clinically Defined AD. Andrew J. Saykin, PsyD Indiana ADC ADC Clinical Core Leaders Meeting April 22, 2017

A computational framework for discovery of glycoproteomic biomarkers

Problem 3: Simulated Rheumatoid Arthritis Data

PERSONALIZED GENETIC REPORT CLIENT-REPORTED DATA PURPOSE OF THE X-SCREEN TEST

integrating Data for Analysis, Anonymization, and SHaring

Host Genomics of HIV-1

Supplementary Figure 1: Classification scheme for non-synonymous and nonsense germline MC1R variants. The common variants with previously established

Complex Trait Genetics in Animal Models. Will Valdar Oxford University

10/19/2017. How Nutritional Genomics Affects You in Nutrition Research and Practice Joyanna Hansen, PhD, RD & Kristin Guertin, PhD, MPH

National Surgical Adjuvant Breast and Bowel Project (NSABP) Foundation Annual Progress Report: 2011 Formula Grant

Visualizing Temporal Patterns by Clustering Patients

Can DNA Witness Race?: Forensic Uses of an Imperfect Ancestry Testing Technology

5/2/18. After this class students should be able to: Stephanie Moon, Ph.D. - GWAS. How do we distinguish Mendelian from non-mendelian traits?

IN SILICO EVALUATION OF DNA-POOLED ALLELOTYPING VERSUS INDIVIDUAL GENOTYPING FOR GENOME-WIDE ASSOCIATION STUDIES OF COMPLEX DISEASE.

Genome. Institute. GenomeVIP: A Genomics Analysis Pipeline for Cloud Computing with Germline and Somatic Calling on Amazon s Cloud. R. Jay Mashl.

Investigating causality in the association between 25(OH)D and schizophrenia

Statistical Tests for X Chromosome Association Study. with Simulations. Jian Wang July 10, 2012

To test the possible source of the HBV infection outside the study family, we searched the Genbank

SVIM: Structural variant identification with long reads DAVID HELLER MAX PLANCK INSTITUTE FOR MOLECULAR GENETICS, BERLIN JUNE 2O18, SMRT LEIDEN

The Six Ws of DNA testing A scenario-based activity introducing medical applications of DNA testing

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

Big Data Phenomics in the VA. Outline

Supplementary information for: A functional variation in BRAP confers risk of myocardial infarction in Asian populations

Nutrigenomics and Personalised Nutrition. John Hesketh

Who Subscribe to Identity Theft Protection Service

An Introduction to Quantitative Genetics I. Heather A Lawson Advanced Genetics Spring2018

Testing the robustness of anonymization techniques: acceptable versus unacceptable inferences - Draft Version

Massoud Houshmand National Institute for Genetic Engineering and Biotechnology (NIGEB), Tehran, Iran

Cancer Gene Panels. Dr. Andreas Scherer. Dr. Andreas Scherer President and CEO Golden Helix, Inc. Twitter: andreasscherer

MULTIFACTORIAL DISEASES. MG L-10 July 7 th 2014

GENOME-WIDE ASSOCIATION STUDIES

Comment 4. Below are a few areas where NLM might be able to apply these twin areas of recommendation:

National Disease Research Interchange Annual Progress Report: 2010 Formula Grant

GENETIC LINKAGE ANALYSIS

Perceived challenges in genomic-based drug development. Garret A. FitzGerald University of Pennsylvania

Supplementary webappendix

Uses of the NIH Collaboratory Distributed Research Network

PhenDisco: a new phenotype discovery system for the database of genotypes and phenotypes

Analysis of glutathione peroxidase 1 gene polymorphism and Keshan disease in Heilongjiang Province, China

Name: PS#: Biol 3301 Midterm 1 Spring 2012

Assessing Accuracy of Genotype Imputation in American Indians

Supplementary Figure 1 Dosage correlation between imputed and genotyped alleles Imputed dosages (0 to 2) of 2-digit alleles (red) and 4-digit alleles

Etiology of Chronic Diseases. Complex Diseases Genes and Environment Initiative

NDRI Private Donor Program: Accelerating Biomedical Research via Private Donation

Case-based reasoning using electronic health records efficiently identifies eligible patients for clinical trials

TB trends and TB genotyping

Nature Genetics: doi: /ng Supplementary Figure 1. SEER data for male and female cancer incidence from

Citation for published version (APA): van Munster, B. C. (2009). Pathophysiological studies in delirium : a focus on genetics.

Leveraging Interaction between Genetic Variants and Mammographic Findings for Personalized Breast Cancer Diagnosis

Simplifying Treatment Protocol Development with.. By Healthy at Work and SaluGenecists

Lorne A. Becker MD Emeritus Professor SUNY Upstate Medical University. Co-Chair, Cochrane Collaboration Steering Group

Human Genetics of Tuberculosis. Laurent Abel Laboratory of Human Genetics of Infectious Diseases University Paris Descartes/INSERM U980

Asthma Surveillance Using Social Media Data

Heart Attack Readmissions in Virginia

Children, Toronto, Ontario, Canada. Department of Laboratory Medicine and Pathobiology Hospital for Sick Children, Toronto, Ontario, Canada, M5G 1X8

Deciphering the Role of micrornas in BRD4-NUT Fusion Gene Induced NUT Midline Carcinoma

Transcription:

integrating Data for Analysis, Anonymization, and SHaring Releasing SNP Data and GWAS Results with Guaranteed Privacy Protection Xiaoqian Jiang, PhD and Shuang Wang, PhD

Overview Introduction idash healthcare Privacy Protection Challenge» Tasks overview Summary of results» Task 1: Privacy-preserving SNP data sharing» Task 2: Privacy-preserving GWAS results sharing Conclusions 9/25/2014 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 2

Human genome privacy Human genomes are important to biomedical research, e.g., Genome-wide association studies (GWAS) But genomic data are also highly sensitive» Diseases association: predisposition to Diabetes, Cancer» Re-identification: name» Information disclosure of blood relatives» A great fear of unknown Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 3

Privacy risk at SNP level Lin et. al. 2004 science: as few as 75 statistically independent SNPs (Single-nucleotide polymorphism) will be sufficient to identify a single person Gymrek et al. 2013 Science: surnames can be recovered from personal genomes by profiling short tandem repeats on the Y chromosome and querying recreational genetic genealogy databases Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 4

Even statistics might be unsafe G Y Reference population Person of interest F Mixture Homer et. al. 2008 PLoS genetics: aggregate genome data (i.e., allele frequencies) can also be used for re-identifying an individual in a case group with a certain disease Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 5

Even statistics might be unsafe G Y Reference population Person of interest F Mixture Most likely to be in the mixture Equally likely to be in the mixture in the reference population Most likely to be in the reference population Homer et. al. 2008 PLoS genetics: aggregate genome data (i.e., allele frequencies) can also be used for re-identifying an individual in a case group with a certain disease Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 6

Even statistics might be unsafe G Y Reference population Person of interest F Mixture Most likely to be in the mixture Equally likely to be in the mixture in the reference population Most likely to be in the reference population Homer et. al. 2008 PLoS genetics: aggregate genome data (i.e., allele frequencies) can also be used for re-identifying an individual in a case group with a certain disease Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 7

idash healthcare Privacy Protection Challenge Evaluate solutions of guaranteed privacy protection Task 1: Privacy-preserving SNP Data Sharing» Four teams (i.e., IU, OU, UT Dallas, and McGill) Task 2: Privacy-preserving release of top K most significant SNPs» Two teams (i.e., UT Austin and CMU) Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 8

Data preparation Task 1: data publishing Case: 200 PGP individuals Control: 174 CEU individuals Data set 1: 311 SNVs Data set 2: 600 SNVs Filtered and genotyped Task 2: top-k SNP identification Case: 200 PGP individuals Control: 174 CEU individuals Data set 1: 5000 SNVs Data set 2: 106,129 SNVs Overview and methodology papers are under review for BMC Biomedical Informatics and Decision Making 9/25/2014 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 9

Overview Introduction idash healthcare Privacy Protection Challenge» Tasks overview Summary of results» Task 1: Privacy-preserving SNP data sharing» Task 2: Privacy-preserving GWAS results sharing Conclusions 9/25/2014 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 10

Task 1: Privacy-preserving SNP Data Sharing Goal: Understand privacy-utility balance in released SNP data, after proper protection Utility: number of significant SNPs identified by the Chi-square association test over the 200 case samples and 174 control samples Also checked: published data s resistance to the likelihood ratio attack (Sankararaman et. al. 2009 Nature Genetics) Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 11

Task 2: Privacy-preserving GWAS results sharing Goal: assess GWAS utility in differentially private data analysis Utility: how likely top-k (e.g., K=1 or 5) most significant SNPs (using chi-square tests) can be preserved in differentially private queries Privacy Protection: Differential privacy with a budget ε=1.0 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 12

Utility: Case-Control Association Test Adopt from http://bioinformatics.org.au/ws09/presentations/day3_jstankovich.pdf Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 13

Overview Introduction idash healthcare Privacy Protection Challenge» Tasks overview Summary of results» Task 1: Privacy-preserving SNP data sharing» Task 2: Privacy-preserving GWAS results sharing Conclusions 9/25/2014 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 14

Experimental results *Re-identification power was calculated at the 0.95 confidence level (i.e., false positive rate of 0.05). Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 15

Overview Introduction idash healthcare Privacy Protection Challenge» Tasks overview Summary of results» Task 1: Privacy-preserving SNP data sharing» Task 2: Privacy-preserving GWAS results sharing Conclusions 9/25/2014 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 16

Results in Task 2 Probabilities that top K (K = 1, 3, 5, 10, 30) most significant SNPs have been preserved in the release data over 1000 trials Mechanism Utility function UT Austin Exponential mechanism Hamming distance CMU Exponential mechanism Chi-squared statistics Small Dataset: 201 cases and 174 controls 5000 SNPs Large Dataset: All valid genotypes on 201 cases and 174 controls 106,129 SNPs Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 17

Conclusions of Task 1 It remains a challenge to privacy-preserved sharing of SNP data, while maintaining their utilities in GWAS using differential privacy» Even for a single genomic locus involving a few hundreds of SNPs, the utility of the data was large damaged after noise-adding to ensure privacy protection It is un-likely that current differential privacy techniques will scale well for sharing whole human genomic data Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 18

Conclusions of Task 2 Privacy-preserving techniques work surprisingly well on publishing outcomes of GWAS-like analyses» Good accuracy can be achieved when only a small number of most significant SNPs are concerned from the users perspective This task is well aligned with the centralized data/computing model» The centralized data/computing center will host human genomic data as well as service for customized analyses on these data, and will only release the results of these analyses to users Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 19

Papers under review Jiang, X., Zhao, Y., Wang, X., Malin, B., Wang, S., Ohno-, L., & Tang, H. (n.d.). A Community Assessment of Privacy Preserving Techniques for Human Genomes. BMC Medical Informatics Decision Making (under Review). Wang, S., Mohammed, N., & Chen, R. (2014). Differentially Private Genome Data Dissemination through Top-Down Specialization. BMC Medical Informatics Decision Making (under Review). Yu, F., & Ji, Z. (2014). Scalable Privacy-Preserving Data Sharing Methodology for Genome-Wide Association Studies : An Application to idash Healthcare Privacy Protection Challenge. BMC Medical Informatics Decision Making (under Review). Roozgard, A., Barzigar, N., Verma, P., & Cheng, S. (2014). Genomic Data Privacy Protection using Compressed Sensing. BMC Medical Informatics Decision Making (under Review). 9/25/2014 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 20

Acknowledgements NIH idash U54HK108460 NIH R01 HG007078-01 NLM R00 LM011392 NHGRI K99 1K99HG008175 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 21

Thank you! Questions? 9/25/2014 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego

Protection against attack Likelihood Ratio Test 0.05 significance level LR test statistics Participants Sankararaman, S., Obozinski, G., Jordan, M. I., & Halperin, E. (2009). Nature Genetics, 41(9), 965 7. doi:10.1038/ng.436 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 23