Identification of regions with common copy-number variations using SNP array

Similar documents
Genome-wide copy-number calling (CNAs not CNVs!) Dr Geoff Macintyre

STATISTICAL METHODS FOR THE DETECTION AND ANALYSES OF STRUCTURAL VARIANTS IN THE HUMAN GENOME. Shu Mei, Teo

Global variation in copy number in the human genome

November 9, Johns Hopkins School of Medicine, Baltimore, MD,

Understanding DNA Copy Number Data

Challenges of CGH array testing in children with developmental delay. Dr Sally Davies 17 th September 2014

Optimizing Copy Number Variation Analysis Using Genome-wide Short Sequence Oligonucleotide Arrays

Structural Variants and Susceptibility to Common Human Disorders Dr. Xavier Estivill

Comparative analyses of seven algorithms for copy number variant identification from single nucleotide polymorphism arrays

CNV Detection and Interpretation in Genomic Data

Comparing CNV detection methods for SNP arrays Laura Winchester, Christopher Yau and Jiannis Ragoussis

Integrated detection and population-genetic analysis of SNPs and copy number variation

Integrated detection and population-genetic analysis of SNPs and copy number variation

Using GWAS Data to Identify Copy Number Variants Contributing to Common Complex Diseases

LTA Analysis of HapMap Genotype Data

Integrated detection and population-genetic analysis. of SNPs and copy number variation

CHROMOSOMAL MICROARRAY (CGH+SNP)

Associating Copy Number and SNP Variation with Human Disease. Autism Segmental duplication Neurobehavioral, includes social disability

SUPPLEMENTARY INFORMATION

CNV detection. Introduction and detection in NGS data. G. Demidov 1,2. NGSchool2016. Centre for Genomic Regulation. CNV detection. G.

Chapter 1 : Genetics 101

Supplementary note: Comparison of deletion variants identified in this study and four earlier studies

Whole-genome detection of disease-associated deletions or excess homozygosity in a case control study of rheumatoid arthritis

Modeling genetic inheritance of copy number variations

GENOME-WIDE ASSOCIATION STUDIES

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF BIOCHEMISTRY AND MOLECULAR BIOLOGY

New methods for discovering common and rare genetic variants in human disease

Genomic structural variation

Introduction to LOH and Allele Specific Copy Number User Forum

cnvhap: an integrative population and haplotype based multiplatform model of SNPs and CNVs

Statistical Analysis of Single Nucleotide Polymorphism Microarrays in Cancer Studies

Generating Spontaneous Copy Number Variants (CNVs) Jennifer Freeman Assistant Professor of Toxicology School of Health Sciences Purdue University

Multiple Copy Number Variations in a Patient with Developmental Delay ASCLS- March 31, 2016

Nature Biotechnology: doi: /nbt.1904

Nature Genetics: doi: /ng Supplementary Figure 1. PCA for ancestry in SNV data.

Cost effective, computer-aided analytical performance evaluation of chromosomal microarrays for clinical laboratories

Nature Genetics: doi: /ng Supplementary Figure 1

PennCNV: An integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data

Supplementary Information. Supplementary Figures

Introduction to the Genetics of Complex Disease

Applications of Chromosomal Microarray Analysis (CMA) in pre- and postnatal Diagnostic: advantages, limitations and concerns

Analysis with SureCall 2.1

Association for Molecular Pathology Promoting Clinical Practice, Basic Research, and Education in Molecular Pathology

Approach to Mental Retardation and Developmental Delay. SR Ghaffari MSc MD PhD

CURRENT GENETIC TESTING TOOLS IN NEONATAL MEDICINE. Dr. Bahar Naghavi

Vega: Variational Segmentation for Copy Number Detection

DNA-seq Bioinformatics Analysis: Copy Number Variation

Microarray Comparative Genomic Hybridisation (array CGH)

Practical challenges that copy number variation and whole genome sequencing create for genetic diagnostic labs

Tutorial on Genome-Wide Association Studies

and SNPs: Understanding Human Structural Variation in Disease. My

Multimarker Genetic Analysis Methods for High Throughput Array Data

Calling DNA variants SNVs, CNVs, and SVs. Steve Laurie Variant Effect Predictor Training Course Prague, 6 th November 2017

The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0

Agilent s Copy Number Variation (CNV) Portfolio

Multiplex target enrichment using DNA indexing for ultra-high throughput variant detection

DETECTING HIGHLY DIFFERENTIATED COPY-NUMBER VARIANTS FROM POOLED POPULATION SEQUENCING

The role of cytogenomics in the diagnostic work-up of Chronic Lymphocytic Leukaemia

Chapter 15: The Chromosomal Basis of Inheritance

New Enhancements: GWAS Workflows with SVS

Human Cancer Genome Project. Bioinformatics/Genomics of Cancer:

Structural Variation and Medical Genomics

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction

Supplementary Figure 1 Dosage correlation between imputed and genotyped alleles Imputed dosages (0 to 2) of 2-digit alleles (red) and 4-digit alleles

Introduction to genetic variation. He Zhang Bioinformatics Core Facility 6/22/2016

Chapter 12-4 DNA Mutations Notes

Copy Number Variations

Investigating rare diseases with Agilent NGS solutions

Chapter 15: The Chromosomal Basis of Inheritance

Copy Number Variations and Association Mapping Advanced Topics in Computa8onal Genomics

UNIVERSITI TEKNOLOGI MARA COPY NUMBER VARIATIONS OF ORANG ASLI (NEGRITO) FROM PENINSULAR MALAYSIA

THE GENETIC STRUCTURE, FUNCTION AND RELEVANCE TO DISEASE OF THE SALIVARY AGGLUTININ GENE (DMBT1)

Biostatistical modelling in genomics for clinical cancer studies

Comprehensive performance comparison of high-resolution array platforms for genome-wide Copy Number Variation (CNV) analysis in humans

Genetic Association Testing of Copy Number Variation

Statistical Applications in Genetics and Molecular Biology

Discovery of common Asian copy number variants using integrated high-resolution array CGH and massively parallel DNA sequencing

Genomics 101 (2013) Contents lists available at SciVerse ScienceDirect. Genomics. journal homepage:

The Chromosomal Basis Of Inheritance

Global assessment of genomic variation in cattle by genome resequencing and high-throughput genotyping

Chapter 11 Patterns of Chromosomal Inheritance

5/2/18. After this class students should be able to: Stephanie Moon, Ph.D. - GWAS. How do we distinguish Mendelian from non-mendelian traits?

Detection of aneuploidy in a single cell using the Ion ReproSeq PGS View Kit

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

Clinical evaluation of microarray data

PSSV User Manual (V1.0)

A Multi-Sample Based Method for Identifying Common CNVs in Normal Human Genomic Structure Using High- Resolution acgh Data

Analysis of CGH and SNP arrays for the detection of chromosomal aberrations in single cells

Next Generation Sequencing as a tool for breakpoint analysis in rearrangements of the globin-gene clusters

Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering

MPS for translocations

Computational Analysis of Genome-Wide DNA Copy Number Changes

Genome-wide association studies (case/control and family-based) Heather J. Cordell, Institute of Genetic Medicine Newcastle University, UK

SUPPLEMENTARY DATA. 1. Characteristics of individual studies

No mutations were identified.

Unit 5 Review Name: Period:

Detection of copy number variants in sequencing data

Missing Heritablility How to Analyze Your Own Genome Fall 2013

Study the Evolution of the Avian Influenza Virus

What can we contribute to cancer research and treatment from Computer Science or Mathematics? How do we adapt our expertise for them

Transcription:

Identification of regions with common copy-number variations using SNP array Agus Salim Epidemiology and Public Health National University of Singapore

Copy Number Variation (CNV) Copy number alteration of a segment of DNA sequence >1kb >1kb >1kb Deletion Duplication Translocation Inversion Insertion Detected by sequencing Vast majority of CNV regions are inherited (Locke, 2006)

Centre for Molecular Epidemiology The link of CNVs with complex diseases Increasing evidence to link CNVs with complex diseases Schizophrenia Autism Crohn s disease Age related macular degeneration

Platforms for CNV detection Array CGH NimbleGen (2.1M probes), Agilent (250K?) SNP array Illumina 1M, Affy 6.0 (1.8M probes) Sequencing

SNP array Originally built to detect Single Nucleotide Polymorphism. Main output: total signal intensity at each probe. Total intensity y( (R) = signal intensity A + signal intensity B Log R ratio = log 2 (observed R/ Expected R) Measures copy-number changes relative to the reference genome. BAF = normalized measure of relative signal intensity (B/A)

Log R Ratio

Detection of CNV regions PennCNV, QuantiSNPs Use Log R ratio and B allele frequency (BAF) Based on hidden Markov model (HMM) Hidden state = copy number (0,1,2,3,4) DNACopy (CBS algorithm) Find change points in intensity data (Log R ratio) Segments genome into multiple regions defined by the change points. No copy number calls.

Sample-by-sample approach Using HMM or CBS algorithm, CNV regions are detected for each individual separately. Sample-by-sample detection is reasonable for rare CNV where individual-specific regions are expected. However, common CNV regions occur at relatively the same genomic regions across individuals (e.g, McCaroll et al., 2008).

Individual CNV regions For each region we have: Start, end, confidence score Confidence score = a score to reflect confidence that the detected region truly exists. How to determine common boundaries for common CNV regions?

Within and between individuals reliability Reliability of (within) an individual regions is reflected by the confidence scores. Between individual reliability of the region is reflected by Between-individual reliability of the region is reflected by how many other individuals have CNV in the region.

Cumulative Overlap of Very Reliable Regions (COVER) Key idea: Common CNV regions occur at almost the same genomic regions. Choose regions where multiple individuals have CNV. But need to do something about the unreliable calls with low confidence scores.

COVER

Composite Scores (COMPOSITE) Using COVER, only very reliable regions are selected. What about regions with average reliability but yet occur in many individuals? COVER put more emphasis on Within-individual reliability. Composite Scores = Sum of confidence scores at each probe, give more emphasis on between-individual reliability.

COMPOSITE

Clustering of Individual Regions Two clusters of region form one common region.

Hierarchical Clustering of Individual Regions Jaccard Coefficient Similarity (A,B) = number of bases shared/length of union of A and B (bases). Linkage: single, average, complete.

CLUSTER examples

Average linkage 13 groups 9 groups

Cut off @ Complete linkage Single linkage Average linkage No. of clusters 0.3 28 24 27 0.4 23 16 21 0.5 21 11 17 0.6 17 6 13 0.7 14 4 12 0.8 12 2 9 0.9 10 2 6 095 0.95 9 2 5 0.99 9 1 2

Comparison of Methods COVER COMPOSITE Both can be refined using CLUSTER. What are we comparing? Proportion of identified CNV regions violate HWE Discordant rates between the identified CNV regions and sequencing results (when available)

HWE of common CNV regions The number of copies in offspring = number of copies in father + number of copies in mother. For diallelic CNV with deletion only (CN=0,1,2), Population frequency of 0 copy allele = p Population frequency of 1 copy = 1-p = q Under random mating, the frequency of subjects with: 0 copy = p 2 1 copy = 2pq 2 copies = q 2

For diallelic CNV with duplication only (CN=2,3,4), Population frequency of 1 copy allele = p Population frequency of 2 copy = 1-p = q Under random mating, the frequency of subjects with: 2 copies = p 2 3 copies = 2pq 4 copies = q 2 For multiallelic CNV, HWE test cannot be applied to the unphased copy number calls.

Discordant rates What is the proportion of CNV regions identified in a sample that overlap with sequencing result of the same sample? overlap = the two regions overlap above certain threshold overlap = the two regions overlap above certain threshold. E.g, Jaccard coefficient >= 0.5

Application 112 HapMap samples (CEU, CHBJPT, YRI) on Illumina 1M platform; 83 of them are unrelated individuals; 8 of them were sequenced by Kidd et al (2008) Statistics i of interest: average discordant rates in the eight sample, Prop of diallelic li CNV do not follow HWE (among unrelated individuals) Average minor allele frequency of diallelic CNVs Average size of CNV regions

COVER results a) Discordant Rates b) HWE violation c) MAF d) Avg Size

Discordant rates can be lowered by choosing individual regions with higher confidence scores. > 90% of identified regions follow HWE.

COMPOSITE Results a) Discordant Rates b) HWE violation c) MAF d) Avg Size

COVER performs better than COMPOSITE; in particular it provides more control on discordant rates and HWE. Within-individual individual reliability is more useful when identifying common CNV regions. Discordant rates is high. Discordant rates between McCaroll et al (2008) results (Affy 6.0) and sequencing is also around 50-70%.

CLUSTER refinement of COVER Cluster cut-off = 0.6 a) Avg num clusters b) HWE violation c) PCA without CLUSTER d) PCA after CLUSTER

Software cnvpack R package available at http://www.meb.ki.se/~yudpaw

Conclusion COVER performs better than COMPOSITE; in particular it provides more control on discordant rates and HWE. CLUSTER provides potential refinement to COVER. Our approach only requires: start, end and confidence scores of individual CNV regions. These are output of CNV detection software.

Collaborators National University of Singapore Teo Shu Mei Chia Kee Seng Ku Chee Seng Karolinska Institute, Sweden Yudi Pawitan Universita Milano Bicocca Italy Universita Milano Bicocca, Italy Stefano Calza