Comparison of segmentation methods in cancer samples

Similar documents
Understanding DNA Copy Number Data

Statistical Analysis of Single Nucleotide Polymorphism Microarrays in Cancer Studies

Risk-prediction modelling in cancer with multiple genomic data sets: a Bayesian variable selection approach

November 9, Johns Hopkins School of Medicine, Baltimore, MD,

Harvard University. A Pseudolikelihood Approach for Simultaneous Analysis of Array Comparative Genomic Hybridizations (acgh)

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Aspects of Statistical Modelling & Data Analysis in Gene Expression Genomics. Mike West Duke University

Genome-wide copy-number calling (CNAs not CNVs!) Dr Geoff Macintyre

Introduction to LOH and Allele Specific Copy Number User Forum

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Analysis of acgh data: statistical models and computational challenges

Structured Association Advanced Topics in Computa8onal Genomics

Bayesian Random SegmentationModels to Identify Shared Copy Number Aberrations for Array CGH Data

SNP array-based analyses of unbalanced embryos as a reference to distinguish between balanced translocation carrier and normal blastocysts

10CS664: PATTERN RECOGNITION QUESTION BANK

Recursive Partitioning Method on Survival Outcomes for Personalized Medicine

Identification of regions with common copy-number variations using SNP array

Chapter 1. Introduction

Systematic Analysis for Identification of Genes Impacting Cancers

Biostatistical modelling in genomics for clinical cancer studies

Outlier Analysis. Lijun Zhang

Supplementary note: Comparison of deletion variants identified in this study and four earlier studies

Integration of high-throughput biological data

Comparative analyses of seven algorithms for copy number variant identification from single nucleotide polymorphism arrays

Bayesian Prediction Tree Models

The clinical implications of G1-G6 transcriptomic signature and 5-gene score in Korean patients with hepatocellular carcinoma

Comparing CNV detection methods for SNP arrays Laura Winchester, Christopher Yau and Jiannis Ragoussis

Comparison of Homogenisation methods. On behalf of COST Action ES0601 group

Propensity scores and causal inference using machine learning methods

False Discovery Rates and Copy Number Variation. Bradley Efron and Nancy Zhang Stanford University

Generating Spontaneous Copy Number Variants (CNVs) Jennifer Freeman Assistant Professor of Toxicology School of Health Sciences Purdue University

Analysis of CGH and SNP arrays for the detection of chromosomal aberrations in single cells

Genome-Wide Localization of Protein-DNA Binding and Histone Modification by a Bayesian Change-Point Method with ChIP-seq Data

Lecture Outline Biost 517 Applied Biostatistics I

Statistical Tests for X Chromosome Association Study. with Simulations. Jian Wang July 10, 2012

Large-scale identity-by-descent mapping discovers rare haplotypes of large effect. Suyash Shringarpure 23andMe, Inc. ASHG 2017

Bayesian Random Segmentation Models to Identify Shared Copy Number Aberrations for Array CGH Data

Cost effective, computer-aided analytical performance evaluation of chromosomal microarrays for clinical laboratories

CNV Detection and Interpretation in Genomic Data

SUPPLEMENTARY FIGURES: Supplementary Figure 1

Human Cancer Genome Project. Bioinformatics/Genomics of Cancer:

Biost 590: Statistical Consulting

Fusing Generic Objectness and Visual Saliency for Salient Object Detection

Computational Analysis of Genome-Wide DNA Copy Number Changes

Genomic structural variation

Modeling genetic inheritance of copy number variations

Quantitative Trait Analysis in Sibling Pairs. Biostatistics 666

Ordinal Data Modeling

Challenges of CGH array testing in children with developmental delay. Dr Sally Davies 17 th September 2014

Ridge regression for risk prediction

Whole-genome detection of disease-associated deletions or excess homozygosity in a case control study of rheumatoid arthritis

Comparison of volume estimation methods for pancreatic islet cells

A Multi-Sample Based Method for Identifying Common CNVs in Normal Human Genomic Structure Using High- Resolution acgh Data

Nature Genetics: doi: /ng Supplementary Figure 1. Mutational signatures in BCC compared to melanoma.

R2: web-based genomics analysis and visualization platform

Chromothripsis: A New Mechanism For Tumorigenesis? i Fellow s Conference Cheryl Carlson 6/10/2011

A Vision-based Affective Computing System. Jieyu Zhao Ningbo University, China

Comparison of discrimination methods for the classification of tumors using gene expression data

The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0

Statistical Applications in Genetics and Molecular Biology

Computer Science, Biology, and Biomedical Informatics (CoSBBI) Outline. Molecular Biology of Cancer AND. Goals/Expectations. David Boone 7/1/2015

Memorial Sloan-Kettering Cancer Center

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

Case Studies in Bayesian Augmented Control Design. Nathan Enas Ji Lin Eli Lilly and Company

Biostatistics II

From Biostatistics Using JMP: A Practical Guide. Full book available for purchase here. Chapter 1: Introduction... 1

Genomic complexity and arrays in CLL. Gian Matteo Rigolin, MD, PhD St. Anna University Hospital Ferrara, Italy

A METHOD FOR FINDING NOVEL ASSOCIATIONS BETWEEN GENOME-WIDE COPY NUMBER AND DNA METHYLATION PATTERNS

Gene expression analysis for tumor classification using vector quantization

Contents. 1.5 GOPredict is robust to changes in study sets... 5

Inference of Isoforms from Short Sequence Reads

MALBAC Technology and Its Application in Non-invasive Chromosome Screening (NICS)

Dan Koller, Ph.D. Medical and Molecular Genetics

Distinguishing Second Primary Cancers From Metastases: Statistical Challenges in Testing Clonal Relatedness of Tumors

Statistical methods for the meta-analysis of full ROC curves

Penalized weighted low-rank approximation for robust recovery of recurrent copy number variations

Comprehensive Chromosome Screening Is NextGen Likely to be the Final Best Platform and What are its Advantages and Quirks?

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction

SISCR Module 7 Part I: Introduction Basic Concepts for Binary Biomarkers (Classifiers) and Continuous Biomarkers

arxiv: v4 [stat.me] 7 May 2010

cnvcurator: an interactive visualization and editing tool for somatic copy number variations

Statistical methods for the meta-analysis of full ROC curves

Package Clonality. June 12, 2018

Structural Variation and Medical Genomics

The Nature of Behavior. By: Joe, Stephen, and Elisha

Genetic Association Testing of Copy Number Variation

Using CART to Mine SELDI ProteinChip Data for Biomarkers and Disease Stratification

Lecture Outline. Biost 590: Statistical Consulting. Stages of Scientific Studies. Scientific Method

P. RICHARD HAHN. Research areas. Employment. Education. Research papers

Comparative Analysis of Chromosomal Variation Altered New Region for Detecting Biomarker in Liver Cancer

Nature Genetics: doi: /ng Supplementary Figure 1. Rates of different mutation types in CRC.

An Empirical and Formal Analysis of Decision Trees for Ranking

Part [2.1]: Evaluation of Markers for Treatment Selection Linking Clinical and Statistical Goals

For a long time, people have observed that offspring look like their parents.

UNIVERSITI TEKNOLOGI MARA COPY NUMBER VARIATIONS OF ORANG ASLI (NEGRITO) FROM PENINSULAR MALAYSIA

Evaluating Classifiers for Disease Gene Discovery

Supplementary Figures

For general queries, contact

EECS 433 Statistical Pattern Recognition

Transcription:

fig/logolille2. Comparison of segmentation methods in cancer samples Morgane Pierre-Jean, Guillem Rigaill, Pierre Neuvial Laboratoire Statistique et Génome Université d Évry Val d Éssonne UMR CNRS 8071 USC INRA CERIM Université Lille 2 Droit et Santé Faculté de médecine 2013-04-16

Outline 1 2 Classical modelization State of the art Two-step approaches 3 Simulated data creation ROC curves 4 Morgane Pierre-Jean Segmentation methods in cancer samples 2/ 24

Outline 1 2 3 4 Morgane Pierre-Jean Segmentation methods in cancer samples 3/ 24

c b Chromosome 2 (Mb) Breakpoints occur at exact same position in the two dimensions Morgane Pierre-Jean Segmentation methods in cancer samples 4/ 24

c b Chromosome 2 (Mb) Breakpoints occur at exact same position in the two dimensions Morgane Pierre-Jean Segmentation methods in cancer samples 4/ 24

c b Chromosome 2 (Mb) Breakpoints occur at exact same position in the two dimensions Morgane Pierre-Jean Segmentation methods in cancer samples 4/ 24

Goal of DNA copy number studies : Identification of altered genome regions. Understand tumor progression Lead to personalized therapies We focused on identification of breakpoints Genomic signals from SNP arrays are bivariate Breakpoints occur exactly at the same position in the two-dimensions Morgane Pierre-Jean Segmentation methods in cancer samples 5/ 24

Outline Classical modelization State of the art Two-step approaches 1 2 Classical modelization State of the art Two-step approaches 3 4 Morgane Pierre-Jean Segmentation methods in cancer samples 6/ 24

Classical modelization State of the art Two-step approaches A change-point model Biological assumption : DNA copy numbers or symmetrized B allele frequency are piecewise constant Statistical model for K change points at (t 1,...t K ) : j = 1,..., n c j = γ j + ɛ j where k {1,..., K + 1}, j [t k 1, t k [ γ j = Γ k Morgane Pierre-Jean Segmentation methods in cancer samples 7/ 24

Classical modelization State of the art Two-step approaches A change-point model Biological assumption : DNA copy numbers or symmetrized B allele frequency are piecewise constant Statistical model for K change points at (t 1,...t K ) : j = 1,..., n c j = γ j + ɛ j where k {1,..., K + 1}, j [t k 1, t k [ γ j = Γ k Complexity Challenges : K and (t 1,...t K ) are unknown For a fixed K, the number of possible partitions : C K n 1 = O(nK 1 ) Morgane Pierre-Jean Segmentation methods in cancer samples 7/ 24

State of the art : Exact solution Classical modelization State of the art Two-step approaches One dimension [Picard et al. (2005)] : complexity in O(Kn 2 ) [Rigaill et al.(2010)] : mean complexity in O(Knlog(n)) Two dimensions Extension of [Picard et al. (2005)] : complexity in O(dKn 2 ) for smaller problems [Mosen-Ansorena, D et al (2013)] : complexity in O(dKnl) where l is the maximum length of segments Morgane Pierre-Jean Segmentation methods in cancer samples 8/ 24

State of the art : Heuristics Classical modelization State of the art Two-step approaches Type Name Method Dimension FLASSO total variation distance with a Convex relaxation complexity in O(Kn) 1 d GFLASSO Group fused Lasso solved by LARS O(Knd) 2 d CBS Circular binary segmentation 1d CART Classification and regression tree 1 d MCBS Multivariate circular binary segmentation 2 d Binary segmentation CBS on copy number then on B PSCBS allele frequency 2 d RBS Recursive binary segmentation in 2 dimensions adaptation of 2 d CART Other PSCN HMM (hidden Markov Model) 2 d Morgane Pierre-Jean Segmentation methods in cancer samples 9/ 24

Classical modelization State of the art Two-step approaches Two-step approaches for joint segmentation [Gey,S and Lebarbier,E (2008)] and [Bleakley and Vert(2011)] proposed two-step approaches. So, we implemented a fast joint segmentation using CART in 2d following by a pruning. First step : Running a fast but approximate segmentation method (RBS) Second step Pruning the final set of breakpoints using dynamic programming that is slower but exact Morgane Pierre-Jean Segmentation methods in cancer samples 10/ 24

Binary Segmentation Classical modelization State of the art Two-step approaches Take the simple case : dimension is equal to 1 (d = 1) : Hypothesis : H 0 : No breakpoint vs H 1 : Exactly one breakpoint. The likelihood ratio statistic is given by max 1 i n Z i And S i = 1 l i y l Z i = ( Si i ) Sn S i n i 1 i + 1 n i, (1) If (d > 1) : the likelihood ratio statistic becomes max 1 i n Z i 2 2 Morgane Pierre-Jean Segmentation methods in cancer samples 11/ 24

Classical modelization State of the art Two-step approaches First step : Recursive Binary Segmentation (RBS) Complexity : O(dnlog(K)) First breakpoint For each i : we compute Z i : t 1 = arg max 1 i n Z i 2 2 fig/rbs0.pdf fig/rbs1.pdf Morgane Pierre-Jean Segmentation methods in cancer samples 12/ 24

Classical modelization State of the art Two-step approaches First step : Recursive Binary Segmentation (RBS) Complexity : O(dnlog(K)) First breakpoint For each i : we compute Z i : t 1 = arg max 1 i n Z i 2 2 fig/rbs2.pdf Morgane Pierre-Jean Segmentation methods in cancer samples 12/ 24

Classical modelization State of the art Two-step approaches First step : Recursive Binary Segmentation (RBS) Second breakpoint : max 1 i t1 Z i 2 2 max t1<i n Z i 2 2 Compute RSE for each segment. Keep the RSE which bring the maximum gain Add the breakpoint to the active set fig/rbs3.pdf Morgane Pierre-Jean Segmentation methods in cancer samples 13/ 24

Classical modelization State of the art Two-step approaches First step : Recursive Binary Segmentation (RBS) Second breakpoint : max 1 i t1 Z i 2 2 max t1<i n Z i 2 2 Compute RSE for each segment. Keep the RSE which bring the maximum gain Add the breakpoint to the active set fig/rbs4.pdf Morgane Pierre-Jean Segmentation methods in cancer samples 13/ 24

Classical modelization State of the art Two-step approaches First step : Recursive Binary Segmentation (RBS) Third breakpoint : max 1 i t1 Z i 2 2 max t1<i t 2 Z i 2 2 max t2<i n Z i 2 2 Compute RSE for each segment. Keep the RSE which bring the maximum gain Add the breakpoint to the active set fig/rbs5.pdf Morgane Pierre-Jean Segmentation methods in cancer samples 14/ 24

Classical modelization State of the art Two-step approaches First step : Recursive Binary Segmentation (RBS) Third breakpoint : max 1 i t1 Z i 2 2 max t1<i t 2 Z i 2 2 max t2<i n Z i 2 2 Compute RSE for each segment. Keep the RSE which bring the maximum gain Add the breakpoint to the active set fig/rbs6.pdf Morgane Pierre-Jean Segmentation methods in cancer samples 14/ 24

Classical modelization State of the art Two-step approaches First step : Recursive Binary Segmentation (RBS) Third breakpoint : max 1 i t1 Z i 2 2 max t1<i t 2 Z i 2 2 max t2<i n Z i 2 2 Compute RSE for each segment. Keep the RSE which bring the maximum gain Add the breakpoint to the active set fig/rbs7.pdf Morgane Pierre-Jean Segmentation methods in cancer samples 14/ 24

Outline Simulated data creation ROC curves 1 2 3 Simulated data creation ROC curves 4 Morgane Pierre-Jean Segmentation methods in cancer samples 15/ 24

Simulated data creation Simulated data creation ROC curves How did we create the simulated data? From a real data set For each technology (Illumina or Affymetrix) we have Several data sets with various level of contamination by normal cells Illumina : 34, 50, 79 and 100% of tumor cells Affymetrix : 30, 50, 70 and 100% of tumor cells. Breakpoints are known State of segments are also known Morgane Pierre-Jean Segmentation methods in cancer samples 16/ 24

Affymetrix Simulated data creation ROC curves fig/profileaffy50100.pdf Morgane Pierre-Jean Segmentation methods in cancer samples 17/ 24

Illumina Simulated data creation ROC curves fig/profileillu50100.pdf Morgane Pierre-Jean Segmentation methods in cancer samples 18/ 24

Simulated data creation ROC curves fig/tntp.pdf Morgane Pierre-Jean Segmentation methods in cancer samples 19/ 24

Simulated data creation ROC curves Illumina : Use 2 dimensions provides good results 100 profiles, n = 5000, K = 5, purity = 79%, precision = 1 fig/figillumina/crl2324,baf,roc,n=5000,k=5,regsize=0,minl= fig/figillumina/crl2324,baf,roc Morgane Pierre-Jean Segmentation methods in cancer samples 20/ 24

Simulated data creation ROC curves Illumina : Use 2 dimensions provides good results 100 profiles, n = 5000, K = 5, purity = 79%, precision = 1 fig/figillumina/crl2324,baf,roc,n=5000,k=5,regsize=0,minl= fig/figillumina/crl2324,baf,roc Morgane Pierre-Jean Segmentation methods in cancer samples 20/ 24

Simulated data creation ROC curves Illumina : Univariate methods are as good as bivariate 100 profiles, n = 5000, K = 5, purity = 100%, precision = 1 fig/figillumina/crl2324,baf,roc,n=5000,k=5,regsize=0,minl= fig/figillumina/crl2324,baf,roc Morgane Pierre-Jean Segmentation methods in cancer samples 21/ 24

Outline 1 2 3 4 Morgane Pierre-Jean Segmentation methods in cancer samples 22/ 24

Results Creation of realistic simulated data R package development jointseg on R-forge. https ://r-forge.r-project.org/r/?group_id=1562 Bivariate methods are not uniformly better than univariate No superiority of one method Perspective Kernel approaches Labelling Other applications (several profiles, methylation data) Morgane Pierre-Jean Segmentation methods in cancer samples 23/ 24

Thanks to Pierre Neuvial, Guillem Rigaill and Cyril Dalmasso

K. Bleakley and J.-P. Vert. The group fused lasso for multiple change-point detection. Technical report, Mines ParisTech, 2011. Z. Harchaoui and C. Lévy-Leduc. Catching change-points with lasso. Advances in Neural Information Processing Systems, 2008. G. Rigaill. Pruned dynamic programming for optimal multiple change-point detection. Technical report, http ://arxiv.org/abs/1004.0887, 2010. G. Rigaill, E. Lebarbier, and S. Robin. Exact posterior distributions and model selection criteria for multiple change point-criteria. Statistics and Computing, 2012. J.-P. Vert and K. Bleakley. Fast detection of multiple change-points shared by many signals using group LARS. Advances in Neutral Information Processing Systems, 2010. F. Picard and E. Lebarbier and M. Hoebeke and G. Rigaill and B. Thiam and S. Robin. Joint segmenation, calling and normalization of multiple CGH profiles. Biostatistics,2011.

Chen, H., Xing, H. and Zhang, N.R. Estimation of parent specific DNA copy number in tumors using high-density genotyping arrays. PLoS Comput Biol,2011. Olshen AB, Venkatraman ES, Lucito R, Wigler M. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics, (2004). Zhang, Nancy R. and Siegmund, David O. and Ji, Hanlee and Li, Jun Z. Detecting simultaneous changepoints in multiple sequences. Biometrika, (2010) Lai, Tze Leung and Xing, Haipeng and Zhang, Nancy Stochastic segmentation models for array-based comparative genomic hybridization data analysis Biostat, (2008) Zhang, Nancy R and Senbabaoglu, Yasin and Li, Jun Z, Joint estimation of DNA copy number from multiple platforms Boinformatics, (2010)

Gey, S. and Lebarbier, E., Using CART to Detect Multiple Change Points in the Mean for Large Sample, Statistics for Systems Biology research group, (2008) Olshen, Adam B and Bengtsson, Henrik and Neuvial, Pierre and Spellman, Paul T and Olshen, Richard A and Seshan, Venkatraman E, Parent-specific copy number in paired tumor-normal studies using circular binary segmentation Boinformatics, (2011) Mosen-Ansorena, David and Aransay, Ana Maria, Bivariate segmentation of SNP-array data for allele-specific copy number analysis in tumour samples BMC Bioinformatics, (2013)