Understanding DNA Copy Number Data Adam B. Olshen Department of Epidemiology and Biostatistics Helen Diller Family Comprehensive Cancer Center University of California, San Francisco http://cc.ucsf.edu/people/olshena_adam.php May 20, 2010
Background The DNA sequence copy number at any locus in a genome is the number of copies of genomic DNA. The normal copy number is two for human autosomes. Copy number alterations are gains or losses of DNA They modify the function and/or expression of genes They are common in cancer: copy number is - Increased at sites of oncogenes; - Decreased at sites of tumor suppressor genes.
Array CGH: Hybridization and Analysis Forward hybridization Reverse hybridization Reference DNA Tumor DNA Reference DNA Tumor DNA Cy5 Cy3 Cy3 Cy5 Co-hybridization Co-hybridization Scan Scan Genomic markers (BAC, cdna, or oligo) Normal Loss Genomic markers (BAC, cdna, or oligo) Normal Gain Gain Loss Array analysis - resulting data are normalized log test over reference intensities for genomic markers
DNA Copy Number Arrays-Array CGH Platform # Colors Type of Markers # of Markers Commercial BAC 2 BACs (100-200kb) 1000-3000 Yes 31,000 cdna 2 cdnas (1kb) 1000-20,000 No ROMA 2 long oligo 85,000 No 400,000 Agilent 2 long oligo 244,000 Yes 1,000,000 Nimblegen 2 long oligo 400,000 Yes Affymetrix 1 short oligo 500,000 Yes 1,800,000 Mips 1 short oligo 50,000 Yes Illumina beads 1,200,000 Yes Review: Pinkel & Albertson (2005, Nature Genetics)
Example Breast ROMA array with 9820 probes log 2 (Test Reference) 2 1 0 1 2 8 Copies 7 Copies 6 Copies 5 Copies 4 Copies 3 Copies 2 Copies 1 Copy Genomic Position
Additional Complications 1. Tumor samples are a mixture of tumor and normal cells, and the amount of normal contamination is often unknown. 2. Tumor cells are not homogeneous. Gains and losses may occur in differing proportions of cells. 3. There is (usually) no gold standard of similar resolution.
Analysis Goals 1. Reconstruct the copy number state for the entire genome 2. Identify the regions of gain and loss 3. Divide the genome into regions of equal copy number
Analysis Approaches Smoothing (Hsu et al.; Tibshirani and Wang) Hidden Markov models (Fridlyand et al.; Guha et al.) Segmentation (Picard et al.; Venkatraman and Olshen) Methods compared by Lai et al. (2005, Bioinformatics)
Segmentation Approach log 2 (T R) 1.0 0.5 0.0 0.5 1.0 Gains Losses 0 20 40 60 80 100 120 Position in Chromosome
Circular Binary Segmentation View the data as if on a circle and segment into two arcs. Hence named circular binary segmentation (CBS). This results in two or three segments of the original data. Test statistic: T = max T ij, where 0 <i<j m Ȳ ij Z ij T ij = s ij (j i) 1 +(m j + i) 1, =(X i+1 +...+ X j )/(j i), Ȳ ij Z ij =(X 1 +...+ X i + X j+1 +...+ X m )/(m j + i), and s 2 ij is the corresponding mean square error. Split data if P (T >T obs ) α (probability under null of no change-point) and recurse until no further splits. Estimate probability by permutation.
Permutations Real, T=30.9 Permuted 1, T=3.1 log 2 (T R) 1.0 0.5 0.0 0.5 1.0 log 2 (T R) 1.0 0.5 0.0 0.5 1.0 0 20 40 60 80 100 120 Position in Chromosome 0 20 40 60 80 100 120 Position in Chromosome Permuted 2, T=3.1 Permuted 3, T=2.9 log 2 (T R) 1.0 0.5 0.0 0.5 1.0 log 2 (T R) 1.0 0.5 0.0 0.5 1.0 0 20 40 60 80 100 120 Position in Chromosome 0 20 40 60 80 100 120 Position in Chromosome
Circular?
We Made it Faster 9820 probes, max probe count 824, time: 347s vs 13s. log2(t/r) 2 1 0 1 2 0 500 1000 1500 2000 2500 3000 Genomic Position Cyan permutation Black hybrid with early stopping
Gains and Losses via Plateau Plot
Gains and Losses vis Plateau Plot
Looking Across Samples
Clustering Samples
Defining Regions Patients 0 10 20 30 40 50 60 _ 0 10 20 30 40 50 60 70 Markers
Defining Regions cont. Patients 0 10 20 30 40 50 60 _ 0 10 20 30 40 50 60 70 Markers
Defining Regions cont. Identifying regions of high frequency gain or lossminimal common regions (MCRs) Genomic Identification of Significant Targets in Cancer (GISTIC) integration of magnitude as well as frequencies (Beroukhim, PNAS, 2007) Significance Testing for Aberrant Copy number (STAC) (Diskin et al., Genome Res., 2006) Multiple Sample Analysis (MSA) (Guttman et al., PLoS Genetics, 2007)
Advanced Topics 1. SNP Arrays, Genoytping and Allele-Specific Copy Number 2. Copy Number Variation 3. Clonality
SNP Arrays and Genotyping SNP arrays can be used for tradional copy number analysis. They can also used by used for genotyping: genome-wide association studies integration of copy number and loss of heterozygosity (LOH)
Two Regions of Normal Copy Number? Copy Number 0 1 2 3 4 Copy Number 0 1 2 3 4 0 200 400 600 800 1000 Position 0 200 400 600 800 1000 Position
Two Regions of Normal Copy Number? Copy Number 0 1 2 3 4 Copy Number 0 1 2 3 4 0 200 400 600 800 1000 Position 0 200 400 600 800 1000 Position TRUE TRUE Homozygotes Homozygotes FALSE FALSE 0 200 400 600 800 1000 Position 0 200 400 600 800 1000 Position
Allele-Specific Copy Number Traditional methods measure the sum of the copy numbers from the two parental chromosomes. This is total copy number. SNP arrays can be used for allele-specific copy number (PSCN). A copy number of 2 for a SNP could mean: 1+1(normal) or 0+2(both altered).
Raw Data Total Copy Number A Copy Number SQRT Copy Number 0 1 2 3 4 SQRT Copy Number 0 1 2 3 4 0 500 1000 1500 2000 2500 3000 Genomic Position 0 500 1000 1500 2000 2500 3000 Genomic Position B Copy Number SQRT Copy Number 0 1 2 3 4 0 500 1000 1500 2000 2500 3000 Genomic Position
PSCBS Segmentation
An Example Sample data for PSCN of 1 (paternal chrom.) and 2 (maternal chrom.) Observe A B A B A B A B A B A B A B A B Maternal A A A B B B A A Paternal A B A B A A A A 1 2 3 4 5 6 7 8 SNP SNP Genotype A Copy Number B Copy Number Total Copy Number 1 AA 3.3 0 3.3 2 AB 1.9 1.1 3.0 3 AA 2.9 0 2.9 4 BB 0 3.0 3.0 5 AB 0.8 2.1 2.9 6 AB 1.0 2.2 3.2 7 AA 2.9 0 2.9 8 AA 3.0 0 3.0
PSCBS 1. Genotyping 2. Two Rounds of Segmentation 3. Calling Copy Number
Finding Heterozygotes Minimum of A Allele and B Allele 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Heterozygotes Homozygotes Minimize p l, p is points in the window, l in length of window
PSCBS 1. Genotyping 2. Two Rounds of Segmentation 3. Calling Copy Number
First Round of Segmentation If the total copy number has not changed, generally neither PSCN has changed. Start by running CBS on total copy number.
Second Round of Segmentation
Second Round of Segmentation The B-allele frequency (BAF) is B/(A+B)=B/Total CN. If PSCN changes, so should the BAF of the heterozygotes (based on paired normal). We use BAF from TumorBoost (Bengtsson et al., in press), which adjusts the tumor BAF based on the paired normal BAF. Segment on the decrease in heterozygosity ρ =2 BAF 0.5.
Second Round of Segmentation
PSCBS 1. Genotyping 2. Two Rounds of Segmentation 3. Calling Copy Number
Copy Number within Segments Four distinct clusters if unequal parent-specific copy number, three if equal copy number, two if LOH B Allele Copy Number 0 1 2 3 4 5 6 Chromosome 8p 0 1 2 3 4 5 6 A Allele Copy Number B Allele Copy Number 0 1 2 3 4 5 6 Chromosome 12q 0 1 2 3 4 5 6 A Allele Copy Number B Allele Copy Number 0 1 2 3 4 5 6 Chromosome 1p 0 1 2 3 4 5 6 A Allele Copy Number
Densities of BAFs
Parent-Specific CBS Algorithm 1. Assign genotypes based on minimum allele. 2. Run CBS on total copy number to get an estimate of the sum of the parent-specific copy numbers. 3. Run CBS within segments on 2 BAF 0.5 of heterozygotes. 4. Test for regions of LOH. 5. Test whether the parent-specific copy numbers are equal, whichisif the mode of the BAF =0.5. 6. If not LOH and unequal, estimate the difference of the parent-specific copy numbers within every CBS segment using only the heterozygotes. 7. From 2) and 6) estimate the parent-specific copy numbers.
Reconstruction
Advanced Topics 1. SNP Arrays, Genotyping and Allele-Specific Copy Number 2. Copy Number Variation 3. Clonality
Copy Number Variation Copy number variations are gains or losses in the germ line >1Kb (Redon et al., 2006). They have been associated with familial cancer (Lucito et al., 2007) and other complex disease (Sebat et al., 2007). When analyzing cancer samples it is important to distinguish between variations and cancer aberrations. Large regions of gain or loss are aberrations; small regions could be either.
The Cancer Genome Atlas (TCGA) TCGA is an NIH-funded project whose goal is better understanding of cancer though large-scale sequencing. To decide which genes to sequence, Cancer Genome Characterization Centers (CGCCs) were set up. Each CGCC applies a different array technology to the same hundreds of glioblastoma, lung, and ovarian samples. TCGA samples are supposed to have matching normals.
ACGH Matched Sample log2(test Reference) 2.0 1.0 0.0 1.0 2.0 Tumor 8 Copies 7 Copies 6 Copies 5 Copies 4 Copies 3 Copies 2 Copies 1 Copy log2(test Reference) 2.0 1.0 0.0 1.0 2.0 Normal 8 Copies 7 Copies 6 Copies 5 Copies 4 Copies 3 Copies 2 Copies 1 Copy
Current Methods for Handling CNVs Compile data from Toronto database (http://projects.tcag.ca/variation/) Then Screen out CNV regions from analysis Screen out regions within samples that may be due to CNVs Screen out regions defined by multiple samples that may be due to CNVs Ignore the problem completely
Predicting Copy Number Variation Segment both the tumor and matching normal sample. An observation is a small region of gain or loss in the tumor. The class label is whether the region is a CNV in the normal sample. The predictors are whether the region was found in the literature, the length of the region, etc. We used 43 matched pairs for training and 36 as a test set. Once the classification model is fit, it can be used to predict in cases where there is no matching sample.
Univariate Results Significant predictors Segment mean Gains and losses Other patients Literature (count) Literature (subjects) Length Segmental duplication region Not significant predictors Near centromere Near telomere
Multivariate Results in.lit< 2.606 in.lit>=2.606 absmeansd< 3.359 absmeansd>=3.359 0 in.lit< 6.073 in.lit>=6.073 1 segdup< 0.5 segdup>=0.5 1 0 1
Multivariate Results Best in literature models 80% accurate CART models 82% accurate Random Forests models 84% accurate Matched normals 94% accurate (no CNVs) Normals 70% accurate (all CNVs)
Advanced Topics 1. SNP Arrays, Genotyping and Allele-Specific Copy Number 2. Copy Number Variation 3. Clonality
Second Cancers When a second cancer appears, it could be a 1. Second primary - independent origin 2. Metastasis - clonal origin Distinguishing between the two has clinical importance: a metastasis is more serious Pathological decision criteria include histological type, stage, and anatomic location Molecular markers may improve decision making Begg with others have used LOH data for this purpose Can array DNA copy number data be used?
LOH and Clonality Clonal Case Ambiguous Case Independent Case Locus L R L R L R 1p - - - - - - 1q - - - - - + 3p + + - + + - 5q - - - + - - 6q + + + + + + 8p - - + + NA NA 11p NA NA NA NA NA NA 11q - - - - - - 13q - - + + + - 16q + + + + - - 17p - - - - - + 17q + + + + + + 18q - - + - - + 22q - - + + NA NA
DNA Copy Number Example 2 1 0 1 2 1 3 5 7 9 11 13 15 17 19 21 23 31 33 35 37 39 45 2 1 0 1 2 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 404244 46 Chromosome Arm (Genomic Order)
Lung Data 20 patients with paired non-small cell lung cancers (Girard et al., 2009) Samples hybridized to Agilent 244K array. Clinical and mutation data included. Clinical decision based on second primaries or metastases. The former may be spared adjuvant chemotherapy.
A Clonal Example (LR2=7.9E+23)
An Indep. Example (LR2=7.3E-06)
An Equivocal Example (LR2=0.3)
Lung Summary Copy number classification contradicted clinical classification in 4 of 20 cases. Additional 4 cases called equivocal by copy number. Three clinical indep. called clonal and supported by matching somatic mutations; one clinical clonal called indep. and no mutation data. Copy number data may be useful in clinical decision making.
Software CBS can be found in an R library called DNAcopy that is part of Bioconductor (www.bioconductor.org). PSCBS can be found in the library PSCBS on R-Forge (r-forge-r-project.org). CNV and clonality methods can be found on Irina Ostrovnaya s web page.
References 1. Begg, C. et al. (2006). Statistical Tests for Clonality. Biometrics. 2. Beroukhim, R. et al. (2007). Assessing the significance of chromosomal aberrations in cancer: Methodology and application to glioma. PNAS 104 20007-20012. 3. Diksin, S. et al. (2004). STAC: A method for testing the significance DNA copy number aberrations across multiple array-cgh experiments. Genome Res. 16 1149-1158 4. Fridlyand, J. et al. (2004). Understanding Array CGH data. JMVA 90 132-153. 5. Girard, N. et al. (2009). Genomic and mutational profiling to assess clonal relationships between multiple non-small cell lung cancers. Clin Cancer Res. 15 5184-5190. 6. Guha, S. et al. (2006). Bayesian Hidden Markov Modeling of Array CGH Data. Harvard University Biostatistics Working Paper Series. Working paper 24. 7. Guttman, M. et al. (2007). Assessing the significance of conserved genomic aberrations using high resolution genomic microarrays. PLoS Genetics 3 e143. 8. Hardenbol, P. et al. (2003). Multiplexed genotyping with sequence-tagged molecular inversion probes. Nat. Biotechnol 21 673-678. 9. Hsu, L. et al. (2005). Denoising array-based comparative genomic hybridization data using wavelets. Bioinformatics 6 211-226. 10. Lai, WR. et al. (2005). Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics 21 3763-3770. 11. Lucito, R. et al. (2007). Copy-Number variants in patients with a strong family history of pancreatic cancer. Cancer Biol Ther. Epub ahead of print.
References 12. Olshen, A. et al. (2010). Extension of circular binary segmentation to parent-specic copy number. Submitted. 13. Olshen, A., Venkatraman, E., Lucito, R. and Wigler, M. (2004). Circular Binary Segmentation for the analysis of array-based DNA copy number data. Biostatistics 5 557-572. 14. Ostrovnaya, A. et al. (2010). A metastasis or a second independent cancer? Evaluating the clonal origin of tumors using array copy number data. Stat Med. Epub ahead of print. 15. Ostrovnaya, A. et al. (2010). A classification model for distinguishing copy number variants from cancer-related alterations BMC Bioinformatics In press. 16. Picard, F. (2005). A statistical approach for array CGH data analysis. BMC Bioinformatics. 6 27. 17. Pinkel, D. and Albertson, D.G. (2005). Array comparative genomic hybridization and its applications in cancer. Nat Genet. 37 S11-S17. 18. Redon, R. et al. (2006). Global variation in copy number in the human genome. Nature. 444 444-454. 19. Sebat, J. et al. (2007). Strong association of de novo copy number mutations with autism. Science. 316 445-449. 20. Tibshirani, R. and Wang, P. (2008). Spatial smoothing and hotspot detection for CGH data using the fused lasso. Biostatistics 9 18-29. 21. Venkatraman, E.S. and Olshen, A.B. (2007). A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics 23 657-663.