The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0

Similar documents
Introduction to LOH and Allele Specific Copy Number User Forum

Statistical Analysis of Single Nucleotide Polymorphism Microarrays in Cancer Studies

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

On Missing Data and Genotyping Errors in Association Studies

LTA Analysis of HapMap Genotype Data

Non-parametric methods for linkage analysis

Fundamental Clinical Trial Design

Introduction of Genome wide Complex Trait Analysis (GCTA) Presenter: Yue Ming Chen Location: Stat Gen Workshop Date: 6/7/2013

Colorspace & Matching

Objectives. Quantifying the quality of hypothesis tests. Type I and II errors. Power of a test. Cautions about significance tests

Inferential Statistics

AP STATISTICS 2008 SCORING GUIDELINES (Form B)

Statistical power and significance testing in large-scale genetic studies

10. LINEAR REGRESSION AND CORRELATION

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction

White Paper Estimating Genotype-Specific Incidence for One or Several Loci

Genome-wide copy-number calling (CNAs not CNVs!) Dr Geoff Macintyre

Lecture Outline. Biost 590: Statistical Consulting. Stages of Scientific Studies. Scientific Method

Tutorial on Genome-Wide Association Studies

Copy Number Variations and Association Mapping Advanced Topics in Computa8onal Genomics

GENETIC LINKAGE ANALYSIS

BIOL 458 BIOMETRY Lab 7 Multi-Factor ANOVA

Compound heterozygosity Yurii S. Aulchenko yurii [dot] aulchenko [at] gmail [dot] com. Thursday, April 11, 13

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

Lecture 6: Linkage analysis in medical genetics

Binary Diagnostic Tests Two Independent Samples

Pedigree Analysis Why do Pedigrees? Goals of Pedigree Analysis Basic Symbols More Symbols Y-Linked Inheritance

Binary Diagnostic Tests Paired Samples

Lecture 15. There is a strong scientific consensus that the Earth is getting warmer over time.

DNA-seq Bioinformatics Analysis: Copy Number Variation

Running Head: ADVERSE IMPACT. Significance Tests and Confidence Intervals for the Adverse Impact Ratio. Scott B. Morris

Genetics All somatic cells contain 23 pairs of chromosomes 22 pairs of autosomes 1 pair of sex chromosomes Genes contained in each pair of chromosomes

What lies beneath: challenges in reporting SNP array results. Jonathan Waters

Supplementary information. Supplementary figure 1. Flow chart of study design

Sampling for Impact Evaluation. Maria Jones 24 June 2015 ieconnect Impact Evaluation Workshop Rio de Janeiro, Brazil June 22-25, 2015

The Pretest! Pretest! Pretest! Assignment (Example 2)

NYSIIS. Immunization Evaluator and Manage Schedule Manual. October 16, Release 1.0

Reflection Questions for Math 58B

SAMPLING AND SAMPLE SIZE

Bayesian Analysis by Simulation

Supplementary Information. Supplementary Figures

A Case Study: Two-sample categorical data

Whole-genome detection of disease-associated deletions or excess homozygosity in a case control study of rheumatoid arthritis

A novel approach to estimation of the time to biomarker threshold: Applications to HIV

Problem 3: Simulated Rheumatoid Arthritis Data

UNIVERSITY OF CALIFORNIA, LOS ANGELES

Statistical Techniques. Meta-Stat provides a wealth of statistical tools to help you examine your data. Overview

Supplementary Figures

Nature Genetics: doi: /ng Supplementary Figure 1. SEER data for male and female cancer incidence from

Understanding DNA Copy Number Data

Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data

Big Data Training for Translational Omics Research. Session 1, Day 3, Liu. Case Study #2. PLOS Genetics DOI: /journal.pgen.

Relationship between genomic features and distributions of RS1 and RS3 rearrangements in breast cancer genomes.

Supplement to SCnorm: robust normalization of single-cell RNA-seq data

Audio: In this lecture we are going to address psychology as a science. Slide #2

Review: Conditional Probability. Using tests to improve decisions: Cutting scores & base rates

Supplementary Figure 1: Attenuation of association signals after conditioning for the lead SNP. a) attenuation of association signal at the 9p22.

Mutation Detection and CNV Analysis for Illumina Sequencing data from HaloPlex Target Enrichment Panels using NextGENe Software for Clinical Research

For more information about how to cite these materials visit

Student Performance Q&A:

ON THE NUMBER OF PERCEIVERS IN A TRIANGLE TEST WITH REPLICATIONS

GENOME-WIDE ASSOCIATION STUDIES

ANN predicts locoregional control using molecular marker profiles of. Head and Neck squamous cell carcinoma

The Efficiency of Mapping of Quantitative Trait Loci using Cofactor Analysis

An Introduction to Bayesian Statistics

Rare Variant Burden Tests. Biostatistics 666

Supplementary note: Comparison of deletion variants identified in this study and four earlier studies

Review. Imagine the following table being obtained as a random. Decision Test Diseased Not Diseased Positive TP FP Negative FN TN

Detection of aneuploidy in a single cell using the Ion ReproSeq PGS View Kit

What do detection dogs know and how do we know they know it?

4 Diagnostic Tests and Measures of Agreement

Title: A robustness study of parametric and non-parametric tests in Model-Based Multifactor Dimensionality Reduction for epistasis detection

Identification of regions with common copy-number variations using SNP array

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2

Chapter 7: Descriptive Statistics

COMPLETE DOMINANCE. Autosomal Dominant Inheritance Autosomal Recessive Inheritance

Analysis of gene expression in blood before diagnosis of ovarian cancer

Title: A new statistical test for trends: establishing the properties of a test for repeated binomial observations on a set of items

Statistical Tests for X Chromosome Association Study. with Simulations. Jian Wang July 10, 2012

DEFORMATION TRANSDUCER CALIBRATOR OPERATION

AP STATISTICS 2013 SCORING GUIDELINES

Global variation in copy number in the human genome

n Outline final paper, add to outline as research progresses n Update literature review periodically (check citeseer)

Evaluation Models STUDIES OF DIAGNOSTIC EFFICIENCY

Full title: A likelihood-based approach to early stopping in single arm phase II cancer clinical trials

Computational Systems Biology: Biology X

Part [2.1]: Evaluation of Markers for Treatment Selection Linking Clinical and Statistical Goals

Catherine A. Welch 1*, Séverine Sabia 1,2, Eric Brunner 1, Mika Kivimäki 1 and Martin J. Shipley 1

Chapter 25. Paired Samples and Blocks. Copyright 2010 Pearson Education, Inc.

Imputation of Missing Genotypes from Sparse to High Density using Long-Range Phasing

Bayesian Dose Escalation Study Design with Consideration of Late Onset Toxicity. Li Liu, Glen Laird, Lei Gao Biostatistics Sanofi

Structured Association Advanced Topics in Computa8onal Genomics

Numerous hypothesis tests were performed in this study. To reduce the false positive due to

Supplementary Figure 1. Quantile-quantile (Q-Q) plots. (Panel A) Q-Q plot graphical

(b) What is the allele frequency of the b allele in the new merged population on the island?

Stepwise method Modern Model Selection Methods Quantile-Quantile plot and tests for normality

JSM Survey Research Methods Section

MBG* Animal Breeding Methods Fall Final Exam

Multiple Copy Number Variations in a Patient with Developmental Delay ASCLS- March 31, 2016

Sum of Neurally Distinct Stimulus- and Task-Related Components.

Transcription:

The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0 Introduction Loss of erozygosity (LOH) represents the loss of allelic differences. The SNP markers on the SNP Array 6.0 can be used to detect LOH. Specifically, the genotypes for each SNP marker are used to find regions with large numbers of homozygous genotype calls. There are two distinct copy number/loh algorithms in Genotyping Console 2.0: Copy number/loh algorithm using a predefined reference model file. This utilizes a single-sample workflow that does genotyping on the fly using an algorithm similar to the BRLMM-P-plus genotyping algorithm in Affymetrix Power Tools, which uses a no-call confidence threshold of 0.05. Copy number/loh algorithm in which both test and reference samples are specified. This utilizes a batch workflow that calls the Birdseed genotyping algorithm with a no-call confidence threshold of 0.1 to find the genotypes. This white paper outlines the proof of concept work performed during the development of the LOH algorithms in the context for the first workflow above. Results and Discussion The performance of the LOH algorithm was evaluated using genotypes that were computed in the following manner. First, a reference quantile normalization distribution and a set of plier feature effects were determined based upon 270 HapMap sample CEL files using Affymetrix Power Tools (APT). Then, each sample CEL file was individually processed using the APT BLRMM- P-plus implementation with the default parameters (this includes a no-call confidence threshold of 0.05) and specifying the pre-computed normalization distribution and feature effects. This procedure is analogous to the GTC 2.0 single-sample workflow. In practice, no-call rates and call accuracy are driven by sample quality rather than differences between the algorithms; hence, we expect the conclusions in this white paper to be independent of the algorithm. The algorithm frames the LOH problem in terms of a statistical hypothesis test. Given a specific region containing N SNP markers with n erozygous and nhom homozygous, genotype calls decide between the following two hypotheses: 1. Null: Region is LOH 2. Alternative: Region is non-loh

Treat the SNP markers as independent binomial variables that can be in one of two states, a erozygous (AB) or homozygous (AA or BB) genotype call. Nocall SNP markers are ignored. Let the probability of making a erozygous call at any position along the genome be p and the erozygous rate in a homozygous region be p. Assume a significance level of α and a power of 1 β. Two important values are chosen based on these quantities. The first is N, the number of markers in a region, and the second is n, the smallest number of erozygous calls that can be observed before we must conclude that a region is not LOH. An iterative procedure is used to estimate these quantities. Specifically, first n is estimated by choosing the smallest n such that P X < n N, p ) = 1 α ( And then N is estimated via the solution of n Np Np p = Z 1 ( ) ( ) ( β ) 1 where ( β ) 1 Z is the inverse of the standard normal function. These are iterated using a fixed number of iterations or until N does not change, whichever occurs first. To simplify things, if the final N is not odd we increase it by 1 to ensure that it is odd. To decide between the two hypotheses, the number of erozygous call n is compared with the ical value n. In the case that n >= n, decide for alternative hypothesis that there is no LOH. If there are not a sufficient number of erozygous calls, the decision is made in favor of LOH. Thus, the algorithm is very simple, consisting merely of counting how many erozygous calls are in a sequence of N consecutive genotype calls and comparing with an appropriate cut-off value. Figure 1 visually demonstrates two different regions on different sections of the genome. One where there are few erozygous SNP genotype calls is called LOH. The other where there are many erozygous SNP genotype calls is not called LOH.

Heterozygous SNP Homozygous SNP Let N = 11 n = 2 A) n = 1 LOH = TRUE Genome B) n = 5 LOH = FALSE Genome Figure 1: Demonstrating how LOH calls are made based on counting the number of erozygous SNP calls in a given window for two sections of a genome. Hypotical values are chosen for window size N and n. A) Region is called LOH because there are fewer erozygous SNP calls (1) than the cut-off (2). B) Regions are not called LOH because the number of erozygous SNP calls (5) exceeds the ical value (2). Figure 2 demonstrates how the procedure works for the entire set of SNP markers. The algorithm is applied to the genome by sliding a moving window of size N SNPs along each chromosome. The window is centered at each SNP position along the genome, and in each case the number of erozygous genotype calls in the window is quantified and a decision is made as to wher we are in the LOH state or the non-loh state at that SNP position. The transition between the states represents a special situation. When transitioning from the non-loh state to the LOH state, a number of previously evaluated SNPs are marked as also being LOH. Similarly, when transitioning from the LOH state to the non-loh state, SNPs immediately adjacent along the genome can also be expected to be LOH and are marked as thus. A special set of rules is used for these back and forward filling operations. In particular, starting at the central position, move toward the end of the window, stopping either when the end of the window is reached or at the last homozygous call before the second erozygous call is reached. This second rule helps to prevent situations where a single divides two long stretches of homozygous calls. Additionally, it helps to more accurately estimate the boundaries of the LOH region.

Heterozygous SNP Homozygous SNP Non LOH LOH Let N = 11 n = 2 A) B) C) D) Continuing though the next five SNP positions E) F)? G Figure 2: An illustration of how the LOH algorithm proceeds along the genome and how the transitions between LOH and non-loh regions are handled. A) Window contains three erozygous SNPs, so call non-loh at that position. B) Moving to the next SNP position the window now contains two erozygous SNPs, so call non-loh at that position. C) At the next SNP position there is only a single erozygous SNP, so the position is called LOH. D) Because transitioned to the LOH state, you need to back fill the window to also mark it as being LOH. E) Moving farther along the genome, the next five SNP positions are all called LOH. F) Now there are two erozygous markers in the window, so the call for this position would be non-loh, but isn t because a transition is occurring. G) Instead, forward fill the window until the last homozygous marker before the second erozygous marker is reached, then move the window center to the next uncalled position along the genome.

Figure 3: Unsmoothed log 2 copy number ratio (top), LOH (middle) and scaled allelic difference (lower) estimates in the region of a 1 copy deletion. Figure 3 demonstrates the results of running the algorithm on a particular section of genome where a one-copy deletion is present. The LOH state is either 0, representing no LOH, or 1, representing LOH. Examining the top and middle panels, we see that the LOH region corresponds directly with the region with lower unsmoothed copy. The lower panel contains scaled allelic differences, helping to confirm that LOH is present. The left-most parts of LOH do not seem to correspond directly with the deletion. Instead, close examination of the scaled allelic differences suggests that this is perhaps a copy-neutral LOH region.

Normal Chromosome Disorder Chromosome Becomes 2MB Region with n SNP markers Randomize Ordering Insert n SNP markers from disorder region Disorder Region Choose a chromosome known to have no large disorder related deletions. Pick a random 2Mb region on this chromosome; count how many markers in that region. Randomly pick this number of markers from the known disorder region, and replace the original genotypes with the disorder-related genotypes. Randomize the ordering of the remaining genotypes on that chromosome. Figure 4: A simulation framework for assessing LOH calling algorithms. A 2 Mb region is selected from a normal chromosome not known to have any large deletion-related LOH. Genotypes from a region having a known 1 copy deletion are substituted into this 2 Mb region. To ensure that no real copy neutral LOH is detected, randomize the ordering of the remaining markers on the chromosome. One method of assessing the performance of the LOH algorithm is via simulation. With a viable method of generating known sections of LOH and known sections of non-loh, the sensitivity and specificity of the algorithm can be examined. Figure 4 shows the simulation framework used here. Specifically, samples with known, validated copy number 1 deletions greater than 2 Mb in size were chosen. For each sample, the following procedure was used: Randomly select a 2 Mb region on a chromosome not known to have any large copy number deletions. Call this chromosome the normal chromosome and count how many SNP markers are in the selected 2 Mb region; call this n. From the chromosome having the known deletion, call this the disorder chromosome, select n consecutive SNPs and their genotype calls. If there are less than n markers in the disorder region and a consecutive set can not be found, select the n markers from the disorder region randomly with replacement. Replace the genotype calls of the markers in the selected region of the normal chromosome with those from the disorder chromosome. To ensure that any other pre-existing LOH on the normal chromosome is removed, the genotypes for the SNP markers outside the selected 2 Mb region are randomly ordered.

Note that this has the downside that it removes the normal linkage structure. The LOH algorithm is then applied to this data to examine its performance. Assessment is conducted by repeating the simulation multiple times for each sample. In each case, a marker in the 2 Mb region that is correctly called LOH is called a true positive. A marker outside the LOH region which is called LOH is a false positive. The ideal is to have a high true positive rate and a low false positive rate. Each of the algorithm parameters discussed above will have an impact on the results. Figure 5: True and false positive rates for LOH detection from 100 simulations for each of 15 different samples. Using p =0.05, α =0.001 and β =0.005. Horizontal lines indicate 99 percent true positives and 1 percent false positives. Figure 5 shows the results of running the simulation 100 times for each of 15 different samples. Each sample has a different large known deletion. For this simulation, the same set of parameters was used for each sample p =0.05, α =0.001 and β =0.005. Note that p is calculated separately for each sample based on the entirety of its respective genotype calls. Two samples, 1 and 6, performed particularly poorly across all 100 simulations. Note that another two

samples, 7 and 8, had simulation results with true positive rates of 0. Closer examination of these two cases showed the selected 2 Mb region having few SNP markers, 49 and 26 markers, respectively. Because the LOH algorithm described here is implicitly a counting algorithm, when there is a region with very few markers it is going to be difficult to correctly detect. Across all 1,500 simulations the median number of markers in the selected 2 Mb region was 632. The smallest region that had a non-zero true positive rate had 47 SNP markers. Figure 6: Comparing the overall no-call rate with the erozygous call rate in the deletion regions. A strong relationship exists. Making the correct genotype calls in a deletion region could be expected to be more difficult than along a portion of the genome having normal copy number. In particular, when in a 1 copy region, the desirable result would be to have only homozygous calls. But because only a single copy exists in such a region, rather than having AA and BB for a homozygous SNP, only A and B are really present. The lower signal for each allele, in this situation, makes it more difficult to discriminate between the correct homozygous state and the erozygous state. This difficulty in making the appropriate genotype call also affects the ability to make the correct decision on LOH state for each marker. The p parameter is used to account for this difficulty. If this parameter is too low, then the true positive results will be lower. If this parameter is too high, then the false positive results will be higher. Additionally, this parameter might differ between samples. In general, deletion regions are not known a priori, so this parameter can not be directly estimated. Instead, another method is used. Figure 6 shows the relationship between the

overall no-call rate for each of the 15 samples and the erozygous call rate in the deletion region. There is a strong relationship between these two values. This suggests that p can be estimated using the overall no-call rate. In particular, a linear regression fit to the above could be used. The two points with erozygous rates of approximately 0.1 correspond to samples 1 and 6. Figure 7: True and False positive rates for LOH detection from 100 simulations for each of 15 different samples. Using α =0.001 and β =0.001 and dynamically selected p. Horizontal lines indicate 99 percent true positives and 1 percent false positives. One procedure for using the overall no-call rate to dynamically select p on a sample by sample basis is as follows: First, use the linear regression proposed above to make an initial estimate p. If this estimate is less than a given minimum value (say, 0.04) then increase to the minimum threshold. To accommodate in estimating to p this way, its potential values are restricted to fall in equally sized steps. This is accomplished by rounding p up in increments of 0.01 when necessary. Figure 7 shows the results of the simulation when using this procedure for dynamically selecting p. The results for sample 1 and sample 6 have improved considerably when compared to the

previous results, indicating that allowing results. p to be selected provided improved