Lecture #4: Overabundance Analysis and Class Discovery

Size: px

Start display at page:

Download "Lecture #4: Overabundance Analysis and Class Discovery"

Lynn Hudson
5 years ago
Views:

1 Topics in Microarray Data nalysis Winter November 15, 2004 Lecture #4: Overabundance nalysis and Class Discovery Lecturer: Doron Lipson Scribes: Itai Sharon & Tomer Shiran 1 Differentially Expressed Genes Overabundance What is Overabundance? Statistical Significance of Overabundance False Detection Rate Binomial Surprise Score Class Discovery What is Class Discovery? Class Discovery lgorithms Discovering dditional Classes Differentially Expressed Genes In the previous lecture we talked about the concept of differentially expressed genes. gene is given a score based on its ability to differentiate two different samples. For example, the expression of a gene in a normal lung tissue might vary from its expression in a tumor lung tissue. score is typically given to a gene based on its ability to differentiate two different samples. The threshold number of misclassifications (TNoM) measures how successful we are in separating the two groups of samples by a simple threshold over the expression values. That is, we search for the threshold value of the gene s expression that will distinguish the experimental conditions. gene is scored by the number of misclassifications made by the best threshold that we can find for it. If the expression value of the gene allows us to perfectly separate the groups, the gene has a TNOM score = 0. On the other hand, if the two groups are interspersed, the gene has a score that may be close to the size of the smallest group of samples. 2 Overabundance 2.1 What is Overabundance? When analyzing a data set we must ask how surprising is the data set? We typically examine the number of genes at different P values (i.e., significance levels) and compare them with the number under the null-hypothesis (the assumption that the separation of the samples is random). The difference between the expected and observed number of genes

2 in each significant P value is an estimation of the overabundance of information in the analyzed data set. Let s take a look at the following example: Scenario Scenario B Number of genes Number of samples Number of genes with TNoM 2 (P value = 0.03) In this example the P value is 0.03 and there are 1000 genes. The expected number of genes with TNoM of 2 or less is the P value multiplied by the number of genes in the data set (1000 * 0.03 = 30, in the preceding example). In Scenario we got one gene with TNoM of 2 or less. However, under the null-hypothesis we expected 30 such genes, so we cannot conclude that this gene has a biological significance. In Scenario B we got 100 genes with TNoM of 2 or less, so we can conclude that there are approximately = 70 that might have a biological significance. When examining data sets with biologically meaningful classifications, we usually find an overabundance of significantly informative genes. The number of genes with small scores is much higher than expected. For example, in the Leukemia data set (Golub et al. 1999), there are 3 genes with TNoM score 3 (P value 7.8 * ) while the expected number is 5.5 * Moreover, there are 294 genes with TNoM score 15 or less, while the expected number is roughly 1. This is an overabundance of informative genes, meaning that the expression profiles carry information relevant to the biological classification. n overabundance graph can be used to visualize the significance of an experiment. For different P values (or TNoM scores) it shows the difference between the expected number of genes and actual number. The following graph illustrates the use of an overabundance graph: Breast Cancer BRC1/BRC2 data - ctual and expected TNoM scores Number of genes Expected ctual TNoM Figure 1: overabundance graph

3 2.2 Statistical Significance of Overabundance Our next step is to quantify the statistical significance of overabundance. This quantification is important for two types of situations: 1) Consider a biologically meaningful classification (e.g., two subtypes of cancer, as in the case of the Leukemia data set). Then, we want to ascertain whether gene expression patterns reflect that classification. The Leukemia data set shows that this is the case without doubt. In other classifications, when there are fewer tissue samples, or more subtle signal, the situation might not be obvious. Using standard methods (e.g., Bonferroni bounds), we can determine whether a single gene is significant for the classification. Our aim, however, is to take into account the global patterns. That is, the behavior of all the genes. n overabundance of informative genes is an indication of statistical significance, even if no single gene is Bonferroni significant. 2) Consider a putative classification, as in Bittner et al ( Molecular classification of cutaneous malignant melanoma by gene expression profiling, 2000), that might correspond to a real biological distinction. Clearly, the ultimate test for a putative classification is a biological validation test (as described therein). However, statistics is a tool for evaluating classifications before planning further experiments. Thus, we want to develop statistical scores that measure the significance of suggested partitions False Detection Rate For each TNoM s we define the False Detection Rate (FDR): expected number of genes with TNoM s FDR ( s) = actual number of genes with TNoM s The FDR function can be used to select a subset of differentially expressed genes with a low expected number of false positives. We typically select a threshold (e.g., TNoM) that minimizes the FDR. nother option is to select a threshold that acquires a given FDR (e.g., 5%) Binomial Surprise Score n alternative to FDR is the Binomial Surprise Score approach: the basic idea behind this approach is to estimate the score for which we are most surprised by the number of observed genes for the given score. Let X(s) be the number of genes with TNoM s for a given threshold s, that we expect to observe for uniformly and independently drawn labeling vectors. Let p s be the matching P value and let n be the total number of genes in the data set. ssuming the n vectors are independently drawn, and assuming that a vector for which the TNoM score is s is drawn with probability p s, it is possible to conclude that: X ( s) ~ Binom ( n, p s )

4 Let n(s) be the observed number of genes in the dataset with TNoM s. The surprise rate logσ s where: is defined as ( ) σ n n ( ) i= n s( i ) ( ) ( ) ( ) i ( ) ( n i ) ( ) s s s = Prob X s n s = p 1 p ( ) s. We are, of course, interested in the threshold s for which the maximum surprise score is received. In the general case we would expect the Binomial Surprise Score to be The maximum surprise score is defined as max logσ ( s) 0 for the highest TNoM score s max, since p = 1 σ( s ) = 1 log( σ( s )) = 0 smax max max Low for TNoM score s 0 = 0, because if p 0 = 0 then σ(0) = 1 log( σ(0)) = 0 (in other words, there are usually very few genes, if any, with a perfect score). Real positive for scores between the two extremes. There is usually one maximum which is obtained between the two extremes. The two figures below present the Binomial Surprise Score for a data of 30 samples from normal and tumor lung tissues (taken from Naftali Kaminski s lab, Sheba Medical Center): as can be seen, argmax s [-log σ(s)] = 6. Lung Cancer Data - ctual and expected TNoM scores distribution Number of genes Expected distribution ctual distribution TNoM score -log(binomial surprise) log(binomial surprise) TNoM score Figure 2 One obvious misassumption of this approach is that genes are not really independent. biological phenomenon is almost always associated with many genes.

5 3 Class Discovery 3.1 What is Class Discovery? In many experimental designs it is useful to find tissue classification in gene expression data. Such classifications might be due to biological phenomena (e.g., disease subtypes), or due to mechanical or protocol noise. Identifying classifications can lead to biological discovery or can uncover experimental or data handling errors. Biologically meaningful classifications are often characterized by overabundance of informative genes. This overabundance might be due to a small set of genes that are highly informative about the classification, or due to a larger set of genes, each of them not as surprising, but the collection of them is. This suggests that the samples should be partitioned into two groups. We can then evaluate these partitions and measure to what degree they have the overabundance of informative genes. The partitions that display high overabundance are proposed putative classifications. To carry out this intuition we need to choose a score for overabundance and then to perform search for high scoring partitions. 3.2 Class Discovery lgorithms One approach for class discovery combines the maximum binomial surprise score with local search techniques. The surprise metric assigns a score to each partition, and a local search technique is used to seek partitions with statistically significant overabundance of informative genes. The following local search algorithms are commonly used (the size of the search space is 2 m for bipartition, or k m for k partitions, where m is the number of samples): 1) Steepest ascent: Move to the next candidate partition if and only if s new > s current. 2) Simulated annealing: Move to the next candidate partition with probability min(1, exp((s new > s current ) / T)). Simulated annealing allows occasional "uphill" moves (moves which worsen the current solution to the problem). The advantage of simulated annealing is that it can overcome local maxima, unlike the steepest ascent algorithm. 3) Genetic algorithms: There are various genetic heuristics which are beyond the scope of this course. If we are interested in partitioning the samples into two groups, then we can represent the associations via a binary vector of length m (the number of samples) each bit indicates the association of a specific sample. simple successor function for the local search algorithms is to flip a single bit in this vector. Figure 3 illustrates the graph this representation.

6 Score= Score=88.2 Score= Score= Score= Score= Score=73.2 Figure Discovering dditional Classes We can use a technique called peeling in order to fine-tune the classification: 1) Discover a significant partition via one of the preceding algorithms. 2) Remove all the genes that support the discovered partition (the maximum surprise threshold can be used). 3) Repeat the previous steps with the remaining genes. The peeling technique is effective because different sets of groups typically induce different partitions. Therefore, finding a fine-tuned set of classes is difficult (or impossible) when the expressions of all the genes in the data set are considered. Sometimes specific sets of genes are already known so standard clustering can be used to partition the samples. However, usually there is no such prior knowledge, so peeling can be an effective way of discovering classes based solely on the input data set.

Class discovery in Gene Expression Data: Characterizing Splits by Support Vector Machines

Class discovery in Gene Expression Data: Characterizing Splits by Support Vector Machines Florian Markowetz and Anja von Heydebreck Max-Planck-Institute for Molecular Genetics Computational Molecular Biology