The tail-rank statistic for finding biomarkers from microarray data, with application to prostate cancer

Size: px

Start display at page:

Download "The tail-rank statistic for finding biomarkers from microarray data, with application to prostate cancer"

Randolph Bates
6 years ago
Views:

1 The tail-rank statistic for finding biomarkers from microarray data, with application to prostate cancer Kevin R. Coombes 1, Jing Wang 1 and Keith A. Baggerly 1 1 Department of Biostatistics and Applied Mathematics, University of Texas M.D. Anderson Cancer Center, 1515 Holcombe Blvd., Box 447, Houston TX USA Kevin R. Coombes - kcoombes@mdanderson.org; Jing Wang - wang@odin.mdacc.tmc.edu; Keith A. Baggerly - kabagg@odin.mdacc.tmc.edu; Corresponding author Abstract Background: High-throughput molecular biology technologies are increasingly being applied to discover biomarkers. The statistical analysis of these studies, however, is usually directed toward differential expression. As a result, it may miss important biomarkers that are only present in a subset of patients. Methods: We introduce the tail-rank test, a novel statistical method for identifying candidate biomarkers. The tail-rank statistic uses nonparametric information from the tails of the distributions to estimate the specificity and sensitivity of individual biomarkers. We include sample size and power computations for studies using the tail-rank test, which account for multiple testing. Results: We apply the tail-rank test to an existing microarray data set on prostate cancer, and we compare it to existing methods based on the t-test or Wilcoxon rank-sum test. We find that the tail-rank test selects different genes than the t-test or the Wilcoxon test. The methods largely agree at extreme values of either statistic. By examining the differences, however, we provide evidence that the tail-rank test can successfully identify biomarkers that pick out biologically relevant subsets of the patient samples. Conclusions: We have described a robust, easily implemented, nonparametric statistical method for identifying candidate biomarkers in high-throughput molecular biology data. Background The search for biomarkers to use in the diagnosis, prognosis, or monitoring of disease is an active area of biomedical research. The availability of highthroughput molecular biology technologies (including RNA expression microarrays, serum proteomics, or array-based comparative genomic hybridization) makes it possible to perform experiments that simultaneously screen thousands of molecules to assess their value as biomarkers. The large data sets produced by these experiments present a statistical challenge to the analyst who would like to use them to discover biomarkers. It is important to realize that a biomarker is not just a differentially expressed gene. To clarify the distinction, consider a microarray experiment that 1

2 compares samples from two classes of individuals, healthy and cancer. In statistical terms, we say that a gene is differentially expressed if the distribution of expression values in the healthy samples differs from the distribution in the cancer samples. We say that a gene makes a good biomarker if a test to distinguish cancer samples from healthy samples based on the expression of this gene has high sensitivity and specificity. More generally, we might look at the area under the receiver operating characteristic (ROC) curve for a univariate test, or we might look at the ability of the gene to contribute to a multivariate test distinguishing the two classes. Distinct biologically important concepts (differential expression and good biomarker, resp.) are assessed using distinct statistical concepts (distributions and ROC curves, resp.). Much of the statistical analysis of differential expression to date has focused on differences that can be assessed by looking at measures of central tendency. All methods based on the t-test, for instance, reduce to statements about differences in the mean expression in the two groups of samples. Rank-based tests like the Wilcoxon rank-sum test, while nonparametric, are also based on the idea that the center of the data has shifted. However, a shift in mean differential expression is not enough to make a good biomarker. Consider, for example, a gene whose expression in diseased samples is 1.5- fold higher than in healthy samples. Assume that both groups of samples are normally distributed on a logarithmic scale with equal standard deviations of 0.7, which is typical in microarray experiments. The area under the receiver operating characteristics (ROC) curve associated with such a gene is only To achieve a specificity of 99%, one would have to set the threshold so high that the sensitivity would only be 6.8%. Changing the threshold to achieve a specificity of 90% would only provide a sensitivity of 32.8%. Statistical analyses based on the center of the data are likely to miss many promising biomarkers. The problem is that, by focusing on average behavior, they test the hypothesis that all cancer patients differ from healthy individuals in the same way. In a different context, Tolstoy wrote that all happy families resemble one another, but each unhappy family is unhappy in its own way. We believe that cancer patients, like those unhappy families, differ from healthy individuals in a variety of ways. There is strong biological evidence to support this contention. Deletion of part of the short arm of chromosome 3 (3p14-p23) is found in 50% of non-small-cell lung cancers; MYC amplification is found in 14% of stomach cancers; BRCA1 mutations are found in a subset of breast cancers; a translocation between chromosomes 11 and 14 occurs in 35% of mantle cell lymphomas. Each of these genetic abnormalities directly causes specific differences in gene expression that only occur in a subset of otherwise histologically similar cancers. From a statistical perspective, these results suggest that the distributions of gene expression in cancer patients are likely to differ from the healthy distributions in much more than the location of the center. There are, of course, statistical tools that look at other properties of distributions. The Kolmogorov- Smirnov statistic, for example, is specifically designed to test whether two distributions are the same. We do not know of any applications of the Kolmogorov-Smirnov statistic to microarrays; it may well be useful for analyzing differential expression in a more general manner than the existing tests. However, it does not address the specific problem of finding good biomarkers. In this paper, we propose a simple, straightforward, statistical test (which we call the tail-rank test) to discover biomarkers. Because we believe that many useful biomarkers will only be present in a subset of the cancer patients, we base the test on the tails of distributions. The test we propose is nonparametric, in the sense that it makes no distributional assumptions on the measurements of the samples from the cancer patients. The tail-rank test is very closely related to the PPST that was recently introduced [1]. In this article, we provide a sound statistical basis for this test, which includes easily computed estimates of power and explicit sample size computations. Results Lapointe and colleagues [2] recently published a paper describing the results of microarray experiments using 41 samples of normal prostate, 62 samples of prostate cancer, and 9 samples from lymph node metastases of prostate cancer. Because the cancer samples included a clearly recognizable subset (the metastases), we felt it would provide a useful application of the tail-rank test. So, we downloaded the raw data from the Stanford Microarray Database 2

3 ( Lapointe s prostate cancer experiments used glass microarrays printed with 42, 129 spots containing 38, 804 different cdna clones representing 23, 685 distinct UniGene clusters. The experiments were performed using the two-color fluorescence process, with a common reference material in the Cy3 channel and the experimental sample in the Cy5 channel. We performed intensity-dependent normalization on each microarray using loess [3, 4]. We further normalized the intensity of each channel by rescaling so the 75 th percentile equaled 1000 and computed the base-two logarithmic ratios at each spot; all further analysis was performed on these ratios. T-test In order to have a baseline for comparison, we performed two-sample t-tests comparing normal prostate samples to the combined primary and metastatic prostate cancer samples, and computed p-values for each spot on the array. To adjust for multiple comparisons, we modeled the p-values as a beta-uniform mixture [5]. Using this method, we set a cutoff on the p-values to control the false discovery rate (FDR) [6]. We chose to bound the FDR to be less than 0.05, which corresponded in this data set to p < or to t > Using this cutoff, we detected 3, 522 differentially expressed spots representing 2, 531 differentially expressed UniGene clusters. Of these, 1, 094 UniGene clusters (1, 415 spots) were overexpressed in prostate cancer and 1, 454 UniGene clusters (2, 107 spots) were underexpressed. Wilcoxon rank sum test We also performed Wilcoxon rank sum tests for each gene, using an empirical Bayes method to determine which Wilcoxon statistics were significant [7]. In order to get results that were comparable to the t- test, we selected a cutoff corresponding to a posterior probability of 99.9% that the Wilcoxon statistic came from a differentially expressed gene. Using this cutoff, we detected 3, 627 differentially expressed spots representing 2, 576 UniGene clusters. Of these, 1, 129 UniGene clusters (1, 498 spots) were overexpressed and 1, 447 clusters (2, 129 spots) were underexpressed in prostate cancer. Not surprisingly, given the number of samples, there was good agreement between the t-test and the Wilcoxon test. More than 90% (1, 905) of underexpressed and 88% (1, 244) of overexpressed spots that were found by the t-test were also detected by the Wilcoxon test. Using the tail-rank test for biomarker detection To apply the tail-rank test, we assumed that the log ratios of the normal prostate samples were normally distributed for each gene. Using this assumption, we estimated 90% tolerance bounds for both the 5 th and 95 th percentiles. We then counted the number of combined primary and metastastic prostate cancer samples whose log ratios fell outside these boundaries. Based on Table 2, we identified a gene as a biomarker if at least 16 of the 71 cancer samples were below the 5% or above the 95% levels from the normal prostate. Using this method, we identified 1, 359 UniGene clusters (1, 766 spots) that were positive biomarkers, since they were present at higher than normal levels in at least 16 cancer samples. We also identified 1, 406 UniGene clusters (1, 930 spots) that were negative biomarkers, since they were expressed at lower than normal levels in at least 16 samples. In total, we identified 2, 743 Uni- Gene clusters (3, 692 spots) as candidate biomarkers. Statistical significance of the results of the tailrank test The theoretical basis for the tail-rank test suggests that the number of false positives it finds should be extremely small. In the prostate cancer data set, where we identified 1, 766 up-regulated and 1, 930 down-regulated biomarkers, it would be useful to estimate the number of false positives empirically. To get such an estimate, we carried out both simulations and a permutation test (Figure 1). First, we simulated 42, 129 genes in 112 samples arbitrarily split into 41 healthy and 71 cancer samples, which were the sizes of the actual data set. The simulation assumed that all measurements were independent and identically distributed, with a standard normal distribution. In 100 simulations, the number of false positives ranged from 0 to 7 with a median of 2. Because we do not expect genes to be statistically independent, however, we also performed a permutation test. For each permutation, we randomly scrambled the labels on the samples and repeated the tail-rank test. In 100 permutations, the number of false positives ranged from 0 to 22 with a median of 2. Al- 3

4 though the lack of independence between the genes may have slightly increased the number of false positives, it is still extremely small compared to the total number of positive calls made by the test. [[Insert Figure 1 here]] Comparison between the t-test and the tail-rank test We next compared the list of genes called differentially expressed by the t-test to the list of genes called biomarkers by the tail-rank test (Figure 2). It is clear that the two tests agree at extreme values of either statistic. Overall, there were 1, 745 genes (2, 363 spots) identified by both methods. However, there are also 984 differentially expressed genes found by the t-test (1, 159 spots) that were not flagged as candidate biomarkers by the tail-rank test. At the same time, there were 1, 142 candidate biomarkers (1, 329 spots) that were not flagged as differentially expressed. The results were essentially the same when we compared the Wilcoxon test to the tail-rank test (data not shown). [[Insert Figure 2 here]] In order to understand the differences between the methods, we examined many of the cases where the differences were extreme. For example, we looked at all 36 genes (38 spots) that were identified as differentially expressed for which only 0 or 1 cancer sample took on an extreme value. (A complete list of these genes is contained in Supplementary Table S1.) Plots of the intensities of four such genes are shown in Figure 3. In a large majority of these cases, including LAP1B, CDC14B, and CTF1, the measured intensities of the normal prostate samples included one or more gross outliers that had a large impact on the estimates of the 5 th and 95 th percentiles. This problem might have been avoided by using a more robust estimation method. In the remaining cases, such as SUPT3H, the normal prostate samples appeared to be significantly more variable than the prostate cancer samples. Although such genes do indeed appear to be differentially expressed, their level of variability in normal prostate would make them a poor choice for biomarkers. [[Insert Figure 3 here]] We also looked at genes that were identified as candidate biomarkers but were not identified as differentially expressed. We looked at all 46 genes (52 spots) that were candidate biomarkers whose absolute t-statistic was less than (A complete list of these genes is contained in Supplementary Table S2.) This set included two kinds of genes (Figure 4). Some genes, including GDF11, simply appeared to be more variable in cancer samples than in normal prostate. It is unlikely that these genes would make useful biomarkers, but they may still provide information about pathways that are disregulated in cancer. Other genes, including CANX, appeared to achieve the goal of identifying a biologically relevant subset of the cancer samples. [[Insert Figure 4 here]] The most promising biomarkers Finally, we looked at all 53 spots (40 genes) for which more than 52 of the 71 combined primary and metastastic prostate cancer samples had expression levels either above the 95 th or below the 5 th percentile for normal prostate. We chose a cutoff of 52 since this level corresponds to a posterior estimate of sensitivity of 40% under the skeptical prior. A complete list of the genes is contained in Table 1. [[Insert Table 1 here]] Discussion The most promising marker identified in this data set is caveolin-1 (CAV1), which occupies the top two spots in the table. Based on these results, CAV1 appears to be about 4-fold underexpressed in prostate cancer cells, and an additional 4-fold underexpressed in lymph node metastases. CAV1 has previously been proposed as a candidate tumor suppressor [10] and a negative regulator of the Ras-p42/44 MAP kinase cascade [11]. Although overexpression of CAV1 has been reported to promote cell survival in a mouse model of prostate cancer [12], its pattern of expression in benign prostate and androgen-sensitive human prostate cancer is more consistent with its role as a tumor suppressor [13, 14]. The caveolin-2 gene (CAV2) is located adjacent to CAV1 on chromosome 7, and it displays a parallel expression pattern in this data set. Its absence is repeatedly identified as a marker of prostate cancer, being expressed at even lower levels in lymph node metastases. Both CAV1 and CAV2 have also been seen to be underexpressed in lung cancer cell lines [15] and in human sarcomas [16]. A number of other interesting genes are identified as important markers. Connexin-43 (GJA1) 4

5 is underexpressed and connexin-32 (GJB1) is overexpressed in prostate cancer compared to normal prostate. Alterations in connexin levels have been reported previously in prostate cancer [17, 18], and it has been suggested that the ratio of connexin- 43 to connexin-32 is important [19]. The alphamethylacyl coenzyme A racemase (AMACR) gene has also previously been identified as a potential marker of prostate cancer [20]. One of the most interesting examples of such a gene is calnexin (CANX). Five different clones represent calnexin on these microarrays. All five spots containing these clones were selected by the tail-rank test, even though the t-statistics were insignificant (0.80, 0.82, 0.89, 0.92, and 2.40). Depending on the clone, between 16 and 20 of the prostate cancer samples had expression levels that were higher than the 95 th percentile of the expression in normal prostate. Interestingly, between 6 and 8 of the 9 lymph node metastases had levels that were below the 5 th percentile of normal, and 8 lymph node metastases had levels that were well below the mean for the primary prostate cancers. This finding is particularly intriguing since it has recently been reported that downregulation of calnexin increases the metastatic potential of melanoma cells [8]. The GITA gene is also interesting; it has recently been shown to be identical to a thyroid adenoma associated gene (THADA) that encodes a death-receptor interacting domain [9]. Conclusions In this paper, we have introduced a novel method, the tail-rank test, for identifying potential biomarkers from a high-throughput microarray study, based on the idea that a marker can prove valuable if it reliably picks out a subset of the samples. The tail-rank test can be applied without making any distributional assumptions. We have also provided sample size and power computations that account for multiple testing. We have applied this method to a prostate cancer data set, where it identified a number of interesting potential biomarkers, some of which are novel and others of which have been reported previously. Methods We assume that we have collected data on G genes or proteins from n H healthy individuals. We let X g,i be the random variable representing the measurement of gene g = 1, 2,..., G on individual i = 1, 2,..., n H. We assume for fixed g that the X g,i X g are independent and identically distributed. Next, we specify a target value ψ that represents the desired specificity of a (univariate) test to distinguish healthy individuals from cancer patients. The first step in our proposed method is to estimate, for each g, a threshold τ g such that P rob(x g < τ g ) = ψ. In practical terms, we can compute τ g using either parametric or nonparametric methods. If we collect enough samples from healthy individuals, we can estimate τ g empirically. Alternatively, if we are willing to make distributional assumptions on the measurements for healthy individuals, such as assuming that they are normally distributed on the log scale, then we can estimate τ g by fitting the model parameters from the data. Given the desired specificity (and the threshold estimates), the second step in the tail-rank test is to estimate the sensitivity of a test for cancer based on gene g. To make this estimate, we collect data from n C cancer patients, and we observe the value of the random variable Y g that counts the number of cancer patients for which the measured expression level of gene g exceeds τ g ; we call Y g the tail-rank statistic. We will call g a biomarker provided Y g exceeds a threshold that we specify in the next section. Significance thresholds based on the null distribution We understand the behavior of the tail-rank statistic under the null hypothesis that gene g is not a useful biomarker. More precisely, we use the null hypothesis that the measurements of g on the cancer patients are independent and have the same distribution as the measurements from the healthy individuals. If this is the case, then all Y g have identical binomial distributions, Y g Y = Binom(n C, 1 ψ). Even when we perform such tests for G genes, with G large, the expected maximum value of G independent instances of Y g remains small. To see this, first let M G = max g=1...g (Y g ) denote the maximum over G independent, identical, binomial random variables. Also, write α = α(m) = P rob(y > m). Finally, let γ = P rob(m G > m) be the desired level of control on the family-wise Type I error rate. Then 1 γ = P rob(m G m) = P rob(y 1 m,..., Y G m) 5

6 = P rob(y m) G = (1 α) G. We can determine the value of m as the (1 α) th quantile of a single binomial distribution by first solving for α = 1 (1 γ) 1/G. One can replace the exact answer by a Bonferroni-like approximation, since (1 α) G 1 αg leads to α γ/g. The results are illustrated in Table 2, which shows the expected maximum value of the tail-rank statistic Y g for various values of the family-wise control parameter γ, the number n C of cancer samples, the specificity ψ, and the number G of genes. [[Insert Table 2 here]] As an example of the use of Table 2, suppose we are interested in finding biomarkers with a specificity of at least 99%. Assume that we perform microarray experiments using 50 cancer samples (and enough healthy samples to estimate the 99 th percentile). If we just look at one gene g, the number Y g, of cancer samples whose expression exceeds the 99 th percentile by chance, is expected to equal E[Y g ] = 50 (1 0.99) = 0.5 < 1. However, suppose the microarray contains 10, 000 genes. Just by chance, some values of Y g will be greater than 1. To determine the maximum that can occur, we need to choose an overall significance level, which we get by bounding the family-wise error rate (FWER). In this example, we take FWER < 5%. Now, look at the subtable of Table 2 with γ = 0.05 and ψ = In the column with G = 10, 000 and the row with n C = 50, we find the value 6. This tells us that in 95% of the experiments that measure 10, 000 genes on 50 cancer patients, we should not see more than 6 cancer patients whose expression levels exceed the 99 th percentile for any gene. Thus, any genes for which Y g 7 are potential biomarkers. As in the example, whenever the observed value of the tail-rank statistic Y g for a gene g exceeds the value in Table 2, we can conclude with a high degree of confidence (1 γ) that g is a candidate biomarker. A key point to observe from this table is that the tail-rank test is relatively insensitive to the multiple testing problem that afflicts most methods for analyzing microarray data. In the usual framework, a Bonferroni bound on the FWER is extremely conservative and greatly inflates the statistic needed to reject the null hypothesis. By contrast, the bound for the tail-rank statistic grows very slowly as a function of the number of hypotheses tested. Power and sample size considerations In order to compute the power of the tail-rank test, we first fix the significance level γ, the specificity ψ, and the number G of genes. Given these values, we estimate the expected maximum value of G independent instances of Y g under the null hypothesis as a function of the sample size n C. As before, we compute this estimate as the (1 α) = (1 γ) 1/G quantile of the null binomial distribution Binom(n C, 1 ψ). To express this notion compactly, for a binomial random variable X Binom(N, p), we write F (x N, p) = P rob(x x) for its cumulative distribution function. Now, the expected maximum value m = E[M G ] of Y g over G genes satisfies F (m n C, 1 ψ) = (1 γ) 1/G, or, equivalently, m = m(g, n C, ψ) = F 1 ((1 γ) 1/G n C, 1 ψ). Since we identify a gene as a biomarker if the observed value of Y g is larger than m, the power π to detect a biomarker whose true sensitivity equals φ is given by π = P rob(binom(n C, φ) > m) = 1 F (m n C, φ). Thus, it is straightforward to compute the power provided we are given the sample size, the sensitivity, and the number of genes. The results of such computations are illustrated in Table 3. As an example, look at the subtable corresponding to γ = 0.05 and ψ = From the table, we see that even 500 samples are not enough to detect a biomarker that is present in only 10% of the cancer patients. By contrast, 100 samples have enough power (> 70%) to reliably detect biomarkers with a sensitivity of 20%, fewer than 50 samples are needed to detect a biomarker with a sensitivity of 30%, and as few as 10 samples will suffice to detect biomarkers with a sensitivity of 70%. [[Insert Table 3 here]] Bayesian estimates of sensitivity Our goal in biomarker discovery is not just to find candidate biomarkers, but also to estimate how well they perform. For each gene g, we are ultimately trying to use the observed data Y g to estimate the value of a parameter φ g, which is the sensitivity of a test for cancer based on whether the expression level 6

7 of gene g exceeds τ g, the ψ th quantile of expression in healthy individuals. In other words, we use the model Y g Binom(n C, φ), with n C a fixed part of the experimental design. Because 0 φ g 1, it is convenient to place a beta prior distribution on the sensitivity. We write φ Beta(α 0, β 0 ) for some choice of the hyperparameters α 0 and β 0. If we observe Y g = y, then the posterior distribution of the sensitivity is another beta distribution given by (φ Y g = y) Beta(α 0 + y, β 0 + n C y). We now consider how to choose reasonable values for the hyperparameters. One possibility is to use an uninformative prior. In this case, we might use Beta(1, 1), which is just the uniform distribution. The observed data would overwhelm this prior for even a modest number of cancer samples. Because of multiple testing, however, we strongly suspect that this method would overestimate the sensitivity of some biomarkers. To see why, suppose we measure the expression value of 10, 000 genes on a large number of healthy individuals and estimate the genespecific thresholds corresponding to the 95 th percentile of healthy expression. Suppose we then measure the expression of all 10, 000 genes on 100 cancer patients, and we find a gene for which 20 of the cancer patients have expression levels that exceed the threshold for that gene. A frequentist estimate of the sensitivity of such a cancer test is 20/100 = 20%. A Bayesian estimate using the uniform prior actually increases this value by shrinking it toward the prior mean of 50%; the expected value of the Beta(21, 81) distribution is 21/102 = 20.6%. In this situation, however, a gene that does not distinguish between healthy and cancer samples would have a sensitivity of 5% and a specificity of 95%. From Table 2, we would not be surprised to find at least one such gene for which 15 out of 100 samples exceed the threshold. Should an increase from 15 to 20 really be enough to change our assessment of a gene from not useful to 20% sensitive? The problem with the uniform prior is that it is overly optimistic with highly multiple testing. With its mean of 1/2, the prior suggests that the most likely sensitivity for a randomly chosen gene is 50%. But good biomarkers are rare, as can be easily demonstrated by looking at the biomedical literature from the beginning of time. In light of the historical difficulty of finding useful biomarkers, it seems reasonable to use a more skeptical prior. We can construct a skeptical prior by building on what we know about the behavior of the tail-rank statistic. As above, we have Y g Binom(n C, 1 ψ) under the null hypothesis. A skeptical prior would assume that most genes are poor biomarkers, implying that the typical sensitivity φ would be close to 1 ψ. Thus, we should use a beta prior whose expected value is 1 ψ. There is a one-dimensional family of such priors, which we write as Beta(w(1 ψ), wψ) for some weight hyperparameter w. Increasing values of w provide increasingly skeptical priors. We return to our earlier example, where we find a gene g for which 20 out of 100 cancer samples have measurements that exceed the 95 th percentile of healthy expression. Then ψ = 0.95 and the posterior expectation of Y g as a function of the weight w is equal to (0.05w + 20)/(w + 100). When w = 0, this formula discards the prior and uses the frequentist estimate of the sensitivity as 20%. When w = 2, we impose a weakly skeptical prior and get a posterior expectation of 19.7%. When w = 99, we impose a strongly skeptical prior and get a posterior expectation of 12.5%. Even this skeptical value, however, is significantly larger than the sensitivity of 5% that we would expect to see from a gene that was truly useless as a biomarker. One can make an argument that reasonable weights are bounded by 0 w n C 1. The binomial distribution Y Binom(n C, 1 ψ) that corresponds to the null hypothesis yields a sampling distribution for Z = Y/n C with mean E[Z] = 1 ψ and variance V ar[z] = (1 ψ)ψ/n C. By equating both the mean and the variance of the beta prior to these sampling values, we find that w = n C 1. This corresponds to the distribution we expect to see if none of the genes on the microarray provides any useful biomarker information, and so has a legitimate claim to be the most skeptical prior that should be considered. Dealing with uncertainty in the threshold Our discussion of the tail-rank test does not yet account for the uncertainty in the estimate of the thresholds τ g based on the samples from healthy individuals. This problem is addressed by constructing a statistical tolerance interval (more precisely, a one-sided tolerance bound) that contains a given fraction, ψ, of the population with a given confidence level, γ [21]. With enough samples, one can obtain distribution-free tolerance bounds (op. cit., Chapter 5). For instance, one can use bootstrap or jackknife methods to estimate these bounds empirically. Here, 7

8 however, we assume that the measurements of the log expression of gene g in healthy individuals are normally distributed. We let X be the sample mean and let s be the sample standard deviation. The upper tolerance bound that, 100γ% of the time, exceeds 100ψ% of G values from a normal distribution is approximated by X U = X + k γ,ψ s, where z ψ + zψ 2 ab k γ,ψ =, a a = 1 z2 1 γ 2G 2, b = zψ 2 z2 1 γ G, and, for any π, z π is the critical value of the normal distribution that is exceeded with probability π [22]. For example, suppose, as in the prostate study below, that we collect data on 41 healthy individuals. A simple point estimate of the 95 th percentile is given by τ = X s. However, only the mean of the distribution is less than this value 95% of the time. Almost half of the time (43.5%), fewer than 95% of the observed values will be less than τ. The 90% tolerance bound on the 95 th percentile is X s, the 95% tolerance bound is X s, and the 99% tolerance bound is X s. Authors contributions All authors participated in developing the new method, analyzing the data, and writing the paper. Acknowledgements This work was supported in part by grants from the National Cancer Institute/National Institutes of Health (P30 CA and P50 CA91846) and from the Goodwin Foundation. References 1. Lyons-Weiler J, Patel S, Becich MJ, E GT: Tests for finding complex patterns of differntial expression in cancers: towards individualized medicine. BMC Bioirnformatics 2004, 5: Lapointe J, Li C, Higgins JP, van de Rijn M, Bair E, Montgomery K, Ferrari M, Egevad L, Rayford W, Bergerheim U, Ekman P, DeMarzo AM, Tibshirani R, Botstein D, Brown PO, Brooks JD, Pollack JR: Gene expression profiling identifies clinically relevant subtypes of prostate cancer. Proc Natl Acad Sci USA 2004, 101: Dudoit S, Yang YH, Callow MJ, Speed TP: Statistical methods for identifying differentially expressed genes in replicated cdna microarray experiments. Statistica Sinica 2002, 12: Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP: Normalization for cdna microarray data: a robust composite method Acids Res 2002, 30:e Pounds S, Morris SW: Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of p-values. Bioinformatics 2003, 19: Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Royal Statist Soc, Series B 1995, 57: Efron B, Tibshirani R: Empirical bayes method and false discovery rates for microarrays. Genet Epidemiol 2002, 23: Dissemond J, Busch M, Mors J, Weimann TK, Lindeke A, Goos M, Wagner SN: Differential downregulation of endoplasmic reticulum-residing chaperones calnexin and calreticulin in human metastatic melanoma. Cancer Lett 2004, 203: Rippe V, Drieschner N, Meiboom M, Murua Escobar H, Bonk U, Belge G, Bullerdiek J: Identification of a gene rearranged by 2p21 aberrations in thyroid adenomas. Oncogene 2003, 22: Engelman JA, Zhang XL, Galbiati F, Lisanti MP: Chromosomal localization, genomic organization, and developmental expression of the caveolin gene family (Cav-1, -2, and -3) Cav-1 and Cav-2 genes map to a known tumor suppressor locus (6- A2/7q31). FEBS Lett 1998, 249: Galbiati F, Volante D, Engelman JA, Watanabe G, Burk R, Pestell RG, Lisanti MP: Targeted downregulation of caveolin-1 is sufficient to drive cell transformation and hyperactivate the p42/44 MAP kinase cascade. EMBO J 1998, 17: Thompson TC, Timme TL, Li L, Goltsov A: Caveolin-1, a metastasis-related gene that promotes cell survival in prostate cancer. Apoptosis 1999, 4: Pflug BR, Reiter RE, Nelson JB: Caveolin expression is decreased following androgen deprivation in human prostate cancer cell lines. Prostate 1999, 40: Soos G, Haas GP, Wang CY, Jones RF: Differential gene expression in human prostate cancer cells adapted to growth in bone in Beige mice. Urol Oncol 2003, 21: Racine C, Belanger M, Hirabayashi H, Boucher M, Chakir J, Couet J: Reduction of caveolin 1 gene expression in lung carcinoma cell lines. Biochem Biophys Res Commun 1999, 255: Wiechen K, Sers C, Agoulnik A, Arlt K, Dietel M, Schlag PM, Schneider U: Down-regulation of caveolin-1, a candidate tumor suppressor gene, in sarcomas. Am J Pathol 2001, 158:

9 17. Tsai H, Werber J, Davia MO, Edelman M, Tanaka KE, Melman A, Christ GJ, Geliebter J: Reduced connexin 43 expression in high grade, human prostatic adenocarcinoma cells. Biochem Biophys Res Commun 1996, 227: Habermann H, Ray V, Habermann W, Prins GS: Alterations in gap junction protein expression in human benign prostatic hyperplasia and prostate cancer. J Urol 2002, 167: Carruba G, Stefano R, Cocciadifero L, Saladino F, Di Cristina A, Tokar E, Quader ST, Webber MM, Castagnetta L: Intercellular communication and human prostate carcinogenesis. Ann N Y Acad Sci 2002, 963: Rubin MA, Zhou M, Dhanasekaran SM, Varambally S, Barrette TR, Sanda MG, Pienta KJ, Ghosh D, Chinnaiyan AM: alpha-methylacyl coenzyme A racemase as a tissue biomarker for prostate cancer. JAMA 2002, 287: Hahn GJ, Meeker WQ: Statistical Intervals: A Guide for Practitioners. New York: John Wiley and Sons Natrella MG: Experimental Statistics, NBS Handbook 91. Washington DC: National Bureau of Standards Figures Figure 1 - Histogram of the number of false positives This figure shows the number of false positives found among 100 simulations of i.i.d normal data, and among 100 permutations of the prostate cancer data set. Figure 2 - Comparison betwen the t-statistic and the tail-rank statistic Differentially expressed genes lie outside the horizontal lines. Biomarkers lie above the vertical line. Figure 3 - Plots of the log intensities of four genes that were called differentially expressed but not flagged as biomarkers. Vertical jitter has been added to distinguish individual samples. In three cases, the normal samples (N) include an extreme outlier that drives the estimate of variability and overwhelms the primary prostate cancer (T) or lymph node metastases (L). In the fourth case, SUPT3H, the normal samples appear to be more variable than the cancer samples. Figure 4 - Plots of the log intensities of four genes that were flagged as biomarkers but not called differentially expressed Vertical jitter has been added to distinguish individual samples. In all four cases, expression in the primary prostate cancer samples (T) and/or lymph node metastases (L) is more extreme than in the normal prostate samples (N). Tables Table 1 - Most promising biomarkers The K 95% column shows the number of cancer samples (out of 71) with values greater than the 95 th (if T > 0) or less than the 5 th (if T < 0) percentile of expression in normal prostate. Table 2 - Expected maximum tail-rank statistic under the null hypothesis Expected maximum (over G genes) of Y g under the null hypothesis as a function of G and the sample size n C, for different values of the family-wise type I error rate γ and the specificity ψ. 9

10 Table 3 - Power in terms of sensitivity and sample size Power as a function of the sensitivity φ to be detected and the sample size n C, assuming G = 10000, for different values of the family-wise type I error rate γ and the specificity ψ. Additional Files Additional file 1 simhist.ps PostScript version of Figure 1 (histograms based on simulations and permutations). Additional file 2 scissors.ps PostScript version of Figure 2 (comparison of t-stat with tail-rank). Additional file 3 outlier.ps PostScript version of Figure 3 (four genes) Additional file 4 borderline.ps PostScript version of Figure 4 (four more genes) Additional file 4 tables.pdf PDf file containing the three tables, each on a separate page. Additional file 5 SupplTableS1.xls Excel file containing supplementary table with genes found by the t-test but not by the tail-rank test. Additional file 6 SupplTableS2.xls Excel file containing supplementary table with genes found by the tail-rank test but not by the t-test. 10

Cancer outlier differential gene expression detection

Biostatistics (2007), 8, 3, pp. 566 575 doi:10.1093/biostatistics/kxl029 Advance Access publication on October 4, 2006 Cancer outlier differential gene expression detection BAOLIN WU Division of Biostatistics,