The tail-rank statistic for finding biomarkers from microarray data, with application to prostate cancer

Size: px
Start display at page:

Download "The tail-rank statistic for finding biomarkers from microarray data, with application to prostate cancer"

Transcription

1 The tail-rank statistic for finding biomarkers from microarray data, with application to prostate cancer Kevin R. Coombes 1, Jing Wang 1 and Keith A. Baggerly 1 1 Department of Biostatistics and Applied Mathematics, University of Texas M.D. Anderson Cancer Center, 1515 Holcombe Blvd., Box 447, Houston TX USA Kevin R. Coombes - kcoombes@mdanderson.org; Jing Wang - wang@odin.mdacc.tmc.edu; Keith A. Baggerly - kabagg@odin.mdacc.tmc.edu; Corresponding author Abstract Background: High-throughput molecular biology technologies are increasingly being applied to discover biomarkers. The statistical analysis of these studies, however, is usually directed toward differential expression. As a result, it may miss important biomarkers that are only present in a subset of patients. Methods: We introduce the tail-rank test, a novel statistical method for identifying candidate biomarkers. The tail-rank statistic uses nonparametric information from the tails of the distributions to estimate the specificity and sensitivity of individual biomarkers. We include sample size and power computations for studies using the tail-rank test, which account for multiple testing. Results: We apply the tail-rank test to an existing microarray data set on prostate cancer, and we compare it to existing methods based on the t-test or Wilcoxon rank-sum test. We find that the tail-rank test selects different genes than the t-test or the Wilcoxon test. The methods largely agree at extreme values of either statistic. By examining the differences, however, we provide evidence that the tail-rank test can successfully identify biomarkers that pick out biologically relevant subsets of the patient samples. Conclusions: We have described a robust, easily implemented, nonparametric statistical method for identifying candidate biomarkers in high-throughput molecular biology data. Background The search for biomarkers to use in the diagnosis, prognosis, or monitoring of disease is an active area of biomedical research. The availability of highthroughput molecular biology technologies (including RNA expression microarrays, serum proteomics, or array-based comparative genomic hybridization) makes it possible to perform experiments that simultaneously screen thousands of molecules to assess their value as biomarkers. The large data sets produced by these experiments present a statistical challenge to the analyst who would like to use them to discover biomarkers. It is important to realize that a biomarker is not just a differentially expressed gene. To clarify the distinction, consider a microarray experiment that 1

2 compares samples from two classes of individuals, healthy and cancer. In statistical terms, we say that a gene is differentially expressed if the distribution of expression values in the healthy samples differs from the distribution in the cancer samples. We say that a gene makes a good biomarker if a test to distinguish cancer samples from healthy samples based on the expression of this gene has high sensitivity and specificity. More generally, we might look at the area under the receiver operating characteristic (ROC) curve for a univariate test, or we might look at the ability of the gene to contribute to a multivariate test distinguishing the two classes. Distinct biologically important concepts (differential expression and good biomarker, resp.) are assessed using distinct statistical concepts (distributions and ROC curves, resp.). Much of the statistical analysis of differential expression to date has focused on differences that can be assessed by looking at measures of central tendency. All methods based on the t-test, for instance, reduce to statements about differences in the mean expression in the two groups of samples. Rank-based tests like the Wilcoxon rank-sum test, while nonparametric, are also based on the idea that the center of the data has shifted. However, a shift in mean differential expression is not enough to make a good biomarker. Consider, for example, a gene whose expression in diseased samples is 1.5- fold higher than in healthy samples. Assume that both groups of samples are normally distributed on a logarithmic scale with equal standard deviations of 0.7, which is typical in microarray experiments. The area under the receiver operating characteristics (ROC) curve associated with such a gene is only To achieve a specificity of 99%, one would have to set the threshold so high that the sensitivity would only be 6.8%. Changing the threshold to achieve a specificity of 90% would only provide a sensitivity of 32.8%. Statistical analyses based on the center of the data are likely to miss many promising biomarkers. The problem is that, by focusing on average behavior, they test the hypothesis that all cancer patients differ from healthy individuals in the same way. In a different context, Tolstoy wrote that all happy families resemble one another, but each unhappy family is unhappy in its own way. We believe that cancer patients, like those unhappy families, differ from healthy individuals in a variety of ways. There is strong biological evidence to support this contention. Deletion of part of the short arm of chromosome 3 (3p14-p23) is found in 50% of non-small-cell lung cancers; MYC amplification is found in 14% of stomach cancers; BRCA1 mutations are found in a subset of breast cancers; a translocation between chromosomes 11 and 14 occurs in 35% of mantle cell lymphomas. Each of these genetic abnormalities directly causes specific differences in gene expression that only occur in a subset of otherwise histologically similar cancers. From a statistical perspective, these results suggest that the distributions of gene expression in cancer patients are likely to differ from the healthy distributions in much more than the location of the center. There are, of course, statistical tools that look at other properties of distributions. The Kolmogorov- Smirnov statistic, for example, is specifically designed to test whether two distributions are the same. We do not know of any applications of the Kolmogorov-Smirnov statistic to microarrays; it may well be useful for analyzing differential expression in a more general manner than the existing tests. However, it does not address the specific problem of finding good biomarkers. In this paper, we propose a simple, straightforward, statistical test (which we call the tail-rank test) to discover biomarkers. Because we believe that many useful biomarkers will only be present in a subset of the cancer patients, we base the test on the tails of distributions. The test we propose is nonparametric, in the sense that it makes no distributional assumptions on the measurements of the samples from the cancer patients. The tail-rank test is very closely related to the PPST that was recently introduced [1]. In this article, we provide a sound statistical basis for this test, which includes easily computed estimates of power and explicit sample size computations. Results Lapointe and colleagues [2] recently published a paper describing the results of microarray experiments using 41 samples of normal prostate, 62 samples of prostate cancer, and 9 samples from lymph node metastases of prostate cancer. Because the cancer samples included a clearly recognizable subset (the metastases), we felt it would provide a useful application of the tail-rank test. So, we downloaded the raw data from the Stanford Microarray Database 2

3 ( Lapointe s prostate cancer experiments used glass microarrays printed with 42, 129 spots containing 38, 804 different cdna clones representing 23, 685 distinct UniGene clusters. The experiments were performed using the two-color fluorescence process, with a common reference material in the Cy3 channel and the experimental sample in the Cy5 channel. We performed intensity-dependent normalization on each microarray using loess [3, 4]. We further normalized the intensity of each channel by rescaling so the 75 th percentile equaled 1000 and computed the base-two logarithmic ratios at each spot; all further analysis was performed on these ratios. T-test In order to have a baseline for comparison, we performed two-sample t-tests comparing normal prostate samples to the combined primary and metastatic prostate cancer samples, and computed p-values for each spot on the array. To adjust for multiple comparisons, we modeled the p-values as a beta-uniform mixture [5]. Using this method, we set a cutoff on the p-values to control the false discovery rate (FDR) [6]. We chose to bound the FDR to be less than 0.05, which corresponded in this data set to p < or to t > Using this cutoff, we detected 3, 522 differentially expressed spots representing 2, 531 differentially expressed UniGene clusters. Of these, 1, 094 UniGene clusters (1, 415 spots) were overexpressed in prostate cancer and 1, 454 UniGene clusters (2, 107 spots) were underexpressed. Wilcoxon rank sum test We also performed Wilcoxon rank sum tests for each gene, using an empirical Bayes method to determine which Wilcoxon statistics were significant [7]. In order to get results that were comparable to the t- test, we selected a cutoff corresponding to a posterior probability of 99.9% that the Wilcoxon statistic came from a differentially expressed gene. Using this cutoff, we detected 3, 627 differentially expressed spots representing 2, 576 UniGene clusters. Of these, 1, 129 UniGene clusters (1, 498 spots) were overexpressed and 1, 447 clusters (2, 129 spots) were underexpressed in prostate cancer. Not surprisingly, given the number of samples, there was good agreement between the t-test and the Wilcoxon test. More than 90% (1, 905) of underexpressed and 88% (1, 244) of overexpressed spots that were found by the t-test were also detected by the Wilcoxon test. Using the tail-rank test for biomarker detection To apply the tail-rank test, we assumed that the log ratios of the normal prostate samples were normally distributed for each gene. Using this assumption, we estimated 90% tolerance bounds for both the 5 th and 95 th percentiles. We then counted the number of combined primary and metastastic prostate cancer samples whose log ratios fell outside these boundaries. Based on Table 2, we identified a gene as a biomarker if at least 16 of the 71 cancer samples were below the 5% or above the 95% levels from the normal prostate. Using this method, we identified 1, 359 UniGene clusters (1, 766 spots) that were positive biomarkers, since they were present at higher than normal levels in at least 16 cancer samples. We also identified 1, 406 UniGene clusters (1, 930 spots) that were negative biomarkers, since they were expressed at lower than normal levels in at least 16 samples. In total, we identified 2, 743 Uni- Gene clusters (3, 692 spots) as candidate biomarkers. Statistical significance of the results of the tailrank test The theoretical basis for the tail-rank test suggests that the number of false positives it finds should be extremely small. In the prostate cancer data set, where we identified 1, 766 up-regulated and 1, 930 down-regulated biomarkers, it would be useful to estimate the number of false positives empirically. To get such an estimate, we carried out both simulations and a permutation test (Figure 1). First, we simulated 42, 129 genes in 112 samples arbitrarily split into 41 healthy and 71 cancer samples, which were the sizes of the actual data set. The simulation assumed that all measurements were independent and identically distributed, with a standard normal distribution. In 100 simulations, the number of false positives ranged from 0 to 7 with a median of 2. Because we do not expect genes to be statistically independent, however, we also performed a permutation test. For each permutation, we randomly scrambled the labels on the samples and repeated the tail-rank test. In 100 permutations, the number of false positives ranged from 0 to 22 with a median of 2. Al- 3

4 though the lack of independence between the genes may have slightly increased the number of false positives, it is still extremely small compared to the total number of positive calls made by the test. [[Insert Figure 1 here]] Comparison between the t-test and the tail-rank test We next compared the list of genes called differentially expressed by the t-test to the list of genes called biomarkers by the tail-rank test (Figure 2). It is clear that the two tests agree at extreme values of either statistic. Overall, there were 1, 745 genes (2, 363 spots) identified by both methods. However, there are also 984 differentially expressed genes found by the t-test (1, 159 spots) that were not flagged as candidate biomarkers by the tail-rank test. At the same time, there were 1, 142 candidate biomarkers (1, 329 spots) that were not flagged as differentially expressed. The results were essentially the same when we compared the Wilcoxon test to the tail-rank test (data not shown). [[Insert Figure 2 here]] In order to understand the differences between the methods, we examined many of the cases where the differences were extreme. For example, we looked at all 36 genes (38 spots) that were identified as differentially expressed for which only 0 or 1 cancer sample took on an extreme value. (A complete list of these genes is contained in Supplementary Table S1.) Plots of the intensities of four such genes are shown in Figure 3. In a large majority of these cases, including LAP1B, CDC14B, and CTF1, the measured intensities of the normal prostate samples included one or more gross outliers that had a large impact on the estimates of the 5 th and 95 th percentiles. This problem might have been avoided by using a more robust estimation method. In the remaining cases, such as SUPT3H, the normal prostate samples appeared to be significantly more variable than the prostate cancer samples. Although such genes do indeed appear to be differentially expressed, their level of variability in normal prostate would make them a poor choice for biomarkers. [[Insert Figure 3 here]] We also looked at genes that were identified as candidate biomarkers but were not identified as differentially expressed. We looked at all 46 genes (52 spots) that were candidate biomarkers whose absolute t-statistic was less than (A complete list of these genes is contained in Supplementary Table S2.) This set included two kinds of genes (Figure 4). Some genes, including GDF11, simply appeared to be more variable in cancer samples than in normal prostate. It is unlikely that these genes would make useful biomarkers, but they may still provide information about pathways that are disregulated in cancer. Other genes, including CANX, appeared to achieve the goal of identifying a biologically relevant subset of the cancer samples. [[Insert Figure 4 here]] The most promising biomarkers Finally, we looked at all 53 spots (40 genes) for which more than 52 of the 71 combined primary and metastastic prostate cancer samples had expression levels either above the 95 th or below the 5 th percentile for normal prostate. We chose a cutoff of 52 since this level corresponds to a posterior estimate of sensitivity of 40% under the skeptical prior. A complete list of the genes is contained in Table 1. [[Insert Table 1 here]] Discussion The most promising marker identified in this data set is caveolin-1 (CAV1), which occupies the top two spots in the table. Based on these results, CAV1 appears to be about 4-fold underexpressed in prostate cancer cells, and an additional 4-fold underexpressed in lymph node metastases. CAV1 has previously been proposed as a candidate tumor suppressor [10] and a negative regulator of the Ras-p42/44 MAP kinase cascade [11]. Although overexpression of CAV1 has been reported to promote cell survival in a mouse model of prostate cancer [12], its pattern of expression in benign prostate and androgen-sensitive human prostate cancer is more consistent with its role as a tumor suppressor [13, 14]. The caveolin-2 gene (CAV2) is located adjacent to CAV1 on chromosome 7, and it displays a parallel expression pattern in this data set. Its absence is repeatedly identified as a marker of prostate cancer, being expressed at even lower levels in lymph node metastases. Both CAV1 and CAV2 have also been seen to be underexpressed in lung cancer cell lines [15] and in human sarcomas [16]. A number of other interesting genes are identified as important markers. Connexin-43 (GJA1) 4

5 is underexpressed and connexin-32 (GJB1) is overexpressed in prostate cancer compared to normal prostate. Alterations in connexin levels have been reported previously in prostate cancer [17, 18], and it has been suggested that the ratio of connexin- 43 to connexin-32 is important [19]. The alphamethylacyl coenzyme A racemase (AMACR) gene has also previously been identified as a potential marker of prostate cancer [20]. One of the most interesting examples of such a gene is calnexin (CANX). Five different clones represent calnexin on these microarrays. All five spots containing these clones were selected by the tail-rank test, even though the t-statistics were insignificant (0.80, 0.82, 0.89, 0.92, and 2.40). Depending on the clone, between 16 and 20 of the prostate cancer samples had expression levels that were higher than the 95 th percentile of the expression in normal prostate. Interestingly, between 6 and 8 of the 9 lymph node metastases had levels that were below the 5 th percentile of normal, and 8 lymph node metastases had levels that were well below the mean for the primary prostate cancers. This finding is particularly intriguing since it has recently been reported that downregulation of calnexin increases the metastatic potential of melanoma cells [8]. The GITA gene is also interesting; it has recently been shown to be identical to a thyroid adenoma associated gene (THADA) that encodes a death-receptor interacting domain [9]. Conclusions In this paper, we have introduced a novel method, the tail-rank test, for identifying potential biomarkers from a high-throughput microarray study, based on the idea that a marker can prove valuable if it reliably picks out a subset of the samples. The tail-rank test can be applied without making any distributional assumptions. We have also provided sample size and power computations that account for multiple testing. We have applied this method to a prostate cancer data set, where it identified a number of interesting potential biomarkers, some of which are novel and others of which have been reported previously. Methods We assume that we have collected data on G genes or proteins from n H healthy individuals. We let X g,i be the random variable representing the measurement of gene g = 1, 2,..., G on individual i = 1, 2,..., n H. We assume for fixed g that the X g,i X g are independent and identically distributed. Next, we specify a target value ψ that represents the desired specificity of a (univariate) test to distinguish healthy individuals from cancer patients. The first step in our proposed method is to estimate, for each g, a threshold τ g such that P rob(x g < τ g ) = ψ. In practical terms, we can compute τ g using either parametric or nonparametric methods. If we collect enough samples from healthy individuals, we can estimate τ g empirically. Alternatively, if we are willing to make distributional assumptions on the measurements for healthy individuals, such as assuming that they are normally distributed on the log scale, then we can estimate τ g by fitting the model parameters from the data. Given the desired specificity (and the threshold estimates), the second step in the tail-rank test is to estimate the sensitivity of a test for cancer based on gene g. To make this estimate, we collect data from n C cancer patients, and we observe the value of the random variable Y g that counts the number of cancer patients for which the measured expression level of gene g exceeds τ g ; we call Y g the tail-rank statistic. We will call g a biomarker provided Y g exceeds a threshold that we specify in the next section. Significance thresholds based on the null distribution We understand the behavior of the tail-rank statistic under the null hypothesis that gene g is not a useful biomarker. More precisely, we use the null hypothesis that the measurements of g on the cancer patients are independent and have the same distribution as the measurements from the healthy individuals. If this is the case, then all Y g have identical binomial distributions, Y g Y = Binom(n C, 1 ψ). Even when we perform such tests for G genes, with G large, the expected maximum value of G independent instances of Y g remains small. To see this, first let M G = max g=1...g (Y g ) denote the maximum over G independent, identical, binomial random variables. Also, write α = α(m) = P rob(y > m). Finally, let γ = P rob(m G > m) be the desired level of control on the family-wise Type I error rate. Then 1 γ = P rob(m G m) = P rob(y 1 m,..., Y G m) 5

6 = P rob(y m) G = (1 α) G. We can determine the value of m as the (1 α) th quantile of a single binomial distribution by first solving for α = 1 (1 γ) 1/G. One can replace the exact answer by a Bonferroni-like approximation, since (1 α) G 1 αg leads to α γ/g. The results are illustrated in Table 2, which shows the expected maximum value of the tail-rank statistic Y g for various values of the family-wise control parameter γ, the number n C of cancer samples, the specificity ψ, and the number G of genes. [[Insert Table 2 here]] As an example of the use of Table 2, suppose we are interested in finding biomarkers with a specificity of at least 99%. Assume that we perform microarray experiments using 50 cancer samples (and enough healthy samples to estimate the 99 th percentile). If we just look at one gene g, the number Y g, of cancer samples whose expression exceeds the 99 th percentile by chance, is expected to equal E[Y g ] = 50 (1 0.99) = 0.5 < 1. However, suppose the microarray contains 10, 000 genes. Just by chance, some values of Y g will be greater than 1. To determine the maximum that can occur, we need to choose an overall significance level, which we get by bounding the family-wise error rate (FWER). In this example, we take FWER < 5%. Now, look at the subtable of Table 2 with γ = 0.05 and ψ = In the column with G = 10, 000 and the row with n C = 50, we find the value 6. This tells us that in 95% of the experiments that measure 10, 000 genes on 50 cancer patients, we should not see more than 6 cancer patients whose expression levels exceed the 99 th percentile for any gene. Thus, any genes for which Y g 7 are potential biomarkers. As in the example, whenever the observed value of the tail-rank statistic Y g for a gene g exceeds the value in Table 2, we can conclude with a high degree of confidence (1 γ) that g is a candidate biomarker. A key point to observe from this table is that the tail-rank test is relatively insensitive to the multiple testing problem that afflicts most methods for analyzing microarray data. In the usual framework, a Bonferroni bound on the FWER is extremely conservative and greatly inflates the statistic needed to reject the null hypothesis. By contrast, the bound for the tail-rank statistic grows very slowly as a function of the number of hypotheses tested. Power and sample size considerations In order to compute the power of the tail-rank test, we first fix the significance level γ, the specificity ψ, and the number G of genes. Given these values, we estimate the expected maximum value of G independent instances of Y g under the null hypothesis as a function of the sample size n C. As before, we compute this estimate as the (1 α) = (1 γ) 1/G quantile of the null binomial distribution Binom(n C, 1 ψ). To express this notion compactly, for a binomial random variable X Binom(N, p), we write F (x N, p) = P rob(x x) for its cumulative distribution function. Now, the expected maximum value m = E[M G ] of Y g over G genes satisfies F (m n C, 1 ψ) = (1 γ) 1/G, or, equivalently, m = m(g, n C, ψ) = F 1 ((1 γ) 1/G n C, 1 ψ). Since we identify a gene as a biomarker if the observed value of Y g is larger than m, the power π to detect a biomarker whose true sensitivity equals φ is given by π = P rob(binom(n C, φ) > m) = 1 F (m n C, φ). Thus, it is straightforward to compute the power provided we are given the sample size, the sensitivity, and the number of genes. The results of such computations are illustrated in Table 3. As an example, look at the subtable corresponding to γ = 0.05 and ψ = From the table, we see that even 500 samples are not enough to detect a biomarker that is present in only 10% of the cancer patients. By contrast, 100 samples have enough power (> 70%) to reliably detect biomarkers with a sensitivity of 20%, fewer than 50 samples are needed to detect a biomarker with a sensitivity of 30%, and as few as 10 samples will suffice to detect biomarkers with a sensitivity of 70%. [[Insert Table 3 here]] Bayesian estimates of sensitivity Our goal in biomarker discovery is not just to find candidate biomarkers, but also to estimate how well they perform. For each gene g, we are ultimately trying to use the observed data Y g to estimate the value of a parameter φ g, which is the sensitivity of a test for cancer based on whether the expression level 6

7 of gene g exceeds τ g, the ψ th quantile of expression in healthy individuals. In other words, we use the model Y g Binom(n C, φ), with n C a fixed part of the experimental design. Because 0 φ g 1, it is convenient to place a beta prior distribution on the sensitivity. We write φ Beta(α 0, β 0 ) for some choice of the hyperparameters α 0 and β 0. If we observe Y g = y, then the posterior distribution of the sensitivity is another beta distribution given by (φ Y g = y) Beta(α 0 + y, β 0 + n C y). We now consider how to choose reasonable values for the hyperparameters. One possibility is to use an uninformative prior. In this case, we might use Beta(1, 1), which is just the uniform distribution. The observed data would overwhelm this prior for even a modest number of cancer samples. Because of multiple testing, however, we strongly suspect that this method would overestimate the sensitivity of some biomarkers. To see why, suppose we measure the expression value of 10, 000 genes on a large number of healthy individuals and estimate the genespecific thresholds corresponding to the 95 th percentile of healthy expression. Suppose we then measure the expression of all 10, 000 genes on 100 cancer patients, and we find a gene for which 20 of the cancer patients have expression levels that exceed the threshold for that gene. A frequentist estimate of the sensitivity of such a cancer test is 20/100 = 20%. A Bayesian estimate using the uniform prior actually increases this value by shrinking it toward the prior mean of 50%; the expected value of the Beta(21, 81) distribution is 21/102 = 20.6%. In this situation, however, a gene that does not distinguish between healthy and cancer samples would have a sensitivity of 5% and a specificity of 95%. From Table 2, we would not be surprised to find at least one such gene for which 15 out of 100 samples exceed the threshold. Should an increase from 15 to 20 really be enough to change our assessment of a gene from not useful to 20% sensitive? The problem with the uniform prior is that it is overly optimistic with highly multiple testing. With its mean of 1/2, the prior suggests that the most likely sensitivity for a randomly chosen gene is 50%. But good biomarkers are rare, as can be easily demonstrated by looking at the biomedical literature from the beginning of time. In light of the historical difficulty of finding useful biomarkers, it seems reasonable to use a more skeptical prior. We can construct a skeptical prior by building on what we know about the behavior of the tail-rank statistic. As above, we have Y g Binom(n C, 1 ψ) under the null hypothesis. A skeptical prior would assume that most genes are poor biomarkers, implying that the typical sensitivity φ would be close to 1 ψ. Thus, we should use a beta prior whose expected value is 1 ψ. There is a one-dimensional family of such priors, which we write as Beta(w(1 ψ), wψ) for some weight hyperparameter w. Increasing values of w provide increasingly skeptical priors. We return to our earlier example, where we find a gene g for which 20 out of 100 cancer samples have measurements that exceed the 95 th percentile of healthy expression. Then ψ = 0.95 and the posterior expectation of Y g as a function of the weight w is equal to (0.05w + 20)/(w + 100). When w = 0, this formula discards the prior and uses the frequentist estimate of the sensitivity as 20%. When w = 2, we impose a weakly skeptical prior and get a posterior expectation of 19.7%. When w = 99, we impose a strongly skeptical prior and get a posterior expectation of 12.5%. Even this skeptical value, however, is significantly larger than the sensitivity of 5% that we would expect to see from a gene that was truly useless as a biomarker. One can make an argument that reasonable weights are bounded by 0 w n C 1. The binomial distribution Y Binom(n C, 1 ψ) that corresponds to the null hypothesis yields a sampling distribution for Z = Y/n C with mean E[Z] = 1 ψ and variance V ar[z] = (1 ψ)ψ/n C. By equating both the mean and the variance of the beta prior to these sampling values, we find that w = n C 1. This corresponds to the distribution we expect to see if none of the genes on the microarray provides any useful biomarker information, and so has a legitimate claim to be the most skeptical prior that should be considered. Dealing with uncertainty in the threshold Our discussion of the tail-rank test does not yet account for the uncertainty in the estimate of the thresholds τ g based on the samples from healthy individuals. This problem is addressed by constructing a statistical tolerance interval (more precisely, a one-sided tolerance bound) that contains a given fraction, ψ, of the population with a given confidence level, γ [21]. With enough samples, one can obtain distribution-free tolerance bounds (op. cit., Chapter 5). For instance, one can use bootstrap or jackknife methods to estimate these bounds empirically. Here, 7

8 however, we assume that the measurements of the log expression of gene g in healthy individuals are normally distributed. We let X be the sample mean and let s be the sample standard deviation. The upper tolerance bound that, 100γ% of the time, exceeds 100ψ% of G values from a normal distribution is approximated by X U = X + k γ,ψ s, where z ψ + zψ 2 ab k γ,ψ =, a a = 1 z2 1 γ 2G 2, b = zψ 2 z2 1 γ G, and, for any π, z π is the critical value of the normal distribution that is exceeded with probability π [22]. For example, suppose, as in the prostate study below, that we collect data on 41 healthy individuals. A simple point estimate of the 95 th percentile is given by τ = X s. However, only the mean of the distribution is less than this value 95% of the time. Almost half of the time (43.5%), fewer than 95% of the observed values will be less than τ. The 90% tolerance bound on the 95 th percentile is X s, the 95% tolerance bound is X s, and the 99% tolerance bound is X s. Authors contributions All authors participated in developing the new method, analyzing the data, and writing the paper. Acknowledgements This work was supported in part by grants from the National Cancer Institute/National Institutes of Health (P30 CA and P50 CA91846) and from the Goodwin Foundation. References 1. Lyons-Weiler J, Patel S, Becich MJ, E GT: Tests for finding complex patterns of differntial expression in cancers: towards individualized medicine. BMC Bioirnformatics 2004, 5: Lapointe J, Li C, Higgins JP, van de Rijn M, Bair E, Montgomery K, Ferrari M, Egevad L, Rayford W, Bergerheim U, Ekman P, DeMarzo AM, Tibshirani R, Botstein D, Brown PO, Brooks JD, Pollack JR: Gene expression profiling identifies clinically relevant subtypes of prostate cancer. Proc Natl Acad Sci USA 2004, 101: Dudoit S, Yang YH, Callow MJ, Speed TP: Statistical methods for identifying differentially expressed genes in replicated cdna microarray experiments. Statistica Sinica 2002, 12: Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP: Normalization for cdna microarray data: a robust composite method Acids Res 2002, 30:e Pounds S, Morris SW: Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of p-values. Bioinformatics 2003, 19: Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Royal Statist Soc, Series B 1995, 57: Efron B, Tibshirani R: Empirical bayes method and false discovery rates for microarrays. Genet Epidemiol 2002, 23: Dissemond J, Busch M, Mors J, Weimann TK, Lindeke A, Goos M, Wagner SN: Differential downregulation of endoplasmic reticulum-residing chaperones calnexin and calreticulin in human metastatic melanoma. Cancer Lett 2004, 203: Rippe V, Drieschner N, Meiboom M, Murua Escobar H, Bonk U, Belge G, Bullerdiek J: Identification of a gene rearranged by 2p21 aberrations in thyroid adenomas. Oncogene 2003, 22: Engelman JA, Zhang XL, Galbiati F, Lisanti MP: Chromosomal localization, genomic organization, and developmental expression of the caveolin gene family (Cav-1, -2, and -3) Cav-1 and Cav-2 genes map to a known tumor suppressor locus (6- A2/7q31). FEBS Lett 1998, 249: Galbiati F, Volante D, Engelman JA, Watanabe G, Burk R, Pestell RG, Lisanti MP: Targeted downregulation of caveolin-1 is sufficient to drive cell transformation and hyperactivate the p42/44 MAP kinase cascade. EMBO J 1998, 17: Thompson TC, Timme TL, Li L, Goltsov A: Caveolin-1, a metastasis-related gene that promotes cell survival in prostate cancer. Apoptosis 1999, 4: Pflug BR, Reiter RE, Nelson JB: Caveolin expression is decreased following androgen deprivation in human prostate cancer cell lines. Prostate 1999, 40: Soos G, Haas GP, Wang CY, Jones RF: Differential gene expression in human prostate cancer cells adapted to growth in bone in Beige mice. Urol Oncol 2003, 21: Racine C, Belanger M, Hirabayashi H, Boucher M, Chakir J, Couet J: Reduction of caveolin 1 gene expression in lung carcinoma cell lines. Biochem Biophys Res Commun 1999, 255: Wiechen K, Sers C, Agoulnik A, Arlt K, Dietel M, Schlag PM, Schneider U: Down-regulation of caveolin-1, a candidate tumor suppressor gene, in sarcomas. Am J Pathol 2001, 158:

9 17. Tsai H, Werber J, Davia MO, Edelman M, Tanaka KE, Melman A, Christ GJ, Geliebter J: Reduced connexin 43 expression in high grade, human prostatic adenocarcinoma cells. Biochem Biophys Res Commun 1996, 227: Habermann H, Ray V, Habermann W, Prins GS: Alterations in gap junction protein expression in human benign prostatic hyperplasia and prostate cancer. J Urol 2002, 167: Carruba G, Stefano R, Cocciadifero L, Saladino F, Di Cristina A, Tokar E, Quader ST, Webber MM, Castagnetta L: Intercellular communication and human prostate carcinogenesis. Ann N Y Acad Sci 2002, 963: Rubin MA, Zhou M, Dhanasekaran SM, Varambally S, Barrette TR, Sanda MG, Pienta KJ, Ghosh D, Chinnaiyan AM: alpha-methylacyl coenzyme A racemase as a tissue biomarker for prostate cancer. JAMA 2002, 287: Hahn GJ, Meeker WQ: Statistical Intervals: A Guide for Practitioners. New York: John Wiley and Sons Natrella MG: Experimental Statistics, NBS Handbook 91. Washington DC: National Bureau of Standards Figures Figure 1 - Histogram of the number of false positives This figure shows the number of false positives found among 100 simulations of i.i.d normal data, and among 100 permutations of the prostate cancer data set. Figure 2 - Comparison betwen the t-statistic and the tail-rank statistic Differentially expressed genes lie outside the horizontal lines. Biomarkers lie above the vertical line. Figure 3 - Plots of the log intensities of four genes that were called differentially expressed but not flagged as biomarkers. Vertical jitter has been added to distinguish individual samples. In three cases, the normal samples (N) include an extreme outlier that drives the estimate of variability and overwhelms the primary prostate cancer (T) or lymph node metastases (L). In the fourth case, SUPT3H, the normal samples appear to be more variable than the cancer samples. Figure 4 - Plots of the log intensities of four genes that were flagged as biomarkers but not called differentially expressed Vertical jitter has been added to distinguish individual samples. In all four cases, expression in the primary prostate cancer samples (T) and/or lymph node metastases (L) is more extreme than in the normal prostate samples (N). Tables Table 1 - Most promising biomarkers The K 95% column shows the number of cancer samples (out of 71) with values greater than the 95 th (if T > 0) or less than the 5 th (if T < 0) percentile of expression in normal prostate. Table 2 - Expected maximum tail-rank statistic under the null hypothesis Expected maximum (over G genes) of Y g under the null hypothesis as a function of G and the sample size n C, for different values of the family-wise type I error rate γ and the specificity ψ. 9

10 Table 3 - Power in terms of sensitivity and sample size Power as a function of the sensitivity φ to be detected and the sample size n C, assuming G = 10000, for different values of the family-wise type I error rate γ and the specificity ψ. Additional Files Additional file 1 simhist.ps PostScript version of Figure 1 (histograms based on simulations and permutations). Additional file 2 scissors.ps PostScript version of Figure 2 (comparison of t-stat with tail-rank). Additional file 3 outlier.ps PostScript version of Figure 3 (four genes) Additional file 4 borderline.ps PostScript version of Figure 4 (four more genes) Additional file 4 tables.pdf PDf file containing the three tables, each on a separate page. Additional file 5 SupplTableS1.xls Excel file containing supplementary table with genes found by the t-test but not by the tail-rank test. Additional file 6 SupplTableS2.xls Excel file containing supplementary table with genes found by the tail-rank test but not by the t-test. 10

Cancer outlier differential gene expression detection

Cancer outlier differential gene expression detection Biostatistics (2007), 8, 3, pp. 566 575 doi:10.1093/biostatistics/kxl029 Advance Access publication on October 4, 2006 Cancer outlier differential gene expression detection BAOLIN WU Division of Biostatistics,

More information

Comments on Significance of candidate cancer genes as assessed by the CaMP score by Parmigiani et al.

Comments on Significance of candidate cancer genes as assessed by the CaMP score by Parmigiani et al. Comments on Significance of candidate cancer genes as assessed by the CaMP score by Parmigiani et al. Holger Höfling Gad Getz Robert Tibshirani June 26, 2007 1 Introduction Identifying genes that are involved

More information

Practical Experience in the Analysis of Gene Expression Data

Practical Experience in the Analysis of Gene Expression Data Workshop Biometrical Analysis of Molecular Markers, Heidelberg, 2001 Practical Experience in the Analysis of Gene Expression Data from Two Data Sets concerning ALL in Children and Patients with Nodules

More information

Comparison of Gene Set Analysis with Various Score Transformations to Test the Significance of Sets of Genes

Comparison of Gene Set Analysis with Various Score Transformations to Test the Significance of Sets of Genes Comparison of Gene Set Analysis with Various Score Transformations to Test the Significance of Sets of Genes Ivan Arreola and Dr. David Han Department of Management of Science and Statistics, University

More information

MOST: detecting cancer differential gene expression

MOST: detecting cancer differential gene expression Biostatistics (2008), 9, 3, pp. 411 418 doi:10.1093/biostatistics/kxm042 Advance Access publication on November 29, 2007 MOST: detecting cancer differential gene expression HENG LIAN Division of Mathematical

More information

Single SNP/Gene Analysis. Typical Results of GWAS Analysis (Single SNP Approach) Typical Results of GWAS Analysis (Single SNP Approach)

Single SNP/Gene Analysis. Typical Results of GWAS Analysis (Single SNP Approach) Typical Results of GWAS Analysis (Single SNP Approach) High-Throughput Sequencing Course Gene-Set Analysis Biostatistics and Bioinformatics Summer 28 Section Introduction What is Gene Set Analysis? Many names for gene set analysis: Pathway analysis Gene set

More information

The Tail Rank Test. Kevin R. Coombes. July 20, Performing the Tail Rank Test Which genes are significant?... 3

The Tail Rank Test. Kevin R. Coombes. July 20, Performing the Tail Rank Test Which genes are significant?... 3 The Tail Rank Test Kevin R. Coombes July 20, 2009 Contents 1 Introduction 1 2 Getting Started 1 3 Performing the Tail Rank Test 2 3.1 Which genes are significant?..................... 3 4 Power Computations

More information

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Gene Selection for Tumor Classification Using Microarray Gene Expression Data Gene Selection for Tumor Classification Using Microarray Gene Expression Data K. Yendrapalli, R. Basnet, S. Mukkamala, A. H. Sung Department of Computer Science New Mexico Institute of Mining and Technology

More information

MODEL-BASED CLUSTERING IN GENE EXPRESSION MICROARRAYS: AN APPLICATION TO BREAST CANCER DATA

MODEL-BASED CLUSTERING IN GENE EXPRESSION MICROARRAYS: AN APPLICATION TO BREAST CANCER DATA International Journal of Software Engineering and Knowledge Engineering Vol. 13, No. 6 (2003) 579 592 c World Scientific Publishing Company MODEL-BASED CLUSTERING IN GENE EXPRESSION MICROARRAYS: AN APPLICATION

More information

Doing Thousands of Hypothesis Tests at the Same Time. Bradley Efron Stanford University

Doing Thousands of Hypothesis Tests at the Same Time. Bradley Efron Stanford University Doing Thousands of Hypothesis Tests at the Same Time Bradley Efron Stanford University 1 Simultaneous Hypothesis Testing 1980: Simultaneous Statistical Inference (Rupert Miller) 2, 3,, 20 simultaneous

More information

Lecture Outline. Biost 590: Statistical Consulting. Stages of Scientific Studies. Scientific Method

Lecture Outline. Biost 590: Statistical Consulting. Stages of Scientific Studies. Scientific Method Biost 590: Statistical Consulting Statistical Classification of Scientific Studies; Approach to Consulting Lecture Outline Statistical Classification of Scientific Studies Statistical Tasks Approach to

More information

Unit 1 Exploring and Understanding Data

Unit 1 Exploring and Understanding Data Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile

More information

Lecture Outline Biost 517 Applied Biostatistics I

Lecture Outline Biost 517 Applied Biostatistics I Lecture Outline Biost 517 Applied Biostatistics I Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics University of Washington Lecture 2: Statistical Classification of Scientific Questions Types of

More information

Application of Resampling Methods in Microarray Data Analysis

Application of Resampling Methods in Microarray Data Analysis Application of Resampling Methods in Microarray Data Analysis Tests for two independent samples Oliver Hartmann, Helmut Schäfer Institut für Medizinische Biometrie und Epidemiologie Philipps-Universität

More information

The 16th KJC Bioinformatics Symposium Integrative analysis identifies potential DNA methylation biomarkers for pan-cancer diagnosis and prognosis

The 16th KJC Bioinformatics Symposium Integrative analysis identifies potential DNA methylation biomarkers for pan-cancer diagnosis and prognosis The 16th KJC Bioinformatics Symposium Integrative analysis identifies potential DNA methylation biomarkers for pan-cancer diagnosis and prognosis Tieliu Shi tlshi@bio.ecnu.edu.cn The Center for bioinformatics

More information

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD Department of Biomedical Informatics Department of Computer Science and Engineering The Ohio State University Review

More information

Sample Size Estimation for Microarray Experiments

Sample Size Estimation for Microarray Experiments Sample Size Estimation for Microarray Experiments Gregory R. Warnes Department of Biostatistics and Computational Biology Univeristy of Rochester Rochester, NY 14620 and Peng Liu Department of Biological

More information

False Discovery Rates and Copy Number Variation. Bradley Efron and Nancy Zhang Stanford University

False Discovery Rates and Copy Number Variation. Bradley Efron and Nancy Zhang Stanford University False Discovery Rates and Copy Number Variation Bradley Efron and Nancy Zhang Stanford University Three Statistical Centuries 19th (Quetelet) Huge data sets, simple questions 20th (Fisher, Neyman, Hotelling,...

More information

Computer Science, Biology, and Biomedical Informatics (CoSBBI) Outline. Molecular Biology of Cancer AND. Goals/Expectations. David Boone 7/1/2015

Computer Science, Biology, and Biomedical Informatics (CoSBBI) Outline. Molecular Biology of Cancer AND. Goals/Expectations. David Boone 7/1/2015 Goals/Expectations Computer Science, Biology, and Biomedical (CoSBBI) We want to excite you about the world of computer science, biology, and biomedical informatics. Experience what it is like to be a

More information

Examining differences between two sets of scores

Examining differences between two sets of scores 6 Examining differences between two sets of scores In this chapter you will learn about tests which tell us if there is a statistically significant difference between two sets of scores. In so doing you

More information

Checking the counterarguments confirms that publication bias contaminated studies relating social class and unethical behavior

Checking the counterarguments confirms that publication bias contaminated studies relating social class and unethical behavior 1 Checking the counterarguments confirms that publication bias contaminated studies relating social class and unethical behavior Gregory Francis Department of Psychological Sciences Purdue University gfrancis@purdue.edu

More information

Broad H3K4me3 is associated with increased transcription elongation and enhancer activity at tumor suppressor genes

Broad H3K4me3 is associated with increased transcription elongation and enhancer activity at tumor suppressor genes Broad H3K4me3 is associated with increased transcription elongation and enhancer activity at tumor suppressor genes Kaifu Chen 1,2,3,4,5,10, Zhong Chen 6,10, Dayong Wu 6, Lili Zhang 7, Xueqiu Lin 1,2,8,

More information

Visualizing Cancer Heterogeneity with Dynamic Flow

Visualizing Cancer Heterogeneity with Dynamic Flow Visualizing Cancer Heterogeneity with Dynamic Flow Teppei Nakano and Kazuki Ikeda Keio University School of Medicine, Tokyo 160-8582, Japan keiohigh2nd@gmail.com Department of Physics, Osaka University,

More information

Bayesian Prediction Tree Models

Bayesian Prediction Tree Models Bayesian Prediction Tree Models Statistical Prediction Tree Modelling for Clinico-Genomics Clinical gene expression data - expression signatures, profiling Tree models for predictive sub-typing Combining

More information

PSYCH-GA.2211/NEURL-GA.2201 Fall 2016 Mathematical Tools for Cognitive and Neural Science. Homework 5

PSYCH-GA.2211/NEURL-GA.2201 Fall 2016 Mathematical Tools for Cognitive and Neural Science. Homework 5 PSYCH-GA.2211/NEURL-GA.2201 Fall 2016 Mathematical Tools for Cognitive and Neural Science Homework 5 Due: 21 Dec 2016 (late homeworks penalized 10% per day) See the course web site for submission details.

More information

Sheila Barron Statistics Outreach Center 2/8/2011

Sheila Barron Statistics Outreach Center 2/8/2011 Sheila Barron Statistics Outreach Center 2/8/2011 What is Power? When conducting a research study using a statistical hypothesis test, power is the probability of getting statistical significance when

More information

Nature Methods: doi: /nmeth.3115

Nature Methods: doi: /nmeth.3115 Supplementary Figure 1 Analysis of DNA methylation in a cancer cohort based on Infinium 450K data. RnBeads was used to rediscover a clinically distinct subgroup of glioblastoma patients characterized by

More information

Objectives. Quantifying the quality of hypothesis tests. Type I and II errors. Power of a test. Cautions about significance tests

Objectives. Quantifying the quality of hypothesis tests. Type I and II errors. Power of a test. Cautions about significance tests Objectives Quantifying the quality of hypothesis tests Type I and II errors Power of a test Cautions about significance tests Designing Experiments based on power Evaluating a testing procedure The testing

More information

CHAPTER 6. Conclusions and Perspectives

CHAPTER 6. Conclusions and Perspectives CHAPTER 6 Conclusions and Perspectives In Chapter 2 of this thesis, similarities and differences among members of (mainly MZ) twin families in their blood plasma lipidomics profiles were investigated.

More information

List of Figures. List of Tables. Preface to the Second Edition. Preface to the First Edition

List of Figures. List of Tables. Preface to the Second Edition. Preface to the First Edition List of Figures List of Tables Preface to the Second Edition Preface to the First Edition xv xxv xxix xxxi 1 What Is R? 1 1.1 Introduction to R................................ 1 1.2 Downloading and Installing

More information

STATISTICAL INFERENCE 1 Richard A. Johnson Professor Emeritus Department of Statistics University of Wisconsin

STATISTICAL INFERENCE 1 Richard A. Johnson Professor Emeritus Department of Statistics University of Wisconsin STATISTICAL INFERENCE 1 Richard A. Johnson Professor Emeritus Department of Statistics University of Wisconsin Key words : Bayesian approach, classical approach, confidence interval, estimation, randomization,

More information

Institutional Ranking. VHA Study

Institutional Ranking. VHA Study Statistical Inference for Ranks of Health Care Facilities in the Presence of Ties and Near Ties Minge Xie Department of Statistics Rutgers, The State University of New Jersey Supported in part by NSF,

More information

A Case Study: Two-sample categorical data

A Case Study: Two-sample categorical data A Case Study: Two-sample categorical data Patrick Breheny January 31 Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 1/43 Introduction Model specification Continuous vs. mixture priors Choice

More information

Statistical analysis DIANA SAPLACAN 2017 * SLIDES ADAPTED BASED ON LECTURE NOTES BY ALMA LEORA CULEN

Statistical analysis DIANA SAPLACAN 2017 * SLIDES ADAPTED BASED ON LECTURE NOTES BY ALMA LEORA CULEN Statistical analysis DIANA SAPLACAN 2017 * SLIDES ADAPTED BASED ON LECTURE NOTES BY ALMA LEORA CULEN Vs. 2 Background 3 There are different types of research methods to study behaviour: Descriptive: observations,

More information

Bayesian Confidence Intervals for Means and Variances of Lognormal and Bivariate Lognormal Distributions

Bayesian Confidence Intervals for Means and Variances of Lognormal and Bivariate Lognormal Distributions Bayesian Confidence Intervals for Means and Variances of Lognormal and Bivariate Lognormal Distributions J. Harvey a,b, & A.J. van der Merwe b a Centre for Statistical Consultation Department of Statistics

More information

Biost 590: Statistical Consulting

Biost 590: Statistical Consulting Biost 590: Statistical Consulting Statistical Classification of Scientific Questions October 3, 2008 Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics, University of Washington 2000, Scott S. Emerson,

More information

Introduction to statistics Dr Alvin Vista, ACER Bangkok, 14-18, Sept. 2015

Introduction to statistics Dr Alvin Vista, ACER Bangkok, 14-18, Sept. 2015 Analysing and Understanding Learning Assessment for Evidence-based Policy Making Introduction to statistics Dr Alvin Vista, ACER Bangkok, 14-18, Sept. 2015 Australian Council for Educational Research Structure

More information

Psy201 Module 3 Study and Assignment Guide. Using Excel to Calculate Descriptive and Inferential Statistics

Psy201 Module 3 Study and Assignment Guide. Using Excel to Calculate Descriptive and Inferential Statistics Psy201 Module 3 Study and Assignment Guide Using Excel to Calculate Descriptive and Inferential Statistics What is Excel? Excel is a spreadsheet program that allows one to enter numerical values or data

More information

A Review of Multiple Hypothesis Testing in Otolaryngology Literature

A Review of Multiple Hypothesis Testing in Otolaryngology Literature The Laryngoscope VC 2014 The American Laryngological, Rhinological and Otological Society, Inc. Systematic Review A Review of Multiple Hypothesis Testing in Otolaryngology Literature Erin M. Kirkham, MD,

More information

Statistical Techniques. Masoud Mansoury and Anas Abulfaraj

Statistical Techniques. Masoud Mansoury and Anas Abulfaraj Statistical Techniques Masoud Mansoury and Anas Abulfaraj What is Statistics? https://www.youtube.com/watch?v=lmmzj7599pw The definition of Statistics The practice or science of collecting and analyzing

More information

Supplement to SCnorm: robust normalization of single-cell RNA-seq data

Supplement to SCnorm: robust normalization of single-cell RNA-seq data Supplement to SCnorm: robust normalization of single-cell RNA-seq data Supplementary Note 1: SCnorm does not require spike-ins, since we find that the performance of spike-ins in scrna-seq is often compromised,

More information

Downregulation of serum mir-17 and mir-106b levels in gastric cancer and benign gastric diseases

Downregulation of serum mir-17 and mir-106b levels in gastric cancer and benign gastric diseases Brief Communication Downregulation of serum mir-17 and mir-106b levels in gastric cancer and benign gastric diseases Qinghai Zeng 1 *, Cuihong Jin 2 *, Wenhang Chen 2, Fang Xia 3, Qi Wang 3, Fan Fan 4,

More information

Investigating the robustness of the nonparametric Levene test with more than two groups

Investigating the robustness of the nonparametric Levene test with more than two groups Psicológica (2014), 35, 361-383. Investigating the robustness of the nonparametric Levene test with more than two groups David W. Nordstokke * and S. Mitchell Colp University of Calgary, Canada Testing

More information

SUPPLEMENTARY APPENDIX

SUPPLEMENTARY APPENDIX SUPPLEMENTARY APPENDIX 1) Supplemental Figure 1. Histopathologic Characteristics of the Tumors in the Discovery Cohort 2) Supplemental Figure 2. Incorporation of Normal Epidermal Melanocytic Signature

More information

Outline of Part III. SISCR 2016, Module 7, Part III. SISCR Module 7 Part III: Comparing Two Risk Models

Outline of Part III. SISCR 2016, Module 7, Part III. SISCR Module 7 Part III: Comparing Two Risk Models SISCR Module 7 Part III: Comparing Two Risk Models Kathleen Kerr, Ph.D. Associate Professor Department of Biostatistics University of Washington Outline of Part III 1. How to compare two risk models 2.

More information

Biostatistical modelling in genomics for clinical cancer studies

Biostatistical modelling in genomics for clinical cancer studies This work was supported by Entente Cordiale Cancer Research Bursaries Biostatistical modelling in genomics for clinical cancer studies Philippe Broët JE 2492 Faculté de Médecine Paris-Sud In collaboration

More information

Behavioral Data Mining. Lecture 4 Measurement

Behavioral Data Mining. Lecture 4 Measurement Behavioral Data Mining Lecture 4 Measurement Outline Hypothesis testing Parametric statistical tests Non-parametric tests Precision-Recall plots ROC plots Hardware update Icluster machines are ready for

More information

Nature Getetics: doi: /ng.3471

Nature Getetics: doi: /ng.3471 Supplementary Figure 1 Summary of exome sequencing data. ( a ) Exome tumor normal sample sizes for bladder cancer (BLCA), breast cancer (BRCA), carcinoid (CARC), chronic lymphocytic leukemia (CLLX), colorectal

More information

T. R. Golub, D. K. Slonim & Others 1999

T. R. Golub, D. K. Slonim & Others 1999 T. R. Golub, D. K. Slonim & Others 1999 Big Picture in 1999 The Need for Cancer Classification Cancer classification very important for advances in cancer treatment. Cancers of Identical grade can have

More information

SubLasso:a feature selection and classification R package with a. fixed feature subset

SubLasso:a feature selection and classification R package with a. fixed feature subset SubLasso:a feature selection and classification R package with a fixed feature subset Youxi Luo,3,*, Qinghan Meng,2,*, Ruiquan Ge,2, Guoqin Mai, Jikui Liu, Fengfeng Zhou,#. Shenzhen Institutes of Advanced

More information

Numerous hypothesis tests were performed in this study. To reduce the false positive due to

Numerous hypothesis tests were performed in this study. To reduce the false positive due to Two alternative data-splitting Numerous hypothesis tests were performed in this study. To reduce the false positive due to multiple testing, we are not only seeking the results with extremely small p values

More information

Understandable Statistics

Understandable Statistics Understandable Statistics correlated to the Advanced Placement Program Course Description for Statistics Prepared for Alabama CC2 6/2003 2003 Understandable Statistics 2003 correlated to the Advanced Placement

More information

Lecture Outline Biost 517 Applied Biostatistics I. Statistical Goals of Studies Role of Statistical Inference

Lecture Outline Biost 517 Applied Biostatistics I. Statistical Goals of Studies Role of Statistical Inference Lecture Outline Biost 517 Applied Biostatistics I Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics University of Washington Statistical Inference Role of Statistical Inference Hierarchy of Experimental

More information

Lessons in biostatistics

Lessons in biostatistics Lessons in biostatistics The test of independence Mary L. McHugh Department of Nursing, School of Health and Human Services, National University, Aero Court, San Diego, California, USA Corresponding author:

More information

A Strategy for Identifying Putative Causes of Gene Expression Variation in Human Cancer

A Strategy for Identifying Putative Causes of Gene Expression Variation in Human Cancer A Strategy for Identifying Putative Causes of Gene Expression Variation in Human Cancer Hautaniemi, Sampsa; Ringnér, Markus; Kauraniemi, Päivikki; Kallioniemi, Anne; Edgren, Henrik; Yli-Harja, Olli; Astola,

More information

Memorial Sloan-Kettering Cancer Center

Memorial Sloan-Kettering Cancer Center Memorial Sloan-Kettering Cancer Center Memorial Sloan-Kettering Cancer Center, Dept. of Epidemiology & Biostatistics Working Paper Series Year 2007 Paper 14 On Comparing the Clustering of Regression Models

More information

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES 24 MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES In the previous chapter, simple linear regression was used when you have one independent variable and one dependent variable. This chapter

More information

The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0

The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0 The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0 Introduction Loss of erozygosity (LOH) represents the loss of allelic differences. The SNP markers on the SNP Array 6.0 can be used

More information

Gene Ontology and Functional Enrichment. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

Gene Ontology and Functional Enrichment. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein The parsimony principle: A quick review Find the tree that requires the fewest

More information

A Comparison of Collaborative Filtering Methods for Medication Reconciliation

A Comparison of Collaborative Filtering Methods for Medication Reconciliation A Comparison of Collaborative Filtering Methods for Medication Reconciliation Huanian Zheng, Rema Padman, Daniel B. Neill The H. John Heinz III College, Carnegie Mellon University, Pittsburgh, PA, 15213,

More information

Practical Bayesian Design and Analysis for Drug and Device Clinical Trials

Practical Bayesian Design and Analysis for Drug and Device Clinical Trials Practical Bayesian Design and Analysis for Drug and Device Clinical Trials p. 1/2 Practical Bayesian Design and Analysis for Drug and Device Clinical Trials Brian P. Hobbs Plan B Advisor: Bradley P. Carlin

More information

Title: A robustness study of parametric and non-parametric tests in Model-Based Multifactor Dimensionality Reduction for epistasis detection

Title: A robustness study of parametric and non-parametric tests in Model-Based Multifactor Dimensionality Reduction for epistasis detection Author's response to reviews Title: A robustness study of parametric and non-parametric tests in Model-Based Multifactor Dimensionality Reduction for epistasis detection Authors: Jestinah M Mahachie John

More information

Bootstrapped Integrative Hypothesis Test, COPD-Lung Cancer Differentiation, and Joint mirnas Biomarkers

Bootstrapped Integrative Hypothesis Test, COPD-Lung Cancer Differentiation, and Joint mirnas Biomarkers Bootstrapped Integrative Hypothesis Test, COPD-Lung Cancer Differentiation, and Joint mirnas Biomarkers Kai-Ming Jiang 1,2, Bao-Liang Lu 1,2, and Lei Xu 1,2,3(&) 1 Department of Computer Science and Engineering,

More information

Methods for Determining Random Sample Size

Methods for Determining Random Sample Size Methods for Determining Random Sample Size This document discusses how to determine your random sample size based on the overall purpose of your research project. Methods for determining the random sample

More information

Risk-prediction modelling in cancer with multiple genomic data sets: a Bayesian variable selection approach

Risk-prediction modelling in cancer with multiple genomic data sets: a Bayesian variable selection approach Risk-prediction modelling in cancer with multiple genomic data sets: a Bayesian variable selection approach Manuela Zucknick Division of Biostatistics, German Cancer Research Center Biometry Workshop,

More information

EXERCISE: HOW TO DO POWER CALCULATIONS IN OPTIMAL DESIGN SOFTWARE

EXERCISE: HOW TO DO POWER CALCULATIONS IN OPTIMAL DESIGN SOFTWARE ...... EXERCISE: HOW TO DO POWER CALCULATIONS IN OPTIMAL DESIGN SOFTWARE TABLE OF CONTENTS 73TKey Vocabulary37T... 1 73TIntroduction37T... 73TUsing the Optimal Design Software37T... 73TEstimating Sample

More information

Full title: A likelihood-based approach to early stopping in single arm phase II cancer clinical trials

Full title: A likelihood-based approach to early stopping in single arm phase II cancer clinical trials Full title: A likelihood-based approach to early stopping in single arm phase II cancer clinical trials Short title: Likelihood-based early stopping design in single arm phase II studies Elizabeth Garrett-Mayer,

More information

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you?

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you? WDHS Curriculum Map Probability and Statistics Time Interval/ Unit 1: Introduction to Statistics 1.1-1.3 2 weeks S-IC-1: Understand statistics as a process for making inferences about population parameters

More information

Classification of cancer profiles. ABDBM Ron Shamir

Classification of cancer profiles. ABDBM Ron Shamir Classification of cancer profiles 1 Background: Cancer Classification Cancer classification is central to cancer treatment; Traditional cancer classification methods: location; morphology, cytogenesis;

More information

Statistics and Probability

Statistics and Probability Statistics and a single count or measurement variable. S.ID.1: Represent data with plots on the real number line (dot plots, histograms, and box plots). S.ID.2: Use statistics appropriate to the shape

More information

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo Please note the page numbers listed for the Lind book may vary by a page or two depending on which version of the textbook you have. Readings: Lind 1 11 (with emphasis on chapters 10, 11) Please note chapter

More information

Understanding Uncertainty in School League Tables*

Understanding Uncertainty in School League Tables* FISCAL STUDIES, vol. 32, no. 2, pp. 207 224 (2011) 0143-5671 Understanding Uncertainty in School League Tables* GEORGE LECKIE and HARVEY GOLDSTEIN Centre for Multilevel Modelling, University of Bristol

More information

Module 3: Pathway and Drug Development

Module 3: Pathway and Drug Development Module 3: Pathway and Drug Development Table of Contents 1.1 Getting Started... 6 1.2 Identifying a Dasatinib sensitive cancer signature... 7 1.2.1 Identifying and validating a Dasatinib Signature... 7

More information

STATISTICS AND RESEARCH DESIGN

STATISTICS AND RESEARCH DESIGN Statistics 1 STATISTICS AND RESEARCH DESIGN These are subjects that are frequently confused. Both subjects often evoke student anxiety and avoidance. To further complicate matters, both areas appear have

More information

SISCR Module 4 Part III: Comparing Two Risk Models. Kathleen Kerr, Ph.D. Associate Professor Department of Biostatistics University of Washington

SISCR Module 4 Part III: Comparing Two Risk Models. Kathleen Kerr, Ph.D. Associate Professor Department of Biostatistics University of Washington SISCR Module 4 Part III: Comparing Two Risk Models Kathleen Kerr, Ph.D. Associate Professor Department of Biostatistics University of Washington Outline of Part III 1. How to compare two risk models 2.

More information

PO Box 19015, Arlington, TX {ramirez, 5323 Harry Hines Boulevard, Dallas, TX

PO Box 19015, Arlington, TX {ramirez, 5323 Harry Hines Boulevard, Dallas, TX From: Proceedings of the Eleventh International FLAIRS Conference. Copyright 1998, AAAI (www.aaai.org). All rights reserved. A Sequence Building Approach to Pattern Discovery in Medical Data Jorge C. G.

More information

Making Inferences from Experiments

Making Inferences from Experiments 11.6 Making Inferences from Experiments Essential Question How can you test a hypothesis about an experiment? Resampling Data Yield (kilograms) Control Group Treatment Group 1. 1.1 1.2 1. 1.5 1.4.9 1.2

More information

Student Performance Q&A:

Student Performance Q&A: Student Performance Q&A: 2009 AP Statistics Free-Response Questions The following comments on the 2009 free-response questions for AP Statistics were written by the Chief Reader, Christine Franklin of

More information

Using Statistical Intervals to Assess System Performance Best Practice

Using Statistical Intervals to Assess System Performance Best Practice Using Statistical Intervals to Assess System Performance Best Practice Authored by: Francisco Ortiz, PhD STAT COE Lenny Truett, PhD STAT COE 17 April 2015 The goal of the STAT T&E COE is to assist in developing

More information

Lecture Outline. Biost 517 Applied Biostatistics I. Purpose of Descriptive Statistics. Purpose of Descriptive Statistics

Lecture Outline. Biost 517 Applied Biostatistics I. Purpose of Descriptive Statistics. Purpose of Descriptive Statistics Biost 517 Applied Biostatistics I Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics University of Washington Lecture 3: Overview of Descriptive Statistics October 3, 2005 Lecture Outline Purpose

More information

6. Unusual and Influential Data

6. Unusual and Influential Data Sociology 740 John ox Lecture Notes 6. Unusual and Influential Data Copyright 2014 by John ox Unusual and Influential Data 1 1. Introduction I Linear statistical models make strong assumptions about the

More information

Fundamental Clinical Trial Design

Fundamental Clinical Trial Design Design, Monitoring, and Analysis of Clinical Trials Session 1 Overview and Introduction Overview Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics, University of Washington February 17-19, 2003

More information

Bayesian meta-analysis of Papanicolaou smear accuracy

Bayesian meta-analysis of Papanicolaou smear accuracy Gynecologic Oncology 107 (2007) S133 S137 www.elsevier.com/locate/ygyno Bayesian meta-analysis of Papanicolaou smear accuracy Xiuyu Cong a, Dennis D. Cox b, Scott B. Cantor c, a Biometrics and Data Management,

More information

Research Questions, Variables, and Hypotheses: Part 2. Review. Hypotheses RCS /7/04. What are research questions? What are variables?

Research Questions, Variables, and Hypotheses: Part 2. Review. Hypotheses RCS /7/04. What are research questions? What are variables? Research Questions, Variables, and Hypotheses: Part 2 RCS 6740 6/7/04 1 Review What are research questions? What are variables? Definition Function Measurement Scale 2 Hypotheses OK, now that we know how

More information

Ecological Statistics

Ecological Statistics A Primer of Ecological Statistics Second Edition Nicholas J. Gotelli University of Vermont Aaron M. Ellison Harvard Forest Sinauer Associates, Inc. Publishers Sunderland, Massachusetts U.S.A. Brief Contents

More information

OCW Epidemiology and Biostatistics, 2010 David Tybor, MS, MPH and Kenneth Chui, PhD Tufts University School of Medicine October 27, 2010

OCW Epidemiology and Biostatistics, 2010 David Tybor, MS, MPH and Kenneth Chui, PhD Tufts University School of Medicine October 27, 2010 OCW Epidemiology and Biostatistics, 2010 David Tybor, MS, MPH and Kenneth Chui, PhD Tufts University School of Medicine October 27, 2010 SAMPLING AND CONFIDENCE INTERVALS Learning objectives for this session:

More information

A Comparison of Methods for Determining HIV Viral Set Point

A Comparison of Methods for Determining HIV Viral Set Point STATISTICS IN MEDICINE Statist. Med. 2006; 00:1 6 [Version: 2002/09/18 v1.11] A Comparison of Methods for Determining HIV Viral Set Point Y. Mei 1, L. Wang 2, S. E. Holte 2 1 School of Industrial and Systems

More information

OncoPPi Portal A Cancer Protein Interaction Network to Inform Therapeutic Strategies

OncoPPi Portal A Cancer Protein Interaction Network to Inform Therapeutic Strategies OncoPPi Portal A Cancer Protein Interaction Network to Inform Therapeutic Strategies 2017 Contents Datasets... 2 Protein-protein interaction dataset... 2 Set of known PPIs... 3 Domain-domain interactions...

More information

A quick review. The clustering problem: Hierarchical clustering algorithm: Many possible distance metrics K-mean clustering algorithm:

A quick review. The clustering problem: Hierarchical clustering algorithm: Many possible distance metrics K-mean clustering algorithm: The clustering problem: partition genes into distinct sets with high homogeneity and high separation Hierarchical clustering algorithm: 1. Assign each object to a separate cluster. 2. Regroup the pair

More information

Psychology, 2010, 1: doi: /psych Published Online August 2010 (

Psychology, 2010, 1: doi: /psych Published Online August 2010 ( Psychology, 2010, 1: 194-198 doi:10.4236/psych.2010.13026 Published Online August 2010 (http://www.scirp.org/journal/psych) Using Generalizability Theory to Evaluate the Applicability of a Serial Bayes

More information

Biost 524: Design of Medical Studies

Biost 524: Design of Medical Studies Biost 524: Design of Medical Studies Lecture 7: Statistical Analysis Plan Susanne May, Ph.D. / Scott S. Emerson, M.D., Ph.D. Associate Professor / Professor of Biostatistics University of Washington Lecture

More information

Assignment #6. Chapter 10: 14, 15 Chapter 11: 14, 18. Due tomorrow Nov. 6 th by 2pm in your TA s homework box

Assignment #6. Chapter 10: 14, 15 Chapter 11: 14, 18. Due tomorrow Nov. 6 th by 2pm in your TA s homework box Assignment #6 Chapter 10: 14, 15 Chapter 11: 14, 18 Due tomorrow Nov. 6 th by 2pm in your TA s homework box Assignment #7 Chapter 12: 18, 24 Chapter 13: 28 Due next Friday Nov. 13 th by 2pm in your TA

More information

RASA: Robust Alternative Splicing Analysis for Human Transcriptome Arrays

RASA: Robust Alternative Splicing Analysis for Human Transcriptome Arrays Supplementary Materials RASA: Robust Alternative Splicing Analysis for Human Transcriptome Arrays Junhee Seok 1*, Weihong Xu 2, Ronald W. Davis 2, Wenzhong Xiao 2,3* 1 School of Electrical Engineering,

More information

DiffVar: a new method for detecting differential variability with application to methylation in cancer and aging

DiffVar: a new method for detecting differential variability with application to methylation in cancer and aging Genome Biology This Provisional PDF corresponds to the article as it appeared upon acceptance. Fully formatted PDF and full text (HTML) versions will be made available soon. DiffVar: a new method for detecting

More information

Review. Imagine the following table being obtained as a random. Decision Test Diseased Not Diseased Positive TP FP Negative FN TN

Review. Imagine the following table being obtained as a random. Decision Test Diseased Not Diseased Positive TP FP Negative FN TN Outline 1. Review sensitivity and specificity 2. Define an ROC curve 3. Define AUC 4. Non-parametric tests for whether or not the test is informative 5. Introduce the binormal ROC model 6. Discuss non-parametric

More information

NEED A SAMPLE SIZE? How to work with your friendly biostatistician!!!

NEED A SAMPLE SIZE? How to work with your friendly biostatistician!!! NEED A SAMPLE SIZE? How to work with your friendly biostatistician!!! BERD Pizza & Pilots November 18, 2013 Emily Van Meter, PhD Assistant Professor Division of Cancer Biostatistics Overview Why do we

More information

A Brief Introduction to Bayesian Statistics

A Brief Introduction to Bayesian Statistics A Brief Introduction to Statistics David Kaplan Department of Educational Psychology Methods for Social Policy Research and, Washington, DC 2017 1 / 37 The Reverend Thomas Bayes, 1701 1761 2 / 37 Pierre-Simon

More information

Still important ideas

Still important ideas Readings: OpenStax - Chapters 1 11 + 13 & Appendix D & E (online) Plous - Chapters 2, 3, and 4 Chapter 2: Cognitive Dissonance, Chapter 3: Memory and Hindsight Bias, Chapter 4: Context Dependence Still

More information

Still important ideas

Still important ideas Readings: OpenStax - Chapters 1 13 & Appendix D & E (online) Plous Chapters 17 & 18 - Chapter 17: Social Influences - Chapter 18: Group Judgments and Decisions Still important ideas Contrast the measurement

More information

Nature Neuroscience: doi: /nn Supplementary Figure 1. Behavioral training.

Nature Neuroscience: doi: /nn Supplementary Figure 1. Behavioral training. Supplementary Figure 1 Behavioral training. a, Mazes used for behavioral training. Asterisks indicate reward location. Only some example mazes are shown (for example, right choice and not left choice maze

More information