Prediction and uncertainty in the analysis of gene expression profiles.

Size: px
Start display at page:

Download "Prediction and uncertainty in the analysis of gene expression profiles."

Transcription

1 Prediction and uncertainty in the analysis of gene expression profiles. Rainer Spang Harry Zuzan Mike West Joseph Nevins Carrie Blanchette Jeffrey R. Marks Abstract We have developed a complete statistical model for the analysis of tumor specific gene expression profiles. The approach provides investigators with a global overview on large scale gene expression data, indicating aspects of the data that relate to tumor phenotype, but also summarizing the uncertainties inherent in classification of tumor types. We demonstrate the use of this method in the context of a gene expression profiling study of 27 human breast cancers. The study is aimed at defining molecular characteristics of tumors that reflect estrogen receptor status. In addition to good predictive performance with respect to pure classification of the expression profiles, the model also uncovers conflicts in the data with respect to the classification of some Institute of Statistics and Decision Sciences, Duke University,Durham, NC, USA Department of Genetics,Howard Hughes Medical Institute, Duke University Medical Center, Durham, NC, USA Department of Experimental Surgery, Duke University Medical Center, Durham, NC, USA of the tumors, highlighting them as critical cases for which additional investigations are appropriate. 1 Introduction A comprehensive understanding of the mostly subtle differences in gene expression of different tumor types is crucial for elucidating the molecular mechanisms of cancer as well as for the successful treatment of the disease. Large scale gene expression profiling using high-density oligonucleotide chips [9], arrays on nylon membrane [25, 24] or cdna microarrays, [23, 26] is certainly among the most promising novel techniques in molecular biology [27]. In particular, the enormous scientific potential of the technologies with respect to uncovering the molecular variation among cancers has been recently pointed out [22, 8, 1, 4, 28, 3, 12]. At the current state of the technology, expression levels for a substantial fraction of all human genes can be assessed, and in the near future, it is likely that the same analysis will be available genome wide. The technologies to genrate large amounts of 1

2 gene expression data are already available and will likely improve within the next years. The bottleneck in dealing cogently with the upcomming data explosion is very clearly on the development of data analysis tools that identify subtle differences in the gene expression profiles. Statistical approaches have mainly focused on unsupervised learning procedures. In these approaches no functional knowledge on the true class of the tumor is used. Methods applied to gene expression analysis include hierarchical average linkage clustering [7], deterministic annealing based clustering [4], self organizing maps [29], principal component analysis [28] and singular value decompositions [21]. These methods provide very broad overviews of the internal structure of the data. The obvious shortcoming of unsupervised approaches is that available information, the true class of either genes or tumors, is not used in the analysis. If this information is used, classical classification methods could in principle be used. However, the very large number of predictors (genes) compared to a small number of samples (microarrays) make most of them unemployable. A precedent feature selection step is normally necessary. A comprehensive comparative study of several discrimination methods in the context of cancer classification based on filtered sets of genes can be found in [19]. Support vector machines have been applied for the classification of genes with respect to functional properties [5]. In [1] we describe a novel Bayesian regression approach for classification problems with far more predictors than samples. The methodology is based on developments in [2]. The method has no systematic limitations with respect to the number of predictors used. Here we demonstrate its appropriateness in the context of gene expression analysis. First studies on cancer specific expression profiles focused on blood cancers, like leukemia [8] and B-cell lymphoma [3]. It was pointed out [4, 8] that studies on solid tumors are expected to be far more complex. RNA samples from biopsy specimens are heterogeenous and typically include RNA from stromal as well as tumor cells. Keeping the percentage of tumor specific RNA constant is difficult. In addition, a pool of tumor tissues that appear to be pathogenetically homogeneous with respect to the morphological appearances of the tumor may well be highly heterogeneous on the molecular level [3]. In fact these pools might contain tumors representing essentially different diseases [3, 8]. Seeing these problems, it becomes clear that gene expression analysis goes beyond simple classification. Conflicts between the expression data and histopathological class assignments are expected and can actually be observed, as we will show below. It is crucial for a sensible data analysis to detect and explore such conflicts, so as to generate scientific understanding and insights as well as pure classication. In view of the possible heterogeneity in the examined RNA samples, it seems appropriate to describe profiles gradually on a scale between and 1, instead of making fixed assignments to one or the other class. Small values indicate a strong inclination towards class 1 and values close to 1 suggest class 2. Intermediate values are a first indication for conflicting data, typical for heterogeneous specimens. Class probabilities put this concept into practice. Moreover, a high predictive capability of the analysis is crucial. This requires a very careful experimental design as well as robust statistical analysis. The model needs to reflect the underlying tumor biology and no experimental or data analysis spe- 2

3 cific artifacts. In particular, the profile analysis needs to be done out-of-sample, meaning that no prior class assignment for the profile under investigation is used in the analysis. In addition to possible diagnostic applications of expression profiling, there is a great interest in revealing the underlying molecular differences between tumor types. Consequently, the model should be transparent enough such that genes that are highly informative for the class distinction can be easily identified. We will discuss a complete statistical model that helps us understand molecular tumor characteristics. We first discuss the Bayesian regression model and then demonstrate its use and capabilities in the context of the estrogen receptor status of 27 human breast cancers. 2 Statistical framework Here we summarize the Bayesian binary analysis that we describe in detail in [1]. Begin with a training set of n tumor samples each described by the expression levels of p genes, namely (x 1,i... x p,i ) for tumor i. In addition, assume that the tumors can be divided into two classes according to some well studied criterium, and for each of the n tumors the true class assignment is known. This knowledge is represented by a vector (z 1... z n ), where z i = 1 if the i th tumor is assigned to class 1 and otherwise. Typically, the number of gene expression levels p is in the range of several thousands, whereas n, the number of tumors in the study, is smaller than 1. Hence, the context here is binary regression with far more predictor variables than samples. We use a standard probit regression model that includes the entire set of p genes as predictor variables. This yields P r(z i = 1 β) = Φ(x iβ). (1) where x i is the vector of gene expression levels of tumor i, β = (β 1... β p ) is a vector of p unknown parameters and Φ is the cumulative density function of a standard normal distribution. P r(z i = 1 β) is then the probability that tumor i belongs to class 1 with respect to the regression model that is determined by the parameters β. Note, that we not only model the class memberships via the binary indicator, but also use the probability scale, where tumor classification is described by the probability that a certain tumor is class 1. We refer to the classification probabilities as first order uncertainties. For the statistical analysis and model fitting, we use the latent variable construction of a probit model [14, 15] y = X β + ɛ (2) where y is a vector of n latent variables, X is the p n matrix whose columns are the gene expression profiles of the n different tumors and ɛ is a vector of n independent standard normal errors. The latent variables correspond to the class assignments by y i if and only if tumor i is assigned to class 1. Dimension Reduction The tumors are represented by points in a p dimensional space. For typical applications, these are several thousand dimensions. However, there are only n p such points and clearly these points all lie in a linear subspace which is at most n dimensional. By projecting onto the subspace, the dimension of the data is reduced dramatically. Clearly, the projection is not unique. We use the singular value decomposition X = ADF 3

4 where A is a p n matrix with orthonormal columns, D is a diagonal matrix with entries d 1 d 2... d n and F is a n n square matrix with both orthonormal rows and columns. A is the projection on the low dimensional factor space. Instead of the original p expression levels of all genes we only have to deal with n p linear combinations of them. We refer to them es expression levels of super-genes. The tumors are represented by the projected expression levels (Ax 1... Ax n ), where Ax i is equal to column i of F D. The fact that the singular value decomposition produces orthogonal tumor descriptors is of great use for the regression problem, and justifies the choice of the special projection for dimension reduction. The regression equation (1) can be rewritten as P r(z i = 1 γ) = Φ((A x i ) γ). (3) The challenge is to learn about the data by inferring the n-dimensional parameter γ = (γ 1... γ n ). Singular value decompositions are also used by different authors in the context of large scale gene expression analysis [31, 21, 3]. However, the use of super-genes as predictors in a full binary regression model is novel. Unbiased structured priors At this stage we have n data points, each of them representing a tumor specific gene expression profile described in an n-dimensional space. As one can easily see, n points in n dimension can always be separated by a hyperplane no matter how class assignments are made, except for the unlikely case of collinearity. Consequently, there is little hope that we can learn from the data without any additional constraints. The picture changes completely in a Bayesian context, where informative prior distributions of the regression parameters are operating, providing partial stochastic constraints. In [1] we introduce the concept of singular generalized g-priors for this type of problem. The prior choice is guided by two aims. First it is desirable to keep the model simple such that computation remains feasible and software can be constructed that allows for a fast and easy analysis of the data. Our second objective is to start from an unbiased perspective, both with respect to the classification of tumors and the decision of which genes are most likely to support the classification. Consistent informative priors for both the gene specific regression parameters β and the super-gene specific parameters γ will be constructed. We start with restricting the class of possible priors for γ to independent normal priors. Normal priors are a standard choice, since they are conjugate to the likelihood function in the probit model. By choosing independent normals, we also adopt the covariance structure of the likelihood function, up to individual scaling parameters for each super-gene dimension. This is a consequence of the use of singular value decompositions, which produce orthogonal supergenes. The dimension specific scale parameters are treated as hyper-parameters with prior distributions centered at 1, in particular we use gamma distributions with mean one and 2 degrees of freedom. These priors are a generalization of the g-priors introduced in [17] where only a single scaling parameter for the complete covariance matrix is considered. The overall setup allows for a routine and computational efficient implementation of the binary regression model using MCMC methods [14, 15]. Details are given in [1]. Now we need to specify prior means. Note, that for every prior on γ equation (3) associates a unique prior on the classification probabilities 4

5 P [z i = 1 γ]. In order to start off from an unbiased perspective, it is necessary that the prior classification probability is symmetric and centered around.5. This is equivalent to choosing a zero mean for the normal prior on γ. It is important to note that the zero mean normal priors are highly informative. Consider a flat prior instead, in this case posteriors would be maximal for ± [1], reflecting the usual problem of discriminating n points in a n dimensional space. The normal priors pull the regression weights back towards zero, thus operating like additional constraints. In view of the actual regression model the prior specifications for the super-gene dimensions are sufficient. However, for the purpose of selecting important genes, it is instructive to examine the original gene specific regression weights β as well. The class of consistent priors on β are highly singular multidimensional normals with support in the subspace that is spanned by the set of gene expression profiles (x 1... x n ) [1]. Note that any prior on β with the appropriate covariance structure and a mean in the null space of the projection A induces the same zero mean prior for γ. Hence, in terms of classification these priors are all equivalent. However, a non-zero mean for a β prior constitutes a prejudiced perspective on which genes are important for the actual tumor classification. To avoid this, we choose zero means for the high dimensional priors as well. Posteriors and identification of influential genes Given the prior specifications above, MCMC methods are used to sample from the posterior distribution. Having these samples we construct posterior samples for the classification probabilities P r[z i = 1 γ] by equation (3). It is also worth noting, that the unbiased prior choice of zero mean normals for both super-gene- and geneweights implies a one to one correspondence between these two sets of parameters. Without the prior specifications, any set of parameters β = Aγ + d where Ad = is consistent with γ. On the other hand, the expectation of the posterior mean given the data y should produce the prior mean. This leads to d = [1]. Hence, A is the only pseudo inverse of the projection A, which is consistent with unbiased priors for the gene specific regression weights. Out-of-sample prediction The setup for a real application of the binary regression model is the following: We are given a set of n tumor specific expression profiles. Class assignments for the corresponding tumors are known. Suppose we are also given the expression profile of a new tumor where nothing is known about its class membership. The challenge is to detect and report the trends in its expression profile as to which class it belongs. To reflect this in analysis of our data, for evaluation and validation purposes, we hold back the true class assignment for one of the profiles at a time. This test profile is subject to investigations and the model compares it to the classified profiles. This is done by treating the class assignments of the test profile as an unknown variables; see [1]. The procedure of holding back the class assignment of a test profile is repeated separately for each tumour in the study, resulting in a comprehensive cross-validation type evaluation study. 5

6 3 Results Here we demonstrate the use of the Bayesian binary regression model in a gene expression profiling study of 27 primary human breast cancers. We focus on the estrogen receptor status of the tumors. Estrogen receptor status is routinely assessed clinically by immunohistochemical methods, which actually detect the estrogen receptor protein in the tissues. Tumors with high levels of the estrogen and progesterone receptors are assigned to class 1 (ER+) whereas tumors with undetectable levels of the hormone receptor are assigned to class 2. The study is controlled for tumor size, all tumors are between 1.5 and 5 cm in maximal dimension. We use the average log ratio measure reported by the Gene Chip software (Affymetrix 2). Each tumor is characterized by 7129 gene expression levels. The original study comprised 3 tumors. Exactly half of these tumors were reported to have high levels of the estrogen and progesterone receptors (ER+/PR+) as measured by immunohistochemical staining and image analysis. The other half had undetectable levels of both nuclear hormone receptors (ER /PR-). An inspection of the raw data showed that two arrays failed to hybridize correctly; so these were excluded from the analysis. Both excluded profiles correspond to ER tumors. For a third tumor it turned out, that the result of the immunohistochemical analysis for ER status was inconsistent when done by two different laboratories. This sample was also removed. We applied the Bayesian regression analysis to the remaining 27 expression profiles. Profiles 1-15 correspond to ER+ tumors and profiles to ER tumors. Probabilistic tumor classification In a first step we fitted the regression model using the entire set of expression profiles and class assignments. We simulated 5 values from the posterior distribution of γ and derived the corresponding sample of classification probabilities π i = P r(z i = 1 γ) for each of the 27 tumors. Here z i = 1 means that tumor number i is ER+. The left plot in figure 1 shows the means of the posterior samples. This mean probability is near one for all tumors that are actually ER+ and it is near zero for all ER tumors except tumor number 16. At this stage of our analysis we would classify tumor 16 as a borderline case. However, the probability that it is ER is higher than the probability that it is ER+. Note, that if we draw a decision line at a probability of.5 we obtain a perfect classification of all 27 tumors. However the analysis uses the true class assignments z 1... z 27 of all the tumors. Hence, although the plot demonstrates a good fit of the model to the data, it does not give us reliable indications for a good predictive performance. One might suspect that the method just stores the given class assignments in the parameters γ 1... γ n. Indeed this would be the case if one uses binary regression for n samples and n predictors without the additional restrains introduced by the priors. That this suspicion is unjustified with respect to the Bayesian method can be demonstrated by out-of-sample predictions. We next excluded the true class assignments for one tumor at a time and analyzed this tumor with the Bayesian regression model treating its class assignment as a missing value. This results in a separate model fitting procedure for each tumor where the initial class assignment for the tumor is ignored and probabilities for the tumor to be class 1 are derived by comparing its expression profile to the remaining 26 profiles using 6

7 Fitted Classification Probabilities Sample Fitted Classification Probabilities Sample Figure 1: Posterior means for the probability of being a ER+ tumor. Filled circles refer to samples that are ER+ according to clinical data and open circles refer to ER samples respectively. The plot on the right shows the model fit when all samples are used to estimate the model parameters. The left plot shows the same probabilities in a cross validation scenario Sample 17 Class Sample 17 Class Sample 16 Class Sample 16 Class Figure 2: Posterior distributions of classification probabilities for two samples. The vertical dashed lines indicate posterior means. only their initial class assignments. The posterior means of the classification probabilities are shown in the right plot of Figure 1. The classification probabilities for ER+ tumors are all above the.5 line. However, they are in general smaller than in the left plot being in the range of.7-1. Tumor 1 is assigned a probability close to.95 of being ER+, showing that it has a typical expression profile for this class. This means that it is both similar to the other ER+ profiles and sufficiently different from the ER profiles. Tumor 14 is different. It has a classification probability of only about.7. While it can still be correctly identified as ER+, it also becomes obvious that the tumor is different from the other ER+ tumors. The lower classification probability reflects conflicts in the data. The regression analysis correctly votes for ER+ but it also indicates a high degree of uncertainty in doing so. The ER tumors show a similar behavior. Tumor 16 is the most interesting case. In the immunohistochemical analysis the estrogen receptor molecule was not detected at all. However, the model-fit analysis already raises some doubts that it is a typical ER tumor. Its probability for being ER is much lower than those of the other ER tumors. However, it is still above.5. This might indicate a conflict between the expression profile and its actual class assignment. In fact, the out-of-sample analysis approves this possibility. Tumor 16 is now classified as ER+ with high predictive probability. Nevertheless, while the estrogen receptor protein is absent in the tumor, analysis of gene expression provides evidence for a pattern typical for ER+. That is, several genes known to be regulated by the estrogen receptor are elevated in expression in this sample whereas these same genes are low in others. 7

8 Second order uncertainty by analyzing the posterior distribution We have above used the continuous scale of probabilities to model the class membership of tumors. Compared to pure classification approaches, this provides us with an additional indication of the strength of belief in the classification. However, there is also a fair amount of uncertainty in the determination of the classification probabilities. An examination of the entire posterior distribution is instructive. We refer to this step as second order uncertainty analysis. In figure 3 the posterior distributions for the classification probabilities of tumors 17 (right plots) and 16 (left plots) are shown. The vertical dashed lines indicate posterior means. The top plots refer to the model-fit analysis whereas the bottom plots correspond to out-of-sample evaluations. Tumor 17 is one of the typical good cases. In the model-analysis (top left plot) one can observe that almost all draws from the posterior distribution are numbers close to zero. There is very little variation in the judgment that this tumor is ER. In the out-of-sample evaluations the variation increases significantly. Posterior values higher than.2 are observed more frequently, but there are still almost no posterior values that would prefer a classification of tumor 17 as being ER+. The posterior plots for tumor 17 are typical; most of the other expression profiles result in very similar posteriors. Again tumor 16 is an interesting and completely different case. The posterior in the model fit scenario indicates that the regression method is fairly undecided as to which class the tumor belongs. In fact one can still observe the reference U-shaped prior distribution in the plot. It becomes clear that the posterior mean of.38 does not indicate that the tumor has characteristics between ER+ and ER but that the model has detected inconsistencies between the expression profile of tumor 16 and its classification as being ER. In cross validations (bottom right plot) however, the model reports a clear indication with little uncertainty that the tumor has a gene expression profile that is typical for a ER+ tumor. Important Genes While classification of tumor specific expression profiles is important in its own right, there is certainly also high interest in identifying the differences in expression patterns between two types of cancer. A first step in this direction is to produce lists of genes that are significantly more influential in the classification process than others. In Section 2 we have shown that the unbiased prior choice realizes a one-to-one correspondence of the low dimensional regression parameters γ and the high dimensional gene specific parameters β. From the MCMC analysis we obtain posterior samples (γj i) and the sample (Aγi ) j is the corresponding posterior distribution of gene weights. Figure 3 is a plot of all the 7129 individual gene weights from the estrogen receptor status analysis. Obviously, there is a fair number of genes that clearly peek out, having significantly higher absolute weights than most others. Significance can be determined by the complete posterior distribution of the gene weights. The names of the top 4 up regulated genes in ER+ and the top 4 down regulated genes are indicated. Table 3 gives the list of the 25 genes with the highest absolute value of their posterior regression weight. The three underlined genes are the estrogen receptor gene itself and the two well known estrogen receptor targets ps2 and the Estrogen Regulated liv-1 Protein. A parallel gene expression study on breast tu- 8

9 Intestinal Trefoil Factor Estrogen Receptor Nat 1 Gene for Arylamine n Acetytransferase PS Matrilysin Omega Light Chain Protein 14.1 RAR Responsive (tig1) mrna Map Kinase Phosphotase 4.3 Figure 3: An inverse projection of the regression weights in the Bayesian binary regression procedure yields weights for all genes on the arrays according to their influence on the classification. Genes with weights peeking out of the mass of genes are candidates for genes which actually make up the difference between the two tumor types. mors is reported in [32]. Here 65 surgical specimen are analyzided using microarray technologies [26]. The data is analyzed using hierachical clustering [7]. An inspection of the gene cluster that contains the estrogen receptor shows that it also contains the Nat1 Gene-for-Arylamine n- Acetyltransferase, the Hepatocyte Nuclear Factor 3 Alpha, the X-Box Binding Protein-1, Gata 3 and the Type 1 Angiotensin II Receptor. All these genes were also identified by our method. This coincidence is striking since the Perou et al. study is based on a different technology, different experimental designs, a different statistical approach and of course different tumors. The fact that both studies result in a high intersection of relevant genes, encourages us with respect to the general potential of large scale gene expression analysis. Genes up in ER+: Estrogen Receptor Intestinal Trefoil Factor ps2 Nat1 Gene-for-Arylamine n-acetyltransferase Transacting T-Cell Specific TF Hepatocyte Nuclear Factor 3 Alpha Prolactin Induced Protein Cardiac Gap Junction Protein Estrogen Regulated liv-1 Protein Clone mrna Sequence X-Box Binding Protein-1 Gata 3 Type 1 Angiotensin II Receptor Lung Amiloride Sensitive Na+ Channel Protein Nonspecific Crossreacting Antigen Androgen Receptor Neuropeptide y Receptor y1 ========================= Genes down in ER+: Matrilysin RAR Responsive (tig1) Omega Light Chain Protein 14.1 Guanylate Binding Protein Isoform I Cystic Fibrosis Antigen gp 39 Cartilage Protein Gene Antileukoprotease (alp) from Cervix Uterus Mesothelial Keratin k7 (type II) Table 1: The top 25 genes 9

10 4 Discussion We have discussed a complete statistical model that helps us understand molecular tumor characteristics. The core of the method is a combination of singular value decompositions and Bayesian binary regression. The choice of a special type of unbiased, relatively informative but structured priors makes binary regression practicable when using far more predictor variables than samples. In a first evaluation experiment, the method displays a high predictive capability in classifying expression profiles of human breast tumors with respect to their estrogen receptor status. However, the main achievement of our work is that we perform the supervised gene expression analysis not in the setup of a simple zero-one classification but in the more complex setup of binary regression. This enables us to assess and link the characteristics of large-scale expression data that relate to tumor type on a probability scale. Furthermore, we the have access to uncertainties in the determination of classification probabilities by analyzing their complete posterior distribution. Finally, we can obtain unique gene specific regression weights which highlight those genes that are most influential in the binary regression procedure. In fact, this is only the start of the story, we aim to utilize complex expression data to extract molecular phenotypes of tumor samples. That is, rather than producing a list of differentially expressed genes, we want to extract patterns characterized by the co-behaviour of subsets of genes. Assays of many single genes will be very much affected by experimental variability and sample heterogenity. In contrast when considering more complex expression patterns, the actual level of expression could vary, the pattern however should stay intact. For the binary regression we already exploit the cobehavior of genes. We now aim for methods to describe and extract significant expression patterns. The key is the posterior distribution of regression weights. It fully summarises the complex interactions between genes, and is available for exploration. The breast cancer estrogen receptor study is a first test of the new Bayesian binary regression model. It was designed as a proof of principle experiment. Since the estrogen receptor is a transcription factor we have expected clear changes in gene expression in all those cells of the tumor specimen that actually express the estrogen receptor. However, biopsy specimens are heterogeneous tissues; not all cells in the sample are tumor cells and even the tumor cells can exhabit some variability. Simple zero-one classification is hence inappropriate. Our method is designed to deal with heterogenous specimens. Tissue heterogeneity results in conflicting expression profiles, but our method detects and reports these conflicts appropriately. We are now investigating more complex problems, including the nodal status of tumors and survival probabilities of patients having ovarian cancer. The differences in gene expression profiles are more subtle changes of a probably larger number of genes. We have promising first results that we will extend and report in the near future. Acknowledgments We would like to thank Merlise Clyde for some helpful discussions. Rainer Spang and Harry Zuzan are partially supported by NISS under NSF grant DMS Joseph Nevins is an 1

11 investigator of the Howard Hughes Medical Institute. References [1] West, M., Nevins, J.R., Marks, J.R., Spang, R., C. Blanchett & Zuzan, H. (2). (submitted to J. Amer. Stat. Assoc.). [2] West, M. (2) [3] Alizadeh, A. A. et al. (2) Nature 43, [11] Pollack, J. R., Perou, C. M., Alizadeh, A. A., Eisen, M. B., Pergamenschikov, A., Williams, C. F., Jeffrey, S. S., Botstein, D. & Brown, P. O. (1999) Nature Genetics 23, [12] Ross, D. T. et al. (2) Nature Genetics 24, [13] Aguilar, O. & West, M. (2) J. Bus. Econ. Stat. (in press). [14] Albert, J. H. & Chip, S. (1993) J. Am. Stat. Assoc. 88, [15] Albert, J. H. & Johnson, V. E. (1999) Ordinal Data Models (Springer, New York). [4] Alon, U., Barkai, A., Notterman, D.A., Gish, K., Ybarra, S., Mack, D. & Levine, A. J. (1999) Proc. Natl. Acad. Sci. 96, [5] Brown, M. P. S., Grundy, W.N., Lin, D., Cristianini, N., Sugnet, C. W., Furey, T. S.,Ares Jr., M. & Haussler, D. (2) Proc. Natl. Acad. Sci. 97, [6] DeRisi, J. L., Iyer, V. R., & Brown, P. O. (1997) Science 278, [7] Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. (1998) Proc. Natl. Acad. Sci. 95, [8] Golub, T. R., et al. (1999) Science 286, [9] Lockhart D. J., et al. (1996) Nature Biotechnology, [1] Perou, C. M., et al. (1999) Proc. Natl. Acad. Sci. 96, [16] Schadt, E. E., Li, C., Su, C. & Wong, W. (2) J. Cell Biochem., (in press). [17] Zellner, A. in Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti (North-Holland, Amsterdam), [18] Dudoit, Yang, H.W., Callow, M. & Speed, T. P. (2) /users/terry/zarray/html/papersindex.html [19] Dudoit, S., Fridlyand, J. & Speed, T. P. (2) /users/terry/zarray/html/papersindex.html [2] Beissbarth, T. et al. (2) Bioinformatics (in press) [21] Alter, O., Brown, P. O. & Botstein, D. (2) Proc. Natl. Acad. Sci. 97, [22] DeRisi, J., Penland, L., Brown, P. O., Bittner, M., Meltzer, P. S., Ray, M., Chen, Y., Su, Y. A. & Trent, J. (1996) Nature Genetics 14,

12 [23] Friemert, C., Erfle, V. & Strauss, G. (1989) Methods Mol. Cell Biol. 1, [24] Hauser, N. C., Vingron, M., Scheideler, M., Krems, B., Hellmuth, K., Entian, K. D. & Hoheisel, J. D. (1998) Yest [25] Lennon, G. G. & Lehrach, H. (1991) Trends Genet. 7, [26] Schena, M., Shalon, D., Davis, R. W. & Brown, P. O. (1995) Science [27] Lander, E. S. (1999) Nature Genetics [28] Hilsenbeck, S. G., Friedrichs, W. E., Schiff, R., O Conell, P., Hansen, R. K., Osborne, C. K. & Fuqua, S. A. W. (1999) J. Natl. Cancer Inst. 91, [29] Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E. S., Golub, T. R. (1999) Proc. Natl. Acad. Sci. 97, 262. [3] Holter, N.S., Madhusmita, M., Cieplak, M., Banavar, J.R, & Fedoroff, N.V. (2) Proc. Natl. Acad. Sci. 97, [31] Hastie, T. et al. Technical Report, Department of Health Research and Policy Stanford University 2 [32] Perou C.M.. et al. Nature 46,

Data analysis and binary regression for predictive discrimination. using DNA microarray data. (Breast cancer) discrimination. Expression array data

Data analysis and binary regression for predictive discrimination. using DNA microarray data. (Breast cancer) discrimination. Expression array data West Mike of Statistics & Decision Sciences Institute Duke University wwwstatdukeedu IPAM Functional Genomics Workshop November Two group problems: Binary outcomes ffl eg, ER+ versus ER ffl eg, lymph node

More information

Data analysis in microarray experiment

Data analysis in microarray experiment 16 1 004 Chinese Bulletin of Life Sciences Vol. 16, No. 1 Feb., 004 1004-0374 (004) 01-0041-08 100005 Q33 A Data analysis in microarray experiment YANG Chang, FANG Fu-De * (National Laboratory of Medical

More information

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Gene Selection for Tumor Classification Using Microarray Gene Expression Data Gene Selection for Tumor Classification Using Microarray Gene Expression Data K. Yendrapalli, R. Basnet, S. Mukkamala, A. H. Sung Department of Computer Science New Mexico Institute of Mining and Technology

More information

A COMBINATORY ALGORITHM OF UNIVARIATE AND MULTIVARIATE GENE SELECTION

A COMBINATORY ALGORITHM OF UNIVARIATE AND MULTIVARIATE GENE SELECTION 5-9 JATIT. All rights reserved. A COMBINATORY ALGORITHM OF UNIVARIATE AND MULTIVARIATE GENE SELECTION 1 H. Mahmoodian, M. Hamiruce Marhaban, 3 R. A. Rahim, R. Rosli, 5 M. Iqbal Saripan 1 PhD student, Department

More information

Comparison of discrimination methods for the classification of tumors using gene expression data

Comparison of discrimination methods for the classification of tumors using gene expression data Comparison of discrimination methods for the classification of tumors using gene expression data Sandrine Dudoit, Jane Fridlyand 2 and Terry Speed 2,. Mathematical Sciences Research Institute, Berkeley

More information

Aspects of Statistical Modelling & Data Analysis in Gene Expression Genomics. Mike West Duke University

Aspects of Statistical Modelling & Data Analysis in Gene Expression Genomics. Mike West Duke University Aspects of Statistical Modelling & Data Analysis in Gene Expression Genomics Mike West Duke University Papers, software, many links: www.isds.duke.edu/~mw ABS04 web site: Lecture slides, stats notes, papers,

More information

Bayesian Prediction Tree Models

Bayesian Prediction Tree Models Bayesian Prediction Tree Models Statistical Prediction Tree Modelling for Clinico-Genomics Clinical gene expression data - expression signatures, profiling Tree models for predictive sub-typing Combining

More information

T. R. Golub, D. K. Slonim & Others 1999

T. R. Golub, D. K. Slonim & Others 1999 T. R. Golub, D. K. Slonim & Others 1999 Big Picture in 1999 The Need for Cancer Classification Cancer classification very important for advances in cancer treatment. Cancers of Identical grade can have

More information

MODEL-BASED CLUSTERING IN GENE EXPRESSION MICROARRAYS: AN APPLICATION TO BREAST CANCER DATA

MODEL-BASED CLUSTERING IN GENE EXPRESSION MICROARRAYS: AN APPLICATION TO BREAST CANCER DATA International Journal of Software Engineering and Knowledge Engineering Vol. 13, No. 6 (2003) 579 592 c World Scientific Publishing Company MODEL-BASED CLUSTERING IN GENE EXPRESSION MICROARRAYS: AN APPLICATION

More information

Ordinal Data Modeling

Ordinal Data Modeling Valen E. Johnson James H. Albert Ordinal Data Modeling With 73 illustrations I ". Springer Contents Preface v 1 Review of Classical and Bayesian Inference 1 1.1 Learning about a binomial proportion 1 1.1.1

More information

Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering

Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering Gene expression analysis Roadmap Microarray technology: how it work Applications: what can we do with it Preprocessing: Image processing Data normalization Classification Clustering Biclustering 1 Gene

More information

Hybridized KNN and SVM for gene expression data classification

Hybridized KNN and SVM for gene expression data classification Mei, et al, Hybridized KNN and SVM for gene expression data classification Hybridized KNN and SVM for gene expression data classification Zhen Mei, Qi Shen *, Baoxian Ye Chemistry Department, Zhengzhou

More information

Experimental Design For Microarray Experiments. Robert Gentleman, Denise Scholtens Arden Miller, Sandrine Dudoit

Experimental Design For Microarray Experiments. Robert Gentleman, Denise Scholtens Arden Miller, Sandrine Dudoit Experimental Design For Microarray Experiments Robert Gentleman, Denise Scholtens Arden Miller, Sandrine Dudoit Copyright 2002 Complexity of Genomic data the functioning of cells is a complex and highly

More information

Classification of cancer profiles. ABDBM Ron Shamir

Classification of cancer profiles. ABDBM Ron Shamir Classification of cancer profiles 1 Background: Cancer Classification Cancer classification is central to cancer treatment; Traditional cancer classification methods: location; morphology, cytogenesis;

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 10: Introduction to inference (v2) Ramesh Johari ramesh.johari@stanford.edu 1 / 17 What is inference? 2 / 17 Where did our data come from? Recall our sample is: Y, the vector

More information

Predictive Biomarkers

Predictive Biomarkers Uğur Sezerman Evolutionary Selection of Near Optimal Number of Features for Classification of Gene Expression Data Using Genetic Algorithms Predictive Biomarkers Biomarker: A gene, protein, or other change

More information

Bayesian and Frequentist Approaches

Bayesian and Frequentist Approaches Bayesian and Frequentist Approaches G. Jogesh Babu Penn State University http://sites.stat.psu.edu/ babu http://astrostatistics.psu.edu All models are wrong But some are useful George E. P. Box (son-in-law

More information

Bayesian versus maximum likelihood estimation of treatment effects in bivariate probit instrumental variable models

Bayesian versus maximum likelihood estimation of treatment effects in bivariate probit instrumental variable models Bayesian versus maximum likelihood estimation of treatment effects in bivariate probit instrumental variable models Florian M. Hollenbach Department of Political Science Texas A&M University Jacob M. Montgomery

More information

Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach

Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach University of South Florida Scholar Commons Graduate Theses and Dissertations Graduate School November 2015 Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach Wei Chen

More information

Clustering mass spectrometry data using order statistics

Clustering mass spectrometry data using order statistics Proteomics 2003, 3, 1687 1691 DOI 10.1002/pmic.200300517 1687 Douglas J. Slotta 1 Lenwood S. Heath 1 Naren Ramakrishnan 1 Rich Helm 2 Malcolm Potts 3 1 Department of Computer Science 2 Department of Wood

More information

Comparison of Triple Negative Breast Cancer between Asian and Western Data Sets

Comparison of Triple Negative Breast Cancer between Asian and Western Data Sets 2010 IEEE International Conference on Bioinformatics and Biomedicine Workshops Comparison of Triple Negative Breast Cancer between Asian and Western Data Sets Lee H. Chen Bioinformatics and Biostatistics

More information

Goodness of Pattern and Pattern Uncertainty 1

Goodness of Pattern and Pattern Uncertainty 1 J'OURNAL OF VERBAL LEARNING AND VERBAL BEHAVIOR 2, 446-452 (1963) Goodness of Pattern and Pattern Uncertainty 1 A visual configuration, or pattern, has qualities over and above those which can be specified

More information

A Strategy for Identifying Putative Causes of Gene Expression Variation in Human Cancer

A Strategy for Identifying Putative Causes of Gene Expression Variation in Human Cancer A Strategy for Identifying Putative Causes of Gene Expression Variation in Human Cancer Hautaniemi, Sampsa; Ringnér, Markus; Kauraniemi, Päivikki; Kallioniemi, Anne; Edgren, Henrik; Yli-Harja, Olli; Astola,

More information

Lecture Outline. Biost 590: Statistical Consulting. Stages of Scientific Studies. Scientific Method

Lecture Outline. Biost 590: Statistical Consulting. Stages of Scientific Studies. Scientific Method Biost 590: Statistical Consulting Statistical Classification of Scientific Studies; Approach to Consulting Lecture Outline Statistical Classification of Scientific Studies Statistical Tasks Approach to

More information

Identification of Tissue Independent Cancer Driver Genes

Identification of Tissue Independent Cancer Driver Genes Identification of Tissue Independent Cancer Driver Genes Alexandros Manolakos, Idoia Ochoa, Kartik Venkat Supervisor: Olivier Gevaert Abstract Identification of genomic patterns in tumors is an important

More information

Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017

Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017 Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017 A.K.A. Artificial Intelligence Unsupervised learning! Cluster analysis Patterns, Clumps, and Joining

More information

Mammogram Analysis: Tumor Classification

Mammogram Analysis: Tumor Classification Mammogram Analysis: Tumor Classification Literature Survey Report Geethapriya Raghavan geeragh@mail.utexas.edu EE 381K - Multidimensional Digital Signal Processing Spring 2005 Abstract Breast cancer is

More information

Simple Discriminant Functions Identify Small Sets of Genes that Distinguish Cancer Phenotype from Normal

Simple Discriminant Functions Identify Small Sets of Genes that Distinguish Cancer Phenotype from Normal Genome Informatics 16(1): 245 253 (2005) 245 Simple Discriminant Functions Identify Small Sets of Genes that Distinguish Cancer Phenotype from Normal Gul S. Dalgin 1 Charles DeLisi 2,3 sdalgin@bu.edu delisi@bu.edu

More information

Cancer outlier differential gene expression detection

Cancer outlier differential gene expression detection Biostatistics (2007), 8, 3, pp. 566 575 doi:10.1093/biostatistics/kxl029 Advance Access publication on October 4, 2006 Cancer outlier differential gene expression detection BAOLIN WU Division of Biostatistics,

More information

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection 202 4th International onference on Bioinformatics and Biomedical Technology IPBEE vol.29 (202) (202) IASIT Press, Singapore Efficacy of the Extended Principal Orthogonal Decomposition on DA Microarray

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 8, Issue 1 2009 Article 13 Detecting Outlier Samples in Microarray Data Albert D. Shieh Yeung Sam Hung Harvard University, shieh@fas.harvard.edu

More information

Journal of Political Economy, Vol. 93, No. 2 (Apr., 1985)

Journal of Political Economy, Vol. 93, No. 2 (Apr., 1985) Confirmations and Contradictions Journal of Political Economy, Vol. 93, No. 2 (Apr., 1985) Estimates of the Deterrent Effect of Capital Punishment: The Importance of the Researcher's Prior Beliefs Walter

More information

Bayesian analysis of binary prediction tree models for retrospectively sampled outcomes

Bayesian analysis of binary prediction tree models for retrospectively sampled outcomes Biostatistics (24),,,pp. 1 15 doi: 1.193/biostatistics/kxh11 Bayesian analysis of binary prediction tree models for retrospectively sampled outcomes JENNIFER PITTMAN Institute of Statistics & Decision

More information

6. Unusual and Influential Data

6. Unusual and Influential Data Sociology 740 John ox Lecture Notes 6. Unusual and Influential Data Copyright 2014 by John ox Unusual and Influential Data 1 1. Introduction I Linear statistical models make strong assumptions about the

More information

RASA: Robust Alternative Splicing Analysis for Human Transcriptome Arrays

RASA: Robust Alternative Splicing Analysis for Human Transcriptome Arrays Supplementary Materials RASA: Robust Alternative Splicing Analysis for Human Transcriptome Arrays Junhee Seok 1*, Weihong Xu 2, Ronald W. Davis 2, Wenzhong Xiao 2,3* 1 School of Electrical Engineering,

More information

Roadmap for Developing and Validating Therapeutically Relevant Genomic Classifiers. Richard Simon, J Clin Oncol 23:

Roadmap for Developing and Validating Therapeutically Relevant Genomic Classifiers. Richard Simon, J Clin Oncol 23: Roadmap for Developing and Validating Therapeutically Relevant Genomic Classifiers. Richard Simon, J Clin Oncol 23:7332-7341 Presented by Deming Mi 7/25/2006 Major reasons for few prognostic factors to

More information

Bayes Linear Statistics. Theory and Methods

Bayes Linear Statistics. Theory and Methods Bayes Linear Statistics Theory and Methods Michael Goldstein and David Wooff Durham University, UK BICENTENNI AL BICENTENNIAL Contents r Preface xvii 1 The Bayes linear approach 1 1.1 Combining beliefs

More information

FUZZY C-MEANS AND ENTROPY BASED GENE SELECTION BY PRINCIPAL COMPONENT ANALYSIS IN CANCER CLASSIFICATION

FUZZY C-MEANS AND ENTROPY BASED GENE SELECTION BY PRINCIPAL COMPONENT ANALYSIS IN CANCER CLASSIFICATION FUZZY C-MEANS AND ENTROPY BASED GENE SELECTION BY PRINCIPAL COMPONENT ANALYSIS IN CANCER CLASSIFICATION SOMAYEH ABBASI, HAMID MAHMOODIAN Department of Electrical Engineering, Najafabad branch, Islamic

More information

Diagnosis of multiple cancer types by shrunken centroids of gene expression

Diagnosis of multiple cancer types by shrunken centroids of gene expression Diagnosis of multiple cancer types by shrunken centroids of gene expression Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu PNAS 99:10:6567-6572, 14 May 2002 Nearest Centroid

More information

Classification. Methods Course: Gene Expression Data Analysis -Day Five. Rainer Spang

Classification. Methods Course: Gene Expression Data Analysis -Day Five. Rainer Spang Classification Methods Course: Gene Expression Data Analysis -Day Five Rainer Spang Ms. Smith DNA Chip of Ms. Smith Expression profile of Ms. Smith Ms. Smith 30.000 properties of Ms. Smith The expression

More information

Mediation Analysis With Principal Stratification

Mediation Analysis With Principal Stratification University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 3-30-009 Mediation Analysis With Principal Stratification Robert Gallop Dylan S. Small University of Pennsylvania

More information

Introduction to Discrimination in Microarray Data Analysis

Introduction to Discrimination in Microarray Data Analysis Introduction to Discrimination in Microarray Data Analysis Jane Fridlyand CBMB University of California, San Francisco Genentech Hall Auditorium, Mission Bay, UCSF October 23, 2004 1 Case Study: Van t

More information

Anale. Seria Informatică. Vol. XVI fasc Annals. Computer Science Series. 16 th Tome 1 st Fasc. 2018

Anale. Seria Informatică. Vol. XVI fasc Annals. Computer Science Series. 16 th Tome 1 st Fasc. 2018 HANDLING MULTICOLLINEARITY; A COMPARATIVE STUDY OF THE PREDICTION PERFORMANCE OF SOME METHODS BASED ON SOME PROBABILITY DISTRIBUTIONS Zakari Y., Yau S. A., Usman U. Department of Mathematics, Usmanu Danfodiyo

More information

Increasing Efficiency of Microarray Analysis by PCA and Machine Learning Methods

Increasing Efficiency of Microarray Analysis by PCA and Machine Learning Methods 56 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16 Increasing Efficiency of Microarray Analysis by PCA and Machine Learning Methods Jing Sun 1, Kalpdrum Passi 1, Chakresh Jain 2 1 Department

More information

Introduction to Gene Sets Analysis

Introduction to Gene Sets Analysis Introduction to Svitlana Tyekucheva Dana-Farber Cancer Institute May 15, 2012 Introduction Various measurements: gene expression, copy number variation, methylation status, mutation profile, etc. Main

More information

Memorial Sloan-Kettering Cancer Center

Memorial Sloan-Kettering Cancer Center Memorial Sloan-Kettering Cancer Center Memorial Sloan-Kettering Cancer Center, Dept. of Epidemiology & Biostatistics Working Paper Series Year 2007 Paper 14 On Comparing the Clustering of Regression Models

More information

Introduction to Bayesian Analysis 1

Introduction to Bayesian Analysis 1 Biostats VHM 801/802 Courses Fall 2005, Atlantic Veterinary College, PEI Henrik Stryhn Introduction to Bayesian Analysis 1 Little known outside the statistical science, there exist two different approaches

More information

MEA DISCUSSION PAPERS

MEA DISCUSSION PAPERS Inference Problems under a Special Form of Heteroskedasticity Helmut Farbmacher, Heinrich Kögel 03-2015 MEA DISCUSSION PAPERS mea Amalienstr. 33_D-80799 Munich_Phone+49 89 38602-355_Fax +49 89 38602-390_www.mea.mpisoc.mpg.de

More information

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes. Final review Based in part on slides from textbook, slides of Susan Holmes December 5, 2012 1 / 1 Final review Overview Before Midterm General goals of data mining. Datatypes. Preprocessing & dimension

More information

Mammogram Analysis: Tumor Classification

Mammogram Analysis: Tumor Classification Mammogram Analysis: Tumor Classification Term Project Report Geethapriya Raghavan geeragh@mail.utexas.edu EE 381K - Multidimensional Digital Signal Processing Spring 2005 Abstract Breast cancer is the

More information

A Biclustering Based Classification Framework for Cancer Diagnosis and Prognosis

A Biclustering Based Classification Framework for Cancer Diagnosis and Prognosis A Biclustering Based Classification Framework for Cancer Diagnosis and Prognosis Baljeet Malhotra and Guohui Lin Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada T6G 2E8

More information

CARISMA-LMS Workshop on Statistics for Risk Analysis

CARISMA-LMS Workshop on Statistics for Risk Analysis Department of Mathematics CARISMA-LMS Workshop on Statistics for Risk Analysis Thursday 28 th May 2015 Location: Department of Mathematics, John Crank Building, Room JNCK128 (Campus map can be found at

More information

Opinion Microarrays and molecular markers for tumor classification Brian Z Ring and Douglas T Ross

Opinion Microarrays and molecular markers for tumor classification Brian Z Ring and Douglas T Ross http://genomebiology.com/2002/3/5/comment/2005.1 Opinion Microarrays and molecular markers for tumor classification Brian Z Ring and Douglas T Ross Address: Applied Genomics Inc., 525 Del Rey Ave #B, Sunnyvale,

More information

SUPPLEMENTARY APPENDIX

SUPPLEMENTARY APPENDIX SUPPLEMENTARY APPENDIX 1) Supplemental Figure 1. Histopathologic Characteristics of the Tumors in the Discovery Cohort 2) Supplemental Figure 2. Incorporation of Normal Epidermal Melanocytic Signature

More information

Lecture Outline Biost 517 Applied Biostatistics I

Lecture Outline Biost 517 Applied Biostatistics I Lecture Outline Biost 517 Applied Biostatistics I Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics University of Washington Lecture 2: Statistical Classification of Scientific Questions Types of

More information

Appendix III Individual-level analysis

Appendix III Individual-level analysis Appendix III Individual-level analysis Our user-friendly experimental interface makes it possible to present each subject with many choices in the course of a single experiment, yielding a rich individual-level

More information

Minimizing Uncertainty in Property Casualty Loss Reserve Estimates Chris G. Gross, ACAS, MAAA

Minimizing Uncertainty in Property Casualty Loss Reserve Estimates Chris G. Gross, ACAS, MAAA Minimizing Uncertainty in Property Casualty Loss Reserve Estimates Chris G. Gross, ACAS, MAAA The uncertain nature of property casualty loss reserves Property Casualty loss reserves are inherently uncertain.

More information

Breast cancer. Risk factors you cannot change include: Treatment Plan Selection. Inferring Transcriptional Module from Breast Cancer Profile Data

Breast cancer. Risk factors you cannot change include: Treatment Plan Selection. Inferring Transcriptional Module from Breast Cancer Profile Data Breast cancer Inferring Transcriptional Module from Breast Cancer Profile Data Breast Cancer and Targeted Therapy Microarray Profile Data Inferring Transcriptional Module Methods CSC 177 Data Warehousing

More information

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination Timothy N. Rubin (trubin@uci.edu) Michael D. Lee (mdlee@uci.edu) Charles F. Chubb (cchubb@uci.edu) Department of Cognitive

More information

Tissue Classification Based on Gene Expression Data

Tissue Classification Based on Gene Expression Data Chapter 6 Tissue Classification Based on Gene Expression Data Many diseases result from complex interactions involving numerous genes. Previously, these gene interactions have been commonly studied separately.

More information

ISIR: Independent Sliced Inverse Regression

ISIR: Independent Sliced Inverse Regression ISIR: Independent Sliced Inverse Regression Kevin B. Li Beijing Jiaotong University Abstract In this paper we consider a semiparametric regression model involving a p-dimensional explanatory variable x

More information

Design for Targeted Therapies: Statistical Considerations

Design for Targeted Therapies: Statistical Considerations Design for Targeted Therapies: Statistical Considerations J. Jack Lee, Ph.D. Department of Biostatistics University of Texas M. D. Anderson Cancer Center Outline Premise General Review of Statistical Designs

More information

CHAPTER - 6 STATISTICAL ANALYSIS. This chapter discusses inferential statistics, which use sample data to

CHAPTER - 6 STATISTICAL ANALYSIS. This chapter discusses inferential statistics, which use sample data to CHAPTER - 6 STATISTICAL ANALYSIS 6.1 Introduction This chapter discusses inferential statistics, which use sample data to make decisions or inferences about population. Populations are group of interest

More information

SUPPLEMENTAL MATERIAL

SUPPLEMENTAL MATERIAL 1 SUPPLEMENTAL MATERIAL Response time and signal detection time distributions SM Fig. 1. Correct response time (thick solid green curve) and error response time densities (dashed red curve), averaged across

More information

Recent studies demonstrate that gene expression information

Recent studies demonstrate that gene expression information Predicting the clinical status of human breast cancer by using gene expression profiles Mike West*, Carrie Blanchette, Holly Dressman, Erich Huang, Seiichi Ishida, Rainer Spang*, Harry Zuzan*, John A.

More information

PCA Enhanced Kalman Filter for ECG Denoising

PCA Enhanced Kalman Filter for ECG Denoising IOSR Journal of Electronics & Communication Engineering (IOSR-JECE) ISSN(e) : 2278-1684 ISSN(p) : 2320-334X, PP 06-13 www.iosrjournals.org PCA Enhanced Kalman Filter for ECG Denoising Febina Ikbal 1, Prof.M.Mathurakani

More information

A Brief Introduction to Bayesian Statistics

A Brief Introduction to Bayesian Statistics A Brief Introduction to Statistics David Kaplan Department of Educational Psychology Methods for Social Policy Research and, Washington, DC 2017 1 / 37 The Reverend Thomas Bayes, 1701 1761 2 / 37 Pierre-Simon

More information

Unit 1 Exploring and Understanding Data

Unit 1 Exploring and Understanding Data Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile

More information

CSE 255 Assignment 9

CSE 255 Assignment 9 CSE 255 Assignment 9 Alexander Asplund, William Fedus September 25, 2015 1 Introduction In this paper we train a logistic regression function for two forms of link prediction among a set of 244 suspected

More information

Risk-prediction modelling in cancer with multiple genomic data sets: a Bayesian variable selection approach

Risk-prediction modelling in cancer with multiple genomic data sets: a Bayesian variable selection approach Risk-prediction modelling in cancer with multiple genomic data sets: a Bayesian variable selection approach Manuela Zucknick Division of Biostatistics, German Cancer Research Center Biometry Workshop,

More information

SubLasso:a feature selection and classification R package with a. fixed feature subset

SubLasso:a feature selection and classification R package with a. fixed feature subset SubLasso:a feature selection and classification R package with a fixed feature subset Youxi Luo,3,*, Qinghan Meng,2,*, Ruiquan Ge,2, Guoqin Mai, Jikui Liu, Fengfeng Zhou,#. Shenzhen Institutes of Advanced

More information

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018 Introduction to Machine Learning Katherine Heller Deep Learning Summer School 2018 Outline Kinds of machine learning Linear regression Regularization Bayesian methods Logistic Regression Why we do this

More information

Reveal Relationships in Categorical Data

Reveal Relationships in Categorical Data SPSS Categories 15.0 Specifications Reveal Relationships in Categorical Data Unleash the full potential of your data through perceptual mapping, optimal scaling, preference scaling, and dimension reduction

More information

Advanced Bayesian Models for the Social Sciences. TA: Elizabeth Menninga (University of North Carolina, Chapel Hill)

Advanced Bayesian Models for the Social Sciences. TA: Elizabeth Menninga (University of North Carolina, Chapel Hill) Advanced Bayesian Models for the Social Sciences Instructors: Week 1&2: Skyler J. Cranmer Department of Political Science University of North Carolina, Chapel Hill skyler@unc.edu Week 3&4: Daniel Stegmueller

More information

Fixed-Effect Versus Random-Effects Models

Fixed-Effect Versus Random-Effects Models PART 3 Fixed-Effect Versus Random-Effects Models Introduction to Meta-Analysis. Michael Borenstein, L. V. Hedges, J. P. T. Higgins and H. R. Rothstein 2009 John Wiley & Sons, Ltd. ISBN: 978-0-470-05724-7

More information

BAYESIAN ESTIMATORS OF THE LOCATION PARAMETER OF THE NORMAL DISTRIBUTION WITH UNKNOWN VARIANCE

BAYESIAN ESTIMATORS OF THE LOCATION PARAMETER OF THE NORMAL DISTRIBUTION WITH UNKNOWN VARIANCE BAYESIAN ESTIMATORS OF THE LOCATION PARAMETER OF THE NORMAL DISTRIBUTION WITH UNKNOWN VARIANCE Janet van Niekerk* 1 and Andriette Bekker 1 1 Department of Statistics, University of Pretoria, 0002, Pretoria,

More information

Advanced Bayesian Models for the Social Sciences

Advanced Bayesian Models for the Social Sciences Advanced Bayesian Models for the Social Sciences Jeff Harden Department of Political Science, University of Colorado Boulder jeffrey.harden@colorado.edu Daniel Stegmueller Department of Government, University

More information

RNA preparation from extracted paraffin cores:

RNA preparation from extracted paraffin cores: Supplementary methods, Nielsen et al., A comparison of PAM50 intrinsic subtyping with immunohistochemistry and clinical prognostic factors in tamoxifen-treated estrogen receptor positive breast cancer.

More information

Individual Participant Data (IPD) Meta-analysis of prediction modelling studies

Individual Participant Data (IPD) Meta-analysis of prediction modelling studies Individual Participant Data (IPD) Meta-analysis of prediction modelling studies Thomas Debray, PhD Julius Center for Health Sciences and Primary Care Utrecht, The Netherlands March 7, 2016 Prediction

More information

Comparison of Gene Set Analysis with Various Score Transformations to Test the Significance of Sets of Genes

Comparison of Gene Set Analysis with Various Score Transformations to Test the Significance of Sets of Genes Comparison of Gene Set Analysis with Various Score Transformations to Test the Significance of Sets of Genes Ivan Arreola and Dr. David Han Department of Management of Science and Statistics, University

More information

Agenetic disorder serious, perhaps fatal without

Agenetic disorder serious, perhaps fatal without ACADEMIA AND CLINIC The First Positive: Computing Positive Predictive Value at the Extremes James E. Smith, PhD; Robert L. Winkler, PhD; and Dennis G. Fryback, PhD Computing the positive predictive value

More information

Small-area estimation of mental illness prevalence for schools

Small-area estimation of mental illness prevalence for schools Small-area estimation of mental illness prevalence for schools Fan Li 1 Alan Zaslavsky 2 1 Department of Statistical Science Duke University 2 Department of Health Care Policy Harvard Medical School March

More information

An Introduction to Bayesian Statistics

An Introduction to Bayesian Statistics An Introduction to Bayesian Statistics Robert Weiss Department of Biostatistics UCLA Fielding School of Public Health robweiss@ucla.edu Sept 2015 Robert Weiss (UCLA) An Introduction to Bayesian Statistics

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write

More information

Lecture 21. RNA-seq: Advanced analysis

Lecture 21. RNA-seq: Advanced analysis Lecture 21 RNA-seq: Advanced analysis Experimental design Introduction An experiment is a process or study that results in the collection of data. Statistical experiments are conducted in situations in

More information

Poverty, Child Mortality and Policy Options from DHS Surveys in Kenya: Jane Kabubo-Mariara Margaret Karienyeh Francis Mwangi

Poverty, Child Mortality and Policy Options from DHS Surveys in Kenya: Jane Kabubo-Mariara Margaret Karienyeh Francis Mwangi Poverty, Child Mortality and Policy Options from DHS Surveys in Kenya: 1993-2003. Jane Kabubo-Mariara Margaret Karienyeh Francis Mwangi University of Nairobi, Kenya Outline of presentation Introduction

More information

Machine Learning to Inform Breast Cancer Post-Recovery Surveillance

Machine Learning to Inform Breast Cancer Post-Recovery Surveillance Machine Learning to Inform Breast Cancer Post-Recovery Surveillance Final Project Report CS 229 Autumn 2017 Category: Life Sciences Maxwell Allman (mallman) Lin Fan (linfan) Jamie Kang (kangjh) 1 Introduction

More information

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD Department of Biomedical Informatics Department of Computer Science and Engineering The Ohio State University Review

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 1, Jan Feb 2017

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 1, Jan Feb 2017 RESEARCH ARTICLE Classification of Cancer Dataset in Data Mining Algorithms Using R Tool P.Dhivyapriya [1], Dr.S.Sivakumar [2] Research Scholar [1], Assistant professor [2] Department of Computer Science

More information

Bioimaging and Functional Genomics

Bioimaging and Functional Genomics Bioimaging and Functional Genomics Elisa Ficarra, EPF Lausanne Giovanni De Micheli, EPF Lausanne Sungroh Yoon, Stanford University Luca Benini, University of Bologna Enrico Macii,, Politecnico di Torino

More information

Bayesian hierarchical modelling

Bayesian hierarchical modelling Bayesian hierarchical modelling Matthew Schofield Department of Mathematics and Statistics, University of Otago Bayesian hierarchical modelling Slide 1 What is a statistical model? A statistical model:

More information

Class discovery in Gene Expression Data: Characterizing Splits by Support Vector Machines

Class discovery in Gene Expression Data: Characterizing Splits by Support Vector Machines Class discovery in Gene Expression Data: Characterizing Splits by Support Vector Machines Florian Markowetz and Anja von Heydebreck Max-Planck-Institute for Molecular Genetics Computational Molecular Biology

More information

NEW METHODS FOR SENSITIVITY TESTS OF EXPLOSIVE DEVICES

NEW METHODS FOR SENSITIVITY TESTS OF EXPLOSIVE DEVICES NEW METHODS FOR SENSITIVITY TESTS OF EXPLOSIVE DEVICES Amit Teller 1, David M. Steinberg 2, Lina Teper 1, Rotem Rozenblum 2, Liran Mendel 2, and Mordechai Jaeger 2 1 RAFAEL, POB 2250, Haifa, 3102102, Israel

More information

Chapter 1. Introduction

Chapter 1. Introduction Chapter 1 Introduction 1.1 Motivation and Goals The increasing availability and decreasing cost of high-throughput (HT) technologies coupled with the availability of computational tools and data form a

More information

Cross-Study Projections of Genomic Biomarkers: An Evaluation in Cancer Genomics

Cross-Study Projections of Genomic Biomarkers: An Evaluation in Cancer Genomics of Genomic Biomarkers: An Evaluation in Cancer Genomics Joseph E. Lucas 1 *, Carlos M. Carvalho 2, Julia Ling-Yu Chen 1, Jen-Tsan Chi 1, Mike West 3 1 Institute for Genome Sciences and Policy, Duke University,

More information

arxiv: v2 [stat.ap] 7 Dec 2016

arxiv: v2 [stat.ap] 7 Dec 2016 A Bayesian Approach to Predicting Disengaged Youth arxiv:62.52v2 [stat.ap] 7 Dec 26 David Kohn New South Wales 26 david.kohn@sydney.edu.au Nick Glozier Brain Mind Centre New South Wales 26 Sally Cripps

More information

PSYCH-GA.2211/NEURL-GA.2201 Fall 2016 Mathematical Tools for Cognitive and Neural Science. Homework 5

PSYCH-GA.2211/NEURL-GA.2201 Fall 2016 Mathematical Tools for Cognitive and Neural Science. Homework 5 PSYCH-GA.2211/NEURL-GA.2201 Fall 2016 Mathematical Tools for Cognitive and Neural Science Homework 5 Due: 21 Dec 2016 (late homeworks penalized 10% per day) See the course web site for submission details.

More information

Investigating the robustness of the nonparametric Levene test with more than two groups

Investigating the robustness of the nonparametric Levene test with more than two groups Psicológica (2014), 35, 361-383. Investigating the robustness of the nonparametric Levene test with more than two groups David W. Nordstokke * and S. Mitchell Colp University of Calgary, Canada Testing

More information

Citation for published version (APA): Ebbes, P. (2004). Latent instrumental variables: a new approach to solve for endogeneity s.n.

Citation for published version (APA): Ebbes, P. (2004). Latent instrumental variables: a new approach to solve for endogeneity s.n. University of Groningen Latent instrumental variables Ebbes, P. IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document

More information

Part [2.1]: Evaluation of Markers for Treatment Selection Linking Clinical and Statistical Goals

Part [2.1]: Evaluation of Markers for Treatment Selection Linking Clinical and Statistical Goals Part [2.1]: Evaluation of Markers for Treatment Selection Linking Clinical and Statistical Goals Patrick J. Heagerty Department of Biostatistics University of Washington 174 Biomarkers Session Outline

More information