Classification of Microarray Gene Expression Data

Size: px
Start display at page:

Download "Classification of Microarray Gene Expression Data"

Transcription

1 Classification of Microarray Gene Expression Data Geoff McLachlan Department of Mathematics & Institute for Molecular Bioscience University of Queensland

2 Institute for Molecular Bioscience, University of Queensland

3 A wide range of supervised and unsupervised learning methods have been considered to better organize data, be it to infer coordinated patterns of gene expression, to discover molecular signatures of disease subtypes, or to derive various predictions. Statistical Methods for Gene Expression: Microarrays and Proteomics

4

5 by C. Tilstone Nature 424, , DNA microarrays have given geneticists and molecular biologists access to more data than ever before. But do these researchers have the statistical knowhow to cope? Branching out: cluster analysis can group samples that show similar patterns of gene expression.

6 MICROARRAY DATA REPRESENTED by a p n matrix x x x 1 n contains the gene expressions for the p genes of the th tissue sample ( = 1,, n). p = No. of genes ( ) n = No. of tissue samples ( ) STANDARD STATISTICAL METHODOLOGY APPROPRIATE FOR n >> p HERE p >> n

7

8 Two Groups in Two Dimensions. All cluster information would be lost by collapsing to the first principal component. The principal ellipses of the two groups are shown as solid curves.

9 bioarray News (2, no. 35, 2002) Arrays Hold Promise for Cancer Diagnostics Oncologists would like to use arrays to predict whether or not a cancer is going to spread in the body, how likely it will respond to a certain type of treatment, and how long the patient will probably survive. It would be useful if the gene expression signatures could distinguish between subtypes of tumours that standard methods, such as histological pathology from a biopsy, fail to discriminate, and that require different treatments.

10 van t Veer & De Jong (2002, Nature Medicine 8) The microarray way to tailored cancer treatment In principle, gene activities that determine the biological behaviour of a tumour are more likely to reflect its aggressiveness than general parameters such as tumour size and age of the patient. (indistinguishable disease states in diffuse large B-cell lymphoma unravelled by microarray expression profiles Shipp et al., 2002, Nature Med. 8)

11 by C. M. Schubert Nature Medicine 9, 9, The Netherlands Cancer Institute in Amsterdam is to become the first institution in the world to use microarray techniques for the routine prognostic screening of cancer patients. Aiming for a June 2003 start date, the center will use a panoply of 70 genes to assess the tumor profile of breast cancer patients and to determine which women will receive aduvant treatment after surgery.

12 Microarrays also to be used in the prediction of breast cancer by Mike West (Duke University) and the Koo Foundation Sun Yat-Sen Cancer Centre, Taipei Huang et al. (2003, The Lancet, Gene expression predictors of breast cancer).

13 CLASSIFICATION OF TISSUES SUPERVISED CLASSIFICATION (DISCRIMINANT ANALYSIS) We OBSERVE the CLASS LABELS y 1,, y n where y = i if th tissue sample comes from the ith class (i=1,,g). AIM: TO CONSTRUCT A CLASSIFIER C(x) FOR PREDICTING THE UNKNOWN CLASS LABEL y OF A TISSUE SAMPLE x. e.g. g = 2 classes G 1 -DISEASE-FREE G 2 -METASTASES

14

15 LINEAR CLASSIFIER FORM C x 0 β T x β 0 for the production of the group label y of a future entity with feature vector x. β 1 x 1 β p x p

16 FISHER S LINEAR DISCRIMINANT FUNCTION C y ) ( ) ( 2 1 ) ( x x S x x x x S β T and x 1 x 2 covariance matrix found from the training data where and S are the sample means and pooled sample

17 SUPPORT VECTOR CLASSIFIER Vapnik (1995) C x subect to min β, β where β 0 and β are obtained as follows: 0 0, y C( x ) β β x 1 n 1 ( 1,, n) β p x p 1,, n relate to the slack variables separable case

18 n y x β 1 ˆ ˆ with non-zero ˆ only for those observations for which the constraints are exactly met (the support vectors) ˆ, ˆ ˆ ˆ ) ( n n T y y C x x x x x

19 Support Vector Machine (SVM) REPLACE x x h ˆ ), ( ˆ ˆ ) ( ), ( ˆ ) ( n n K h h C x x x x x where the kernel function ) ( ), ( ), ( x x x x h h K is the inner product in the transformed feature space. by

20 HASTIE et al. (2001, Chapter 12) The Lagrange (primal function) is (1) ) (1 ) ( n n n P C y L x β which we maximize w.r.t. β, β 0, and ξ. Setting the respective derivatives to zero, we get )., 1, ( 0 0, 0, (4) )., 1, ( (3) (2) 1 1 n n y y n n x β with and

21 (5) k T k k n n k n D y y L x x We maximize (5) subect to n y 1 0. and 0 In addition to (2) to (4), the constraints include., 1, for (8) 0 ) (1 ) ( (7) 0 (6) 0 ) (1 ) ( n C y C y x x Together these equations (2) to (8) uniquely characterize the solution to the primal and dual problem. By substituting (2) to (4) into (1), we obtain the Lagrangian dual function

22 Leo Breiman (2001) Statistical modeling: the two cultures (with discussion). Statistical Science 16, Discussants include Brad Efron and David Cox

23 Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the National Academy of Sciences Vol. 99, Issue 10, , May 14,

24 GUYON, WESTON, BARNHILL & VAPNIK 2002 Machine Learning COLON Data (Alon et al., 1999) LEUKAEMIA Data (Golub et al., 1999)

25 Since p>>n, consideration given to selection of suitable genes SVM: FORWARD or BACKWARD (in terms of magnitude of weight β i ) RECURSIVE FEATURE ELIMINATION (RFE) FISHER: FORWARD ONLY (in terms of CVE)

26 LEUKAEMIA DATA: Only 2 genes are needed to obtain a zero CVE (cross-validated error rate) COLON DATA: Using only 4 genes, CVE is 2%

27 The success of the RFE indicates that RFE has a built in regularization mechanism that we do not understand yet that prevents overfitting the training data in its selection of gene subsets.

28 Figure 1: Error rates of the SVM rule with RFE procedure averaged over 50 random splits of colon tissue samples

29 Figure 2: Error rates of the SVM rule with RFE procedure averaged over 50 random splits of leukemia tissue samples

30 Figure 3: Error rates of Fisher s rule with stepwise forward selection procedure using all the colon data

31 Figure 4: Error rates of Fisher s rule with stepwise forward selection procedure using all the leukemia data

32 Figure 5: Error rates of the SVM rule averaged over 20 noninformative samples generated by random permutations of the class labels of the colon tumor tissues

33 G 1 G 2 C(x) (x 1, x 2, x 3,, x n ) C(x).

34 From the original data set, remove x 1 to give the reduced set (x 2, x 3,, x n ) Then form the classifier C (1) (x ) from this reduced set. Use C (1) (x 1 ) to allocate x 1 to either G 1 or G 2.

35 Repeat this process for the second data point, x 2. So that this point is assigned to either G 1 or G 2 on the basis of the classifier C (2) (x 2 ). And so on up to x n.

36 Figure 1: Error rates of the SVM rule with RFE procedure averaged over 50 random splits of colon tissue samples

37 ADDITIONAL REFERENCES Selection bias ignored: XIONG et al. (2001, Molecular Genetics and Metabolism) XIONG et al. (2001, Genome Research) ZHANG et al. (2001, PNAS) Aware of selection bias: SPANG et al. (2001, Silico Biology) WEST et al. (2001, PNAS) NGUYEN and ROCKE (2002)

38 BOOTSTRAP APPROACH Efron s (1983, JASA).632 estimator B AE.632 B1 where B1 is the bootstrap when rule is applied to a point not in the training sample. A Monte Carlo estimate of B1 is where E B1 K k 1 I k n 1 Q k E K n k 1 with and I k I Q k k R k * if x kth bootstrap sample otherwise if * R k misallocates x otherwise

39 Toussaint & Sharpe (1975) proposed the ERROR RATE ESTIMATOR A (w) (1 - w)ae wcv2e where w 0.5 McLachlan (1977) proposed w=w o where w o is chosen to minimize asymptotic bias of A(w) in the case of two homoscedastic normal groups. Value of w 0 was found to range between 0.6 n and 0.7, depending on the values of 1,, and. p n 2

40 .632+ estimate of Efron & Tibshirani (1997, JASA) B. 632 (1 - w)ae wb1 where.632 w r B1 AE r AE g i1 (1 pi qi ) (relative overfitting rate) (estimate of no information error rate) If r = 0, w =.632, and so B.632+ = B.632 r = 1, w = 1, and so B.632+ = B1

41 One concern is the heterogeneity of the tumours themselves, which consist of a mixture of normal and malignant cells, with blood vessels in between. Even if one pulled out some cancer cells from a tumour, there is no guarantee that those are the cells that are going to metastasize, ust because tumours are heterogeneous. What we really need are expression profiles from hundreds or thousands of tumours linked to relevant, and appropriate, clinical data. John Quackenbush

42 UNSUPERVISED CLASSIFICATION (CLUSTER ANALYSIS) INFER CLASS LABELS y 1,, y n of x 1,, x n Initially, hierarchical distance-based methods of cluster analysis were used to cluster the tissues and the genes Eisen, Spellman, Brown, & Botstein (1998, PNAS)

43 Hierarchical (agglomerative) clustering algorithms are largely heuristically motivated and there exist a number of unresolved issues associated with their use, including how to determine the number of clusters. in the absence of a well-grounded statistical model, it seems difficult to define what is meant by a good clustering algorithm or the right number of clusters. (Yeung et al., 2001, Model-Based Clustering and Data Transformations for Gene Expression Data, Bioinformatics 17)

44 Attention is now turning towards a model-based approach to the analysis of microarray data For example: Broet, Richarson, and Radvanyi (2002). Bayesian hierarchical model for identifying changes in gene expression from microarray experiments. Journal of Computational Biology 9 Ghosh and Chinnaiyan (2002). Mixture modelling of gene expression data from microarray experiments. Bioinformatics 18 Liu, Zhang, Palumbo, and Lawrence (2003). Bayesian clustering with variable and transformation selection. In Bayesian Statistics 7 Pan, Lin, and Le, 2002, Model-based cluster analysis of microarray gene expression data. Genome Biology 3 Yeung et al., 2001, Model based clustering and data transformations for gene expression data, Bioinformatics 17

45 The notion of a cluster is not easy to define. There is a very large literature devoted to clustering when there is a metric known in advance; e.g. k-means. Usually, there is no a priori metric (or equivalently a user-defined distance matrix) for a cluster analysis. That is, the difficulty is that the shape of the clusters is not known until the clusters have been identified, and the clusters cannot be effectively identified unless the shapes are known.

46 In this case, one attractive feature of adopting mixture models with elliptically symmetric components such as the normal or t densities, is that the implied clustering is invariant under affine transformations of the data (that is, under operations relating to changes in location, scale, and rotation of the data). Thus the clustering process does not depend on irrelevant factors such as the units of measurement or the orientation of the clusters in space.

47 x

48 MIXTURE OF g NORMAL COMPONENTS ) ; ( ) ; ( ) ( 1 g g g 1 1 f Σ, µ x Σ, µ x x ) ( ) ( µ x µ x T EUCLIDEAN DISTANCE ) ( ) ( ) ( log 2 µ x Σ µ x Σ µ, x; 1 T where constant constant ) ( ) ( ) ( log 2 µ x Σ µ x Σ µ, x; 1 T MAHALANOBIS DISTANCE where

49 MIXTURE OF g NORMAL COMPONENTS f ( x) ( x; µ 1, Σ1) ( x; µ 1 g g g, Σ ) k-means σ 2 1 g SPHERICAL CLUSTERS

50 Equal spherical covariance matrices

51 Figure 6: Plot of Crab Data

52 Figure 7: Contours of the fitted component densities on the 2 nd & 3 rd variates for the blue crab data set.

53 With a mixture model-based approach to clustering, an observation is assigned outright to the ith cluster if its density in the ith component of the mixture distribution (weighted by the prior probability of that component) is greater than in the other (g-1) components. f ( x) 1 ( x; µ 1, Σ1) i ( x; µ i, ( x; µ, Σ ) g g g Σ i )

54 Finite Mixture Models..

55 It was the publication of the seminal paper of Dempster, Laird, and Rubin (1977) on the EM algorithm that greatly stimulated interest in the use of finite mixture distributions to model heterogeneous data. McLachlan and Krishnan (1997, Wiley)

56 If need be, the normal mixture model can be made less sensitive to outlying observations by using t component densities. With this t mixture model-based approach, the normal distribution for each component in the mixture is embedded in a wider class of elliptically symmetric distributions with an additional parameter called the degrees of freedom.

57 The advantage of the t mixture model is that, although the number of outliers needed for breakdown is almost the same as with the normal mixture model, the outliers have to be much larger.

58

59 Clustering of genes on basis of tissues genes not independent Clustering of tissues on basis of genes - latter is a nonstandard problem in cluster analysis (n << p)

60 McLachlan, Peel, Adams, and Basford (1999)

61 EMMIX for Windows

62 PROVIDES A MODEL-BASED APPROACH TO CLUSTERING McLachlan, Bean, and Peel, 2002, A Mixture Model- Based Approach to the Clustering of Microarray Expression Data, Bioinformatics 18,

63

64 n=62 (40 tumours; 22 normals) tissue samples of p=2,000 genes in a 2, matrix.

65

66

67 t

68 t

69 t

70

71

72 In this process, the genes are being treated anonymously. May wish to incorporate existing biological information on the function of genes into the selection procedure. Lottaz and Spang (2003, Proceedings of 54 th Meeting of the ISI) They structure the feature space by using a functional grid provided by the Gene Ontology annotations.

73

74

75 Grouping for Colon Data

76

77

78

79 Grouping for Colon Data

80 A normal mixture model without restrictions on the component-covariance matrices may be viewed as too general for many situations in practice, in particular, with high dimensional data. One approach for reducing the number of parameters is to work in a lower dimensional space by adopting mixtures of factor analyzers (Ghahramani & Hinton, 1997).

81 g f x i1 i x i i i B B i T i D i i g B i p q D i

82 Testing for the number of components, g, in a mixture is an important but very difficult problem which has not been completely resolved.

83 A mixture density with g components might be empirically indistinguishable from one with either fewer than g components or more than g components. It is therefore sensible in practice to approach the question of the number of components in a mixture model in terms of an assessment of the smallest number of components in the mixture compatible with the data.

84 An obvious way of approaching the problem of testing for the smallest value of the number of components in a mixture model is to use the LRTS, -2log. Suppose we wish to test the null hypothesis, H 0 : g g 0 versus H : 1 g g 1 for some g 1 >g 0.

85 We let denote the MLE of Ψ i calculated under H i, (i=0,1). Then the evidence against H 0 will be strong if is sufficiently small, or equivalently, if -2log is sufficiently large, where Ψˆ 2log 2{log L( Ψˆ ) log L( Ψˆ 1 0 )}

86 McLachlan (1987) proposed a resampling approach to the assessment of the P-value of the LRTS in testing H : g g v H : g g 1 for a specified value of g 0.

87 The Bayesian information criterion (BIC) of Schwarz (1978) is given by L d n as the penalized log likelihood to be maximized in model selection, including the present situation for the number of components g in a mixture model.

88

89

90

91 Grouping for Leukemia Data

92

93

94

95 Breast cancer data set in van t Veer et al. (van t Veer et al., 2002, Gene Expression Profiling Predicts Clinical Outcome Of Breast Cancer, Nature 415) These data were the result of microarray experiments on three patient groups with different classes of breast cancer tumours. The overall goal was to identify a set of genes that could distinguish between the different tumour groups based upon the gene expression information for these groups.

96 The chips are down; Diagnosing breast cancer (Gene chips have shown that there are two sorts of breast cancer)

97 Nature News feature (Ball)

98 Colour-coded: this plot of gene-expression data shows breast tumours falling into two groups

99 Microarray data from 98 patients with primary breast cancers with p = 24,881 genes 44 from good prognosis group (remained metastasis free after a period of more than 5 years) 34 from poor prognosis group (developed distant metastases within 5 years) 20 with hereditary form of cancer (18 with BRAC1; 2 with BRAC2)

100 Pre-processing filter of van t Veer et al. only genes with both: P-value less than 0.01; and at least a two-fold difference in more than 5 out of the 98 tissues for the genes were retained. This reduces the data set to 4869 genes.

101 Heat Map Displaying the Reduced Set of 4,869 Genes on the 98 Breast Cancer Tumours

102 Unsupervised Classification Analysis Using EMMIX-GENE Steps used in the application of EMMIX-GENE: 1. Select the most relevant genes from this filtered set of 4,869 genes. The set of retained genes is thus reduced to 1, Cluster these 1,867 genes into forty groups. The maority of gene groups produced were reasonably cohesive and distinct. 3. Using these forty group means, cluster the tissue samples into two and three components using a mixture of factor analyzers model with q = 4 factors.

103 Insert heat map of 1867 genes Heat Map of Top 1867 Genes

104

105

106

107 i m i U i i m i U i i m i U i i m i U i where i = group number m i = number in group i U i = -2 log λ i

108 Heat Map of Genes in Group G1

109 Heat Map of Genes in Group G2

110 Heat Map of Genes in Group G3

111 1. A change in gene expression is apparent between the sporadic (first 78 tissue samples) and hereditary (last 20 tissue samples) tumours. 2. The final two tissue samples (the two BRCA2 tumours) show consistent patterns of expression. This expression is different from that exhibited by the set of BRCA1 tumours. 3. The problem of trying to distinguish between the two classes, patients who were disease-free after 5 years 1 and those with metastases within 5 years 2, is not straightforward on the basis of the gene expressions.

112 Selection of Relevant Genes We compared the genes selected by EMMIX- GENE with those genes retained in the original study by van t Veer et al. (2002). van t Veer et al. used an agglomerative hierarchical algorithm to organise the genes into dominant genes groups. Two of these groups were highlighted in their paper, with their genes corresponding to biologically significant features.

113 Cluster A Cluster B Identification of van t Veer et al. containing genes co-regulated with the ER-a gene (ESR1) containing co-regulated genes that are the molecular reflection of extensive lymphocytic infiltrate, and comprise a set of genes expressed in T and B cells Number of genes Number of matches with genes retained by select-gene We can see that of the 80 genes identified by van t Veer et al., only 47 are retained by the select-genes step of the EMMIX-GENE algorithm.

114 Comparing Clusters from Hierarchical Algorithm with those from EMMIX-GENE Algorithm Cluster A Cluster Index (EMMIX- GENE) Number of Genes Matched Pe rce ntage Matched (%) Cluster B Subsets of these 47 genes appeared inside several of the 40 groups produced by the cluster-genes step of EMMIX-GENE.

115 Genes Retained by EMMIX-GENE Appearing in Cluster A (vertical blue lines indicate the three groups of tumours)

116 Genes Reected by EMMIX-GENE Appearing in Cluster A

117 Genes Retained by EMMIX-GENE Appearing in Cluster B

118 Genes Reected by EMMIX-GENE Appearing in Cluster B

119 Assessing the Number of Tissue Groups To assess the number of components g to be used in the normal mixture the likelihood ratio statistic was adopted, and the resampling approach used to assess the P-value. By proceeding sequentially, testing the null hypothesis H 0 : g = g 0 versus the alternative hypothesis H 1 : g = g 0 + 1, starting with g 0 = 1 and continuing until a non-significant result was obtained it was concluded that g = 3 components were adequate for this data set.

120 Clustering Tissue Samples on the Basis of Gene Groups using EMMIX-GENE Tissue samples can be subdivided into two groups corresponding to 78 sporadic tumours and 20 hereditary tumours. When the two cluster assignment of EMMIX-GENE is compared to this genuine grouping, only 1 of the 20 hereditary tumour patients is misallocated, although 37 of the sporadic tumour patients are incorrectly assigned to the hereditary tumour cluster.

121 Using a mixture of factor analyzers model with q = 8 factors, we would misallocate: 7 out of the 44 members of 1 ; 24 out of the 34 members of 2 ; and 1 of the 18 BRCA1 samples. The misallocation rate of 24/34 for the second class, 2, is not surprising given both the gene expressions as summarized in the groups of genes and that we are classifying the tissues in an unsupervised manner without using the knowledge of their true classification.

122 Supervised Classification When knowledge of the groups true classification is used (van t Veer et al.), the reported error rate was approximately 50% for members of 2 when allowance was made for the selection bias in forming a classifier on the basis of an optimal subset of the genes. Further analysis of this data set in a supervised context confirms the difficulty in trying to discriminate between the disease-free class 1 and the metastases class 2. (Tibshirani and Efron, 2002, Pre-Validation and Inference in Microarrays, Statistical Applications In Genetics And Molecular Biology 1)

123

124 Investigating Underlying Signatures With Other Clinical Indicators The three clusters constructed by EMMIX- GENE were investigated in order to determine whether they followed a pattern contingent upon the clinical predictors of histological grade, angioinvasion, oestrogen receptor, lymphocytic infiltrate.

125

126 Microarrays have become promising diagnostic tools for clinical applications. However, large-scale screening approaches in general and microarray technology in particular, inescapably lead to the challenging problem of learning from high-dimensional data.

127 Hope to see you in Cairns in 2004!

MODEL-BASED CLUSTERING IN GENE EXPRESSION MICROARRAYS: AN APPLICATION TO BREAST CANCER DATA

MODEL-BASED CLUSTERING IN GENE EXPRESSION MICROARRAYS: AN APPLICATION TO BREAST CANCER DATA International Journal of Software Engineering and Knowledge Engineering Vol. 13, No. 6 (2003) 579 592 c World Scientific Publishing Company MODEL-BASED CLUSTERING IN GENE EXPRESSION MICROARRAYS: AN APPLICATION

More information

Introduction to Discrimination in Microarray Data Analysis

Introduction to Discrimination in Microarray Data Analysis Introduction to Discrimination in Microarray Data Analysis Jane Fridlyand CBMB University of California, San Francisco Genentech Hall Auditorium, Mission Bay, UCSF October 23, 2004 1 Case Study: Van t

More information

Comparison of discrimination methods for the classification of tumors using gene expression data

Comparison of discrimination methods for the classification of tumors using gene expression data Comparison of discrimination methods for the classification of tumors using gene expression data Sandrine Dudoit, Jane Fridlyand 2 and Terry Speed 2,. Mathematical Sciences Research Institute, Berkeley

More information

Outlier Analysis. Lijun Zhang

Outlier Analysis. Lijun Zhang Outlier Analysis Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Extreme Value Analysis Probabilistic Models Clustering for Outlier Detection Distance-Based Outlier Detection Density-Based

More information

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Gene Selection for Tumor Classification Using Microarray Gene Expression Data Gene Selection for Tumor Classification Using Microarray Gene Expression Data K. Yendrapalli, R. Basnet, S. Mukkamala, A. H. Sung Department of Computer Science New Mexico Institute of Mining and Technology

More information

Classification of cancer profiles. ABDBM Ron Shamir

Classification of cancer profiles. ABDBM Ron Shamir Classification of cancer profiles 1 Background: Cancer Classification Cancer classification is central to cancer treatment; Traditional cancer classification methods: location; morphology, cytogenesis;

More information

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes. Final review Based in part on slides from textbook, slides of Susan Holmes December 5, 2012 1 / 1 Final review Overview Before Midterm General goals of data mining. Datatypes. Preprocessing & dimension

More information

T. R. Golub, D. K. Slonim & Others 1999

T. R. Golub, D. K. Slonim & Others 1999 T. R. Golub, D. K. Slonim & Others 1999 Big Picture in 1999 The Need for Cancer Classification Cancer classification very important for advances in cancer treatment. Cancers of Identical grade can have

More information

Classification. Methods Course: Gene Expression Data Analysis -Day Five. Rainer Spang

Classification. Methods Course: Gene Expression Data Analysis -Day Five. Rainer Spang Classification Methods Course: Gene Expression Data Analysis -Day Five Rainer Spang Ms. Smith DNA Chip of Ms. Smith Expression profile of Ms. Smith Ms. Smith 30.000 properties of Ms. Smith The expression

More information

A COMBINATORY ALGORITHM OF UNIVARIATE AND MULTIVARIATE GENE SELECTION

A COMBINATORY ALGORITHM OF UNIVARIATE AND MULTIVARIATE GENE SELECTION 5-9 JATIT. All rights reserved. A COMBINATORY ALGORITHM OF UNIVARIATE AND MULTIVARIATE GENE SELECTION 1 H. Mahmoodian, M. Hamiruce Marhaban, 3 R. A. Rahim, R. Rosli, 5 M. Iqbal Saripan 1 PhD student, Department

More information

Diagnosis of multiple cancer types by shrunken centroids of gene expression

Diagnosis of multiple cancer types by shrunken centroids of gene expression Diagnosis of multiple cancer types by shrunken centroids of gene expression Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu PNAS 99:10:6567-6572, 14 May 2002 Nearest Centroid

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write

More information

FUZZY C-MEANS AND ENTROPY BASED GENE SELECTION BY PRINCIPAL COMPONENT ANALYSIS IN CANCER CLASSIFICATION

FUZZY C-MEANS AND ENTROPY BASED GENE SELECTION BY PRINCIPAL COMPONENT ANALYSIS IN CANCER CLASSIFICATION FUZZY C-MEANS AND ENTROPY BASED GENE SELECTION BY PRINCIPAL COMPONENT ANALYSIS IN CANCER CLASSIFICATION SOMAYEH ABBASI, HAMID MAHMOODIAN Department of Electrical Engineering, Najafabad branch, Islamic

More information

Identification of Tissue Independent Cancer Driver Genes

Identification of Tissue Independent Cancer Driver Genes Identification of Tissue Independent Cancer Driver Genes Alexandros Manolakos, Idoia Ochoa, Kartik Venkat Supervisor: Olivier Gevaert Abstract Identification of genomic patterns in tumors is an important

More information

10CS664: PATTERN RECOGNITION QUESTION BANK

10CS664: PATTERN RECOGNITION QUESTION BANK 10CS664: PATTERN RECOGNITION QUESTION BANK Assignments would be handed out in class as well as posted on the class blog for the course. Please solve the problems in the exercises of the prescribed text

More information

Selection and Combination of Markers for Prediction

Selection and Combination of Markers for Prediction Selection and Combination of Markers for Prediction NACC Data and Methods Meeting September, 2010 Baojiang Chen, PhD Sarah Monsell, MS Xiao-Hua Andrew Zhou, PhD Overview 1. Research motivation 2. Describe

More information

Bayesian Prediction Tree Models

Bayesian Prediction Tree Models Bayesian Prediction Tree Models Statistical Prediction Tree Modelling for Clinico-Genomics Clinical gene expression data - expression signatures, profiling Tree models for predictive sub-typing Combining

More information

Memorial Sloan-Kettering Cancer Center

Memorial Sloan-Kettering Cancer Center Memorial Sloan-Kettering Cancer Center Memorial Sloan-Kettering Cancer Center, Dept. of Epidemiology & Biostatistics Working Paper Series Year 2007 Paper 14 On Comparing the Clustering of Regression Models

More information

Chapter 1. Introduction

Chapter 1. Introduction Chapter 1 Introduction 1.1 Motivation and Goals The increasing availability and decreasing cost of high-throughput (HT) technologies coupled with the availability of computational tools and data form a

More information

Combining Risks from Several Tumors Using Markov Chain Monte Carlo

Combining Risks from Several Tumors Using Markov Chain Monte Carlo University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln U.S. Environmental Protection Agency Papers U.S. Environmental Protection Agency 2009 Combining Risks from Several Tumors

More information

SubLasso:a feature selection and classification R package with a. fixed feature subset

SubLasso:a feature selection and classification R package with a. fixed feature subset SubLasso:a feature selection and classification R package with a fixed feature subset Youxi Luo,3,*, Qinghan Meng,2,*, Ruiquan Ge,2, Guoqin Mai, Jikui Liu, Fengfeng Zhou,#. Shenzhen Institutes of Advanced

More information

Data analysis and binary regression for predictive discrimination. using DNA microarray data. (Breast cancer) discrimination. Expression array data

Data analysis and binary regression for predictive discrimination. using DNA microarray data. (Breast cancer) discrimination. Expression array data West Mike of Statistics & Decision Sciences Institute Duke University wwwstatdukeedu IPAM Functional Genomics Workshop November Two group problems: Binary outcomes ffl eg, ER+ versus ER ffl eg, lymph node

More information

Cancer outlier differential gene expression detection

Cancer outlier differential gene expression detection Biostatistics (2007), 8, 3, pp. 566 575 doi:10.1093/biostatistics/kxl029 Advance Access publication on October 4, 2006 Cancer outlier differential gene expression detection BAOLIN WU Division of Biostatistics,

More information

An Improved Algorithm To Predict Recurrence Of Breast Cancer

An Improved Algorithm To Predict Recurrence Of Breast Cancer An Improved Algorithm To Predict Recurrence Of Breast Cancer Umang Agrawal 1, Ass. Prof. Ishan K Rajani 2 1 M.E Computer Engineer, Silver Oak College of Engineering & Technology, Gujarat, India. 2 Assistant

More information

Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering

Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering Gene expression analysis Roadmap Microarray technology: how it work Applications: what can we do with it Preprocessing: Image processing Data normalization Classification Clustering Biclustering 1 Gene

More information

Machine Learning to Inform Breast Cancer Post-Recovery Surveillance

Machine Learning to Inform Breast Cancer Post-Recovery Surveillance Machine Learning to Inform Breast Cancer Post-Recovery Surveillance Final Project Report CS 229 Autumn 2017 Category: Life Sciences Maxwell Allman (mallman) Lin Fan (linfan) Jamie Kang (kangjh) 1 Introduction

More information

MOST: detecting cancer differential gene expression

MOST: detecting cancer differential gene expression Biostatistics (2008), 9, 3, pp. 411 418 doi:10.1093/biostatistics/kxm042 Advance Access publication on November 29, 2007 MOST: detecting cancer differential gene expression HENG LIAN Division of Mathematical

More information

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Predicting Breast Cancer Survival Using Treatment and Patient Factors Predicting Breast Cancer Survival Using Treatment and Patient Factors William Chen wchen808@stanford.edu Henry Wang hwang9@stanford.edu 1. Introduction Breast cancer is the leading type of cancer in women

More information

TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS)

TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS) TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS) AUTHORS: Tejas Prahlad INTRODUCTION Acute Respiratory Distress Syndrome (ARDS) is a condition

More information

Aspects of Statistical Modelling & Data Analysis in Gene Expression Genomics. Mike West Duke University

Aspects of Statistical Modelling & Data Analysis in Gene Expression Genomics. Mike West Duke University Aspects of Statistical Modelling & Data Analysis in Gene Expression Genomics Mike West Duke University Papers, software, many links: www.isds.duke.edu/~mw ABS04 web site: Lecture slides, stats notes, papers,

More information

BIOINFORMATICS ORIGINAL PAPER

BIOINFORMATICS ORIGINAL PAPER BIOINFORMATICS ORIGINAL PAPER Vol. 21 no. 9 2005, pages 1979 1986 doi:10.1093/bioinformatics/bti294 Gene expression Estimating misclassification error with small samples via bootstrap cross-validation

More information

ISIR: Independent Sliced Inverse Regression

ISIR: Independent Sliced Inverse Regression ISIR: Independent Sliced Inverse Regression Kevin B. Li Beijing Jiaotong University Abstract In this paper we consider a semiparametric regression model involving a p-dimensional explanatory variable x

More information

MODEL SELECTION STRATEGIES. Tony Panzarella

MODEL SELECTION STRATEGIES. Tony Panzarella MODEL SELECTION STRATEGIES Tony Panzarella Lab Course March 20, 2014 2 Preamble Although focus will be on time-to-event data the same principles apply to other outcome data Lab Course March 20, 2014 3

More information

Russian Journal of Agricultural and Socio-Economic Sciences, 3(15)

Russian Journal of Agricultural and Socio-Economic Sciences, 3(15) ON THE COMPARISON OF BAYESIAN INFORMATION CRITERION AND DRAPER S INFORMATION CRITERION IN SELECTION OF AN ASYMMETRIC PRICE RELATIONSHIP: BOOTSTRAP SIMULATION RESULTS Henry de-graft Acquah, Senior Lecturer

More information

Predicting Kidney Cancer Survival from Genomic Data

Predicting Kidney Cancer Survival from Genomic Data Predicting Kidney Cancer Survival from Genomic Data Christopher Sauer, Rishi Bedi, Duc Nguyen, Benedikt Bünz Abstract Cancers are on par with heart disease as the leading cause for mortality in the United

More information

Computer Age Statistical Inference. Algorithms, Evidence, and Data Science. BRADLEY EFRON Stanford University, California

Computer Age Statistical Inference. Algorithms, Evidence, and Data Science. BRADLEY EFRON Stanford University, California Computer Age Statistical Inference Algorithms, Evidence, and Data Science BRADLEY EFRON Stanford University, California TREVOR HASTIE Stanford University, California ggf CAMBRIDGE UNIVERSITY PRESS Preface

More information

A MONTE CARLO STUDY OF MODEL SELECTION PROCEDURES FOR THE ANALYSIS OF CATEGORICAL DATA

A MONTE CARLO STUDY OF MODEL SELECTION PROCEDURES FOR THE ANALYSIS OF CATEGORICAL DATA A MONTE CARLO STUDY OF MODEL SELECTION PROCEDURES FOR THE ANALYSIS OF CATEGORICAL DATA Elizabeth Martin Fischer, University of North Carolina Introduction Researchers and social scientists frequently confront

More information

The 16th KJC Bioinformatics Symposium Integrative analysis identifies potential DNA methylation biomarkers for pan-cancer diagnosis and prognosis

The 16th KJC Bioinformatics Symposium Integrative analysis identifies potential DNA methylation biomarkers for pan-cancer diagnosis and prognosis The 16th KJC Bioinformatics Symposium Integrative analysis identifies potential DNA methylation biomarkers for pan-cancer diagnosis and prognosis Tieliu Shi tlshi@bio.ecnu.edu.cn The Center for bioinformatics

More information

Good Old clinical markers have similar power in breast cancer prognosis as microarray gene expression profilers q

Good Old clinical markers have similar power in breast cancer prognosis as microarray gene expression profilers q European Journal of Cancer 40 (2004) 1837 1841 European Journal of Cancer www.ejconline.com Good Old clinical markers have similar power in breast cancer prognosis as microarray gene expression profilers

More information

Tissue Classification Based on Gene Expression Data

Tissue Classification Based on Gene Expression Data Chapter 6 Tissue Classification Based on Gene Expression Data Many diseases result from complex interactions involving numerous genes. Previously, these gene interactions have been commonly studied separately.

More information

PAIRED AND UNPAIRED COMPARISON AND CLUSTERING WITH GENE EXPRESSION DATA

PAIRED AND UNPAIRED COMPARISON AND CLUSTERING WITH GENE EXPRESSION DATA Statistica Sinica 12(2002), 87-110 PAIRED AND UNPAIRED COMPARISON AND CLUSTERING WITH GENE EXPRESSION DATA Jenny Bryan 1, Katherine S. Pollard 2 and Mark J. van der Laan 2 1 University of British Columbia

More information

Summary of main challenges and future directions

Summary of main challenges and future directions Summary of main challenges and future directions Martin Schumacher Institute of Medical Biometry and Medical Informatics, University Medical Center, Freiburg Workshop October 2008 - F1 Outline Some historical

More information

An Efficient Diseases Classifier based on Microarray Datasets using Clustering ANOVA Extreme Learning Machine (CAELM)

An Efficient Diseases Classifier based on Microarray Datasets using Clustering ANOVA Extreme Learning Machine (CAELM) www.ijcsi.org 8 An Efficient Diseases Classifier based on Microarray Datasets using Clustering ANOVA Extreme Learning Machine (CAELM) Shamsan Aljamali 1, Zhang Zuping 2 and Long Jun 3 1 School of Information

More information

SUPPLEMENTARY APPENDIX

SUPPLEMENTARY APPENDIX SUPPLEMENTARY APPENDIX 1) Supplemental Figure 1. Histopathologic Characteristics of the Tumors in the Discovery Cohort 2) Supplemental Figure 2. Incorporation of Normal Epidermal Melanocytic Signature

More information

Data complexity measures for analyzing the effect of SMOTE over microarrays

Data complexity measures for analyzing the effect of SMOTE over microarrays ESANN 216 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 27-29 April 216, i6doc.com publ., ISBN 978-2878727-8. Data complexity

More information

UvA-DARE (Digital Academic Repository)

UvA-DARE (Digital Academic Repository) UvA-DARE (Digital Academic Repository) A classification model for the Leiden proteomics competition Hoefsloot, H.C.J.; Berkenbos-Smit, S.; Smilde, A.K. Published in: Statistical Applications in Genetics

More information

STATISTICAL INFERENCE 1 Richard A. Johnson Professor Emeritus Department of Statistics University of Wisconsin

STATISTICAL INFERENCE 1 Richard A. Johnson Professor Emeritus Department of Statistics University of Wisconsin STATISTICAL INFERENCE 1 Richard A. Johnson Professor Emeritus Department of Statistics University of Wisconsin Key words : Bayesian approach, classical approach, confidence interval, estimation, randomization,

More information

BIOSTATISTICAL METHODS AND RESEARCH DESIGNS. Xihong Lin Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA

BIOSTATISTICAL METHODS AND RESEARCH DESIGNS. Xihong Lin Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA BIOSTATISTICAL METHODS AND RESEARCH DESIGNS Xihong Lin Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA Keywords: Case-control study, Cohort study, Cross-Sectional Study, Generalized

More information

Selection of Linking Items

Selection of Linking Items Selection of Linking Items Subset of items that maximally reflect the scale information function Denote the scale information as Linear programming solver (in R, lp_solve 5.5) min(y) Subject to θ, θs,

More information

SUPPLEMENTARY INFORMATION. Table 1 Patient characteristics Preoperative. language testing

SUPPLEMENTARY INFORMATION. Table 1 Patient characteristics Preoperative. language testing Categorical Speech Representation in the Human Superior Temporal Gyrus Edward F. Chang, Jochem W. Rieger, Keith D. Johnson, Mitchel S. Berger, Nicholas M. Barbaro, Robert T. Knight SUPPLEMENTARY INFORMATION

More information

CHAPTER 6. Conclusions and Perspectives

CHAPTER 6. Conclusions and Perspectives CHAPTER 6 Conclusions and Perspectives In Chapter 2 of this thesis, similarities and differences among members of (mainly MZ) twin families in their blood plasma lipidomics profiles were investigated.

More information

Colon cancer subtypes from gene expression data

Colon cancer subtypes from gene expression data Colon cancer subtypes from gene expression data Nathan Cunningham Giuseppe Di Benedetto Sherman Ip Leon Law Module 6: Applied Statistics 26th February 2016 Aim Replicate findings of Felipe De Sousa et

More information

Research Supervised clustering of genes Marcel Dettling and Peter Bühlmann

Research Supervised clustering of genes Marcel Dettling and Peter Bühlmann http://genomebiology.com/22/3/2/research/69. Research Supervised clustering of genes Marcel Dettling and Peter Bühlmann Address: Seminar für Statistik, Eidgenössische Technische Hochschule (ETH) Zürich,

More information

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection 202 4th International onference on Bioinformatics and Biomedical Technology IPBEE vol.29 (202) (202) IASIT Press, Singapore Efficacy of the Extended Principal Orthogonal Decomposition on DA Microarray

More information

Identifying Thyroid Carcinoma Subtypes and Outcomes through Gene Expression Data Kun-Hsing Yu, Wei Wang, Chung-Yu Wang

Identifying Thyroid Carcinoma Subtypes and Outcomes through Gene Expression Data Kun-Hsing Yu, Wei Wang, Chung-Yu Wang Identifying Thyroid Carcinoma Subtypes and Outcomes through Gene Expression Data Kun-Hsing Yu, Wei Wang, Chung-Yu Wang Abstract: Unlike most cancers, thyroid cancer has an everincreasing incidence rate

More information

Multiclass Classification of Cervical Cancer Tissues by Hidden Markov Model

Multiclass Classification of Cervical Cancer Tissues by Hidden Markov Model Multiclass Classification of Cervical Cancer Tissues by Hidden Markov Model Sabyasachi Mukhopadhyay*, Sanket Nandan*; Indrajit Kurmi ** *Indian Institute of Science Education and Research Kolkata **Indian

More information

EECS 433 Statistical Pattern Recognition

EECS 433 Statistical Pattern Recognition EECS 433 Statistical Pattern Recognition Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1 / 19 Outline What is Pattern

More information

Mammogram Analysis: Tumor Classification

Mammogram Analysis: Tumor Classification Mammogram Analysis: Tumor Classification Literature Survey Report Geethapriya Raghavan geeragh@mail.utexas.edu EE 381K - Multidimensional Digital Signal Processing Spring 2005 Abstract Breast cancer is

More information

Classifica4on. CSCI1950 Z Computa4onal Methods for Biology Lecture 18. Ben Raphael April 8, hip://cs.brown.edu/courses/csci1950 z/

Classifica4on. CSCI1950 Z Computa4onal Methods for Biology Lecture 18. Ben Raphael April 8, hip://cs.brown.edu/courses/csci1950 z/ CSCI1950 Z Computa4onal Methods for Biology Lecture 18 Ben Raphael April 8, 2009 hip://cs.brown.edu/courses/csci1950 z/ Binary classifica,on Given a set of examples (x i, y i ), where y i = + 1, from unknown

More information

Bayesian Confidence Intervals for Means and Variances of Lognormal and Bivariate Lognormal Distributions

Bayesian Confidence Intervals for Means and Variances of Lognormal and Bivariate Lognormal Distributions Bayesian Confidence Intervals for Means and Variances of Lognormal and Bivariate Lognormal Distributions J. Harvey a,b, & A.J. van der Merwe b a Centre for Statistical Consultation Department of Statistics

More information

Unit 1 Exploring and Understanding Data

Unit 1 Exploring and Understanding Data Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile

More information

Cancer Gene Extraction Based on Stepwise Regression

Cancer Gene Extraction Based on Stepwise Regression Mathematical Computation Volume 5, 2016, PP.6-10 Cancer Gene Extraction Based on Stepwise Regression Jie Ni 1, Fan Wu 1, Meixiang Jin 1, Yixing Bai 1, Yunfei Guo 1 1. Mathematics Department, Yanbian University,

More information

Applications. DSC 410/510 Multivariate Statistical Methods. Discriminating Two Groups. What is Discriminant Analysis

Applications. DSC 410/510 Multivariate Statistical Methods. Discriminating Two Groups. What is Discriminant Analysis DSC 4/5 Multivariate Statistical Methods Applications DSC 4/5 Multivariate Statistical Methods Discriminant Analysis Identify the group to which an object or case (e.g. person, firm, product) belongs:

More information

Data analysis in microarray experiment

Data analysis in microarray experiment 16 1 004 Chinese Bulletin of Life Sciences Vol. 16, No. 1 Feb., 004 1004-0374 (004) 01-0041-08 100005 Q33 A Data analysis in microarray experiment YANG Chang, FANG Fu-De * (National Laboratory of Medical

More information

Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017

Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017 Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017 A.K.A. Artificial Intelligence Unsupervised learning! Cluster analysis Patterns, Clumps, and Joining

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016 Exam policy: This exam allows one one-page, two-sided cheat sheet; No other materials. Time: 80 minutes. Be sure to write your name and

More information

SUPPLEMENTAL MATERIAL

SUPPLEMENTAL MATERIAL 1 SUPPLEMENTAL MATERIAL Response time and signal detection time distributions SM Fig. 1. Correct response time (thick solid green curve) and error response time densities (dashed red curve), averaged across

More information

For general queries, contact

For general queries, contact Much of the work in Bayesian econometrics has focused on showing the value of Bayesian methods for parametric models (see, for example, Geweke (2005), Koop (2003), Li and Tobias (2011), and Rossi, Allenby,

More information

Sample Size Estimation for Microarray Experiments

Sample Size Estimation for Microarray Experiments Sample Size Estimation for Microarray Experiments Gregory R. Warnes Department of Biostatistics and Computational Biology Univeristy of Rochester Rochester, NY 14620 and Peng Liu Department of Biological

More information

Class discovery in Gene Expression Data: Characterizing Splits by Support Vector Machines

Class discovery in Gene Expression Data: Characterizing Splits by Support Vector Machines Class discovery in Gene Expression Data: Characterizing Splits by Support Vector Machines Florian Markowetz and Anja von Heydebreck Max-Planck-Institute for Molecular Genetics Computational Molecular Biology

More information

Two-stage Methods to Implement and Analyze the Biomarker-guided Clinical Trail Designs in the Presence of Biomarker Misclassification

Two-stage Methods to Implement and Analyze the Biomarker-guided Clinical Trail Designs in the Presence of Biomarker Misclassification RESEARCH HIGHLIGHT Two-stage Methods to Implement and Analyze the Biomarker-guided Clinical Trail Designs in the Presence of Biomarker Misclassification Yong Zang 1, Beibei Guo 2 1 Department of Mathematical

More information

Comparison of Triple Negative Breast Cancer between Asian and Western Data Sets

Comparison of Triple Negative Breast Cancer between Asian and Western Data Sets 2010 IEEE International Conference on Bioinformatics and Biomedicine Workshops Comparison of Triple Negative Breast Cancer between Asian and Western Data Sets Lee H. Chen Bioinformatics and Biostatistics

More information

Evaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination

Evaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination Evaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination Committee: Advisor: Dr. Rosemary Renaut Dr. Adrienne C. Scheck Dr. Kenneth Hoober Dr. Bradford Kirkman-Liff John

More information

Gene expression correlates of clinical prostate cancer behavior

Gene expression correlates of clinical prostate cancer behavior Gene expression correlates of clinical prostate cancer behavior Cancer Cell 2002 1: 203-209. Singh D, Febbo P, Ross K, Jackson D, Manola J, Ladd C, Tamayo P, Renshaw A, D Amico A, Richie J, Lander E, Loda

More information

PROFILE SIMILARITY IN BIOEQUIVALENCE TRIALS

PROFILE SIMILARITY IN BIOEQUIVALENCE TRIALS Sankhyā : The Indian Journal of Statistics Special Issue on Biostatistics 2000, Volume 62, Series B, Pt. 1, pp. 149 161 PROFILE SIMILARITY IN BIOEQUIVALENCE TRIALS By DAVID T. MAUGER and VERNON M. CHINCHILLI

More information

Mammogram Analysis: Tumor Classification

Mammogram Analysis: Tumor Classification Mammogram Analysis: Tumor Classification Term Project Report Geethapriya Raghavan geeragh@mail.utexas.edu EE 381K - Multidimensional Digital Signal Processing Spring 2005 Abstract Breast cancer is the

More information

National Surgical Adjuvant Breast and Bowel Project (NSABP) Foundation Annual Progress Report: 2009 Formula Grant

National Surgical Adjuvant Breast and Bowel Project (NSABP) Foundation Annual Progress Report: 2009 Formula Grant National Surgical Adjuvant Breast and Bowel Project (NSABP) Foundation Annual Progress Report: 2009 Formula Grant Reporting Period July 1, 2011 June 30, 2012 Formula Grant Overview The National Surgical

More information

List of Figures. List of Tables. Preface to the Second Edition. Preface to the First Edition

List of Figures. List of Tables. Preface to the Second Edition. Preface to the First Edition List of Figures List of Tables Preface to the Second Edition Preface to the First Edition xv xxv xxix xxxi 1 What Is R? 1 1.1 Introduction to R................................ 1 1.2 Downloading and Installing

More information

Biomarker adaptive designs in clinical trials

Biomarker adaptive designs in clinical trials Review Article Biomarker adaptive designs in clinical trials James J. Chen 1, Tzu-Pin Lu 1,2, Dung-Tsa Chen 3, Sue-Jane Wang 4 1 Division of Bioinformatics and Biostatistics, National Center for Toxicological

More information

Part [2.1]: Evaluation of Markers for Treatment Selection Linking Clinical and Statistical Goals

Part [2.1]: Evaluation of Markers for Treatment Selection Linking Clinical and Statistical Goals Part [2.1]: Evaluation of Markers for Treatment Selection Linking Clinical and Statistical Goals Patrick J. Heagerty Department of Biostatistics University of Washington 174 Biomarkers Session Outline

More information

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018 Introduction to Machine Learning Katherine Heller Deep Learning Summer School 2018 Outline Kinds of machine learning Linear regression Regularization Bayesian methods Logistic Regression Why we do this

More information

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination Timothy N. Rubin (trubin@uci.edu) Michael D. Lee (mdlee@uci.edu) Charles F. Chubb (cchubb@uci.edu) Department of Cognitive

More information

Predicting Diabetes and Heart Disease Using Features Resulting from KMeans and GMM Clustering

Predicting Diabetes and Heart Disease Using Features Resulting from KMeans and GMM Clustering Predicting Diabetes and Heart Disease Using Features Resulting from KMeans and GMM Clustering Kunal Sharma CS 4641 Machine Learning Abstract Clustering is a technique that is commonly used in unsupervised

More information

Efficient Classification of Cancer using Support Vector Machines and Modified Extreme Learning Machine based on Analysis of Variance Features

Efficient Classification of Cancer using Support Vector Machines and Modified Extreme Learning Machine based on Analysis of Variance Features American Journal of Applied Sciences 8 (12): 1295-1301, 2011 ISSN 1546-9239 2011 Science Publications Efficient Classification of Cancer using Support Vector Machines and Modified Extreme Learning Machine

More information

The SAGE Encyclopedia of Educational Research, Measurement, and Evaluation Multivariate Analysis of Variance

The SAGE Encyclopedia of Educational Research, Measurement, and Evaluation Multivariate Analysis of Variance The SAGE Encyclopedia of Educational Research, Measurement, Multivariate Analysis of Variance Contributors: David W. Stockburger Edited by: Bruce B. Frey Book Title: Chapter Title: "Multivariate Analysis

More information

Data Mining in Bioinformatics Day 4: Text Mining

Data Mining in Bioinformatics Day 4: Text Mining Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10 Bioinformatics Group MPIs Tübingen Karsten Borgwardt: Data Mining in Bioinformatics, Page 1 What is text mining?

More information

Section 4.1. Chapter 4. Classification into Groups: Discriminant Analysis. Introduction: Canonical Discriminant Analysis.

Section 4.1. Chapter 4. Classification into Groups: Discriminant Analysis. Introduction: Canonical Discriminant Analysis. Chapter 4 Classification into Groups: Discriminant Analysis Section 4.1 Introduction: Canonical Discriminant Analysis Understand the goals of discriminant Identify similarities between discriminant analysis

More information

Michael Hallquist, Thomas M. Olino, Paul A. Pilkonis University of Pittsburgh

Michael Hallquist, Thomas M. Olino, Paul A. Pilkonis University of Pittsburgh Comparing the evidence for categorical versus dimensional representations of psychiatric disorders in the presence of noisy observations: a Monte Carlo study of the Bayesian Information Criterion and Akaike

More information

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD Department of Biomedical Informatics Department of Computer Science and Engineering The Ohio State University Review

More information

Computational Perception /785. Auditory Scene Analysis

Computational Perception /785. Auditory Scene Analysis Computational Perception 15-485/785 Auditory Scene Analysis A framework for auditory scene analysis Auditory scene analysis involves low and high level cues Low level acoustic cues are often result in

More information

Data mining for Obstructive Sleep Apnea Detection. 18 October 2017 Konstantinos Nikolaidis

Data mining for Obstructive Sleep Apnea Detection. 18 October 2017 Konstantinos Nikolaidis Data mining for Obstructive Sleep Apnea Detection 18 October 2017 Konstantinos Nikolaidis Introduction: What is Obstructive Sleep Apnea? Obstructive Sleep Apnea (OSA) is a relatively common sleep disorder

More information

Lecture 21. RNA-seq: Advanced analysis

Lecture 21. RNA-seq: Advanced analysis Lecture 21 RNA-seq: Advanced analysis Experimental design Introduction An experiment is a process or study that results in the collection of data. Statistical experiments are conducted in situations in

More information

Towards Learning to Ignore Irrelevant State Variables

Towards Learning to Ignore Irrelevant State Variables Towards Learning to Ignore Irrelevant State Variables Nicholas K. Jong and Peter Stone Department of Computer Sciences University of Texas at Austin Austin, Texas 78712 {nkj,pstone}@cs.utexas.edu Abstract

More information

Wen et al. (1998) PNAS, 95:

Wen et al. (1998) PNAS, 95: Large-scale temporal gene expression mapping of central nervous system development Fluctuations in mrna expression of 2 genes during rat central nervous system development, focusing on the cervical spinal

More information

False Discovery Rates and Copy Number Variation. Bradley Efron and Nancy Zhang Stanford University

False Discovery Rates and Copy Number Variation. Bradley Efron and Nancy Zhang Stanford University False Discovery Rates and Copy Number Variation Bradley Efron and Nancy Zhang Stanford University Three Statistical Centuries 19th (Quetelet) Huge data sets, simple questions 20th (Fisher, Neyman, Hotelling,...

More information

Various Approaches to Szroeter s Test for Regression Quantiles

Various Approaches to Szroeter s Test for Regression Quantiles The International Scientific Conference INPROFORUM 2017, November 9, 2017, České Budějovice, 361-365, ISBN 978-80-7394-667-8. Various Approaches to Szroeter s Test for Regression Quantiles Jan Kalina,

More information

Multivariable Systems. Lawrence Hubert. July 31, 2011

Multivariable Systems. Lawrence Hubert. July 31, 2011 Multivariable July 31, 2011 Whenever results are presented within a multivariate context, it is important to remember that there is a system present among the variables, and this has a number of implications

More information

Bayesian Models for Combining Data Across Subjects and Studies in Predictive fmri Data Analysis

Bayesian Models for Combining Data Across Subjects and Studies in Predictive fmri Data Analysis Bayesian Models for Combining Data Across Subjects and Studies in Predictive fmri Data Analysis Thesis Proposal Indrayana Rustandi April 3, 2007 Outline Motivation and Thesis Preliminary results: Hierarchical

More information

Doing Thousands of Hypothesis Tests at the Same Time. Bradley Efron Stanford University

Doing Thousands of Hypothesis Tests at the Same Time. Bradley Efron Stanford University Doing Thousands of Hypothesis Tests at the Same Time Bradley Efron Stanford University 1 Simultaneous Hypothesis Testing 1980: Simultaneous Statistical Inference (Rupert Miller) 2, 3,, 20 simultaneous

More information