MOST: detecting cancer differential gene expression

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "MOST: detecting cancer differential gene expression"

Transcription

1 Biostatistics (2008), 9, 3, pp doi: /biostatistics/kxm042 Advance Access publication on November 29, 2007 MOST: detecting cancer differential gene expression HENG LIAN Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore SUMMARY We propose a new statistics for the detection of differentially expressed genes when the genes are activated only in a subset of the samples. Statistics designed for this unconventional circumstance has proved to be valuable for most cancer studies, where oncogenes are activated for a small number of disease samples. Previous efforts made in this direction include cancer outlier profile analysis (Tomlins and others, 2005), outlier sum (Tibshirani and Hastie, 2007), and outlier robust t-statistics (Wu, 2007). We propose a new statistics called maximum ordered subset t-statistics (MOST) which seems to be natural when the number of activated samples is unknown. We compare MOST to other statistics and find that the proposed method often has more power then its competitors. Keywords: Cancer; COPA; Differential gene expression; Microarray. 1. INTRODUCTION The most popular method for differential gene expression detection in 2-sample microarray studies is to compute the t-statistics. The differentially expressed genes are those whose t-statistics exceed a certain threshold. Recently, many researchers have come to the realization that in many cancer studies, many genes show increased expressions in disease samples, but only for a small number of those samples. The study of Tomlins and others (2005) shows that t-statistics has low power in this case, and they introduced the so-called cancer outlier profile analysis (COPA). Their study shows clearly that COPA can perform better than the traditional t-statistics for cancer microarray data sets. More recently, several progresses have been made in this direction with the aim to design better statistics to account for the heterogeneous activation pattern of the cancer genes. In Tibshirani and Hastie (2007), the authors introduced a new statistics, which they called outlier sum. Later, Wu (2007) proposed outlier robust t-statistics (ORT) and showed it usually outperformed the previously proposed ones in both simulation study and application to real data set. In this paper, we propose another statistics for the detection of cancer differential gene expression which have similar power to ORT when the number of activated samples is very small, but perform better when more samples are differentially expressed. We call our new method the maximum ordered subset t-statistics (MOST). Through simulation studies, we found the new statistics outperformed the previously To whom correspondence should be addressed. c The Author Published by Oxford University Press. All rights reserved. For permissions, please

2 412 H. LIAN proposed ones under some circumstances and never significantly worse in all situations. Thus, we think it is a valuable addition to the dictionary of cancer outlier expression detection. 2. MAXIMUM ORDERED SUBSET t-statistics We consider the simple 2-class microarray data for detecting cancer genes. We assume there are n normal samples and m cancer samples. The gene expressions for normal samples are denoted by x ij for genes i = 1, 2,...,p and samples j = 1, 2,...n, while y ij denote the expressions for cancer samples with i = 1, 2,...,p and j = 1, 2,...m. In this paper, we are only interested in 1-sided test where the activated genes from cancer samples have a higher expression level. The extension to 2-sided test is straightforward. The usual t-statistics (up to a multiplication factor independent of genes) for 2-sample test of differences in means is defined for each gene i by T i = x i ȳ i, (2.1) s i where x i = j x ij/n is the average expression of gene i in normal samples, ȳ i = j y ij/m is the average expression of gene i in cancer samples, and s i is the usual pooled standard deviation estimate 1 si 2 j n = (x ij x i ) j m (y ij ȳ i ) 2. n + m 2 The t-statistics is powerful when the alternative distribution is such that y ij, j = 1, 2,...,m, all come from a distribution with a higher mean. Tomlins and others (2005) argues that for most cancer types, heterogeneous activation patterns make t-statistics inefficient for detecting those expression profiles. They defined the COPA statistics C i = q r ({y ij } 1 j m ) med i, (2.2) mad i where q r ( ) is the rth percentile of the data, med i = median({x ij } 1 j n, {y ij } 1 j m ) is the median of the pooled samples for gene i, and mad i = median({x ij med i } 1 j n, {y ij med i } 1 j m ) is the median absolute deviation of the pooled samples. The choice of r in (2.2) depends on the subjective judgement of the user. The use of med i and mad i to replace the mean and the standard deviation in (2.1) is due to robustness considerations since it is already known that some of the genes are differentially expressed. In (2.2), only one value of {y ij } is used in the computation. A more efficient strategy would be to use additional expression values. Let O i ={y ij : y ij > q 75 ({x ij } 1 j n, {y ij } 1 j m ) + IQR({x ij } 1 j n, {y ij } 1 j m )} (2.3) be the outliers from the cancer samples for gene i, where IQR( ) is the interquartile range of the data. The OS statistics from Tibshirani and Hastie (2007) is then defined as y ij O i (y ij med i ) OS i =. (2.4) mad i More recently, Wu (2007) studied ORT statistics, which is similar to OS statistics. The important difference that makes ORT superior is that outliers are defined relative to the normal sample instead of the pooled sample. So in their definition, O i ={y ij : y ij > q 75 ({x ij } 1 j n ) + IQR({x ij } 1 j n )}. (2.5)

3 MOST: detecting cancer differential gene expression 413 By similar reasoning, med i in OS is replaced by med ix and mad j by median({x ij med ix } 1 j n, {y ij med iy } 1 j m ), where med ix and med iy are the medians of normal and cancer samples, respectively. In both OS and ORT statistics, the outliers are defined somewhat arbitrarily with no convincing reasons. To address this question, we propose the following statistics that implicitly considers all possible values for outlier thresholds. Suppose for notational simplicity that {y ij } 1 j m are ordered for each i: y i1 y i2 y im. If the number of samples where oncogenes are activated is known, we would naturally define the statistics as 1 j k M ik = (y ij med ix ) median({x ij med ix } 1 j n, {y ij med iy } 1 j m ). (2.6) When k is not known to us, one would be tempted to define M i = max M ik. 1 k m But this does not quite work since obviously M ik for different values of k are not directly comparable under the null distribution that x ij, y ij N(0, 1). For example, when m = 2, we have E[y i1 med ix ] > 0, while E [ j=1,2 (y ij med ix ) ] = 0. This observation motivates us to normalize M ik such that each approximately has mean 0 and variance 1. This can be achieved by defining µ k = E [ 1 j k z j] and σk 2 = Var( 1 j k z j), where z1 > z 2 > > z m is the order statistics of m samples generated from the standard normal distribution. Then, we can define M ik as ( 1 j k M ik = (y )/ ij med ix ) median({x ij med ix } 1 j n, {y ij med iy } 1 j m ) µ k σ k (2.7) so that M ik has mean and variance approximately equal to 0 and 1, respectively. Finally, we can define our new statistics (called MOST) as M i = max M ik. (2.8) 1 k m With MOST, we practically consider every possible threshold above which y ij are taken to be outliers. In this formulation, the number of outliers is implicitly defined as arg max M ik. (2.9) 1 k m 3. SIMULATION STUDIES AND APPLICATION Some simulations are carried out to study MOST and compare its performance to OS, ORT, COPA, and t-statistics. For COPA, we choose to use the 90th percentile in its definition as in Tibshirani and Hastie (2007). We generate the expression data from standard normal with n = m = 20. For various values k, 1 k m, which is the number of differentially expressed cancer samples, a constant µ is added for differentially expressed genes. We simulated 1000 differentially and nondifferentially expressed genes and calculated the receiver operating characteristic (ROC) curves from them by choosing different thresholds for gene calls. Figures 1 and 2 plot the ROC curves for some combinations of k and µ. Forµ = 2 and k small, all 5 statistics behave similarly with t-statistics performing the worst. As k increases, t becomes better and OS and COPA begin to lose power. For µ = 1 and medium to large k, the performance of MOST is only worse than t and better than other statistics. Smaller k in this case basically leads to ROC curve that is close to a 45 line for all statistics since the signal µ = 1 is too weak in this case, so we do not show these

4 414 H. LIAN Fig. 1. ROC curves estimated based on simulation. The number of normal/cancer sample is n = m = 20. Various combinations of µ and k s are chosen. Other uninteresting results where all statistics have close to perfectly good or bad performances are excluded as explained in the main text.

5 MOST: detecting cancer differential gene expression 415 Fig. 2. More ROC curves. results. For µ = 4 and small k, MOST is better than ORT, COPA, and t, and in this situation, only OS is competitive with MOST. Larger k in this case will produce nearly perfect ROC curves for all statistics, and thus those results are also omitted. Besides ROC curves, we have also tried examining the possibility of using (2.9) for estimating the number of differentially expressed samples k but so far have been unable to get a reasonable estimate out of it. From the above simulations, we judge that our new estimate MOST is at least as good as other previously proposed statistics, sometimes much better. Thus, it is a valuable tool for detecting activated genes in many situations. As an example of real data application, the data from West and others (2001) is publicly available from The microarray used in the breast cancer study contains 7129 genes and 49 tumor samples, 25 of which with no positive lymph nodes identified and the other 24 with positive nodes. Similar to Wu (2007), we take the log transformation of the expressions after normalizing the data. We apply all the above mentioned statistics to the data. For real data application, we need to use 2-sided test. The modification for MOST required is straightforward. We just use 1 j k m ik = (y ij med ix ) median({x ij med ix } 1 j n, {y ij med iy } 1 j m ). (3.1)

6 416 H. LIAN This time with y ij ordered such that y i1 y i2 y im. We also need to normalize m ik as in (2.7) and then m i = min 1 k m m ik. This test will detect downward-regulated genes in disease samples. The 2-sided test will be the maximum of the absolute values of M i and m i, but the sign is kept. A search of PubMed returns 908 genes related to breast cancer, and these are mapped to 655 probe sets on the Affymetrix HuGeneFL microarray used in this experiment. The ranking of these are computed for all the 5 statistics. The differences in ranking between MOST and other statistics are plotted in Figure 3 as histograms. Negative values in the histograms show the superiority of the MOST statistics since they imply higher ranking to these breast cancer related genes given by MOST. The histograms show that for this data, the performance of MOST is superior to t-statistics (shown by the heavy left tail in the histogram) and similar to others. In a second example, we consider a subset of acute lymphoblastic leukemia (ALL) data consisting 79 samples patients with B-cell acute lymphoblastic leukemia, available from the Bioconductor project. The comparison is between 37 samples with the BCR/ABL fusion gene resulting from a translocation and 42 normal samples. In total, 358 probe sets on the array are annotated with the gene ontology term tyrosine Fig. 3. Comparison between MOST and other statistics on breast cancer data. The histogram shows the difference in ranking on 655 probe sets. In these figures, left-skewed histogram implies superiority of MOST.

7 MOST: detecting cancer differential gene expression 417 Fig. 4. Comparison between MOST and other statistics on ALL data. kinase activity, which is believed to mediate many of the effects due to BCR/ABL translocation. After a simple filtering to remove probe sets that are not expressed, 94 of those 358 probe sets remain. The differences in ranking between MOST and other statistics are shown as histograms in Figure 4. For this data, t-test and MOST seem to be superior to other methods. The R implementation of MOST statistics as well as sample code for simulation is available at REFERENCES TIBSHIRANI, R. AND HASTIE, T. (2007). Outlier sums for differential gene expression analysis. Biostatistics 8, 2 8. TOMLINS, S. A., RHODES, D. R., PERNER, S., DHANASEKARAN, S. M., MEHRA, R., SUN, X. W., VARAMBALLY, S., CAO, X., TCHINDA, J., KUEFER, R. and others (2005). Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science 310, WEST, M.,BLANCHETTE, C., DRESSMAN, H.,HUANG, E.,ISHIDA, S., SPANG, R., ZUZAN, H.,OLSON, J.A.J., MARKS, J. R. AND NEVINS, J. R. (2001). Predicting the clinical status of human breast cancer by using

8 418 H. LIAN gene expression profiles. Proceedings of the National Academy of Sciences of the United States of America 98, WU, B. (2007). Cancer outlier differential gene expression detection. Biostatistics 8, [Received August 8, 2007; first revision September 24, 2007; second revision October 23, 2007; accepted for publication October 25, 2007]

Cancer outlier differential gene expression detection

Cancer outlier differential gene expression detection Biostatistics (2007), 8, 3, pp. 566 575 doi:10.1093/biostatistics/kxl029 Advance Access publication on October 4, 2006 Cancer outlier differential gene expression detection BAOLIN WU Division of Biostatistics,

More information

Bayesian Confidence Intervals for Means and Variances of Lognormal and Bivariate Lognormal Distributions

Bayesian Confidence Intervals for Means and Variances of Lognormal and Bivariate Lognormal Distributions Bayesian Confidence Intervals for Means and Variances of Lognormal and Bivariate Lognormal Distributions J. Harvey a,b, & A.J. van der Merwe b a Centre for Statistical Consultation Department of Statistics

More information

VU Biostatistics and Experimental Design PLA.216

VU Biostatistics and Experimental Design PLA.216 VU Biostatistics and Experimental Design PLA.216 Julia Feichtinger Postdoctoral Researcher Institute of Computational Biotechnology Graz University of Technology Outline for Today About this course Background

More information

Chapter 1: Exploring Data

Chapter 1: Exploring Data Chapter 1: Exploring Data Key Vocabulary:! individual! variable! frequency table! relative frequency table! distribution! pie chart! bar graph! two-way table! marginal distributions! conditional distributions!

More information

AP Statistics. Semester One Review Part 1 Chapters 1-5

AP Statistics. Semester One Review Part 1 Chapters 1-5 AP Statistics Semester One Review Part 1 Chapters 1-5 AP Statistics Topics Describing Data Producing Data Probability Statistical Inference Describing Data Ch 1: Describing Data: Graphically and Numerically

More information

Lecture Outline. Biost 517 Applied Biostatistics I. Purpose of Descriptive Statistics. Purpose of Descriptive Statistics

Lecture Outline. Biost 517 Applied Biostatistics I. Purpose of Descriptive Statistics. Purpose of Descriptive Statistics Biost 517 Applied Biostatistics I Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics University of Washington Lecture 3: Overview of Descriptive Statistics October 3, 2005 Lecture Outline Purpose

More information

Diagnosis of multiple cancer types by shrunken centroids of gene expression

Diagnosis of multiple cancer types by shrunken centroids of gene expression Diagnosis of multiple cancer types by shrunken centroids of gene expression Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu PNAS 99:10:6567-6572, 14 May 2002 Nearest Centroid

More information

Chapter 1: Introduction to Statistics

Chapter 1: Introduction to Statistics Chapter 1: Introduction to Statistics Variables A variable is a characteristic or condition that can change or take on different values. Most research begins with a general question about the relationship

More information

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD Department of Biomedical Informatics Department of Computer Science and Engineering The Ohio State University Review

More information

Sawtooth Software. MaxDiff Analysis: Simple Counting, Individual-Level Logit, and HB RESEARCH PAPER SERIES. Bryan Orme, Sawtooth Software, Inc.

Sawtooth Software. MaxDiff Analysis: Simple Counting, Individual-Level Logit, and HB RESEARCH PAPER SERIES. Bryan Orme, Sawtooth Software, Inc. Sawtooth Software RESEARCH PAPER SERIES MaxDiff Analysis: Simple Counting, Individual-Level Logit, and HB Bryan Orme, Sawtooth Software, Inc. Copyright 009, Sawtooth Software, Inc. 530 W. Fir St. Sequim,

More information

PRACTICAL STATISTICS FOR MEDICAL RESEARCH

PRACTICAL STATISTICS FOR MEDICAL RESEARCH PRACTICAL STATISTICS FOR MEDICAL RESEARCH Douglas G. Altman Head of Medical Statistical Laboratory Imperial Cancer Research Fund London CHAPMAN & HALL/CRC Boca Raton London New York Washington, D.C. Contents

More information

Title: A robustness study of parametric and non-parametric tests in Model-Based Multifactor Dimensionality Reduction for epistasis detection

Title: A robustness study of parametric and non-parametric tests in Model-Based Multifactor Dimensionality Reduction for epistasis detection Author's response to reviews Title: A robustness study of parametric and non-parametric tests in Model-Based Multifactor Dimensionality Reduction for epistasis detection Authors: Jestinah M Mahachie John

More information

Classification of cancer profiles. ABDBM Ron Shamir

Classification of cancer profiles. ABDBM Ron Shamir Classification of cancer profiles 1 Background: Cancer Classification Cancer classification is central to cancer treatment; Traditional cancer classification methods: location; morphology, cytogenesis;

More information

Determining the Vulnerabilities of the Power Transmission System

Determining the Vulnerabilities of the Power Transmission System Determining the Vulnerabilities of the Power Transmission System B. A. Carreras D. E. Newman I. Dobson Depart. Fisica Universidad Carlos III Madrid, Spain Physics Dept. University of Alaska Fairbanks AK

More information

Investigating the robustness of the nonparametric Levene test with more than two groups

Investigating the robustness of the nonparametric Levene test with more than two groups Psicológica (2014), 35, 361-383. Investigating the robustness of the nonparametric Levene test with more than two groups David W. Nordstokke * and S. Mitchell Colp University of Calgary, Canada Testing

More information

2016 Children and young people s inpatient and day case survey

2016 Children and young people s inpatient and day case survey NHS Patient Survey Programme 2016 Children and young people s inpatient and day case survey Technical details for analysing trust-level results Published November 2017 CQC publication Contents 1. Introduction...

More information

On the Combination of Collaborative and Item-based Filtering

On the Combination of Collaborative and Item-based Filtering On the Combination of Collaborative and Item-based Filtering Manolis Vozalis 1 and Konstantinos G. Margaritis 1 University of Macedonia, Dept. of Applied Informatics Parallel Distributed Processing Laboratory

More information

Louis Leon Thurstone in Monte Carlo: Creating Error Bars for the Method of Paired Comparison

Louis Leon Thurstone in Monte Carlo: Creating Error Bars for the Method of Paired Comparison Louis Leon Thurstone in Monte Carlo: Creating Error Bars for the Method of Paired Comparison Ethan D. Montag Munsell Color Science Laboratory, Chester F. Carlson Center for Imaging Science Rochester Institute

More information

Technical Specifications

Technical Specifications Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically

More information

Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach

Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach University of South Florida Scholar Commons Graduate Theses and Dissertations Graduate School November 2015 Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach Wei Chen

More information

On Regression Analysis Using Bivariate Extreme Ranked Set Sampling

On Regression Analysis Using Bivariate Extreme Ranked Set Sampling On Regression Analysis Using Bivariate Extreme Ranked Set Sampling Atsu S. S. Dorvlo atsu@squ.edu.om Walid Abu-Dayyeh walidan@squ.edu.om Obaid Alsaidy obaidalsaidy@gmail.com Abstract- Many forms of ranked

More information

breast cancer; relative risk; risk factor; standard deviation; strength of association

breast cancer; relative risk; risk factor; standard deviation; strength of association American Journal of Epidemiology The Author 2015. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of Public Health. All rights reserved. For permissions, please e-mail:

More information

A Comparison of Methods for Determining HIV Viral Set Point

A Comparison of Methods for Determining HIV Viral Set Point STATISTICS IN MEDICINE Statist. Med. 2006; 00:1 6 [Version: 2002/09/18 v1.11] A Comparison of Methods for Determining HIV Viral Set Point Y. Mei 1, L. Wang 2, S. E. Holte 2 1 School of Industrial and Systems

More information

Multi-class cancer outlier differential gene expression detection Supplementary Information

Multi-class cancer outlier differential gene expression detection Supplementary Information Multi-class cancer outlier differential gene expression detection Supplementary Information Fang Liu, Baolin Wu 1 Breast cancer microarray data analysis The breast cancer microarray data (West et al.,

More information

Understandable Statistics

Understandable Statistics Understandable Statistics correlated to the Advanced Placement Program Course Description for Statistics Prepared for Alabama CC2 6/2003 2003 Understandable Statistics 2003 correlated to the Advanced Placement

More information

Introduction & Basics

Introduction & Basics CHAPTER 1 Introduction & Basics 1.1 Statistics the Field... 1 1.2 Probability Distributions... 4 1.3 Study Design Features... 9 1.4 Descriptive Statistics... 13 1.5 Inferential Statistics... 16 1.6 Summary...

More information

Comparison of the Null Distributions of

Comparison of the Null Distributions of Comparison of the Null Distributions of Weighted Kappa and the C Ordinal Statistic Domenic V. Cicchetti West Haven VA Hospital and Yale University Joseph L. Fleiss Columbia University It frequently occurs

More information

Content. Basic Statistics and Data Analysis for Health Researchers from Foreign Countries. Research question. Example Newly diagnosed Type 2 Diabetes

Content. Basic Statistics and Data Analysis for Health Researchers from Foreign Countries. Research question. Example Newly diagnosed Type 2 Diabetes Content Quantifying association between continuous variables. Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma siersma@sund.ku.dk The Research Unit for General

More information

RELYING ON TRUST TO FIND RELIABLE INFORMATION. Keywords Reputation; recommendation; trust; information retrieval; open distributed systems.

RELYING ON TRUST TO FIND RELIABLE INFORMATION. Keywords Reputation; recommendation; trust; information retrieval; open distributed systems. Abstract RELYING ON TRUST TO FIND RELIABLE INFORMATION Alfarez Abdul-Rahman and Stephen Hailes Department of Computer Science, University College London Gower Street, London WC1E 6BT, United Kingdom E-mail:

More information

Chapter 4 Cellular Oncogenes ~ 4.6 -

Chapter 4 Cellular Oncogenes ~ 4.6 - Chapter 4 Cellular Oncogenes - 4.2 ~ 4.6 - Many retroviruses carrying oncogenes have been found in chickens and mice However, attempts undertaken during the 1970s to isolate viruses from most types of

More information

Attentional Theory Is a Viable Explanation of the Inverse Base Rate Effect: A Reply to Winman, Wennerholm, and Juslin (2003)

Attentional Theory Is a Viable Explanation of the Inverse Base Rate Effect: A Reply to Winman, Wennerholm, and Juslin (2003) Journal of Experimental Psychology: Learning, Memory, and Cognition 2003, Vol. 29, No. 6, 1396 1400 Copyright 2003 by the American Psychological Association, Inc. 0278-7393/03/$12.00 DOI: 10.1037/0278-7393.29.6.1396

More information

Students were asked to report how far (in miles) they each live from school. The following distances were recorded. 1 Zane Jackson 0.

Students were asked to report how far (in miles) they each live from school. The following distances were recorded. 1 Zane Jackson 0. Identifying Outliers Task Students were asked to report how far (in miles) they each live from school. The following distances were recorded. Student Distance 1 Zane 0.4 2 Jackson 0.5 3 Benjamin 1.0 4

More information

Responsiveness to feedback as a personal trait

Responsiveness to feedback as a personal trait Responsiveness to feedback as a personal trait Thomas Buser University of Amsterdam Leonie Gerhards University of Hamburg Joël van der Weele University of Amsterdam Pittsburgh, June 12, 2017 1 / 30 Feedback

More information

Section on Survey Research Methods JSM 2009

Section on Survey Research Methods JSM 2009 Missing Data and Complex Samples: The Impact of Listwise Deletion vs. Subpopulation Analysis on Statistical Bias and Hypothesis Test Results when Data are MCAR and MAR Bethany A. Bell, Jeffrey D. Kromrey

More information

Model Selection Methods for Cancer Staging and Other Disease Stratification Problems. Yunzhi Lin

Model Selection Methods for Cancer Staging and Other Disease Stratification Problems. Yunzhi Lin Model Selection Methods for Cancer Staging and Other Disease Stratification Problems by Yunzhi Lin A dissertation submitted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY

More information

Methodological skills

Methodological skills Methodological skills rma linguistics, week 3 Tamás Biró ACLC University of Amsterdam t.s.biro@uva.nl Tamás Biró, UvA 1 Topics today Parameter of the population. Statistic of the sample. Re: descriptive

More information

TITLE: Small Molecule Inhibitors of ERG and ETV1 in Prostate Cancer

TITLE: Small Molecule Inhibitors of ERG and ETV1 in Prostate Cancer AD Award Number: W81XWH-12-1-0399 TITLE: Small Molecule Inhibitors of ERG and ETV1 in Prostate Cancer PRINCIPAL INVESTIGATOR: Colm Morrissey CONTRACTING ORGANIZATION: University of Washington Seattle WA

More information

Bayesian Prediction Tree Models

Bayesian Prediction Tree Models Bayesian Prediction Tree Models Statistical Prediction Tree Modelling for Clinico-Genomics Clinical gene expression data - expression signatures, profiling Tree models for predictive sub-typing Combining

More information

Assessing the diagnostic accuracy of a sequence of tests

Assessing the diagnostic accuracy of a sequence of tests Biostatistics (2003), 4, 3,pp. 341 351 Printed in Great Britain Assessing the diagnostic accuracy of a sequence of tests MARY LOU THOMPSON Department of Biostatistics, Box 357232, University of Washington,

More information

An Empirical and Formal Analysis of Decision Trees for Ranking

An Empirical and Formal Analysis of Decision Trees for Ranking An Empirical and Formal Analysis of Decision Trees for Ranking Eyke Hüllermeier Department of Mathematics and Computer Science Marburg University 35032 Marburg, Germany eyke@mathematik.uni-marburg.de Stijn

More information

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests David Shin Pearson Educational Measurement May 007 rr0701 Using assessment and research to promote learning Pearson Educational

More information

Detection Theory: Sensitivity and Response Bias

Detection Theory: Sensitivity and Response Bias Detection Theory: Sensitivity and Response Bias Lewis O. Harvey, Jr. Department of Psychology University of Colorado Boulder, Colorado The Brain (Observable) Stimulus System (Observable) Response System

More information

UNIVERSITY OF THE FREE STATE DEPARTMENT OF COMPUTER SCIENCE AND INFORMATICS CSIS6813 MODULE TEST 2

UNIVERSITY OF THE FREE STATE DEPARTMENT OF COMPUTER SCIENCE AND INFORMATICS CSIS6813 MODULE TEST 2 UNIVERSITY OF THE FREE STATE DEPARTMENT OF COMPUTER SCIENCE AND INFORMATICS CSIS6813 MODULE TEST 2 DATE: 3 May 2017 MARKS: 75 ASSESSOR: Prof PJ Blignaut MODERATOR: Prof C de Villiers (UP) TIME: 2 hours

More information

STT315 Chapter 2: Methods for Describing Sets of Data - Part 2

STT315 Chapter 2: Methods for Describing Sets of Data - Part 2 Chapter 2.5 Interpreting Standard Deviation Chebyshev Theorem Empirical Rule Chebyshev Theorem says that for ANY shape of data distribution at least 3/4 of all data fall no farther from the mean than 2

More information

Introduction to Discrimination in Microarray Data Analysis

Introduction to Discrimination in Microarray Data Analysis Introduction to Discrimination in Microarray Data Analysis Jane Fridlyand CBMB University of California, San Francisco Genentech Hall Auditorium, Mission Bay, UCSF October 23, 2004 1 Case Study: Van t

More information

The reliability and validity of the Index of Complexity, Outcome and Need for determining treatment need in Dutch orthodontic practice

The reliability and validity of the Index of Complexity, Outcome and Need for determining treatment need in Dutch orthodontic practice European Journal of Orthodontics 28 (2006) 58 64 doi:10.1093/ejo/cji085 Advance Access publication 8 November 2005 The Author 2005. Published by Oxford University Press on behalf of the European Orthodontics

More information

Sawtooth Software. The Number of Levels Effect in Conjoint: Where Does It Come From and Can It Be Eliminated? RESEARCH PAPER SERIES

Sawtooth Software. The Number of Levels Effect in Conjoint: Where Does It Come From and Can It Be Eliminated? RESEARCH PAPER SERIES Sawtooth Software RESEARCH PAPER SERIES The Number of Levels Effect in Conjoint: Where Does It Come From and Can It Be Eliminated? Dick Wittink, Yale University Joel Huber, Duke University Peter Zandan,

More information

Evidence Based Medicine

Evidence Based Medicine Course Goals Goals 1. Understand basic concepts of evidence based medicine (EBM) and how EBM facilitates optimal patient care. 2. Develop a basic understanding of how clinical research studies are designed

More information

1) What is the independent variable? What is our Dependent Variable?

1) What is the independent variable? What is our Dependent Variable? 1) What is the independent variable? What is our Dependent Variable? Independent Variable: Whether the font color and word name are the same or different. (Congruency) Dependent Variable: The amount of

More information

The Relative Performance of Full Information Maximum Likelihood Estimation for Missing Data in Structural Equation Models

The Relative Performance of Full Information Maximum Likelihood Estimation for Missing Data in Structural Equation Models University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Educational Psychology Papers and Publications Educational Psychology, Department of 7-1-2001 The Relative Performance of

More information

Statistical analysis DIANA SAPLACAN 2017 * SLIDES ADAPTED BASED ON LECTURE NOTES BY ALMA LEORA CULEN

Statistical analysis DIANA SAPLACAN 2017 * SLIDES ADAPTED BASED ON LECTURE NOTES BY ALMA LEORA CULEN Statistical analysis DIANA SAPLACAN 2017 * SLIDES ADAPTED BASED ON LECTURE NOTES BY ALMA LEORA CULEN Vs. 2 Background 3 There are different types of research methods to study behaviour: Descriptive: observations,

More information

Fixed-Effect Versus Random-Effects Models

Fixed-Effect Versus Random-Effects Models PART 3 Fixed-Effect Versus Random-Effects Models Introduction to Meta-Analysis. Michael Borenstein, L. V. Hedges, J. P. T. Higgins and H. R. Rothstein 2009 John Wiley & Sons, Ltd. ISBN: 978-0-470-05724-7

More information

Building Multi-Marker Algorithms for Disease Prediction The Role of Correlations Among Markers

Building Multi-Marker Algorithms for Disease Prediction The Role of Correlations Among Markers Biomarker Insights Original Research Open Access Full open access to this and thousands of other papers at http://www.la-press.com. Building Multi-Marker Algorithms for Disease Prediction The Role of Correlations

More information

Understanding DNA Copy Number Data

Understanding DNA Copy Number Data Understanding DNA Copy Number Data Adam B. Olshen Department of Epidemiology and Biostatistics Helen Diller Family Comprehensive Cancer Center University of California, San Francisco http://cc.ucsf.edu/people/olshena_adam.php

More information

Motivation: Fraud Detection

Motivation: Fraud Detection Outlier Detection Motivation: Fraud Detection http://i.imgur.com/ckkoaop.gif Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 2 Techniques: Fraud Detection Features Dissimilarity Groups and

More information

Individualized Treatment Effects Using a Non-parametric Bayesian Approach

Individualized Treatment Effects Using a Non-parametric Bayesian Approach Individualized Treatment Effects Using a Non-parametric Bayesian Approach Ravi Varadhan Nicholas C. Henderson Division of Biostatistics & Bioinformatics Department of Oncology Johns Hopkins University

More information

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes. Final review Based in part on slides from textbook, slides of Susan Holmes December 5, 2012 1 / 1 Final review Overview Before Midterm General goals of data mining. Datatypes. Preprocessing & dimension

More information

Lab 2: Non-linear regression models

Lab 2: Non-linear regression models Lab 2: Non-linear regression models The flint water crisis has received much attention in the past few months. In today s lab, we will apply the non-linear regression to analyze the lead testing results

More information

MODEL PERFORMANCE ANALYSIS AND MODEL VALIDATION IN LOGISTIC REGRESSION

MODEL PERFORMANCE ANALYSIS AND MODEL VALIDATION IN LOGISTIC REGRESSION STATISTICA, anno LXIII, n. 2, 2003 MODEL PERFORMANCE ANALYSIS AND MODEL VALIDATION IN LOGISTIC REGRESSION 1. INTRODUCTION Regression models are powerful tools frequently used to predict a dependent variable

More information

Manifestation Of Differences In Item-Level Characteristics In Scale-Level Measurement Invariance Tests Of Multi-Group Confirmatory Factor Analyses

Manifestation Of Differences In Item-Level Characteristics In Scale-Level Measurement Invariance Tests Of Multi-Group Confirmatory Factor Analyses Journal of Modern Applied Statistical Methods Copyright 2005 JMASM, Inc. May, 2005, Vol. 4, No.1, 275-282 1538 9472/05/$95.00 Manifestation Of Differences In Item-Level Characteristics In Scale-Level Measurement

More information

Business Statistics (ECOE 1302) Spring Semester 2011 Chapter 3 - Numerical Descriptive Measures Solutions

Business Statistics (ECOE 1302) Spring Semester 2011 Chapter 3 - Numerical Descriptive Measures Solutions The Islamic University of Gaza Faculty of Commerce Department of Economics and Political Sciences Business Statistics (ECOE 1302) Spring Semester 2011 Chapter 3 - Numerical Descriptive Measures Solutions

More information

Chapter 21 Multilevel Propensity Score Methods for Estimating Causal Effects: A Latent Class Modeling Strategy

Chapter 21 Multilevel Propensity Score Methods for Estimating Causal Effects: A Latent Class Modeling Strategy Chapter 21 Multilevel Propensity Score Methods for Estimating Causal Effects: A Latent Class Modeling Strategy Jee-Seon Kim and Peter M. Steiner Abstract Despite their appeal, randomized experiments cannot

More information

Proposing a New Term Weighting Scheme for Text Categorization

Proposing a New Term Weighting Scheme for Text Categorization Proposing a New Term Weighting Scheme for Text Categorization Man LAN Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore 119613 lanman@i2r.a-star.edu.sg Chew-Lim TAN School of Computing

More information

Predicting clinical outcomes in neuroblastoma with genomic data integration

Predicting clinical outcomes in neuroblastoma with genomic data integration Predicting clinical outcomes in neuroblastoma with genomic data integration Ilyes Baali, 1 Alp Emre Acar 1, Tunde Aderinwale 2, Saber HafezQorani 3, Hilal Kazan 4 1 Department of Electric-Electronics Engineering,

More information

Statistical Considerations for Post Market Surveillance

Statistical Considerations for Post Market Surveillance Statistical Considerations for Post Market Surveillance 30 th International Conference on Phamacoepidemiology and Therapeutic Risk Management Ted Lystig, Ph.D. MEDTRONIC INC. Distinguished Statistician

More information

A Real Case Comparison of Average and Population Bioequivalence for Evaluation of APSD data

A Real Case Comparison of Average and Population Bioequivalence for Evaluation of APSD data A Real Case Comparison of Average and Population Bioequivalence for Evaluation of APSD data IPAC-RS/UF Orlando Inhalation Conference March 18-20, 2014 Dennis Sandell, S5 Consulting Contents The data brief

More information

Human population sub-structure and genetic association studies

Human population sub-structure and genetic association studies Human population sub-structure and genetic association studies Stephanie A. Santorico, Ph.D. Department of Mathematical & Statistical Sciences Stephanie.Santorico@ucdenver.edu Global Similarity Map from

More information

Online Supplement to A Data-Driven Model of an Appointment-Generated Arrival Process at an Outpatient Clinic

Online Supplement to A Data-Driven Model of an Appointment-Generated Arrival Process at an Outpatient Clinic Online Supplement to A Data-Driven Model of an Appointment-Generated Arrival Process at an Outpatient Clinic Song-Hee Kim, Ward Whitt and Won Chul Cha USC Marshall School of Business, Los Angeles, CA 989,

More information

Introduction to Meta-analysis of Accuracy Data

Introduction to Meta-analysis of Accuracy Data Introduction to Meta-analysis of Accuracy Data Hans Reitsma MD, PhD Dept. of Clinical Epidemiology, Biostatistics & Bioinformatics Academic Medical Center - Amsterdam Continental European Support Unit

More information

A Comparison of Several Goodness-of-Fit Statistics

A Comparison of Several Goodness-of-Fit Statistics A Comparison of Several Goodness-of-Fit Statistics Robert L. McKinley The University of Toledo Craig N. Mills Educational Testing Service A study was conducted to evaluate four goodnessof-fit procedures

More information

Thinking about Inference

Thinking about Inference CHAPTER 15 c Jupiterimages/Age fotostock Thinking about Inference IN THIS CHAPTER WE COVER... To this point, we have met just two procedures for statistical inference. Both concern inference about the

More information

Statistical Techniques. Masoud Mansoury and Anas Abulfaraj

Statistical Techniques. Masoud Mansoury and Anas Abulfaraj Statistical Techniques Masoud Mansoury and Anas Abulfaraj What is Statistics? https://www.youtube.com/watch?v=lmmzj7599pw The definition of Statistics The practice or science of collecting and analyzing

More information

ABSTRACT THE INDEPENDENT MEANS T-TEST AND ALTERNATIVES SESUG Paper PO-10

ABSTRACT THE INDEPENDENT MEANS T-TEST AND ALTERNATIVES SESUG Paper PO-10 SESUG 01 Paper PO-10 PROC TTEST (Old Friend), What Are You Trying to Tell Us? Diep Nguyen, University of South Florida, Tampa, FL Patricia Rodríguez de Gil, University of South Florida, Tampa, FL Eun Sook

More information

Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study

Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study Research Report Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study Xueli Xu Matthias von Davier April 2010 ETS RR-10-10 Listening. Learning. Leading. Linking Errors in Trend Estimation

More information

ROC Curve. Brawijaya Professional Statistical Analysis BPSA MALANG Jl. Kertoasri 66 Malang (0341)

ROC Curve. Brawijaya Professional Statistical Analysis BPSA MALANG Jl. Kertoasri 66 Malang (0341) ROC Curve Brawijaya Professional Statistical Analysis BPSA MALANG Jl. Kertoasri 66 Malang (0341) 580342 ROC Curve The ROC Curve procedure provides a useful way to evaluate the performance of classification

More information

Breast screening: understanding case difficulty and the nature of errors

Breast screening: understanding case difficulty and the nature of errors Loughborough University Institutional Repository Breast screening: understanding case difficulty and the nature of errors This item was submitted to Loughborough University's Institutional Repository by

More information

Internal Consistency: Do We Really Know What It Is and How to. Assess it?

Internal Consistency: Do We Really Know What It Is and How to. Assess it? Internal Consistency: Do We Really Know What It Is and How to Assess it? Wei Tang, Ying Cui, CRAME, Department of Educational Psychology University of Alberta, CANADA The term internal consistency has

More information

3. Fixed-sample Clinical Trial Design

3. Fixed-sample Clinical Trial Design 3. Fixed-sample Clinical Trial Design 3. Fixed-sample clinical trial design 3.1 On choosing the sample size 3.2 Frequentist evaluation of a fixed-sample trial 3.3 Bayesian evaluation of a fixed-sample

More information

Multiple Comparisons and the Known or Potential Error Rate

Multiple Comparisons and the Known or Potential Error Rate Journal of Forensic Economics 19(2), 2006, pp. 231-236 2007 by the National Association of Forensic Economics Multiple Comparisons and the Known or Potential Error Rate David Tabak * I. Introduction Many

More information

A New Bliss Independence Model to Analyze Drug Combination Data

A New Bliss Independence Model to Analyze Drug Combination Data 51867JBXXXX10.1177/108705711451867Journal of Biomolecular ScreeningZhao et al. research-article014 Technical Note A New Bliss Independence Model to Analyze Drug Combination Data Journal of Biomolecular

More information

Reliability, validity, and all that jazz

Reliability, validity, and all that jazz Reliability, validity, and all that jazz Dylan Wiliam King s College London Introduction No measuring instrument is perfect. The most obvious problems relate to reliability. If we use a thermometer to

More information

On perfect working-memory performance with large numbers of items

On perfect working-memory performance with large numbers of items Psychon Bull Rev (2011) 18:958 963 DOI 10.3758/s13423-011-0108-7 BRIEF REPORT On perfect working-memory performance with large numbers of items Jonathan E. Thiele & Michael S. Pratte & Jeffrey N. Rouder

More information

Statistics Guide. Prepared by: Amanda J. Rockinson- Szapkiw, Ed.D.

Statistics Guide. Prepared by: Amanda J. Rockinson- Szapkiw, Ed.D. This guide contains a summary of the statistical terms and procedures. This guide can be used as a reference for course work and the dissertation process. However, it is recommended that you refer to statistical

More information

THE IMPORTANCE OF THE NORMALITY ASSUMPTION IN LARGE PUBLIC HEALTH DATA SETS

THE IMPORTANCE OF THE NORMALITY ASSUMPTION IN LARGE PUBLIC HEALTH DATA SETS Annu. Rev. Public Health 2002. 23:151 69 DOI: 10.1146/annurev.publheath.23.100901.140546 Copyright c 2002 by Annual Reviews. All rights reserved THE IMPORTANCE OF THE NORMALITY ASSUMPTION IN LARGE PUBLIC

More information

Risk Aversion in Games of Chance

Risk Aversion in Games of Chance Risk Aversion in Games of Chance Imagine the following scenario: Someone asks you to play a game and you are given $5,000 to begin. A ball is drawn from a bin containing 39 balls each numbered 1-39 and

More information

TESTING at any level (e.g., production, field, or on-board)

TESTING at any level (e.g., production, field, or on-board) IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 54, NO. 3, JUNE 2005 1003 A Bayesian Approach to Diagnosis and Prognosis Using Built-In Test John W. Sheppard, Senior Member, IEEE, and Mark A.

More information

Introduction to Quantitative Methods (SR8511) Project Report

Introduction to Quantitative Methods (SR8511) Project Report Introduction to Quantitative Methods (SR8511) Project Report Exploring the variables related to and possibly affecting the consumption of alcohol by adults Student Registration number: 554561 Word counts

More information

Chapter 11. Experimental Design: One-Way Independent Samples Design

Chapter 11. Experimental Design: One-Way Independent Samples Design 11-1 Chapter 11. Experimental Design: One-Way Independent Samples Design Advantages and Limitations Comparing Two Groups Comparing t Test to ANOVA Independent Samples t Test Independent Samples ANOVA Comparing

More information

Here are the various choices. All of them are found in the Analyze menu in SPSS, under the sub-menu for Descriptive Statistics :

Here are the various choices. All of them are found in the Analyze menu in SPSS, under the sub-menu for Descriptive Statistics : Descriptive Statistics in SPSS When first looking at a dataset, it is wise to use descriptive statistics to get some idea of what your data look like. Here is a simple dataset, showing three different

More information

Knowledge discovery tools 381

Knowledge discovery tools 381 Knowledge discovery tools 381 hours, and prime time is prime time precisely because more people tend to watch television at that time.. Compare histograms from di erent periods of time. Changes in histogram

More information

Learning to Cook: An Exploration of Recipe Data

Learning to Cook: An Exploration of Recipe Data Learning to Cook: An Exploration of Recipe Data Travis Arffa (tarffa), Rachel Lim (rachelim), Jake Rachleff (jakerach) Abstract Using recipe data scraped from the internet, this project successfully implemented

More information

JSM Survey Research Methods Section

JSM Survey Research Methods Section Methods and Issues in Trimming Extreme Weights in Sample Surveys Frank Potter and Yuhong Zheng Mathematica Policy Research, P.O. Box 393, Princeton, NJ 08543 Abstract In survey sampling practice, unequal

More information

How many attributes should be used in a paired comparison task?

How many attributes should be used in a paired comparison task? How many attributes should be used in a paired comparison task? An empirical examination using a new validation approach Heinz Holling + Torsten Melles Wolfram Reiners February 1998 + Heinz Holling is

More information

GCE. Statistics (MEI) OCR Report to Centres. June Advanced Subsidiary GCE AS H132. Oxford Cambridge and RSA Examinations

GCE. Statistics (MEI) OCR Report to Centres. June Advanced Subsidiary GCE AS H132. Oxford Cambridge and RSA Examinations GCE Statistics (MEI) Advanced Subsidiary GCE AS H132 OCR Report to Centres June 2013 Oxford Cambridge and RSA Examinations OCR (Oxford Cambridge and RSA) is a leading UK awarding body, providing a wide

More information

Review+Practice. May 30, 2012

Review+Practice. May 30, 2012 Review+Practice May 30, 2012 Final: Tuesday June 5 8:30-10:20 Venue: Sections AA and AB (EEB 125), sections AC and AD (EEB 105), sections AE and AF (SIG 134) Format: Short answer. Bring: calculator, BRAINS

More information

Diagnostic imaging evaluating image quality using visual grading characteristic (VGC) analysis

Diagnostic imaging evaluating image quality using visual grading characteristic (VGC) analysis Vet Res Commun (2010) 34:473 479 DOI 10.1007/s11259-010-9413-2 SHORT COMMUNICATION Diagnostic imaging evaluating image quality using visual grading characteristic (VGC) analysis Eberhard Ludewig & Andreas

More information

Individual- and trial-level surrogacy in colorectal cancer

Individual- and trial-level surrogacy in colorectal cancer Stat Methods Med Res OnlineFirst, published on February 19, 2008 as doi:10.1177/0962280207081864 SMM081864 2008/2/5 page 1 Statistical Methods in Medical Research 2008; 1 9 Individual- and trial-level

More information

Testing the representation of time in reference memory in the bisection and the generalization task: The utility of a developmental approach

Testing the representation of time in reference memory in the bisection and the generalization task: The utility of a developmental approach PQJE178995 TECHSET COMPOSITION LTD, SALISBURY, U.K. 6/16/2006 THE QUARTERLY JOURNAL OF EXPERIMENTAL PSYCHOLOGY 0000, 00 (0), 1 17 Testing the representation of time in reference memory in the bisection

More information

EPIDEMIOLOGY. Training module

EPIDEMIOLOGY. Training module 1. Scope of Epidemiology Definitions Clinical epidemiology Epidemiology research methods Difficulties in studying epidemiology of Pain 2. Measures used in Epidemiology Disease frequency Disease risk Disease

More information

Effective Implementation of Bayesian Adaptive Randomization in Early Phase Clinical Development. Pantelis Vlachos.

Effective Implementation of Bayesian Adaptive Randomization in Early Phase Clinical Development. Pantelis Vlachos. Effective Implementation of Bayesian Adaptive Randomization in Early Phase Clinical Development Pantelis Vlachos Cytel Inc, Geneva Acknowledgement Joint work with Giacomo Mordenti, Grünenthal Virginie

More information