MOST: detecting cancer differential gene expression

Size: px
Start display at page:

Download "MOST: detecting cancer differential gene expression"

Transcription

1 Biostatistics (2008), 9, 3, pp doi: /biostatistics/kxm042 Advance Access publication on November 29, 2007 MOST: detecting cancer differential gene expression HENG LIAN Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore SUMMARY We propose a new statistics for the detection of differentially expressed genes when the genes are activated only in a subset of the samples. Statistics designed for this unconventional circumstance has proved to be valuable for most cancer studies, where oncogenes are activated for a small number of disease samples. Previous efforts made in this direction include cancer outlier profile analysis (Tomlins and others, 2005), outlier sum (Tibshirani and Hastie, 2007), and outlier robust t-statistics (Wu, 2007). We propose a new statistics called maximum ordered subset t-statistics (MOST) which seems to be natural when the number of activated samples is unknown. We compare MOST to other statistics and find that the proposed method often has more power then its competitors. Keywords: Cancer; COPA; Differential gene expression; Microarray. 1. INTRODUCTION The most popular method for differential gene expression detection in 2-sample microarray studies is to compute the t-statistics. The differentially expressed genes are those whose t-statistics exceed a certain threshold. Recently, many researchers have come to the realization that in many cancer studies, many genes show increased expressions in disease samples, but only for a small number of those samples. The study of Tomlins and others (2005) shows that t-statistics has low power in this case, and they introduced the so-called cancer outlier profile analysis (COPA). Their study shows clearly that COPA can perform better than the traditional t-statistics for cancer microarray data sets. More recently, several progresses have been made in this direction with the aim to design better statistics to account for the heterogeneous activation pattern of the cancer genes. In Tibshirani and Hastie (2007), the authors introduced a new statistics, which they called outlier sum. Later, Wu (2007) proposed outlier robust t-statistics (ORT) and showed it usually outperformed the previously proposed ones in both simulation study and application to real data set. In this paper, we propose another statistics for the detection of cancer differential gene expression which have similar power to ORT when the number of activated samples is very small, but perform better when more samples are differentially expressed. We call our new method the maximum ordered subset t-statistics (MOST). Through simulation studies, we found the new statistics outperformed the previously To whom correspondence should be addressed. c The Author Published by Oxford University Press. All rights reserved. For permissions, please journals.permissions@oxfordjournals.org.

2 412 H. LIAN proposed ones under some circumstances and never significantly worse in all situations. Thus, we think it is a valuable addition to the dictionary of cancer outlier expression detection. 2. MAXIMUM ORDERED SUBSET t-statistics We consider the simple 2-class microarray data for detecting cancer genes. We assume there are n normal samples and m cancer samples. The gene expressions for normal samples are denoted by x ij for genes i = 1, 2,...,p and samples j = 1, 2,...n, while y ij denote the expressions for cancer samples with i = 1, 2,...,p and j = 1, 2,...m. In this paper, we are only interested in 1-sided test where the activated genes from cancer samples have a higher expression level. The extension to 2-sided test is straightforward. The usual t-statistics (up to a multiplication factor independent of genes) for 2-sample test of differences in means is defined for each gene i by T i = x i ȳ i, (2.1) s i where x i = j x ij/n is the average expression of gene i in normal samples, ȳ i = j y ij/m is the average expression of gene i in cancer samples, and s i is the usual pooled standard deviation estimate 1 si 2 j n = (x ij x i ) j m (y ij ȳ i ) 2. n + m 2 The t-statistics is powerful when the alternative distribution is such that y ij, j = 1, 2,...,m, all come from a distribution with a higher mean. Tomlins and others (2005) argues that for most cancer types, heterogeneous activation patterns make t-statistics inefficient for detecting those expression profiles. They defined the COPA statistics C i = q r ({y ij } 1 j m ) med i, (2.2) mad i where q r ( ) is the rth percentile of the data, med i = median({x ij } 1 j n, {y ij } 1 j m ) is the median of the pooled samples for gene i, and mad i = median({x ij med i } 1 j n, {y ij med i } 1 j m ) is the median absolute deviation of the pooled samples. The choice of r in (2.2) depends on the subjective judgement of the user. The use of med i and mad i to replace the mean and the standard deviation in (2.1) is due to robustness considerations since it is already known that some of the genes are differentially expressed. In (2.2), only one value of {y ij } is used in the computation. A more efficient strategy would be to use additional expression values. Let O i ={y ij : y ij > q 75 ({x ij } 1 j n, {y ij } 1 j m ) + IQR({x ij } 1 j n, {y ij } 1 j m )} (2.3) be the outliers from the cancer samples for gene i, where IQR( ) is the interquartile range of the data. The OS statistics from Tibshirani and Hastie (2007) is then defined as y ij O i (y ij med i ) OS i =. (2.4) mad i More recently, Wu (2007) studied ORT statistics, which is similar to OS statistics. The important difference that makes ORT superior is that outliers are defined relative to the normal sample instead of the pooled sample. So in their definition, O i ={y ij : y ij > q 75 ({x ij } 1 j n ) + IQR({x ij } 1 j n )}. (2.5)

3 MOST: detecting cancer differential gene expression 413 By similar reasoning, med i in OS is replaced by med ix and mad j by median({x ij med ix } 1 j n, {y ij med iy } 1 j m ), where med ix and med iy are the medians of normal and cancer samples, respectively. In both OS and ORT statistics, the outliers are defined somewhat arbitrarily with no convincing reasons. To address this question, we propose the following statistics that implicitly considers all possible values for outlier thresholds. Suppose for notational simplicity that {y ij } 1 j m are ordered for each i: y i1 y i2 y im. If the number of samples where oncogenes are activated is known, we would naturally define the statistics as 1 j k M ik = (y ij med ix ) median({x ij med ix } 1 j n, {y ij med iy } 1 j m ). (2.6) When k is not known to us, one would be tempted to define M i = max M ik. 1 k m But this does not quite work since obviously M ik for different values of k are not directly comparable under the null distribution that x ij, y ij N(0, 1). For example, when m = 2, we have E[y i1 med ix ] > 0, while E [ j=1,2 (y ij med ix ) ] = 0. This observation motivates us to normalize M ik such that each approximately has mean 0 and variance 1. This can be achieved by defining µ k = E [ 1 j k z j] and σk 2 = Var( 1 j k z j), where z1 > z 2 > > z m is the order statistics of m samples generated from the standard normal distribution. Then, we can define M ik as ( 1 j k M ik = (y )/ ij med ix ) median({x ij med ix } 1 j n, {y ij med iy } 1 j m ) µ k σ k (2.7) so that M ik has mean and variance approximately equal to 0 and 1, respectively. Finally, we can define our new statistics (called MOST) as M i = max M ik. (2.8) 1 k m With MOST, we practically consider every possible threshold above which y ij are taken to be outliers. In this formulation, the number of outliers is implicitly defined as arg max M ik. (2.9) 1 k m 3. SIMULATION STUDIES AND APPLICATION Some simulations are carried out to study MOST and compare its performance to OS, ORT, COPA, and t-statistics. For COPA, we choose to use the 90th percentile in its definition as in Tibshirani and Hastie (2007). We generate the expression data from standard normal with n = m = 20. For various values k, 1 k m, which is the number of differentially expressed cancer samples, a constant µ is added for differentially expressed genes. We simulated 1000 differentially and nondifferentially expressed genes and calculated the receiver operating characteristic (ROC) curves from them by choosing different thresholds for gene calls. Figures 1 and 2 plot the ROC curves for some combinations of k and µ. Forµ = 2 and k small, all 5 statistics behave similarly with t-statistics performing the worst. As k increases, t becomes better and OS and COPA begin to lose power. For µ = 1 and medium to large k, the performance of MOST is only worse than t and better than other statistics. Smaller k in this case basically leads to ROC curve that is close to a 45 line for all statistics since the signal µ = 1 is too weak in this case, so we do not show these

4 414 H. LIAN Fig. 1. ROC curves estimated based on simulation. The number of normal/cancer sample is n = m = 20. Various combinations of µ and k s are chosen. Other uninteresting results where all statistics have close to perfectly good or bad performances are excluded as explained in the main text.

5 MOST: detecting cancer differential gene expression 415 Fig. 2. More ROC curves. results. For µ = 4 and small k, MOST is better than ORT, COPA, and t, and in this situation, only OS is competitive with MOST. Larger k in this case will produce nearly perfect ROC curves for all statistics, and thus those results are also omitted. Besides ROC curves, we have also tried examining the possibility of using (2.9) for estimating the number of differentially expressed samples k but so far have been unable to get a reasonable estimate out of it. From the above simulations, we judge that our new estimate MOST is at least as good as other previously proposed statistics, sometimes much better. Thus, it is a valuable tool for detecting activated genes in many situations. As an example of real data application, the data from West and others (2001) is publicly available from The microarray used in the breast cancer study contains 7129 genes and 49 tumor samples, 25 of which with no positive lymph nodes identified and the other 24 with positive nodes. Similar to Wu (2007), we take the log transformation of the expressions after normalizing the data. We apply all the above mentioned statistics to the data. For real data application, we need to use 2-sided test. The modification for MOST required is straightforward. We just use 1 j k m ik = (y ij med ix ) median({x ij med ix } 1 j n, {y ij med iy } 1 j m ). (3.1)

6 416 H. LIAN This time with y ij ordered such that y i1 y i2 y im. We also need to normalize m ik as in (2.7) and then m i = min 1 k m m ik. This test will detect downward-regulated genes in disease samples. The 2-sided test will be the maximum of the absolute values of M i and m i, but the sign is kept. A search of PubMed returns 908 genes related to breast cancer, and these are mapped to 655 probe sets on the Affymetrix HuGeneFL microarray used in this experiment. The ranking of these are computed for all the 5 statistics. The differences in ranking between MOST and other statistics are plotted in Figure 3 as histograms. Negative values in the histograms show the superiority of the MOST statistics since they imply higher ranking to these breast cancer related genes given by MOST. The histograms show that for this data, the performance of MOST is superior to t-statistics (shown by the heavy left tail in the histogram) and similar to others. In a second example, we consider a subset of acute lymphoblastic leukemia (ALL) data consisting 79 samples patients with B-cell acute lymphoblastic leukemia, available from the Bioconductor project. The comparison is between 37 samples with the BCR/ABL fusion gene resulting from a translocation and 42 normal samples. In total, 358 probe sets on the array are annotated with the gene ontology term tyrosine Fig. 3. Comparison between MOST and other statistics on breast cancer data. The histogram shows the difference in ranking on 655 probe sets. In these figures, left-skewed histogram implies superiority of MOST.

7 MOST: detecting cancer differential gene expression 417 Fig. 4. Comparison between MOST and other statistics on ALL data. kinase activity, which is believed to mediate many of the effects due to BCR/ABL translocation. After a simple filtering to remove probe sets that are not expressed, 94 of those 358 probe sets remain. The differences in ranking between MOST and other statistics are shown as histograms in Figure 4. For this data, t-test and MOST seem to be superior to other methods. The R implementation of MOST statistics as well as sample code for simulation is available at REFERENCES TIBSHIRANI, R. AND HASTIE, T. (2007). Outlier sums for differential gene expression analysis. Biostatistics 8, 2 8. TOMLINS, S. A., RHODES, D. R., PERNER, S., DHANASEKARAN, S. M., MEHRA, R., SUN, X. W., VARAMBALLY, S., CAO, X., TCHINDA, J., KUEFER, R. and others (2005). Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science 310, WEST, M.,BLANCHETTE, C., DRESSMAN, H.,HUANG, E.,ISHIDA, S., SPANG, R., ZUZAN, H.,OLSON, J.A.J., MARKS, J. R. AND NEVINS, J. R. (2001). Predicting the clinical status of human breast cancer by using

8 418 H. LIAN gene expression profiles. Proceedings of the National Academy of Sciences of the United States of America 98, WU, B. (2007). Cancer outlier differential gene expression detection. Biostatistics 8, [Received August 8, 2007; first revision September 24, 2007; second revision October 23, 2007; accepted for publication October 25, 2007]

Cancer outlier differential gene expression detection

Cancer outlier differential gene expression detection Biostatistics (2007), 8, 3, pp. 566 575 doi:10.1093/biostatistics/kxl029 Advance Access publication on October 4, 2006 Cancer outlier differential gene expression detection BAOLIN WU Division of Biostatistics,

More information

Detection of Outliers in Gene Expression Data Using Expressed Robust-t Test

Detection of Outliers in Gene Expression Data Using Expressed Robust-t Test Malaysian Journal of Mathematical Sciences 10(2): 117 135 (2016) MALAYSIAN JOURNAL OF MATHEMATICAL SCIENCES Journal homepage: http://einspem.upm.edu.my/journal Detection of Outliers in Gene Expression

More information

Monte Carlo Analysis of Univariate Statistical Outlier Techniques Mark W. Lukens

Monte Carlo Analysis of Univariate Statistical Outlier Techniques Mark W. Lukens Monte Carlo Analysis of Univariate Statistical Outlier Techniques Mark W. Lukens This paper examines three techniques for univariate outlier identification: Extreme Studentized Deviate ESD), the Hampel

More information

Appendix III Individual-level analysis

Appendix III Individual-level analysis Appendix III Individual-level analysis Our user-friendly experimental interface makes it possible to present each subject with many choices in the course of a single experiment, yielding a rich individual-level

More information

VU Biostatistics and Experimental Design PLA.216

VU Biostatistics and Experimental Design PLA.216 VU Biostatistics and Experimental Design PLA.216 Julia Feichtinger Postdoctoral Researcher Institute of Computational Biotechnology Graz University of Technology Outline for Today About this course Background

More information

Bayesian Confidence Intervals for Means and Variances of Lognormal and Bivariate Lognormal Distributions

Bayesian Confidence Intervals for Means and Variances of Lognormal and Bivariate Lognormal Distributions Bayesian Confidence Intervals for Means and Variances of Lognormal and Bivariate Lognormal Distributions J. Harvey a,b, & A.J. van der Merwe b a Centre for Statistical Consultation Department of Statistics

More information

Chapter 1: Exploring Data

Chapter 1: Exploring Data Chapter 1: Exploring Data Key Vocabulary:! individual! variable! frequency table! relative frequency table! distribution! pie chart! bar graph! two-way table! marginal distributions! conditional distributions!

More information

Module 3: Pathway and Drug Development

Module 3: Pathway and Drug Development Module 3: Pathway and Drug Development Table of Contents 1.1 Getting Started... 6 1.2 Identifying a Dasatinib sensitive cancer signature... 7 1.2.1 Identifying and validating a Dasatinib Signature... 7

More information

Supplemental Data. Article. The Role of SPINK1. in ETS Rearrangement-Negative Prostate Cancers

Supplemental Data. Article. The Role of SPINK1. in ETS Rearrangement-Negative Prostate Cancers Cancer Cell, Volume 13 Supplemental Data Article The Role of SPINK1 in ETS Rearrangement-Negative Prostate Cancers Scott A. Tomlins, Daniel R. Rhodes, Jianjun Yu, Sooryanarayana Varambally, Rohit Mehra,

More information

Data analysis and binary regression for predictive discrimination. using DNA microarray data. (Breast cancer) discrimination. Expression array data

Data analysis and binary regression for predictive discrimination. using DNA microarray data. (Breast cancer) discrimination. Expression array data West Mike of Statistics & Decision Sciences Institute Duke University wwwstatdukeedu IPAM Functional Genomics Workshop November Two group problems: Binary outcomes ffl eg, ER+ versus ER ffl eg, lymph node

More information

Observational studies; descriptive statistics

Observational studies; descriptive statistics Observational studies; descriptive statistics Patrick Breheny August 30 Patrick Breheny University of Iowa Biostatistical Methods I (BIOS 5710) 1 / 38 Observational studies Association versus causation

More information

Introduction. We can make a prediction about Y i based on X i by setting a threshold value T, and predicting Y i = 1 when X i > T.

Introduction. We can make a prediction about Y i based on X i by setting a threshold value T, and predicting Y i = 1 when X i > T. Diagnostic Tests 1 Introduction Suppose we have a quantitative measurement X i on experimental or observed units i = 1,..., n, and a characteristic Y i = 0 or Y i = 1 (e.g. case/control status). The measurement

More information

Previously, when making inferences about the population mean,, we were assuming the following simple conditions:

Previously, when making inferences about the population mean,, we were assuming the following simple conditions: Chapter 17 Inference about a Population Mean Conditions for inference Previously, when making inferences about the population mean,, we were assuming the following simple conditions: (1) Our data (observations)

More information

AP Statistics. Semester One Review Part 1 Chapters 1-5

AP Statistics. Semester One Review Part 1 Chapters 1-5 AP Statistics Semester One Review Part 1 Chapters 1-5 AP Statistics Topics Describing Data Producing Data Probability Statistical Inference Describing Data Ch 1: Describing Data: Graphically and Numerically

More information

The normal curve and standardisation. Percentiles, z-scores

The normal curve and standardisation. Percentiles, z-scores The normal curve and standardisation Percentiles, z-scores The normal curve Frequencies (histogram) Characterised by: Central tendency Mean Median Mode uni, bi, multi Positively skewed, negatively skewed

More information

Lecture Outline. Biost 517 Applied Biostatistics I. Purpose of Descriptive Statistics. Purpose of Descriptive Statistics

Lecture Outline. Biost 517 Applied Biostatistics I. Purpose of Descriptive Statistics. Purpose of Descriptive Statistics Biost 517 Applied Biostatistics I Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics University of Washington Lecture 3: Overview of Descriptive Statistics October 3, 2005 Lecture Outline Purpose

More information

PROFILE SIMILARITY IN BIOEQUIVALENCE TRIALS

PROFILE SIMILARITY IN BIOEQUIVALENCE TRIALS Sankhyā : The Indian Journal of Statistics Special Issue on Biostatistics 2000, Volume 62, Series B, Pt. 1, pp. 149 161 PROFILE SIMILARITY IN BIOEQUIVALENCE TRIALS By DAVID T. MAUGER and VERNON M. CHINCHILLI

More information

Chapter 11: Advanced Remedial Measures. Weighted Least Squares (WLS)

Chapter 11: Advanced Remedial Measures. Weighted Least Squares (WLS) Chapter : Advanced Remedial Measures Weighted Least Squares (WLS) When the error variance appears nonconstant, a transformation (of Y and/or X) is a quick remedy. But it may not solve the problem, or it

More information

Part [2.1]: Evaluation of Markers for Treatment Selection Linking Clinical and Statistical Goals

Part [2.1]: Evaluation of Markers for Treatment Selection Linking Clinical and Statistical Goals Part [2.1]: Evaluation of Markers for Treatment Selection Linking Clinical and Statistical Goals Patrick J. Heagerty Department of Biostatistics University of Washington 174 Biomarkers Session Outline

More information

Ulrich Mansmann IBE, Medical School, University of Munich

Ulrich Mansmann IBE, Medical School, University of Munich Ulrich Mansmann IBE, Medical School, University of Munich Content of the lecture How to assess the relevance of groups of genes: - outstanding gene expression in a specific group compared to other genes;

More information

T. R. Golub, D. K. Slonim & Others 1999

T. R. Golub, D. K. Slonim & Others 1999 T. R. Golub, D. K. Slonim & Others 1999 Big Picture in 1999 The Need for Cancer Classification Cancer classification very important for advances in cancer treatment. Cancers of Identical grade can have

More information

PsychTests.com advancing psychology and technology

PsychTests.com advancing psychology and technology PsychTests.com advancing psychology and technology tel 514.745.8272 fax 514.745.6242 CP Normandie PO Box 26067 l Montreal, Quebec l H3M 3E8 contact@psychtests.com Psychometric Report Emotional Intelligence

More information

Supplement to SCnorm: robust normalization of single-cell RNA-seq data

Supplement to SCnorm: robust normalization of single-cell RNA-seq data Supplement to SCnorm: robust normalization of single-cell RNA-seq data Supplementary Note 1: SCnorm does not require spike-ins, since we find that the performance of spike-ins in scrna-seq is often compromised,

More information

Review. Imagine the following table being obtained as a random. Decision Test Diseased Not Diseased Positive TP FP Negative FN TN

Review. Imagine the following table being obtained as a random. Decision Test Diseased Not Diseased Positive TP FP Negative FN TN Outline 1. Review sensitivity and specificity 2. Define an ROC curve 3. Define AUC 4. Non-parametric tests for whether or not the test is informative 5. Introduce the binormal ROC model 6. Discuss non-parametric

More information

Introduction to Statistical Data Analysis I

Introduction to Statistical Data Analysis I Introduction to Statistical Data Analysis I JULY 2011 Afsaneh Yazdani Preface What is Statistics? Preface What is Statistics? Science of: designing studies or experiments, collecting data Summarizing/modeling/analyzing

More information

Estimation of Area under the ROC Curve Using Exponential and Weibull Distributions

Estimation of Area under the ROC Curve Using Exponential and Weibull Distributions XI Biennial Conference of the International Biometric Society (Indian Region) on Computational Statistics and Bio-Sciences, March 8-9, 22 43 Estimation of Area under the ROC Curve Using Exponential and

More information

Classical Psychophysical Methods (cont.)

Classical Psychophysical Methods (cont.) Classical Psychophysical Methods (cont.) 1 Outline Method of Adjustment Method of Limits Method of Constant Stimuli Probit Analysis 2 Method of Constant Stimuli A set of equally spaced levels of the stimulus

More information

Homework Exercises for PSYC 3330: Statistics for the Behavioral Sciences

Homework Exercises for PSYC 3330: Statistics for the Behavioral Sciences Homework Exercises for PSYC 3330: Statistics for the Behavioral Sciences compiled and edited by Thomas J. Faulkenberry, Ph.D. Department of Psychological Sciences Tarleton State University Version: July

More information

Boosted PRIM with Application to Searching for Oncogenic Pathway of Lung Cancer

Boosted PRIM with Application to Searching for Oncogenic Pathway of Lung Cancer Boosted PRIM with Application to Searching for Oncogenic Pathway of Lung Cancer Pei Wang Department of Statistics Stanford University Stanford, CA 94305 wp57@stanford.edu Young Kim, Jonathan Pollack Department

More information

Basic Statistics for Comparing the Centers of Continuous Data From Two Groups

Basic Statistics for Comparing the Centers of Continuous Data From Two Groups STATS CONSULTANT Basic Statistics for Comparing the Centers of Continuous Data From Two Groups Matt Hall, PhD, Troy Richardson, PhD Comparing continuous data across groups is paramount in research and

More information

Sawtooth Software. MaxDiff Analysis: Simple Counting, Individual-Level Logit, and HB RESEARCH PAPER SERIES. Bryan Orme, Sawtooth Software, Inc.

Sawtooth Software. MaxDiff Analysis: Simple Counting, Individual-Level Logit, and HB RESEARCH PAPER SERIES. Bryan Orme, Sawtooth Software, Inc. Sawtooth Software RESEARCH PAPER SERIES MaxDiff Analysis: Simple Counting, Individual-Level Logit, and HB Bryan Orme, Sawtooth Software, Inc. Copyright 009, Sawtooth Software, Inc. 530 W. Fir St. Sequim,

More information

Chapter 9. Youth Counseling Impact Scale (YCIS)

Chapter 9. Youth Counseling Impact Scale (YCIS) Chapter 9 Youth Counseling Impact Scale (YCIS) Background Purpose The Youth Counseling Impact Scale (YCIS) is a measure of perceived effectiveness of a specific counseling session. In general, measures

More information

Conditional Distributions and the Bivariate Normal Distribution. James H. Steiger

Conditional Distributions and the Bivariate Normal Distribution. James H. Steiger Conditional Distributions and the Bivariate Normal Distribution James H. Steiger Overview In this module, we have several goals: Introduce several technical terms Bivariate frequency distribution Marginal

More information

University of Groningen

University of Groningen University of Groningen Effect of compensatory viewing strategies on practical fitness to drive in subjects with visual field defects caused by ocular pathology Coeckelbergh, Tanja Richard Maria IMPORTANT

More information

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD Department of Biomedical Informatics Department of Computer Science and Engineering The Ohio State University Review

More information

IMPACT OF TRACE MINERAL VARIATION WITHIN FORAGES ON THE RATION FORMULATION PROCESS. J. R. Knapp Fox Hollow Consulting, LLC Columbus, Ohio INTRODUCTION

IMPACT OF TRACE MINERAL VARIATION WITHIN FORAGES ON THE RATION FORMULATION PROCESS. J. R. Knapp Fox Hollow Consulting, LLC Columbus, Ohio INTRODUCTION IMPACT OF TRACE MINERAL VARIATION WITHIN FORAGES ON THE RATION FORMULATION PROCESS J. R. Knapp Fox Hollow Consulting, LLC Columbus, Ohio INTRODUCTION While trace mineral (TM) concentrations of forages

More information

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you?

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you? WDHS Curriculum Map Probability and Statistics Time Interval/ Unit 1: Introduction to Statistics 1.1-1.3 2 weeks S-IC-1: Understand statistics as a process for making inferences about population parameters

More information

Full title: A likelihood-based approach to early stopping in single arm phase II cancer clinical trials

Full title: A likelihood-based approach to early stopping in single arm phase II cancer clinical trials Full title: A likelihood-based approach to early stopping in single arm phase II cancer clinical trials Short title: Likelihood-based early stopping design in single arm phase II studies Elizabeth Garrett-Mayer,

More information

YSU Students. STATS 3743 Dr. Huang-Hwa Andy Chang Term Project 2 May 2002

YSU Students. STATS 3743 Dr. Huang-Hwa Andy Chang Term Project 2 May 2002 YSU Students STATS 3743 Dr. Huang-Hwa Andy Chang Term Project May 00 Anthony Koulianos, Chemical Engineer Kyle Unger, Chemical Engineer Vasilia Vamvakis, Chemical Engineer I. Executive Summary It is common

More information

New Procedures for Identifying High Rates of Pesticide Use

New Procedures for Identifying High Rates of Pesticide Use New Procedures for Identifying High Rates of Pesticide Use Larry Wilhoit Department of Pesticide Regulation February 14, 2011 1 Detecting Outliers in Rates of Use 2 Current Outlier Criteria 3 New Outlier

More information

A Comparison of Robust and Nonparametric Estimators Under the Simple Linear Regression Model

A Comparison of Robust and Nonparametric Estimators Under the Simple Linear Regression Model Nevitt & Tam A Comparison of Robust and Nonparametric Estimators Under the Simple Linear Regression Model Jonathan Nevitt, University of Maryland, College Park Hak P. Tam, National Taiwan Normal University

More information

IEEE Proof Web Version

IEEE Proof Web Version IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 4, NO. 4, DECEMBER 2009 1 Securing Rating Aggregation Systems Using Statistical Detectors and Trust Yafei Yang, Member, IEEE, Yan (Lindsay)

More information

Chapter 23. Inference About Means. Copyright 2010 Pearson Education, Inc.

Chapter 23. Inference About Means. Copyright 2010 Pearson Education, Inc. Chapter 23 Inference About Means Copyright 2010 Pearson Education, Inc. Getting Started Now that we know how to create confidence intervals and test hypotheses about proportions, it d be nice to be able

More information

Analysis of gene expression in blood before diagnosis of ovarian cancer

Analysis of gene expression in blood before diagnosis of ovarian cancer Analysis of gene expression in blood before diagnosis of ovarian cancer Different statistical methods Note no. Authors SAMBA/10/16 Marit Holden and Lars Holden Date March 2016 Norsk Regnesentral Norsk

More information

caspa Comparison and Analysis of Special Pupil Attainment

caspa Comparison and Analysis of Special Pupil Attainment caspa Comparison and Analysis of Special Pupil Attainment Analysis and bench-marking in CASPA This document describes of the analysis and bench-marking features in CASPA and an explanation of the analysis

More information

Diagnosis of multiple cancer types by shrunken centroids of gene expression

Diagnosis of multiple cancer types by shrunken centroids of gene expression Diagnosis of multiple cancer types by shrunken centroids of gene expression Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu PNAS 99:10:6567-6572, 14 May 2002 Nearest Centroid

More information

List of Figures. List of Tables. Preface to the Second Edition. Preface to the First Edition

List of Figures. List of Tables. Preface to the Second Edition. Preface to the First Edition List of Figures List of Tables Preface to the Second Edition Preface to the First Edition xv xxv xxix xxxi 1 What Is R? 1 1.1 Introduction to R................................ 1 1.2 Downloading and Installing

More information

Kidane Tesfu Habtemariam, MASTAT, Principle of Stat Data Analysis Project work

Kidane Tesfu Habtemariam, MASTAT, Principle of Stat Data Analysis Project work 1 1. INTRODUCTION Food label tells the extent of calories contained in the food package. The number tells you the amount of energy in the food. People pay attention to calories because if you eat more

More information

About OMICS International

About OMICS International About OMICS International OMICS International through its Open Access Initiative is committed to make genuine and reliable contributions to the scientific community. OMICS International hosts over 700

More information

Name: (First) (Middle) ( Last)

Name: (First) (Middle) ( Last) Prince Sultan University STAT 101 Final Examination Summer Semester 2007/2008, Term 073 Monday, 20 th August 2008 Dr. Aiman Mukheimer Time Allowed: 120 minutes Name: (First) (Middle) ( Last) ID Number:

More information

Chapter 1: Introduction to Statistics

Chapter 1: Introduction to Statistics Chapter 1: Introduction to Statistics Variables A variable is a characteristic or condition that can change or take on different values. Most research begins with a general question about the relationship

More information

Performance of intraclass correlation coefficient (ICC) as a reliability index under various distributions in scale reliability studies

Performance of intraclass correlation coefficient (ICC) as a reliability index under various distributions in scale reliability studies Received: 6 September 2017 Revised: 23 January 2018 Accepted: 20 March 2018 DOI: 10.1002/sim.7679 RESEARCH ARTICLE Performance of intraclass correlation coefficient (ICC) as a reliability index under various

More information

Response to Mease and Wyner, Evidence Contrary to the Statistical View of Boosting, JMLR 9:1 26, 2008

Response to Mease and Wyner, Evidence Contrary to the Statistical View of Boosting, JMLR 9:1 26, 2008 Journal of Machine Learning Research 9 (2008) 59-64 Published 1/08 Response to Mease and Wyner, Evidence Contrary to the Statistical View of Boosting, JMLR 9:1 26, 2008 Jerome Friedman Trevor Hastie Robert

More information

Investigating the robustness of the nonparametric Levene test with more than two groups

Investigating the robustness of the nonparametric Levene test with more than two groups Psicológica (2014), 35, 361-383. Investigating the robustness of the nonparametric Levene test with more than two groups David W. Nordstokke * and S. Mitchell Colp University of Calgary, Canada Testing

More information

Empirical Knowledge: based on observations. Answer questions why, whom, how, and when.

Empirical Knowledge: based on observations. Answer questions why, whom, how, and when. INTRO TO RESEARCH METHODS: Empirical Knowledge: based on observations. Answer questions why, whom, how, and when. Experimental research: treatments are given for the purpose of research. Experimental group

More information

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Gene Selection for Tumor Classification Using Microarray Gene Expression Data Gene Selection for Tumor Classification Using Microarray Gene Expression Data K. Yendrapalli, R. Basnet, S. Mukkamala, A. H. Sung Department of Computer Science New Mexico Institute of Mining and Technology

More information

Summarizing Data. (Ch 1.1, 1.3, , 2.4.3, 2.5)

Summarizing Data. (Ch 1.1, 1.3, , 2.4.3, 2.5) 1 Summarizing Data (Ch 1.1, 1.3, 1.10-1.13, 2.4.3, 2.5) Populations and Samples An investigation of some characteristic of a population of interest. Example: You want to study the average GPA of juniors

More information

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo Please note the page numbers listed for the Lind book may vary by a page or two depending on which version of the textbook you have. Readings: Lind 1 11 (with emphasis on chapters 10, 11) Please note chapter

More information

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1 Nested Factor Analytic Model Comparison as a Means to Detect Aberrant Response Patterns John M. Clark III Pearson Author Note John M. Clark III,

More information

OCW Epidemiology and Biostatistics, 2010 David Tybor, MS, MPH and Kenneth Chui, PhD Tufts University School of Medicine October 27, 2010

OCW Epidemiology and Biostatistics, 2010 David Tybor, MS, MPH and Kenneth Chui, PhD Tufts University School of Medicine October 27, 2010 OCW Epidemiology and Biostatistics, 2010 David Tybor, MS, MPH and Kenneth Chui, PhD Tufts University School of Medicine October 27, 2010 SAMPLING AND CONFIDENCE INTERVALS Learning objectives for this session:

More information

Standard Scores. Richard S. Balkin, Ph.D., LPC-S, NCC

Standard Scores. Richard S. Balkin, Ph.D., LPC-S, NCC Standard Scores Richard S. Balkin, Ph.D., LPC-S, NCC 1 Normal Distributions While Best and Kahn (2003) indicated that the normal curve does not actually exist, measures of populations tend to demonstrate

More information

Title: A robustness study of parametric and non-parametric tests in Model-Based Multifactor Dimensionality Reduction for epistasis detection

Title: A robustness study of parametric and non-parametric tests in Model-Based Multifactor Dimensionality Reduction for epistasis detection Author's response to reviews Title: A robustness study of parametric and non-parametric tests in Model-Based Multifactor Dimensionality Reduction for epistasis detection Authors: Jestinah M Mahachie John

More information

Chapter 20: Test Administration and Interpretation

Chapter 20: Test Administration and Interpretation Chapter 20: Test Administration and Interpretation Thought Questions Why should a needs analysis consider both the individual and the demands of the sport? Should test scores be shared with a team, or

More information

Gene-microRNA network module analysis for ovarian cancer

Gene-microRNA network module analysis for ovarian cancer Gene-microRNA network module analysis for ovarian cancer Shuqin Zhang School of Mathematical Sciences Fudan University Oct. 4, 2016 Outline Introduction Materials and Methods Results Conclusions Introduction

More information

Comparison of discrimination methods for the classification of tumors using gene expression data

Comparison of discrimination methods for the classification of tumors using gene expression data Comparison of discrimination methods for the classification of tumors using gene expression data Sandrine Dudoit, Jane Fridlyand 2 and Terry Speed 2,. Mathematical Sciences Research Institute, Berkeley

More information

Facial Expression Biometrics Using Tracker Displacement Features

Facial Expression Biometrics Using Tracker Displacement Features Facial Expression Biometrics Using Tracker Displacement Features Sergey Tulyakov 1, Thomas Slowe 2,ZhiZhang 1, and Venu Govindaraju 1 1 Center for Unified Biometrics and Sensors University at Buffalo,

More information

Normal Distribution. Many variables are nearly normal, but none are exactly normal Not perfect, but still useful for a variety of problems.

Normal Distribution. Many variables are nearly normal, but none are exactly normal Not perfect, but still useful for a variety of problems. Review Probability: likelihood of an event Each possible outcome can be assigned a probability If we plotted the probabilities they would follow some type a distribution Modeling the distribution is important

More information

NORTH SOUTH UNIVERSITY TUTORIAL 1

NORTH SOUTH UNIVERSITY TUTORIAL 1 NORTH SOUTH UNIVERSITY TUTORIAL 1 REVIEW FROM BIOSTATISTICS I AHMED HOSSAIN,PhD Data Management and Analysis AHMED HOSSAIN,PhD - Data Management and Analysis 1 DATA TYPES/ MEASUREMENT SCALES Categorical:

More information

Population. Sample. AP Statistics Notes for Chapter 1 Section 1.0 Making Sense of Data. Statistics: Data Analysis:

Population. Sample. AP Statistics Notes for Chapter 1 Section 1.0 Making Sense of Data. Statistics: Data Analysis: Section 1.0 Making Sense of Data Statistics: Data Analysis: Individuals objects described by a set of data Variable any characteristic of an individual Categorical Variable places an individual into one

More information

Applied Statistical Analysis EDUC 6050 Week 4

Applied Statistical Analysis EDUC 6050 Week 4 Applied Statistical Analysis EDUC 6050 Week 4 Finding clarity using data Today 1. Hypothesis Testing with Z Scores (continued) 2. Chapters 6 and 7 in Book 2 Review! = $ & '! = $ & ' * ) 1. Which formula

More information

The Impact of Continuity Violation on ANOVA and Alternative Methods

The Impact of Continuity Violation on ANOVA and Alternative Methods Journal of Modern Applied Statistical Methods Volume 12 Issue 2 Article 6 11-1-2013 The Impact of Continuity Violation on ANOVA and Alternative Methods Björn Lantz Chalmers University of Technology, Gothenburg,

More information

Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering

Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering Gene expression analysis Roadmap Microarray technology: how it work Applications: what can we do with it Preprocessing: Image processing Data normalization Classification Clustering Biclustering 1 Gene

More information

Human Cancer Genome Project. Bioinformatics/Genomics of Cancer:

Human Cancer Genome Project. Bioinformatics/Genomics of Cancer: Bioinformatics/Genomics of Cancer: Professor of Computer Science, Mathematics and Cell Biology Courant Institute, NYU School of Medicine, Tata Institute of Fundamental Research, and Mt. Sinai School of

More information

Students were asked to report how far (in miles) they each live from school. The following distances were recorded. 1 Zane Jackson 0.

Students were asked to report how far (in miles) they each live from school. The following distances were recorded. 1 Zane Jackson 0. Identifying Outliers Task Students were asked to report how far (in miles) they each live from school. The following distances were recorded. Student Distance 1 Zane 0.4 2 Jackson 0.5 3 Benjamin 1.0 4

More information

Comparison of Gene Set Analysis with Various Score Transformations to Test the Significance of Sets of Genes

Comparison of Gene Set Analysis with Various Score Transformations to Test the Significance of Sets of Genes Comparison of Gene Set Analysis with Various Score Transformations to Test the Significance of Sets of Genes Ivan Arreola and Dr. David Han Department of Management of Science and Statistics, University

More information

Pitfalls in Linear Regression Analysis

Pitfalls in Linear Regression Analysis Pitfalls in Linear Regression Analysis Due to the widespread availability of spreadsheet and statistical software for disposal, many of us do not really have a good understanding of how to use regression

More information

Outline of Part III. SISCR 2016, Module 7, Part III. SISCR Module 7 Part III: Comparing Two Risk Models

Outline of Part III. SISCR 2016, Module 7, Part III. SISCR Module 7 Part III: Comparing Two Risk Models SISCR Module 7 Part III: Comparing Two Risk Models Kathleen Kerr, Ph.D. Associate Professor Department of Biostatistics University of Washington Outline of Part III 1. How to compare two risk models 2.

More information

Unit 1 Exploring and Understanding Data

Unit 1 Exploring and Understanding Data Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile

More information

Empirical Formula for Creating Error Bars for the Method of Paired Comparison

Empirical Formula for Creating Error Bars for the Method of Paired Comparison Empirical Formula for Creating Error Bars for the Method of Paired Comparison Ethan D. Montag Rochester Institute of Technology Munsell Color Science Laboratory Chester F. Carlson Center for Imaging Science

More information

Basic Statistics 01. Describing Data. Special Program: Pre-training 1

Basic Statistics 01. Describing Data. Special Program: Pre-training 1 Basic Statistics 01 Describing Data Special Program: Pre-training 1 Describing Data 1. Numerical Measures Measures of Location Measures of Dispersion Correlation Analysis 2. Frequency Distributions (Relative)

More information

Comments on Significance of candidate cancer genes as assessed by the CaMP score by Parmigiani et al.

Comments on Significance of candidate cancer genes as assessed by the CaMP score by Parmigiani et al. Comments on Significance of candidate cancer genes as assessed by the CaMP score by Parmigiani et al. Holger Höfling Gad Getz Robert Tibshirani June 26, 2007 1 Introduction Identifying genes that are involved

More information

Name: Biostatistics 1 st year Comprehensive Examination: Applied Take Home exam. Due May 29 th, 2015 by 5pm. Late exams will not be accepted.

Name: Biostatistics 1 st year Comprehensive Examination: Applied Take Home exam. Due May 29 th, 2015 by 5pm. Late exams will not be accepted. Name: Biostatistics 1 st year Comprehensive Examination: Applied Take Home exam Due May 29 th, 2015 by 5pm. Late exams will not be accepted. Instructions: 1. There are 2 questions and 4 pages. Answer each

More information

Understandable Statistics

Understandable Statistics Understandable Statistics correlated to the Advanced Placement Program Course Description for Statistics Prepared for Alabama CC2 6/2003 2003 Understandable Statistics 2003 correlated to the Advanced Placement

More information

International Statistical Literacy Competition of the ISLP Training package 3

International Statistical Literacy Competition of the ISLP   Training package 3 International Statistical Literacy Competition of the ISLP http://www.stat.auckland.ac.nz/~iase/islp/competition Training package 3 1.- Drinking Soda and bone Health http://figurethis.org/ 1 2 2.- Comparing

More information

Quantitative Methods in Computing Education Research (A brief overview tips and techniques)

Quantitative Methods in Computing Education Research (A brief overview tips and techniques) Quantitative Methods in Computing Education Research (A brief overview tips and techniques) Dr Judy Sheard Senior Lecturer Co-Director, Computing Education Research Group Monash University judy.sheard@monash.edu

More information

Stats 95. Statistical analysis without compelling presentation is annoying at best and catastrophic at worst. From raw numbers to meaningful pictures

Stats 95. Statistical analysis without compelling presentation is annoying at best and catastrophic at worst. From raw numbers to meaningful pictures Stats 95 Statistical analysis without compelling presentation is annoying at best and catastrophic at worst. From raw numbers to meaningful pictures Stats 95 Why Stats? 200 countries over 200 years http://www.youtube.com/watch?v=jbksrlysojo

More information

SUPPLEMENTAL MATERIAL

SUPPLEMENTAL MATERIAL 1 SUPPLEMENTAL MATERIAL Response time and signal detection time distributions SM Fig. 1. Correct response time (thick solid green curve) and error response time densities (dashed red curve), averaged across

More information

Study of cigarette sales in the United States Ge Cheng1, a,

Study of cigarette sales in the United States Ge Cheng1, a, 2nd International Conference on Economics, Management Engineering and Education Technology (ICEMEET 2016) 1Department Study of cigarette sales in the United States Ge Cheng1, a, of pure mathematics and

More information

Quantitative Data and Measurement. POLI 205 Doing Research in Politics. Fall 2015

Quantitative Data and Measurement. POLI 205 Doing Research in Politics. Fall 2015 Quantitative Fall 2015 Theory and We need to test our theories with empirical data Inference : Systematic observation and representation of concepts Quantitative: measures are numeric Qualitative: measures

More information

MEA DISCUSSION PAPERS

MEA DISCUSSION PAPERS Inference Problems under a Special Form of Heteroskedasticity Helmut Farbmacher, Heinrich Kögel 03-2015 MEA DISCUSSION PAPERS mea Amalienstr. 33_D-80799 Munich_Phone+49 89 38602-355_Fax +49 89 38602-390_www.mea.mpisoc.mpg.de

More information

Nature Methods: doi: /nmeth.3115

Nature Methods: doi: /nmeth.3115 Supplementary Figure 1 Analysis of DNA methylation in a cancer cohort based on Infinium 450K data. RnBeads was used to rediscover a clinically distinct subgroup of glioblastoma patients characterized by

More information

Application of Resampling Methods in Microarray Data Analysis

Application of Resampling Methods in Microarray Data Analysis Application of Resampling Methods in Microarray Data Analysis Tests for two independent samples Oliver Hartmann, Helmut Schäfer Institut für Medizinische Biometrie und Epidemiologie Philipps-Universität

More information

1) What is the independent variable? What is our Dependent Variable?

1) What is the independent variable? What is our Dependent Variable? 1) What is the independent variable? What is our Dependent Variable? Independent Variable: Whether the font color and word name are the same or different. (Congruency) Dependent Variable: The amount of

More information

V. Gathering and Exploring Data

V. Gathering and Exploring Data V. Gathering and Exploring Data With the language of probability in our vocabulary, we re now ready to talk about sampling and analyzing data. Data Analysis We can divide statistical methods into roughly

More information

Classification of cancer profiles. ABDBM Ron Shamir

Classification of cancer profiles. ABDBM Ron Shamir Classification of cancer profiles 1 Background: Cancer Classification Cancer classification is central to cancer treatment; Traditional cancer classification methods: location; morphology, cytogenesis;

More information

SISCR Module 4 Part III: Comparing Two Risk Models. Kathleen Kerr, Ph.D. Associate Professor Department of Biostatistics University of Washington

SISCR Module 4 Part III: Comparing Two Risk Models. Kathleen Kerr, Ph.D. Associate Professor Department of Biostatistics University of Washington SISCR Module 4 Part III: Comparing Two Risk Models Kathleen Kerr, Ph.D. Associate Professor Department of Biostatistics University of Washington Outline of Part III 1. How to compare two risk models 2.

More information

The Lens Model and Linear Models of Judgment

The Lens Model and Linear Models of Judgment John Miyamoto Email: jmiyamot@uw.edu October 3, 2017 File = D:\P466\hnd02-1.p466.a17.docm 1 http://faculty.washington.edu/jmiyamot/p466/p466-set.htm Psych 466: Judgment and Decision Making Autumn 2017

More information

A Critique of Two Methods for Assessing the Nutrient Adequacy of Diets

A Critique of Two Methods for Assessing the Nutrient Adequacy of Diets CARD Working Papers CARD Reports and Working Papers 6-1991 A Critique of Two Methods for Assessing the Nutrient Adequacy of Diets Helen H. Jensen Iowa State University, hhjensen@iastate.edu Sarah M. Nusser

More information

Module 2: Target Discovery and Validation

Module 2: Target Discovery and Validation Module 2: Target Discovery and Validation Table of Contents 1.1 Getting Started... 6 1.2 Nomination, validation, and associations of AGTR1 as an outlier in a subset of breast cancers... 7 1.2.1 AGTR1 discovery

More information

Determining the Vulnerabilities of the Power Transmission System

Determining the Vulnerabilities of the Power Transmission System Determining the Vulnerabilities of the Power Transmission System B. A. Carreras D. E. Newman I. Dobson Depart. Fisica Universidad Carlos III Madrid, Spain Physics Dept. University of Alaska Fairbanks AK

More information