Monte Carlo Analysis of Univariate Statistical Outlier Techniques Mark W. Lukens

Similar documents
Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F

Descriptive Statistics Lecture

Business Statistics Probability

Still important ideas

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

MEA DISCUSSION PAPERS

Still important ideas

IDENTIFICATION OF OUTLIERS: A SIMULATION STUDY

MOST: detecting cancer differential gene expression

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions

Basic Statistics 01. Describing Data. Special Program: Pre-training 1

Understandable Statistics

Unit 1 Exploring and Understanding Data

V. Gathering and Exploring Data

- Decide on an estimator for the parameter. - Calculate distribution of estimator; usually involves unknown parameter

AP Statistics. Semester One Review Part 1 Chapters 1-5

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions

Investigating the robustness of the nonparametric Levene test with more than two groups

YSU Students. STATS 3743 Dr. Huang-Hwa Andy Chang Term Project 2 May 2002

bivariate analysis: The statistical analysis of the relationship between two variables.

Chapter 1: Exploring Data

Six Modifications Of The Aligned Rank Transform Test For Interaction

A Comparison of Robust and Nonparametric Estimators Under the Simple Linear Regression Model

Conditional spectrum-based ground motion selection. Part II: Intensity-based assessments and evaluation of alternative target spectra

Outlier Analysis. Lijun Zhang

Six Sigma Glossary Lean 6 Society

Ecological Statistics

c. Construct a boxplot for the data. Write a one sentence interpretation of your graph.

Readings: Textbook readings: OpenStax - Chapters 1 11 Online readings: Appendix D, E & F Plous Chapters 10, 11, 12 and 14

12.1 Inference for Linear Regression. Introduction

6. Unusual and Influential Data

CHILD HEALTH AND DEVELOPMENT STUDY

Introduction & Basics

A Brief Introduction to Bayesian Statistics

Statistical Tolerance Regions: Theory, Applications and Computation

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you?

Biostatistics. Donna Kritz-Silverstein, Ph.D. Professor Department of Family & Preventive Medicine University of California, San Diego

Equilibrium Selection In Coordination Games

Score Tests of Normality in Bivariate Probit Models

Variability. After reading this chapter, you should be able to do the following:

List of Figures. List of Tables. Preface to the Second Edition. Preface to the First Edition

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Stats 95. Statistical analysis without compelling presentation is annoying at best and catastrophic at worst. From raw numbers to meaningful pictures

Empirical Formula for Creating Error Bars for the Method of Paired Comparison

A GUIDE TO ROBUST STATISTICAL METHODS IN NEUROSCIENCE. Keywords: Non-normality, heteroscedasticity, skewed distributions, outliers, curvature.

Learning to Cook: An Exploration of Recipe Data

Lec 02: Estimation & Hypothesis Testing in Animal Ecology

Multiple Bivariate Gaussian Plotting and Checking

Detection of Outliers in Gene Expression Data Using Expressed Robust-t Test

Undertaking statistical analysis of

Introduction to Bayesian Analysis 1

OUTLIER SUBJECTS PROTOCOL (art_groupoutlier)

Student Performance Q&A:

CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA

Bruno D. Zumbo, Ph.D. University of Northern British Columbia

Kidane Tesfu Habtemariam, MASTAT, Principle of Stat Data Analysis Project work

AP Psych - Stat 2 Name Period Date. MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Pitfalls in Linear Regression Analysis

TEACHING REGRESSION WITH SIMULATION. John H. Walker. Statistics Department California Polytechnic State University San Luis Obispo, CA 93407, U.S.A.

Bayesian Confidence Intervals for Means and Variances of Lognormal and Bivariate Lognormal Distributions

Sampling Weights, Model Misspecification and Informative Sampling: A Simulation Study

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida

Conditional Distributions and the Bivariate Normal Distribution. James H. Steiger

Midterm STAT-UB.0003 Regression and Forecasting Models. I will not lie, cheat or steal to gain an academic advantage, or tolerate those who do.

MBA 605 Business Analytics Don Conant, PhD. GETTING TO THE STANDARD NORMAL DISTRIBUTION

Statistics and Probability

Unit 2 Boundary Value Testing, Equivalence Class Testing, Decision Table-Based Testing. ST 8 th Sem, A Div Prof. Mouna M.

Estimating the number of components with defects post-release that showed no defects in testing

Quantitative Methods in Computing Education Research (A brief overview tips and techniques)

Two-Way Independent ANOVA

Confidence Intervals On Subsets May Be Misleading

Comparison of the Null Distributions of

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

CCM6+7+ Unit 12 Data Collection and Analysis

Analysis and Interpretation of Data Part 1

Exploratory method for summarizing concomitant medication data the mean cumulative function

Statistics: Interpreting Data and Making Predictions. Interpreting Data 1/50

Normal Distribution: Homework *

Chapter 1: Explaining Behavior

Preliminary Report on Simple Statistical Tests (t-tests and bivariate correlations)

Impact of Response Variability on Pareto Front Optimization

LAB ASSIGNMENT 4 INFERENCES FOR NUMERICAL DATA. Comparison of Cancer Survival*

SAMPLE ASSESSMENT TASKS MATHEMATICS ESSENTIAL GENERAL YEAR 11

Summarizing Data. (Ch 1.1, 1.3, , 2.4.3, 2.5)

Identification of Tissue Independent Cancer Driver Genes

Introduction to Statistical Data Analysis I

Statistical Summaries. Kerala School of MathematicsCourse in Statistics for Scientists. Descriptive Statistics. Summary Statistics

SCAF Workshop Project Cost Control

3 CONCEPTUAL FOUNDATIONS OF STATISTICS

Samples, Sample Size And Sample Error. Research Methodology. How Big Is Big? Estimating Sample Size. Variables. Variables 2/25/2018

Analysis of Vaccine Effects on Post-Infection Endpoints Biostat 578A Lecture 3

Louis Leon Thurstone in Monte Carlo: Creating Error Bars for the Method of Paired Comparison

Comparison of the Approaches Taken by EFSA and JECFA to Establish a HBGV for Cadmium 1

A SAS Macro to Investigate Statistical Power in Meta-analysis Jin Liu, Fan Pan University of South Carolina Columbia

Basic Statistics for Comparing the Centers of Continuous Data From Two Groups

A NEW DIAGNOSIS SYSTEM BASED ON FUZZY REASONING TO DETECT MEAN AND/OR VARIANCE SHIFTS IN A PROCESS. Received August 2010; revised February 2011

The Impact of Continuity Violation on ANOVA and Alternative Methods

Elementary Statistics:

Transcription:

Monte Carlo Analysis of Univariate Statistical Outlier Techniques Mark W. Lukens This paper examines three techniques for univariate outlier identification: Extreme Studentized Deviate ESD), the Hampel identifier and the Rousseeuw identifier. The purpose is to determine how these outlier identification techniques perform under varying conditions. An experimental design along with two different Monte Carlo simulations provides insights into the problem. Under certain assumptions it is shown that the ESD identifier performs well with very small data contamination and the robust Hampel and Rousseuw identifiers perform better with large samples and with multiple outliers. 1. Introduction In the article, The Identification of Multiple Outliers, JASA, September 1993, Laurie Davies and Ursula Gather examine the outlier identification problem. Their approach was to develop an outlier-generating model that allows a small number of observations from a random sample to come from a different distribution than the target distribution. This contamination is then considered as outliers. The target distribution is assumed to be a normal distribution through the course of the article. The outlier region for µ, σ ) is defined by: out, µ, σ ) = { x : x µ > z1 / σ } A number x is called an outlier with respect to µ, σ ) if x out, µ, σ ) The authors then explain and formulate the outlier identification problem and explain three robust outlier identification techniques using the above methodology. The techniques are: Extreme Studentized Deviate Extreme Deviate Removal ESD-EDR), the Hampel identifier and the Rousseeuw identifier. A Monte Carlo simulation is used to determine which of these techniques performs the best. This paper will attempt to confirm the results of, but not replicate, the Monte Carlo experiment using a slightly different experimental design and approach.. Outlier Identification Given the above definition for the outlier region an example graph of this region for the standard normal is shown below. 0.4 0.3 0. 0.1 0.0-6 -4-0 4 Outlier Region Figure One An outlier is any data point x that falls into the outlier region. ow consider a random sample of size : X = X1,..., X ). Outliers are any data points which lie in the outlier region defined as: OR X ) { : ˆ ˆ, = x x µ > σ g, where Standard ormal = 1 1 ) Outlier Region 1/

µˆ is a statistical measure of location and σˆ is a statistical measure of scale or spread of the random sample X. The function g, ) relays the distributional information of the statistics in consideration. The Davies and Gather article called the methods that have this type of outlier regions single step identifiers. Three single step outlier identifiers were examined in this article: Extreme Studentized Deviate ESD), the Hampel identifier and the Rousseeuw identifier. The latter two are robust with a breakdown point of 1 /. The ESD identifier is not robust and has a breakdown point of 1 / n + 1). Therefore, the authors expanded this technique into a multiple step identifier. The technique they tested is the ESD-EDR extreme deviate removal). This identifier is robust with a breakdown point of 1 /. This paper will only consider the single-step outlier identifiers. In this way a comparison between robust Hampel and Rousseuw) and nonrobust ESD) methods might occur. 3. Single Step Outlier Identifiers Lower and upper bounds can be derived for single step identifiers from the above outlier region equation. Another way to state the region is as follows: OR X, ) =, L X I R X,, )) ), ) where L X, ) is the lower bound and R X, ) is the upper bound of the outlier region. These bounds are useful, especially in a Monte Carlo setting for testing whether a data point is in the outlier region. The Extreme Studentized Deviate ESD) identifier uses the sample mean and sample standard deviation to form the outlier region. The outlier region is then defined as: OR X, ) = { x : x X > S g, X where is the sample mean and is the sample standard deviation. S The lower bound and upper bound for the outlier region: L X, ) = X S g, ) R X, ) = X + S g, ) The Hampel identifier uses the sample median and sample median absolute deviation to form the outlier region. The outlier region is defined as follows: OR X, ) = { x : x med X ) > mad X ) g, where med X ) is the sample median and mad X ) is the sample median absolute deviation. The lower bound and upper bound for the outlier region: L R, ) = med X ) mad X ) g, ), ) = med X ) + mad X ) g, ) X X The Rousseeuw identifier uses the sample midpoint and the length of the shortest half sample of size [ / ] + 1 to form the outlier region. The outlier region is defined using this equation.

OR X, ) = { x : x mid X ) > min half sample X ) g, where mid X ) is the sample midpoint. The lower bound and upper bound for the Rousseeuw outlier region are easily derived similar to the other two methods. The authors provide values for g, ) for each of the three identifiers. They derive these values through simulation and are assumed correct. The g, ) values that are provided are for = 0, = 50 and = 100. These are the sample sizes used in the experimental design. 4. Experimental Design Two experimental deviations from the article, The Identification of Multiple Outliers have been stated. amely, the ESD-EDR technique will be replaced by the ESD technique to compare robust verse non-robust) and the sample size will have only three levels not four = 10 is excluded since no g, ) are provided). One more deviation is in the form of outlier generation. The authors generated their contamination by placing them such that they were not identified as outliers but had values as large as possible. This number varies depending on the random sample. A different approach taken here is the fixing of the outlier location by creating another factor within the experimental design. This factor has four levels placing the outlier contamination at 3, 4, 5 and 6 for a standard normal. This provides a decent range in which to examine how the outlier distance from the mean affects the outlier techniques For the different sample sizes, random data is generated with k outliers. k from the standard normal distribution and k from 3, 4, 5 or 6 depending on the level of the experiment. Shown below is an example histogram the outliers are generated at x = 6. 0 50 100 150-4 - 0 4 6 data Figure Two For this type of generated data set standard normal with contamination) each of the three outlier identification techniques are performed using the statistical software package SPLUS. Treatments: ESD indentifier, Hampel identifier and Rousseeuw identifier The experimental design contains the following factors and levels: Factor: Sample Size ) Levels: 0, 50 and 100 Factor: umber of Contaminants k ) Levels: 1,, 3, 5, 7, 10, 15, 0, 5, 30 /. and 49. ot to exceed [ ] Factor: Outlier Location Levels: 3, 4, 5 and 6 3

The following data table displays how the design looks and how the experiment was recorded. Outlier = = k ESD ROU HAMP 1 3 5 7 10 15 0 5 30 49 Table One To keep the performance metric similar to the original experiment, I used the upper bound R X, ) for each of the three treatments. This represents the largest possible nonidentified outlier. The smaller this upper bound the better the outlier method. 5. Experimental Results SPLUS was used to code each of the outlier identification techniques and to analyze performance. Each treatment has its own separate code Appendix A). Changes in the design points occur by changing the initialization of the variables at the beginning of each code and then rerunning the code. Each design point had 000 runs. The output of the code was the mean of the 000 calculated upper bounds: R X, ). The variance of this mean was quite small for all but the Rousseeuw identifier. The resultant upper bounds are shown in Appendix B. One can easily compare the mean upper bound to the actual outlier value to see if it was captured by the method. The values highlighted are considered the best based on this metric. Using the upper bound as the measure of performance gave the following results: 1. Performed the best for small sample size = 0) with close in outliers 3 and 4).. Did the best for all design points with very small contamination k=1). Generally did better when the sample size was large =100) and the contamination was greater than 5%. Performed the best only 3 times out of 90 total design points. This only occurred when the outlier was far away from the mean x = 6). Overall, it appears that the robust methods do not perform well when the sample size is small and the outliers are close in. Correspondence with one of the authors, Ursula Gather, implied this was the why they used the largest non-identified outlier as a metric. Close in outliers are not easily picked up by the robust techniques. The ESD method generally did not perform well in larger samples with contamination greater than 5%. These conclusions coincide with the results of the simulation in the article by Davies and Gather. Their findings state that the ESD-EDR method is superior when there are a very small number of outliers. They also state that the Hampel and Rousseeuw 4

identifiers work well in worst-case situations large number of outliers). Using the metric from the original article has its faults, however. Calculating the lowest bounds does not imply that the outlier technique actual identified the outliers in the random sample. After a few minor revisions of the code in Appendix A, a different experiment was derived to determine the percentage of time the technique correctly identified outliers in the sample. The same experimental design was used while changing the performance metric to percent outlier detected out of 000). This yielded different results Appendix C). The three outlier identification techniques only correctly identified outliers in 103 out of 90 design points often in very low percentages). When the outlier was closest to the mean, the robust techniques failed almost completely and the ESD technique only detected outliers a small percentage of the time when k=1. Here is a summary of the results. 1. Worked best when contamination is less than 5%.. Performance was superior when contamination small and with outliers closer in to the mean. 1. This method performed the best only 3 times out of the 90 total design points.. Appears to do better than the Hampel for smaller contamination. 3. Did not perform well for small sample =0). 1. Generally did the best for large number of outliers greater than 5%). Better performance as the outlier distance from the mean increases. The second experiment, using outlier detected as the metric, might seem more reasonable than the metric from the first experiment. Based on these results it appears that the non-robust ESD is superior for contamination less than 5% and the Hampel identifier is generally better for data sets with larger contamination. 6. Departures from ormality The main underlying assumption so far has been that the target distribution is standard normal. This section of the paper examines what occurs to the above outlier identification techniques when the target distribution is not normal. Two departures from normality are examined. The first departure examines what happens when the target distribution is uniform no tails). The second departure from normality uses a double exponential as a target distribution heavy tails). The original paper by Davies and Gather only provided values g, ) for normal data. To find these values the authors used simulations to derive the distribution of: x i ˆ µ ˆ σ at where xi is from the target distribution µˆ is the estimate of location and σˆ is the estimate of scale. 5

Changing the target distribution to uniform or to double exponential changes the values for g, ). Therefore, new simulations are needed. A uniform with a minimum of negative three and a maximum of three were chosen to keep similar outlier values three to six). A double exponential with a parameter of one was chosen to keep a symmetric distribution centered on zero. Using SPLUS code similar to that found in Appendix D the necessary values were derived. They are shown in the table below. g, ) values Uniform -3, 3) ESD =0 =50 =100 g, ) 1.73 1.73 1.74 HAMP =0 =50 =100 g, ).01.01.0 ROU =0 =50 =100 g, ) 1.04 1.04 1.05 g, ) values Double Exponential ESD =0 =50 =100 g, ) 3.71 4.35 4.75 HAMP =0 =50 =100 g, ) 7.54 8.65 9.9 ROU =0 =50 =100 g, ) 3.64 3.81 4.0 Table Two With these values calculated a new experimental result tables are derived through Monte Carlo simulation for both uniform and double exponential distributions. 6. Departures from ormality: Experimental Results Overall the three outlier identification techniques appeared to work better when replacing the normal by a uniform Appendix E). This is likely due to the lack of tails which might create masking. In the largest data set, when the outlier was far enough out all three techniques captured up to 15% contamination. This was the best performance by far. In general these outlier techniques seem to improve when the tails of the distribution are removed or reduced. Below are the summarized results for the Monte Carlo experiment using an uniform target distribution. 1. Worked better with larger contamination than the normal up to 15%).. o longer the best method for close in contamination. 1. This method performed much better. Clearly the best for close in contamination under a uniform target distribution.. Work well for very large contamination up to 49%). 3. The high percentage values at the right boundary of the distribution, however, might indicate a lack of efficiency. There may be potential to tag valid data as outliers. 1. Generally outperformed by the other outlier identification techniques when using uniform data.. Better performance as the outlier distance increases with small data sets. ext lets examine the replacement of the normal distribution with a double exponential Appendix F). It is clear from the experimental results that the non-robust ESD usually fails under 6

these conditions. The heavy tails of the double exponential create masking when using the sample mean and sample standard deviation. The Hampel outlier identification technique clearly works the best when the outlier distance is large. Below are the summarized results for the Monte Carlo experiment using a double exponential target distribution. 1. Was never the best.. Did achieve 93% identification with a large sample and one outlier at a large distance. 1. The best for close in contamination with a large sample size.. Generally outperformed by the Hampel technique. 1. Clearly the best for the double exponential heavy tails).. Outperformed the other three techniques when the outlier distance was medium to large. 7. Conclusions The results from the above Monte Carlo experiments reinforces and extends the results from the Davies and Gather article in JASA. o one statistical outlier technique dominated for every situation. The Extreme Studentized Deviate ESD) technique performed well under the assumptions of normality and small contamination. The Roueesseuw outlier identification method performed the best when the underlying distribution had no tails. The Hampel outlier identification technique perform the best when the target probability distribution had heavy tails or under normality when large contamination was present. Development of a multi-step technique incorporating all three methods might be considered. However, the experiments in this paper used percent outlier detected as a measure of performance. These experiments did not consider efficiency. It might be possible that the outlier identification technique not only identifies outliers but captures a large part of target distributional data as well. This seemed to be the case with the Rousseeuw identifier under uniform assumptions. This might be similar to the Minimum Volume Ellipsoid MVE) estimator for multivariate data. This multivariate extension of the Rousseeuw identifier is known for its lack of efficiency Wilcox, p. 04). Measuring the efficiency of these three outlier methods is the logical next step to the development of a mult-step univariate outlier identifier. 8. References Barnett, V. and Lewis, T. 1994), Outliers in Statistical Data 3 rd ed), ew York: Wiley. Davies, Laurie and Gather, Ursula 1993), The Identification of Multiple Outliers, Journal of the American Statistical Association, 88, 78-79. Wilcox, Rand R,. 1997), Introduction to Robust Estimation and Hypothesis Testing, San Diego: Academic Press. 7