Outlier Analysis. Lijun Zhang

Similar documents
Knowledge Discovery and Data Mining I

Motivation: Fraud Detection

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

INTRODUCTION TO MACHINE LEARNING. Decision tree learning

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.

Formulating Emotion Perception as a Probabilistic Model with Application to Categorical Emotion Classification

Data Mining. Outlier detection. Hamid Beigy. Sharif University of Technology. Fall 1395

OUTLIER DETECTION : A REVIEW

Reveal Relationships in Categorical Data

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

Still important ideas

Classical Psychophysical Methods (cont.)

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions

Comparative Study of K-means, Gaussian Mixture Model, Fuzzy C-means algorithms for Brain Tumor Segmentation

Business Statistics Probability

Identification of Tissue Independent Cancer Driver Genes

Nature Neuroscience: doi: /nn Supplementary Figure 1. Behavioral training.

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Intelligent Edge Detector Based on Multiple Edge Maps. M. Qasim, W.L. Woon, Z. Aung. Technical Report DNA # May 2012

Class Outlier Detection. Zuzana Pekarčíková

Derivative-Free Optimization for Hyper-Parameter Tuning in Machine Learning Problems

Still important ideas

Quantitative Evaluation of Edge Detectors Using the Minimum Kernel Variance Criterion

V. Gathering and Exploring Data

Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F

SURVEY ON OUTLIER DETECTION TECHNIQUES USING CATEGORICAL DATA

OUTLIER SUBJECTS PROTOCOL (art_groupoutlier)

Lecture Outline. Biost 517 Applied Biostatistics I. Purpose of Descriptive Statistics. Purpose of Descriptive Statistics

Chapter 1. Introduction

Applications. DSC 410/510 Multivariate Statistical Methods. Discriminating Two Groups. What is Discriminant Analysis

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

Modelling Spatially Correlated Survival Data for Individuals with Multiple Cancers

Stats 95. Statistical analysis without compelling presentation is annoying at best and catastrophic at worst. From raw numbers to meaningful pictures

Comparison of discrimination methods for the classification of tumors using gene expression data

Mammogram Analysis: Tumor Classification

EECS 433 Statistical Pattern Recognition

Detection of Glaucoma and Diabetic Retinopathy from Fundus Images by Bloodvessel Segmentation

Gene-microRNA network module analysis for ovarian cancer

A Learning Method of Directly Optimizing Classifier Performance at Local Operating Range

Quantitative Methods in Computing Education Research (A brief overview tips and techniques)

Comparison of volume estimation methods for pancreatic islet cells

Readings: Textbook readings: OpenStax - Chapters 1 11 Online readings: Appendix D, E & F Plous Chapters 10, 11, 12 and 14

Standard Scores. Richard S. Balkin, Ph.D., LPC-S, NCC

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Unit 1 Exploring and Understanding Data

Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017

12.1 Inference for Linear Regression. Introduction

NeuroMem. RBF Decision Space Mapping

Monte Carlo Analysis of Univariate Statistical Outlier Techniques Mark W. Lukens

Medical Statistics 1. Basic Concepts Farhad Pishgar. Defining the data. Alive after 6 months?

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

A Semi-supervised Approach to Perceived Age Prediction from Face Images

TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS)

Semi-Supervised Disentangling of Causal Factors. Sargur N. Srihari

Pooling Subjective Confidence Intervals

6. Unusual and Influential Data

Introduction & Basics

Performance evaluation of the various edge detectors and filters for the noisy IR images

COST EFFECTIVENESS STATISTIC: A PROPOSAL TO TAKE INTO ACCOUNT THE PATIENT STRATIFICATION FACTORS. C. D'Urso, IEEE senior member

SUPPLEMENTARY INFORMATION

ISIR: Independent Sliced Inverse Regression

Multiclass Classification of Cervical Cancer Tissues by Hidden Markov Model

IDENTIFICATION OF OUTLIERS: A SIMULATION STUDY

PSYCH-GA.2211/NEURL-GA.2201 Fall 2016 Mathematical Tools for Cognitive and Neural Science. Homework 5

Classification. Methods Course: Gene Expression Data Analysis -Day Five. Rainer Spang

NIH Public Access Author Manuscript Conf Proc IEEE Eng Med Biol Soc. Author manuscript; available in PMC 2013 February 01.

Normal Distribution. Many variables are nearly normal, but none are exactly normal Not perfect, but still useful for a variety of problems.

MS&E 226: Small Data

Predictive performance and discrimination in unbalanced classification

NMF-Density: NMF-Based Breast Density Classifier

Small Group Presentations

Chapter 1: Exploring Data

Zainab M. AlQenaei. Dissertation Defense University of Colorado at Boulder Leeds School of Business Operations and Information Management Division

Estimation of Area under the ROC Curve Using Exponential and Weibull Distributions

Information-Theoretic Outlier Detection For Large_Scale Categorical Data

Organizing Data. Types of Distributions. Uniform distribution All ranges or categories have nearly the same value a.k.a. rectangular distribution

Unsupervised MRI Brain Tumor Detection Techniques with Morphological Operations

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Quantitative Data and Measurement. POLI 205 Doing Research in Politics. Fall 2015

The SAGE Encyclopedia of Educational Research, Measurement, and Evaluation Multivariate Analysis of Variance

STAT 503X Case Study 1: Restaurant Tipping

COMPARATIVE STUDY ON FEATURE EXTRACTION METHOD FOR BREAST CANCER CLASSIFICATION

Automatic Medical Coding of Patient Records via Weighted Ridge Regression

Ch.20 Dynamic Cue Combination in Distributional Population Code Networks. Ka Yeon Kim Biopsychology

Learning to Identify Irrelevant State Variables

An Empirical and Formal Analysis of Decision Trees for Ranking

Prediction of Successful Memory Encoding from fmri Data

CLASSIFICATION OF BREAST CANCER INTO BENIGN AND MALIGNANT USING SUPPORT VECTOR MACHINES

Machine Learning for Predicting Delayed Onset Trauma Following Ischemic Stroke

Announcements. Perceptual Grouping. Quiz: Fourier Transform. What you should know for quiz. What you should know for quiz

Seeing is Behaving : Using Revealed-Strategy Approach to Understand Cooperation in Social Dilemma. February 04, Tao Chen. Sining Wang.

Review. Imagine the following table being obtained as a random. Decision Test Diseased Not Diseased Positive TP FP Negative FN TN

Nearest Shrunken Centroid as Feature Selection of Microarray Data

Improved Intelligent Classification Technique Based On Support Vector Machines

Selection of Linking Items

Introduction to Discrimination in Microarray Data Analysis

T. R. Golub, D. K. Slonim & Others 1999

Transcription:

Outlier Analysis Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj

Outline Introduction Extreme Value Analysis Probabilistic Models Clustering for Outlier Detection Distance-Based Outlier Detection Density-Based Methods Information-Theoretic Models Outlier Validity Summary

Introduction (1) A Quote You are unique, and if that is not fulfilled, then something has been lost. Martha Graham An Informal Definition A Complementary Concept to Clustering Clustering attempts to determine groups of data points that are similar Outliers are individual data points that are different from the remaining data

Introduction (2) Applications Data cleaning Remove noise in data Credit card fraud Unusual patterns of credit card activity Network intrusion detection Unusual records/changes in network traffic

Introduction (3) The Key Idea Create a model of normal patterns Outliers are data points that do not naturally fit within this normal model The outlierness of a data point is quantified by a outlier score Outputs of Outlier Detection Algorithms Real-valued outlier score Binary label

Outline Introduction Extreme Value Analysis Probabilistic Models Clustering for Outlier Detection Distance-Based Outlier Detection Density-Based Methods Information-Theoretic Models Outlier Validity Summary

Extreme Value Analysis (1) Statistical Tails http://www.regent sprep.org/regents/ math/algtrig/ats2/ normallesson.htm All extreme values are outliers Outliers may not be extreme values 1 and 100 are extreme values 50 is an outlier but not extreme value

Extreme Value Analysis (2) All extreme values are outlies Outlies may not be extreme values

Univariate Extreme Value Analysis (1) Statistical Tail Confidence Tests Suppose the density distribution is Tails are extreme regions s.t. Symmetric Distribution Two symmetric tails The areas inside tails represent the cumulative probability

Univariate Extreme Value Analysis (2) Statistical Tail Confidence Tests Suppose the density distribution is Tails are extreme regions s.t. Asymmetric Distribution Areas in two tails are different Regions in the interior are not tails

The Procedure (1) A model distribution is selected Normal Distribution with mean and standard deviation Parameter Selection Prior domain knowledge Estimate from data 1 1 1

The Procedure (2) -value of a random variable Large positive values of correspond to the upper tail Large negative values of correspond to the lower tail follows the standard normal distribution Extreme values

Multivariate Extreme Values (1) Unimodal probability distributions with a single peak Suppose the density distribution is Tails are extreme regions s.t. Multivariate Gaussian Distribution where distance between is the Mahalanobis and

Multivariate Extreme Values (2) Extreme-value Score of Larger values imply more extreme behavior

Multivariate Extreme Values (2) Extreme-value Score of Larger values imply more extreme behavior Extreme-value Probability of Let be the region,,σ,,σ Cumulative probability of Cumulative Probability of distribution for which the value is larger than

Why distribution? The Mahalanobis distance Let be the covariance matrix,,σ Σ Projection+Normalization Let Σ Λ Then, Σ Λ,,Σ

Adaptive to the Shape is an extreme value

Depth-Based Methods Convex Hull Corners

The Procedure The index is the outlier score Smaller values indicate a grate tendency

An Example Peeling Layers of an Onion

Limitations No Normalization Many data points are indistinguishable The computational complexity increases significantly with dimensionality

Outline Introduction Extreme Value Analysis Probabilistic Models Clustering for Outlier Detection Distance-Based Outlier Detection Density-Based Methods Information-Theoretic Models Outlier Validity Summary

Probabilistic Models Related to Probabilistic Model-Based Clustering The Key Idea Assume data is generated from a mixture-based generative model Learn the parameter of the model from data EM algorithm Evaluate the probability of each data point being generated by the model Points with low values are outliers

Mixture-based Generative Model Data was generated from a mixture of distributions with probability distribution represents a cluster/mixture component Each point is generated as follows Select a mixture component with probability, Assume the -th component is selected Generate a data point from

Learning Parameter from Data The probability that generated by the mixture model is given by, The probability of the data set generated by Learning parameters that maximize

Identify Outliers Outlier Score is defined as,

Outline Introduction Extreme Value Analysis Probabilistic Models Clustering for Outlier Detection Distance-Based Outlier Detection Density-Based Methods Information-Theoretic Models Outlier Validity Summary

Clustering for Outlier Detection Outlier Analysis v.s. Clustering Clustering is about finding crowds of data points Outlier analysis is about finding data points that are far away from these crowds Every data point is Either a member of a cluster Or an outlier Some clustering algorithms also detect outliers DBSCAN, DENCLUE

The Procedure (1) A Simple Way 1. Cluster the data 2. Define the outlier score as the distance of the data point to its cluster centroid

The Procedure (2) A Better Approach 1. Cluster the data 2. Define the outlier score as the local Mahalanobis distance Suppose belongs to cluster is the mean vector of the -th cluster Σ is the covariance matrix of the -th cluster Multivariate Extreme Value Analysis Global Mahalanobis distance

A Post-processing Step Remove Small-Size Clusters

Outline Introduction Extreme Value Analysis Probabilistic Models Clustering for Outlier Detection Distance-Based Outlier Detection Density-Based Methods Information-Theoretic Models Outlier Validity Summary

Distance-Based Outlier Detection An Instance-Specific Definition The distance-based outlier score of an object is its distance to its -th nearest neighbor

Distance-Based Outlier Detection An Instance-Specific Definition The distance-based outlier score of an object is its distance to its -th nearest neighbor Sometimes, average distance is used High-computational Cost Index structure Effective when the dimensionality is low Pruning tricks Designed for the case that only the top- outliers are needed

The Naïve Approach for Finding Top -Outliers 1. Evaluate the distance matrix

The Naïve Approach for Finding Top -Outliers 1. Evaluate the distance matrix 2. Find the -th smallest value in each row

The Naïve Approach for Finding Top -Outliers 1. Evaluate the distance matrix 2. Find the -th smallest value in each row 3. Choose data points with largest

Pruning Methods Sampling 1. Evaluate a distance matrix

Pruning Methods Sampling 1. Evaluate a distance matrix 2. Find the -th smallest value in each row

Pruning Methods Sampling 1. Evaluate a distance matrix 2. Find the -th smallest value in each row 3. Identify the -th score in top -rows

Pruning Methods Sampling 1. Evaluate a distance matrix 2. Find the -th smallest value in each row 3. Identify the -th score in top -rows 4. Remove points with

Pruning Methods Early Termination When completing the empty area

Pruning Methods Early Termination When completing the empty area Update known when more distances are Stop if Update if necessary

Local Distance Correction Methods Impact of Local Variations

Local Outlier Factor (LOF) Let be the distance of to its - nearest neighbor Let be the set of points within the -nearest neighbor distance of Reachability Distance Not symmetric between and If Otherwise, is large, Smoothed out by, more stable

Local Outlier Factor (LOF) Average Reachability Distance Local Outlier Factor Larger for Outliers Close to 1 for Others Outlier Score

Instance-Specific Mahalanobis Distance (1) Define a local Mahalanobis distance for each point Based on the covariance structure of the neighborhood of a data point The Challenge Neighborhood of a data point is hard to define with the Euclidean distance Euclidean distance is biased toward capturing the circular region around that point

Instance-Specific Mahalanobis Distance (2) An agglomerative approach for neighborhood construction Add to Data points are iteratively added to that have the smallest distance to argmin min Instance-specific Mahalanobis score Outlier score max

Instance-Specific Mahalanobis Distance (3) Can be applied to both cases Relation to clustering-based approaches

Outline Introduction Extreme Value Analysis Probabilistic Models Clustering for Outlier Detection Distance-Based Outlier Detection Density-Based Methods Information-Theoretic Models Outlier Validity Summary

Density-Based Methods The Key Idea Determine sparse regions in the underlying data Limitations Cannot handle variations of density

Histogram- and Grid-Based Techniques Histogram for -dimensional data Data points that lie in bins with very low frequency are reported as outliers https://www.mathsisfun.com/data/histograms.html Grid for high-dimensional data Challenges Size of grid Too local Sparsity

Kernel Density Estimation Given data points is a kernel function The density at each data point Computed without including the point itself in the density computation Low values of the density indicate greater tendency to be an outlier

Outline Introduction Extreme Value Analysis Probabilistic Models Clustering for Outlier Detection Distance-Based Outlier Detection Density-Based Methods Information-Theoretic Models Outlier Validity Summary

Information-Theoretic Models An Example The 1 st One: AB 17 times C in 2 nd string increases its minimum description length Conventional Methods Fix model, then calculate the deviation Information-Theoretic Models Fix the deviation, then learn the model Outlier score of : increase of the model size when is present

Probabilistic Models The Conventional Method Learn the parameters of generative model with a fixed size Use the fit of each data point as the outlier score Information-Theoretic Method Fix a maximum allowed deviation (a minimum value of fit) Learn the size and values of parameters Increase of size is used as the outlier score

Outline Introduction Extreme Value Analysis Probabilistic Models Clustering for Outlier Detection Distance-Based Outlier Detection Density-Based Methods Information-Theoretic Models Outlier Validity Summary

Outlier Validity Methodological Challenges Internal criteria are rarely used in outlier analysis A particular validity measure will favor an algorithm using a similar objective function criterion Magnified because of the small sample solution space External Measures The known outlier labels from a synthetic data set The rare class labels from a real data set

Receiver Operating Characteristic (ROC) curve is the set of outliers (ground-truth) An algorithm outputs a outlier score Given a threshold, we denote the set of outliers by True-positive rate (recall) The false positive rate ROC Curve Plot versus

An Example

Outline Introduction Extreme Value Analysis Probabilistic Models Clustering for Outlier Detection Distance-Based Outlier Detection Density-Based Methods Information-Theoretic Models Outlier Validity Summary

Summary Extreme Value Analysis Univariate, Multivariate, Depth-Based Probabilistic Models Clustering for Outlier Detection Distance-Based Outlier Detection Pruning, LOF, Instance-Specific Density-Based Methods Histogram- and Grid-Based, Kernel Density Information-Theoretic Models Outlier Validity ROC curve