Knowledge Discovery and Data Mining I

Similar documents
Outlier Analysis. Lijun Zhang

Class Outlier Detection. Zuzana Pekarčíková

Motivation: Fraud Detection

Data Mining. Outlier detection. Hamid Beigy. Sharif University of Technology. Fall 1395

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.

OUTLIER DETECTION : A REVIEW

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics

IDENTIFICATION OF OUTLIERS: A SIMULATION STUDY

Interpreting and Unifying Outlier Scores

SURVEY ON OUTLIER DETECTION TECHNIQUES USING CATEGORICAL DATA

Information-Theoretic Outlier Detection For Large_Scale Categorical Data

Comparison of discrimination methods for the classification of tumors using gene expression data

COST EFFECTIVENESS STATISTIC: A PROPOSAL TO TAKE INTO ACCOUNT THE PATIENT STRATIFICATION FACTORS. C. D'Urso, IEEE senior member

A Comparative Study of RNN for Outlier Detection in Data Mining

MEA DISCUSSION PAPERS

Ensembles for Unsupervised Outlier Detection: Challenges and Research Questions

Modelling Spatially Correlated Survival Data for Individuals with Multiple Cancers

Contributions to Brain MRI Processing and Analysis

T. R. Golub, D. K. Slonim & Others 1999

Chapter 1. Introduction

Identification of Tissue Independent Cancer Driver Genes

Class discovery in Gene Expression Data: Characterizing Splits by Support Vector Machines

Outlier detection in datasets with mixed-attributes

Conditional Distributions and the Bivariate Normal Distribution. James H. Steiger

Introduction to Discrimination in Microarray Data Analysis

INTRODUCTION TO MACHINE LEARNING. Decision tree learning

Previously, when making inferences about the population mean,, we were assuming the following simple conditions:

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Multiple Bivariate Gaussian Plotting and Checking

Section 4.1. Chapter 4. Classification into Groups: Discriminant Analysis. Introduction: Canonical Discriminant Analysis.

A Personalized Approach for Detecting Unusual Sleep from Time Series Sleep-Tracking Data

Introduction to MVPA. Alexandra Woolgar 16/03/10

Bayesian Models for Combining Data Across Subjects and Studies in Predictive fmri Data Analysis

Classification of Synapses Using Spatial Protein Data

A Semi-supervised Approach to Perceived Age Prediction from Face Images

MRI Image Processing Operations for Brain Tumor Detection

A STATISTICAL PATTERN RECOGNITION PARADIGM FOR VIBRATION-BASED STRUCTURAL HEALTH MONITORING

Applications. DSC 410/510 Multivariate Statistical Methods. Discriminating Two Groups. What is Discriminant Analysis

The RoB 2.0 tool (individually randomized, cross-over trials)

Comparative Study of K-means, Gaussian Mixture Model, Fuzzy C-means algorithms for Brain Tumor Segmentation

Machine Learning Statistical Learning. Prof. Matteo Matteucci

Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties

Tutorial 3: MANOVA. Pekka Malo 30E00500 Quantitative Empirical Research Spring 2016

10CS664: PATTERN RECOGNITION QUESTION BANK

Machine Learning for Predicting Delayed Onset Trauma Following Ischemic Stroke

Colon cancer subtypes from gene expression data

Hypothesis testing: comparig means

Nature Neuroscience: doi: /nn Supplementary Figure 1. Behavioral training.

Supplementary Materials Extracting a Cellular Hierarchy from High-dimensional Cytometry Data with SPADE

Data Mining and Knowledge Discovery: Practice Notes

n Outline final paper, add to outline as research progresses n Update literature review periodically (check citeseer)

Small Group Presentations

SCHOOL OF MATHEMATICS AND STATISTICS

EXTRACT THE BREAST CANCER IN MAMMOGRAM IMAGES

Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering

ISIR: Independent Sliced Inverse Regression

Monte Carlo Analysis of Univariate Statistical Outlier Techniques Mark W. Lukens

Brain Tumour Detection of MR Image Using Naïve Beyer classifier and Support Vector Machine

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

Bayesian methods in health economics

Lecture #4: Overabundance Analysis and Class Discovery

Comparative study of Naïve Bayes Classifier and KNN for Tuberculosis

Top-k typicality queries and efficient query answering methods on large databases

HHS Public Access Author manuscript Mach Learn Med Imaging. Author manuscript; available in PMC 2017 October 01.

COMPARATIVE STUDY ON FEATURE EXTRACTION METHOD FOR BREAST CANCER CLASSIFICATION

Unsupervised MRI Brain Tumor Detection Techniques with Morphological Operations

Behavioral Data Mining. Lecture 4 Measurement

MINING OF OUTLIER DETECTION IN LARGE CATEGORICAL DATASETS

Impact of Response Variability on Pareto Front Optimization

Using Bayesian Networks to Analyze Expression Data. Xu Siwei, s Muhammad Ali Faisal, s Tejal Joshi, s

Model reconnaissance: discretization, naive Bayes and maximum-entropy. Sanne de Roever/ spdrnl

Resonating memory traces account for the perceptual magnet effect

Type II Fuzzy Possibilistic C-Mean Clustering

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Reveal Relationships in Categorical Data

Knowledge Discovery and Data Mining. Testing. Performance Measures. Notes. Lecture 15 - ROC, AUC & Lift. Tom Kelsey. Notes

Unsupervised Discovery of Emphysema Subtypes in a Large Clinical Cohort

Gene expression correlates of clinical prostate cancer behavior

Searching for Temporal Patterns in AmI Sensor Data

PMR5406 Redes Neurais e Lógica Fuzzy. Aula 5 Alguns Exemplos

Performance Analysis of Different Classification Methods in Data Mining for Diabetes Dataset Using WEKA Tool

A Probabilistic Model for Estimating Real-valued Truth from Conflicting Sources

Scatter Plots and Association

Biostatistical modelling in genomics for clinical cancer studies

Detection and Classification of Brain Tumor using BPN and PNN Artificial Neural Network Algorithms

TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS)

Recognizing Scenes by Simulating Implied Social Interaction Networks

Representational similarity analysis

Yeast Cells Classification Machine Learning Approach to Discriminate Saccharomyces cerevisiae Yeast Cells Using Sophisticated Image Features.

NAILFOLD CAPILLAROSCOPY USING USB DIGITAL MICROSCOPE IN THE ASSESSMENT OF MICROCIRCULATION IN DIABETES MELLITUS

STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION XIN SUN. PhD, Kansas State University, 2012

doi: / _59(

Statistical Applications in Genetics and Molecular Biology

Reliability of Ordination Analyses

Changing expectations about speed alters perceived motion direction

Learning to Cook: An Exploration of Recipe Data

Impact Evaluation Toolbox

A Comparison of Collaborative Filtering Methods for Medication Reconciliation

Chapter 25. Paired Samples and Blocks. Copyright 2010 Pearson Education, Inc.

Transcription:

Ludwig-Maximilians-Universität München Lehrstuhl für Datenbanksysteme und Data Mining Prof. Dr. Thomas Seidl Knowledge Discovery and Data Mining I Winter Semester 2018/19

Introduction What is an outlier? Hawkins (1980) An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism. Statistics-based intuition: Normal data objects follow a generating mechanism, e.g. some given statistical process Abnormal objects deviate from this generating mechanism Unsupervised Methods Outlier Detection January 25, 2019 309

Introduction Example: Hadlum vs. Hadlum (1949) [Barnett 1978] The birth of a child to Mrs. Hadlum happened 349 days after Mr. Hadlum left for military service. Average human gestation period is 280 days (40 weeks). Statistically, 349 days is an outlier. Unsupervised Methods Outlier Detection January 25, 2019 310

Introduction Example: Hadlum vs. Hadlum (1949) [Barnett 1978] Blue: statistical basis (13634 observations of gestation periods) Green: assumed underlying Gaussian process Very low probability for the birth of Mrs. Hadlums child being generated by this process Red: assumption of Mr. Hadlum (another Gaussian process responsible for the observed birth, where the gestation period starts later) Unsupervised Methods Outlier Detection January 25, 2019 311

Introduction Applications Fraud detection Purchasing behavior of a credit card owner usually changes when the card is stolen Abnormal buying patterns can characterize credit card abuse Medicine Whether a particular test result is abnormal may depend on other characteristics of the patients (e.g. gender, age,... ) Unusual symptoms or test results may indicate potential health problems of a patient Public health The occurrence of a particular disease, e.g. tetanus, scattered across various hospitals of a city indicate problems with the corresponding vaccination program in that city Whether an occurrence is abnormal depends on different aspects like frequency, spatial correlation, etc. Unsupervised Methods Outlier Detection January 25, 2019 312

Introduction Applications (cont d) Sports statistics In many sports, various parameters are recorded for players in order to evaluate the players performances Outstanding (in a positive as well as a negative sense) players may be identified as having abnormal parameter values Sometimes, players show abnormal values only on a subset or a special combination of the recorded parameters Detecting measurement errors Data derived from sensors (e.g. in a given scientific experiment) may contain measurement errors Abnormal values could provide an indication of a measurement error Removing such errors can be important in other data mining and data analysis tasks One person s noise could be another person s signal. Unsupervised Methods Outlier Detection January 25, 2019 313

Introduction Important Properties of Outlier Models Global vs. local approach Outlierness regarding whole dataset (global) or regarding a subset of data (local)? Labeling vs. Scoring Binary decision or outlier degree score? Assumptions about Outlierness What are the characteristics of an outlier object? Unsupervised Methods Outlier Detection January 25, 2019 314

Agenda 1. Introduction 2. Basics 3. Unsupervised Methods 3.1 Frequent Pattern Mining 3.2 Clustering 3.3 Outlier Detection 3.3.1 Clustering-based Outliers 3.3.2 Statistical Outliers 3.3.3 Distance-based Outliers 3.3.4 Density-based Outliers 3.3.5 Angle-based Outliers 3.3.6 Summary 4. Supervised Methods 5. Advanced Topics

Clustering-based Outliers An object is a cluster-based outlier if it does not strongly belong to any cluster. Basic Idea Cluster the data into groups Choose points in small clusters as candidate outliers. Compute the distance between candidate points and non-candidate clusters. If candidate points are far from all other non-candidate points and clusters, they are outliers Unsupervised Methods Outlier Detection January 25, 2019 315

Clustering-based Outliers More Systematic Approaches Find clusters and then assess the degree to which a point belongs to any cluster E.g. for k-means, use distance to the centroid If eliminating a point results in substantial improvement of the objective function, we could classify it as an outlier Clustering creates a model of the data and the outliers distort that model. Applicable to clustering algorithms optimizing some objective function (e.g. k-means) Unsupervised Methods Outlier Detection January 25, 2019 316

Agenda 1. Introduction 2. Basics 3. Unsupervised Methods 3.1 Frequent Pattern Mining 3.2 Clustering 3.3 Outlier Detection 3.3.1 Clustering-based Outliers 3.3.2 Statistical Outliers 3.3.3 Distance-based Outliers 3.3.4 Density-based Outliers 3.3.5 Angle-based Outliers 3.3.6 Summary 4. Supervised Methods 5. Advanced Topics

Statistical Tests General Idea Given a certain kind of statistical distribution (e.g., Gaussian) Compute the parameters assuming all data points have been generated by such a statistical distribution (e.g., mean and standard deviation) Outliers are points that have a low probability to be generated by the overall distribution (e.g., deviate more than 3 times the standard deviation from the mean) Unsupervised Methods Outlier Detection January 25, 2019 317

Statistical Tests Basic Assumption Normal data objects follow a (known) distribution and occur in a high probability region of this model Outliers deviate strongly from this distribution Unsupervised Methods Outlier Detection January 25, 2019 318

Statistical Tests A huge number of different tests are available differing in Type of data distribution (e.g. Gaussian) Number of variables, i.e., dimensions of the data objects (univariate/multivariate) Number of distributions (mixture models) Parametric versus non-parametric (e.g. histogram-based) Example on the Following Slides Gaussian distribution Multivariate Single model Parametric Unsupervised Methods Outlier Detection January 25, 2019 319

Statistical Outliers: Gaussian Distribution Probability Density Function of a Multivariate Normal Distribution ( N (x µ, σ 2 1 ) = exp 1 ) (x µ)2 2πσ 2 2σ2 µ is the mean value of all points (usually data are normalized such that µ = 0) Σ is the covariance matrix from the mean Unsupervised Methods Outlier Detection January 25, 2019 320

Statistical Outliers: Mahalanobis Distance Mahalanobis Distance Mahalanobis distance of point x to µ : MDist(x, µ) = (x µ) T Σ 1 (x µ) MDist follows a χ 2 -distribution with d degrees of freedom (d = data dimensionality) Outliers: All points x, with MDist(x, µ) > χ 2 (0.975) ( 3σ) Unsupervised Methods Outlier Detection January 25, 2019 321

Statistical Outliers: Problems Problems Curse of dimensionality: The larger the degree of freedom, the more similar the MDist values for all points x-axis = observed MDist values y-axis = frequency of observation Unsupervised Methods Outlier Detection January 25, 2019 322

Statistical Outliers: Problems Problems (cont d) Robustness Mean and standard deviation are very sensitive to outliers These values are computed for the complete data set (including potential outliers) The MDist is used to determine outliers although the MDist values are influenced by these outliers Unsupervised Methods Outlier Detection January 25, 2019 323

Statistical Outliers: Problems Problems (cont d) Data distribution is fixed Low flexibility (if no mixture models) Global method Unsupervised Methods Outlier Detection January 25, 2019 324

Agenda 1. Introduction 2. Basics 3. Unsupervised Methods 3.1 Frequent Pattern Mining 3.2 Clustering 3.3 Outlier Detection 3.3.1 Clustering-based Outliers 3.3.2 Statistical Outliers 3.3.3 Distance-based Outliers 3.3.4 Density-based Outliers 3.3.5 Angle-based Outliers 3.3.6 Summary 4. Supervised Methods 5. Advanced Topics

Distance-Based Approaches General Idea Judge a point based on the distance(s) to its neighbors (Several variants proposed) Basic Assumption Normal data objects have a dense neighborhood Outliers are far apart from their neighbors, i.e., have a less dense neighborhood Unsupervised Methods Outlier Detection January 25, 2019 325

Distance-Based Approaches D(ɛ, π) Outliers 20 Given: radius ɛ, percentage π A point p is considered an outlier if at most π percent of all points (including p) have a distance to p less than ɛ. OutlierSet(ɛ, π) = { } {q D dist(p, q) < ɛ} p D π D 20 E. Knorr, R. Ng. A Unified Notion of Outliers: Properties and Computation. KDD 97 Unsupervised Methods Outlier Detection January 25, 2019 326

Distance-Based Approaches: D(ɛ, π) Example Score (ɛ = 0.3) 4.5 Decision (π = 0.02) 4.5 4.0 4.0 3.5 3.5 3.0 3.0 2.5 2.5 2.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 2.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 Unsupervised Methods Outlier Detection January 25, 2019 327

Distance-Based Approaches: knn Outlier scoring based on knn distances General models: Take the knn distance of a point as its outlier score Decision k-distance above some threshold τ Outlier. Unsupervised Methods Outlier Detection January 25, 2019 328

Distance-Based Approaches: knn Example Score (k = 1) 4.5 Decision (τ = 0.3) 4.5 4.0 4.0 3.5 3.5 3.0 3.0 2.5 2.5 2.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 2.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 Unsupervised Methods Outlier Detection January 25, 2019 329

Distance-Based Approaches: knn Example Score (k = 5) 4.5 Decision (τ = 0.3) 4.5 4.0 4.0 3.5 3.5 3.0 3.0 2.5 2.5 2.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 2.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 Unsupervised Methods Outlier Detection January 25, 2019 330

knn: Problems Problems Highly sensitive towards k: Too small k: small number of close neighbors can cause low outlier scores. Too large: all objects in a cluster with less than k objects might become outliers. cannot handle datasets with regions of widely different densities due to the global threshold Image Source: P. Tan, M. Steinbach, V. Kumar (2006). Classification: basic concepts, decision trees, and model evaluation. Introduction to data mining, 1, 145-205. Unsupervised Methods Outlier Detection January 25, 2019 331

Agenda 1. Introduction 2. Basics 3. Unsupervised Methods 3.1 Frequent Pattern Mining 3.2 Clustering 3.3 Outlier Detection 3.3.1 Clustering-based Outliers 3.3.2 Statistical Outliers 3.3.3 Distance-based Outliers 3.3.4 Density-based Outliers 3.3.5 Angle-based Outliers 3.3.6 Summary 4. Supervised Methods 5. Advanced Topics

Density-Based Approaches General Idea Compare the density around a point with the density around its local neighbors. The relative density of a point compared to its neighbors is computed as an outlier score. Approaches also differ in how to estimate density. Basic Assumption The density around a normal data object is similar to the density around its neighbors. The density around an outlier is considerably different to the density around its neighbors. Unsupervised Methods Outlier Detection January 25, 2019 332

Density-Based Approaches Problems Different definitions of density: e.g., #points within a specified distance ɛ from the given object The choice of ɛ is critical (too small = normal points considered as outliers; too big = outliers considered normal) A global notion of density is problematic (as it is in clustering); fails when data contain regions of different densities D has a higher absolute density than A but compared to its neighborhood, Ds density is lower. Unsupervised Methods Outlier Detection January 25, 2019 333

Density-Based Approaches Failure Case of Distance-Based D(ɛ, π): parameters ɛ, π cannot be chosen s.t. o 2 is outlier, but none of the points in C 1 (e.g. q) knn-distance: knn-distance of objects in C 1 (e.g. q) larger than the knn-distance of o 2. Unsupervised Methods Outlier Detection January 25, 2019 334

Density-Based Approaches Solution Consider the relative density w.r.t. to the neighbourhood. Model Local Density (ld) of point p (inverse of avg. distance of knns of p) ld k (p) = 1 k o knn(p) dist(p, o) Local Outlier Factor (LOF) of p (avg. ratio of lds of knns of p and ld of p) LOF k (p) = 1 k o knn(p) ld k (o) ld k (p) 1 Unsupervised Methods Outlier Detection January 25, 2019 335

Density-Based Approaches Score (k = 7) 4.5 Decision (LOF k (o) > 2) 4.5 4.0 4.0 3.5 3.5 3.0 3.0 2.5 2.5 2.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 2.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 Unsupervised Methods Outlier Detection January 25, 2019 336

Density-Based Approaches Extension (Smoothing factor) Reachability distance rd k (p, o) = max{kdist(o), dist(p, o)} Local reachability distance lrd k lrd k (p) = 1 k Replace ld by lrd LOF k (p) = 1 k o knn(p) o knn(p) rd(p, o) lrd k (o) lrd k (p) 1 Unsupervised Methods Outlier Detection January 25, 2019 337

Density-Based Approaches Discussion LOF 1 = point in cluster LOF 1 = outlier. Choice of k defines the reference set Unsupervised Methods Outlier Detection January 25, 2019 338

Agenda 1. Introduction 2. Basics 3. Unsupervised Methods 3.1 Frequent Pattern Mining 3.2 Clustering 3.3 Outlier Detection 3.3.1 Clustering-based Outliers 3.3.2 Statistical Outliers 3.3.3 Distance-based Outliers 3.3.4 Density-based Outliers 3.3.5 Angle-based Outliers 3.3.6 Summary 4. Supervised Methods 5. Advanced Topics

Angle-Based Approach General Idea Angles are more stable than distances in high dimensional spaces o outlier if most other objects are located in similar directions o no outlier if many other objects are located in varying directions inlier outlier Basic Assumption Outliers are at the border of the data distribution Normal points are in the center of the data distribution Unsupervised Methods Outlier Detection January 25, 2019 339

Angle-Based Approach Model Consider for a given point p the angle between px and py for any two x, y from the database Measure the variance of the angle spectrum Unsupervised Methods Outlier Detection January 25, 2019 340

Angle-Based Approach Model (cont d) Weighted by the corresponding distances (for lower dimensional data sets where angles are less reliable) ( 1 ABOD(p) = VAR x,y D xp 2 cos ( ) ( xp, ) xp, ) yp yp = VAR x,y D yp 2 xp 2 2 yp 2 2 Small ABOD outlier Unsupervised Methods Outlier Detection January 25, 2019 341

Angle-Based Approaches Score (all pairs) 4.5 Decision (ABOD(o) < 0.2) 4.5 4.0 4.0 3.5 3.5 3.0 3.0 2.5 2.5 2.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 2.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 Unsupervised Methods Outlier Detection January 25, 2019 342

Agenda 1. Introduction 2. Basics 3. Unsupervised Methods 3.1 Frequent Pattern Mining 3.2 Clustering 3.3 Outlier Detection 3.3.1 Clustering-based Outliers 3.3.2 Statistical Outliers 3.3.3 Distance-based Outliers 3.3.4 Density-based Outliers 3.3.5 Angle-based Outliers 3.3.6 Summary 4. Supervised Methods 5. Advanced Topics

Summary Properties: global vs. local, labeling vs. scoring Clustering-Based Outliers: Identification as non-(cluster-members) Statistical Outliers: Assume probability distribution; outliers = unlikely to be generated by distribution Distance-Based Outliers: Distance to neighbors as outlier metric Density-Based Outliers: Relative density around the point as outlier metric Angle-Based Outliers: Angles between outliers and random point pairs vary only slightly Unsupervised Methods Outlier Detection January 25, 2019 343