Class Outlier Detection. Zuzana Pekarčíková

Similar documents
Motivation: Fraud Detection

Knowledge Discovery and Data Mining I

Data Mining. Outlier detection. Hamid Beigy. Sharif University of Technology. Fall 1395

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.

Outlier Analysis. Lijun Zhang

OUTLIER DETECTION : A REVIEW

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

From Biostatistics Using JMP: A Practical Guide. Full book available for purchase here. Chapter 1: Introduction... 1

SURVEY ON OUTLIER DETECTION TECHNIQUES USING CATEGORICAL DATA

A Semi-supervised Approach to Perceived Age Prediction from Face Images

Comparison of discrimination methods for the classification of tumors using gene expression data

Information-Theoretic Outlier Detection For Large_Scale Categorical Data

T. R. Golub, D. K. Slonim & Others 1999

10CS664: PATTERN RECOGNITION QUESTION BANK

Introduction to Discrimination in Microarray Data Analysis

Comparative study of Naïve Bayes Classifier and KNN for Tuberculosis

IMPACTS OF SOCIAL NETWORKS AND SPACE ON OBESITY. The Rights and Wrongs of Social Network Analysis

ISIR: Independent Sliced Inverse Regression

Chapter 1. Introduction

Brain Tumour Detection of MR Image Using Naïve Beyer classifier and Support Vector Machine

Statistics for Clinical Trials: Basics of Phase III Trial Design

Social Effects in Blau Space:

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

Identification of Tissue Independent Cancer Driver Genes

FUNNEL: Automatic Mining of Spatially Coevolving Epidemics

Colon cancer subtypes from gene expression data

A Comparison of Collaborative Filtering Methods for Medication Reconciliation

Data mining for Obstructive Sleep Apnea Detection. 18 October 2017 Konstantinos Nikolaidis

1.4 - Linear Regression and MS Excel

Supplementary Materials Extracting a Cellular Hierarchy from High-dimensional Cytometry Data with SPADE

Bayesian Models for Combining Data Across Subjects and Studies in Predictive fmri Data Analysis

The Outlier Approach How To Triumph In Your Career As A Nonconformist

PREDICTION OF BREAST CANCER USING STACKING ENSEMBLE APPROACH

Data Mining and Knowledge Discovery: Practice Notes

Sample Size Considerations. Todd Alonzo, PhD

Searching for Temporal Patterns in AmI Sensor Data

Improving k Nearest Neighbor with Exemplar Generalization for Imbalanced Classification

Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties

Statistical Applications in Genetics and Molecular Biology

On Bayesian Network and Outlier Detection

CS709a: Algorithms and Complexity

COST EFFECTIVENESS STATISTIC: A PROPOSAL TO TAKE INTO ACCOUNT THE PATIENT STRATIFICATION FACTORS. C. D'Urso, IEEE senior member

Ensembles for Unsupervised Outlier Detection: Challenges and Research Questions

SUPPLEMENTARY INFORMATION. Table 1 Patient characteristics Preoperative. language testing

STAT100 Module 4. Detecting abnormalities. Dr. Matias Salibian-Barrera Winter 2009 / 2010

A STATISTICAL PATTERN RECOGNITION PARADIGM FOR VIBRATION-BASED STRUCTURAL HEALTH MONITORING

Chapter 14 Noise Versus Outliers

EECS 433 Statistical Pattern Recognition

Unsupervised Learning of Micro-Action Exemplars using a Product Manifold

Outlier Detection Based on Surfeit Entropy for Large Scale Categorical Data Set

Multiple Choice Questions

Radiotherapy Outcomes

On the Combination of Collaborative and Item-based Filtering

BayesRandomForest: An R

Fusing Generic Objectness and Visual Saliency for Salient Object Detection

Review of Chronic Kidney Disease based on Data Mining Techniques

Performance Analysis of Different Classification Methods in Data Mining for Diabetes Dataset Using WEKA Tool

Evaluating Classifiers for Disease Gene Discovery

Yeast Cells Classification Machine Learning Approach to Discriminate Saccharomyces cerevisiae Yeast Cells Using Sophisticated Image Features.

Tissue Classification Based on Gene Expression Data

BIOINFORMATICS ORIGINAL PAPER

Identifying Peer Influence Effects in Observational Social Network Data: An Evaluation of Propensity Score Methods

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD

Outlier detection in datasets with mixed-attributes

Noise-Robust Speech Recognition Technologies in Mobile Environments

Numerical Integration of Bivariate Gaussian Distribution

Recognizing Scenes by Simulating Implied Social Interaction Networks

Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017

Classification of Thyroid Nodules in Ultrasound Images using knn and Decision Tree

Agent-Based Models. Maksudul Alam, Wei Wang

Statistical Analysis Using Machine Learning Approach for Multiple Imputation of Missing Data

Mayuri Takore 1, Prof.R.R. Shelke 2 1 ME First Yr. (CSE), 2 Assistant Professor Computer Science & Engg, Department

Nearest Shrunken Centroid as Feature Selection of Microarray Data

Project title: Using Non-Local Connectivity Information to Identify Nascent Disease Outbreaks

Sound Analysis Research at LabROSA

Personalized Colorectal Cancer Survivability Prediction with Machine Learning Methods*

Knowledge Discovery and Data Mining Applied to Engineering Applications

Opleiding Informatica

The Long Tail of Recommender Systems and How to Leverage It

A NOVEL VARIABLE SELECTION METHOD BASED ON FREQUENT PATTERN TREE FOR REAL-TIME TRAFFIC ACCIDENT RISK PREDICTION

Statement of research interest

Mining Dynamic Relationships From Spatio-temporal Datasets: An Application to Brain fmri Data

Previously, when making inferences about the population mean,, we were assuming the following simple conditions:

A Concise Guide to Market

International Journal of Pharma and Bio Sciences A NOVEL SUBSET SELECTION FOR CLASSIFICATION OF DIABETES DATASET BY ITERATIVE METHODS ABSTRACT

Prediction of heart disease using k-nearest neighbor and particle swarm optimization.

An Empirical and Formal Analysis of Decision Trees for Ranking

Deep Networks and Beyond. Alan Yuille Bloomberg Distinguished Professor Depts. Cognitive Science and Computer Science Johns Hopkins University

Hacettepe University Department of Computer Science & Engineering

INTRODUCTION TO MACHINE LEARNING. Decision tree learning

Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features. Tyler Yue Lab

Exploration of Linked Anomalies in Sensor Data for Suspicious Behavior Detection

INTRODUCTION TO ECONOMETRICS (EC212)

Diagnosis of Breast Cancer Using Ensemble of Data Mining Classification Methods

An Improved Algorithm To Predict Recurrence Of Breast Cancer

Data Mining of Access to Tetanus Toxoid Immunization Among Women of Childbearing Age in Ethiopia

Three principles of data science: predictability, computability, and stability (PCS)

Basic SPSS for Postgraduate

Forecasting Cyber Attacks with Imbalanced Data Sets and Different Time Granularities

Interpreting and Unifying Outlier Scores

Transcription:

Class Outlier Detection Zuzana Pekarčíková

Outline λ What is an Outlier? λ Applications of Outlier Detection λ Types of Outliers λ Outlier Detection Methods Types λ Basic Outlier Detection Methods λ High-dimensional Outlier Detection Methods λ Class Outlier Detection Random Forests λ Context-based Approach λ Weka Extensions by Frederick Livingston λ Weka CODB Extensions λ Our Implementation λ Our Future Plans

What is an Outlier? λ Definition of Hawkins [Hawkins 1980]: An outlier is an observation which deviates so much from the other observations as to

Applications of Outlier Detection λ Fraud detection λ Purchasing behavior of a credit card owner usually changes when the card is stolen λ Medicine λ Unusual symptoms or test results may indicate potential health problems of a patient λ Whether a particular test result is abnormal may depend on other characteristics of the patients λ Detecting measurement errors λ Data derived from sensors may contain measurement errors λ Removing such errors can be important in other data mining and data analysis tasks

Types of Outliers λ Point Anomalies λ An individual data instance can be considered as anomalous with respect to the rest o λ The simplest type of Outliers. λ Example: credit card fraud - detection

Types of Outliers λ Contextual Anomalies λ If a data instance is anomalous in a specific context, but not otherwise. Example: temp λ - t1 = t2, however t1 is in winter whereas t2 in summer

Types of Outliers λ Collective Anomalies λ A collection of related data instances is anomalous with respect to the entire data set. λ The individual data instances in a collective anomaly may not be anomalies by themse λ Example: λ human λ cardiogram

Outlier Detection Methods Types λ The labels associated with a data instance denote whether that instance is normal or anomalous λ Supervised Methods - availability of a training data set that has labeled instances for normal - as well as anomaly classes - building a predictive model for normal vs. anomaly classes problem is transforma - Problems: λ anomalous instances are far fewer than normal instances λ obtaining acurate labels for the anomaly class is challenging λ Semi-supervised Methods - training data has labeled instances only for the normal class λ Unsupervised Methods - no labels, most widely used - assumption: normal instances are far more frequent than anomalies in the test dat

Outlier Detection Methods λ Statistical Methods - normal data objects are generated by a statistical (stochastic) model, and data not fol - Example: statistical distribution: Gaussina λ Outliers are points that have a low probability to be generated by Gaussian distrib - Problems: Mean and standard deviation are very sensitive to outliers λ These values are computed for the complete data set (including potential outliers) - Advantage: existence of statistical λ proof why the object is an λ outlier λ µ DB

Outlier Detection Methods λ Proximity-Based Methods λ An object is an outlier if the proximity of the object to its neighbors significantly deviates fr - Distance-based Detection λ Radius r λ k nearest neighbors - Density-based Detection λ Relative density of object counted λ from density of its neighbors λ Clustering-Based Methods λ Normal data objects belong to large λ and dense clusters, whereas outliers λ belong to small or sparse clusters, λ or do not belong to any clusters. λ q

Outlier Detection Methods λ Classification-Based Methods - Main idea: training a classification model that can distinguish normal data from outlier - Problem: imbalanced classes - Solution: using one-class model classifier describe only the normal class and samp

Hight-dimensional Outlier Detection Methods Problems in high-dimensional: λ Relative contrast between distances decreases with increasing dimensionality λ Data are very sparse, almost all points are outliers λ Concept of neighborhood becomes meaningless Solutions: λ Use more robust distance functions and find full-dimensional outliers λ Find outliers in projections (subspaces) of the original feature space

High-dimensional Outlier Detection Methods ABOD angle-based outlier degree - Object o is an outlier if most other objects are located in similar directions - Object o is no outlier if many other objects are located in varying directions λ o λ oulier λ no outlier

Class Outlier Detection λ semantic outlier λ A semantic outlier is a data point, which behaves differently with other data points in the s

Class Outlier Detection λ Multi-class classification based anomaly detection techniques assume that the training da λ Anomaly detection techniques teach a classifier to distinguish between each normal class

Class Outlier Detection Random Forests λ Random Forests is an enensemble classification and regression approach. λ Ensemble methods use multiple models to obtain better predictive performance than coul λ Random Forests: - consists of many classification trees - 1/3 of all samples are left out OOB (out of bag) data for classification error - each tree is constructed by a different bootstrap sample from the original data - all data are run down the tree and proximities are computed for each pair of cases T - Outliers are cases whose proximities to all other cases in the data are generally smal - Used in outliers relative to their class an outlier in class j is a case whose prosimitie

Context-based Approach Is the temperature 28 C outlier? If we are in Brno in summer NO If we are in Brno in winter YES it dependes on the location and time CONTEXT

Context-based Approach Contextual outlier significantly deviates from model with respect to a specific contex Generally the attributes of the data objects are divided into two groups: λ Contextual attributes: Define the object s context. In the example, the contextual attributes may λ Behavioral attributes: Define the object s characteristics, and are used to evaluate whether the Contextual outlier detection methods can be devided into two categories according to whe λ Transforming Contextual Outlier Detection to Conventional Outlier Detectio - The context can be easily identified. λ Modeling Normal Behavior with Respect to Contexts - The context identification is more difficult

Context-based Approach Transforming Contextual Outlier Detection to Conventional Outlier Detection General Idea: λ Evaluation wheater the object is an outlier is done in two steps: - identifycation the context of the object using the contextual attributes - calculation the outlier score for the object in the context using a conventional outlier d Example: λ In customerrelationship management, we can detect outlier customers in the context of customer g λ 3 attributes: - contextual: age group (25, 25-45, 45-65, and over 65), post code - behavioral: number of transactions per yer λ Is customer c outlier? - locate the context of c using the attributes age group and post code - compare c with the other customers in the same group, and use a conventional outlier detection method

Context-based Approach Modeling Normal Behavior with Respect to Contexts Context is not easy to identify Example: λ An online store records the sequence of products seached for by each customer. Outlier behavior is - contexts cannot be easily specified because it is unclear how many products browsed earlie General idea: λ modelation of normal behaviour with respect to contexts λ With using a training data set a method trains a model that predicts the expected behavior attribute Is an object outlier? λ We apply the model to the contextual attributes of the object. If the behavior attribute values deviate

+ Proximities in Random Forests λ It is one of the most important feature when consider Outlier Detection. λ It describes corelations between data elements. λ Proximities Matrix: - NxN, where N is the number of data set - One for all trees in the forest - The cell [8, 10] represents how often the elements 8 and 10 share together one node - The outlier score is computed as the average proximity value for each element. If this

+ Proximities in Random Forests - Example λ Four elements λ Five Random Forest Trees λ Table in the left is ordinary proximities matrix λ Table in the right is normalized by number of trees λ Element C is outlier because it has the minimum sharing notes with the other elements

+ Weka Extensions by Fred Livingston Implementation of Proximities Extensional classes: λ weka.classifiers.meta.baggingext λ weka.trees.randomforestext λ implement the proximities matrix The version which was used has implemented property prox_matrix and its computing as well, how

+ Our Implementation I λ GUI option for computing proximities matrix - It is possible to turn on the showing of proximities matrix as a new feature of RandomFor

+ Our Implementation II λ After computation of classification the proximities matrix will by shown if it is turn on.

+ Our Implementation III λ Ensemble Outlier Detection: - OutlierEnsembleEngine λ Is the main class whitch coordinate the whole computation. λ By the constructor all needed properties are loaded. - connector, data, subdatanum, odmarray, odmweights, odmweights, randattrnum λ Method compute() - for each data element and for each method computes the outlier score - for each data element then it is combined (with using connector) the re - returns the array of result ensemble outlier score for each data elemen

+ Our Implementation IV λ OutlierEnsembleEngine pseudocode: public Double[] compute() { foreach (Method m IN MethodArray) { Element[] randomsubdata = getrandomsubdata(); OutliersScoreByMethods[] = m.computeoutliersscores(randomsubdata); } outliersscoreresult = connector.compute(outliersscorebymethods, MethodWeig } return outliersscoreresult;

+ Our Implementation V - OutlierEnsembleConnector λ Abstract class with abstract method compute() λ Each class, which extends this one, computes the result outlier score for data element with λ OutlierEnsembleConnectorAddition class - extends OutlierEnsembleConnector - it uses as combination function the addition

+ Our Implementation VI - OutlierDetectionMethod class λ Abstract class λ Each class extending this one take care of computing the outlier score for data elements. λ OutlierDetectionMethodUsingRandomForest - Extends OutlierDetectionMethod - Uses the RandomTree class for computation outlier scores - Not implemented yet

+ Weka CODB Extensions λ Class Outlier Detection λ Distance based approach λ It is based on the Class Outlier Factor (COF) whitch represents the degree λ COF computes as a deviation of the instance from the instances of the sam

+ Weka CODB Extensions II λ PCL(T, K) is the Probability of the class label of the instance T with respect to the class la λ Deviation(T) is how much the instance T deviates from instances of the same class λ α and β are factors to control the importance and the effects of Deviation(T) and KDist(T) λ KDist(T) is the summation of distance between the instance T and its K nearest neighbor