Data Mining with Weka

Similar documents
Evaluating Classifiers for Disease Gene Discovery

Comparative Analysis of Machine Learning Algorithms for Chronic Kidney Disease Detection using Weka

International Journal of Pharma and Bio Sciences A NOVEL SUBSET SELECTION FOR CLASSIFICATION OF DIABETES DATASET BY ITERATIVE METHODS ABSTRACT

A DATA MINING APPROACH FOR PRECISE DIAGNOSIS OF DENGUE FEVER

Wrapper subset evaluation facilitates the automated detection of diabetes from heart rate variability measures

Data Mining Techniques to Predict Survival of Metastatic Breast Cancer Patients

Sawtooth Software. MaxDiff Analysis: Simple Counting, Individual-Level Logit, and HB RESEARCH PAPER SERIES. Bryan Orme, Sawtooth Software, Inc.

Analysis of Classification Algorithms towards Breast Tissue Data Set

USING STATCRUNCH TO CONSTRUCT CONFIDENCE INTERVALS and CALCULATE SAMPLE SIZE

CSE 258 Lecture 2. Web Mining and Recommender Systems. Supervised learning Regression

Performance Evaluation of Machine Learning Algorithms in the Classification of Parkinson Disease Using Voice Attributes

CSE 258 Lecture 1.5. Web Mining and Recommender Systems. Supervised learning Regression

Predicting Malignancy from Mammography Findings and Image Guided Core Biopsies

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 1, Jan Feb 2017

Stepwise method Modern Model Selection Methods Quantile-Quantile plot and tests for normality

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

Predicting Breast Cancer Survivability Rates

ParkDiag: A Tool to Predict Parkinson Disease using Data Mining Techniques from Voice Data

Analysis of Hoge Religious Motivation Scale by Means of Combined HAC and PCA Methods

The Long Tail of Recommender Systems and How to Leverage It

Wrapper subset evaluation facilitates the automated detection of diabetes from heart rate variability measures

Hour 2: lm (regression), plot (scatterplots), cooks.distance and resid (diagnostics) Stat 302, Winter 2016 SFU, Week 3, Hour 1, Page 1

Analysis of Cow Culling Data with a Machine Learning Workbench. by Rhys E. DeWar 1 and Robert J. McQueen 2. Working Paper 95/1 January, 1995

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

Classification of Smoking Status: The Case of Turkey

Bangor University Laboratory Exercise 1, June 2008

Chapter 8 Estimating with Confidence

One-Way Independent ANOVA

Classification and Predication of Breast Cancer Risk Factors Using Id3

Credal decision trees in noisy domains

Stage-Specific Predictive Models for Cancer Survivability

Data Mining and Knowledge Discovery: Practice Notes

Reliability. Scale: Empathy

Data Mining Crime Correlations Using San Francisco Crime Open Data

Positive and Unlabeled Relational Classification through Label Frequency Estimation

Section 6: Analysing Relationships Between Variables

Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering

Chapter 12. The One- Sample

CLASSIFICATION OF BREAST CANCER INTO BENIGN AND MALIGNANT USING SUPPORT VECTOR MACHINES

Colour Normalisation to Reduce Inter-Patient and Intra-Patient Variability in Microaneurysm Detection in Colour Retinal Images

Content Part 2 Users manual... 4

Data complexity measures for analyzing the effect of SMOTE over microarrays

Noninvasive Diagnosis of Pulmonary Hypertension Using Heart Sound Analysis

Prediction Models of Diabetes Diseases Based on Heterogeneous Multiple Classifiers

Territorial Behavior based Algorithm for Data Classification

The Simulacrum. What is it, how is it created, how does it work? Michael Eden on behalf of Sally Vernon & Cong Chen NAACCR 21 st June 2017

An Approach for Diabetes Detection using Data Mining Classification Techniques

You can use this app to build a causal Bayesian network and experiment with inferences. We hope you ll find it interesting and helpful.

Chapter 23. Inference About Means. Copyright 2010 Pearson Education, Inc.

Performance Analysis of Different Classification Methods in Data Mining for Diabetes Dataset Using WEKA Tool

Module 28 - Estimating a Population Mean (1 of 3)

Intro to SPSS. Using SPSS through WebFAS

Computer Science 101 Project 2: Predator Prey Model

Supervised Learning Approach for Predicting the Presence of Seizure in Human Brain

A Knowledge-Based System for Fashion Trend Forecasting

Explore. sexcntry Sex according to country. [DataSet1] D:\NORA\NORA Main File.sav

Automated Medical Diagnosis using K-Nearest Neighbor Classification

An Improved Algorithm To Predict Recurrence Of Breast Cancer

Say It with Colors: Language-Independent Gender Classification on Twitter

AN EXPERIMENTAL STUDY ON HYPOTHYROID USING ROTATION FOREST

CS229 Final Project Report. Predicting Epitopes for MHC Molecules

How to Conduct On-Farm Trials. Dr. Jim Walworth Dept. of Soil, Water & Environmental Sci. University of Arizona

Detecting Hoaxes, Frauds and Deception in Writing Style Online

A Comparison of Collaborative Filtering Methods for Medication Reconciliation

PREDICTION OF BREAST CANCER USING STACKING ENSEMBLE APPROACH

Predicting the Effect of Diabetes on Kidney using Classification in Tanagra

Naïve Bayes classification in R

CHAPTER 2 NATURAL SELECTION AND REPRODUCTION

Predicting Breast Cancer Recurrence Using Machine Learning Techniques

FUZZY DATA MINING FOR HEART DISEASE DIAGNOSIS

Positive and Unlabeled Relational Classification through Label Frequency Estimation

Yeast Cells Classification Machine Learning Approach to Discriminate Saccharomyces cerevisiae Yeast Cells Using Sophisticated Image Features.

Using AUC and Accuracy in Evaluating Learning Algorithms

CPSC 121 Some Sample Questions for the Final Exam

QUICK START CARDS. Copyright 2012, 2013 Gottalook Productions LLC

Name Psychophysical Methods Laboratory

Prediction of Diabetes Using Bayesian Network

Mining Big Data: Breast Cancer Prediction using DT - SVM Hybrid Model

Logistic-based patient grouping for multi-disciplinary treatment

Sampling and Power Calculations East Asia Regional Impact Evaluation Workshop Seoul, South Korea

Coach Morse - Morse Code Practice Unit

ML LAId bare. Cambridge Wireless SIG Meeting. Mary-Ann & Phil Claridge 23 November

Effective Values of Physical Features for Type-2 Diabetic and Non-diabetic Patients Classifying Case Study: Shiraz University of Medical Sciences

Data mining for Obstructive Sleep Apnea Detection. 18 October 2017 Konstantinos Nikolaidis

Decision-Tree Based Classifier for Telehealth Service. DECISION SCIENCES INSTITUTE A Decision-Tree Based Classifier for Providing Telehealth Services

Objectives. What do you do?

Chapter 8: Estimating with Confidence

Impute vs. Ignore: Missing Values for Prediction

Graphic Organizers. Compare/Contrast. 1. Different. 2. Different. Alike

Two-Way Independent Samples ANOVA with SPSS

An empirical evaluation of text classification and feature selection methods

A Bayesian Belief Network Classifier for Predicting Victimization in National Crime Victimization Survey

56:134 Process Engineering

Comparative study of Naïve Bayes Classifier and KNN for Tuberculosis

Lesson 3 Profex Graphical User Interface for BGMN and Fullprof

CHAPTER ONE CORRELATION

IJISET - International Journal of Innovative Science, Engineering & Technology, Vol. 3 Issue 2, February

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

FROM SLOW to EXPLOSIVE

Transcription:

Data Mining with Weka Class 2 Lesson 1 Be a classifier! Ian H. Witten Department of Computer Science University of Waikato New Zealand weka.waikato.ac.nz

Lesson 2.1: Be a classifier! Class 1 Getting started with Weka Class 2 Evaluation Class 3 Simple classifiers Lesson 2.1 Be a classifier! Lesson 2.2 Training and testing Lesson 2.3 More training/testing Class 4 More classifiers Class 5 Putting it all together Lesson 2.4 Baseline accuracy Lesson 2.5 Cross validation Lesson 2.6 Cross validation results

Lesson 2.1: Be a classifier! Interactive decision tree construction Load segmentchallenge.arff; look at dataset Select UserClassifier (tree classifier) Use the test set segmenttest.arff Examine data visualizer and tree visualizer Plot regioncentroidrow vs intensitymean Rectangle, Polygon and Polyline selection tools several selections Rightclick in Tree visualizer and Accept the tree Over to you: how well can you do?

Lesson 2.1: Be a classifier! Build a tree: what strategy did you use? Given enough time, you could produce a perfect tree for the dataset but would it perform well on the test data? Course text Section 11.2 Do it yourself: the User Classifier

Data Mining with Weka Class 2 Lesson 2 Training and testing Ian H. Witten Department of Computer Science University of Waikato New Zealand weka.waikato.ac.nz

Lesson 2.2: Training and testing Class 1 Getting started with Weka Class 2 Evaluation Class 3 Simple classifiers Lesson 2.1 Be a classifier! Lesson 2.2 Training and testing Lesson 2.3 More training/testing Class 4 More classifiers Class 5 Putting it all together Lesson 2.4 Baseline accuracy Lesson 2.5 Cross validation Lesson 2.6 Cross validation results

Lesson 2.2: Training and testing Test data Training data ML algorithm Classifier Deploy! Evaluation results

Lesson 2.2: Training and testing Test data Training data ML algorithm Classifier Deploy! Evaluation results Basic assumption: training and test sets produced by independent sampling from an infinite population

Lesson 2.2: Training and testing Use J48 to analyze the segment dataset Open file segment challenge.arff Choose J48 decision tree learner (trees>j48) Supplied test set segment test.arff Run it: 96% accuracy Evaluate on training set: 99% accuracy Evaluate on percentage split: 95% accuracy Do it again: get exactly the same result!

Lesson 2.2: Training and testing Basic assumption: training and test sets sampled independently from an infinite population Just one dataset? hold some out for testing Expect slight variation in results but Weka produces same results each time J48 on segment challenge dataset Course text Section 5.1 Training and testing

Data Mining with Weka Class 2 Lesson 3 Repeated training and testing Ian H. Witten Department of Computer Science University of Waikato New Zealand weka.waikato.ac.nz

Lesson 2.3: Repeated training and testing Class 1 Getting started with Weka Class 2 Evaluation Class 3 Simple classifiers Lesson 2.1 Be a classifier! Lesson 2.2 Training and testing Lesson 2.3 More training/testing Class 4 More classifiers Class 5 Putting it all together Lesson 2.4 Baseline accuracy Lesson 2.5 Cross validation Lesson 2.6 Cross validation results

Lesson 2.3: Repeated training and testing Evaluate J48 on segment challenge With segment challenge.arff and J48 (trees>j48) Set percentage split to 90% Run it: 96.7% accuracy Repeat [More options] Repeat with seed 2, 3, 4, 5, 6, 7, 8, 9 10 0.967 0.940 0.940 0.967 0.953 0.967 0.920 0.947 0.933 0.947

Lesson 2.3: Repeated training and testing Evaluate J48 on segment challenge Sample mean Variance Standard deviation x = 2 = x i n (x i x ) 2 n 1 0.967 0.940 0.940 0.967 0.953 0.967 0.920 0.947 0.933 0.947 x = 0.949, = 0.018

Lesson 2.3: Repeated training and testing Basic assumption: training and test sets sampled independently from an infinite population Expect slight variation in results get it by setting the random number seed Can calculate mean and standard deviation experimentally

Data Mining with Weka Class 2 Lesson 4 Baseline accuracy Ian H. Witten Department of Computer Science University of Waikato New Zealand weka.waikato.ac.nz

Lesson 2.4: Baseline accuracy Class 1 Getting started with Weka Class 2 Evaluation Class 3 Simple classifiers Lesson 2.1 Be a classifier! Lesson 2.2 Training and testing Lesson 2.3 More training/testing Class 4 More classifiers Class 5 Putting it all together Lesson 2.4 Baseline accuracy Lesson 2.5 Cross validation Lesson 2.6 Cross validation results

Lesson 2.4: Baseline accuracy Use diabetes dataset and default holdout Open file diabetes.arff Test option: Percentage split Try these classifiers: trees > J48 76% bayes > NaiveBayes 77% lazy > IBk 73% rules > PART 74% (we ll learn about them later) 768 instances (500 negative, 268 positive) Always guess negative : 500/768 65% rules > ZeroR: most likely class!

Lesson 2.4: Baseline accuracy Sometimes baseline is best! Open supermarket.arff and blindly apply rules > ZeroR 64% trees > J48 63% bayes > NaiveBayes 63% lazy > IBk 38% (!!) rules > PART 63% Attributes are not informative Don t just apply Weka to a dataset: you need to understand what s going on!

Lesson 2.4: Baseline accuracy Consider whether differences are likely to be significant Always try a simple baseline, e.g. rules > ZeroR Look at the dataset Don t blindly apply Weka: try to understand what s going on!

Data Mining with Weka Class 2 Lesson 5 Cross validation Ian H. Witten Department of Computer Science University of Waikato New Zealand weka.waikato.ac.nz

Lesson 2.5: Cross validation Class 1 Getting started with Weka Class 2 Evaluation Class 3 Simple classifiers Lesson 2.1 Be a classifier! Lesson 2.2 Training and testing Lesson 2.3 More training/testing Class 4 More classifiers Class 5 Putting it all together Lesson 2.4 Baseline accuracy Lesson 2.5 Cross validation Lesson 2.6 Cross validation results

Lesson 2.5: Cross validation Can we improve upon repeated holdout? (i.e. reduce variance) Cross validation Stratified cross validation

Lesson 2.5: Cross validation Repeated holdout (in Lesson 2.3, hold out 10% for testing, repeat 10 times) (repeat 10 times)

Lesson 2.5: Cross validation 10 fold cross validation Divide dataset into 10 parts (folds) Hold out each part in turn Average the results Each data point used once for testing, 9 times for training Stratified cross validation Ensure that each fold has the right proportion of each class value

Lesson 2.5: Cross validation After cross validation, Weka outputs an extra model built on the entire dataset 10 times 10% of data 90% of data ML algorithm Classifier Evaluation results 11th time 100% of data ML algorithm Classifier Deploy!

Lesson 2.5: Cross validation Cross validation better than repeated holdout Stratified is even better With 10 fold cross validation, Weka invokes the learning algorithm 11 times Practical rule of thumb: Lots of data? use percentage split Else stratified 10 fold cross validation Course text Section 5.3 Cross validation

Data Mining with Weka Class 2 Lesson 6 Cross validation results Ian H. Witten Department of Computer Science University of Waikato New Zealand weka.waikato.ac.nz

Lesson 2.6: Cross validation results Class 1 Getting started with Weka Class 2 Evaluation Class 3 Simple classifiers Lesson 2.1 Be a classifier! Lesson 2.2 Training and testing Lesson 2.3 More training/testing Class 4 More classifiers Class 5 Putting it all together Lesson 2.4 Baseline accuracy Lesson 2.5 Cross validation Lesson 2.6 Cross validation results

Lesson 2.6: Cross validation results Is cross validation really better than repeated holdout? Diabetes dataset Baseline accuracy (rules > ZeroR): 65.1% trees > J48 10 fold cross validation 73.8% with different random number seed 1 2 3 4 5 6 7 8 9 10 73.8 75.0 75.5 75.5 74.4 75.6 73.6 74.0 74.5 73.0

Lesson 2.6: Cross validation results Sample mean x = x i n Variance 2 = (x i x ) 2 n 1 Standard deviation holdout (10%) 75.3 77.9 80.5 74.0 71.4 70.1 79.2 71.4 80.5 67.5 x = 74.8 = 4.6 cross validation (10 fold) 73.8 75.0 75.5 75.5 74.4 75.6 73.6 74.0 74.5 73.0 x = 74.5 = 0.9

Lesson 2.6: Cross validation results Why 10 fold? E.g. 20 fold: 75.1% Cross validation really is better than repeated holdout It reduces the variance of the estimate

Data Mining with Weka Department of Computer Science University of Waikato New Zealand Creative Commons Attribution 3.0 Unported License creativecommons.org/licenses/by/3.0/ weka.waikato.ac.nz