Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.

Similar documents
# Assessment of gene expression levels between several cell group types is a common application of the unsupervised technique.

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

Outlier Analysis. Lijun Zhang

Colon cancer subtypes from gene expression data

Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering

EECS 433 Statistical Pattern Recognition

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Knowledge Discovery and Data Mining I

Class Outlier Detection. Zuzana Pekarčíková

Predicting Kidney Cancer Survival from Genomic Data

Comparison of discrimination methods for the classification of tumors using gene expression data

10CS664: PATTERN RECOGNITION QUESTION BANK

Introduction to Discrimination in Microarray Data Analysis

Nearest Shrunken Centroid as Feature Selection of Microarray Data

Classification. Methods Course: Gene Expression Data Analysis -Day Five. Rainer Spang

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection

Algorithms Implemented for Cancer Gene Searching and Classifications

Gender Based Emotion Recognition using Speech Signals: A Review

STAT 151B. Administrative Info. Statistics 151B: Introduction Modern Statistical Prediction and Machine Learning. Overview and introduction

Computer Age Statistical Inference. Algorithms, Evidence, and Data Science. BRADLEY EFRON Stanford University, California

Predicting Diabetes and Heart Disease Using Features Resulting from KMeans and GMM Clustering

Part [2.1]: Evaluation of Markers for Treatment Selection Linking Clinical and Statistical Goals

Brain Tumour Detection of MR Image Using Naïve Beyer classifier and Support Vector Machine

Radiotherapy Outcomes

Machine Learning for Predicting Delayed Onset Trauma Following Ischemic Stroke

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp

Generalized additive model for disease risk prediction

Selection and Combination of Markers for Prediction

Class discovery in Gene Expression Data: Characterizing Splits by Support Vector Machines

Clustering analysis of cancerous microarray data

Biomarker adaptive designs in clinical trials

Boosted PRIM with Application to Searching for Oncogenic Pathway of Lung Cancer

An Improved Algorithm To Predict Recurrence Of Breast Cancer

Supporting Information Identification of Amino Acids with Sensitive Nanoporous MoS 2 : Towards Machine Learning-Based Prediction

Identification of Tissue Independent Cancer Driver Genes

Neurons and neural networks II. Hopfield network

Nature Neuroscience: doi: /nn Supplementary Figure 1. Behavioral training.

BREAST CANCER EPIDEMIOLOGY MODEL:

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 1, Jan Feb 2017

Evaluating Classifiers for Disease Gene Discovery

TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS)

Prediction of Successful Memory Encoding from fmri Data

Hybridized KNN and SVM for gene expression data classification

Accurate Diabetes Risk Stratification Using Machine Learning: Role of Missing Value and Outliers

Classification with microarray data

Computational Cognitive Science

Diagnosis of multiple cancer types by shrunken centroids of gene expression

Supplementary Materials Extracting a Cellular Hierarchy from High-dimensional Cytometry Data with SPADE

PAIRED AND UNPAIRED COMPARISON AND CLUSTERING WITH GENE EXPRESSION DATA

Accurate molecular classification of cancer using simple rules.

Data analysis in microarray experiment

Mayuri Takore 1, Prof.R.R. Shelke 2 1 ME First Yr. (CSE), 2 Assistant Professor Computer Science & Engg, Department

Predicting Breast Cancer Recurrence Using Machine Learning Techniques

Statistical Analysis of Single Nucleotide Polymorphism Microarrays in Cancer Studies

Contributions to Brain MRI Processing and Analysis

T. R. Golub, D. K. Slonim & Others 1999

Supplemental Figures. Figure S1: 2-component Gaussian mixture model of Bourdeau et al. s fold-change distribution

arxiv: v2 [cs.lg] 30 Oct 2013

Intelligent Systems. Discriminative Learning. Parts marked by * are optional. WS2013/2014 Carsten Rother, Dmitrij Schlesinger

Review of Chronic Kidney Disease based on Data Mining Techniques

Journal: Nature Methods

Nature Medicine: doi: /nm.3967

Roadmap for Developing and Validating Therapeutically Relevant Genomic Classifiers. Richard Simon, J Clin Oncol 23:

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

Classification of Microarray Gene Expression Data

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics

Classification of Synapses Using Spatial Protein Data

Mammogram Analysis: Tumor Classification

SVM-Kmeans: Support Vector Machine based on Kmeans Clustering for Breast Cancer Diagnosis

Chapter 1. Introduction

ABSTRACT I. INTRODUCTION. Mohd Thousif Ahemad TSKC Faculty Nagarjuna Govt. College(A) Nalgonda, Telangana, India

Visualizing Data for Hypothesis Generation Using Large-Volume Health care Claims Data

Reliability of Ordination Analyses

Reader s Emotion Prediction Based on Partitioned Latent Dirichlet Allocation Model

Probability-Based Protein Identification for Post-Translational Modifications and Amino Acid Variants Using Peptide Mass Fingerprint Data

Diagnosis of Breast Cancer Using Ensemble of Data Mining Classification Methods

BLOOD GLUCOSE PREDICTION MODELS FOR PERSONALIZED DIABETES MANAGEMENT

Automatic Hemorrhage Classification System Based On Svm Classifier

AClass: A Simple, Online Probabilistic Classifier. Vikash K. Mansinghka Computational Cognitive Science Group MIT BCS/CSAIL

PREDICTION OF BREAST CANCER USING STACKING ENSEMBLE APPROACH

Identifikation von Risikofaktoren in der koronaren Herzchirurgie

Supplementary Information. Gauge size. midline. arcuate 10 < n < 15 5 < n < 10 1 < n < < n < 15 5 < n < 10 1 < n < 5. principal principal

Predicting Breast Cancer Survivability Rates

NMF-Density: NMF-Based Breast Density Classifier

COMP9444 Neural Networks and Deep Learning 5. Convolutional Networks

Detection of Cognitive States from fmri data using Machine Learning Techniques

arxiv: v1 [stat.ap] 8 Oct 2014

Ideal Observers for Detecting Motion: Correspondence Noise

ABSTRACT I. INTRODUCTION II. HEART DISEASE

Modelling Spatially Correlated Survival Data for Individuals with Multiple Cancers

Assigning B cell Maturity in Pediatric Leukemia Gabi Fragiadakis 1, Jamie Irvine 2 1 Microbiology and Immunology, 2 Computer Science

From Biostatistics Using JMP: A Practical Guide. Full book available for purchase here. Chapter 1: Introduction... 1

Model-free machine learning methods for personalized breast cancer risk prediction -SWISS PROMPT

Performance and Saliency Analysis of Data from the Anomaly Detection Task Study

Tissue Classification Based on Gene Expression Data

Transcription:

Final review Based in part on slides from textbook, slides of Susan Holmes December 5, 2012 1 / 1

Final review Overview Before Midterm General goals of data mining. Datatypes. Preprocessing & dimension reduction. Distances. Multidimensional scaling. Multidimensional arrays. Decision trees. Performance measures for classifiers. Discriminant analysis. 2 / 1

Final review Overview After Midterm More classifiers: Rule-based Classifiers Nearest-Neighbour Classifiers Naive Bayes Classifiers Neural Networks Support Vector Machines Random Forests Boosting (AdaBoost / Gradient Boosting) Clustering. Outlier detection. 3 / 1

Rule based classifiers Rule-based Classifier (Example) R1: (Give Birth = no)! (Can Fly = yes) " Birds R2: (Give Birth = no)! (Live in Water = yes) " Fishes R3: (Give Birth = yes)! (Blood Type = warm) " Mammals R4: (Give Birth = no)! (Can Fly = no) " Reptiles R5: (Live in Water = sometimes) " Amphibians 4 / 1

Rule based classifiers Concepts coverage accuracy mutual exclusivity exhaustivity Laplace accuracy 5 / 1

Nearest Neighbor Classifiers Nearest neighbour classifier! Basic idea: If it walks like a duck, quacks like a duck, then it s probably a duck Compute Distance Test Record Training Records Choose k of the nearest records Tan,Steinbach, Kumar Introduction to 4/18/2004 34 6 / 1

o large, Nearest neighborhood neighbour classifier may include points fr lasses 7 / 1

Naive Bayes classifiers Naive Bayes classifiers Model: P(Y = c X 1 = x 1,..., X p = x p ( p ) P(X l = x l Y = c) P(Y = c) l=1 For continuous features, typically a 1-dimensional QDA model is used (i.e. Gaussian within each class). For discrete features: use the Laplace smoothed probabilities P(X j = l Y = c) = # {i : X ij = l, Y i = c} + α. # {Y i = c} + α k 8 / 1

ial Neural networks: Networks single layer (ANN) 9 / 1

Neural networks: double layer 10 / 1

Support Support Vector vector machine Machines 11 / 1

Support vector machines Support vector machines Solves the problem minimize β,α,ξ β 2 subject to y i (x T i β + α) 1 ξ i, ξ i 0, n i=1 ξ i C 12 / 1

Support vector machines Non-separable problems The ξ i s can be removed from this problem, yielding n minimize β,α β 2 2 + γ (1 y i f α,β (x i )) + i=1 where (z) + = max(z, 0) is the positive part function. Or, n minimize β,α (1 y i f α,β (x i )) + + λ β 2 2 i=1 13 / 1

Logistic vs. SVM 4.0 3.5 Logistic SVM 3.0 2.5 2.0 1.5 1.0 0.5 0.0 3 2 1 0 1 2 3 14 / 1

General EnsembleIdea methods 15 / 1

Ensemble methods Bagging / Random Forests In this method, one takes several bootstrap samples (samples with replacement) of the data. For each bootstrap sample S b, 1 b B, fit a model, retaining the classifier f,b. After all models have been fit, use majority vote f (x) = majority vote of (f,b (x)) 1 i B. Defined the OOB estimate of error. 16 / 1

Ensemble methods Illustrating AdaBoost Initial weights for each data point Data points for training 17 / 1

Ensemble methods Illustrating AdaBoost Tan,Steinbach, Kumar Introduction to 4/18/2004 84 18 / 1

Ensemble methods Boosting as gradient descent It turns out that boosting can be thought of as something like gradient descent. In some sense, the boosting algorithm is a steepest descent algorithm to find argmin f F n L(y i, f (x i )). i=1 19 / 1

What is Cluster Analysis? Cluster analysis! Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Intra-cluster distances are minimized Inter-cluster distances are maximized Tan,Steinbach, Kumar Introduction to 4/18/2004 2 20 / 1

Clustering Types of clustering Partitional A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset. Hierarchical A set of nested clusters organized as a hierarchical tree. Each data object is in exactly one subset for any horizontal cut of the tree... 21 / 1

Cluster analysis 502 14. Unsupervised Learning X2 X 1 FIGURE 14.4. Simulated data in the plane, clustered into three classes (represented A partitional by orange, example blue and green) by the K-means clustering algorithm 22 / 1

K-means 520 14. Unsupervised Learning log WK -3.0-2.5-2.0-1.5-1.0-0.5 0.0 Gap -0.5 0.0 0.5 1.0 2 4 6 8 Number of Clusters 2 4 6 8 Number of Clusters FIGURE 14.11. (Left panel): observed (green) and expected (blue) values of log W K for the simulatedfigure data of : Figure Gap statistic 14.4. Both curves have been translated to equal zero at one cluster. (Right panel): Gap curve, equal to the difference between the observed and expected values of log W K.TheGapestimateK is the smallest K producing a gap within one standard deviation of the gap at K +1; 23 / 1

K-medoid Algorithm Same as K-means, except that centroid is estimated not by the average, but by the observation having minimum pairwise distance with the other cluster members. Advantage: centroid is one of the observations useful, eg when features are 0 or 1. Also, one only needs pairwise distances for K-medoids rather than the raw observations. 24 / 1

Silhouette plot 25 / 1

Cluster analysis 522 14. Unsupervised Learning LEUKEMIA K562B-repro K562A-repro BREAST BREAST CNS CNS BREAST NSCLC UNKNOWN OVARIAN MCF7A-repro BREAST MCF7D-repro LEUKEMIA LEUKEMIA LEUKEMIA LEUKEMIA MELANOMA OVARIAN OVARIAN BREAST NSCLC LEUKEMIA NSCLC MELANOMA RENAL RENAL RENAL RENAL RENAL RENAL RENAL NSCLC OVARIAN OVARIAN NSCLC NSCLC NSCLC PROSTATE OVARIAN PROSTATE RENAL CNS CNS CNS BREAST NSCLC NSCLC BREAST RENAL MELANOMA MELANOMA MELANOMA MELANOMA MELANOMA MELANOMA COLON COLON COLON COLON COLON COLON COLON FIGURE 14.12. Dendrogram from agglomerative hierarchical clustering with average linkage to the human tumor microarray data. A hierarchical example chical structure produced by the algorithm. Hierarchical methods impose 26 / 1

Hierarchical clustering Concepts Top-down vs. bottom up Different linkages: single linkage (minimum distance) complete linkage (maximum distance) 27 / 1

Mixture models Mixture models Similar to K-means but assignment to clusters is soft. Often applied with multivariate normal as the model within classes. EM algorithm used to fit the model: Estimate responsibilities. Estimate within class parameters replacing labels (unobserved) with responsibilities. 28 / 1

Model-based clustering Summary 1 Choose a type of mixture model (e.g. multivariate Normal) and a maximum number of clusters, K 2 Use a specialized hierarchical clustering technique: model-based hierarchical agglomeration. 3 Use clusters from previous step to initialize EM for the mixture model. 4 Uses BIC to compare different mixture models and models with different numbers of clusters. 29 / 1

Outliers 30 / 1

Outliers General steps Build a profile of the normal behavior. Use these summary statistics to detect anomalies, i.e. points whose characteristics are very far from the normal profile. General types of schemes involve a statistical model of normal, and far is measured in terms of likelihood. Example: Grubbs test chooses an outlier threshold to control Type I error of any declared outliers if data does actually follow the model... 31 / 1

32 / 1