Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017

Similar documents
Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.

Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Applications. DSC 410/510 Multivariate Statistical Methods. Discriminating Two Groups. What is Discriminant Analysis

Predicting Breast Cancer Survivability Rates

10CS664: PATTERN RECOGNITION QUESTION BANK

Gene expression analysis for tumor classification using vector quantization

Classification with microarray data

Comparison of discrimination methods for the classification of tumors using gene expression data

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

Introduction to Discrimination in Microarray Data Analysis

EARLY STAGE DIAGNOSIS OF LUNG CANCER USING CT-SCAN IMAGES BASED ON CELLULAR LEARNING AUTOMATE

COMPARATIVE STUDY ON FEATURE EXTRACTION METHOD FOR BREAST CANCER CLASSIFICATION

Identification of Tissue Independent Cancer Driver Genes

INTRODUCTION TO MACHINE LEARNING. Decision tree learning

Minimum Feature Selection for Epileptic Seizure Classification using Wavelet-based Feature Extraction and a Fuzzy Neural Network

Outlier Analysis. Lijun Zhang

T. R. Golub, D. K. Slonim & Others 1999

Feature Vector Denoising with Prior Network Structures. (with Y. Fan, L. Raphael) NESS 2015, University of Connecticut

Classification of Mammograms using Gray-level Co-occurrence Matrix and Support Vector Machine Classifier

A Vision-based Affective Computing System. Jieyu Zhao Ningbo University, China

Unsupervised MRI Brain Tumor Detection Techniques with Morphological Operations

Hybridized KNN and SVM for gene expression data classification

Classification. Methods Course: Gene Expression Data Analysis -Day Five. Rainer Spang

Data complexity measures for analyzing the effect of SMOTE over microarrays

Nearest Shrunken Centroid as Feature Selection of Microarray Data

Mammogram Analysis: Tumor Classification

When Overlapping Unexpectedly Alters the Class Imbalance Effects

A prediction model for type 2 diabetes using adaptive neuro-fuzzy interface system.

Data mining for Obstructive Sleep Apnea Detection. 18 October 2017 Konstantinos Nikolaidis

Identifying Thyroid Carcinoma Subtypes and Outcomes through Gene Expression Data Kun-Hsing Yu, Wei Wang, Chung-Yu Wang

Comparing Multifunctionality and Association Information when Classifying Oncogenes and Tumor Suppressor Genes

NeuroMem. RBF Decision Space Mapping

Nature Medicine: doi: /nm.3967

Improved Intelligent Classification Technique Based On Support Vector Machines

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

Enhanced Detection of Lung Cancer using Hybrid Method of Image Segmentation

PREDICTION OF BREAST CANCER USING STACKING ENSEMBLE APPROACH

SUPPLEMENTARY APPENDIX

Brain Tumor segmentation and classification using Fcm and support vector machine

MEM BASED BRAIN IMAGE SEGMENTATION AND CLASSIFICATION USING SVM

Type II Fuzzy Possibilistic C-Mean Clustering

List of Figures. List of Tables. Preface to the Second Edition. Preface to the First Edition

Predictive Models for Healthcare Analytics

Learning in neural networks

Learning Classifier Systems (LCS/XCSF)

Intelligent Edge Detector Based on Multiple Edge Maps. M. Qasim, W.L. Woon, Z. Aung. Technical Report DNA # May 2012

BREAST CANCER EARLY DETECTION USING X RAY IMAGES

A Learning Method of Directly Optimizing Classifier Performance at Local Operating Range

NMF-Density: NMF-Based Breast Density Classifier

19th AWCBR (Australian Winter Conference on Brain Research), 2001, Queenstown, AU

A Scoring Policy for Simulated Soccer Agents Using Reinforcement Learning

Algorithms in Nature. Pruning in neural networks

Feature selection methods for early predictive biomarker discovery using untargeted metabolomic data

Prediction Models of Diabetes Diseases Based on Heterogeneous Multiple Classifiers

Reveal Relationships in Categorical Data

COMP9444 Neural Networks and Deep Learning 5. Convolutional Networks

Data analysis in microarray experiment

A New Approach for Detection and Classification of Diabetic Retinopathy Using PNN and SVM Classifiers

Statistics, Probability and Diagnostic Medicine

SUPPLEMENTARY INFORMATION. Table 1 Patient characteristics Preoperative. language testing

Brain Tumour Detection of MR Image Using Naïve Beyer classifier and Support Vector Machine

Diagnosis of multiple cancer types by shrunken centroids of gene expression

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD

Heterogeneous Data Mining for Brain Disorder Identification. Bokai Cao 04/07/2015

CANCER DIAGNOSIS USING DATA MINING TECHNOLOGY

SubLasso:a feature selection and classification R package with a. fixed feature subset

Mammogram Analysis: Tumor Classification

From Biostatistics Using JMP: A Practical Guide. Full book available for purchase here. Chapter 1: Introduction... 1

Various performance measures in Binary classification An Overview of ROC study

What can we contribute to cancer research and treatment from Computer Science or Mathematics? How do we adapt our expertise for them

THE data used in this project is provided. SEIZURE forecasting systems hold promise. Seizure Prediction from Intracranial EEG Recordings

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY

LOGISTIC REGRESSION VERSUS NEURAL

Recognizing Scenes by Simulating Implied Social Interaction Networks

Statistical Applications in Genetics and Molecular Biology

Knowledge Discovery and Data Mining. Testing. Performance Measures. Notes. Lecture 15 - ROC, AUC & Lift. Tom Kelsey. Notes

An Improved Algorithm To Predict Recurrence Of Breast Cancer

Index. E Eftekbar, B., 152, 164 Eigenvectors, 6, 171 Elastic net regression, 6 discretization, 28 regularization, 42, 44, 46 Exponential modeling, 135

MACHINE LEARNING BASED APPROACHES FOR PREDICTION OF PARKINSON S DISEASE

A mathematical model for short-term vs. long-term survival in patients with. glioma

Viewpoint Dependence in Human Spatial Memory

NAÏVE BAYESIAN CLASSIFIER FOR ACUTE LYMPHOCYTIC LEUKEMIA DETECTION

Lung Cancer Diagnosis from CT Images Using Fuzzy Inference System

Clustering Autism Cases on Social Functioning

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

Systematic Reviews and meta-analyses of Diagnostic Test Accuracy. Mariska Leeflang

Evaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination

Intelligent Systems. Discriminative Learning. Parts marked by * are optional. WS2013/2014 Carsten Rother, Dmitrij Schlesinger

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 1, Jan Feb 2017

Motivation: Fraud Detection

Introduction to diagnostic accuracy meta-analysis. Yemisi Takwoingi October 2015

Representational similarity analysis

SCIENCE & TECHNOLOGY

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics

CHAPTER 6 HUMAN BEHAVIOR UNDERSTANDING MODEL

Transcription:

Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017 A.K.A. Artificial Intelligence Unsupervised learning! Cluster analysis Patterns, Clumps, and Joining Supervised learning! Graph/tree search! Hypothesis testing! Linear discriminant! Nearest neighbor method Estimating classification errors D14657 200 150 100 50 0 Tumor Normal -50-20 -10 0 10 20 30 40 M94363 Copyright 2017 by Robert Stengel. All rights reserved. For educational use only. http://www.princeton.edu/~stengel/mae345.html 1 Some Machine Learning Objectives Logical Inference Classification Pattern Recognition Image Processing System Modeling Decision Analysis Data Representation Linguistic Translation Explainable AI 2

Old-Fashioned A.I. Expert Systems! Communication/ Information Theory! Decision Rules! Graph and Tree Searches! Asymmetric Structure! Explanation Facility Trendy A.I. Deep-Learning Neural Networks! Unsupervised Shallow Networks! Supervised Shallow Networks! Back-Propagation! Associative/ Recurrent Networks Explainable Artificial Intelligence? 3 Classification Objectives Class comparison!identify feature sets for predefined classes Class prediction!develop mathematical function/algorithm that predicts class membership in a novel feature set Class discovery!identify new classes, sub-classes, or features related to classification objectives 4

DNA Microarray Chip Glass plate with short strands of synthesized DNA arrayed in spots (probes) across the surface. Typically:! A million spots containing different nucleotide sequences! Each spot contains 10 6-10 7 strands of same sequence! 25 nucleotides (base pairs) in each strand! Strands are short segments of 20,000 genes 10-20 probes (base pairs) per gene See Supplemental Material for Lecture 15 5 Microarray Processing RNA from biological sample (target) is reverse transcribed to cdna, transcribed to crna, labeled, and hybridized to complementary nucleotides on chip Array is washed, stained, and scanned to quantify expression level of genes in sample Perfect and mismatched features for each gene in separate probes A! T C! G 6

Detection of Gene Expression Level in cdna (from RNA) Each tissue sample evaluated by a separate microarray Intensity of dot represents over- or under-expression of an RNA gene transcript in the sample 250 200 150 100 H57136 50 Tumor Normal 0-50 -100-80 -60-40 -20 0 20 40 60 80 M96839 7 Class Comparison A B Overexpressed in A Overexpressed in B Feature sets for predefined classes! Group A samples from tumor tissue! Group B samples from normal tissue Genes overexpressed in Group A Genes overexpressed in Group B Up In Normal Down in Normal 8

Class Prediction Algorithm that predicts class membership for a novel feature set!genes of a new sample are analyzed New sample in Group A or Group B? 9 Class Discovery New features revealed in classification!new class in universal set?!novel sample type (e.g., antibody) correlates with group?!novel characteristic (e.g., gender, age, or metastasis) correlates with group? Example: Tissue Sample Features Revealed by Staining (Histology) 10

Example for Data Classification Data set characterized by two features 11 Clustering of Data What characterizes a cluster? How many clusters are there? 12

Discriminants of Data Where are the boundaries between sets? 13 The Data Set Revealed The discriminant is the Delaware River 14

Towns and Crossroads of Pennsylvania and New Jersey 15 Choosing Features for Classification How many? How strong? Correlation between strong and weak features Degree of overlap Use of exogenous information for selection Statistical significance Closeness to boundaries To distinguish New Jersey from Pennsylvania, we could consider! Longitude! Latitude! Altitude! Temperature! Population! # of fast-food stores! Cultural factors! Zip Code 16

Recall: Membership in a Set! A = a particular set in U!! defined in a list or rule, or a membership function! Universal set = all guests at a party! Particular sets = distinguishing features of guests! 17 Distorted Membership Functions*: Photo Ambiguity and uncertainty in data sets to be classified * Photoshop 18

Distorted Membership Functions*: Map * Photoshop 19 Characteristics of Classification Features Strong feature! Individual feature provides good classification! Minimal overlap of feature values in each class! Significant difference in class mean values! Low variance in class Additional features! Orthogonal feature (low correlation) adds new information to the set! Co-expressed feature (high correlation) is redundant; averaging reduces error 20

Feature Sets Best line or curve may classify with significant error Best plane or surface classifies with equal or less error 200 150 Tumor Normal D14657 100 50 0-50 -20-10 0 10 20 30 40 M94363 21 Separable Sets Gene Analysis (2-D) Bacterial Response to Antibiotics (3-D) 800.00 Average Value of Down-Regulated Genes (Primary Colon Cancer) 700.00 600.00 500.00 400.00 300.00 200.00 100.00 Primary Colon Cancer Normal Mucosa 0.00 0.00 200.00 400.00 600.00 800.00 1000.00 1200.00 1400.00 1600.00 1800.00 2000.00 Average Value of Up-Regulated Genes (Primary Colon Cancer) 22

Expected Error in Classification Minimum possible error with statistically optimal discriminant (e.g., Delaware River) plus Error due to constraint imposed by sub-optimal discriminant (e.g., straight vs. curved line) plus Error due to sampling (i.e., number and distribution of points) 23 Errors in Classification Over-/under-fitting! Excessive/inadequate sensitivity to details in training data set! Lack of generalization to novel data Validation! Train with less than all available data! Reserve some data for evaluation of trained classifier! Vary sets used for training and validation 24

Validation of Classifier Train, Validate, and Test Reserve some data for evaluation of trained classifier Train with A, test with B! A: Training set (or sample)! B: Novel set (or sample)! Vary sets used for training and validation Leave-one-out validation (combined validation and test)! Remove a single sample! Train on remaining samples! Does the trained classifier identify the single sample?! Repeat with all sets, removing all samples, one-by-one 25 3 x 3 Confusion Matrix Number of cases predicted to be in each class vs. actual numbers True Class Predicted Class Cats Dogs Rabbits Cats 5 2 0 Dogs 3 3 2 Rabbits 0 1 11 Interpretation: Actually, there are! 8 cats: 5 predicted to be cats, 3 to be dogs, and none to be rabbits! 6 dogs: 2 predicted to be cats, 3 to be dogs, and 1 to be rabbit! 13 rabbits: None predicted to be cats, 2 to be dogs, and 11 to be rabbits 26

Classification of Overlapping Sets Tumor = Positive Normal = Negative Altering discriminant changes classification errors In the example, classification error can never be zero with simple discriminant 27 False Classification May Be Inevitable 200 150 Tumor Normal 100 D14657 TN TP 50 0 FN FP -50-20 -10 0 10 20 30 40 M94363 28

Categories of Classification Performance! (2 x 2 Confusion Matrix) Actual Class Predicted Class Positive Negative Number in Predicted Class Positive Negative True Positive (TP) False Negative (FN) False Positive (FP) True Negative (TN) # Predicted(+) = (# TP + # FP) # Predicted( ) = (# FN + # TN) Number in Actual Class # Actual(+) = (# TP + # FN) # Actual( ) = (# FP + # TN) 29 Actual Class Predicted Class Positive Positive Negative Number in Actual Class True Positive (TP) False Negative (FN) # Actual(+) = (# TP + # FN) Negative False Positive (FP) True Negative (TN) # Actual( ) = (# FP + # TN) Number in Predicted Class # Predicted(+) = (# TP + # FP) # Predicted( ) = (# FN + # TN) Measures of Classification Performance Sensitivity (% /100) = # True Positive # Actual Positive Specificity (% /100) = # True Negative # Actual Negative 30

Measures of Classification Performance Accuracy (%/100) = Positive Predictive Value (%/100) = Negative Predictive Value (%/100) = # True Positive + # True Negative # Actual Positive + # Actual Negative # True Positive # Predicted Positive # True Negative # Predicted Negative 31 Receiver Operating Characteristic* (ROC) Curve Comparison of 4 discriminant functions True positive rate (Sensitivity) vs. false positive rate (1 Specificity) for a varying parameter (or criterion) of the discriminant function The more area under the curve, the better the discriminant function [Ideal AUC = 1] For a given discriminant function, choose the criterion that is closest to (0,1) [perfect classification] OR farthest from random Perfect Classification * Devised during WWII to evaluate radar target detection 32

Unsupervised Learning Unsupervised Learning Learning depends on closeness of related features Previously unknown correlations or features may be detected Meaning of classification revealed after learning via exogenous knowledge Same answer given for all questions 33 Cluster Analysis Unsupervised Learning Recognize patterns within data sets Group data points that are close to each other! Hierarchical trees! Two-way clustering! k-means clustering Data compaction 34

Minimum spanning tree! Find smallest total edge length! Eliminate inconsistent edges (e.g., A-B much longer than others)! Delete noisy points (e.g., bubble chamber track at right)! Recognize and span gaps! Delete necks by diameter comparison! Group by similar density... plus other methods of digital image analysis (shapes, edges,...) Pattern Recognition 35 Distance Measures Between Data Points Distance between real vectors, x 1 and x 2 :! Euclidean distance! Weighted Euclidean distance! Squared Euclidean distance! Manhattan distance! Chebychev distance Distance between different categories, x 1 and x 2 :! Categorical disagreement distance ( x 1! x 2 ) T ( x 1! x 2 ) ( x 1! x 2 ) T Q( x 1! x 2 ) ( x 1! x 2 ) T ( x 1! x 2 ) " i x 1i! x 2i max i x 1i! x 2i ( Number of x 1i # x 2i )/i 36

Hierarchical Trees (Dendrograms) Classification based on distance between points Top-down evolution! Begin with 2 best clusters! Plot against linkage distance, e.g., distance between centroids! Divide each cluster into 2 best clusters until arriving at individuals Distance ClusterCentroid : x = N! i=1 N x i Individuals 37 Hierarchical Trees (Dendrograms) Bottom-up evolution! Start with each point in set! Link each point to a neighbor Single linkage: distance between nearest neighbors in clusters Complete linkage: distance between farthest neighbors in clusters Pair-group average/centroid! Link pairs to closest pairs 38

Dual Hierarchical Trees Two-way joining!trees derived from two independent variables Cluster by feature and by sample Cluster by different components of measurement 39 Supervised Learning Supervised Learning Learning depends on prior definition and knowledge of class Complex correlation between features is revealed Classification is inherent in learning Different answers given for different questions 40

Simple Hypothesis Test: t Test Is A greater than B? Welch s t test compares mean values of two data sets! Unequal numbers and variances! t is reduced by uncertainty in the data sets (!)! t is increased by number of points in the data sets (n)! Distributions are not necessarily Gaussian, but classification is based on means and variances t = ( m! m A B ) " A 2 n A + " 2 B n B m = mean value of data set! = standard deviation of data set n = number of points in data set t > 3, m A " m B with # 99.7% confidence (error probability $ 0.003 for Gaussian distributions) [n > 25] 41 Analysis of Variance! x 2 = Variance N # i=1 ( x i " x ) 2 ( N " 1) F Statistic 2 F AB =! x 1 =! A 2! x2 2! B 2 F test of two populations! Mean value of secondary importance! Populations are equivalent if F min < F AB < F max or F AB! 1! Populations are strongly equivalent if F AB! 1 and t AB! 0 42

Example of Gene-by-Gene Tumor/ Normal Classification by t Test (Data from Alon et al, 1999) 58 RNA samples representing tumor and normal tissue 1,151 genes are over/underexpressed in tumor/normal comparison, p $ 0.003 Genetically dissimilar samples are apparent Dimension reduction by neglecting genes with t < 3 t = ( m T! m N ) / 2 2 " T 36 + " N 22 Cancer-positive gene sets Cancer-negative gene sets Possibly misclassified by pathologist Up In Normal Down in Normal 43 Sample and Gene Correlation Matrices Over Entire Data Set -1 0 1 Gene correlation (C G = D D T ) x = Sample correlation (C S = D T D) x = 44

Discriminant Analysis Hypothesis test!are 2 given populations different? Linear discriminant!what is(are) the best line(s)/plane(s)/ hyperplane(s) for separating 2 (or k) populations? y = mx + b D14657 200 150 100 50 0 Tumor Normal -50-20 -10 0 10 20 30 40 M94363 45 Statistical Linear Discriminant What is(are) the best line(s)/plane(s)/ hyperplane(s) for separating 2 (or k) populations?! Fisher s linear discriminant! Gradient descent! Perceptron Nonseparable sets! Minimum square error http://en.wikipedia.org/wiki/ Linear_discriminant_analysis 46

Nearest-Neighbor Discriminant Analysis Ignore all points except those closest to evolving discriminant Support vector machine Linear classifier Reshape the space by transformation (via kernel) 47 Ensemble Mean Values Pre-Processing to Reduce Feature Space Treat each probe set (row) as a redundant, corrupted measurement of the same tumor/normal indicator z ij = k i y +! ij, i =1, m, j =1, n Compute column averages for each sample sub-group (i.e., sum each column and divide by n) ˆ z j = 1 n n z ij i=1! Dimension reduction: Feature space is reduced from (# samples x # genes) to (# samples) Statistics of random variable sums are ~Gaussian by central limit theorem A simple step toward deep learning 48

Two-Feature Discriminants for Class Prediction and Evaluation (Alon, Notterman, 1999, data) Scatter plot presents ensemble averages of up genes vs. down genes for each sample!ẑ up j, ẑ down j " # $ Classification based on ensemble averages Mislabeled samples are identifiable Average Value for Genes Down-Regulated in Tumors 200 180 160 140 120 100 80 60 40 20 0 0 20 40 60 80 100 120 140 160 180 Average Value for Genes Up-Regulated in Tumors Tumors Normals 49 Average Value of Down-Regulated Genes (Primary Colon Cancer) 800.00 700.00 600.00 500.00 400.00 300.00 200.00 100.00 Clustering of Sample Averages for Primary Colon Cancer vs. Normal Mucosa! (NIH PPG data, 2004) Primary Colon Cancer Normal Mucosa 0.00 0.00 200.00 400.00 600.00 800.00 1000.00 1200.00 1400.00 1600.00 1800.00 2000.00 Average Value of Up-Regulated Genes (Primary Colon Cancer) 144 samples, 3,437 probe sets analyzed 47 primary colon cancer 22 normal mucosa Affymetrix HGU-133A GeneChip All transcripts Present in all samples # Probe sets Up: 1067 Down: 290 Constant: 19 50

Clustering of Sample Averages for Primary Polyp vs. Normal Mucosa 800.00 Average Value of Down-Regulated Genes (Primary Polyp) 700.00 600.00 500.00 400.00 300.00 200.00 100.00 Primary Polyp Normal Mucosa 21 primary polyp 22 normal mucosa # Probe sets Up: 1014 Down: 219 Constant: 40 0.00 0.00 500.00 1000.00 1500.00 2000.00 2500.00 Average Value of Up-Regulated Genes (Primary Polyp) 51 Clustering of Sample Averages for Primary Polyp vs. Primary Colon Cancer 1000.00 Average Value of Down-Regulated Genes (Primary Polyp) 900.00 800.00 700.00 600.00 500.00 400.00 300.00 200.00 100.00 Primary Polyp Primary Colon Cancer 21 primary polyp 47 primary colon cancer # Probe sets Up: 273 Down: 126 Constant: 172 0.00 0.00 500.00 1000.00 1500.00 2000.00 2500.00 3000.00 3500.00 Average Value of Up-Regulated Genes (Primary Polyp) 52

Next Time:! Introduction to Neural Networks! 53 Supplemental Material 54

Big Data and Data Mining Multi-dimensional classification Ray of hope? or infinite harm? http://en.wikipedia.org/wiki/big_data 55