Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.

Final review Based in part on slides from textbook, slides of Susan Holmes December 5, 2012 1 / 1

Final review Overview Before Midterm General goals of data mining. Datatypes. Preprocessing & dimension reduction. Distances. Multidimensional scaling. Multidimensional arrays. Decision trees. Performance measures for classifiers. Discriminant analysis. 2 / 1

Final review Overview After Midterm More classifiers: Rule-based Classifiers Nearest-Neighbour Classifiers Naive Bayes Classifiers Neural Networks Support Vector Machines Random Forests Boosting (AdaBoost / Gradient Boosting) Clustering. Outlier detection. 3 / 1

Rule based classifiers Rule-based Classifier (Example) R1: (Give Birth = no)! (Can Fly = yes) " Birds R2: (Give Birth = no)! (Live in Water = yes) " Fishes R3: (Give Birth = yes)! (Blood Type = warm) " Mammals R4: (Give Birth = no)! (Can Fly = no) " Reptiles R5: (Live in Water = sometimes) " Amphibians 4 / 1

Rule based classifiers Concepts coverage accuracy mutual exclusivity exhaustivity Laplace accuracy 5 / 1

Nearest Neighbor Classifiers Nearest neighbour classifier! Basic idea: If it walks like a duck, quacks like a duck, then it s probably a duck Compute Distance Test Record Training Records Choose k of the nearest records Tan,Steinbach, Kumar Introduction to 4/18/2004 34 6 / 1

o large, Nearest neighborhood neighbour classifier may include points fr lasses 7 / 1

Naive Bayes classifiers Naive Bayes classifiers Model: P(Y = c X 1 = x 1,..., X p = x p ( p ) P(X l = x l Y = c) P(Y = c) l=1 For continuous features, typically a 1-dimensional QDA model is used (i.e. Gaussian within each class). For discrete features: use the Laplace smoothed probabilities P(X j = l Y = c) = # {i : X ij = l, Y i = c} + α. # {Y i = c} + α k 8 / 1

ial Neural networks: Networks single layer (ANN) 9 / 1

Neural networks: double layer 10 / 1

Support Support Vector vector machine Machines 11 / 1

Support vector machines Support vector machines Solves the problem minimize β,α,ξ β 2 subject to y i (x T i β + α) 1 ξ i, ξ i 0, n i=1 ξ i C 12 / 1

Support vector machines Non-separable problems The ξ i s can be removed from this problem, yielding n minimize β,α β 2 2 + γ (1 y i f α,β (x i )) + i=1 where (z) + = max(z, 0) is the positive part function. Or, n minimize β,α (1 y i f α,β (x i )) + + λ β 2 2 i=1 13 / 1

Logistic vs. SVM 4.0 3.5 Logistic SVM 3.0 2.5 2.0 1.5 1.0 0.5 0.0 3 2 1 0 1 2 3 14 / 1

General EnsembleIdea methods 15 / 1

Ensemble methods Bagging / Random Forests In this method, one takes several bootstrap samples (samples with replacement) of the data. For each bootstrap sample S b, 1 b B, fit a model, retaining the classifier f,b. After all models have been fit, use majority vote f (x) = majority vote of (f,b (x)) 1 i B. Defined the OOB estimate of error. 16 / 1

Ensemble methods Illustrating AdaBoost Initial weights for each data point Data points for training 17 / 1

Ensemble methods Illustrating AdaBoost Tan,Steinbach, Kumar Introduction to 4/18/2004 84 18 / 1

Ensemble methods Boosting as gradient descent It turns out that boosting can be thought of as something like gradient descent. In some sense, the boosting algorithm is a steepest descent algorithm to find argmin f F n L(y i, f (x i )). i=1 19 / 1

What is Cluster Analysis? Cluster analysis! Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Intra-cluster distances are minimized Inter-cluster distances are maximized Tan,Steinbach, Kumar Introduction to 4/18/2004 2 20 / 1

Clustering Types of clustering Partitional A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset. Hierarchical A set of nested clusters organized as a hierarchical tree. Each data object is in exactly one subset for any horizontal cut of the tree... 21 / 1

Cluster analysis 502 14. Unsupervised Learning X2 X 1 FIGURE 14.4. Simulated data in the plane, clustered into three classes (represented A partitional by orange, example blue and green) by the K-means clustering algorithm 22 / 1

K-means 520 14. Unsupervised Learning log WK -3.0-2.5-2.0-1.5-1.0-0.5 0.0 Gap -0.5 0.0 0.5 1.0 2 4 6 8 Number of Clusters 2 4 6 8 Number of Clusters FIGURE 14.11. (Left panel): observed (green) and expected (blue) values of log W K for the simulatedfigure data of : Figure Gap statistic 14.4. Both curves have been translated to equal zero at one cluster. (Right panel): Gap curve, equal to the difference between the observed and expected values of log W K.TheGapestimateK is the smallest K producing a gap within one standard deviation of the gap at K +1; 23 / 1

K-medoid Algorithm Same as K-means, except that centroid is estimated not by the average, but by the observation having minimum pairwise distance with the other cluster members. Advantage: centroid is one of the observations useful, eg when features are 0 or 1. Also, one only needs pairwise distances for K-medoids rather than the raw observations. 24 / 1

Silhouette plot 25 / 1

Cluster analysis 522 14. Unsupervised Learning LEUKEMIA K562B-repro K562A-repro BREAST BREAST CNS CNS BREAST NSCLC UNKNOWN OVARIAN MCF7A-repro BREAST MCF7D-repro LEUKEMIA LEUKEMIA LEUKEMIA LEUKEMIA MELANOMA OVARIAN OVARIAN BREAST NSCLC LEUKEMIA NSCLC MELANOMA RENAL RENAL RENAL RENAL RENAL RENAL RENAL NSCLC OVARIAN OVARIAN NSCLC NSCLC NSCLC PROSTATE OVARIAN PROSTATE RENAL CNS CNS CNS BREAST NSCLC NSCLC BREAST RENAL MELANOMA MELANOMA MELANOMA MELANOMA MELANOMA MELANOMA COLON COLON COLON COLON COLON COLON COLON FIGURE 14.12. Dendrogram from agglomerative hierarchical clustering with average linkage to the human tumor microarray data. A hierarchical example chical structure produced by the algorithm. Hierarchical methods impose 26 / 1

Hierarchical clustering Concepts Top-down vs. bottom up Different linkages: single linkage (minimum distance) complete linkage (maximum distance) 27 / 1

Mixture models Mixture models Similar to K-means but assignment to clusters is soft. Often applied with multivariate normal as the model within classes. EM algorithm used to fit the model: Estimate responsibilities. Estimate within class parameters replacing labels (unobserved) with responsibilities. 28 / 1

Model-based clustering Summary 1 Choose a type of mixture model (e.g. multivariate Normal) and a maximum number of clusters, K 2 Use a specialized hierarchical clustering technique: model-based hierarchical agglomeration. 3 Use clusters from previous step to initialize EM for the mixture model. 4 Uses BIC to compare different mixture models and models with different numbers of clusters. 29 / 1

Outliers 30 / 1

Outliers General steps Build a profile of the normal behavior. Use these summary statistics to detect anomalies, i.e. points whose characteristics are very far from the normal profile. General types of schemes involve a statistical model of normal, and far is measured in terms of likelihood. Example: Grubbs test chooses an outlier threshold to control Type I error of any declared outliers if data does actually follow the model... 31 / 1

32 / 1