Section 4.1. Chapter 4. Classification into Groups: Discriminant Analysis. Introduction: Canonical Discriminant Analysis.

Similar documents
Applications. DSC 410/510 Multivariate Statistical Methods. Discriminating Two Groups. What is Discriminant Analysis

Small Group Presentations

CHAPTER VI RESEARCH METHODOLOGY

The SAGE Encyclopedia of Educational Research, Measurement, and Evaluation Multivariate Analysis of Variance

Lunchtime Seminar. Risper Awuor, Ph.D. Department of Graduate Educational and Leadership. January 30, 2013

The FASTCLUS Procedure as an Effective Way to Analyze Clinical Data

Representational similarity analysis

Introduction to Discrimination in Microarray Data Analysis

12/30/2017. PSY 5102: Advanced Statistics for Psychological and Behavioral Research 2

Lev Sverdlov, Ph.D., Innapharma, Inc., Park Ridge, NJ

investigate. educate. inform.

Profile Analysis. Intro and Assumptions Psy 524 Andrew Ainsworth

Chapter 14: More Powerful Statistical Methods

CHAPTER 6. Conclusions and Perspectives

Reveal Relationships in Categorical Data

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties

Linear Regression in SAS

Correlation and Regression

Knowledge Discovery and Data Mining I

Lev Sverdlov, Ph.D.; John F. Noble, Ph.D.; Gabriela Nicolau, Ph.D. Innapharma, Inc., Upper Saddle River, NJ

A Comparison of Linear Mixed Models to Generalized Linear Mixed Models: A Look at the Benefits of Physical Rehabilitation in Cardiopulmonary Patients

Overview of Lecture. Survey Methods & Design in Psychology. Correlational statistics vs tests of differences between groups

(C) Jamalludin Ab Rahman

Technical Specifications

Selection and Combination of Markers for Prediction

Outlier Analysis. Lijun Zhang

Discriminant Analysis with Categorical Data

Part 8 Logistic Regression

Context of Best Subset Regression

Chapter 21 Multilevel Propensity Score Methods for Estimating Causal Effects: A Latent Class Modeling Strategy

A Biostatistics Applications Area in the Department of Mathematics for a PhD/MSPH Degree

From Biostatistics Using JMP: A Practical Guide. Full book available for purchase here. Chapter 1: Introduction... 1

isc ove ring i Statistics sing SPSS

Using SAS to Calculate Tests of Cliff s Delta. Kristine Y. Hogarty and Jeffrey D. Kromrey

Department of Allergy-Pneumonology, Penteli Children s Hospital, Palaia Penteli, Greece

Biostatistics II

Index. E Eftekbar, B., 152, 164 Eigenvectors, 6, 171 Elastic net regression, 6 discretization, 28 regularization, 42, 44, 46 Exponential modeling, 135

10CS664: PATTERN RECOGNITION QUESTION BANK

WELCOME! Lecture 11 Thommy Perlinger

Motivation: Fraud Detection

Practical Multivariate Analysis

List of Figures. List of Tables. Preface to the Second Edition. Preface to the First Edition

Meta-Analysis. Zifei Liu. Biological and Agricultural Engineering

UNIVERSITY OF FLORIDA 2010

How to analyze correlated and longitudinal data?

Chapter 1 - Sampling and Experimental Design

Generalized Estimating Equations for Depression Dose Regimes

National Surgical Adjuvant Breast and Bowel Project (NSABP) Foundation Annual Progress Report: 2008 Formula Grant

Linear Regression Analysis

A SAS Macro to Investigate Statistical Power in Meta-analysis Jin Liu, Fan Pan University of South Carolina Columbia

Chapter 3. Psychometric Properties

Georgiev et al. (2008) have investigated how fractal dimension of EEGs changes as a response to different mental activities in healthy volunteers.

CLASSICAL AND. MODERN REGRESSION WITH APPLICATIONS

Logistic Regression Predicting the Chances of Coronary Heart Disease. Multivariate Solutions

Introduction to MVPA. Alexandra Woolgar 16/03/10

Applied Medical. Statistics Using SAS. Geoff Der. Brian S. Everitt. CRC Press. Taylor Si Francis Croup. Taylor & Francis Croup, an informa business

Reliability. Internal Reliability

Aristomenis Kotsakis,Matthias Nübling, Nikolaos P. Bakas, George Pelekanakis, John Thanopoulos

Survey research (Lecture 1) Summary & Conclusion. Lecture 10 Survey Research & Design in Psychology James Neill, 2015 Creative Commons Attribution 4.

Survey research (Lecture 1)

04/12/2014. Research Methods in Psychology. Chapter 6: Independent Groups Designs. What is your ideas? Testing

CHAPTER 4 RESULTS. In this chapter the results of the empirical research are reported and discussed in the following order:

Chapter 13 Estimating the Modified Odds Ratio

Multiple Regression Using SPSS/PASW

Understanding the Hypothesis

MAKING THE NSQIP PARTICIPANT USE DATA FILE (PUF) WORK FOR YOU

A macro of building predictive model in PROC LOGISTIC with AIC-optimal variable selection embedded in cross-validation

Repeated Measures ANOVA and Mixed Model ANOVA. Comparing more than two measurements of the same or matched participants

MS&E 226: Small Data

A STUDY ON SOCIAL IMPACT OF WOMEN SELF HELP GROUPS IN METTUR TALUK, SALEM DISTRICT, TAMILNADU

Lecture Outline. Biost 590: Statistical Consulting. Stages of Scientific Studies. Scientific Method

Causal Methods for Observational Data Amanda Stevenson, University of Texas at Austin Population Research Center, Austin, TX

Midterm Exam ANSWERS Categorical Data Analysis, CHL5407H

Classification. Methods Course: Gene Expression Data Analysis -Day Five. Rainer Spang

Behavioral Intervention Rating Rubric. Group Design

Doing Quantitative Research 26E02900, 6 ECTS Lecture 6: Structural Equations Modeling. Olli-Pekka Kauppila Daria Kautto

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

EMPIRICAL STRATEGIES IN LABOUR ECONOMICS

STATISTICS AND RESEARCH DESIGN

Use the above variables and any you might need to construct to specify the MODEL A/C comparisons you would use to ask the following questions.

Randomized experiments vs. Propensity scores matching: a Meta-analysis.

Functionality. A Case For Teaching Functional Skills 4/8/17. Teaching skills that make sense

A review of caveats in statistical nuclear image analysis

The Food Consumption Analysis

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

Part 1. Online Session: Math Review and Math Preparation for Course 5 minutes Introduction 45 minutes Reading and Practice Problem Assignment

SURVEY TOPIC INVOLVEMENT AND NONRESPONSE BIAS 1

CHAPTER OBJECTIVES - STUDENTS SHOULD BE ABLE TO:

Today: Binomial response variable with an explanatory variable on an ordinal (rank) scale.

Propensity Score Methods for Estimating Causality in the Absence of Random Assignment: Applications for Child Care Policy Research

Types of Data. Systematic Reviews: Data Synthesis Professor Jodie Dodd 4/12/2014. Acknowledgements: Emily Bain Australasian Cochrane Centre

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

SUPPLEMENTAL MATERIAL

Assigning B cell Maturity in Pediatric Leukemia Gabi Fragiadakis 1, Jamie Irvine 2 1 Microbiology and Immunology, 2 Computer Science

Bangor University Laboratory Exercise 1, June 2008

Problem #1 Neurological signs and symptoms of ciguatera poisoning as the start of treatment and 2.5 hours after treatment with mannitol.

Summary & Conclusion. Lecture 10 Survey Research & Design in Psychology James Neill, 2016 Creative Commons Attribution 4.0

Daniel Boduszek University of Huddersfield

Transcription:

Chapter 4 Classification into Groups: Discriminant Analysis Section 4.1 Introduction: Canonical Discriminant Analysis Understand the goals of discriminant Identify similarities between discriminant analysis and multivariate general linear models. Explain how to perform canonical discriminant Use the CANDISC procedure to perform canonical discriminant The Research Questions A credit card company wants to use financial information to decide whether a potential customer is a good risk or a bad risk before offering a credit card. A school district wants to use classroom behavior and scores to identify candidate students for a learning intervention program. An insurance company wants to understand what demographic and behavioral variables are most characteristic of different types of drivers. 3 4 Why Discriminant Analysis? With discriminant analysis, you can interpret variables that are most characteristic of group differences use a linear combination of variables to predict group membership validate the model on a new sample in one step easily score new observations into groups. Supervised Data Analysis There are numerous types of analysis that are used to classify observations based on a set of variables. However, discriminant analysis is not the same as cluster analysis to perform discriminant analysis, you must have information about actual group membership in order to estimate discriminant functions discriminant analysis finds combinations of predictors that best differentiate the groups, so you can apply those linear combinations in the future to predict groups when group membership is not known. 5 6 1

How Does Discriminant Analysis Work? Canonical Discriminant Analysis If you were to perform canonical analysis on the crops data with all four remote sensing measures as one set of variables and the crops (dummy-coded) as the second set of variables, you would be performing the equivalent of canonical discriminant The number of discriminant functions is the minimum of the number of predictors or the number of groups minus one. For the crops example, you would have min (2,4) = 2 discriminant functions. 7 8 The Multivariate Linear Model The linear model underlying discriminant analysis is essentially the same as MANOVA: Y = Xβ + E The assumptions are the same as for MANOVA If the data are not multivariate normal, a nonparametric method may be preferred. Two Goals for Discriminant Analysis 1. Interpretation: How are the groups different? Find and interpret linear combinations of variables that optimally predict group differences 2. Classification: How accurately can observations be classified into groups? Using functions of variables to predict group membership for a data set and evaluate expected error rates 9 10 Two Goals for Discriminant Analysis 1. Interpretation: How are the groups different? Find and interpret linear combinations of variables that optimally predict group differences 2. Classification: How accurately can observations be classification into groups? Using functions of variables to predict group membership for a data set and evaluate expected error rates The CANDISC Procedure General form of the CANDISC procedure: PROC CANDISC <options>; CLASS variable; VAR variables; RUN; 11 12 2

The %PLOTIT Macro General form of the %PLOTIT macro: %plotit (data=data set, plotvars=var1 var2, labelvar = varname, symvar=group_var, typevar=group_var, symsize = option, symlen=option); Pathological Gambling Example 13 14 Canonical Discriminant Analysis ch4s1d1.sas This demonstration illustrates the CANDISC procedure for canonical discriminant analysis, and shows the %PLOTIT macro for visualizing discriminant functions. Exercises This exercise reinforces the concepts discussed previously. 15 16 Section 4.2 Describe the steps involved in Fisher linear discriminant Contrast Fisher linear discriminant analysis with canonical discriminant Perform Fisher linear discriminant analysis using the DISCRIM procedure. Fisher Linear Discriminant Analysis 18 3

What Have You Learned? The predictors discriminate between groups. Both functions discriminate significantly. There are useful interpretations for the two functions. Both functions discriminate between different pairs of groups. Two Goals for Discriminant Analysis Recall the two aspects of discriminant analysis: Interpretation: How are the groups different? Find and interpret linear combinations of variables that optimally predict group differences 2. Classification: How accurately can observations be classified into groups? Using functions of variables to predict group membership for a data set and evaluate expected error rates 19 20 Compare and Contrast Assuming number of groups < number of predictors: Canonical discriminant analysis Number of functions = groups 1 Seek functions that maximally separate group centroids. Fisher linear discriminant analysis Number of functions = groups Score observations on similarity to group centroids. Scores are converted to probability of membership in each group. The DISCRIM Procedure PROC DISCRIM can be used for many different types of analysis including canonical discriminant analysis assessing and confirming the usefulness of the functions (empirical validation and crossvalidation) predicting group membership on new data using the functions (scoring) linear and quadratic discriminant analysis nonparametric discriminant analysis 21 22 The DISCRIM Procedure General form of the DISCRIM procedure: PROC DISCRIM <options>; CLASS variable; PRIORS expression; VAR variables; RUN; Prior Probability Estimates In discriminant analysis, it may be useful to specify prior probabilities, or priors. By using PRIORS statement, you can specify how to estimate the probabilities of group membership in the population. You will use proportional, equal, and user-specified priors in this chapter. 23 24 4

Fisher Linear Discriminant Analysis ch4s2d1.sas This demonstration illustrates the DISCRIM procedure for Fisher linear discriminant Exercises This exercise reinforces the concepts discussed previously. 25 26 Use the DISCRIM procedure to test the assumption of homogeneous covariance matrices Perform quadratic discriminant analysis Section 4.3 Quadratic Discriminant Analysis 28 Homogeneity of Covariance Matrices Recall that one assumption of the multivariate linear model is homogeneity of covariance matrices. 2 2 Dt (x) = dt (x) + g2( t) Mahalanobis distance -2(ln(prior)) Quadratic Discriminant Analysis Quadratic discriminant analysis uses a separate estimate of the covariance matrix for each group in calculating distances from group centroids: D (x) = d (x) + ) + g () t 2 2 t t g1( t ln S 2 If groups do have heterogeneous covariance structures, classifications based on a pooled covariance matrix can be prone to greater error in classification. 29 30 5

Quadratic Discriminant Analysis PROC DISCRIM POOL=option SLPOOL=option; OPTION: ANALYSIS: POOL=YES Linear POOL=NO Quadratic POOL=TEST Depends on test Demonstration ch4s3d1.sas This demonstration illustrates a test for homogeneity of covariance matrices and performs quadratic discriminant analysis using the DISCRIM procedure. 31 32 Use the DISCRIM procedure to validate discriminant Section 4.4 Discriminant Analysis and Empirical Validation 34 What Have You Learned? Using quadratic discriminant analysis with proportional priors, your expected error rate in classification is only 4%. Most of the errors in classification are expected to be among Controls. Empirical Validation Validation is particularly important in discriminant The observations that were classified were the same ones used to develop the equations, resulting in downwardly biased error count estimates. Discriminant analysis capitalizes on chance associations in the data. Apply the equations to a new set of data to get a better estimate of the expected error rate for the population. 35 36 6

PROC DISCRIM for Empirical Validation General form of the DISCRIM procedure: PROC DISCRIM DATA = old-data TESTDATA = new-data TESTLIST; CLASS variable; PRIORS priors; VAR variables; RUN; Testing Discriminant Functions on a New Data Set ch4s4d1.sas This demonstration illustrates the DISCRIM procedure for empirical validation of discriminant functions. 37 38 What Have You Learned? Validating the discriminant functions on a new sample resulted in an estimated error rate of about 19%, which is still quite low. Most of the errors in the validation data set were from the Binge and Steady groups. Section 4.5 Stepwise Discriminant Analysis 39 Explain different methods for stepwise discriminant Understand some of the potential problems with stepwise Fit a stepwise model using PROC STEPDISC and interpret the output. Validate the results of a stepwise analysis using PROC DISCRIM. Too Many Variables! dsm3 dsm8 dsm1 dsm11 dsm9 dsm6 dsm5 dsm2 dsm12 dsm4 dsm7 dsm10 41 42 7

Stepwise Selection Methods Forward Selection Backward Selection Analyze with Care Although it is useful and efficient, stepwise methods have limitations: Correlated predictors Chance relationships in the data Always validate your findings from a stepwise Stepwise Selection 43 44 The STEPDISC Procedure General form of the STEPDISC procedure: PROC STEPDISC DATA = data-set METHOD = method; CLASS variable; VAR variables; RUN; Stepwise Discriminant Analysis ch4s5d1.sas This demonstration illustrates the STEPDISC procedure for stepwise discriminant analysis and the DISCRIM procedure for empirical validation of the final model. 45 46 What Have You Learned? The reduced model with four predictors resulted in an estimated error rate of 11% in the calibration data (compared to 4% with the full, 12-predictor model). The reduced model resulted in an estimated error rate of about 21% in the validation data, compared with 19% with the full model. Reducing the model from 12 predictors to 4 predictors resulted in very little loss of predictive power. Exercises This exercise reinforces the concepts discussed previously. 47 48 8