Applications. DSC 410/510 Multivariate Statistical Methods. Discriminating Two Groups. What is Discriminant Analysis

Similar documents
Small Group Presentations

Section 4.1. Chapter 4. Classification into Groups: Discriminant Analysis. Introduction: Canonical Discriminant Analysis.

isc ove ring i Statistics sing SPSS

Department of Allergy-Pneumonology, Penteli Children s Hospital, Palaia Penteli, Greece

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

Reliability of Ordination Analyses

CRITERIA FOR USE. A GRAPHICAL EXPLANATION OF BI-VARIATE (2 VARIABLE) REGRESSION ANALYSISSys

Chapter 14: More Powerful Statistical Methods

The SAGE Encyclopedia of Educational Research, Measurement, and Evaluation Multivariate Analysis of Variance

Reveal Relationships in Categorical Data

CHAPTER VI RESEARCH METHODOLOGY

(C) Jamalludin Ab Rahman

bivariate analysis: The statistical analysis of the relationship between two variables.

Daniel Boduszek University of Huddersfield

Overview of Lecture. Survey Methods & Design in Psychology. Correlational statistics vs tests of differences between groups

Tutorial 3: MANOVA. Pekka Malo 30E00500 Quantitative Empirical Research Spring 2016

Discriminant Analysis with Categorical Data

11/24/2017. Do not imply a cause-and-effect relationship

Multiple Regression Using SPSS/PASW

12/30/2017. PSY 5102: Advanced Statistics for Psychological and Behavioral Research 2

Correlation and Regression

Business Research Methods. Introduction to Data Analysis

SUPPLEMENTARY INFORMATION. Table 1 Patient characteristics Preoperative. language testing

List of Figures. List of Tables. Preface to the Second Edition. Preface to the First Edition

The Food Consumption Analysis

Problem #1 Neurological signs and symptoms of ciguatera poisoning as the start of treatment and 2.5 hours after treatment with mannitol.

Business Statistics Probability

STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION XIN SUN. PhD, Kansas State University, 2012

Unit 1 Exploring and Understanding Data

2.75: 84% 2.5: 80% 2.25: 78% 2: 74% 1.75: 70% 1.5: 66% 1.25: 64% 1.0: 60% 0.5: 50% 0.25: 25% 0: 0%

MULTIPLE REGRESSION OF CPS DATA

A STUDY ON SOCIAL IMPACT OF WOMEN SELF HELP GROUPS IN METTUR TALUK, SALEM DISTRICT, TAMILNADU

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

A COMPARISON BETWEEN MULTIVARIATE AND BIVARIATE ANALYSIS USED IN MARKETING RESEARCH

Still important ideas

Analysis and Interpretation of Data Part 1

MMI 409 Spring 2009 Final Examination Gordon Bleil. 1. Is there a difference in depression as a function of group and drug?

From Biostatistics Using JMP: A Practical Guide. Full book available for purchase here. Chapter 1: Introduction... 1

Multiple Regression Analysis

Ecological Statistics

1.4 - Linear Regression and MS Excel

Dr. Kelly Bradley Final Exam Summer {2 points} Name

Tutorial #7A: Latent Class Growth Model (# seizures)

Statistical questions for statistical methods

Statistics as a Tool. A set of tools for collecting, organizing, presenting and analyzing numerical facts or observations.

Technical Specifications

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

AP Statistics. Semester One Review Part 1 Chapters 1-5

10. LINEAR REGRESSION AND CORRELATION

WELCOME! Lecture 11 Thommy Perlinger

Survey research (Lecture 1) Summary & Conclusion. Lecture 10 Survey Research & Design in Psychology James Neill, 2015 Creative Commons Attribution 4.

Survey research (Lecture 1)

Mnemonic representations of transient stimuli and temporal sequences in the rodent hippocampus in vitro

Daniel Boduszek University of Huddersfield

CHILD HEALTH AND DEVELOPMENT STUDY

Comparing Ecological Communities Part Two: Ordination. Ordination vs. classification. Species responses to gradients. Read: Ch.

Still important ideas

Data Analysis with SPSS

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data

POL 242Y Final Test (Take Home) Name

Practical Multivariate Analysis

Classification and Statistical Analysis of Auditory FMRI Data Using Linear Discriminative Analysis and Quadratic Discriminative Analysis

Lunchtime Seminar. Risper Awuor, Ph.D. Department of Graduate Educational and Leadership. January 30, 2013

Profile Analysis. Intro and Assumptions Psy 524 Andrew Ainsworth

Understandable Statistics

Multiple Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Choosing a Significance Test. Student Resource Sheet

Preliminary Report on Simple Statistical Tests (t-tests and bivariate correlations)

Simple Linear Regression

Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F

Lecture Outline. Biost 517 Applied Biostatistics I. Purpose of Descriptive Statistics. Purpose of Descriptive Statistics

Overview of Multivariate Methods

CLASSICAL AND. MODERN REGRESSION WITH APPLICATIONS

Comparison of discrimination methods for the classification of tumors using gene expression data

CHAPTER 3 RESEARCH METHODOLOGY

Regression Including the Interaction Between Quantitative Variables

NORTH SOUTH UNIVERSITY TUTORIAL 2

CHAPTER TWO REGRESSION

Data Analysis in Practice-Based Research. Stephen Zyzanski, PhD Department of Family Medicine Case Western Reserve University School of Medicine

HANDOUTS FOR BST 660 ARE AVAILABLE in ACROBAT PDF FORMAT AT:

Statistics: A Brief Overview Part I. Katherine Shaver, M.S. Biostatistician Carilion Clinic

Basic Biostatistics. Chapter 1. Content

Multiple Bivariate Gaussian Plotting and Checking

Day 11: Measures of Association and ANOVA

SUPPLEMENTAL MATERIAL

Biostatistics II

Repeated Measures ANOVA and Mixed Model ANOVA. Comparing more than two measurements of the same or matched participants

Choosing the Correct Statistical Test

Chapter 6 Measures of Bivariate Association 1

Chapter 11 Multiple Regression

Representational similarity analysis

Georgiev et al. (2008) have investigated how fractal dimension of EEGs changes as a response to different mental activities in healthy volunteers.

Nature Neuroscience: doi: /nn Supplementary Figure 1. Behavioral training.

Linear Regression Analysis

Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017

Summary & Conclusion. Lecture 10 Survey Research & Design in Psychology James Neill, 2016 Creative Commons Attribution 4.0

6. Unusual and Influential Data

How to describe bivariate data

Transcription:

DSC 4/5 Multivariate Statistical Methods Applications DSC 4/5 Multivariate Statistical Methods Discriminant Analysis Identify the group to which an object or case (e.g. person, firm, product) belongs: predict the success or failure of a new product decide whether a student should be admitted to graduate school classify students as to vocational interests determine the category of credit risk for a person predict whether a firm will be successful What is Discriminant Analysis Discriminating Two Groups Dependent (response) variable is categorical (nominal or nonmetric), e.g. defining two groups Independent (predictor) variables are metric Empirically derive a variate or discriminant function (linear combination of the predictor variables) that discriminates best between pre-defined groups calculate discriminant (Z)( ) function scores for each individual (e.g. Z = β + β X + β X +...) maximize between-group variance of Z relative to within-group variance of Little overlap in the discriminant function High between-group variance relative to within-group variance Greater overlap in the discriminant function Low between-group variance relative to within-group variance group variance of Z 4 Discriminating More Than Two Groups Derive K variates, K = number of groups E.g. three groups, two discriminant Z functions can be represented on a scatterplot: Z 6 5 4 4 5 6 Z 5 Two Group Discriminant Analysis Example Goal: identify cluster of DSC 4/5 students Response: cluster (found in last chapter) Predictors: age, years of work experience (exp), expected salary (sal) Data: students (discrim.xls( discrim.xls) Question: Can we form a linear combination of age, exp, and sal to discriminate one cluster from the other? 6 Discriminant Analysis

DSC 4/5 Multivariate Statistical Methods Scatterplot of exp vs. age Scatterplot of sal vs. age 7 8 Scatterplot of sal vs. exp SAS Instructions Use SAS procedure discrim Code: see discrim.sas Class variable: cluster Predictor variables: age, exp, sal Option canonical runs a canonical analysis (which is the type we ll focus on) Option list displays results for everyone Statement priors proportional assumes sample sizes reflect population sizes 9 Discriminant Results Z = Can = constant.5 Age +.59 Exp +.78 Sal Plot of Can*Cluster. Legend: A = obs, B = obs, etc. Can ˆ A ˆ A A A A ˆ A A A A ˆ A B A A A A A - ˆ A A B - ˆ A Šƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒ C C Cluster Classification Probabilities An alternative to classifying objects using Z is calculating (posterior) probabilities of belonging to each group classify each object into the group with the highest probability displayed in SAS output when list option used used to produce classification matrix: from C C C 5 into C Discriminant Analysis

DSC 4/5 Multivariate Statistical Methods Three Group Discriminant Analysis Example Goal: identify years of work experience category for DSC 4/5 students Response: low (-), medium (-6) or high Predictors: age, expected salary (sal) Data: students (discrim.xls( discrim.xls) Question: Can we form two linear combinations of age and sal to discriminate students in low (), medium () and high () work experience categories? Scatterplot 4 Discriminant Results SAS code: see discrim.sas Z = Can = constant +.4 Age +. Salary Z = Can = constant. Age +.9 Salary Correctly classifies all but one: from 9 into 9 5 Scatterplot of Discriminant Functions Plot of Can*Can. Symbol is value of Expcat. Can. ˆ.5 ˆ. ˆ ˆ.5. ˆ ˆ -.5 -. ˆ -.5 ˆ Šƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒ -4-4 6 8 Can 6 Discriminant Analysis Objectives Determine significant differences between groups on a set of variables discriminant analysis provides an objective assessment of differences Determine which variables account the most for the differences discriminant analysis lends insight into the role of individual variables Classify new objects (individuals, firms, products) into groups based on variable values 7 Selection of Variables Dependent variable must be categorical, with two or more mutually exclusive and exhaustive categories can be created from a metric variable, e.g. work experience category (low, medium, high) Polar Extremes approach compares only the two most extreme categories Independent (predictor) variables can be based on previous research, economic theory, intuition, etc. 8 Discriminant Analysis

DSC 4/5 Multivariate Statistical Methods Sample Considerations Aim for 5-5 observations per predictor variable Aim for at least observations per group minimum group size is the number of predictor variables Aim for roughly equal group sizes Consider dividing the sample in two (5-5, 5, 6-4, or 75-5) if possible (total size > ): analysis sample for developing model holdout sample for validating the model 9 Discriminant Assumptions Multivariate normality of predictor variables logistic regression is a possible alternative if this assumption is violated Each group has same variance-covariance covariance matrix Separate-groups covariance matrices or quadratic techniques can be used for classification if this assumption is violated All relationships are linear No multicollinearity (remove redundant predictors) No outliers (remove aberrant cases) Examples Two-group Illustrative Example page 8-96 96 hatco.xls data file hatcogroup.sas SAS file Three-group Illustrative Example page 96-4 hatco.xls data file hatcogroup.sas SAS file Computational Method Simultaneous (direct) estimation all predictor variables are considered concurrently, regardless of discriminating power Stepwise estimation find single best discriminating variable find variable best able to improve discrimination when paired with first variable, etc. until remaining predictors do not contribute significantly to further discrimination Statistical Significance Various criteria available for assessing discriminant function(s), e.g. Wilks lambda indicates whether group means are significantly different (small: yes, large: no) Mahalanobis D indicates (generalized) Euclidean distance between groups Lead to chi-squared and F hypothesis tests: used to select predictors in stepwise estimation used to select number of discriminant functions to retain (up to K ) used to assess differences between all groups Discriminant Functions Discriminant function coefficients Unstandardized: used in calculating Z scores Standardized: used in interpreting relative contribution of predictors to the function(s) Fisher s s linear discriminant functions One for each group For every case, calculate scores for each group Used for classifying cases: assign case to group with the highest classification function score Equivalent to assigning case to group with highest posterior probability 4 Discriminant Analysis 4

DSC 4/5 Multivariate Statistical Methods Discriminant Function Results The discriminant analysis explains a % of the variation in the response variable Each discriminant function accounts for a proportion of this Proportion of variation explained by st function is given by its (canonical correlation) Proportion of remaining variation explained by nd function given by its (canonical correlation) Etc until we can finally sum to get the total % of the variation explained by the analysis c +c ( c )+c ( c c ( c ))+ 5 Graphical Representations Group membership scatterplot scatterplot of first two (unstandardized) discriminant functions, with cases identified by symbols indicating group, and centroids marked Territorial map scatterplot of first two (unstandardized) discriminant functions, with classification boundaries and group centroids marked Combination! (see p9) 6 Classification Cutting scores based on (unstandardized) discriminant function centroids (see p65) N AZ B + N BZ A Zcut = N A + N B assign case to A if Z < Z cut, to B otherwise only works with two groups Fisher s s linear discriminant functions for every case, calculate scores for each group assign case to group with the highest classification function score Posterior probabilities assign case to group with the highest posterior probability 7 Classification Matrices Analysis Cross- validated Holdout Actual Predicted Group Total 8 Prediction Accuracy Hit ratio = % correctly classified Should be at least 5% higher than the larger of: # in group maximum chance criterion = max sample size proportional chance criterion = p + p + L+ png Press s s Q statistic should be significantly high [ N nk] Q = N( K ) compare with chi-squared (df( = ) critical value 9 Casewise Diagnostics Identify misclassified cases Profile the correctly classified and misclassified cases on the predictor variables Identify significant differences using two-sample t-teststests Relate to graphical representations if possible Patterns in misclassification may indicate the possibility of additional group(s) Discriminant Analysis 5

DSC 4/5 Multivariate Statistical Methods Interpret Results. Assess relative contribution of predictors to the function(s): tabulate and/or plot standardized discriminant function coefficients better to use discriminant loadings: correlations between predictor variables and discriminant function(s) loadings may be rotated to aid interpretation. Potency index = loading ij relative eigenvalue j functions, j. Compare and with univariate F-ratios for each predictor to check consistency Validate Results Assess prediction accuracy on holdout sample, or, failing that, using cross- validation ( leave-one-out out prediction ) Profile groups on the predictor variables assess whether results conform to prior expectations, theory, common sense, etc. Profile groups on additional variables expected to reflect differences between groups Discriminant Analysis 6