PROC CORRESP: Different Perspectives for Nominal Analysis. Richard W. Cole. Systems Analyst Computation Center. The University of Texas at Austin

Similar documents
Reveal Relationships in Categorical Data

bivariate analysis: The statistical analysis of the relationship between two variables.

CHAPTER 3 RESEARCH METHODOLOGY

Lessons in biostatistics

Discriminant Analysis with Categorical Data

Statistics Assignment 11 - Solutions

Section 6: Analysing Relationships Between Variables

Reliability of Ordination Analyses

Today: Binomial response variable with an explanatory variable on an ordinal (rank) scale.

THE STATSWHISPERER. Introduction to this Issue. Doing Your Data Analysis INSIDE THIS ISSUE

Choosing a Significance Test. Student Resource Sheet

Undertaking statistical analysis of

CHILD HEALTH AND DEVELOPMENT STUDY

One-Way Independent ANOVA

Chapter 2 Organizing and Summarizing Data. Chapter 3 Numerically Summarizing Data. Chapter 4 Describing the Relation between Two Variables

Analysis of Environmental Data Conceptual Foundations: En viro n m e n tal Data

A review of statistical methods in the analysis of data arising from observer reliability studies (Part 11) *

Chapter 1: Explaining Behavior

Survey research (Lecture 1) Summary & Conclusion. Lecture 10 Survey Research & Design in Psychology James Neill, 2015 Creative Commons Attribution 4.

Survey research (Lecture 1)

LAB ASSIGNMENT 4 INFERENCES FOR NUMERICAL DATA. Comparison of Cancer Survival*

MMI 409 Spring 2009 Final Examination Gordon Bleil. 1. Is there a difference in depression as a function of group and drug?

Analysis and Interpretation of Data Part 1

Two-Way Independent ANOVA

Understandable Statistics

From Biostatistics Using JMP: A Practical Guide. Full book available for purchase here. Chapter 1: Introduction... 1

HOW STATISTICS IMPACT PHARMACY PRACTICE?

Biostatistics II

Identifying or Verifying the Number of Factors to Extract using Very Simple Structure.

Midterm Exam ANSWERS Categorical Data Analysis, CHL5407H

Exercise Verify that the term on the left of the equation showing the decomposition of "total" deviation in a two-factor experiment.

Conditional Distributions and the Bivariate Normal Distribution. James H. Steiger

Chapter 1: Introduction to Statistics

Summary & Conclusion. Lecture 10 Survey Research & Design in Psychology James Neill, 2016 Creative Commons Attribution 4.0

ANOVA in SPSS (Practical)

Analysis of Variance ANOVA, Part 2. What We Will Cover in This Section. Factorial ANOVA, Two-way Design

Statistical Methods and Reasoning for the Clinical Sciences

Introduction to statistics Dr Alvin Vista, ACER Bangkok, 14-18, Sept. 2015

Statistics as a Tool. A set of tools for collecting, organizing, presenting and analyzing numerical facts or observations.

Basic concepts and principles of classical test theory

NORTH SOUTH UNIVERSITY TUTORIAL 2

STATISTICS AND RESEARCH DESIGN

Applied Medical. Statistics Using SAS. Geoff Der. Brian S. Everitt. CRC Press. Taylor Si Francis Croup. Taylor & Francis Croup, an informa business

Quantitative Methods in Computing Education Research (A brief overview tips and techniques)

Introduction to Quantitative Methods (SR8511) Project Report

STATISTICS INFORMED DECISIONS USING DATA

Still important ideas

MBA 605 Business Analytics Don Conant, PhD. GETTING TO THE STANDARD NORMAL DISTRIBUTION

STATISTICS & PROBABILITY

Measuring the User Experience

PRINCIPLES OF STATISTICS

Unit 1 Exploring and Understanding Data

Lecture 15 Chapters 12&13 Relationships between Two Categorical Variables

1. Below is the output of a 2 (gender) x 3(music type) completely between subjects factorial ANOVA on stress ratings

PEER REVIEW HISTORY ARTICLE DETAILS VERSION 1 - REVIEW

Ecological Statistics

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Daniel Boduszek University of Huddersfield

Pearson Education Limited Edinburgh Gate Harlow Essex CM20 2JE England and Associated Companies throughout the world

Business Statistics Probability

Inferential Statistics

Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data

Introduction. Lecture 1. What is Statistics?

MTH 225: Introductory Statistics

Notes for laboratory session 2

CHAPTER TWO REGRESSION

MEA DISCUSSION PAPERS

Section I: Multiple Choice Select the best answer for each question. a) 8 b) 9 c) 10 d) 99 e) None of these

CHAPTER - 6 STATISTICAL ANALYSIS. This chapter discusses inferential statistics, which use sample data to

Small Group Presentations

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp

Problem #1 Neurological signs and symptoms of ciguatera poisoning as the start of treatment and 2.5 hours after treatment with mannitol.

Chapter 1. Picturing Distributions with Graphs

Statistics. Nur Hidayanto PSP English Education Dept. SStatistics/Nur Hidayanto PSP/PBI

The Personal Profile System 2800 Series Research Report

Testing Means. Related-Samples t Test With Confidence Intervals. 6. Compute a related-samples t test and interpret the results.

Correlation and Regression

AMSc Research Methods Research approach IV: Experimental [2]

Lecture 20: Chi Square

SPRING GROVE AREA SCHOOL DISTRICT. Course Description. Instructional Strategies, Learning Practices, Activities, and Experiences.

Announcement. Homework #2 due next Friday at 5pm. Midterm is in 2 weeks. It will cover everything through the end of next week (week 5).

Basic Statistics and Data Analysis in Work psychology: Statistical Examples

CHAPTER ONE CORRELATION

Overview of Lecture. Survey Methods & Design in Psychology. Correlational statistics vs tests of differences between groups

Chapter 1: Introduction to Statistics

Module One: What is Statistics? Online Session

Name: emergency please discuss this with the exam proctor. 6. Vanderbilt s academic honor code applies.

ANOVA. Thomas Elliott. January 29, 2013

Research Methods in Forest Sciences: Learning Diary. Yoko Lu December Research process

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

SPSS Correlation/Regression

Psychology Research Process

MEASURES OF ASSOCIATION AND REGRESSION

Psych 5741/5751: Data Analysis University of Boulder Gary McClelland & Charles Judd. Exam #2, Spring 1992

Readings: Textbook readings: OpenStax - Chapters 1 11 Online readings: Appendix D, E & F Plous Chapters 10, 11, 12 and 14

EPS 625 INTERMEDIATE STATISTICS TWO-WAY ANOVA IN-CLASS EXAMPLE (FLEXIBILITY)

Aesthetic Response to Color Combinations: Preference, Harmony, and Similarity. Supplementary Material. Karen B. Schloss and Stephen E.

Homework Exercises for PSYC 3330: Statistics for the Behavioral Sciences

SOME NOTES ON STATISTICAL INTERPRETATION

Transcription:

PROC CORRESP: Different Perspectives for Nominal Analysis Richard W. Cole Systems Analyst Computation Center The University of Texas at Austin 35

ABSTRACT Correspondence Analysis as a fundamental approach to the analysis data having any scale of measurement fails to achieve wide acceptance in the United States. As a descriptive and inferential statistical technique, it should be riding the crest of currently popular categorical modeling procedures for a number of reasons: 1) it allows one to plot, in metric space the relationships among nominal categories of variables, to 2) quantify the size and number of dimensions comprising these relationships, and, 3) generate 'optimal scores' that reflect the interval scaled metric values underlying each variable. This paper compares this technique to more traditional categorical modeling procedures and emphasizes its relative advantages for business and behavioral science applications. INTRODUCTION Analysis of variables having a nominal scale of measurement has traditionally been a~complished through the use of various approaches such as Pearson's chi-square test of independence, or a likelihood-ratio chi-square test of independence (using maximum likelihood estimation). While the latter has been expanded upon in recent years to a multivar~ate modeling approach where one seeks to explain the data using some parsimonious set of variables, few researchers, at least in the United States, have examined nominal variables from more of a structural perspective presently available thru what now consistently being called Correspondence Analysis. This form of analysis has its foundations in a matrix transformation procedure known as singular value decomposition where a matrix of any dimension can be reduced to its basic structure, with resulting dimensions quantified in terms of their explanatory power. These dimensions may be statistically examined, and the levels of each variable (known as points) plotted on these axis. This fundamental approach to contingency table analysis has been rediscovered across, and within numerous disciplines, though the earliest treatment would seem to have been Eckart & Young (1936). At present, correspondence analysis appears to be increasing in popularity with the recent releases of SAS's PROC CORRESP and the SPSS CATEGORIES product. Additionally, publications by the SAGE Press (Weller & Romney, 1990) and the classic text by Greenacre (1984) provide the reader with a thorough introduction to the topic. In this brief paper 36

a number of examples are given to introduce users to correspondence analysis and the advantages it may provide over traditional tests of independence and fit. Traditional Approaches to Nominal Analysis When faced with examining the relationship between two nominal variables most researchers opt for some type of chi-square test of independence where the hypothesis of interest is no interaction or dependency among the variables of interest. That is, the obtained cell frequencies may be explained adequately by the marginal distributions for each of the variables. Having examined the chisquare statistic associated with this test and having found it significant, one may conduct some form of follow-up analysis either by generating contrasts among specific cells or simply examine some form of standardized residuals, which when summed, equal the.overall chi-square value. These values that help the researcher to understand relations among the various levels of each variable. Correspondence Analysis While the approach above is useful for understanding relations among specific cells and conducting tests of various hypotheses, little may be understood about the interrelationship or dimensionality among the variables. For variables assumed to have linear relationships, various factor, and principal component analysis approaches are used to generate a reduced dimensional space for the variables that may then be plotted along these axes. Correspondence Analysis is the nominal counterpart to such an analysis. By default through PROC CORRESP, output contains information about the dimensions of the data with variance accounted for by a dimension, and its chi-square significance test. Users may output coordinates for the dimensions to SAS/GRAPH and generate a plot of the various levels for each variable. This graphic is more appropriately called a 'profile' containing a point for each level of a variable. A COMPARISON OF APPROACHES To illustrate the differences in these two approaches, sample data from Greenacre (1984) involving smoking habits across various levels of employment will be used. Here the researcher has an interest in the "dangers of smoking" at levels of: 1) none, 2) light, 3) medium, and, 4) heavy, for employees categorized as: 1) senior 37

management. 2) junior management, 3) senior employees, 4) junior employees, and, 5) secretarial staff. Independence Testing Approach Table 1 contains a contingency table of these data and results from both traditional chi-square independence tests using SAS's PROC FREQ. Note that these tests of independence fail to detect the presence of an interaction between smoking and employment status. In such a case an examination of the standardized residuals is not necessarily warranted though these values do give an indication of 'heavy' and 'light' cells contributing to the overall chi-square value. The large cell chi-square values represent a discrepancy between the observed and expected cell frequencies (seen in rows one and two of each cell). Based on these results one might conclude while junior employees tend to not be, 'non smokers', senior employees are often found to be 'non smokers'.. Still, no overall 'dependency' exists as indicated by the high probability values associated with either test of independence. Correspondence Analysis Approach With Correspondence Analysis quite a different picture emerges. Figure 1 contains a plot of the first two correspondence analysis dimensions calculated from the data. In this plot, dimension 1 (along the x axis) may be expressed along a no smoking to heavy smoking continuum with medium smoking best representing the smoking end of the continuum (due to its distance from the origin and its low dimension 2 value). A reexamination of the contingency table analysis of Table 1 may support this initial interpretation. Note that high cell chi-square values are only obtained for no smoking (in the junior-emp and senior-emp cells) which may therefore be viewed as contributing primarily to the definition of Dimension 1. As might be expected, the three smoking categories (light, medium, and heavy) are represented on the smoking end of Dimension 1 with nonsmoking independently representing the nonsmoking end of the continuum. As can be seen from the PROC CORRESP output in Table 2, Dimension 1 explains 87 percent of the variance (distances across points) in the data. For the smoking levels we may examine inertia values that are analogous to variance accounted for within a factor, but in this case within a dimension. Of obvious interest is the partial inertia for the no smoking category having a value of.6539. This indicates that the 38

TABLE 1 Employment By Smoking Habits Rows-Employment Status Columns-Smoking Habits Frequency I Expected Freq I Cell Chi-Squarelhvysmokelltsmoke lmedsmokelnosmoke I Total junior-emp I 13 I 24 I 33 I 18 I 88 I 11.399 I 20.518 I 28.269 I 27.813 I I 0.2249 I 0.5909 I 0.7916 I 3.4625 I junior-mgmt I 4 I 3 I 7 I 4 I 18 I 2.3316 I 4.1969 I 5.7824 I 5.6891 I I 1.1938 I 0.3413 I 0.2564 I 0.5015 I secret I 2 I 6 I 7 I 10 I 25 I 3.2383 I 5.829 I 8.0311 I 7.9016 I I 0.4735 I 0.005 I 0.1324 I 0.5573 I senior-emp I 4 I 10 I 12 I 25 I 51 I 6.6062 I 11.891 I 16.383 I 16.119 I I 1.0282 I 0.3008 I 1.1728 1 4.8929 I senior-mgmt I 2 I 2 I 3 I 4 I 11 I 1.4249 I 2.5648 I 3.5337 I 3.4767 I I 0.2321 I 0.1244 I 0.0806 I 0.0788 I Total 25 45 62 61 193 12.95 23.32 32.12 31.61 100.00 STATISTICS Statistic DF Value Prob Chi-Square 12 Likelihood Ratio Chi-Square 12 16.442 16.348 0.172 0.176 relative contribution of this category to Dimension 1 is around 65 percent. This is a clear indication that no smoking plays a major role in defining Dimension 1. Dimension 2 would seem reflective of some construct ranging from light to heavy smoking, though its interpretation is more difficult. Regarding employment status, notice that senior employees and junior employees fall roughly along Dimension 1 as well. This finding would seem to be consistent with findings from the standardized residuals especially in the case of nonsmoking senior employees who are overly abundant. 39

TABLE 2 Smoking Habits by Employment Level Analysis Inertia and Chi-Square Decomposition Singular Principal Chi- Values Inertias Squares Percents 18 36 54 72 90 ----+----+----+----+----+--- 0.27342 0.07476 14.4285 87.76% ************************ 0.10009 0.01002 1.9333 11.76% *** 0.02034 0.00041 0.0798 0.49% ------- ------- 0.08519 16.4416 (Degrees of Freedom - 12) Summary Statistics for the Row Points Quality Mass Inertia junior-emp 0.999810 0.455959 0.308354 junior-mgmt 0.991082 0.093264 0.139467 secret 0.998603 0.129534 0.071053 senior-emp 0.999817 0.264249 0.449750 senior-mgmt 0.892568 0. 056995 0.031376 Partial Contributions to Inertia Dim1 for the Row Points Dim2 junior-emp junior-mgmt secret senior-emp senior-mqmt 0.330974 0.083659 0.070064 0.512006 0.003298 0.151772 0.551151 0.080522 0.002998 0.213558 Summary Statistics for the Column Points Quality Mass Inertia hvysmoke 0.994552 0.129534 0.191743 ltsmoke 0.984016 0.233161 0.082860 medsmoke 0.983228 0.321244 0.148025 nosmoke 0.999995 0.316062 0.577372 Partial Contributions to Inertia for the Column Points Diml Dim2 hvysmoke 0.149538 0.505754 1tsmoke 0.030850 0.463174 medsmoke 0.165617 0.001737 nosmoke 0.653996 0.029336 40

While it may be appealing to interpret a level (point) of smoking in relation to a level of employment, this is advised against since these distances are not defined. Any interpretation of row points to column points may possibly be supported by an examination of the standardized residual data. For example the no smoking point appears close to the senior-employee point, which, while technically cannot be interpreted, is supported by the significant cell chi-square value (residuals may be interpreted as z-scores though they tend to underestimate the true value of the deviate). Conversely, note that the only other significant standardized residual being the no smoke by junior-emp cell has a higher expected value that would suggest these components exist at opposite ends of Dimension 1, though again no distance can actually be inferred from the graph. Figure 1 Smoking Habits by Employment Level D 9.3 9. t e n S -9.1 noslloke senlor-e.p. senior-.g t junior-.glt huyslloke.edslloke Junlor-ellp Itslloke secret.. o n -8.3 " '" "" I'" " '" 'I"" '" 'I"" " '" I" " '" " 2-9.5-9.3-8.1 9.1 8.3 8.5 Di.enslon 1 label CONCLUSION This paper has presented an overview of contingency analysis available through PROC CORRESP. While the procedure is often 41

stressed as a graphical approach to understanding the relationship among variables, this technique can be used in conjunction with more traditional expectancy table approaches to examine information other than hypothesis testing. REFERENCES Eckart, C., & Young, O. (1936). The approximation of one matrix by another one of lower rank. Psychometrika, 1, 211-218. Oreenacre, M. (1984) Theory and applications of correspondence analysis. Academic Press, London. Weller, S. C., & Romney, A. K. (1990). Metric Scaling: Correspondence Analysis. Sage University Paper Series on Quantitative Applications in the Social Sciences, 07-075. Newbury Park, CA: Sage. Richard W. Cole Systems Analyst Computation Center University of Texas @ Austin Austin, TX 78701-1110 INTERNET: rcole@emx.cc.utexas.edu BITNET: rwc@utxvm 42