... (po -P.) ,., Statistics. NESUG '96 Proceedings 630 AN ALTERNATIVE WAY TO COMPUTE THE KAPPA STATISTICS IN OBSERVER AGREEMENT STUDIES (3)

Similar documents
Unequal Numbers of Judges per Subject

A review of statistical methods in the analysis of data arising from observer reliability studies (Part 11) *

and The 95% confidence intervals are calculated as follows: where

COMPUTING READER AGREEMENT FOR THE GRE

(true) Disease Condition Test + Total + a. a + b True Positive False Positive c. c + d False Negative True Negative Total a + c b + d a + b + c + d

COMMITMENT &SOLUTIONS UNPARALLELED. Assessing Human Visual Inspection for Acceptance Testing: An Attribute Agreement Analysis Case Study

Comparison of the Null Distributions of

AAPOR Exploring the Reliability of Behavior Coding Data

A SAS Macro to Present a Summary Table of the Number of Patients Having Experienced Adverse Events in a Clinical Trial

bivariate analysis: The statistical analysis of the relationship between two variables.

Lab 4: Alpha and Kappa. Today s Activities. Reliability. Consider Alpha Consider Kappa Homework and Media Write-Up

Maltreatment Reliability Statistics last updated 11/22/05

ABSTRACT INTRODUCTION

What s New in SUDAAN 11

NIH Public Access Author Manuscript Tutor Quant Methods Psychol. Author manuscript; available in PMC 2012 July 23.

Causal Mediation Analysis with the CAUSALMED Procedure

Creating Multiple Cohorts Using the SAS DATA Step Jonathan Steinberg, Educational Testing Service, Princeton, NJ

Lesson: A Ten Minute Course in Epidemiology

Unit 1 Exploring and Understanding Data

Reliability and Validity checks S-005

CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA

A SAS Application for Analyzing Quality of Life data: NCIC-CTG Standard Method

112 Statistics I OR I Econometrics A SAS macro to test the significance of differences between parameter estimates In PROC CATMOD

Teaching A Way of Implementing Statistical Methods for Ordinal Data to Researchers

Assessment of Interrater Agreement for Multiple Nominal Responses Among Several Raters Chul W. Ahn, City of Hope National Medical Center

PROGRAMMER S SAFETY KIT: Important Points to Remember While Programming or Validating Safety Tables

Propensity Score Methods for Estimating Causality in the Absence of Random Assignment: Applications for Child Care Policy Research

Systematic Inductive Method for Imputing Partial Missing Dates in Clinical Trials

02a: Test-Retest and Parallel Forms Reliability

An Exploratory Case Study of the Use of Video Digitizing Technology to Detect Answer-Copying on a Paper-and-Pencil Multiple-Choice Test

Hungry Mice. NP: Mice in this group ate as much as they pleased of a non-purified, standard diet for laboratory mice.

The FASTCLUS Procedure as an Effective Way to Analyze Clinical Data

Analysis of single gene effects 1. Quantitative analysis of single gene effects. Gregory Carey, Barbara J. Bowers, Jeanne M.

Using SAS to Calculate Tests of Cliff s Delta. Kristine Y. Hogarty and Jeffrey D. Kromrey

How to analyze correlated and longitudinal data?

Implementing Worst Rank Imputation Using SAS

Effect of Source and Level of Protein on Weight Gain of Rats

Closed Coding. Analyzing Qualitative Data VIS17. Melanie Tory

Answers to end of chapter questions

Empirical Knowledge: based on observations. Answer questions why, whom, how, and when.

Lev Sverdlov, Ph.D.; John F. Noble, Ph.D.; Gabriela Nicolau, Ph.D. Innapharma, Inc., Upper Saddle River, NJ

A SAS Macro to Investigate Statistical Power in Meta-analysis Jin Liu, Fan Pan University of South Carolina Columbia

Package CompareTests

The Association Design and a Continuous Phenotype

Validity and reliability of measurements

1. Objective: analyzing CD4 counts data using GEE marginal model and random effects model. Demonstrate the analysis using SAS and STATA.

S.A.F.E.T.Y. TM Profile for Joe Bloggs. Joe Bloggs. Apr / 13

25. Two-way ANOVA. 25. Two-way ANOVA 371

BIOL 458 BIOMETRY Lab 7 Multi-Factor ANOVA

Tips and Tricks for Raking Survey Data with Advanced Weight Trimming

Knowledge is Power: The Basics of SAS Proc Power

APPENDIX N. Summary Statistics: The "Big 5" Statistical Tools for School Counselors

C-1: Variables which are measured on a continuous scale are described in terms of three key characteristics central tendency, variability, and shape.

Designing Psychology Experiments: Data Analysis and Presentation

m 11 m.1 > m 12 m.2 risk for smokers risk for nonsmokers

Applied Medical. Statistics Using SAS. Geoff Der. Brian S. Everitt. CRC Press. Taylor Si Francis Croup. Taylor & Francis Croup, an informa business

UNIVERSITY OF THE FREE STATE DEPARTMENT OF COMPUTER SCIENCE AND INFORMATICS CSIS6813 MODULE TEST 2

****************************************************************************

Data that can be classified as belonging to a distinct number of categories >>result in categorical responses. And this includes:

Lab #7: Confidence Intervals-Hypothesis Testing (2)-T Test

Appendix B Statistical Methods

Validity and reliability of measurements

BIOEQUIVALENCE STANDARDIZED Singer Júlia Chinoin Pharmaceutical and Chemical Works Ltd, Hungary

The Impact of Relative Standards on the Propensity to Disclose. Alessandro Acquisti, Leslie K. John, George Loewenstein WEB APPENDIX

Juvenile Substance Abuse Profile

Screening (Diagnostic Tests) Shaker Salarilak

Today: Binomial response variable with an explanatory variable on an ordinal (rank) scale.

Reproducibility of childhood respiratory symptom questions

A profiling system for the assessment of individual needs for rehabilitation with hearing aids

Collapsing Longitudinal Data Across Related Events and Imputing Endpoints

Research Article Analysis of Agreement on Traditional Chinese Medical Diagnostics for Many Practitioners

Generalized Estimating Equations for Depression Dose Regimes

Psychology Research Methods Lab Session Week 10. Survey Design. Due at the Start of Lab: Lab Assignment 3. Rationale for Today s Lab Session

Chapter 2 Organizing and Summarizing Data. Chapter 3 Numerically Summarizing Data. Chapter 4 Describing the Relation between Two Variables

Chapter 6 Measures of Bivariate Association 1

Lev Sverdlov, Ph.D., John F. Noble, Ph.D., Gabriela Nicolau, Ph.D. Innapharma Inc., Suffern, New York

STAT362 Homework Assignment 5

Extract Information from Large Database Using SAS Array, PROC FREQ, and SAS Macro

Chapter 2 Survey Research Design and Quantitative Methods of Analysis for Cross-Sectional Data

SUPPLEMENTARY INFORMATION In format provided by Javier DeFelipe et al. (MARCH 2013)

Quasicomplete Separation in Logistic Regression: A Medical Example

SAMPLING ERROI~ IN THE INTEGRATED sysrem FOR SURVEY ANALYSIS (ISSA)

The Lens Model and Linear Models of Judgment

ONLINE SUPPLEMENT FOR: Implicit Theories of Personality and Attributions of Hostile Intent:

ISC- GRADE XI HUMANITIES ( ) PSYCHOLOGY. Chapter 2- Methods of Psychology

Using Direct Standardization SAS Macro for a Valid Comparison in Observational Studies

Results & Statistics: Description and Correlation. I. Scales of Measurement A Review

Pain Assessment in Elderly Patients with Severe Dementia

ALABAMA SELF-ASSESSMENT INDEX PILOT PROGRAM SUMMARY REPORT

Lessons in biostatistics

Patient survey report Survey of people who use community mental health services 2015 South London and Maudsley NHS Foundation Trust

Interpreting Kappa in Observational Research: Baserate Matters

Chapter IR:VIII. VIII. Evaluation. Laboratory Experiments Logging Effectiveness Measures Efficiency Measures Training and Testing

Reveal Relationships in Categorical Data

Programmatic Challenges of Dose Tapering Using SAS

4 Diagnostic Tests and Measures of Agreement

Psychology, 2010, 1: doi: /psych Published Online August 2010 (

Method Comparison Report Semi-Annual 1/5/2018

GLOBAL HEALTH. PROMIS Pediatric Scale v1.0 Global Health 7 PROMIS Pediatric Scale v1.0 Global Health 7+2

LOGLINK Example #1. SUDAAN Statements and Results Illustrated. Input Data Set(s): EPIL.SAS7bdat ( Thall and Vail (1990)) Example.

Transcription:

AN ALTERNATIVE WAY TO COMPUTE THE KAPPA STATISTICS IN OBSERVER AGREEMENT STUDIES Yukiko Ellis,. U.S. Bureau of the Census, Washington, D.C. The kappa statistic, first proposed by Cohen (1960), is a coefficient of observer agreement for categorical data. Researchers in medicine. epidemiology, education, and psychology have widely used the statistic in their attempt to evaluate measurement error due to observer error. This paper presents a SASe computer program that computes the perceut agreement and kappa statistic that are often needed in reliability studies based on behavior coding. The first section briefly describes behavior coding, the percent agreement, and kappa statistic. The second section describes three versions of my SAS program, depending on the purpose for computing the kappa and how the data are set up. The third section describes an option to compute the kappa statistic through use of PROC FREQ with the 'agree' option under Release 6.10 of SAS. The program presented here has a few advantages over PROC FREQ by providing flexibility in the input data setup, in the number of observers and measurements involved in comparison. In the appendix, SAS code for Version 2 of my program is presented. L Background Behavior coding is a procedure in which trained staff listen to audiotapes of survey interviews and code interviewer and respondent behaviors. Interviewer codes include appropriate reading of question, changes to question wording, verification or suggestion of an answer, and erroneous skipping of question. Respondent codes include question interruptions, expression of uncertainty, qualified and uncodable answers, and 'don't know' andrefusalresponses. Hence, the purpose of behavior coding is to evaluate the quality of survey questions and to identify the cognitive problems they pose. To assess the extent to which behavior coding is reliable, we ID1lSt have all coders in a project behavior-code the same set of interviews and examine how well they apply the same codes under the same conditions. If agreement among the coders is low, then the usefulness of the behavior codings is severely limited. Suppose now that each of a sample of n questions is coded in regard to interviewer behavior by two behavior coders, with the ratings on a nominal scale with m categories. The results ofbehavior coding can be summarized in a two-way frequency table as in Table 1. In this table, om' is the number of all possible behavior code categories for interviewers. Each survey question is classified in this table according to how the two behavior coders ended the question. For example, if Coder 1 assigned a question to the fourth category of the interviewer behavior code, while Coder 2 assigned the second category, then this question would be classified into cell (4,2). Table 1. Joint distribution of codings by two coders for classification witb m categories Coder 2 Coder 1 1, 2,.. '.' m Total I n ll, D12t., n lm nl+ 2 lltl' D 22,., n 2m lit.......... m D,.I' ~f " n,.,. D,.+ Total D+lr n+z'., n..". n There are many ways one can measure the intercoder agreement. The simplest index of agreement is the overall percent agreement, m Po = Pn + P22 +... + Plllll = L Pii' (1) i-l 'Where p. = I\; I n, 1 <= i,j <= m. Kappa is another measure of agreement. Kappa has a desirable property that the percent agreement does not have; namely, kappa incorporates a coitection for chan<»expected agreement into the assessment of intercoder reliability. Let Po denote the overall proportion of observed agreement as in equation (1). If the codings are statistically independent, then the overall proportion of chance-expected agreement is: m P. = Pl+P+l+P2+P+2+",+Pm+P+m = LPi+P+i' (2) j-i 'Where Pi+ = ~ Pij and p..; = E;Pij' The overall value of kappa is then defined as:,., 1C = (po -P.) (l-p.). (3) If there is complete agreement between the two coders, 11:=+1. This happens, for example, 'When all observations in the NESUG '96 Proceedings 630

. frequency table belong to two or more cells on the diagonal, resulting in Po=1.0 and ~ 1. If all observations fallin one cell on the diagonal, then P.=1.0 and R becomes undefined.. The occurrence of an undefined kappa is rare as long as we have a sufficiently large number of observations. If observed agreement is greater than or equal to chance agreement, fe>=o. If observed agreement is less than or equal to chance agreement, K<=D. If Po and P. are such that P.=(1+Po)12, then the minimum value of R equals -1. Otherwise, the minimum value depends on the marginal proportions and is between -1 and 0, inclusive. Landis and Koch (1977) suggest that, for most practical purposes, values of kappa greater than.75 maybe taken to represent =ellent agreement beyond chance, values below.40 may be taken to represent poor agreement beyond chance, and values between.40 and.75 may be taken to represent fair to good agreement beyond chance. The standard error of kappa computed by the SAS program presented in this paper is appropriate for testing the null hypothesis that the underlying value of kappa is zero (Fleiss, Cohen. and Everitt, 1969). It is estimated by: Note that the standard error depends only on the marginal distributions and not at all on the individual cell frequencies, an observation that is useful from a progranuning point of view. ll. Three Versions of SAS Program There are three versions of the SAS program that I wrote. They differ based on how the data are set up. whether there are one or two behavior codes assigned to respondent behaviors. and whether the kappa statistics are computed to evaluate coders or individual questions. All three versions assume that there are four behavior coders. Data Setup All three versions of the program assume that your data set is a SAS data set. The data set can have either of the two types of data setup as follows: Setup 1: There is' one observation for each unique combination of interview number (IN1NUM), question number (QNUM), and behavior coder (BCODER). Each observation has additional information on interviewer behavior code (FRCODE) and one respondent behavior code (RICODE). See an example of this data SETUP in Figure 1. In 1his ~le, Q is the question number. The data-specific variable names in the input data are renamed to a standard set of variable names in the first part of my SAS program. This is nsually the setup for codes that are keyed into a data base from a paper questionnaire form. Fi2ure 1. Example of Data Setup 1 Flrst 10 observations in a hypothetical data set: Obs BCODER IN1NUM Q FRCODE* RICODE** 1 coder 1 703484 I M A 2 coderl 703484 IA E A 3 coderl 703484 2 V B 4 coderl 703484 3 E A 5 coder2 703484 I E A 6 coder2 703484 IA E Q 7 coder2 703484 2 S A 8 c0der2 703484 3 E I 9 coderl 260972 I E C 10 coder I 260972 la E.N * E=exact wording, M=major change, S=slight change, V=verify answer. ** A=adequate answer, B=break-in, C=clarification, I=inadeQuate answer, N=no answer, Q=qualified answer. Setup 2: There is one observation for each unique combination of INTNUM and BCODER. Each observation then has a set of all interview behavior codes (FRCODE) in the order of questions. The interviewer behavior codes are then followed by a set of all respondent behavior codes (RICODE). If there is a set of second respondent behavior codes (R2CODE), then that follows the set of the first respondent behavior codes. Hence, there is no explicit QNUM variable in this data setup. QNUM is implicit in the order of the behavior codes. See an example of this data setup in Figure 2. This is nsually the setup for codes that have been entered nsing an automated Computer Assisted Interview (CAl) system. Setup 2 requires smaller computer storage space than Setup 1. Suppose Jiveintecviews are behavior-coded by four coders, and there are 50 questions in a questionnaire. Then the data Jilein Setup I has 5x4x50 or 1000 records while the data file in Setup 2 has 5x4 or 20 records but each record has a much longer record length. 631 NESUG '96 Proceedings

Fi 'I~ 2: Example of Data Setup 2 First five observations in a hvnothetical data oet F F F F R R R R R R R R I B R R R R 1 1 1 1 2 2 2 2 N C C C C C C C C C C C C C T 0 0 0 0 0 0 0 0 0 0 0 0 0 N D D D D D D D D D D D D D U E E E E E E E E E E E E E M R 1 2 3 4 1 2 3 4 1 2 3 4 1 P S M M E B Q A B N N I I 2 N S M M E B Q A C N N N N 3 R S M M E A A A I N N B I 4 S.. S S S E B A A B N N I N 5 N E S E E A B B C N N N N Number of ResDondent Behavior Codes Some behavior coding projects allow only one code each for interviewer and respondent behaviors while others allow more than one code. In the latter case, it is up to the researcher to decide whether to summarize the multiple codes to one code or to use all multiple behavior codes. Currently, the programs presented here compute the percent agreement and the kappa statistic for one or two codes for respondent behavior, but the programs can be easily adapted for more codes. When there are two respondent behavior codes, the kappa statistic is computed in two ways as follows: 1. The two respondent behavior codes assigned by Coder 1 are compared with those assigned by Coder 2 to see if there is at least one common behavior code between the two sets of codes. This common code cannot be a 'no response' category (un''). Two coders arerequired to have assigned the same number of codes for each interview. If one coder assigned a behavior code to a question while the other coder didn't, then a dununy category of 'no response' has to be created to insure the balance before the program can be used. Different Uses of Kappa Kappa statistics are computed for various purposes. The SAS program presented here handles either of the following two situations described below: Coder Evaluation: A conunon use of the kappa statistics based on behavior coding is to evaluate whether the coding is done consistently by coders. For this type of evaluation, the result of behavior coding by two coders is collected across all questions and interviews and sununarized in a two-way freq ueney table like the one in Table 1. A kappa statistic is computed for each pair of coders. Hence, if there are four coders, six kappa statistics are computed. See Figure 3 for an example of the output of coder evaluation. Question Evaluation: It is possible to use the kappa statistics to evaluate whether the coding was done consistently for each survey question in the questionnaire. For each individual question, the result of behavior coding by two coders is first combined across all interviews. Hence, the number of interviews that both coders behavior-coded becomes '0,' the total number of observations, in Table 1. It is important to have a sufficiently large 'n' to avoid having all observations classified into one cellon the diagonal, which will result in an undefined kappa. After kappa is computed for all possible combinations of two coders for a particular question, the estimate of the overall kappa for the question can be computed as follows: e.g. 'AB' and 'BN' would result in an agreement. 'AN' and 'CN' would result in a disagreement. 2. The two sets of codes are compared to see whether they are exactly the same. e.g. '~' and 'BA' would result in an agreement. 'AN' and 'AN' would result in an agreement. 'AB' and 'BN' would result in a disagreement. where x is the number of all possible pairs of the coders, and Vpi:.) is the squared standard error ofr;. The standard error of itovaall is given by: NESUG '96 Proceedings 632

s.e.~overall)=~l 1 E joiv[kj (6) interview behavior from two behavior coders: INTNUM QNUM FRCODE 1 FRCODE 2 A A Suppose we have four coders, for example, and each coded eleven interviews. Then 'n' is 11 inthis example, and there are six unique combinations of two coders. Therefore, six kappa statistics are computed for each question. These six estimates of kappa are then combined to compute the estimate of the overall kappa for each question. Summary of Three Program Versions Table 2 below describes the three versions of my SAS program in terms of the data set, number of respondent behavior code, and use of kappa. 2 E A 3 A s FRCODE! is the code for interviewer behavior given by Coder 1 and FRCODE2 by Coder 2. Either Data Setup 1 or Setup 2 mentioned in Section D can be readily converted to this arrangement. Then the following SAS statement will compute a kappa statistic (see pp.98-102 in Stokes, Davis, and Koch, 1995): Currently, Data Setup 1 is combined with one respondent behavior code in Version 1, and Data Setup 2 is combined with two respondent behavior codes in Versions 2 and 3. Also, Version 3 meets the most complex condition defined by the three factors. Examples of the outputs for Versions 1 and 2 are shown in Figure 3, and for Version 3 in Figure 4 on the next page. Table 2. Three Program Versions by Data Setup, Number of Response Behavior Codes, and Use of Kappa Data Setup/No. of Respondent Code Use of Kappa Setup1/0ne Setup2lTwo respondent code respondent code Coder evaluat'n. Version 1 Version 2 Quest. evaluat' n -- Version 3 In. Kappa Statistic from PROC FREQ With Release of 6.10 of the SAS System, the kappa statistic can be computed using PROC FREQ. This option requires a data setup where there is one observation for each unique combination of interview number and question number. Each record then includes behavior codes given by all coders. PROC FREQ; TABLES FRCODE! *FRCODE2/AGREE; This method of computation would be an excellent choice if a researcher is dealing with two behavior coders, only one code is assigned for respondent behavior, and if the kappa is computed for coder evaluation. If there are more than two coders, a SAS MACRO can be used to repeat the computation for each unique combination of two coders. If two codes are given for respondent behavior, they will have to be combined to one code before this method can be used: If individual questions, rather than coders, are being evaluated, the overall kappa estimate and its standard error will have to be computed by hand. IV. Conclusion The SAS program presented in this paper was developed primarily for rejia!'ility studies based on behavior coding. The program was written to accommodate production of multiple kappa statistics when there are more than two observers, different input modes, two respondent codes to compare, and different reasons for use of the kappa statistic. The program can be easily adapted for other types of observer agreement studies, however, which must meet similar requirements. Example: Suppose we have the following data on the 633 NESUG '96 Proceedings

z m en c G) ~." M Q, ~. en Figure 3: Example or An Output From Versions 1 And Z Percent Agreement. Kappa Statistic, and Estimated Standard Error: ''p_'' stands for percent. "k-" for kappa. "se_" for standard error. "int" stands for interviewer code, "rsp" for agreement on at least one of the two respondent codes, and "rboth" for agreement on both respondent codes. i ~ [ CODERl CODER 2 p_int k_int se_int p..rsp k_rsp se_rsp p_rboth k_rboth sejboth 1 2 0.8658 0.7457 0.0562 0.8841 0.7379 0.0451 0.7682 0.5528 0.0390 1 3 0.8719 0.7557 0.0563 0.9207 0.8007 0.0502 0.8536 0.7185 0.0442 1 4 0.8353 0.6868 0.0561 0.9329 0.8389 0.0487 0.8597 0.7399 0.0433 2 3 0.8963 0.8009 0.0620 0.8780 0.7260 0.0457 0.7926 0.5839 0.0416 2 4 0.8719 0.7520 0.0637 0.9268 0.8408 0.0466 0.7987 0.6140 0.0394 en w :. 3 4 0.9024 0.8090 0.0648 0.9695 0.9273 0.0492 0.9207 0.8485 0.0443 Figure 4: Example or An Output From Version 3 The Common Underlying Value of Kappa & Its Standard Error QNUM k_int se~t kjsp se_fsp kjboth se_rboth 1 1.0000 0.1826 0.8373 0.1685 0.6406 0.1425 2 0.5247 0.1139 1.0000 0.1058 0.8431 0.0973 3 0.8337 0.1332 1.0000 0.1282 0.6495 0.0837 4 1.0000 0.0000 0.6145 0.1277 0.6145 0.1276 5 0.4570 0.1136 0.8805 0.0939 0.4191 0.0770 6 0.6298 0.1346 0.9145 0.1094 0.5602 0.0900 7 0.8299 0.1281 0.7500 0.1907 0.3589 0.0797

Appendix: Version 2 Program ******************************************** This SAS program is written for the UNIX workstation platform. Version 2 handles Data Setup 2 in which there is one observation for each unique combination of interview number (INlNUM) and behavior coder (BCODER). Each observation has a set of all interviewer behavior codes (FRCODE) in the order of question. The interviewer codes are followed by one or two sets of respondent codes (RICODE, R2CODE). Version 2 computes the kappa statistic for each pair of coders. The program assumes four behavior coders. The estimated standard error is appropriate for testing the null hypothesis of 'kappa.=o.' To use this program, you have to replace the code typed in capital letters with your data below. For example, you have to supply: - information in libname where your file resides. - codes for the two sets of respondent behavior if the respondent codes CRI CODE & R2CODE) are not in an alphabetical order. - code for each behavior coder (BCODER). *******************************************., options nosymbolgen; libname behavior '/HOME/YEWSIBEHA VB'; '" Rename your variables to the program variable names.; '" Rl CODE is the first behavior code and R2CODE is the second behavior code for a respondent. Rl CODE andr2code should be in an alphabetical order. Ifnot, exchange the order of code.; data recode; set behavior.new; rename CODER=bcoder IN1RVW=intnum QUES'IN=qnum IN1RVWER=frcode RCODEl=r!code RCODE2=r2code; 1* Skip the next 4 lines if Rl CODE & R2CODE for a respondent are already ordered alphabetically. "'I ifr2code=' A' then do; r2code=rlcode; rlcode='a'; end; ifrlcocie='c' & r2code=' 'then r2code='n'; attribute rpcode length=$2; rpcode = r 1 code II r2code; run; '" Create one file for each behavior coder. This program assumes four behavior coders; '" User has to supply each behavior coder's code shown in capital letters in single quotes below; data coder 1 coder2 coder3 coder4; setrecode; if bcoder='ip61 ' then output coder 1; else ifbcoder='in94' then output coder2; else ifbcoder='ir76' then output coder3; else ifbcoder='is58' then output coder4; else abort; run; ******************************************** MACRO SORT first sorts each file created in the previous step by the interview number and question. Then the marginal percent distribution of behavior codes is obtained, one for interviewers and the other for respondents, for each behavior coder. From this point on, the program runs on its own. *******************************************., %macro sort(filel); proc sort data=&filel;by intnum qnum;run; proc freq data=&file 1 noprint; tables frcodelout=fr&filel;run; proc freq data=&file 1 noprint; tables rlcodelout=rl&filel;run; proc freq data=&file 1 noprint; tables r2code1out=r2&filel ;run; proc freq data=&file 1 noprint; tables rpcodelout=rp&filel ;run; %mendsort; '" fr&filel has variables, frcode, count, and %, and one obs. per frcode (Le., behavior code for interviewers) category. Execute macro sort for each behavior coder; 635 NESUG '96 Proceedings

%sort( codec 1) %sort(coder2) %sort( coder3) %sort(coder4) p_int=scorellfreq_; PJspl=score2Lfreq_; PJboth=score3Lfreq_;run; %mend score; %macro score; data one&first&second two&first&second comb&first&second both; merge coder&first(in=one). coder&second(in=two rename=(bcoder=bc0der2 frcode=frcode2 rlcode=rlcode2 r2code=r2c0de2 rpcode=rpc0de2»; by intnum qnum; attrib count scorel score2 score3 length=4; if one=l & two=o then output one&first&second; else if one=o & two=l then output two&first&second; else do; iffrcode=' 'and frc0de2=', then delete; ifrlcode='n' or rlcode2='n' then abort; iffrcode=frc0de2 then scorel=l; else scorel=o; ifrlcode=rlcode2 or rlcode=r2code2 or r2code=rlcode2 or (r2code"='n' & r2code=r2code2) then score2=l; else score2=o; ifrpcode=rpcode2 then score3=1; else score3=o; if r2code "= 'N' and r2code2 "= 'N' then output both; count +1; call symput('cases',left(put(count,12.))); output comb&first&second; end;run; proc summary data=com&first&second; var scorel score2 sc0re3; output out=result&first&second sum: ;run; data pct&first&second; set result&first&second; attirb coderl coder2length=4; coder I =&first; coder2=&second; %let first=l; %let second=2; %score %let second=3; %score %let second=4; %score %let first=2; %let second=3; %score %let second=4; %score %let first=3; %let second=4; %score data join; set pct12 pctl3 pct14 pct23 pct24 pct34; run; ******************************************** Compute the overall proportion of chance-expected agreement. *******************************************. %macro chance(who); data marginal(drop=ij}; merge &who.coderl(keep=percent &who.code rename=(percent=pct 1» &who.c0der2(keep=percent &who.code rename=(percent=pct2» &who.coder3(keep=percent &who.code rename=(percent=pct3» &who.coder4(keep=percent &who.code rename=(percent=pct4»; by &who.code; attrib coderl coder21ength=4; arraypct(4) pctl-pct4; do i=l to 3; do j=i+l to 4; coderl=i; coder2=j; p=pct(i)*pct(j)/loooo; sp=p*(pct(i)+pct(j»/1 00; drop pctl pct2 pct3 pct4; output; end; end;run; proc sort; by coder I c0der2;, NESUG '96 Proceedings 636

varp sp; output out=&who.chance( drop= _freq type-j sum=;run; %mend cj1ance; %chance(fr) %chance(rl) %chance(rp) option symbolgen; data all;merge join frchance(rename=(p=fr_p sp=fr_sp» rlchance(rename=(p=rl-p sp=rl_sp» rpchance(rename=(p=rp-p sp=rp_sp»; bycoderl coder2; k_int=(p_int - fr-p)/(l - fr-p); fr_rad=fr_p+fr-p**2-fr_sp; se_int=l/«l-fr_p)*sqrt(&cases»*sqrt(frjad); REFERENCES Cohen, J. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement. 20: 37-46. Fleiss, I. L., I. Cohen, and B. S. Everitt. 1969. Large-sample standard errors of kappa and weighted kappa. Psycol. Bull. 72:323-327. Landis, I. R., and G. G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics. 33:159-179. Stokes, M. E~ C. S. Davis, and G. G. Koch. 1995. Categorical Data Analysis Usin the SAS System. SAS Institute Inc., Cary, N.C. NOTE: SAS is a registered trademark of SAS Institute, Inc., Cary, N.C., U.SA. k_rsp=(p_rspl-rl_p)/(l-rl-p); rl_rad=rl_p+rl_p**2-rcsp; se_rsp= lie (l-r Cp) *sqrt(&cases» *sqrt(r l_rad); k_rboth=(p_rboth-rp _p)/( l-rp_p); rp_rad--rp_p+rp_p**2-rp_sp; se_rboth=li«l-rp_p)*sqrt(&cases»*sqrt(rp_rad); drop _freq_ score1 score2 score3;run; procprint; idcoderl; var c0der2 p_int k...int se_int p_rsp kjsp sejsp pjboth kjboth se_rboth; titlel 'Percent Agreement, Estimated Kappa Statistic, and Estimated Standard Error:' ; title2 "'p_" stands for percent,.. k..... stands for kappa, "se_.. stands for standard error,'; title3 "'int" for interviewer code,'; title4 ' ''rsp'' for agreement on at least one of the two respondent codes'; titles "'rboth.. for agreement on both respondent codes';run; 637 NESUG '96 Proceedings