... (po -P.) ,., Statistics. NESUG '96 Proceedings 630 AN ALTERNATIVE WAY TO COMPUTE THE KAPPA STATISTICS IN OBSERVER AGREEMENT STUDIES (3)

AN ALTERNATIVE WAY TO COMPUTE THE KAPPA STATISTICS IN OBSERVER AGREEMENT STUDIES Yukiko Ellis,. U.S. Bureau of the Census, Washington, D.C. The kappa statistic, first proposed by Cohen (1960), is a coefficient of observer agreement for categorical data. Researchers in medicine. epidemiology, education, and psychology have widely used the statistic in their attempt to evaluate measurement error due to observer error. This paper presents a SASe computer program that computes the perceut agreement and kappa statistic that are often needed in reliability studies based on behavior coding. The first section briefly describes behavior coding, the percent agreement, and kappa statistic. The second section describes three versions of my SAS program, depending on the purpose for computing the kappa and how the data are set up. The third section describes an option to compute the kappa statistic through use of PROC FREQ with the 'agree' option under Release 6.10 of SAS. The program presented here has a few advantages over PROC FREQ by providing flexibility in the input data setup, in the number of observers and measurements involved in comparison. In the appendix, SAS code for Version 2 of my program is presented. L Background Behavior coding is a procedure in which trained staff listen to audiotapes of survey interviews and code interviewer and respondent behaviors. Interviewer codes include appropriate reading of question, changes to question wording, verification or suggestion of an answer, and erroneous skipping of question. Respondent codes include question interruptions, expression of uncertainty, qualified and uncodable answers, and 'don't know' andrefusalresponses. Hence, the purpose of behavior coding is to evaluate the quality of survey questions and to identify the cognitive problems they pose. To assess the extent to which behavior coding is reliable, we ID1lSt have all coders in a project behavior-code the same set of interviews and examine how well they apply the same codes under the same conditions. If agreement among the coders is low, then the usefulness of the behavior codings is severely limited. Suppose now that each of a sample of n questions is coded in regard to interviewer behavior by two behavior coders, with the ratings on a nominal scale with m categories. The results ofbehavior coding can be summarized in a two-way frequency table as in Table 1. In this table, om' is the number of all possible behavior code categories for interviewers. Each survey question is classified in this table according to how the two behavior coders ended the question. For example, if Coder 1 assigned a question to the fourth category of the interviewer behavior code, while Coder 2 assigned the second category, then this question would be classified into cell (4,2). Table 1. Joint distribution of codings by two coders for classification witb m categories Coder 2 Coder 1 1, 2,.. '.' m Total I n ll, D12t., n lm nl+ 2 lltl' D 22,., n 2m lit.......... m D,.I' ~f " n,.,. D,.+ Total D+lr n+z'., n..". n There are many ways one can measure the intercoder agreement. The simplest index of agreement is the overall percent agreement, m Po = Pn + P22 +... + Plllll = L Pii' (1) i-l 'Where p. = I\; I n, 1 <= i,j <= m. Kappa is another measure of agreement. Kappa has a desirable property that the percent agreement does not have; namely, kappa incorporates a coitection for chan<»expected agreement into the assessment of intercoder reliability. Let Po denote the overall proportion of observed agreement as in equation (1). If the codings are statistically independent, then the overall proportion of chance-expected agreement is: m P. = Pl+P+l+P2+P+2+",+Pm+P+m = LPi+P+i' (2) j-i 'Where Pi+ = ~ Pij and p..; = E;Pij' The overall value of kappa is then defined as:,., 1C = (po -P.) (l-p.). (3) If there is complete agreement between the two coders, 11:=+1. This happens, for example, 'When all observations in the NESUG '96 Proceedings 630

. frequency table belong to two or more cells on the diagonal, resulting in Po=1.0 and ~ 1. If all observations fallin one cell on the diagonal, then P.=1.0 and R becomes undefined.. The occurrence of an undefined kappa is rare as long as we have a sufficiently large number of observations. If observed agreement is greater than or equal to chance agreement, fe>=o. If observed agreement is less than or equal to chance agreement, K<=D. If Po and P. are such that P.=(1+Po)12, then the minimum value of R equals -1. Otherwise, the minimum value depends on the marginal proportions and is between -1 and 0, inclusive. Landis and Koch (1977) suggest that, for most practical purposes, values of kappa greater than.75 maybe taken to represent =ellent agreement beyond chance, values below.40 may be taken to represent poor agreement beyond chance, and values between.40 and.75 may be taken to represent fair to good agreement beyond chance. The standard error of kappa computed by the SAS program presented in this paper is appropriate for testing the null hypothesis that the underlying value of kappa is zero (Fleiss, Cohen. and Everitt, 1969). It is estimated by: Note that the standard error depends only on the marginal distributions and not at all on the individual cell frequencies, an observation that is useful from a progranuning point of view. ll. Three Versions of SAS Program There are three versions of the SAS program that I wrote. They differ based on how the data are set up. whether there are one or two behavior codes assigned to respondent behaviors. and whether the kappa statistics are computed to evaluate coders or individual questions. All three versions assume that there are four behavior coders. Data Setup All three versions of the program assume that your data set is a SAS data set. The data set can have either of the two types of data setup as follows: Setup 1: There is' one observation for each unique combination of interview number (IN1NUM), question number (QNUM), and behavior coder (BCODER). Each observation has additional information on interviewer behavior code (FRCODE) and one respondent behavior code (RICODE). See an example of this data SETUP in Figure 1. In 1his ~le, Q is the question number. The data-specific variable names in the input data are renamed to a standard set of variable names in the first part of my SAS program. This is nsually the setup for codes that are keyed into a data base from a paper questionnaire form. Fi2ure 1. Example of Data Setup 1 Flrst 10 observations in a hypothetical data set: Obs BCODER IN1NUM Q FRCODE* RICODE** 1 coder 1 703484 I M A 2 coderl 703484 IA E A 3 coderl 703484 2 V B 4 coderl 703484 3 E A 5 coder2 703484 I E A 6 coder2 703484 IA E Q 7 coder2 703484 2 S A 8 c0der2 703484 3 E I 9 coderl 260972 I E C 10 coder I 260972 la E.N * E=exact wording, M=major change, S=slight change, V=verify answer. ** A=adequate answer, B=break-in, C=clarification, I=inadeQuate answer, N=no answer, Q=qualified answer. Setup 2: There is one observation for each unique combination of INTNUM and BCODER. Each observation then has a set of all interview behavior codes (FRCODE) in the order of questions. The interviewer behavior codes are then followed by a set of all respondent behavior codes (RICODE). If there is a set of second respondent behavior codes (R2CODE), then that follows the set of the first respondent behavior codes. Hence, there is no explicit QNUM variable in this data setup. QNUM is implicit in the order of the behavior codes. See an example of this data setup in Figure 2. This is nsually the setup for codes that have been entered nsing an automated Computer Assisted Interview (CAl) system. Setup 2 requires smaller computer storage space than Setup 1. Suppose Jiveintecviews are behavior-coded by four coders, and there are 50 questions in a questionnaire. Then the data Jilein Setup I has 5x4x50 or 1000 records while the data file in Setup 2 has 5x4 or 20 records but each record has a much longer record length. 631 NESUG '96 Proceedings

Fi 'I~ 2: Example of Data Setup 2 First five observations in a hvnothetical data oet F F F F R R R R R R R R I B R R R R 1 1 1 1 2 2 2 2 N C C C C C C C C C C C C C T 0 0 0 0 0 0 0 0 0 0 0 0 0 N D D D D D D D D D D D D D U E E E E E E E E E E E E E M R 1 2 3 4 1 2 3 4 1 2 3 4 1 P S M M E B Q A B N N I I 2 N S M M E B Q A C N N N N 3 R S M M E A A A I N N B I 4 S.. S S S E B A A B N N I N 5 N E S E E A B B C N N N N Number of ResDondent Behavior Codes Some behavior coding projects allow only one code each for interviewer and respondent behaviors while others allow more than one code. In the latter case, it is up to the researcher to decide whether to summarize the multiple codes to one code or to use all multiple behavior codes. Currently, the programs presented here compute the percent agreement and the kappa statistic for one or two codes for respondent behavior, but the programs can be easily adapted for more codes. When there are two respondent behavior codes, the kappa statistic is computed in two ways as follows: 1. The two respondent behavior codes assigned by Coder 1 are compared with those assigned by Coder 2 to see if there is at least one common behavior code between the two sets of codes. This common code cannot be a 'no response' category (un''). Two coders arerequired to have assigned the same number of codes for each interview. If one coder assigned a behavior code to a question while the other coder didn't, then a dununy category of 'no response' has to be created to insure the balance before the program can be used. Different Uses of Kappa Kappa statistics are computed for various purposes. The SAS program presented here handles either of the following two situations described below: Coder Evaluation: A conunon use of the kappa statistics based on behavior coding is to evaluate whether the coding is done consistently by coders. For this type of evaluation, the result of behavior coding by two coders is collected across all questions and interviews and sununarized in a two-way freq ueney table like the one in Table 1. A kappa statistic is computed for each pair of coders. Hence, if there are four coders, six kappa statistics are computed. See Figure 3 for an example of the output of coder evaluation. Question Evaluation: It is possible to use the kappa statistics to evaluate whether the coding was done consistently for each survey question in the questionnaire. For each individual question, the result of behavior coding by two coders is first combined across all interviews. Hence, the number of interviews that both coders behavior-coded becomes '0,' the total number of observations, in Table 1. It is important to have a sufficiently large 'n' to avoid having all observations classified into one cellon the diagonal, which will result in an undefined kappa. After kappa is computed for all possible combinations of two coders for a particular question, the estimate of the overall kappa for the question can be computed as follows: e.g. 'AB' and 'BN' would result in an agreement. 'AN' and 'CN' would result in a disagreement. 2. The two sets of codes are compared to see whether they are exactly the same. e.g. '~' and 'BA' would result in an agreement. 'AN' and 'AN' would result in an agreement. 'AB' and 'BN' would result in a disagreement. where x is the number of all possible pairs of the coders, and Vpi:.) is the squared standard error ofr;. The standard error of itovaall is given by: NESUG '96 Proceedings 632

s.e.~overall)=~l 1 E joiv[kj (6) interview behavior from two behavior coders: INTNUM QNUM FRCODE 1 FRCODE 2 A A Suppose we have four coders, for example, and each coded eleven interviews. Then 'n' is 11 inthis example, and there are six unique combinations of two coders. Therefore, six kappa statistics are computed for each question. These six estimates of kappa are then combined to compute the estimate of the overall kappa for each question. Summary of Three Program Versions Table 2 below describes the three versions of my SAS program in terms of the data set, number of respondent behavior code, and use of kappa. 2 E A 3 A s FRCODE! is the code for interviewer behavior given by Coder 1 and FRCODE2 by Coder 2. Either Data Setup 1 or Setup 2 mentioned in Section D can be readily converted to this arrangement. Then the following SAS statement will compute a kappa statistic (see pp.98-102 in Stokes, Davis, and Koch, 1995): Currently, Data Setup 1 is combined with one respondent behavior code in Version 1, and Data Setup 2 is combined with two respondent behavior codes in Versions 2 and 3. Also, Version 3 meets the most complex condition defined by the three factors. Examples of the outputs for Versions 1 and 2 are shown in Figure 3, and for Version 3 in Figure 4 on the next page. Table 2. Three Program Versions by Data Setup, Number of Response Behavior Codes, and Use of Kappa Data Setup/No. of Respondent Code Use of Kappa Setup1/0ne Setup2lTwo respondent code respondent code Coder evaluat'n. Version 1 Version 2 Quest. evaluat' n -- Version 3 In. Kappa Statistic from PROC FREQ With Release of 6.10 of the SAS System, the kappa statistic can be computed using PROC FREQ. This option requires a data setup where there is one observation for each unique combination of interview number and question number. Each record then includes behavior codes given by all coders. PROC FREQ; TABLES FRCODE! *FRCODE2/AGREE; This method of computation would be an excellent choice if a researcher is dealing with two behavior coders, only one code is assigned for respondent behavior, and if the kappa is computed for coder evaluation. If there are more than two coders, a SAS MACRO can be used to repeat the computation for each unique combination of two coders. If two codes are given for respondent behavior, they will have to be combined to one code before this method can be used: If individual questions, rather than coders, are being evaluated, the overall kappa estimate and its standard error will have to be computed by hand. IV. Conclusion The SAS program presented in this paper was developed primarily for rejia!'ility studies based on behavior coding. The program was written to accommodate production of multiple kappa statistics when there are more than two observers, different input modes, two respondent codes to compare, and different reasons for use of the kappa statistic. The program can be easily adapted for other types of observer agreement studies, however, which must meet similar requirements. Example: Suppose we have the following data on the 633 NESUG '96 Proceedings

z m en c G) ~." M Q, ~. en Figure 3: Example or An Output From Versions 1 And Z Percent Agreement. Kappa Statistic, and Estimated Standard Error: ''p_'' stands for percent. "k-" for kappa. "se_" for standard error. "int" stands for interviewer code, "rsp" for agreement on at least one of the two respondent codes, and "rboth" for agreement on both respondent codes. i ~ [ CODERl CODER 2 p_int k_int se_int p..rsp k_rsp se_rsp p_rboth k_rboth sejboth 1 2 0.8658 0.7457 0.0562 0.8841 0.7379 0.0451 0.7682 0.5528 0.0390 1 3 0.8719 0.7557 0.0563 0.9207 0.8007 0.0502 0.8536 0.7185 0.0442 1 4 0.8353 0.6868 0.0561 0.9329 0.8389 0.0487 0.8597 0.7399 0.0433 2 3 0.8963 0.8009 0.0620 0.8780 0.7260 0.0457 0.7926 0.5839 0.0416 2 4 0.8719 0.7520 0.0637 0.9268 0.8408 0.0466 0.7987 0.6140 0.0394 en w :. 3 4 0.9024 0.8090 0.0648 0.9695 0.9273 0.0492 0.9207 0.8485 0.0443 Figure 4: Example or An Output From Version 3 The Common Underlying Value of Kappa & Its Standard Error QNUM k_int se~t kjsp se_fsp kjboth se_rboth 1 1.0000 0.1826 0.8373 0.1685 0.6406 0.1425 2 0.5247 0.1139 1.0000 0.1058 0.8431 0.0973 3 0.8337 0.1332 1.0000 0.1282 0.6495 0.0837 4 1.0000 0.0000 0.6145 0.1277 0.6145 0.1276 5 0.4570 0.1136 0.8805 0.0939 0.4191 0.0770 6 0.6298 0.1346 0.9145 0.1094 0.5602 0.0900 7 0.8299 0.1281 0.7500 0.1907 0.3589 0.0797

Appendix: Version 2 Program ******************************************** This SAS program is written for the UNIX workstation platform. Version 2 handles Data Setup 2 in which there is one observation for each unique combination of interview number (INlNUM) and behavior coder (BCODER). Each observation has a set of all interviewer behavior codes (FRCODE) in the order of question. The interviewer codes are followed by one or two sets of respondent codes (RICODE, R2CODE). Version 2 computes the kappa statistic for each pair of coders. The program assumes four behavior coders. The estimated standard error is appropriate for testing the null hypothesis of 'kappa.=o.' To use this program, you have to replace the code typed in capital letters with your data below. For example, you have to supply: - information in libname where your file resides. - codes for the two sets of respondent behavior if the respondent codes CRI CODE & R2CODE) are not in an alphabetical order. - code for each behavior coder (BCODER). *******************************************., options nosymbolgen; libname behavior '/HOME/YEWSIBEHA VB'; '" Rename your variables to the program variable names.; '" Rl CODE is the first behavior code and R2CODE is the second behavior code for a respondent. Rl CODE andr2code should be in an alphabetical order. Ifnot, exchange the order of code.; data recode; set behavior.new; rename CODER=bcoder IN1RVW=intnum QUES'IN=qnum IN1RVWER=frcode RCODEl=r!code RCODE2=r2code; 1* Skip the next 4 lines if Rl CODE & R2CODE for a respondent are already ordered alphabetically. "'I ifr2code=' A' then do; r2code=rlcode; rlcode='a'; end; ifrlcocie='c' & r2code=' 'then r2code='n'; attribute rpcode length=$2; rpcode = r 1 code II r2code; run; '" Create one file for each behavior coder. This program assumes four behavior coders; '" User has to supply each behavior coder's code shown in capital letters in single quotes below; data coder 1 coder2 coder3 coder4; setrecode; if bcoder='ip61 ' then output coder 1; else ifbcoder='in94' then output coder2; else ifbcoder='ir76' then output coder3; else ifbcoder='is58' then output coder4; else abort; run; ******************************************** MACRO SORT first sorts each file created in the previous step by the interview number and question. Then the marginal percent distribution of behavior codes is obtained, one for interviewers and the other for respondents, for each behavior coder. From this point on, the program runs on its own. *******************************************., %macro sort(filel); proc sort data=&filel;by intnum qnum;run; proc freq data=&file 1 noprint; tables frcodelout=fr&filel;run; proc freq data=&file 1 noprint; tables rlcodelout=rl&filel;run; proc freq data=&file 1 noprint; tables r2code1out=r2&filel ;run; proc freq data=&file 1 noprint; tables rpcodelout=rp&filel ;run; %mendsort; '" fr&filel has variables, frcode, count, and %, and one obs. per frcode (Le., behavior code for interviewers) category. Execute macro sort for each behavior coder; 635 NESUG '96 Proceedings

%sort( codec 1) %sort(coder2) %sort( coder3) %sort(coder4) p_int=scorellfreq_; PJspl=score2Lfreq_; PJboth=score3Lfreq_;run; %mend score; %macro score; data one&first&second two&first&second comb&first&second both; merge coder&first(in=one). coder&second(in=two rename=(bcoder=bc0der2 frcode=frcode2 rlcode=rlcode2 r2code=r2c0de2 rpcode=rpc0de2»; by intnum qnum; attrib count scorel score2 score3 length=4; if one=l & two=o then output one&first&second; else if one=o & two=l then output two&first&second; else do; iffrcode=' 'and frc0de2=', then delete; ifrlcode='n' or rlcode2='n' then abort; iffrcode=frc0de2 then scorel=l; else scorel=o; ifrlcode=rlcode2 or rlcode=r2code2 or r2code=rlcode2 or (r2code"='n' & r2code=r2code2) then score2=l; else score2=o; ifrpcode=rpcode2 then score3=1; else score3=o; if r2code "= 'N' and r2code2 "= 'N' then output both; count +1; call symput('cases',left(put(count,12.))); output comb&first&second; end;run; proc summary data=com&first&second; var scorel score2 sc0re3; output out=result&first&second sum: ;run; data pct&first&second; set result&first&second; attirb coderl coder2length=4; coder I =&first; coder2=&second; %let first=l; %let second=2; %score %let second=3; %score %let second=4; %score %let first=2; %let second=3; %score %let second=4; %score %let first=3; %let second=4; %score data join; set pct12 pctl3 pct14 pct23 pct24 pct34; run; ******************************************** Compute the overall proportion of chance-expected agreement. *******************************************. %macro chance(who); data marginal(drop=ij}; merge &who.coderl(keep=percent &who.code rename=(percent=pct 1» &who.c0der2(keep=percent &who.code rename=(percent=pct2» &who.coder3(keep=percent &who.code rename=(percent=pct3» &who.coder4(keep=percent &who.code rename=(percent=pct4»; by &who.code; attrib coderl coder21ength=4; arraypct(4) pctl-pct4; do i=l to 3; do j=i+l to 4; coderl=i; coder2=j; p=pct(i)*pct(j)/loooo; sp=p*(pct(i)+pct(j»/1 00; drop pctl pct2 pct3 pct4; output; end; end;run; proc sort; by coder I c0der2;, NESUG '96 Proceedings 636

varp sp; output out=&who.chance( drop= _freq type-j sum=;run; %mend cj1ance; %chance(fr) %chance(rl) %chance(rp) option symbolgen; data all;merge join frchance(rename=(p=fr_p sp=fr_sp» rlchance(rename=(p=rl-p sp=rl_sp» rpchance(rename=(p=rp-p sp=rp_sp»; bycoderl coder2; k_int=(p_int - fr-p)/(l - fr-p); fr_rad=fr_p+fr-p**2-fr_sp; se_int=l/«l-fr_p)*sqrt(&cases»*sqrt(frjad); REFERENCES Cohen, J. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement. 20: 37-46. Fleiss, I. L., I. Cohen, and B. S. Everitt. 1969. Large-sample standard errors of kappa and weighted kappa. Psycol. Bull. 72:323-327. Landis, I. R., and G. G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics. 33:159-179. Stokes, M. E~ C. S. Davis, and G. G. Koch. 1995. Categorical Data Analysis Usin the SAS System. SAS Institute Inc., Cary, N.C. NOTE: SAS is a registered trademark of SAS Institute, Inc., Cary, N.C., U.SA. k_rsp=(p_rspl-rl_p)/(l-rl-p); rl_rad=rl_p+rl_p**2-rcsp; se_rsp= lie (l-r Cp) *sqrt(&cases» *sqrt(r l_rad); k_rboth=(p_rboth-rp _p)/( l-rp_p); rp_rad--rp_p+rp_p**2-rp_sp; se_rboth=li«l-rp_p)*sqrt(&cases»*sqrt(rp_rad); drop _freq_ score1 score2 score3;run; procprint; idcoderl; var c0der2 p_int k...int se_int p_rsp kjsp sejsp pjboth kjboth se_rboth; titlel 'Percent Agreement, Estimated Kappa Statistic, and Estimated Standard Error:' ; title2 "'p_" stands for percent,.. k..... stands for kappa, "se_.. stands for standard error,'; title3 "'int" for interviewer code,'; title4 ' ''rsp'' for agreement on at least one of the two respondent codes'; titles "'rboth.. for agreement on both respondent codes';run; 637 NESUG '96 Proceedings