Maltreatment Reliability Statistics last updated 11/22/05 Historical Information In July 2004, the Coordinating Center (CORE) / Collaborating Studies Coordinating Center (CSCC) identified a protocol to assess the reliability of RNA/MB coding. This was approved by 4 of 5 sites (plus CORE) in August 2004. The plan was to run reliability statistics on: (1) number of codeable allegations, (2) number of codeable findings, (3) Type of maltreatment (allegations), (4) type of maltreatment (findings), and (5) conclusion codes. All sites coded all selected observations to equal a final number of 129 cases (X 5 sites + the original data). Case Selection Five percent of cases were selected from the RNA/MB pool (for 0407 data). Selection criteria was as follows: (1) must be an allegation and findings narrative, (2) must be at least 1 valid NIS2 & MMCS code from the allegations sections and findings sections, (3) there is a valid date of referral and/or incident, and (4) the allegation/findings narratives are available in the dataset or can be obtained from the sites. Cases were randomly selected according to the criteria above. The CSCC selected the cases. On the initial run, all sites were able to provide narratives or narratives were in the datasets with the exception of the Southern Site (SO). They were unable to provide narratives on two of their selected cases. A second round of random selection for SO occurred and two additional cases were identified that did have narratives in the dataset. Data Entry The CSCC created a data entry system so that the reliability coding from the five sites could be entered. The data entry system was set up using FSEDIT (in SAS). Laura Respess (CORE Program Assistant) was charged with entering the data. Jamie Smith (CORE Applications Analyst) reviewed 10% (n = 60) of the entered cases for data entry accuracy. There were NO data entry errors in any of the 60 cases. During data entry, Jamie Smith reviewed the coding from the sites for any (nonreliability) errors. Jamie (and Liz Knight, CORE co-investigator) communicated with the sites to correct the errors and/or clarify the instructions. A subsample of cases required additional information (not otherwise included in the narrative) to enable sites to accurately code the narrative (n = 8). Sites who contributed the original data provided the additional information. All sites were requested to use the supplemental information to code the reliability narratives for the identified subsample. Reliability Statistics Reliability statistics were conducted for allegation and findings data separately. Kappas and Interclass Correlation Coefficients were conducted depending on the type of data. The specific analysis variables (and necessary statistics) are as follows:
1. Number of codeable allegations ICC 2. Number of codeable findings ICC 3. Conclusion codes Kappa 4. Maltreatment type allegations NIS2 - Kappa 5. Maltreatment type findings NIS2 - Kappa 6. Maltreatment type allegations MMCS - Kappa 7. Maltreatment type findings MMCS- Kappa 8. Maximum Severity Codes by Maltreatment Type for MMCS Allegations - ICC Note that sites coded conclusions and maltreatment types according to the RNA/MB reliability codebook as if they were coding an original case file. However, analyses will be conducted only on the broad types of maltreatment (not the subtypes) and whether an allegation was substantiated or not (yes or no). See below: NIS2 allegation & findings codes = broad type Physical Abuse = 420-423 Sexual Abuse = 430 433 Emotional Abuse = 440 443 Physical Neglect = 450 457 Educational Neglect = 460 463 Emotional Neglect = 470 477 Other maltreatment = 480 484 MMCS allegation & findings codes = broad type Physical Abuse = 100 109 Sexual Abuse = 200 Physical Neglect, failure to provide = 300 305 Physical Neglect, lack of supervision = 400 403 Emotional maltreatment = 500 Moral/Legal Maltreatment = 600 Educational Maltreatment = 700 Drugs/Alcohol = 800 Conclusion Codes Codes of 1 or 3 = substantiated Codes of 2, 4, 5, 6, or 7 = not substantiated Severity Codes The maximum severity value for a particular type of maltreatment across a record Results Analyses were completed on 4/29/05. See Tables 1 and 2 for Kappa statistics for type of allegation, finding, and conclusion codes for MMCS and NIS2 codes. The methodology used for computing kappa statistics are presented in Fleiss (1981). The macro used to compute interclass correlations were based on Shrout and Fleiss (1979). The Shrout- Fleiss Reliability Random Set are reported for the current analyses. Analyses on the Maximum Severity Codes were conducted on 9/22/05. Table 3 details the maximum severity code reliability statistics.
Allegations Kappas for MMCS codes from the allegations narrative ranged from.49 -.88 (M =.76). All of the categories with the exception of the moral/legal category have Kappas exceeding.70 (the value typically considered acceptable). Similar results were obtained for Kappas of NIS2 allegation codes with a range of.58 -.88 (M =.77). Again, with the exception of one category (emotional neglect) all Kappas were above.70. Interclass correlation coefficients were obtained for the number of allegations coded for MMCS and NIS2 and =.79 and.74 respectively. Findings Kappas for MMCS codes from the findings narrative ranged from.45 -.84 (M =.72). All of the categories with the exception of the moral/legal category had Kappas exceeding.70. Similar results were obtained for Kappas of NIS2 allegation codes with a range of.54 -.85. (M =.73). Again, with the exception of one category (emotional neglect) all Kappas were above.70. Interclass correlation coefficients were obtained (Shrout-Fleiss Reliability Random Set) for the number of findings coded for MMCS and NIS2 and =.75 and.65 respectively. Conclusion Codes Kappas for conclusion codes based on the MMCS coding of the findings narrative ranged from.14 73 (M =.54). The lowest value was obtained for the moral/legal category (k=.14), the highest for educational maltreatment (k =.73). Only two kappa values were at or above.70. Kappas for conclusion codes based on the NIS2 coding of the findings narrative ranged from.34.73 (M =.56). The lowest value was obtained for the emotional abuse category (k=.34), the highest for educational neglect (k =.73). Only one kappa value was at or above.70. Maximum Severity Reliability statistics were conducted on the maximum severity codes by each MMCS maltreatment category except Moral/Legal and Drugs/Alcohol. Two sets of analyses were conducted. The first set included the substitution of a missing value if the maltreatment type was not coded from the allegation (range of values = 1 6, depending on the maltreatment type). The second set included the substitution of a value of 0 if the maltreatment type was not coded from the allegation (range of values = 0-6). Note that the first set are a much more restrictive set of analyses given any disagreement over the maltreatment type are subsequently thrown out before assessing agreement on maximum severity. The latter set of runs may potentially inflate agreement. ICCs for the first set ranged from.30 (Educational Neglect) to.65 (Lack of supervision). Most were in the.6 range. ICCs from the second set ranged from.57 (Educational Neglect) to.87 (Sexual Abuse).
A Word about Kappas Landis and Koch (1977) attempted to provide some measure of agreement for kappa values in various ranges: < = 0 = poor 0 -.2 = slight.2 -.4 = fair.4 -.6 = moderate.6 -.8 = substantial.8 1.0 = almost perfect However, Munoz and Bangdiwala (1997) compared the Kappa statistics by Cohen (1960) and the B statistic by Bangdiwala (1985) and offer an alternative labeling system for the Kappa statistic. -.35 - -.20 = poor -.05 -.07 = fair.25 -.33 = moderate.55 -.60 = substantial.85 -.87 = almost perfect 1.0 = perfect Conclusions With either interpretation, the Kappas obtained for the current analyses appear to range from moderate to almost perfect in most instances. Caution is suggested for use of the conclusion codes. Assessment of severity codes proved more difficult given the conditional nature of the presence of severity codes (maltreatment type had to be coded). Analyses were run two ways, one that is highly conservative and one that may potentially inflate agreement. Although neither of these approaches is optimal, arguments can be made that either approach is sufficient. From the conservative perspective, reliability was only assessed for those coders/records who agreed that a maltreatment type was codeable from the referral narrative. The denominator for these analyses will be significantly lower than that used in the second set. Thus disagreements have a greater impact on the reliability. In contrast, those who agreed that a maltreatment type did not occur are getting credit for agreeing there is no severity code (= 0). Thus all records (N = 129*6 raters) were included in the analyses potentially minimizing disagreements as compared to the first approach. Either is defensible and reportable, however the CSCC recommends reporting the range (or upper and lower figure) from these two sets of analyses when describing reliability in manuscripts with a brief description of the assessment process. Given the complexity of coding CPS records, the span of the ages represented in the current sample, and coder turnover at the sites, we are very encouraged with these figures and congratulate (and thank) everyone for their efforts with this process.
Reliability Summary In winter of 2004-2005 a formal assessment of CPS narrative coding reliability was conducted among all active coders at each of the five sites plus the original data. Approximately five percent of CPS records (N = 129) currently in the LONGSCAN cross-site database, as of 9/20/04, were selected for review. Analyses were conducted to measure agreement on (a) the number of allegations and substantiations, (b) the type of maltreatment at referral and the investigation by CPS, (c) conclusions about maltreatment based on CPS investigation, and (d) the severity of maltreatment based on the referral information. These categories are consistent with the way the CPS data is commonly used for analyses within LONGSCAN and for classifying the maltreatment experiences of the study child participants. Reliability analyses focused on coding using the MMCS and NIS2 classification systems. Results indicated reliability ranged from moderate to almost perfect for nearly every category of analysis except coding of substantiated maltreatment based on the CPS findings narratives. Given the complexity of coding CPS records across agencies and states, the span of the ages at the time of referral, and the change in coders inherent in a longitudinal study, these figures are encouraging and represent the quality and consistency of training. Results Table 1. Kappas for MMCS Allegation, Findings, and Conclusion Codes Type of Maltx Allegation Findings Conclusion Code Physical Abuse.87.84.56 Sexual Abuse.77.71.51 Neglect Failure to Provide.88.83.70 Lack of Supervision.77.70.51 Emotional Maltx.73.72.55 Moral/Legal.49.45.14 Educational Maltx.72.74.73 Drugs/Alcohol.87.79.66 Table 2. Kappas for NIS2 Allegation, Findings, and Conclusion Codes Type of Maltx Allegation Findings Conclusion Code Physical Abuse.88.85.56 Sexual Abuse.78.73.51 Emotional Abuse.73.71.34 Physical Neglect.81.75.63 Educational Neglect.74.75.73 Emotional Neglect.58.54.51 Other Maltx.87.78.65
Table 3. ICCs for Maximum Severity Codes for MMCS Maltreatment Types Set 1 (severity without a maltreatment code =.) Set 2 (severity without a maltreatment code = 0) Type of Maltx Physical Abuse.60.84 Sexual Abuse.60.88 Failure to Provide.61.76 Lack of Supervision.65.68 Emotional Maltx.54.73 Educational Maltx.30.57 Note: Maximum severity codes were selected from MMCS Allegations Bibliography Cohen, J. (1960). A coefficient for agreement for nominal scales. Educational and Psychological Measurement, 20, 37-46. Fleiss, J. (1981). Balanced incomplete block designs for inter-rater reliability studies. Applied-Psychological-Measurement, 5, 105-112. Landis, J.R., & Koch, G.G. (1977). The measure of observer agreement for categorical data. Biometrics, 33, 159 174. Munoz, S.R., & Bangdiwala, S.I. (1997). Interpretation of Kappa and B Statistics measures of agreement. Journal of Applied Statistics, 24, 105 111. Shrout, P.E., & Fleiss, J.L. (1979). Interclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86, 420 428.