Maltreatment Reliability Statistics last updated 11/22/05

Similar documents
COMPUTING READER AGREEMENT FOR THE GRE

COMMITMENT &SOLUTIONS UNPARALLELED. Assessing Human Visual Inspection for Acceptance Testing: An Attribute Agreement Analysis Case Study

English 10 Writing Assessment Results and Analysis

Unequal Numbers of Judges per Subject

Closed Coding. Analyzing Qualitative Data VIS17. Melanie Tory

LEVEL ONE MODULE EXAM PART TWO [Reliability Coefficients CAPs & CATs Patient Reported Outcomes Assessments Disablement Model]

Ryan Mattek, PhD Letitia Johnson PhD. SRA-FV: Evidence of Inter-rater Reliability in a Combined SOMMI Sample

Statistical Validation of the Grand Rapids Arch Collapse Classification

A review of statistical methods in the analysis of data arising from observer reliability studies (Part 11) *

Running head: ATTRIBUTE CODING FOR RETROFITTING MODELS. Comparison of Attribute Coding Procedures for Retrofitting Cognitive Diagnostic Models

2 Philomeen Weijenborg, Moniek ter Kuile and Frank Willem Jansen.

Comparing Vertical and Horizontal Scoring of Open-Ended Questionnaires

(true) Disease Condition Test + Total + a. a + b True Positive False Positive c. c + d False Negative True Negative Total a + c b + d a + b + c + d

Relationship Between Intraclass Correlation and Percent Rater Agreement

Validity and reliability of measurements

Victoria YY Xu PGY-3 Internal Medicine University of Toronto. Supervisor: Dr. Camilla Wong

AAPOR Exploring the Reliability of Behavior Coding Data

Comparison of the Null Distributions of

Victoria YY Xu PGY-2 Internal Medicine University of Toronto. Supervisor: Dr. Camilla Wong

reproducibility of the interpretation of hysterosalpingography pathology

A profiling system for the assessment of individual needs for rehabilitation with hearing aids

NIH Public Access Author Manuscript Tutor Quant Methods Psychol. Author manuscript; available in PMC 2012 July 23.

Psychotherapy research historically focused predominantly

7/17/2013. Evaluation of Diagnostic Tests July 22, 2013 Introduction to Clinical Research: A Two week Intensive Course

THE ESTIMATION OF INTEROBSERVER AGREEMENT IN BEHAVIORAL ASSESSMENT

02a: Test-Retest and Parallel Forms Reliability

Examining Inter-Rater Reliability of a CMH Needs Assessment measure in Ontario

Figure 1: Design and outcomes of an independent blind study with gold/reference standard comparison. Adapted from DCEB (1981b)

Pain Assessment in Elderly Patients with Severe Dementia

Seemingly isolated greater trochanter fractures do not exist

Agreement Coefficients and Statistical Inference

Evaluating the Endoscopic Reference Score for eosinophilic esophagitis: moderate to substantial intra- and interobserver reliability

A Coding System to Measure Elements of Shared Decision Making During Psychiatric Visits

Evaluating Quality in Creative Systems. Graeme Ritchie University of Aberdeen

Validity and reliability of measurements


Model 1: Subject As Single Factor

Repeatability of a questionnaire to assess respiratory

2012 Summary Report of the San Francisco Eligible Metropolitan Area. Quality Management Performance Measures

2. How do different moderators (in particular, modality and orientation) affect the results of psychosocial treatment?

Evidence-Based Practice Fidelity Site Visit Tools

County of Santa Cruz: Serving Families Involved with Family and Children s Services and Alcohol and Drug Programs

Magnetic Resonance Imaging Interpretation in Patients With Symptomatic Lumbar Spine Disc Herniations

Research with the SAPROF

Using Direct Behavior Ratings in a Middle School Setting

Nova Scotia Board of Examiners in Psychology. Custody and Access Evaluation Guidelines

It s a New World New PT and PTA Coursework Tools

Authorship Guidelines for CAES Faculty Collaborating with Students

A Validated Classification for External Immobilization of the Cervical Spine

DATA is derived either through. Self-Report Observation Measurement

Shoplifting Inventory: Standardization Study

Performance of intraclass correlation coefficient (ICC) as a reliability index under various distributions in scale reliability studies

Psychometric qualities of the Dutch Risk Assessment Scales (RISc)

Reliability of Obituaries as a Data Source in Epidemiologic Studies: Agreement in Age, Residence and Occupation

TURNING POINT ASSESSMENT/TREATMENT WOMAN ABUSE PROTOCOL DEPARTMENT OF JUSTICE AND PUBLIC SAFETY

SBIRT IOWA THE IOWA CONSORTIUM FOR SUBSTANCE ABUSE RESEARCH AND EVALUATION. Iowa Army National Guard. Biannual Report Fall 2015

BMC Medical Research Methodology

Reliability and Validity checks S-005

Evaluation of a clinical test. I: Assessment of reliability

An Exploratory Case Study of the Use of Video Digitizing Technology to Detect Answer-Copying on a Paper-and-Pencil Multiple-Choice Test

A study of adverse reaction algorithms in a drug surveillance program

Trauma Symptom Checklist for Children Briere, J Purpose To assess the effects of childhood trauma through the child s self-report.

Workforce Analysis: Children and Young People s Mental Health and Wellbeing Wider system

The study of communication is interdisciplinary, sharing topics, literatures,

Agreement Between Retrospective Accounts

Chapter 2. Traumatic stress symptomatology after child maltreatment and single traumatic events: Different profiles. Slightly adapted for consistency:

Measurement and Reliability: Statistical Thinking Considerations

Chapter IR:VIII. VIII. Evaluation. Laboratory Experiments Logging Effectiveness Measures Efficiency Measures Training and Testing

Update on the Reliability of Diagnosis in Older Psychiatric Outpatients Using the Structured Clinical Interview for DSM IIIR

Assessment of Interrater Agreement for Multiple Nominal Responses Among Several Raters Chul W. Ahn, City of Hope National Medical Center

CEMO RESEARCH PROGRAM

Children s Advocacy Centers: A Natural (and Local) Partner for Youth-Serving Organizations

RECOMMENDATIONS FOR THE DIAGNOSIS AND MANAGEMENT OF CHURG-STRAUSS SYNDROME METHODS AND SCORING

A practical tool for locomotion scoring in sheep: Reliability when used by veterinary surgeons and sheep farmers

DEPRESSION-FOCUSED INTERVENTION FOR PREGNANT SMOKERS 1. Supplemental Material For Online Use Only

Interpreting Kappa in Observational Research: Baserate Matters

Research Article The Study on the Agreement between Automatic Tongue Diagnosis System and Traditional Chinese Medicine Practitioners

Assessment of Peer Rejection and Externalizing Behavior Problems in Preschool Boys: A Short-Term Longitudinal Study

A profile of young Albertans with Fetal Alcohol Spectrum Disorder

Brief Report: Interrater Reliability of Clinical Diagnosis and DSM-IV Criteria for Autistic Disorder: Results of the DSM-IV Autism Field Trial

Gambler Addiction Index: Gambler Assessment

MODEL CHURCH POLICIES

Diagnostic concordance among dermatopathologists in basal cell carcinoma subtyping: Results of a study in a skin referral hospital in Tehran, Iran

2019 COLLECTION TYPE: MIPS CLINICAL QUALITY MEASURES (CQMS) MEASURE TYPE: Process High Priority

Reliability of motor development data in the WHO Multicentre Growth Reference Study

SFHPT24 Undertake an assessment for family and systemic therapy

doi: /j.jad

CareerCruising. MatchMaker Reliability and Validity Analysis

Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data

Cuyahoga County Division of Children and Family Services (CCDCFS) Policy Statement

Hounslow Safeguarding Children Board. Training Strategy Content.. Page. Introduction 2. Purpose 3

Juvenile Pre-Disposition Evaluation: Reliability and Validity

Sandwell Safeguarding Adults Board. ANNUAL REPORT 2016/2017 Executive Summary

and The 95% confidence intervals are calculated as follows: where

Making the Subjective Objective? Computer-Assisted Quantification of Qualitative Content Cues to Deception

Applying the Risk of Bias Tool in a Systematic Review of Combination Long-Acting Beta-Agonists and Inhaled Corticosteroids for Persistent Asthma

Victim Index Reliability and Validity Study

What Smokers Who Switched to Vapor Products Tell Us About Themselves. Presented by Julie Woessner, J.D. CASAA National Policy Director

Week 2 Video 2. Diagnostic Metrics, Part 1

Exploring Normalization Techniques for Human Judgments of Machine Translation Adequacy Collected Using Amazon Mechanical Turk

Transcription:

Maltreatment Reliability Statistics last updated 11/22/05 Historical Information In July 2004, the Coordinating Center (CORE) / Collaborating Studies Coordinating Center (CSCC) identified a protocol to assess the reliability of RNA/MB coding. This was approved by 4 of 5 sites (plus CORE) in August 2004. The plan was to run reliability statistics on: (1) number of codeable allegations, (2) number of codeable findings, (3) Type of maltreatment (allegations), (4) type of maltreatment (findings), and (5) conclusion codes. All sites coded all selected observations to equal a final number of 129 cases (X 5 sites + the original data). Case Selection Five percent of cases were selected from the RNA/MB pool (for 0407 data). Selection criteria was as follows: (1) must be an allegation and findings narrative, (2) must be at least 1 valid NIS2 & MMCS code from the allegations sections and findings sections, (3) there is a valid date of referral and/or incident, and (4) the allegation/findings narratives are available in the dataset or can be obtained from the sites. Cases were randomly selected according to the criteria above. The CSCC selected the cases. On the initial run, all sites were able to provide narratives or narratives were in the datasets with the exception of the Southern Site (SO). They were unable to provide narratives on two of their selected cases. A second round of random selection for SO occurred and two additional cases were identified that did have narratives in the dataset. Data Entry The CSCC created a data entry system so that the reliability coding from the five sites could be entered. The data entry system was set up using FSEDIT (in SAS). Laura Respess (CORE Program Assistant) was charged with entering the data. Jamie Smith (CORE Applications Analyst) reviewed 10% (n = 60) of the entered cases for data entry accuracy. There were NO data entry errors in any of the 60 cases. During data entry, Jamie Smith reviewed the coding from the sites for any (nonreliability) errors. Jamie (and Liz Knight, CORE co-investigator) communicated with the sites to correct the errors and/or clarify the instructions. A subsample of cases required additional information (not otherwise included in the narrative) to enable sites to accurately code the narrative (n = 8). Sites who contributed the original data provided the additional information. All sites were requested to use the supplemental information to code the reliability narratives for the identified subsample. Reliability Statistics Reliability statistics were conducted for allegation and findings data separately. Kappas and Interclass Correlation Coefficients were conducted depending on the type of data. The specific analysis variables (and necessary statistics) are as follows:

1. Number of codeable allegations ICC 2. Number of codeable findings ICC 3. Conclusion codes Kappa 4. Maltreatment type allegations NIS2 - Kappa 5. Maltreatment type findings NIS2 - Kappa 6. Maltreatment type allegations MMCS - Kappa 7. Maltreatment type findings MMCS- Kappa 8. Maximum Severity Codes by Maltreatment Type for MMCS Allegations - ICC Note that sites coded conclusions and maltreatment types according to the RNA/MB reliability codebook as if they were coding an original case file. However, analyses will be conducted only on the broad types of maltreatment (not the subtypes) and whether an allegation was substantiated or not (yes or no). See below: NIS2 allegation & findings codes = broad type Physical Abuse = 420-423 Sexual Abuse = 430 433 Emotional Abuse = 440 443 Physical Neglect = 450 457 Educational Neglect = 460 463 Emotional Neglect = 470 477 Other maltreatment = 480 484 MMCS allegation & findings codes = broad type Physical Abuse = 100 109 Sexual Abuse = 200 Physical Neglect, failure to provide = 300 305 Physical Neglect, lack of supervision = 400 403 Emotional maltreatment = 500 Moral/Legal Maltreatment = 600 Educational Maltreatment = 700 Drugs/Alcohol = 800 Conclusion Codes Codes of 1 or 3 = substantiated Codes of 2, 4, 5, 6, or 7 = not substantiated Severity Codes The maximum severity value for a particular type of maltreatment across a record Results Analyses were completed on 4/29/05. See Tables 1 and 2 for Kappa statistics for type of allegation, finding, and conclusion codes for MMCS and NIS2 codes. The methodology used for computing kappa statistics are presented in Fleiss (1981). The macro used to compute interclass correlations were based on Shrout and Fleiss (1979). The Shrout- Fleiss Reliability Random Set are reported for the current analyses. Analyses on the Maximum Severity Codes were conducted on 9/22/05. Table 3 details the maximum severity code reliability statistics.

Allegations Kappas for MMCS codes from the allegations narrative ranged from.49 -.88 (M =.76). All of the categories with the exception of the moral/legal category have Kappas exceeding.70 (the value typically considered acceptable). Similar results were obtained for Kappas of NIS2 allegation codes with a range of.58 -.88 (M =.77). Again, with the exception of one category (emotional neglect) all Kappas were above.70. Interclass correlation coefficients were obtained for the number of allegations coded for MMCS and NIS2 and =.79 and.74 respectively. Findings Kappas for MMCS codes from the findings narrative ranged from.45 -.84 (M =.72). All of the categories with the exception of the moral/legal category had Kappas exceeding.70. Similar results were obtained for Kappas of NIS2 allegation codes with a range of.54 -.85. (M =.73). Again, with the exception of one category (emotional neglect) all Kappas were above.70. Interclass correlation coefficients were obtained (Shrout-Fleiss Reliability Random Set) for the number of findings coded for MMCS and NIS2 and =.75 and.65 respectively. Conclusion Codes Kappas for conclusion codes based on the MMCS coding of the findings narrative ranged from.14 73 (M =.54). The lowest value was obtained for the moral/legal category (k=.14), the highest for educational maltreatment (k =.73). Only two kappa values were at or above.70. Kappas for conclusion codes based on the NIS2 coding of the findings narrative ranged from.34.73 (M =.56). The lowest value was obtained for the emotional abuse category (k=.34), the highest for educational neglect (k =.73). Only one kappa value was at or above.70. Maximum Severity Reliability statistics were conducted on the maximum severity codes by each MMCS maltreatment category except Moral/Legal and Drugs/Alcohol. Two sets of analyses were conducted. The first set included the substitution of a missing value if the maltreatment type was not coded from the allegation (range of values = 1 6, depending on the maltreatment type). The second set included the substitution of a value of 0 if the maltreatment type was not coded from the allegation (range of values = 0-6). Note that the first set are a much more restrictive set of analyses given any disagreement over the maltreatment type are subsequently thrown out before assessing agreement on maximum severity. The latter set of runs may potentially inflate agreement. ICCs for the first set ranged from.30 (Educational Neglect) to.65 (Lack of supervision). Most were in the.6 range. ICCs from the second set ranged from.57 (Educational Neglect) to.87 (Sexual Abuse).

A Word about Kappas Landis and Koch (1977) attempted to provide some measure of agreement for kappa values in various ranges: < = 0 = poor 0 -.2 = slight.2 -.4 = fair.4 -.6 = moderate.6 -.8 = substantial.8 1.0 = almost perfect However, Munoz and Bangdiwala (1997) compared the Kappa statistics by Cohen (1960) and the B statistic by Bangdiwala (1985) and offer an alternative labeling system for the Kappa statistic. -.35 - -.20 = poor -.05 -.07 = fair.25 -.33 = moderate.55 -.60 = substantial.85 -.87 = almost perfect 1.0 = perfect Conclusions With either interpretation, the Kappas obtained for the current analyses appear to range from moderate to almost perfect in most instances. Caution is suggested for use of the conclusion codes. Assessment of severity codes proved more difficult given the conditional nature of the presence of severity codes (maltreatment type had to be coded). Analyses were run two ways, one that is highly conservative and one that may potentially inflate agreement. Although neither of these approaches is optimal, arguments can be made that either approach is sufficient. From the conservative perspective, reliability was only assessed for those coders/records who agreed that a maltreatment type was codeable from the referral narrative. The denominator for these analyses will be significantly lower than that used in the second set. Thus disagreements have a greater impact on the reliability. In contrast, those who agreed that a maltreatment type did not occur are getting credit for agreeing there is no severity code (= 0). Thus all records (N = 129*6 raters) were included in the analyses potentially minimizing disagreements as compared to the first approach. Either is defensible and reportable, however the CSCC recommends reporting the range (or upper and lower figure) from these two sets of analyses when describing reliability in manuscripts with a brief description of the assessment process. Given the complexity of coding CPS records, the span of the ages represented in the current sample, and coder turnover at the sites, we are very encouraged with these figures and congratulate (and thank) everyone for their efforts with this process.

Reliability Summary In winter of 2004-2005 a formal assessment of CPS narrative coding reliability was conducted among all active coders at each of the five sites plus the original data. Approximately five percent of CPS records (N = 129) currently in the LONGSCAN cross-site database, as of 9/20/04, were selected for review. Analyses were conducted to measure agreement on (a) the number of allegations and substantiations, (b) the type of maltreatment at referral and the investigation by CPS, (c) conclusions about maltreatment based on CPS investigation, and (d) the severity of maltreatment based on the referral information. These categories are consistent with the way the CPS data is commonly used for analyses within LONGSCAN and for classifying the maltreatment experiences of the study child participants. Reliability analyses focused on coding using the MMCS and NIS2 classification systems. Results indicated reliability ranged from moderate to almost perfect for nearly every category of analysis except coding of substantiated maltreatment based on the CPS findings narratives. Given the complexity of coding CPS records across agencies and states, the span of the ages at the time of referral, and the change in coders inherent in a longitudinal study, these figures are encouraging and represent the quality and consistency of training. Results Table 1. Kappas for MMCS Allegation, Findings, and Conclusion Codes Type of Maltx Allegation Findings Conclusion Code Physical Abuse.87.84.56 Sexual Abuse.77.71.51 Neglect Failure to Provide.88.83.70 Lack of Supervision.77.70.51 Emotional Maltx.73.72.55 Moral/Legal.49.45.14 Educational Maltx.72.74.73 Drugs/Alcohol.87.79.66 Table 2. Kappas for NIS2 Allegation, Findings, and Conclusion Codes Type of Maltx Allegation Findings Conclusion Code Physical Abuse.88.85.56 Sexual Abuse.78.73.51 Emotional Abuse.73.71.34 Physical Neglect.81.75.63 Educational Neglect.74.75.73 Emotional Neglect.58.54.51 Other Maltx.87.78.65

Table 3. ICCs for Maximum Severity Codes for MMCS Maltreatment Types Set 1 (severity without a maltreatment code =.) Set 2 (severity without a maltreatment code = 0) Type of Maltx Physical Abuse.60.84 Sexual Abuse.60.88 Failure to Provide.61.76 Lack of Supervision.65.68 Emotional Maltx.54.73 Educational Maltx.30.57 Note: Maximum severity codes were selected from MMCS Allegations Bibliography Cohen, J. (1960). A coefficient for agreement for nominal scales. Educational and Psychological Measurement, 20, 37-46. Fleiss, J. (1981). Balanced incomplete block designs for inter-rater reliability studies. Applied-Psychological-Measurement, 5, 105-112. Landis, J.R., & Koch, G.G. (1977). The measure of observer agreement for categorical data. Biometrics, 33, 159 174. Munoz, S.R., & Bangdiwala, S.I. (1997). Interpretation of Kappa and B Statistics measures of agreement. Journal of Applied Statistics, 24, 105 111. Shrout, P.E., & Fleiss, J.L. (1979). Interclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86, 420 428.