Journal of Autism and Developmental Disorders, Vol. 30, No. 2, 2000 Brief Report: Interrater Reliability of Clinical Diagnosis and DSM-IV Criteria for Autistic Disorder: Results of the DSM-IV Autism Field Trial Ami Klin, 1,2 Jason Lang, 1 Domenic V. Cicchetti, 1 and Fred R. Volkmar 1 INTRODUCTION One of the most important goals of diagnostic classification systems such as the Diagnostic and Statistical Manual of Mental Disorders, 4th ed. (American Psychiatric Association [APA] 1994) is to enhance the agreement on a specific diagnosis among clinicians with diverse backgrounds and levels of experience. Although historically autism has been one of the most reliably diagnosed disorders in child psychiatry (Mattison, Cantwell, Russell, & Will, 1979), some aspects of this, and related pervasive developmental disorders (PDD), present challenges for diagnosis, particularly among less experienced clinicians, for example, there is a broad range of syndrome expression in terms of level of intellectual and communicative functioning, and symptoms change somewhat both as a function of age and developmental level (Lord, Pickles, McLennan, Rutter, et al., 1997; Volkmar, Klin, & Cohen, 1997). Knowledge of and experience in appreciating the manifestations of autism at different levels of developmental abilities is central in the diagnostic process (Rutter, 1978). An additional, and more recent, complexity is the addition of other explicitly defined categories to the PDD class of conditions. Before DSM-IV there were only two categories under the PDD class of disorders (autism and the residual category Pervasive Developmental Disorder Not Otherwise Specified; PDDNOS). DSM-IV now recognizes three additional disorders: Rett Disorder, Childhood Disintegrative Disorder, and Asperger 1 Yale Child Study Center, New Haven, Connecticut. 2 Address all, Correspondence to Ami Klin, Yale Child Study Center, 230 South Frontage Road, New Haven Connecticut 06520, e-mail: Ami.Klin@Yale.Edu 163 Disorder, each of which must be differentiated from autism (Volkmar et al., 1994). Finally, two other complexities should be noted. It has been increasingly recognized both in clinical practice (Klin et al., 1997) and in recent epidemiological studies (Fombonne, 1998), that many children are now identified who have an autistic-like condition but do not present the classic syndrome of autism. In addition, as awareness of autism and related conditions has increased, more children who are cognitively higher functioning have been identified (Klin et al., 1997). Despite advances in the neuroscience of autism, there are no biological markers in the identification of the disorder. Consequently, the diagnostic process is still based on developmental history and behavioral observations made by clinicians. Although there are a number of excellent diagnostic instruments for the diagnosis of autism (e.g., Lord, Rutter, & Dihavore, 1996; Lord, Rutter, & Le Couteur, 1994), the most reliable ones require specialized intensive training, and none substitutes for clinical expertise and experience (Lord et al., 1997). And whether or not one utilizes such diagnostic instruments, the diagnostic assignment still depends on the adoption of consensual definitions as operationalized by DSM-IV or the international equivalent, International Classification of Diseases, Tenth Revision (ICD-10; World Health Organization [WHO] 1993). The two systems are now conceptually identical (Volkmar et al., 1994). To what extent the adoption of these systems yields reliable diagnoses needs to be, therefore, empirically examined. Within this context, while the gold standard for the diagnosis of autism is the best clinical judgment of experienced clinicians (Spitzer & Williams, 1988), diagnostic assignments are made in the larger clinical 0162-3257/00/0400-0163$18.00/0 2000 Plenum Publishing Corporation
164 Klin, Lang, Cicchetti, and Volkmar community probably as often as by the smaller number of autism experts. Hence the importance of expanding the empirical verification of reliability of DSM-IV into the community of less experienced clinicians. In the very small number of studies that systematically examined diagnostic reliability among clinicians with different levels of experience (e.g., Goodman & Simonoff, 1991) or professional training (Perry, Veleno, & Factor, 1998), acceptable levels of agreement were reported, although none of these studies has specifically addressed reliability of diagnostic assignment based on the DSM-IV definition of autism among both experienced and inexperienced clinicians. The present study uses diagnostic data collected during the DSM-IV Autism Field Trial (Volkmar et al., 1994) to answer four questions related to the issues outlined above: (a) What is the interrater reliability of clinician-assigned diagnosis of autism (i.e., without the use of DSM-IV criteria)? (b) What is the interrater reliability for the various DSM-IV criteria for autistic disorder? (c) What is the interrater reliability of DSM-IVassigned diagnosis of autism (i.e., when clinicians rate each diagnostic criterion and the diagnosis of autism is assigned dependent on whether or not the algorithm for autism is met)? and (d) How do these two diagnostic strategies compare? These four questions were examined in the context of comparisons between clinicians with more and less experience. METHOD The DSM-IV Field Trial The DSM-IV Autism Field Trial was a collaborative project involving 13 sites in North America, 4 sites in Europe, and 4 sites in the Middle East, Asia, and Oceania (Volkmar et al., 1994). These sites provided diagnostic ratings on consecutive cases of individuals with either autism or another developmental disorder that would reasonably include autism in the differential diagnosis. The study involved 977 rated cases with clinical (i.e., clinician-assigned) diagnoses of autism (n = 454), other (nonautistic) pervasive developmental disorders (PDD) (n = 240), and non-pdd (e.g., primary diagnoses of language disorders, mental retardation; n = 283). The goal of the Field Trial was to empirically derive the definition of autism for DSM-IV based on a series of analyses including reliability and validity considerations. Previous reports describe several aspects of the Field Trial in greater detail (Buitelaar, Van der Gaag, Klin, & Volkmar, 1999; Volkmar et al., 1994; Volkmar & Rutter, 1995). Participants Of the entire sample of 977 participants, 131 cases received diagnostic ratings by at least two clinicians for the purpose of assessing interrater reliability. Of the 131 cases, 62% had a clinician-assigned diagnosis of autism, 14% had a diagnosis of a non-autistic PDD, and 24% had a diagnosis of a non-pdd disorder; 71% were male and 29% were female; 42% were below age 5, 41% were between ages 5 and 10, 15% were between ages 10 and 20, and 2% were above age 20; 66% were Caucasian, 15% were of African origin (e.g., African American), 13% were Hispanic, 3% were Asian, whereas the remainder had other race/ethnicity. Eighty-three clinicians rated at least one reliability case: 36% of these raters were male and 64% were female; 21% were below age 30, 46% were between ages 30 and 40, 25% were between ages 40 and 50, and 8% were above age 50; 53% were psychiatrists or residents in child psychiatry, 34% were psychologists or psychology trainees, and 13% were speech and language pathologists, nurses, social workers, or special educators. Of these clinicians 51% had extensive experience in the assessment and diagnosis of autism (defined as involvement in the assessment and diagnosis of over 25 patients), with the remaining 49% of raters reported having lesser degrees of experience (25% with 10 to 25 cases, and 24% with less than 10 cases). Of the 131 reliability cases, 37% received diagnostic ratings by at least 2 experienced clinicians, whereas 83% had at least one experienced clinician involved. Procedure The overall clinical diagnosis was assigned before, and independently of the clinicians ratings of the various DSM-IV individual criteria for autism; similarly to the DSM-III-R autism field trial (Spitzer & Siegel, 1990), the diagnoses of experienced clinicians served as a first approximation of a diagnostic gold standard. Subsequently, clinicians completed the ratings for each of the potential DSM-IV criteria for autism, which had been developed on the basis of the results of the various literature reviews and data re-analyses that preceded the project (e.g., Szatmari, 1992; Volkmar, Cicchetti, Bregman, & Cohen, 1992). In addition, a standard data coding system was used to provide information on characteristics of patients (e.g., age, IQ, communicative ability, nature and quality of information available, and, at the discretion of the clinician, information on standard tests or assessment instruments). Similarly, standard forms were provided for raters to indicate clinician-assigned
Interrater Reliability for Autistic Disorder 165 diagnosis and level of confidence, ratings for each of the DSM-IV criteria for autism, and the rater s own personal data (e.g., age, gender, experience). Coordinators at each site were provided with a summary of procedures, but no systematic training in application of the potential DSM-IV criteria was provided. Measures were taken to protect patient (and rater) confidentiality, and the research procedures had been approved by the various institutional human investigation committees at the different sites. A series of quality control data entry and management were adopted (e.g., double entry and range checks, data audits). When ratings included missing data, site coordinators were asked to secure the information if possible. Of the entire DSM-IV sample (i.e., 977 cases), only 6 cases had to be excluded because of multiple and major data points missing. be interpreted with caution and are likely to be less stable than values derived from larger samples. As might be expected, the more experienced raters exhibited excellent agreement with the most disagreement over the more fine-grained distinctions between autism and other possible disorders in the PDD class. The same pattern was obtained, with slightly lower levels of agreement, between raters from different professional backgrounds, and, with lower levels still, between pairs of experienced inexperienced raters. It is important to note, however, that agreement was generally quite high. Even when inexperienced raters were evaluated to each other, agreement was reasonably good with the exception, as expected, of agreement in regards to the comparison of autism and other PDD categories where the level of agreement was only fair. RESULTS Interrater Reliability of Clinical Diagnosis Interrater reliability coefficients were obtained for primary clinician-assigned diagnoses. The kappa coefficient (Cohen, 1960) was used as the preferred chancecorrected measure of agreement for the dichotomous data (Fleiss, 1981). Table I lists kappas obtained for agreement between pairs of raters according to clinical experience and professional training and overall levels of agreement on clinical diagnosis. These values are provided for case comparisons between autism and a non-pdd disorder, between autism and other developmental disorders (both non-pdd and nonautistic PDD), and between autism and nonautistic PDDs. Levels of clinical significance are defined as per Cicchetti and Sparrow s (1981) criteria. Values are ranked by kappa to clarify patterns of agreement observed. It should be emphasized that kappas in groups with small ns must Interrater Reliability of DSM-IV Criteria for Autistic Disorder Interrater reliability coefficients were also obtained for the potential DSM-IV criteria for autism. These criteria, which now form the definition of autistic disorder in DSM-IV (APA, 1994), are listed in Table II together with their respective kappa coefficients. Given that kappa is a chance-corrected coefficient, low coefficients may be a function of high chance probability for agreement rather than poor rates of observed agreement. Therefore, percentages of observed agreement (PO) were obtained for those criteria for which kappas were not in the Excellent or Good categories of clinical significance. The clinical significance of PO s was defined as 90 100% Excellent Agreement, 80 89% Good Agreement, 70 79% Fair Agreement, and <70% Poor Agreement. Only criteria with kappa <.60 and PO < 80% were judged to be of poor or suboptimal reliability. Table II lists the DSM-IV criteria for autistic Table I. Interrater Reliability for Clinician-Assigned Diagnoses by Diagnostic Group and Rater Experience/Professional Background a Autism vs. non-pdd Autism vs. other Autism vs. nonautistic PDD No. of Clinical No. of Clinical No. of Clinical Rater Groups k cases significance k cases significance k cases significance Experienced vs. Experienced 1.00 44 E 0.94 48 E 0.85 40 E Psychologist vs. Psychiatrist 1.00 38 E 0.86 45 E 0.67 33 G Inexperienced vs. Inexperienced 1.00 14 E 0.79 19 E 0.41 11 F All reliability raters 0.95 103 E 0.81 131 E 0.65 95 G Experienced vs. Inexperienced 0.89 42 E 0.70 61 G 0.59 43 F a Levels of clinical significance of kappa values were defined as follows (criteria as per Cicchetti & Sparrow, 1981): E = Excellent Agreement (k between 0.75 and 1.00); G = Good Agreement (k between 0.60 and 0.74); F = Fair Agreement (k between 0.40 and 0.59); P = Poor Agreement (k less than 0.40).
166 Klin, Lang, Cicchetti, and Volkmar Table II. Kappas, Percentage of Observed Agreement (PO), and Their Clinical Significance for the DSM-IV Criteria for Autistic Disorder Clinical Clinical Criterion a Kappa significance PO significance 1A 0.73 Good 0.89 Good 1B 0.76 Excellent 0.93 Excellent 1C 0.77 Excellent 0.89 Good 1D 0.74 Good 0.90 Excellent 2A 0.75 Excellent 0.90 Excellent 2B 0.58 Fair 0.83 Good 2C 0.79 Excellent 0.89 Good 2D 0.71 Good 0.90 Excellent 3A 0.77 Excellent 0.88 Good 3B 0.63 Good 0.84 Good 3C 0.69 Good 0.85 Good 3D 0.64 Good 0.82 Good Onset 0.66 Good 0.93 Excellent a Criteria are listed in the order in which they appear in DSM-IV. disorder, and their corresponding kappas and clinical significance, as well as the PO s and their clinical significance. As can be seen in Table II, kappas and PO s were generally in the Good to Excellent range, and none of the criteria had poor reliability as defined above. Interrater Reliability for DSM-IV-Assigned Diagnosis and how it compares with Interrater Reliability for Clinician-Assigned diagnosis Interrater reliability coefficients for DSM-IVassigned diagnosis (i.e., diagnostic assignment is made using DSM-IV criteria) and clinician-assigned diagnosis (i.e., diagnostic assignment is made without the use of DSM-IV criteria) were compared for pairs of experienced experienced raters and inexperienced inexperienced raters. Kappas and PO s are presented in Table III. The interrater reliability coefficients for the pairs of experienced experienced raters fell in the Excellent category of clinical significance in both clinicianassigned and DSM-IV-assigned diagnostic strategies. The level of agreement decreased somewhat with the utilization of DSM-IV criteria (though still in the Excellent range). This may be accounted for the fact that ex- perienced clinicians take into account a broader range of clinical phenomena beyond those captured and defined in the DSM-IV criteria for autistic disorder. In contrast, the interrater reliability coefficients for the pairs of inexperienced inexperienced raters, which fell in the Poor category for clinician-assigned diagnoses, were elevated to the uppermost level of the Fair or even Good categories of clinical significance when these raters utilized the DSM-IV criteria for making diagnostic assignment. Therefore, the utilization of DSM-IV criteria appeared to improve interrater reliability of diagnosis. To further examine the changes in the reliability coefficients obtained in the comparison between the two diagnostic strategies, the statistical significance of the difference between the kappa coefficients was calculated within the rater pairs (Cicchetti & Heavens, 1981; Fleiss & Cicchetti, 1978). While the difference in kappas for the experienced experienced raters was not statistically significant (Z = 1.35, ns), the difference in kappas for the inexperienced inexperienced raters approached statistical significance (Z = 1.70, p =.89; p =.045 for one-tailed test). Although this comparison only bordered on statistical significance, it is of interest that there was a clinically significant improvement in interrater reliability among the inexperienced raters when they utilized the DSM-IV criteria, going, as noted, from Poor to Fair/Good rates of agreement. It is likely, therefore, that in contrast with the experienced raters, the utilization of DSM-IV criteria by inexperienced raters improved their clinical considerations, and in turn, their diagnostic reliability. DISCUSSION This study focuses on issues of interrater reliability in the diagnosis of autistic disorder. Based on reliability analyses of diagnostic data collected on cases rated by two clinicians in the context of the DSM-IV Autism Field Trial, a series of important questions could be clarified. First, it was shown that the interrater reliability among clinicians making a diagnosis of autism and related PDDs without the use of DSM-IV criteria was overall quite high, although agreement decreased Table III. Kappas, Percentage of Observed Agreement (PO), and Their clinical Significance for DSM-IV-Assigned Diagnosis and Clinician-Assigned Diagnosis Raters Diagnostic Strategy Kappa Clinical significance PO Clinical significance Experienced vs. clinician assigned 0.94 Excellent 0.98 Excellent experienced DSM-IV assigned 0.84 Excellent 0.91 Excellent Inexperienced vs. Clinician assigned 0.34 Poor 0.67 Poor inexperienced DSM-IV assigned 0.59 Fair 0.80 Good
Interrater Reliability for Autistic Disorder 167 somewhat when the differential diagnosis involved a comparison between autism and other forms of PDD. Differences in professional background among the raters was of little significance. In contrast, differences in clinical experience had a more marked impact on reliability coefficients, with inexperienced raters showing lower rates of agreement, particularly in regards to comparisons between autism and other forms of PDD. The second question concerns the interrater reliability of the various DSM-IV criteria for autistic disorder. With no exception, the coefficients of agreement for the various criteria fell in the Good to Excellent range of clinical significance. None of the criteria had suboptimal or poor reliability. The final set of questions addressed probably the most important aspect of this study, namely, to what extent the use of DSM-IV criteria improves reliability of diagnosis when compared to a diagnostic process making no use of the criteria (i.e., when the clinician assigned a diagnosis based on overall clinical impressions only). The answer to this question depended on the raters in question. When pairs of experienced raters were involved, there was little difference in reliability coefficients obtained for DSM-IV based and clinicianassigned diagnoses. In fact, clinician-assigned coefficients were a little higher, possibly reflecting the fact that experienced clinicians consider a broader range of information than that captured and defined in the DSM- IV definition when making the diagnosis of autism. In contrast, there was a clinically significant improvement in diagnostic reliability when inexperienced raters used the DSM-IV criteria, suggesting that in their case the use of these criteria was beneficial and clearly superior to their overall clinical judgments. Although the reason why this may have been so is beyond the scope of this study, it is likely that the use of DSM-IV criteria both broadened and structured these clinicians observations and clinical considerations. This is, at any rate, a commonly voiced opinion made by trainees, who can benefit from the structure and guidance provided by DSM-IV criteria. If so, one may say that DSM-IV makes an important contribution to clinical practice. REFERENCES American Psychiatric Association. (1994). Diagnostic and statistical manual of mental disorders (4th ed.). Washington, DC: Author. Buitelaar, J. K., Van der Gaag, R., Klin, A., & Volkmar, F. R. (1999). Exploring the boundaries of pervasive developmental disorder not otherwise specified: Analyses of data from the DSM-IV autistic disorder field trial. Journal of Autism and Developmental Disorders, 29, 33 43. Cicchetti, D. V., & Heavens, R., Jr. (1981). A computer program for determining the significance of the difference between pairs of independently derived values of Kappa or weighted Kappa. Educational and Psychological Measurement, 41, 189 193. Cicchetti, D. V., & Sparrow, S. S. (1981). Developing criteria for establishing inter-rater reliability of specific items in a given inventory. American Journal of Mental Deficiency, 86, 127 137. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and psychological measurement, 20, 37 46. Fleiss, J. (1981). Statistical methods for rates and proportions (2nd ed.). New York: Wiley. Fleiss, J. L., & Cicchetti, D. V. (1978). Inference about weighted Kappa in the non-null case. Applied Psychological Measurement, 2, 113 117. Fombonne, E. (1998). Epidemiological surveys of autism. In F. R. Volkmar (Ed.), Autism and pervasive developmental disorders (pp. 32 63). Cambridge, UK: Cambridge University Press. Goodman, R., & Simonoff, E. (1991). Reliability of clinical ratings by trainee child psychiatrists: A research note. Journal of Child Psychology and Psychiatry, 32, 551 555. Klin, A., Carter, A., Volkmar, F. R., Cohen, D. J., Marans, W. D., Sparrow, S. S. (1997). Assessment issues in children with autism. In D. J. Cohen & F. R. Volkmar (Eds.), Handbook of autism and pervasive developmental disorders (pp. 411 447). New York: Wiley. Lord, C., Pickles, A., McLennan, J., Rutter, M., Bregman, M., Folstein, S., Fombonne, E., Leboyer, M., & Minshew, N. (1997). Diagnosis autism: Analyses of data from the Autism Diagnostic Interview. Journal of Autism and Developmental Disorders, 27, 501 517. Lord, C., Rutter, M., & DiLavore, P. (1996). Autism diagnostic observation schedule Generic (ADOS-G). Unpublished manuscript. University of Chicago, Chicago, IL. Lord, C., Rutter, M., & LeCouteur, A. (1994). Autism Diagnostic Interview - Revised: A revised version of a diagnostic interview for caregivers of individuals with possible pervasive developmental disorders. Journal of Autism and Developmental Disorders, 24, 659 85. Mattison, R., Cantwell, D. P., Russell, A. T., & Will, L. (1979). A comparison of DSM-II and DSM-III in the diagnosis of childhood psychiatric disorders: 2. Inter-rater agreement. Archives of General Psychiatry, 36, 1217 1222. Perry, A., Veleno, P., & Factor, D. (1998). Inter-rarer agreement between direct care staff and psychologists for the diagnosis of autism according to DSM-III, DSM-III-R, and DSM-IV. Journal of Developmental Disabilities, 6, 32 43. Rutter, M. (1978). Diagnosis and definition of childhood autism. Journal of Autism and Childhood Schizophrenia, 8, 139 161. Spitzer, R. L., & Siegel, B. (1990). The DSM-III-R field trial of pervasive developmental disorders. Journal of the American Academy of Child and Adolescent Psychiatry, 29, 855 862. Spitzer, R. L., & Williams, J. B. (1988). Having a dream: A research strategy for DSM-IV. Archives of General Psychiatry, 45, 871 4. Stzatmari, P. (1992). A review of the DSM-III-R criteria for autistic disorder. Journal of Autism and Developmental Disorders, 22, 507 524. Volkmar, F. R., Cicchetti, D. V., Bregman, J., & Cohen, D. J. (1992). Developmental aspects of DSM-III-R criteria for autism. Journal of Autism and Developmental Disorders, 22, 657 662. Volkmar, F. R., Klin, A., & Cohen, D. J. (1997). Diagnosis and Classificiation of autism and related conditions: Consensus and Issues. In D. J. Cohen & F. R. Volkmar (Eds.), Handbook of autism and pervasive developmental disorders (2nd ed., pp. 5 40). New York: Wiley. Volkmar, F. R., Klin, A., Siegel, B., Szatmari, P., Lord, C., Campbell, M., Freeman, B. J., Cicchetti, D. V., Rutter, M., Kline, W., Buitelaar, J., Hattab, Y., Fombonne, E., Fuentes, J., Werry, J., Stone, W., Kerbeshian, J., Hoshino, Y., Bregman, J., Loveland, K., Szymanski, L. & Towbin, K. (1994). DSM-IV Autism/pervasive developmental disorder field trial. American Journal of Psychiatry, 151, 1361 1367. Volkmar, F. R., & Rutter, M. (1995). Childhood Disintegrative Disorder: Results of the DSM-IV autism field trial. Journal of the American Academy of Child and Adolescent Psychiatry, 34, 1092 1095. World Health Organization. (1993). International classification of diseases (10th rev. chap. 5. Mental and behavioral disorders (including disorders of psychological development). Diagnostic criteria for research. Geneva: Author.