Practical Radiation Oncology (2013) 3, 74 78 www.practicalradonc.org Special Article Reliability of oral examinations: Radiation oncology certifying examination June C. Yang PhD, Paul E. Wallner DO, Gary J. Becker MD, Jennifer L. Bosma PhD, Anthony M. Gerdeman PhD American Board of Radiology, Tucson, Arizona Received 28 July 2011; revised 24 October 2011; accepted 25 October 2011 Abstract Purpose: Oral examinations are used in certifying examinations by many medical specialty boards. They represent daily clinical practice situations more realistically than do written tests or computerbased tests. However, there are repeated concerns in the literature regarding objectivity, fairness, and extraneous factors from interpersonal interactions, item bias, reliability, and validity. In this study, the reliability of oral examination on the radiation oncology certifying examination, which was administered in May of 2010, was analyzed. Methods and Materials: One hundred fifty-two candidates rotated though 8 examination stations. Stations consisted of a hotel room equipped with a computer and software that exhibited images appropriate to the content areas. Each candidate had a 25-30 minute face-to-face encounter with an oral examiner who was a content expert in one of the following areas: gastrointestinal, gynecology, genitourinary, lymphoma/leukemia/transplant/myeloma, head/neck/skin, breast, central nervous system/pediatrics, or lung/sarcoma. This type of design is typically referred to as a repeated measures design or a subject by treatment design, although the oral examination was a routine event without any experimental manipulation. Results: The reliability coefficient was obtained by applying Feldt and Charter's simple computational alternative to analysis of variance formulas that yielded KR-20, or Cronbach's coefficient alpha of 0.81. Conclusions: An experimental design to develop a blueprint in order to improve the consistency of evaluation is suggested. 2013 American Society for Radiation Oncology. Published by Elsevier Inc. All rights reserved. Introduction Oral examinations continue to be administered as certifying examinations by 14 of 24 medical specialty boards. Among the advantages of oral examinations is Conflicts of interest: None. Corresponding author. American Board of Radiology, 5441 E. Williams Blvd, Tucson, AZ 85711. E-mail address: jyang@theabr.org (J.C. Yang). that they represent daily clinical practice situations more realistically than written examinations do. An oral examination can test the limits of a candidate's knowledge on the given topic. They are considered particularly effective in assessing a candidate's clinical decision-making ability and interpersonal skills, as well as intrapersonal qualities such as confidence and selfawareness. 1,2 However, concerns about the reliability, validity, and fairness of oral examinations have been expressed repeatedly in the literature. 1-5 1879-8500/$ see front matter 2013 American Society for Radiation Oncology. Published by Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.prro.2011.10.006
Practical Radiation Oncology: January-March 2013 The major concerns that have been raised pertain to the inherent nature of oral examinations; that is, the effects of the personal interactions. These include the following: examinees' communication or language skills; examinees' familiarity with oral examination form; inhibition of the performance due to stress; the examiner's judgments related to demographic information such as age, gender, race, or socioeconomic status; and other factors that are unrelated to actual capabilities (knowledge, skills, and judgment). Would the candidate receive the same score if a different examiner tested him or her? Reliability is a major concern because subjective judgments may affect the scores. Different examiners may have different degrees of harshness or leniency. Content or follow-up questions may vary from one examiner to another. Contrary to the above concerns, Lunz and Bashook 2 found that when scores on communication ability were compared with scores on oral examinations in a medical specialty board certification process, the correlation coefficient between the communication scores and the oral examination scores was 0.10. A random sample of 90 candidates was observed. Nonmedical researchers measured the communication ability of the candidates using a 21-item communication survey instrument. The authors concluded that the candidates' oral examination scores were not influenced by their communication abilities. Recently, the American Board of Psychiatry and Neurology decided to eliminate oral examinations from the certification examination process, citing as concerns the candidates' stress and anxiety associated with the oral examinations, as well as financial stress. 6 The American Board of Radiology (ABR) has also decided to discontinue (beginning in 2014) the oral examination as the final certifying examination for the diagnostic radiology certification process. This decision was based on concerns about subjectivity and technically inadequate presentations. For example, in years past case presentations during the oral examination were similar to what could be observed in daily clinical situations. However, with the advancement of electronic and technical modalities, such as computed tomography and magnetic resonance (MR) imaging, MR angiography, and MR spectroscopy, it has become more difficult for the oral examination to simulate actual clinical practice. Medical physics and radiation oncology (RO) will continue the oral examinations in their certifying processes for those who have passed the initial written qualifying examinations. Methods and materials Reliability of oral examinations 75 The radiation oncology oral examination, a criterionreferenced test, is administered as the final certifying examination to those who have completed one internshiptransitional year plus 4 years of radiation oncology residency. This generally occurs 10 months after candidates obtain a passing score on the qualifying examination, which is taken at the end of the fourth year of residency. In this study, the oral examination scores were obtained from the regularly scheduled and administered RO oral examination in May 2010. The oral examination was not an experimental design and there was no modification or manipulation made in yielding the candidates' performance scores. No Institutional Review Board application was submitted as no human subject or identity was used for the study. Following a long-established procedure, all RO oral examiners received about 1 hour of general orientation followed by several hours of discussion within their specific category groups, before the beginning of the oral examination. Also, all new examiners were required to observe 2 oral examinations before giving their first examinations. No announcements or comments were made to the examiners or examinees regarding a study of the reliability of oral examination. That is, the reliability coefficient was calculated simply by using the existing data. In this routinely administered RO oral examination, 152 candidates rotated through 8 examination stations. Stations consisted of a hotel room equipped with a computer and software that exhibited images appropriate to the content areas. Each candidate had a face-to-face encounter with an oral examiner who was a content expert in one of the following areas: gastrointestinal, gynecology, genitourinary, lymphoma/leukemia/transplant/ myeloma, head/neck/skin, breast, central nervous system/ pediatrics, or lung/sarcoma. Each oral examination lasted 25-30 minutes. The scoring rubrics have been developed over the years, the continuous score scale ranged from 68 to 72 with a 1-point interval among the performances, and 70 as the passing score on each category. During each face-toface oral examination, the examiners, independent from each other, filled out specifics about the candidate's performance on a blank form provided with case numbers. At the end of the oral examinations, the examiners of each category met to discuss in depth the accuracy and fairness of some candidates' failing grades. After the meeting, those with failing or borderline performances were presented to the oral examiner panel of all 8 clinical categories so they could be reconsidered and their final grades determined. The textbook Educational Measurement presents several methods for estimating reliability coefficients, depending on the type of test and the theoretical framework. 7 The design of this RO oral examination is referred to as one way repeated-measures design, subject by treatment design, or complete randomized block design, and the analysis of the results are derived from analysis of variance (ANOVA) and produce exactly the same results. Although the result will be identical, one may also apply a generalizability model, as presented by
76 J.C. Yang et al Practical Radiation Oncology: January-March 2013 Brennan in his GENOVA program. 8 The author chose to use a simple computational alternative to ANOVA, or GENOVA, which was published by Feldt and Charter in 2004 [9], as follows: n h io r k =1 SD 2 x SD2 j SD2 s = ðk 1ÞSD 2 s ; where r k is k-judge reliability, SD x is the standard deviation of all scores using kn as the divisor. N is the number of subjects and k is the number of judges. SD j is the standard deviation of judge means using k as the divisor, and SD s is the standard deviation of each subject's test performance. The best score for each subject is the mean of the judges' ratings having a standard deviation. Note that SD 2 is variance. The above formula is identical to Cronbach's coefficient alpha, which is the same as the Kuder-Richardson formula 20 (KR-20), which can be obtained by ANOVA results with the interaction term subtracted. That is, ρ = (MS s MS A S )/ (MS s ), where MS represents mean square, the subscript s represents subject, and the subscript A S represents interaction in ANOVA terms. 9 Results Applying Feldt and Charter's formula, 9 the simple calculation method yielded the inter-rater reliability coefficient KR-20 of 0.81 for the oral examination administered by 8 content expert examiners. This means that 81% of the examinees' scores would maintain the same rank orders if a comparable examination was administered. Table 1 exhibits the data entry format, which contains partial data on the oral examination scores of 152 candidates, the mean of each candidate for 8 competency areas, and the mean of each rater who is the content expert examiner (the term rater is used interchangeably with examiner in this paper). Correlation coefficients were obtained among 8 oral examination scores and 3 RO written examination scores, biology, physics, and clinical, acknowledging that the examinees may have taken different written examination forms at different times and that the reliability coefficients varied somewhat from test to test. During the past 5 years, RO biology tests' reliability coefficients varied from 0.92 to 0.96 (0.95 on average); the reliability coefficients of RO physics examinations ranged from 0.88 to 0.96 (0.94 on average) and RO clinical component examinations had reliability coefficients ranging from 0.93 to 0.95 (0.95 on average). The correlation coefficients were obtained among the oral examination scores and the RO written tests. Restriction in range was not corrected because 152 examinees have taken different versions of the written examinations over the years. All correlation coefficients were statistically significant at 2-tailed P =.05. Most of them were statistically significant at P =.0001, suggesting highly consistent relationships between the written test scores and oral examination scores. The clinical examination, a written test, was highly correlated with all 8 categories of the oral examinations, yielding correlation Table 1 Subject (Candidate) Data format of May 2010 radiation oncology oral examination scores Rater 1 (GI) Rater 2 (Gyn) Rater 3 (GU) Rater 4 (Lymph) Rater 5 (HNS) Rater 6 (Breast) Rater 7 (CNS/PEDS) Rater 8 (Lung) Mean (Subject) 1 70 70 70 70 68 70 70 69 69.63 2 70 70 69 69 69 68 71 69 69.38 3 70 71 70 70 70 71 71 71 70.50 4 70 70 70 70 70 70 69 70 69.88 5 70 71 72 70 69 70 71 70 70.38 6 71 71 71 71 71 70 71 71 70.88 7 70 71 71 70 70 71 70 70 70.38 8 69 70 68 70 68 70 70 69 69.25 9 69 69 69 70 70 71 68 70 69.50 10 71 69 70 69 71 71 71 70 70.25 11 70 71 71 71 70 69 69 71 71.00 152 71 71 71 72 71 71 70 70 70.88 Mean 70.43 70.66 70.52 70.52 70.44 70.54 70.50 70.50 70.51 Standard deviation 0.64 0.89 0.77 0.76 0.88 0.84 0.87 0.85 0.53 CNS/PEDS, central nervous system/pediatrics; GI, gastrointestinal; GU, genitourinary; GYN, gynecology; HNS, head/neck/skin; Lymph, lymphoma/ leukemia/transplant/myeloma.
Practical Radiation Oncology: January-March 2013 coefficients ranging from 0.412 to 0.625. All were statistically significant at probabilities ranging from 2E -6 to 8E -17. These results mean that such high correlation could not have occurred merely by a chance. One can expect or predict that candidates who obtain high scores on the clinical written examination will perform well on oral examinations. The correlation coefficients between the physics written examination and the 8 oral examination categories were also statistically significant, with probabilities ranging from 0.005 to 5E -24. The same inference can be drawn on the predictability of the physics written test on the oral examinations: that these relationships could not have been the result of chance occurrence and that candidates who perform well on the physics written test will perform well on the oral examinations. Similarly, the biology written examination was a strong predictor of performance on the oral examinations, with probabilities ranging from 0.004 to 2E -6. Intercorrelation among the 8 oral examination categories was statistically significant also, with probabilities ranging from 0.034 to 3E -11, which suggests that candidates who perform well on one oral examination category will show successful performances on the other oral examination categories. Table 2 exhibits the detailed correlation coefficients for reference. Discussion Reliability of oral examinations 77 The fact that many correlation coefficients ranged from 0.300 to 0.600 suggests that unique factors existed in each form of examination; that is, the oral examinations and the written examinations measured some aspects of knowledge, skill, or judgment that were not measured in the other form of examination. If a correlation coefficient between a written test and oral examination approaches 1.0, it means the tests are probably measuring the same aspects or attributes of the criteria. Although statistically significant at P =.05, the correlation coefficient between the breast category oral examination and the RO physics written test correlated less highly (P =.028), and the same was true for the correlation coefficient between the breast category oral examination and the lymphoma/leukemia/ transplant/myeloma category oral examination (P =.034). No plausible explanation is certain. However, one possible explanation may be that the reliability of the breast category is lower than the other categories, possibly because all the regular breast category examiners were first-time examiners. There were 2 relievers who were experienced but each examined fewer candidates. As mentioned above, there are concerns in the literature about the reliability and validity of the oral examination because of its potential bias, effects of interpersonal Table 2 Intercorrelation among radiation oncology oral examination categories and computerized ( written ) test scores (n = 152) Category Computerized test Oral examination Clinical Physics Biology Rater 1 GI Rater 2 Gyn Rater 3 GU Rater 4 Lymph Rater 5 HNS Rater 6 Breast Rater 7 CNS/Peds Rater 8 Lung Clinical Pearson r 1 0.575 0.625 0.510 0.523 0.412 0.390 0.497 0.375 0.472 0.453 Probability 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 Physics Pearson r 1 0.715 0.300 0.338 0.225 0.327 0.342 0.178 0.220 0.335 Probability 0.000 0.000 0.000 0.005 0.000 0.000 0.028 0.007 0.000 Biology Pearson r 1 0.355 0.326 0.307 0.301 0.407 0.257 0.233 0.353 Probability 0.000 0.000 0.000 0.000 0.000 0.001 0.004 0.000 GI Pearson r 1 0.400 0.333 0.387 0.362 0.347 0.437 0.297 Probability 0.000 0.000 0.000 0.000 0.000 0.000 0.000 Gyn Pearson r 1 0.432 0.527 0.484 0.296 0.378 0.457 Probability 0.000 0.000 0.000 0.000 0.000 0.000 GU Pearson r 1 0.290 0.428 0.306 0.266 0.392 Probability 0.000 0.000 0.000 0.001 0.000 Lymph Pearson r 1 0.372 0.172 0.338 0.232 Probability 0.000 0.034 0.000 0.004 HNS Pearson r 1 0.306 0.336 0.332 Probability 0.000 0.000 0.000 Breast Pearson r 1 0.318 0.242 Probability 0.000 0.003 CNS/PEDS Pearson r 1 0.232 Probability 0.004 Lung Pearson r 1 Probability The probability was for 2-tailed tests. CNS/PEDS, central nervous system/pediatrics; GI, gastrointestinal; GU, genitourinary; GYN, gynecology; HNS, head/neck/skin; Lymph, lymphoma/ leukemia/transplant/myeloma.
78 J.C. Yang et al Practical Radiation Oncology: January-March 2013 communication skills, and possible subjectivity. However, some research studies dispute some of the concerns. In this study, each candidate was examined by 8 content experts for a period of 25-30 minutes each. The raters scored candidates' performances independently. The inter-rater reliability of the May 2010 ABR RO certifying oral examination was 0.81. Considering the lack of extensive systematic training, this is remarkably high for an oral examination administered by 8 content expert examiners. The reliability coefficient is higher than what might be commonly perceived or expected as concerns expressed in the literature and mentioned in the introduction of this paper. 1-5 However, more systematic examiner training before the beginning of the oral examination may further increase the consistency in the rater evaluations of the candidates' performances. Training for standardizing cases and questions may also help increase the consistency in grading. In addition, a variability in scores greater than the range of 68 to 72 would probably increase the reliability coefficient. The reliability of each category can affect the inter-rater reliability. If the reliability of a content category is low it will not correlate highly with other content categories. Perhaps the reliability of the breast category may be suspect because the correlation coefficient between the breast category and lymphoma/leukemia/transplant/ myeloma category (P =.028), as well as the correlation coefficient between the breast and physics categories (P =.034), were not as high as correlation coefficients between the other content categories, including the written test scores. It was not possible to determine one category's reliability coefficient by itself in this study. Though arbitrary, perhaps, a scale with 5-point intervals could be designed. The scale could be developed with scoring guidelines that may be familiar to the rater and are easy to follow. The trustees could be involved in the decision-making process regarding the development of the scoring rubric, especially whether or not they would allow the rater to assign a score within a 5-point interval scale with the wider range of points; for example, from 65 to 85 or 70 to 100, etc, with specific guidelines for each interval. A systematic method to increase consistency among the raters could be implemented by providing systematic training to standardize cases and questions. Each 25-30 minute oral examination could be audiotaped by the content category. The audiotapes of each content category could be transcribed and analyzed and then categorized by the questions asked by the examiners. This method could produce the blueprint for the oral examination in each category. During the process of developing the blueprint, a scale with 5-point intervals could be implemented instead of a 1-point interval scale (although this will have no effect on developing the oral examination blueprints ). Although the reliability coefficient of the ABR May 2010 oral examination was remarkably high, as with any organization administering high-stakes examinations, the ABR must continue to strive for high reliability, and thus validity, of their examinations to protect candidates and the public. Reliability is a necessary condition for validity, but not a sufficient condition for validity. Currently, the associate executive director of radiation oncology is working on a revision of the case formats to better rationalize oral examination content in order to improve the reliability and thus validity of the radiation oncology oral examinations. References 1. Memon AM, Joughin GR, Memon B. Oral assessment and postgraduate medical examinations: Establishing conditions for validity, reliability and fairness. Adv Health Sci Educ Theory Pract. 2010;15:277-279. 2. Lunz ME, Bashook PG. Relationship between candidate communication ability and oral certification examination scores. Med Educ. 2008;42:1227-1233. 3. U.S. Congress. The Court Interpreters Act of 1978, PL 95-539.28 USC 1827. Washington, DC: United States Congress. 1978. 4. Steinberg PI. Oral examination anxiety in physicians, narcissism, and object relations. J Appl Psychoanal Stud. 2002;4:379-388. 5. Stansfield CW, Hewitt WE. Examining the predictive validity of a screening test for court interpreters. Lang Test. 2005;22:438-462. 6. Pascuzzi RM. Opinion/education: The ABPN is the neurology resident's best friend. Neurology. 2008;70:e16-e19. 7. Brennan RL, ed. Educational Measurement, 4th ed. American Council on Education, Westport, CT: Praeger Publishers; 2006;65-110; 221-256. 8. Brennan RL. Generalizability Theory. 4th ed. New York, NY: Springer-Verlag; 2011453-469. 9. Feldt LS, Charter RA. A simple computational alternative to analysis of variance formulas for estimating the k-judge reliability. Psychol Rep. 2004;94:514-516.