UNIVERSITY OF CALGARY. Reliability & Validity of the. Objective Structured Clinical Examination (OSCE): A Meta-Analysis. Ibrahim Al Ghaithi A THESIS

Size: px
Start display at page:

Download "UNIVERSITY OF CALGARY. Reliability & Validity of the. Objective Structured Clinical Examination (OSCE): A Meta-Analysis. Ibrahim Al Ghaithi A THESIS"

Transcription

1 UNIVERSITY OF CALGARY Reliability & Validity of the Objective Structured Clinical Examination (OSCE): A Meta-Analysis by Ibrahim Al Ghaithi A THESIS SUBMITTED TO THE FACULTY OF GRADUATE STUDIES IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE GRADUATE PROGRAM IN COMMUNITY HEALTH SCIENCES CALGARY, ALBERTA OCTOBER, 2016 Ibrahim Al Ghaithi 2016

2 Abstract Background: The objective structured clinical examination (OSCE) provides one of the most commonly used methods for assessing clinical skill competencies in the health professions. Objectives: To investigate the existing published research on the reliability, validity and feasibility of the OSCE in the assessment of physicians and residents in medical education programs. Methods: In addition to a MEDLINE, the literature search for peer-reviewed, journal publications that used an OSCE assessment method to evaluate clinical skill competence also included PsychINFO, ERIC and EMBASE databases. Results: In total, 49 studies met the inclusion and exclusion criteria in the final analysis. The OSCE assessment method has a moderate internal reliability [mean alpha coefficient (α) = 0.70], low to moderate criterion validity [mean Pearson correlation (r)= 0.46] and low to moderate construct validity (mean r = 0.42). High heterogeneity was observed and large part was attributed to multiple sources of measurement errors. The mean cost per candidate is $353 ± $ 362 (95% Confidence Intervals: $25-$1083). Conclusions: The OSCE method for the assessment of clinical skill competence was found to be reliable and valid, however, the administration costs are much higher than written or direct observation of clinical skill performance in practice. ii

3 Acknowledgments I am heartily thankful to my supervisor, Dr. Tyrone Donnon, whose encouragement, guidance, and support from the formative stages of this thesis to the final draft enabled me to develop an understanding of the whole subject. I would also thank Dr. Elizabeth Oddone-Paolucci and Dr. Aliya Kassam for being part of my supervisory committee. iii

4 Dedication For my wife, Fatma, and my son, Khalid, who have always been there through the hard times and offered me unconditional love and support throughout the course of this thesis. iv

5 Table of Contents Abstract... ii Acknowledgments... iii Dedication... iv Table of Contents... v List of Tables... vii List of Figures... ix CHAPTER ONE: INTRODUCTION Overview and History of the OSCE What is an OSCE? Why or why not use an OSCE? Development of an OSCE Evolution of the OSCE Rationale for the Study CHAPTER TWO: LITERATURE REVIEW The Role of the OSCE in High-Stake-Decision Making ) Checklists vs. Global Rating Scales ) Setting the Standards for OSCE Cutoff Scores Objectivity The OSCE s Psychometric Qualities Reliability Inter-station reliability or internal consistency Inter-rater reliability Validity Traditional Concept of Validity Modern Concept of Validity Feasibility Examinees perception of OSCE Research Question and Hypothesis CHAPTER THREE: METHODS Selection of studies Inclusion and exclusion criteria for eligible studies Coding protocol and data extraction Statistical analysis CHAPTER FOUR: RESULTS Reliability Validity ) Criterion Validity ) Construct Validity ) Predictive Validity Feasibility v

6 The Cost Number of Sites Number of days over which OSCE has been conducted OSCE s Duration Number of Forms Number of Languages CHAPTER FIVE: DISCUSSION Reliability ) Station related measurement errors: ) Examiner s/rater s related measurement errors ) Scoring related measurement errors ) Examinees related measurement errors Validity ) Is the OSCE content valid ) Is the OSCE constructively valid ) Is the OSCE concurrently valid Feasibility Number of stations Number of SPs/Patients Examiners Large-scale exam Post-processing/scoring Space/Venue Strengths and Limitations of the Study Strengths Limitations Recommendations for Future Research REFERENCES Appendix 1: List of Search terms Appendix 2: Coding Sheet vi

7 List of Tables Figure 1.1: The CanMEDS role framework (18) Figure 3.1: Selection of studies for the OSCE meta-analysis Table 4.1: Summary of the 49 studies included in the OSCE meta-analysis Figure 4.1: Overall distributions of reliability and generalizability coefficients Figure 4.2: Scatterplot of reliability coefficient (alpha) by number of stations with unweighted regression line Table 4.2: Moderator analyses Figure 4.3: Forest plot for Criterion Validity (r) Figure 4.4: Forest plot for Criterion Validity (r) by Number of Stations Figure 4.5: Forest plot for Criterion Validity (r) by Station's Duration Figure 4.6: Forest plot for Criterion Validity (r) by Scoring Methods Figure 4.7: Forest plot for Criterion Validity (r) by Number of Raters Figure 4.8: Forest plot for Criterion Validity (r) by Examiner Figure 4.9: Forest plot for Criterion Validity (r) by Exam's Content Figure 4.10: Forest plot for Criterion Validity (r) by Exam's Context Figure 4.11: Forest plot for Criterion Validity (r) by Candidates' Background Figure 4.12: Forest plot for Criterion Validity (r) by Comparison's Test Figure 4.13: Forest plot for Construct Validity (r) Figure 4.14: Forest plot for Construct Validity (r) by Number of Stations Figure 4.15: Forest plot for Construct Validity (r) by Station's Duration Figure 4.16: Forest plot for Construct Validity (r) by Scoring Methods Figure 4.17: Forest plot for Construct Validity (r) by Number of Raters Figure 4.18: Forest plot for Construct Validity (r) by Examiners Figure 4.19: Forest plot for Construct Validity (r) by Exam's Content Figure 4.20: Forest plot for Construct Validity (r) by Exam's Context Figure 4.21: Forest plot for Construct Validity (r) by Candidates' background vii

8 Figure 4.22: Forest plot for Construct Validity (r) by Comparison's Test Figure 4.23: Forest plot for Predictive Validity (r) Table 4.3: Characteristics of studies that have provided cost per candidate Figure 4.24: Correlation analysis between OSCE's cost and other factors that may affect feasibility Table 4.4: Number of sites required to complete the OSCE Table 4.5: Number of days required to complete the OSCE Table 4.6: The duration of the included studies' OSCE in hours Table 4.7: Number of forms used in OSCEs Table 4.8: Languages used in OSCEs viii

9 List of Figures Figure 1.1: The CanMEDS role framework (18) Figure 3.1: Selection of studies for the OSCE meta-analysis Table 4.1: Summary of the 49 studies included in the OSCE meta-analysis Figure 4.1: Overall distributions of reliability and generalizability coefficients Figure 4.2: Scatterplot of reliability coefficient (alpha) by number of stations with unweighted regression line Table 4.2: Moderator analyses Figure 4.3: Forest plot for Criterion Validity (r) Figure 4.4: Forest plot for Criterion Validity (r) by Number of Stations Figure 4.5: Forest plot for Criterion Validity (r) by Station's Duration Figure 4.6: Forest plot for Criterion Validity (r) by Scoring Methods Figure 4.7: Forest plot for Criterion Validity (r) by Number of Raters Figure 4.8: Forest plot for Criterion Validity (r) by Examiner Figure 4.9: Forest plot for Criterion Validity (r) by Exam's Content Figure 4.10: Forest plot for Criterion Validity (r) by Exam's Context Figure 4.11: Forest plot for Criterion Validity (r) by Candidates' Background Figure 4.12: Forest plot for Criterion Validity (r) by Comparison's Test Figure 4.13: Forest plot for Construct Validity (r) Figure 4.14: Forest plot for Construct Validity (r) by Number of Stations Figure 4.15: Forest plot for Construct Validity (r) by Station's Duration Figure 4.16: Forest plot for Construct Validity (r) by Scoring Methods Figure 4.17: Forest plot for Construct Validity (r) by Number of Raters Figure 4.18: Forest plot for Construct Validity (r) by Examiners Figure 4.19: Forest plot for Construct Validity (r) by Exam's Content Figure 4.20: Forest plot for Construct Validity (r) by Exam's Context Figure 4.21: Forest plot for Construct Validity (r) by Candidates' background ix

10 Figure 4.22: Forest plot for Construct Validity (r) by Comparison's Test Figure 4.23: Forest plot for Predictive Validity (r) Table 4.3: Characteristics of studies that have provided cost per candidate Figure 4.24: Correlation analysis between OSCE's cost and other factors that may affect feasibility Table 4.4: Number of sites required to complete the OSCE Table 4.5: Number of days required to complete the OSCE Table 4.6: The duration of the included studies' OSCE in hours Table 4.7: Number of forms used in OSCEs Table 4.8: Languages used in OSCEs x

11 CHAPTER ONE: INTRODUCTION Overview and History of the OSCE Clinical certifying and licensing examinations are implemented to assess medical students and physicians clinical competency (1). The Objective Structured Clinical Examination (OSCE) provides one of the methods for assessing clinical skill competencies in a well-organized and structured way with particular attention being paid to maintaining objectivity across examinees (2). Since the first publication in 1975 by Ronald Harden et al. (3), OSCEs were used primarily for formative assessment purposes. In the last decade, however, OSCEs have become a key component of several high stakes certification and licensure programs. The last few decades have seen major changes in assessment measures and methods in medical education. The traditional written tests have been replaced by more comprehensive competency-based evaluations that involve observation of the candidate including the use of In- Training Evaluation Reports (ITER), Direct Observation of Procedural Skills (DOPS) checklists, and standardized patient evaluation and simulation scenarios (4). Currently, these evaluation methods are used for both summative and formative decisions about physician competency and have been included as part of the licensure assessment process. As more attention is given to competence-based education frameworks, a similar increase has arisen in the importance of corresponding evaluation and assessment strategies in the literature. This has been, in a large part, reinforced with societal changes that demand more accountability to ensure that patient rights and safety are being met. Subsequently, this has put the traditional written and oral examination assessment methods under question and has called for testing that reflects realistic patient encounters in medical education. In response to improving the evaluation of medical students clinical performance, the OSCE was introduced 11

12 for the first time in 1975 by Ronald Harden at the University of Dundee in Scotland (3). What is an OSCE? The Objectively Structured Clinical Examination (OSCE) is an interactive, competencybased examination often used in the health care professions. It is designed to test knowledge as well as clinical skills such as history taking, physical examination, communication, patient education, medical/surgical procedures, order writing and test interpretation. The OSCE usually comprises a circuit of multiple stations where each examinee is presented with a series of standardized clinical problems or presentations that are often portrayed by standardized patients (SPs) like professional actors, faculty members or medical students. At each station or encounter, independent examiners and/or SPs evaluate the examinees competencies as a function of their clinical knowledge and skills. Typically, the evaluation is premised on an initial set of instructions (e.g., posted outside of the examination room) that are provided to the examinee before each 5 to 15 minute patient encounter begins at any one particular OSCE station (5). Clinical performance is based on the examinee s ability to demonstrate competencies rated by the examiner using a station specific checklist and/or Likert-type global rating scale, where a total score dictates a pass or fail using a minimum performance level cut-off score determined a priori. The number of stations, their duration, and the number of examiners included may vary from one OSCE to another depending on many factors such as the purpose of the OSCE (i.e. certifying exam vs teaching purposes), the medical discipline, and available resources (5). 12

13 Why or why not use an OSCE? The OSCE has many advantages over traditional assessment strategies or methods. It has a unique structure in which the trainees are tested in well-controlled clinical scenarios where more than one competence can be assessed simultaneously. In fact, each competence can be tested at more than one station. This represents a major advantage of the OSCE because there is no limit to the variety of clinical scenarios that can be created to test any number of competencies. These clinical scenarios are usually portrayed by SPs that have been shown to reliably mimic actual patient encounters (6). The simulated clinical scenarios involve more realistic content, context and procedures that make the OSCE superior to traditional static assessment approaches. The use of SPs provides a standardized environment that allows the assessment of trainees abilities to be near equivalent to real life encounters depending on the complexity of the clinical presentation. However, actual patients may be a better choice for demonstrating actual physical findings like splenomegaly. Therefore, some schools have used a combination of both SPs and real patients in the same OSCE, which adds more validity and credibility to their final examinee performance results (7-10). The clinical scenarios used in an OSCE cannot be duplicated by written cases commonly used in traditional assessment tests. Moreover, these clinical scenarios allow assessment of noncognitive clinical skills that are usually difficult to assess, such as inter-personal skills (IPS). In fact, a positive correlation has been demonstrated between trainees IPS and clinical performance (11). Further, trainees with poor IPS were easily identified. Therefore, it has been shown that these non-cognitive aspects of clinical competency can be reliably measured using the OSCE method of assessment. Although the OSCE is used primarily as an evaluation method, it also has an educational 13

14 or teaching element. By having the trainees directly observed by faculty members, the OSCE sends a clear message to the trainee that basic clinical skills are an important part of being an effective practitioner. Furthermore, it is well known that faculty ratings are subject to variability in the rating of students clinical performances. In fact, it has been demonstrated that faculty members are generally reluctant to criticize deficits in trainees' clinical competency (12). Alternatively, the OSCE is designed to be more objective and, therefore, provides more constructive and accurate feedback on students performance. Such feedback is particularly valuable for in-training residency programs and program directors because it identifies weak trainees and, accordingly, areas of weakness within the curriculum (13, 14). Compared to common traditional assessment tools, it is very well known that OSCE is more expensive and labor-intensive (15, 16). The time required to prepare and conduct an OSCE is more than that required for traditional examinations (3, 17). Scoring is usually manual, which represent another disadvantage in conducting OSCEs. Many studies have shown that manual scoring is time-consuming and prone to mistakes (18). Therefore, computer scoring with optical scanning has been suggested (19). However, this will increase the cost more. Ideally, all candidates should go through similar stations. However, in reality, it is difficult to have similar real patients with similar clinical findings. Therefore, in OSCEs, SPs are used. However, training SPs is also more expensive and labor-intensive (20). In fact, having children as SPs is almost impossible (21). In summary, the OSCE is a well-established and superior approach to traditional assessment methods in many aspects that make it the new gold standard for evaluating the clinical competency of students in the health care professions. However, the practicality is a drawback, which limits the use of OSCEs. 14

15 Development of an OSCE The process of developing an OSCE consists of seven main steps, each of which will be presented in the following paragraphs (3, 22, 23). 1. Establishment of a table of specifications or blueprint. The initial step of the process is to define the content domain and the skills to be assessed that are based on learner outcome expectations or competency standards of practice based on the level of the examinees. Two main parameters, the level (e.g., 3 rd year medical student vs. 2 nd year resident) and the range (e.g., generic vs. discipline specific) define each skill or domain of measurement. While the level defines the complexity of the expected performance, the range describes the scope and specificity of the expected performance (23). 2. OSCE station format. There are two main types of stations; the long and the couplet station (22, 24). Determination of station type usually depends on the task being assessed. For example, long stations may be used for conducting a thorough history taking and physical examination, patient counseling and education, or common medical/surgical procedures. On the other hand, a couplet station may be used for an initial patient encounter (e.g., history taking or physical examination) followed by a post-encounter prop separate from the patient encounter (e.g., interpretation of test results or recommendations for further patient management). The optimal number of stations varies depending on the nature of the assessment. While more stations may improve the psychometric properties, it also increases the cost to administer. In general, determination of number of stations is a function of what needs to be measured (i.e., 15

16 table of specifications) and the psychometric rigor (i.e., reliability and validity) needed to ensure that there is adequate assessment across the stations identified (25). 3. Case design and development. It is the process of translating the blueprint into standardized clinical presentations or scenarios that are appropriate for the skill or task being assessed, as well as to be portrayed by SPs appropriately. OSCE cases are usually written by specialty experts or subject matter faculty members that draw on practice experience. Instructions are developed for the SPs on how to depict the clinical presentation, along with a comprehensive checklist that outlines the knowledge, skills and/or attitudes expected to be elicited by the examinees, and any additional expectations regarding the need for the examiner to draw out a final differential diagnosis, investigations, and/or short-/long-term management plans. 4. Case Review. The review of a case is a crucial step in which usually an expert panel assesses three main aspects of the each task: Criticality of tasks and processes described (e.g., Do they have to be able to do this? ) Frequency of tasks and processes described (e.g., Will they be required to be able to do this? ) Relevancy of tasks and processes described (e.g., Are they generally expected to be able to do this? ) 5. Standard Setting. 16

17 At this stage, a recognized standard setting method is used to ensure that almost all the exam aspects are judged on their importance (relevancy) or difficulty. Several standard setting methods have been used in OSCEs; two commonly used are Angoff and Ebel methods. Using the Ebel approach, for example, judges determine a pass/fail cut-score for each item that are averaged across all items to establish a minimum performance level (MPL) summative score that must be met by all candidates to pass any one specific OSCE station. In part, the MPLs set for an OSCE are a function of the exam content, length, environment and scoring procedures. In the Angoff approach, on the other hand, the judges first estimate the probability that a borderline candidate will pass each item in the exam. Then, based on this estimation, they determine the pass/fail mark (26). 6. Piloting. This stage involves an initial piloting or feasibility study in which the proposed OSCE is tested on a small group of candidates, in order to evaluate its logistics and gather information prior to its application to a larger administration. Therefore, the OSCE quality and efficiency may be improved and deficiencies in its design and delivery may be revealed. Thus, these can be addressed before time and greater resources are expended. 7. Administration of the OSCE. This is the final stage where the piloted OSCE is administered to a final, targeted group of examinees for its primary summative (e.g., certifying exam) or formative (e.g., teaching) purpose. 17

18 Evolution of the OSCE The OSCE was initially introduced as a measurement approach for the assessment of clinical competency in response to social and academic concerns regarding deficiency of testing in medical school and residency in-training settings that relied on traditional written or oral based examinations(1, 2). The alternative use of direct observation of skills, where students and residents were supervised by preceptors or attending clinicians on the ward, was acknowledged as being too subjective in making comparisons between students (27-29). Moreover, each patient encounter and the skills demonstrated varied considerably (27). Following the development of the OSCE, competency-based education (CBE) became the new curriculum and assessment framework for medical education programs (30). Worldwide, CBE has been adopted and presented in a variety of different competency frameworks (31, 32). In USA, for instance, the most commonly used CBE-model was largely developed by the Accreditation Council for Graduate Medical Education (ACGME) (33). This model defines CBE into six related domains: medical knowledge, patient care, professionalism, communication and interpersonal skills, practice-based learning and improvement, and systems-based practice. On the other hand, in Canada, CBE has taken the form of the CanMEDS roles framework (Figure 1), which defines key competencies into seven related and overlapping domains: Medical Expert, Communicator, Collaborator, Manager, Health Advocate, Scholar, and Professional (34). Over the last two decades, CBE has gained recognition and popularity, and has become a standard framework for medical education worldwide. Currently, CBE has been incorporated within many training programs in North America, Europe and Australia (35, 36). CBE has addressed the social and academic concerns for accountability and patient safety, because it implies a developmental progression of trainee from a novice to competent physician 18

19 (30-32). However, the evaluation of this transitional process needs an assessment method that can be used for both formative and summative purposes. Since the OSCE was developed specifically to evaluate clinical competencies, it has become the assessment method of choice for CBE and, therefore, the standard and most objective method to assess clinical competence (4, 37). Professional Communicator Scholar Medical expert Collaborator Health advocate Manager Figure 1.1: The CanMEDS role framework (18). Since the original description of the OSCE, the literature and research has evolved to address its format and psychometric properties primarily at the undergraduate level (38-40). However, recently more research has been completed investigating the use of OSCEs at the post-graduate level (16, 41). Nonetheless, the role and value of the OSCE have been shown 19

20 to be effective in a variety of disciplines at both the undergraduate and post-graduate levels (21, 38, 42-44). The OSCE has emerged as the standard assessment method to measure examinees clinical competencies. Initially, it has been used mostly at individual institutions for the assessment of students and residents clinical skills for both formative and summative purposes (16, 45). More recently, however, it has been implemented across multiple institutions, and has been used on a national scale for certification and licensure (46). In fact four major North American testing bodies (The Educational Commission on Foreign Medical Graduates, The National Board of Medical Examiners, The Medical Council of Canada, and La Corporation Professionnelle des Medicines du Quebec) have already implemented the OSCE as a standard assessment process in their examinations. Rationale for the Study The main purpose of this study is to investigate the reliability, validity and feasibility of the OSCE in the context of measuring clinical competence, through the method of meta-analysis (i.e., method of combining the existing published research). The literature tends to support the use of the OSCE as a method to improve the state of the art in evaluating clinical competence at the level of post-graduate training programs and licensing examination. However, there has not been a comprehensive, empirical study that looks at the use of OSCEs as a valid, reliable and feasible assessment method to measure clinical competencies in medical education. 20

21 CHAPTER TWO: LITERATURE REVIEW The Role of the OSCE in High-Stake-Decision Making In the past, several traditional assessment tools have been used in high stake decision examination. However, many of these tools were paper-based, do not assess skills, or have poor psychometric properties (4). Moreover, there has been an increased social demand for high quality services as well as patient safety and wrights (5). Therefore, medical education has adapted a competency-based framework that was translated in many areas including curriculum, training and assessment (30). Being primarily a competency-based, the use of OSCE for highstakes decision-making has provided a great benefit to trainees, institutions and society at large. For the same reason, however, care must be taken to achieve high levels of validity and reliability in the OSCE process while maintaining its feasibility (5). 1) Checklists vs. Global Rating Scales The two main and widely used OSCE scoring methods are the performance checklist (task specific) and global rating scale (general competency) (47). The two methods differ in both their format (number of items or scales) and application (specific or general). The OSCE performance checklist s score is based on whether specific clinical item tasks have been completed or not; and accordingly, it produces a sum total score of the performed items. On the other hand, the global rating scales are usually based on scoring of each scale based on anchored global ratings of general competencies that can be knowledge, skill or attitudinal (e.g., demonstrates compassion and empathy to the needs of the patient and his/her family); and accordingly, it produces a global rating score for each general competency scale that can be summed to create a total score (48, 49). Checklist items and global rating scales can be completed by either faculty members examiners or be designed for SPs too if the need for 21

22 patient feedback on performance is required (47). Although performance checklists are more commonly used in OSCEs, they may be considered inferior to global rating scales when it comes to assessing a skilled trainee who uses fast pattern recognition instead of lengthy questioning or physical examination. In fact, when scored by experts, global rating scales have been shown to be superior and have a higher interstation reliability (48, 49). Regehr compared the two approaches in the context of OSCE exams and concluded: Global rating scales scored by experts showed higher inter-station reliability, better construct validity, and better concurrent validity than did checklists. Further the presence of checklists did not improve the reliability or validity of the global rating scale over that of the global rating alone. These results suggest that global rating scales administered by experts are a more appropriate summative measure when assessing candidates on performance based assessment (48). Nonetheless, there has always been concern with the degree of inter-rater reliability regardless of the scoring method. Despite intensive rater training and experience, research has identified four common rater errors: leniency, inconsistency, the halo effect, and restriction of range (50). 1) Leniency describes a rater who is always rates candidates higher than their true performance. The opposite is called hard or sever rater. However, the term leniency error is generally used to describe both raters because they both produce similar effect in rating s interpretation (51). 2) Inconsistency describes a rater who tends to apply the rating scale in a way that is inconsistent with the standard manner. This is usually due to the rater s misunderstanding of the rating criteria. Inconsistency is also called randomness effect because it produce more randomness in rating i.e. more random variability than expected (52). 22

23 3) The Halo Effect describes a rater whose rating for a candidate s in a particular trait influence his subsequent rating of the same candidates in other traits. It is usually results when the rater fails to discriminate between different aspects of the candidate s behavior. The Halo Effect is a threat to internal reliability because it produce inappropriately similar rating across items (53). 4) Restriction of Range describes a rater whose rating is restricted around one point of the scale i.e. midpoint or one extreme of the scale. Raters who exhibit a restriction of range problem usually do not use other parts of the scale. Best example of Restriction of Range is when the rater tends to restrict his rating around the midpoint of the scale and, therefore, rates all candidates as average. By doing so, the variability of rating is limited. Thus, it fails to discriminate between good and poor performers (51). 2) Setting the Standards for OSCE Cutoff Scores Setting the pass/fail cutoff score represents a major obstacle in the success of any OSCEbased assessment. The two commonly used standards for setting the cutoff score are the normreference (based on relative standards) and the criterion-reference (based on absolute standards) methods (54). Norm-Reference In this approach a relative minimum cut-off is considered for the pass mark in advance based on the performance of the test takers as a group (55). For example, a medical school may accept overall 75% average as a standard pass mark for the OSCE or only the top 40 performers to move on to the next level. A major drawback in this approach is that outstanding or mastery performance is not assessed. Therefore, it would be more accurate if examinees demonstrated a minimum performance level (based on mastery of a pre-determined set of competencies reflected 23

24 across stations) and this mastery of competency level were considered the minimum cut-off score to pass the OSCE (55, 56). Criterion-reference For the OSCE, Angoff and Ebel methods are the two commonly used approaches to set absolute or mastery level standards. In both the Angoff and Ebel standard setting methods, the pass marks are determined based on an estimate made by an expert panel familiar with the purpose of the test and the level of the learners. This estimate reflects the probability that a borderline examinee will pass each item in the test (57). While this minimum performance level (MPL) setting is somewhat hypothetical, the resulting performance of the examinees is reviewed to ensure accuracy is achieved. Nevertheless, both methods can be labour-intensive and timeconsuming, as it may require as many as 10 examiners to obtain reproducible results (26). Moreover, different schools may set different pass marks as their standard (58). In the borderline approach, the examinee s overall performance is scored directly by the station's examiner using a performance checklist and/or global rating scales (GRS). These checklists produce four levels of performance; which are identified as fail, borderline, pass, and distinction. For each station, the pass mark is defined by the mean scores of the examinees that were rated as borderline candidates. For each examinee, the overall pass mark is the sum of his/her mean across all stations (59). The borderline approach initially formulated by the Medical Council of Canada, has recently become the method of choice in setting the pass marks. Unlike the Angoff or Ebel methods where variability in opinion of the judges sets the pass cutoff scores, in the borderline approach the checklist and global scores are collated and statistically regressed against each other to derive the cut score. Further, the borderline approach is less time 24

25 consuming and is based on the actual examinees performances rather than hypothetical a priori judgments of pass/fail cutoffs. Objectivity Like any SP-based testing, the OSCE's objectivity depends on the standardization of the process and selection of the scoring methods. In other words, it rests on the skills of the expert panel who prepare the OSCE station scenarios and the appropriateness of the checklists and/or global ratings designed to measure examinees performance (3). Therefore, the reliability of an assessment tool is indirectly used to assess the objectivity of that particular tool. In order to improve the objectivity, a lot of effort must be made to develop detailed checklists and comprehensive global rating scales.. However, this has unfortunately created another problem known as "trivialization", in which each task is broken down into many small components that may or may not be clinically relevant. Although it has been shown that global rating scores yield higher inter-rater reliability compared to checklists (48), global ratings by and large are broadbased assessments of examinee s proficiency and best suited to measures of general skills or competencies. Therefore, higher objectivity does not necessarily mean high reliability (3). The OSCE s Psychometric Qualities The idea that the OSCE is superior to other traditional methods in the assessment of competencies has been challenged several times. For example, in a review of selected publications, Barman has challenged the validity and reliability of the OSCE format (60). Norman also challenged the psychometric qualities of an OSCE evaluation and showed that evidence to support its superiority is lacking. Further, compared to traditional methods, an OSCE was found to be more expensive and resource intensive (61). 25

26 Reliability In general, reliability refers to "consistency" or "repeatability" of measurements. In a high-stakes decision examination, however, reliability does not simply mean reproducibility of scores; rather it means how confident we are that these scores reflect future performance (i.e., how confident are we that a particular examinee qualifies as independent physician) (62). Therefore, reliability is not simply an intrinsic characteristic of the OSCE rather it is a property of the inferences we draw from the results of the OSCE (63). Further, reliability is generally content specific. Thus, if an examinee does well on a particular case, it is difficult to predict that the same examinee will do well on a different case (62) (63). In context of the OSCE, there are two types of reliability to consider: inter-station (internal consistency) and inter-rater reliability. Inter-station reliability or internal consistency The internal consistency refers to the reproducibility of trainee s performance over items or stations (62). Although less than ideal reliability has been reported, acceptable internal consistency has been achieved in many studies. While a high reliability coefficient (Cronbach s alpha, α = 0.95) was reported in one study (64), no correlation between individual examinee s performances across stations was reported by another study (65). In the ECFMG studies, the reliability coefficient for the Clinical Skills Assessment (CSA) was α = 0.64 (66). While some consider this reliability coefficient adequate, it is considered inadequate by others when compared to written examinations, which have been shown to have internal consistency higher than 0.8 (41, 67). Thus, over the years, wide variation in the OSCE's reliability has been reported. This variability in the OSCE reliability has been shown to be due to several sources of errors. Therefore, the reliability may be improved by addressing these sources of error: 26

27 1) Format and scoring issues. In a systemic review, three factors have been identified to improve OSCE's reliability: larger numbers of stations, raters, and good standardization of patients (68). 2) The number of stations. Reliability is influenced by the number of stations, which in turn depends on several factors, including the purpose of the OSCE and the level of the examinees. One study reported a reliability coefficient of α = 0.69 when 5 stations were incorporated for postgraduate pediatric residents (69). However, the same reliability coefficient was reported in another study when 10 stations were incorporated for medical students (70). Nonetheless, in another study, a reliability coefficient of α = 0.80 was achieved when 34 stations were incorporated for both medical students and residents (71). In short, to maintain an acceptable degree of reliability, the caveat is to ensure a large enough number of stations are included, keeping the purpose of the OSCE in context. 3) The length of the examination. This depends on the skills being assessed. For example, communication skills have achieved a reliability coefficient higher than α = 0.70 with only 2 hours of examination. On the other hand, 6 hours of examination was needed to achieve the same reliability coefficient when assessing data gathering (72). In general, 4 to 8 hours of testing was needed to achieve reliability coefficients of α = 0.85 to 0.90 (73). However, when critically looking at the description of these stations, we find that they primarily focused on one clinical skill or task. Hence, one can conclude that reliability would be increased with homogeneity of tasks measured at different stations. 4) Trainee's physical and mental status. Many factors have been identified, including 27

28 text anxiety, fatigue, and memory lapse (73, 74). Nonetheless, even when efforts were made to address these sources of errors, reliability coefficients generally ranged between α = 0.41 to 0.88 (63, 71). For licensure purposes, an internal consistency of α = 0.80 or greater has been considered acceptable. For example, in the Educational Commission for Foreign Medical Graduates (ECFMG) examination an internal consistency of α = 0.81 was reported by Boulet (75) and 0.85 by Sutnick (76). Similarly, Reznick reported an internal consistency of α = 0.80 for Part II of the Medical Council of Canada Qualification Examination (MCCQE II) (72). Inter-rater reliability Inter-rater reliability refers to the degree of consistency between different raters/examiners in scoring the same trainee s performance; i.e., how well do raters agree on scoring individual trainee performance (63). Although inter-rater reliability varies between studies, agreement within studies is generally good. In one study evaluating the introduction of an OSCE into the Israeli Board Examination in Anesthesia, inter-rater reliability ranged from between α = 0.75 to 0.89 (with a mean of α = 0.82) (77). In a series of three studies evaluating surgical skills of obstetrics and gynecology residents, Goff reported inter-rater reliability coefficients of α = 0.87 to 0.95 for global rating scales, and α = 0.78 to 0.96 for the checklist scores (78-80). In another study, however, an OSCE used to assess communication and interpersonal skills (CIS) of radiology residents reported fair to moderate inter-rater reliability coefficients, α = 0.30 to 0.50 (81). Like any other examination, the goal of the OSCE is to evaluate the trainee's performance beyond the specific skills used to the larger domain or competency from which the skills are sampled. Depending on the nature and size of the sample, these evaluations can be more or less 28

29 credible (valid) and more or less consistent (reliable). Accordingly, the sampling process (content validity) is one of the most important part of any test design and development process, because it basically reflects the skills and domains to be assessed. For the OSCE, stations, SPs and raters are three main factors that influence the reliability and validity of the assessment process and, in essence, are sampled from larger domains that might have been used in the OSCE process. Thus, the OSCE would be considered reliable if the trainees' performance had been consistent across different but similar samples of these three factors. The reliability coefficient, then, would reflect the scores derived from these similar samples. Further, for a trainee's performance to be consistent, the test sample must contain an adequate number of stations, and there must be consistency in SPs and raters involvement in the testing process. Therefore, the OSCE's reliability would be influenced by variation in these three factors, as well as other sources of measurement errors. Validity In general, validity refers to "accuracy", and in statistics, the validity of a measurement tool refers to the degree to which the tool measures what it is intended to measure (82). In highstakes decisions, highly valid results are critical. If the assessment process has low validity, the decision made is doubtful and may be compromised. In fact, validity is not a characteristic of the test itself, but of the inference made from its score. Therefore, depending on its purpose and inferences made from its scores, each test can have different levels of validity (82). Over the last few decades, validity went through extensive evolution process that has refined many things around its concepts and utilization. In the next few sections, we will highlight some significant parts of this evolution process. 29

30 Traditional Concept of Validity In the classical framework, validity has three main types: content validity; construct validity and criterion validity (which has tow subtypes, concurrent and predictive criterion validity) (83). In general, this model has been a convenient way to understand and organize validity. However, in practice, there are no major distinctions between these types of validity. In instance, evidence that may be identified with particular validity type may also be relevant in the other two types (82). Content validity Content validity refers to the extent to which an assessment measure accurately represents aspects of the construct being tested. Therefore, if the construct and test's elements are not related then the test is potentially measuring something else, creating an invalid assessment of the domain or construct of interest (84). Also, it tries to estimate how much each element adds to or subtracts from a test. In a written exam, for instance, an expert panel may review and rate each question on its relevance and importance in measuring the examinee performance in a particular specialty or content area. The overall results and ratings are then statistically analyzed and accordingly, the questions are modified to improve the test validity. Like for any other measurement tool, the OSCE's content validity is critical especially when the sample is a representation of only a small portion of the domain being tested. Several studies evaluating the OSCE s validity have approached and assured content validity by different methods including the use of a test blueprint, modified Delphi technique, expert panel, and surveys and questionnaires completed by examiners and/or examinees (68, 77, 85, 86). In fact, by using these different methods, early studies claim high content validity is obtainable in the 30

31 OSCE format (3, 87). Given the definition of content validity, one may question the content validity of those studies with fewer numbers of stations in terms of their ability to represent every important element and measure of the examinee s clinical competency (68, 88). Construct validity In education, construct validity of a test refers to the extent to which the test is able to differentiate between examinees at different points of their education according to their performance (84). In context of the OSCE, several studies addressed construct validity by comparing examinees scores on the OSCE with their level of training. The majority of these studies have shown that the OSCE format has high construct validity. For example, in a study evaluating technical skills of general surgical residents, Reznick (89) reported a significant effect of the role of training for both the checklist scores (F(3,44) = 20.08, P <0.001) and the GRS (F(3,44) = 24.63, P <0.001). This effect accounted for 62.7% of the variance in GRS and 57.8% for the checklist scores. In another study evaluating surgical skills of Post-Graduate Year (PGY) 1-4 Obstetrics and Gynecology residents, Neilsen (64) reported significant differences on the checklist, global surgical skills, and pass/fail score sheets by residency level. Mean global score percent for PGY-1 was 40%, PGY-2 was 64%; PGY-3 was 80%; and for PGY-4 was 90%. Mean checklist score percent for PGY-1 was 46%, PGY-2 was 69%; PGY-3 was 84%; and for PGY-4 was 92%. Passing score percent for PGY-1 was 0%, PGY-2 was 0%; PGY-3 was 60%; and for PGY-4 was 100%.So simply, as it is expected, as the trainees progress in in their training, their OSCE s performance (Mean global score, mean checklist score, and passing score) has improved. Therefore, these studies have proved that OSCE is able to differentiate between examinees at different points of their education i.e. it has a good construct validity. Further, in a study evaluating CanMEDS roles as part of Neonatal-Perinatal Medicine 31

32 subspecialty in-training exam, Jefferies reported statistically significant differences between the two levels of training for each of the CanMEDS Roles, with more senior trainees achieving higher scores (8). Using Cohen s d the effect size was in the medium range (d = 0.50 to 0.79) for the Manager and Professional Role, and was large (d > 0.80) for the remaining five roles. Second year candidates achieved significantly higher average checklist scores (M = 60.2% ± 6.9 vs. 70.1% ±4.0, p = , d = 1.76) and average examiner overall global scores (M = 54.3% ±10.2 vs. 73.4% ±8.1, p = 0.001, d = 2.08) compared to first year candidates. Other small-scale studies have observed similar trends (10, 78-80, 87). In another study evaluating the CanMEDS roles, however, there were no significant differences in the mean checklist scores for the different CanMEDS roles (p > 0.05) between the two levels of training (90). Moreover, Hillard found that the OSCE did not demonstrate statistically significant differences among pediatric residents in each of the four years of training (68). Criterion validity Criterion validity refers to the correlation between a measurement tool and a criterion variable, which represent the same construct or domain. In simple words, it compares the measurement tool with other measures or outcomes (the criteria) already held to be valid. There are many different types of criterion validity, two of them are important to consider in context of the OSCE. The first is concurrent validity; and it is used if the test data and criterion data are collected at the same time. The second is predictive validity; and is used if the test data are collected first in order to predict criterion data collected at a later point in time (84). For the OSCE, concurrent validity is usually measured by comparing its score to other traditional tests. In fact, the OSCE has almost been compared to every single known traditional test including written tests, National Board of Medical Examiners subsets, in-training evaluation 32

33 reports (ITERs), clinical ratings, self rating, undergraduate GPA, and other SP-based tests. However, unlike these traditional tests, the OSCE evaluates more unique clinical competencies in more standardized settings (63). Therefore, it is not surprising to find that studies comparing the OSCE to other measures of clinical competence have produced mixed results with correlation coefficients ranging from r = 0.00 to For instance, in one study evaluating the clinical performance of family residents the OSCE was reported to be more inferior to a traditional written test (91). On the other hand, another study found OSCEs to be one of the most valid evaluation methods for clinical competence assessment (92). In residency programs, trainees clinical competency is usually evaluated by two common measurement methods: ITERs and in-training examination. Literature has shown that the OSCE is well correlated with in-training examinations. The correlation coefficients ranged from r = 0.37 to 0.71, with statistically significant P-values < 0.01 or less (69, 70, 93-96). Correspondingly, correlations of the OSCE with ITERs also showed statistical significance, but with a lower range of correlation coefficients; r = 0.39 to 0.57 with a P-values of 0.05 or less (40, 68-70, 93, 95). Therefore, by comparing the OSCE with ITERs and in-training examination (the criteria, already held to be valid), the OSCE is, therefore, a valid test to assess clinical competency in residency programs. In terms of evidence for the OSCE s predictive validity, the literature is minimal. Nonetheless, in a study using an OSCE to evaluate IMGs two discriminant analyses were used to provide evidence for the OSCE's predictive validity. These were promotion to the 3-month rotation, and pass/fail status on the rotation. The discriminant analysis showed a 100% correct classification rate in the pass/fail rotation results. In addition, the analysis of the promoted/not promoted results from the project administrators compared to the classification results from the 33

34 discriminant analysis indicated that 87.2% of the candidates were correctly classified (97). Several other studies addressed the OSCE's predictive validity using residency supervisor ratings as the outcome measure (98-100). Across many of these studies, the analyses yielded similar results. The OSCEs showed a small to moderate correlation to residency supervisor ratings with a correlation coefficient range from r = 0.27 to In comparison, the residency supervisor ratings correlated poorly to more traditional methods like multiple choice medical licensing examination scores with a correlation coefficient range from near zero to r = All in all, the OSCE scores appear to have better predictive validity than other traditional assessment s tools. Modern Concept of Validity For decades, the classical framework of validity provided a convenient way to understand and utilize validity evidence. However, many researchers have argued that this framework is fragmented and incomplete particularly because it could not translate evidence of the value implications of score meaning into action (82). Also, in their view, it fails to take into account the social consequences of score use (82). Accordingly, on 1989, an alternative validity s framework was introduced by Messick (101). This modern approach views validity as a unified concept in which all validity is considered construct validity. Since 1999, Messick s framework has been adopted as a standard validity s framework by the American Educational Research Association (AERA), American Psychological Association (APA), and National Council on Measurement in Education (NCME) (102). The validity evidence, in this modern framework, is collected from five main sources, namely content, response process, internal structure, relations with other variables, and consequences (see below) (101). More recently, Kane has proposed the validity argument, where validity 34

35 evidence can be utilized to support or refute the proposed score interpretations and uses (103). Content evidence: it includes the process taken to ensure that assessment content (including tasks, questions, and instructions) reflects the construct it is intended to measure (e.g., communication ) (101, 102). Response process evidence: it describes the analyses process evaluating actions and responses of both the examinees and the examiners, and how these actions and responses relate to the intended construct (101, 102). Internal structure evidence: it describes the data that evaluate the relationship between different assessment items and how well they align with the intended construct (101, 102). Relations with other variables evidence: it describes the statistical associations between assessment scores and another variable that has a specified theoretical relationship (101, 102). Consequences evidence: it describes the consequences and impact of the assessment itself, as well as decisions and actions that result (101, 102). Feasibility An ideal OSCE is one that not only has been proven to be valid and reliable, but also feasible. Regardless of its high psychometric properties, it will not be practical to implement an OSCE if the resources are not available. These resources may vary according to the purpose and the context of the OSCE. For example, OSCEs evaluating surgical competence usually involve surgical simulation or mannequins that may increase the cost and may not always be available to all institutions. In high-stake decision making situations like licensure examinations, the OSCE may need to be conducted in multiple formats, locations and even languages. All these factors 35

36 may make the OSCE impractical. Though it is crucial, OSCE feasibility has not been sufficiently addressed or reported in the literature. In one European study evaluating the resources needed for the OSCE, several items were included in the calculation of the financial cost including employment, status of staff involved, subject-matter and temporal dimension of the task (104). The exam included 4 stations and 145 students. The results showed that the OSCE cost 86 per student with a total exam cost of 12,486. When the cost was analyzed according to exam content, the cost was 4677 for preparation, 5625 for implementation, and 2166 for post-processing. In another study, 62 dental students were evaluated through 11 OSCE stations on the same day. The cost analysis included 120 hours for the development, 130 hours for the examination itself, and 20 hours for the post review process. In addition, 200 for materials were added. The results showed that the OSCE cost 181 per student with a total cost of 11,200 (105). Given these results, the OSCE seems to be an expensive assessment method. In fact, when the OSCE involves simulation, the cost becomes higher, with a range from $53 to $1080 per student (60, 106). A study reported $10 (USD) per hour of SP training or participation in an OSCE examination. However, the cost rises to $20 to $30 per hour if the SP has to undergo an evasive procedure such as a rectal examination (107). In pediatrics, it is nearly impossible to have children as real patients. Further, when normal children act as SPs, fatigue is a major concern especially with prolonged testing periods. In such situations, larger numbers of children are needed. However, this may not always be feasible or there may not even be enough children SPs available. Another major concern when children act as SPs is the potential psychological impact the scenario may have on them. In one study, a negative incident occurred when a 6-year old SP overheard discussion about death; a concept she had not yet experienced at her age (108). 36

37 In the context of surgical skills assessment where animal laboratories may be needed, the cost becomes even higher. In a study using pigs to assess the surgical skills of 24 residents the cost for each of the animal laboratories was estimated to be approximately $1500; this included the pigs, veterinary technician, and surgical supplies. The total cost of the program was approximately $6000 ($250 per resident), not including faculty time or data entry (79). Compared to traditional methods, the OSCE requires more time for preparation, implementation and post-administration processing (3, 17). In a study evaluating 117 students with 22 stations and 93 SPs, the total examination time was 527 personnel-hours (109). The cost per student for each OSCE hour was $15 compared to $10 per student per hour for the National Board of Medical Examination tests (110). Further, some studies have considered the OSCE's reliability insufficient if the exam is less than 6 hours long (111). This might be impractical especially when a large number of candidates or real patients are required for some of the station scenarios. The post-processing stage also adds to the high cost of the OSCE. Manual scoring of the OSCE for example is time-consuming and increases the probability of error (19). Because of the lengthy time needed for examiners to complete the scoring, some studies actually reported difficulty in the process of ensuring a well timed OSCE administration (18). Thus, some authors have suggested computer scoring, where optical scanners are used to read coded answers, to make the OSCE administration run on time and reduce the risk of scoring errors (18). For high stake decisions, a single OSCE that covers all the relevant clinical skills to be assessed for all candidates would be the best. In reality, however, multiple forms (i.e., parallel stations), formats, locations, languages, and administrations over several days may be needed. These multiple administrations will substantially increase the cost and stretch resources to the 37

38 limit in terms of OCSE station development, trained examiners and patients/sps, and appropriate infrastructure requirements (i.e., both facilities and administrative support). Therefore, it is very difficult to administer a single OSCE to assess a large number of candidates at the same time (21), as is the case in high stakes licensure or certification examinations. Finally, another major concern with high stakes OSCE examinations that involves multiple administrations over time or at various locations is the very high potential for lapses in security or breech of station content assessed due to risk of sharing of information immediately after an administration. This makes the use multiple format/forms a mandatory requirement, but greatly increases the cost at the same time. In summary, the OSCE is a costly and labour-intensive assessment process. The wide flexibility in costs is related to the variability in number of stations required, examinees tested, total time for administration, cost of SPs and examinees, and related fees for coordinators and facilities. Therefore, there are many drawbacks that need to be addressed in order to make the OSCE a more feasible and practical method for assessing clinical competency. Examinees perception of OSCE Although it is a stressful experience, examinees generally express positive feedback after an OSCE examination. The majority reports that the OSCE format is a fair and comprehensive assessment of their client competency, and they appreciate the faculty time and commitment to the process (112). After an OSCE, examinees report greater confidence and less anxiety about an upcoming clinical rotation (113). Many examinees state it has been a good educational experience that helped them identify their strengths and areas of weakness (112, 113). According to a subjective feedback questionnaire during the Israeli Board of Anesthesiology Examination, Berkenstadt reported that most (70 90%) participants found the difficulty level of the exam 38

39 stations reasonable to very easy, and a minority (< 10%) did not prefer this method of examination to a conventional oral exam. The realism (examinee s familiarity/comfort) was defined as high by examinees (80 90%) (77). In a study evaluating the consultative skills of respirology fellows, 91% of the candidates believed that the OSCE was useful in evaluating consultative skills. Further, 88% felt that the OSCE would be more fair than the traditional orals administered for the purpose of certification (85). In another study evaluating the clinical competence of second year internal medicine residents, Dupras reported that 15% of residents felt the time limit was too short to perform an adequate cardiac examination, 22% felt the OSCE was stressful, 42% regarded the OSCE as a good learning experience, and 18% believed that the OSCE was a strong motivator for independent studying (114). Research Question and Hypothesis Is the OSCE a valid, reliable and feasible assessment method to measure clinical competencies in medical education? We hypothesize that the OSCE demonstrates sound psychometric characteristics that will improve the state of the art in evaluating clinical competencies at the level of post-graduate training and licensure examinations. The main purpose of this study was to investigate in a meta-analysis the existing published research on the reliability, validity and feasibility of the OSCE in the context of clinical competence assessment. The main objective of this study is to conduct a comprehensive meta-analysis on the use of the OSCE as an assessment method in order to assess the reliability, validity and feasibility of OSCE in evaluating clinical competency. 39

40 CHAPTER THREE: METHODS Selection of studies This review was conducted and reported in adherence to standards of quality for reporting meta-analyses based on the MOOSE guidelines for reporting of observational studies (115). In addition to MEDLINE databases (1970 to November 2012), the search also included PsychINFO (1970 to November 2012), ERIC (1970 to November 2012) and EMBASE (1970 to November 2012). Reference lists of the initial primary studies reviewed were also searched to identify other potential studies. Although the study did not involve human participants, ethical approval from the Conjoint Health Research Ethics Board (CHREB) was required and obtained for this thesis-based research project (ID E-24772). The selected search criteria included English language peer-reviewed journal studies, using keywords including, but not restricted to: OSCE, reliability, validity, ITER and licensure (a summary of search terms is available in Appendix 1). Publication date was not specified as a criterion for inclusion, however, the introduction and use of OSCEs as a formal assessment method did not occur until after Therefore, this meta-analysis was conducted from the dates of 1975, when OSCEs were introduced, to November This resulted in a total of 299 retrieved article abstracts. Two authors (IA) and (TD) independently identified the studies and critically appraised the abstracts for inclusion based on the pre-established inclusion/exclusion criteria used to guide the review of the titles and abstracts of all the articles identified. After discussions on any abstracts or full articles where there was issues regarding the extraction of data, a 100% agreement between the two reviewers was achieved. It was expected and well known that OSCE reliability coefficients have been reported 40

41 using different statistical approaches that typically include Cronbach s alpha, Cohen s kappa statistic, and Generalizability indices. However, alpha is the most commonly reported index of reliability. Therefore, only those studies reporting Cronbach s alpha as the OSCE s reliability index were included in the reliability s meta-analysis. Nonetheless, generalisability coefficients computed across stations were also included as a comparison distribution. Generalisability coefficients were not meta-analyzed with Cronbach s alpha because these two coefficients treat differences in means as part of the error differently. While Generalisability coefficient considers differences in means as part of the error, Cronbach s alpha does not. This is because generalisability coefficients are concerned with the absolute standing of examinees, not the relative standing of examinees as Cronbach s alpha does (116). Moreover, the meaning of the alpha coefficient depends on whether it has been calculated across-station or across-items. While an alpha computed across items estimates the consistency of a given skill within a station, alpha computed across stations would estimate the consistency as examinees move from one station to another (117). Similarly, it is common to find different statistical methods to evaluate the construct and criterion-related validity of OSCEs that include Pearson product-moment correlation coefficient (r), mean differences in performance and P values. However, Pearson s correlation (strength and direction of relationship between measures) is the most commonly reported index of validity. Therefore, only those studies reporting Pearson s correlation coefficients as the OSCE s validity index were included in the validity s meta-analysis. However, the majority of studies that have evaluated construct validity have not reported the absolutes value of the construct validity rather they just report whether the difference between the two candidate groups being compared is significant or not using a P value. Therefore, in order to run our analysis, we have converted 41

42 these P values to Pearson correlation coefficient (r) using the Practical Meta-Analysis Effect Size Calculator (118). Out of the 299 articles collected initially, 188 articles were excluded because they were either duplicates or did not meet the stipulated criteria for inclusion (e.g., not peer-reviewed, review article, evaluated other allied health professionals). The remaining 109 studies were retrieved with full copies of the articles. A manual search of the reference lists was performed and an additional 14 articles were identified. Further reviewing of the retrieved studies yielded a final 49 eligible articles that met all the components of our inclusion and exclusion criteria. Articles were excluded during the final review process either to: (1) psychometric properties (validity/reliability) not being assessed or reported; or (2) the study focused on the evaluation of an undergraduate population (Figure 3.1). Inclusion and exclusion criteria for eligible studies These criteria were formulated based on reviewing published articles in the field of SPbased evaluation and, in particular, evaluations using the OSCE. In accordance with the research question, to be included a study had to meet the following eligibility criteria: 1. Be a published peer-reviewed article in the English language. 2. Any sample size for both examinees and examiners.. 3. Use of an OSCE or any other SP-based examination to evaluate clinical competence. 4. Examinees were trainees [e.g. residents, fellows, or International Medical Graduates (IMGs)] or qualified physicians. 5. Presented empirical findings on the reliability, validity or feasibility of the OSCE. Studies were excluded if: 42

43 1. They assessed other non-physicians, allied health professionals or medical students only. 2. Results were presented in a review format that may have common theme or topic headings, but did not include empirical findings. 43

44 Figure 3.1: Selection of studies for the OSCE meta-analysis. 44

45 Coding protocol and data extraction The coding protocol for extracting data was developed, in part, based on a careful review of the 49 eligible studies. Different outcome domains were established to investigate all possible independent variables that may have a relationship with the dependent variables. All the coded data were then incorporated into an Excel-summary sheet for instant access and importing into statistical software programs (Appendix 2). The coding protocol included each study s title, author(s) name(s), year, source of publication, study design, sample size, examinees level, intervention/ assessment tool(s), number of stations, duration of stations, number of examiners per station, evaluated competence, scoring method, outcome measured, validity, reliability, and feasibility. All 49 articles were independently coded by two coders (IA and TD). After iterative reviews, 100% agreement on the data coded was achieved. Statistical analysis The statistical analysis was performed using Microsoft Office Excel (2010 edition), SPSS (version 20) for the reliability analysis, and the Stata (version 12) data analysis and statistical software program for validity meta-analysis. The two reviewers also reviewed the data entries for accuracy before any of the meta-analyses were completed. For reliability, the meta-analysis was conducted based on the recommendations of Rodriguez and Maeda (116), In fact, these recommendations provide suggestions for conducting meta-analyses of coefficient alpha with the use of weighting based on a function of the precision of each coefficient alpha value. This precision of the alpha value is usually established from its sampling distribution. However, the level of precision will differ among different alpha values 45

46 because each alpha value comes from a different study with different characteristics like sample size. Therefore, all calculations were conducted using the alpha value, study sample size and number of stations (or number of items). The number of stations (or items) was used as a covariate. Also, as per Rodriguez and Maeda recommendations, when there are multiple types of reliabilities being analyzed or when the inferences being made is beyond the included studies, then it is more appropriate to run multivariate mixed-effects or random-effects analyses. Therefore, random-effect model was used to calculate effect sizes by using a transformation of alpha and a weighting scheme that involves both the estimated sampling error of each study and the estimated random effects variance component (116). The estimated population coefficient alphas for across-stations and across-items studies were calculated initially, followed by the moderator analyses. Moderators were modeled separately across stations and across items. The P-value was calculated using t-test when there were only two variables, and with single ANOVA when there were three variables. A P-value was considered significant if it was less than 0.05 as it rejects the hypothesis that there is a difference between the means of the variables. For the comparison between alpha and generalizability coefficients, an unweighted average was computed and overall distributions of both coefficients were illustrated with boxplots (Figure 4.1). In addition, the joint distribution of effect size and number of stations were illustrated with a scatterplot using an unweighted regression line that related the number of stations to the alpha coefficient (Figure 4.2) For validity, one of the main concerns in the integration of results from different studies 46

47 is the diversity that can occur in the research designs and methods those studies used to assess examinees competencies using an OSCE. The fixed effect model assumes that a common true effect is shared among included studies, and the common effect size is estimated by a summary effect. On the other hand, the random effects model assumes that the true effect varies between the included studies, and the weighted average of these different effects is an estimate of the common effect size (119). In meta-analysis, any kind of variability among studies is known as heterogeneity. If there is no heterogeneity, the two models should agree on the effect size. However, if heterogeneity is present, the random effects model should be the preferred model because it tends to give a wider confidence interval (i.e. more conservative estimate)(120). Therefore, it is important to determine whether heterogeneity is present or not. Statistically, this is done by chi-squared (Q) statistics, which is a weighted sum of squares on a standardized scale. It is usually reported with a P value, where the presence of heterogeneity is indicated by low P-values ( ). Inevitably, in any meta-analysis, statistical heterogeneity will present (123). Therefore, testing for the presence of heterogeneity is not the ultimate focus, instead quantifying it and assessing its impact on the results is more important. A useful statistic for quantifying heterogeneity is I 2, which assesses whether observed variability in effect size is due to true heterogeneity rather than sampling error (chance). It is calculated as I 2 = 100% x (Q - df)/q, where Q is Cochran's heterogeneity statistic and df is the degrees of freedom (120). With 95% Confidence Interval (CI), the results of the different studies and the overall effect are illustrated in a graph called "forest plot" (120). In this meta-analysis, for validity, we calculated effect sizes using both fixed-effect and random-effects models. The untransformed effect-size estimates (i.e., Pearson s correlation 47

48 coefficient, r) were used in calculating the weighted mean effect size. In addition, Forrest plots with Cochran Q tests for heterogeneity of effect sizes were used to test for heterogeneity. Heterogeneity was considered significant if P values were < Based on the assumption of a null hypothesis, however, the absence of a significant P value for Q does not by itself imply homogeneity because it could reflect low power within studies rather than actual consistency. Therefore, a review of the actual dispersion of the studies on the Forest plot becomes an important visual indicator for consistency between studies. The combined effect sizes were calculated irrespective of the number of trials included under each outcome in order to obtain a uniform effect estimate based on the domain of measurement. 48

49 CHAPTER FOUR: RESULTS The 49 studies included in the present meta-analysis are summarized in Table 4.1. The majority of these studies (86%) were completed in North America. The total sample size for the studies included is 61,796 with a combination of 732 OSCE stations. A total of 43% of the studies were conducted in a High Stakes Decision (HSD) context and the remaining 57% focused on investigating aspects of the OSCE method from a research context. Several areas of competency were assessed including clinical skills, communication, data interpretation, knowledge, consultative skills, ethical decision, professionalism, evidence base medicine, decision making, written skills and English proficiency. Information on specific demographic characteristics such as candidates sex, age, or socioeconomic status was very rarely reported or even referred to in the identified studies. Reliability A total of 109 alpha values were coded with a mean reliability coefficient of α = 0.70 (95% CI: ). The mean of the generalizability coefficients is Ep 2 = 0.73 (95% CI: ). The unweighted average of the generalizability coefficients was computed for comparison with the alpha coefficients. The overall distributions of both alpha and generalizability coefficients are illustrated with box-plots in Figure 4.1. The joint distribution of the effect size for unweighted reliability coefficients and number of stations is illustrated with a scatterplot in Figure

50 Table 4.1: Summary of the 49 studies included in the OSCE meta-analysis. Study Context Area Population s Level Population Intervention Station Items Station duration (min) Examiners Examiners /station Scoring 1 Baig HSD IMG IMG 39 2 Bansal R Surgical residency Trainee 24 3 Berkenstadt HSD Anesthesia residency Trainee 145 OSCE, ITERs, mini- CEX, PAR 14 NA NA FP + SP 2 CL OSCE + MCQE 6 25 to FP 1 CL + GR Simulationbased OSCE NA NA 2 CL + GR 4 Boudreau R Respiratory medicine, (ABIM) Trainee + Physicians 22 OSCE 6 NA FP + SP 2-3 CL + GR 5 Boulet HSD (ECFMG) IMG 123 OSCE 10 NA NA FP + SP 2 CL + GR 6 Brailovsky HSD Family medicine Trainee 235 OSCE 40 NA 7-14 FP + SP 1 CL 7 Cohen 1990 R Surgical resdency Trainee 27 OSCE FP 1 CL 8 Cohen 1996 HSD PIP IMG 72 OSCE + MCQE 29 NA 10 FP 1 CL + GR 9 Dupras R Internal medicine residency Trainee 51 OSCE 12 NA 6 FP 2-3 CL 10 Gerrow 2003 HSD NDEB Trainee 2317 OSCE + MCQE 25 NA 5 NA NA NA 11 Goff 2000 R 12 Goff 2001 R Obstetrics and gynecology residency Trainee 24 OSAT FP 1-2 Obstetrics and gynecology residency Trainee 24 OSAT FP 1 CL + GR + P/F CL + GR + P/F 13 Goff 2002 R Obstetrics and gynecology residency Trainee 16 OSAT + written examination FP 2 CL + GR + P/F 14 Grand'Maison 1992 HSD Family medicine residency + CFPC Trainee 539 OSCE 2 NA 7-14 FP 1 CL 50

51 15 Grand'Maison 1996 HSD QLEx for family medicine Physician 13 OSCE + written Questionnaire s 38 NA 7 or 14 NA NA NA 16 Grand'Maison 1997 HSD CFPC Trainee 172 OSCE 40 NA 7 or 14 FP 1 CL + GR 17 Hamadeh R Family medicine residency Trainee 31 OSCE FP 1 CL 18 Hatala HSD RCPSC IM Trainee 251 OSCE 1 NA NA FP 2 CL + GR 19 Hilliard R Pediatric residency Trainee 43 OSCE 5 NA NA FP + SP 2 CL + GR 20 Hofmeister HSD Family medicine residency + AIMG IMG 71 MMI + OSCE FP 1-2 CL 21 Jefferies 2007 R Neonatal-Perinatal Medicine subspecialty training program Trainee: 24 OSCE 10 NA 12 FP + SP + SHP 2 CL + GR + CanMEDS ratings 22 Jefferies 2011 R Neonatal-Perinatal Medicine subspecialty training program Trainee 68 SOE 8 NA 15 FP 1-2 CL + GR 23 Joorabchi 96 R Pediatric residency Trainee 126 OSCE NA 5 NA NA CL + GR 24 Kramer R General practice Trainee + Physicians 121 OSCE 16 NA 7 or 15 FP 4-11 CL + GR 25 MacRae 1997 R Surgical residency Trainee 18 PAME 6 NA 30 FP + SP 2 CL + GR 26 MacRae2000 R Surgical residency Trainee 24 OSATS + PAME or r 15 FP 1 GR 27 Martin R Surgical residency Trainee 20 OSAT FP 2 CL + GR + P/F 28 Nadeem R Radiology residency Trainee 42 OSCE FP + SP 2 CL + GR 29 Neilsen R Obstetrics and Gynecology Trainee 18 OSATS 1 6 or 7 NA FP 2 CL + GR + P/F 30 Petrusa R Internal medicine residency Trainee 74 OSCE 17 NA 8 FP + SP + RP 2 CL 31 Regehr R Surgical residency Trainee 53 OSAT or 7 15 FP 2 CL + GR 51

52 32 Reznick 1992 HSD MCCQE Trainee 240 OSCE 20 NA 10 FP + SP 3 CL + GR + communica tion scores + PEP score 33 Reznick 1993 HSD MCCQE Trainee 401 OSCE 20 NA 10 FP 1 CL + GR + communica tion scores + PEP score 34 Reznick 1996 HSD MCCQE II Trainee + Physicians 3024 OSCE 20 NA 10 FP 1 CL + GR + PEP score 35 Reznick 97 R Surgical residency Trainee 48 OSAT or 7 15 FP 1 CL + GR 36 Rifkin Research Internal medicine residency Trainee: (75% IMG) 34 OSCE 4 NA NA NA NA NA 37 Rothman HSD MCCQE II Trainee 744 OSCE 20 NA 10 FP 1 CL + P/F + PEP score 38 Skinner R Family practice residency. Trainee 17 OSCE 27 NA Sloan 1993 R Surgical internship Trainee 30 OSCE 35 NA 10 FP+ senior residents NA CL + GR FP+ senior residents 1 CL 40 Sloan 1995 R Surgical residency Trainee 56 OSCE 38 NA 10 FP 1 CL 41 Stillman P 1991 HSD ABIM Trainee (17% IMG) 310 OSCE + MCQ 19 NA FP + SP 1-2 CL + GR 42 Stillman PL 1986 HSD ABIM Trainee (8% IMG) 336 OSCE or 8 40 SP 1 CL + GR 43 Sutnick HSD ECFMG Trainee (84% IMG) 624 1) clinical encounters with SP; (2) laser videodisk pictorials; (3) written clinical vignettes; and (4) assessment of spoken 8 NA 22 SPs + record room clerks 2 CL + GR + spoken English rating 52

53 44 Taghva R Psychiatry residency Trainee 22 OSCE FP 2 CL + GR 45 Tudiver R Family Medicine residency Trainee 42 OSCE 2 8 or 2 30 FP 3 CL + GR English 46 Vallevand HSD WAAIP IMG 39 OSCE 14 NA NA FP + SP 2 CL + GR + P/F van Zanten HSD ECFMG IMG OSCE FP + SP 2 CL + GR van Zanten HSD USMLE II IMG OSCE FP + SP 2 CL + GR 49 Yang R Internal medicine residency Trainee 209 DOPS, IM- ITE & OSCE FP 1-2 CL 53

54 Figure 4.1: Overall distributions of reliability and generalizability coefficients Ep2: Generalisability coefficient; α: reliability coefficient Figure 2.2: Scatterplot of reliability coefficient (alpha) by number of stations with unweighted regression line. 54

55 The unweighted regression line is also plotted in the scatterplot (Figure 4.2), relating the number of stations to the reliability coefficients. This scatterplot graph showns that the effect size (i.e., reliability) tends to increase as the number of stations increase. However, there is a considerable variability in the reported coefficients. Moderator analyses were conducted using seven potential explanatory variables or moderators: content, context, number of raters, examiners type, examinees level, stations duration, and scales type (Table 4.2). These moderators were analyzed separately across stations and across items. For the results across stations, the difference in mean reliability is significant in three moderators: content (clinical versus communication), number of raters, and examiner type (FP versus SP). There were 34 reliability coefficients reported for clinical content and 20 for communication content. The mean reliability estimate was α = 0.65 for the clinical content (95% CI: ). For the communication content, the mean reliability estimate was α = 0.64 (95% CI: ). When the number of raters was used as a moderator, there were 16 reliability coefficients reported for one rater and 45 for two or more raters. The mean reliability estimate was α = 0.61 for the one rater (95% CI: ). For the two or more raters, the mean reliability estimate was α = 0.69 (95% CI: ). When the type of examiner was used as a moderator, there were 50 reliability coefficients reported for the FP and 30 for the SP. The mean reliability estimate was α = 0.69 for the FP (95% CI ). For the SP, the mean reliability estimate was α = 0.64 (95% CI ). Note that two of the three moderators (content and number of raters) were also significant across-items. For the results across items, four moderators were significant: exam s content, exam s context (Research versus HSD), number of raters, and level of examinee (physicians versus trainees). There were 48 reliability coefficients for the clinical content and 8 for the 55

56 communication content. The mean reliability estimate was α = 0.61 for the clinical content (95% CI: ), and α = 0.66 (95% CI: ) for the communication content. When the exam s context was used as a moderator, there were 48 reliability coefficients for research context and no coefficients were coded for HSD context. For the research context, the mean reliability estimate was α = 0.61 (95% CI: ). When the number of raters was used as a moderator, there were 15 reliability coefficients for one rater and 30 for two or more raters. The mean reliability estimate was α = 0.59 for the one rater (95% CI: ). For the two or more raters, the mean reliability estimate was α = 0.60 (95% CI: ). When the examinees level was used as a moderator, there were 48 reliability coefficients for trainee s level and no coefficients were coded for physician s level. For the trainee s level, the mean reliability estimate was α = 0.61 (95% CI: ). Note that the exam s context and the examinees level were not significant across-stations. Also, for both stations and items, mean reliability estimates for short and long station duration did not differ significantly. In summary, for the results across stations, three moderators were significant: content, number of raters, and examiner type. For the results across items, four moderators were significant: exam s content, exam s context, number of raters, and level of examinee (Table 4.2). 56

57 Table 4.2: Moderator analyses. 57

58 Validity 1) Criterion Validity A total of 33 studies reported criterion (concurrent) validity values of association between testing measures using Pearson s correlation coefficient (r). The majority of these studies reported low to moderate validity values, with a mean of r = 0.46 ± 0.21 (95% CI: ) and pooled fixed-effects SMD of r = 0.47 (95% CI: ); see Figure 4.3. High heterogeneity was observed (I 2 = 98.6%, Q 2 = , p < 0.001). This implied that over 98% of observed variability is attributed to between-studies variability. As these heterogeneities are very large and significant, it is important to understand its sources. Based on the variability between OSCE administrations within studies, it is expected that the criterion validity may be affected by multiple sources of errors including the exam s context, the candidates background, and traditional assessment measures with which the OSCE s results were being compared with. Therefore, a subgroup analysis was performed to try to explain some sources of these heterogeneities. 58

59 Figure 4.3: Forest plot for Criterion Validity (r). 59

60 Number of Stations Three groups of number of stations have been categorized: Low 1-10 stations, Moderate stations, and High 21. Then a subgroup analysis was performed using a random-effects size estimation model. The heterogeneity statistics are presented below the Forest plot (Figure 4.4). In each of the subgroups Cochran Q test was performed to assess heterogeneity. The degree of heterogeneity was quantified by the I 2 index. Based on this subgroup analysis, the effect sizes within High subgroup are more homogeneous than the other two subgroups. As a result, the between studies variance is higher in these subgroups (Q 2 = for Moderate; Q 2 = for Low) compared to High subgroup (Q 2 = 13.33). This indicates that part of the heterogeneity could be explained by the factor number of stations. Moreover, the I 2 index of the High subgroup (47.5%) is much smaller than that of the other two subgroups (93.1% for Moderate, 96.4% for Low) as well as the overall variance (98.6%). This emphasizes the above finding that a small part of the heterogeneity could be explained by the factor number of stations. This implied that a small part of observed between-studies variability might be attributed to the factor number of stations (Figure 4.4). Station s Duration Three groups of station s duration have been formed, Short = 1-10 minutes, Moderate = minutes, and Long 21 minutes. Then a subgroup analysis was performed using a random-effects size estimation model. The heterogeneity statistics are presented below the Forest plot (Figure 4.5). In each of the subgroups Cochran Q test was performed to assess heterogeneity. The degree of heterogeneity was quantified by the I 2 index. 60

61 Based on the subgroup analysis, the effect sizes within short and moderate subgroups are more homogeneous than the long subgroup. As a result, the between studies variance is markedly higher in long subgroup (Q 2 = ) compared to the short (Q 2 = ) and moderate subgroup (Q 2 = 79.43). Moreover, the I 2 index of the short (87.0%) and moderate (92.4%) subgroups are relatively smaller than that of the long subgroup (99.4%) as well as the overall variance (98.6%); see Figure 4.5. This indicates that part of the heterogeneity could be explained by the factor station s duration (Figure 4.5). Scoring Methods Three groups of scoring methods have been formed, Checklist (CL), Global Rating (GR), and CL & GR. Then a subgroup analysis was performed using a random-effects size estimation model. The heterogeneity statistics are presented below the Forest plot (Figure 4.6). In each of the subgroups Cochran Q test was performed to assess heterogeneity. The degree of heterogeneity was quantified by the I 2 index. This subgroup analysis resulted in only two subgroups, because the GR subgroup was only included in one study. Nonetheless, the effect size estimate within the CL subgroup was more homogeneous than the CL&GR subgroup. As a result, the between studies variance is lower in CL subgroups (Q 2 = 25.59) compared to CL&GR subgroup (Q 2 = ). This indicates that part of the heterogeneity could be explained by the factor Scoring Methods. Moreover, the I 2 index of the CL subgroup (68.7%) is smaller than that of the CL&GR subgroup (99.0%) as well as the overall variance (98.6%). This emphasizes the above finding that a small part of the heterogeneity could be explained by the factor Scoring Methods. 61

62 This implied that a small part of observed between-studies variability might be attributed to the factor Scoring Methods (Figure 4.6). Number of Raters per Station Three groups of number of raters have been formed, Low for 1 rater, Moderate for 2 raters, and High for 3 raters. Then a subgroup analysis was performed using a random-effects size estimation model. The heterogeneity statistics are presented below the Forest plot (Figure 4.7). In each of the subgroups Cochran Q test was performed to assess heterogeneity. The degree of heterogeneity was quantified by the I 2 index. Based on this subgroup analysis, the effect sizes within the High subgroup are more homogeneous than the other two subgroups. As a result, the between studies variance is higher in these subgroups (Q 2 = for Moderate; Q 2 =85.75 for Low) compared to High subgroup (Q 2 = 18.01). This indicates that part of the heterogeneity could be explained by the factor number of raters. Moreover, the I 2 index of the High subgroup (66.7%) is smaller than that of the other two subgroup (90.7% for Low, 99.2% for Moderate) as well as the overall variance (98.6%). This emphasizes that a small part of the heterogeneity could be explained by the factor number of raters (Figure 4.7). Examiners Three groups of examiners have been formed, Faculty Physician (FP), Standardized Patient (SP), and FP & SP. Then a subgroup analysis was performed using a random-effects size estimation model. The heterogeneity statistics are presented below the Forest plot (Figure 4.8). In 62

63 each of the subgroups Cochran Q test was performed to assess heterogeneity. The degree of heterogeneity was quantified by the I 2 index. This subgroup analysis resulted in only two subgroups because the SP subgroup was only included in one study. Nonetheless, the effect sizes within the FP subgroup are more homogeneous than the FP&SP subgroup. As a result, the between studies variance is lower in FP subgroups (Q 2 = ) compared to FP&SP subgroup (Q 2 = ). This indicates that part of the heterogeneity could be explained by the factor Examiner. Moreover, the I 2 index of the FP subgroup (82.7%) is slightly smaller than that of the FP&SP subgroup (99.4%) as well as the overall variance (98.6%). This emphasizes that a small part of the heterogeneity could be explained by the factor Examiner (Figure 4.8). Exam s Content Three groups of exam s content have been formed Clinical, Communication, and Clinical & Communication. Then a subgroup analysis was performed using a random-effects size estimation model. The heterogeneity statistics are presented below the Forest plot (Figure 4.9). In each of the subgroups Cochran Q test was performed to assess heterogeneity. The degree of heterogeneity was quantified by the I 2 index. Based on the subgroup analysis, the effect sizes within Clinical and Communication subgroups are more homogeneous to some degree than the Clinical & Communication subgroups. As a result, the between studies variance is slightly higher in Clinical & Communication subgroup (Q 2 = ) compared to the Clinical (Q 2 = 84.51) and Communication subgroup (Q 2 = 15.66). Moreover, the I 2 index of the Clinical (85.8%) and 63

64 Communication (87.2%) subgroups are relatively smaller than that of the other subgroup (91.4%) as well as the overall variance (98.6%). This indicates that part of the heterogeneity could be explained by the factor exam s content (Figure 4.9). Exam s Context Two groups of exam s context have been formed, High Stake Decision (HSD and Research (R). Then a subgroup analysis was performed using a random-effects size estimation model. The heterogeneity statistics are presented below the Forest plot (Figure 4.10). In each of the subgroups Cochran Q test was performed to assess heterogeneity. The degree of heterogeneity was quantified by the I 2 index. Based on the subgroup analysis, the effect sizes within R subgroup are more homogeneous than the HSD subgroups. As a result, the between studies variance is markedly higher in HSD subgroup (Q 2 = ) compared to the R subgroup (Q 2 = ). This indicates that part of the heterogeneity could be explained by the factor exam s context. Moreover, the I 2 index of the R subgroup is relatively smaller (81.6%) than that of the other subgroup (99.4%) as well as the overall variance (98.6%). This may indicate that a small part of the heterogeneity could be explained by the factor exam s context (Figure 4.10). Candidates Background Three groups of candidates background have been formed, International Medical Graduates (IMG), non-img, and IMG & non-img. Then a subgroup analysis was performed using a random-effects size estimation model. The heterogeneity statistics are presented below 64

65 the Forest plot (Figure 4.11). In each of the subgroups Cochran Q test was performed to assess heterogeneity. The degree of heterogeneity was quantified by the I 2 index. Based on the subgroup analysis, the effect sizes within IMG & non-img subgroup, though it contains only three studies, are uniquely homogeneous with no heterogeneity; i.e., I 2 index of zero; also the between studies variance is markedly low (Q 2 = 0.21). In addition, the effect sizes within non-img subgroup are more homogeneous than the IMG subgroups. As a result, the between studies variance is clearly higher in IMG subgroup (Q 2 = ) compared to the non-img subgroup (Q 2 = ). Moreover, given that the I 2 index of the IMG & non-img subgroup is zero and that of non-img is slightly smaller (84.9%) than that of the IMG subgroup (99.7%) as well as the overall variance (98.6%). This indicates that part of the heterogeneity could be explained by the factor candidates background (Figure 4.11). Comparison s Test Three groups of traditional assessment tools with which the OSCE s result being compared with comparison test have been formed; clinical, non-clinical, and clinical & non-clinical. Then a subgroup analysis was performed using a random-effects size estimation model. The heterogeneity statistics are presented below the Forrest plot (Figure 4.12). In each of the subgroups Cochran Q test was performed to assess heterogeneity. The degree of heterogeneity was quantified by the I 2 index. Based on the subgroup analysis, the effect sizes within clinical subgroup are also uniquely homogeneous with minimum heterogeneity (Q 2 = 7.12; and I 2 index of 43.8%). In addition, the effect sizes within clinical & non-clinical subgroup are slightly more homogeneous than the non-clinical subgroups. As a result, the between studies variance is 65

66 slightly higher in non-clinical subgroup (Q 2 = ) compared to the clinical & nonclinical subgroup (Q 2 = ). Moreover, given that the I 2 index of the IMG & non-img subgroup is very low and that of clinical & non-clinical is slightly smaller (89.4%) than that of the non-clinical subgroup (96.6%) as well as the overall variance (98.6%). This may indicates that part of the heterogeneity could be explained by the factor Comparison s Test (Figure 4.12). 66

67 Figure 4.4: Forest plot for Criterion Validity (r) by Number of Stations. Criterion Validity r by Number of Stations Study ID moderate Baig et al, 2010 Dupras et al, 1995 Hofmeister et al, 2009 MacRae et al, 2000 Martin et al, 1997 Petrusa et al, 1990 Reznick et al, 1992 Stillman P et al, 1991 van Zanten et al, 2007 (2) Subtotal (I-squared = 93.1%, p = 0.000). low Berkenstadt et al, 2006 Boulet et al, 1998 Goff et al, 2000 Goff et al, 2001 Hamadeh et al, 1993 Hatala et al, 2009 Hilliard et al, 1998 Jefferies et al, 2011 MacRae et al, 1997 Regehr et al, 1998 Rifkin et al, 2005 Stillman PL et al, 1986 Taghva et al, 2010 Tudiver et al, 2009 van Zanten et al, 2007 (1) Yang et al, 2011 Subtotal (I-squared = 96.4%, p = 0.000). high Cohen et al, 1990 Cohen et al, 1996 Gerrow et al, 2003 Grand'Maison et al, 1997 Joorabchi et al, 1996 Skinner et al, 1997 Sloan et al, 1993 Sloan et al, 1995 Subtotal (I-squared = 47.5%, p = 0.064). Overall (I-squared = 98.6%, p = 0.000) NOTE: Weights are from random effects analysis ES (95% CI) 0.20 (0.02, 0.38) 0.24 (-0.02, 0.50) 0.16 (-0.07, 0.39) 0.35 (-0.01, 0.71) 0.84 (0.69, 0.99) 0.40 (0.21, 0.59) 0.34 (0.23, 0.45) 0.23 (0.12, 0.34) 0.59 (0.58, 0.60) 0.38 (0.23, 0.54) 0.34 (0.20, 0.48) 0.52 (0.39, 0.65) 0.82 (0.67, 0.97) 0.88 (0.78, 0.98) 0.53 (0.27, 0.79) 0.49 (0.40, 0.58) 0.51 (0.28, 0.74) 0.80 (0.71, 0.89) 0.20 (-0.25, 0.65) 0.63 (0.46, 0.80) 0.15 (-0.18, 0.48) 0.23 (0.13, 0.33) 0.50 (0.17, 0.83) 0.62 (0.43, 0.81) 0.26 (0.25, 0.27) 0.51 (0.41, 0.61) 0.51 (0.37, 0.65) 0.59 (0.33, 0.85) 0.63 (0.49, 0.77) 0.50 (0.47, 0.53) 0.57 (0.47, 0.67) 0.43 (0.29, 0.57) 0.26 (-0.20, 0.72) 0.11 (-0.25, 0.47) 0.61 (0.44, 0.78) 0.52 (0.45, 0.59) 0.47 (0.39, 0.55) % Weight Pearson's Correlation Coefficient r 67

68 Figure 4.5: Forest plot for Criterion Validity (r) by Station's Duration. Criterion Validity r by Station's Duration Study ID long Baig et al, 2010 Berkenstadt et al, 2006 Boulet et al, 1998 Hatala et al, 2009 Hilliard et al, 1998 MacRae et al, 1997 MacRae et al, 2000 Rifkin et al, 2005 Stillman PL et al, 1986 Tudiver et al, 2009 van Zanten et al, 2007 (1) van Zanten et al, 2007 (2) Subtotal (I-squared = 99.4%, p = 0.000). short Cohen et al, 1990 Cohen et al, 1996 Dupras et al, 1995 Gerrow et al, 2003 Goff et al, 2000 Goff et al, 2001 Hamadeh et al, 1993 Hofmeister et al, 2009 Joorabchi et al, 1996 Petrusa et al, 1990 Reznick et al, 1992 Skinner et al, 1997 Sloan et al, 1993 Sloan et al, 1995 Subtotal (I-squared = 87.0%, p = 0.000). moderate Grand'Maison et al, 1997 Jefferies et al, 2011 Martin et al, 1997 Regehr et al, 1998 Stillman P et al, 1991 Taghva et al, 2010 Yang et al, 2011 Subtotal (I-squared = 92.4%, p = 0.000). Overall (I-squared = 98.6%, p = 0.000) NOTE: Weights are from random effects analysis ES (95% CI) 0.20 (0.02, 0.38) 0.34 (0.20, 0.48) 0.52 (0.39, 0.65) 0.49 (0.40, 0.58) 0.51 (0.28, 0.74) 0.20 (-0.25, 0.65) 0.35 (-0.01, 0.71) 0.15 (-0.18, 0.48) 0.23 (0.13, 0.33) 0.62 (0.43, 0.81) 0.26 (0.25, 0.27) 0.59 (0.58, 0.60) 0.38 (0.24, 0.53) 0.59 (0.33, 0.85) 0.63 (0.49, 0.77) 0.24 (-0.02, 0.50) 0.50 (0.47, 0.53) 0.82 (0.67, 0.97) 0.88 (0.78, 0.98) 0.53 (0.27, 0.79) 0.16 (-0.07, 0.39) 0.43 (0.29, 0.57) 0.40 (0.21, 0.59) 0.34 (0.23, 0.45) 0.26 (-0.20, 0.72) 0.11 (-0.25, 0.47) 0.61 (0.44, 0.78) 0.49 (0.39, 0.60) 0.57 (0.47, 0.67) 0.80 (0.71, 0.89) 0.84 (0.69, 0.99) 0.63 (0.46, 0.80) 0.23 (0.12, 0.34) 0.50 (0.17, 0.83) 0.51 (0.41, 0.61) 0.58 (0.41, 0.76) 0.47 (0.39, 0.55) % Weight Pearson's Correlation Coefficient r 68

69 Figure 4.6: Forest plot for Criterion Validity (r) by Scoring Methods. Criterion Validity r by Scoring Methods Study ID CL Baig et al, 2010 Cohen et al, 1990 Dupras et al, 1995 Hamadeh et al, 1993 Hofmeister et al, 2009 Petrusa et al, 1990 Sloan et al, 1993 Sloan et al, 1995 Yang et al, 2011 Subtotal (I-squared = 68.7%, p = 0.001). CL&GR Berkenstadt et al, 2006 Boulet et al, 1998 Cohen et al, 1996 Gerrow et al, 2003 Goff et al, 2000 Goff et al, 2001 Grand'Maison et al, 1997 Hatala et al, 2009 Hilliard et al, 1998 Jefferies et al, 2011 Joorabchi et al, 1996 MacRae et al, 1997 Martin et al, 1997 Regehr et al, 1998 Reznick et al, 1992 Rifkin et al, 2005 Skinner et al, 1997 Stillman P et al, 1991 Stillman PL et al, 1986 Taghva et al, 2010 Tudiver et al, 2009 van Zanten et al, 2007 (1) van Zanten et al, 2007 (2) Subtotal (I-squared = 99.0%, p = 0.000). GR MacRae et al, 2000 Subtotal (I-squared =.%, p =.). Overall (I-squared = 98.6%, p = 0.000) NOTE: Weights are from random effects analysis ES (95% CI) 0.20 (0.02, 0.38) 0.59 (0.33, 0.85) 0.24 (-0.02, 0.50) 0.53 (0.27, 0.79) 0.16 (-0.07, 0.39) 0.40 (0.21, 0.59) 0.11 (-0.25, 0.47) 0.61 (0.44, 0.78) 0.51 (0.41, 0.61) 0.39 (0.27, 0.51) 0.34 (0.20, 0.48) 0.52 (0.39, 0.65) 0.63 (0.49, 0.77) 0.50 (0.47, 0.53) 0.82 (0.67, 0.97) 0.88 (0.78, 0.98) 0.57 (0.47, 0.67) 0.49 (0.40, 0.58) 0.51 (0.28, 0.74) 0.80 (0.71, 0.89) 0.43 (0.29, 0.57) 0.20 (-0.25, 0.65) 0.84 (0.69, 0.99) 0.63 (0.46, 0.80) 0.34 (0.23, 0.45) 0.15 (-0.18, 0.48) 0.26 (-0.20, 0.72) 0.23 (0.12, 0.34) 0.23 (0.13, 0.33) 0.50 (0.17, 0.83) 0.62 (0.43, 0.81) 0.26 (0.25, 0.27) 0.59 (0.58, 0.60) 0.51 (0.41, 0.60) 0.35 (-0.01, 0.71) 0.35 (-0.01, 0.71) 0.47 (0.39, 0.55) % Weight Pearson's Correlation Coefficient r 69

70 Figure 4.7: Forest plot for Criterion Validity (r) by Number of Raters. Criterion Validity r by Number of Raters Study ID moderate Baig et al, 2010 Berkenstadt et al, 2006 Boulet et al, 1998 Goff et al, 2000 Hatala et al, 2009 Hilliard et al, 1998 Hofmeister et al, 2009 Jefferies et al, 2011 MacRae et al, 1997 Martin et al, 1997 Petrusa et al, 1990 Regehr et al, 1998 Stillman P et al, 1991 Taghva et al, 2010 van Zanten et al, 2007 (1) van Zanten et al, 2007 (2) Yang et al, 2011 Subtotal (I-squared = 99.2%, p = 0.000). low Cohen et al, 1990 Cohen et al, 1996 Goff et al, 2001 Grand'Maison et al, 1997 Hamadeh et al, 1993 MacRae et al, 2000 Sloan et al, 1993 Sloan et al, 1995 Stillman PL et al, 1986 Subtotal (I-squared = 90.7%, p = 0.000). high Dupras et al, 1995 Gerrow et al, 2003 Joorabchi et al, 1996 Reznick et al, 1992 Rifkin et al, 2005 Skinner et al, 1997 Tudiver et al, 2009 Subtotal (I-squared = 66.7%, p = 0.006). Overall (I-squared = 98.6%, p = 0.000) NOTE: Weights are from random effects analysis ES (95% CI) 0.20 (0.02, 0.38) 0.34 (0.20, 0.48) 0.52 (0.39, 0.65) 0.82 (0.67, 0.97) 0.49 (0.40, 0.58) 0.51 (0.28, 0.74) 0.16 (-0.07, 0.39) 0.80 (0.71, 0.89) 0.20 (-0.25, 0.65) 0.84 (0.69, 0.99) 0.40 (0.21, 0.59) 0.63 (0.46, 0.80) 0.23 (0.12, 0.34) 0.50 (0.17, 0.83) 0.26 (0.25, 0.27) 0.59 (0.58, 0.60) 0.51 (0.41, 0.61) 0.48 (0.36, 0.60) 0.59 (0.33, 0.85) 0.63 (0.49, 0.77) 0.88 (0.78, 0.98) 0.57 (0.47, 0.67) 0.53 (0.27, 0.79) 0.35 (-0.01, 0.71) 0.11 (-0.25, 0.47) 0.61 (0.44, 0.78) 0.23 (0.13, 0.33) 0.52 (0.34, 0.69) 0.24 (-0.02, 0.50) 0.50 (0.47, 0.53) 0.43 (0.29, 0.57) 0.34 (0.23, 0.45) 0.15 (-0.18, 0.48) 0.26 (-0.20, 0.72) 0.62 (0.43, 0.81) 0.41 (0.31, 0.51) 0.47 (0.39, 0.55) % Weight Pearson's Correlation Coefficient r 70

71 Figure 4.8: Forest plot for Criterion Validity (r) by Examiner. Criterion Validity r by Examiner Study ID FP&SP Baig et al, 2010 Berkenstadt et al, 2006 Boulet et al, 1998 Gerrow et al, 2003 Hilliard et al, 1998 Joorabchi et al, 1996 MacRae et al, 1997 Petrusa et al, 1990 Reznick et al, 1992 Rifkin et al, 2005 Stillman P et al, 1991 van Zanten et al, 2007 (1) van Zanten et al, 2007 (2) Subtotal (I-squared = 99.4%, p = 0.000). FP Cohen et al, 1990 Cohen et al, 1996 Dupras et al, 1995 Goff et al, 2000 Goff et al, 2001 Grand'Maison et al, 1997 Hamadeh et al, 1993 Hatala et al, 2009 Hofmeister et al, 2009 Jefferies et al, 2011 MacRae et al, 2000 Martin et al, 1997 Regehr et al, 1998 Skinner et al, 1997 Sloan et al, 1993 Sloan et al, 1995 Taghva et al, 2010 Tudiver et al, 2009 Yang et al, 2011 Subtotal (I-squared = 82.7%, p = 0.000). SP Stillman PL et al, 1986 Subtotal (I-squared =.%, p =.). Overall (I-squared = 98.6%, p = 0.000) NOTE: Weights are from random effects analysis ES (95% CI) 0.20 (0.02, 0.38) 0.34 (0.20, 0.48) 0.52 (0.39, 0.65) 0.50 (0.47, 0.53) 0.51 (0.28, 0.74) 0.43 (0.29, 0.57) 0.20 (-0.25, 0.65) 0.40 (0.21, 0.59) 0.34 (0.23, 0.45) 0.15 (-0.18, 0.48) 0.23 (0.12, 0.34) 0.26 (0.25, 0.27) 0.59 (0.58, 0.60) 0.37 (0.25, 0.50) 0.59 (0.33, 0.85) 0.63 (0.49, 0.77) 0.24 (-0.02, 0.50) 0.82 (0.67, 0.97) 0.88 (0.78, 0.98) 0.57 (0.47, 0.67) 0.53 (0.27, 0.79) 0.49 (0.40, 0.58) 0.16 (-0.07, 0.39) 0.80 (0.71, 0.89) 0.35 (-0.01, 0.71) 0.84 (0.69, 0.99) 0.63 (0.46, 0.80) 0.26 (-0.20, 0.72) 0.11 (-0.25, 0.47) 0.61 (0.44, 0.78) 0.50 (0.17, 0.83) 0.62 (0.43, 0.81) 0.51 (0.41, 0.61) 0.57 (0.48, 0.66) 0.23 (0.13, 0.33) 0.23 (0.13, 0.33) 0.47 (0.39, 0.55) % Weight Pearson's Correlation Coefficient r 71

72 Figure 4.9: Forest plot for Criterion Validity (r) by Exam's Content. Criterion Validity r by Exam's Content Study ID Communication Baig et al, 2010 Boulet et al, 1998 van Zanten et al, 2007 (1) Subtotal (I-squared = 87.2%, p = 0.000). clinical & communication Cohen et al, 1996 Gerrow et al, 2003 Grand'Maison et al, 1997 Hilliard et al, 1998 Hofmeister et al, 2009 Jefferies et al, 2011 Joorabchi et al, 1996 MacRae et al, 1997 MacRae et al, 2000 Reznick et al, 1992 Rifkin et al, 2005 Sloan et al, 1995 Stillman P et al, 1991 Stillman PL et al, 1986 Taghva et al, 2010 van Zanten et al, 2007 (2) Yang et al, 2011 Subtotal (I-squared = 91.4%, p = 0.000). clinical Berkenstadt et al, 2006 Cohen et al, 1990 Dupras et al, 1995 Goff et al, 2000 Goff et al, 2001 Hamadeh et al, 1993 Hatala et al, 2009 Martin et al, 1997 Petrusa et al, 1990 Regehr et al, 1998 Skinner et al, 1997 Sloan et al, 1993 Tudiver et al, 2009 Subtotal (I-squared = 85.8%, p = 0.000). Overall (I-squared = 98.6%, p = 0.000) NOTE: Weights are from random effects analysis ES (95% CI) 0.20 (0.02, 0.38) 0.52 (0.39, 0.65) 0.26 (0.25, 0.27) 0.33 (0.15, 0.50) 0.63 (0.49, 0.77) 0.50 (0.47, 0.53) 0.57 (0.47, 0.67) 0.51 (0.28, 0.74) 0.16 (-0.07, 0.39) 0.80 (0.71, 0.89) 0.43 (0.29, 0.57) 0.20 (-0.25, 0.65) 0.35 (-0.01, 0.71) 0.34 (0.23, 0.45) 0.15 (-0.18, 0.48) 0.61 (0.44, 0.78) 0.23 (0.12, 0.34) 0.23 (0.13, 0.33) 0.50 (0.17, 0.83) 0.59 (0.58, 0.60) 0.51 (0.41, 0.61) 0.46 (0.39, 0.53) 0.34 (0.20, 0.48) 0.59 (0.33, 0.85) 0.24 (-0.02, 0.50) 0.82 (0.67, 0.97) 0.88 (0.78, 0.98) 0.53 (0.27, 0.79) 0.49 (0.40, 0.58) 0.84 (0.69, 0.99) 0.40 (0.21, 0.59) 0.63 (0.46, 0.80) 0.26 (-0.20, 0.72) 0.11 (-0.25, 0.47) 0.62 (0.43, 0.81) 0.55 (0.42, 0.68) 0.47 (0.39, 0.55) % Weight Pearson's Correlation Coefficient r 72

73 Figure 4.10: Forest plot for Criterion Validity (r) by Exam's Context. Criterion Validity r by Exam's Context Study ID HSD Baig et al, 2010 Berkenstadt et al, 2006 Boulet et al, 1998 Cohen et al, 1996 Gerrow et al, 2003 Grand'Maison et al, 1997 Hatala et al, 2009 Hofmeister et al, 2009 Reznick et al, 1992 Stillman P et al, 1991 Stillman PL et al, 1986 van Zanten et al, 2007 (1) van Zanten et al, 2007 (2) Subtotal (I-squared = 99.4%, p = 0.000). R Cohen et al, 1990 Dupras et al, 1995 Goff et al, 2000 Goff et al, 2001 Hamadeh et al, 1993 Hilliard et al, 1998 Jefferies et al, 2011 Joorabchi et al, 1996 MacRae et al, 1997 MacRae et al, 2000 Martin et al, 1997 Petrusa et al, 1990 Regehr et al, 1998 Rifkin et al, 2005 Skinner et al, 1997 Sloan et al, 1993 Sloan et al, 1995 Taghva et al, 2010 Tudiver et al, 2009 Yang et al, 2011 Subtotal (I-squared = 81.6%, p = 0.000). Overall (I-squared = 98.6%, p = 0.000) NOTE: Weights are from random effects analysis ES (95% CI) 0.20 (0.02, 0.38) 0.34 (0.20, 0.48) 0.52 (0.39, 0.65) 0.63 (0.49, 0.77) 0.50 (0.47, 0.53) 0.57 (0.47, 0.67) 0.49 (0.40, 0.58) 0.16 (-0.07, 0.39) 0.34 (0.23, 0.45) 0.23 (0.12, 0.34) 0.23 (0.13, 0.33) 0.26 (0.25, 0.27) 0.59 (0.58, 0.60) 0.39 (0.27, 0.51) 0.59 (0.33, 0.85) 0.24 (-0.02, 0.50) 0.82 (0.67, 0.97) 0.88 (0.78, 0.98) 0.53 (0.27, 0.79) 0.51 (0.28, 0.74) 0.80 (0.71, 0.89) 0.43 (0.29, 0.57) 0.20 (-0.25, 0.65) 0.35 (-0.01, 0.71) 0.84 (0.69, 0.99) 0.40 (0.21, 0.59) 0.63 (0.46, 0.80) 0.15 (-0.18, 0.48) 0.26 (-0.20, 0.72) 0.11 (-0.25, 0.47) 0.61 (0.44, 0.78) 0.50 (0.17, 0.83) 0.62 (0.43, 0.81) 0.51 (0.41, 0.61) 0.54 (0.44, 0.64) 0.47 (0.39, 0.55) % Weight Pearson's Correlation Coefficient r 73

74 Figure 4.11: Forest plot for Criterion Validity (r) by Candidates' Background. Criterion Validity r by Candidates' Background Study ID IMG Baig et al, 2010 Boulet et al, 1998 Cohen et al, 1996 Hofmeister et al, 2009 van Zanten et al, 2007 (1) van Zanten et al, 2007 (2) Subtotal (I-squared = 99.7%, p = 0.000). non-img Berkenstadt et al, 2006 Cohen et al, 1990 Dupras et al, 1995 Gerrow et al, 2003 Goff et al, 2000 Goff et al, 2001 Grand'Maison et al, 1997 Hamadeh et al, 1993 Hatala et al, 2009 Hilliard et al, 1998 Jefferies et al, 2011 Joorabchi et al, 1996 MacRae et al, 1997 MacRae et al, 2000 Martin et al, 1997 Petrusa et al, 1990 Regehr et al, 1998 Reznick et al, 1992 Skinner et al, 1997 Sloan et al, 1993 Sloan et al, 1995 Taghva et al, 2010 Tudiver et al, 2009 Yang et al, 2011 Subtotal (I-squared = 84.9%, p = 0.000). IMG & non-img Rifkin et al, 2005 Stillman P et al, 1991 Stillman PL et al, 1986 Subtotal (I-squared = 0.0%, p = 0.899). Overall (I-squared = 98.6%, p = 0.000) NOTE: Weights are from random effects analysis ES (95% CI) 0.20 (0.02, 0.38) 0.52 (0.39, 0.65) 0.63 (0.49, 0.77) 0.16 (-0.07, 0.39) 0.26 (0.25, 0.27) 0.59 (0.58, 0.60) 0.40 (0.21, 0.59) 0.34 (0.20, 0.48) 0.59 (0.33, 0.85) 0.24 (-0.02, 0.50) 0.50 (0.47, 0.53) 0.82 (0.67, 0.97) 0.88 (0.78, 0.98) 0.57 (0.47, 0.67) 0.53 (0.27, 0.79) 0.49 (0.40, 0.58) 0.51 (0.28, 0.74) 0.80 (0.71, 0.89) 0.43 (0.29, 0.57) 0.20 (-0.25, 0.65) 0.35 (-0.01, 0.71) 0.84 (0.69, 0.99) 0.40 (0.21, 0.59) 0.63 (0.46, 0.80) 0.34 (0.23, 0.45) 0.26 (-0.20, 0.72) 0.11 (-0.25, 0.47) 0.61 (0.44, 0.78) 0.50 (0.17, 0.83) 0.62 (0.43, 0.81) 0.51 (0.41, 0.61) 0.54 (0.46, 0.61) 0.15 (-0.18, 0.48) 0.23 (0.12, 0.34) 0.23 (0.13, 0.33) 0.23 (0.15, 0.30) 0.47 (0.39, 0.55) % Weight Pearson's Correlation Coefficient r 74

75 Figure 4.12: Forest plot for Criterion Validity (r) by Comparison's Test. Criterion Validity r by Test's Comparison Study ID clinical Baig et al, 2010 Dupras et al, 1995 Hamadeh et al, 1993 Hilliard et al, 1998 Skinner et al, 1997 Subtotal (I-squared = 43.8%, p = 0.130). non-clinical Berkenstadt et al, 2006 Cohen et al, 1996 Gerrow et al, 2003 Grand'Maison et al, 1997 Reznick et al, 1992 Rifkin et al, 2005 Sloan et al, 1993 Sloan et al, 1995 Stillman PL et al, 1986 Taghva et al, 2010 van Zanten et al, 2007 (1) Subtotal (I-squared = 96.6%, p = 0.000). clinical & non-clinical Boulet et al, 1998 Cohen et al, 1990 Goff et al, 2000 Goff et al, 2001 Hatala et al, 2009 Hofmeister et al, 2009 Jefferies et al, 2011 Joorabchi et al, 1996 MacRae et al, 1997 MacRae et al, 2000 Martin et al, 1997 Petrusa et al, 1990 Regehr et al, 1998 Stillman P et al, 1991 Tudiver et al, 2009 van Zanten et al, 2007 (2) Yang et al, 2011 Subtotal (I-squared = 89.4%, p = 0.000). Overall (I-squared = 98.6%, p = 0.000) NOTE: Weights are from random effects analysis ES (95% CI) 0.20 (0.02, 0.38) 0.24 (-0.02, 0.50) 0.53 (0.27, 0.79) 0.51 (0.28, 0.74) 0.26 (-0.20, 0.72) 0.35 (0.20, 0.50) 0.34 (0.20, 0.48) 0.63 (0.49, 0.77) 0.50 (0.47, 0.53) 0.57 (0.47, 0.67) 0.34 (0.23, 0.45) 0.15 (-0.18, 0.48) 0.11 (-0.25, 0.47) 0.61 (0.44, 0.78) 0.23 (0.13, 0.33) 0.50 (0.17, 0.83) 0.26 (0.25, 0.27) 0.40 (0.29, 0.51) 0.52 (0.39, 0.65) 0.59 (0.33, 0.85) 0.82 (0.67, 0.97) 0.88 (0.78, 0.98) 0.49 (0.40, 0.58) 0.16 (-0.07, 0.39) 0.80 (0.71, 0.89) 0.43 (0.29, 0.57) 0.20 (-0.25, 0.65) 0.35 (-0.01, 0.71) 0.84 (0.69, 0.99) 0.40 (0.21, 0.59) 0.63 (0.46, 0.80) 0.23 (0.12, 0.34) 0.62 (0.43, 0.81) 0.59 (0.58, 0.60) 0.51 (0.41, 0.61) 0.56 (0.48, 0.64) 0.47 (0.39, 0.55) % Weight Pearson's Correlation Coefficient r 75

76 2) Construct Validity 21 studies have evaluated construct validity. Unfortunately, the majority of these studies have not reported the absolutes value of the construct validity coefficients rather they just report whether the difference between the two candidate groups being compared is significant or not using a P-value. Except for two studies, the reported P-values were all significant (i.e., P 0.05). In order to run our analysis, we have converted these P-values to Pearson s correlation coefficients (r). Accordingly, the majority of these studies reported low to moderate validity coefficients, with a mean of r = 0.42 ± 0.19 (95% CI: ) and pooled fixed-effect size estimate (i.e., SMD) of r = 0.41 (95% CI: ); see Figure Moderate heterogeneity was observed (I 2 = 68.9%, Q 2 = 64.31, p < 0.001). This implies that over 60% of observed variability is attributed to between-studies variability. As these heterogeneities are large and significant, it is important to understand its sources. As for criterion validity, it is expected that the construct validity may be affected by the same sources of errors including the exam s context, the candidates background, and traditional assessment tools with which the OSCE s results are being compared with. Therefore, a subgroup analysis was performed to try to explain some sources of these heterogeneities. 76

77 Figure 4.13: Forest plot for Construct Validity (r). Study ID Construct Validity r ES (95% CI) % Weight Bansal et al, 2007 Cohen et al, 1990 Goff et al, 2000 Goff et al, 2001 Goff et al, 2002 Hamadeh et al, 1993 Hilliard et al, 1998 Jefferies et al, 2007 Jefferies et al, 2011 MacRae et al, 1997 MacRae et al, 2000 Martin et al, 1997 Neilsen et al, 2003 Petrusa et al, 1990 Regehr et al, 1998 Reznick et al, 1997 Sloan et al, 1993 Sloan et al, 1995 Stillman P et al, 1991 Tudiver et al, 2009 Vallevand et al, 2012 Overall (I-squared = 68.9%, p = 0.000) 0.52 (0.21, 0.82) 0.49 (0.19, 0.79) 0.63 (0.37, 0.89) (0.30, 0.86) (0.48, 1.00) (0.04, 0.67) (0.03, 0.58) (0.40, 0.90) (0.01, 0.46) (0.04, 0.83) (0.15, 0.80) (0.07, 0.81) (0.09, 0.85) (-0.12, 0.33) (-0.23, 0.31) (0.23, 0.69) (0.43, 0.87) 0.50 (0.29, 0.70) (0.00, 0.22) (0.04, 0.59) 0.24 (0.02, 0.46) (0.31, 0.51) NOTE: Weights are from random effects analysis Pearson Correlation Coefficient r 77

78 Number of Stations Based on the subgroup analysis (Figure 4.14), the effect sizes within High subgroup are uniquely homogeneous with no heterogeneity; i.e., I 2 index of zero and the between studies variance is markedly low (Q 2 = 1.24). In addition, the effect sizes within Moderate subgroup are more homogeneous than the Low subgroup. As a result, the between studies variance is relatively higher in Low subgroup (Q 2 = 24.60) compared to the Moderate subgroup (Q 2 = 7.25). Moreover, given that the I 2 index of the High subgroup is zero and that of Moderate is relatively smaller (44.8%) than that of the Low subgroup (51.2%) as well as the overall variance (68.9%). This indicates that part of the heterogeneity could be explained by the factor Number of Stations (Figure 4.14). Station s Duration Based on the subgroup analysis (Figure 4.15), the effect sizes within Long subgroup are also uniquely homogeneous with no heterogeneity (i.e., I 2 index of zero) and the between studies variance is markedly low (Q 2 = 2.18). In addition, the effect sizes within Short subgroup are more homogeneous than the Moderate subgroup. As a result, the between studies variance is relatively higher in the Moderate subgroup (Q 2 = 22.25) compared to the Short subgroup (Q 2 = 19.17). Moreover, given that the I 2 index of the Long subgroup is zero and that of Short is relatively smaller (58.3%) than that of the Moderate subgroup (77.5%) as well as the overall variance (68.9%). This indicates that part of the heterogeneity could be explained by the factor Station s Duration (Figure 4.15). 78

79 Scoring Methods The scoring methods subgroup analysis has only two subgroups because the GR scoring method was reported in only one study (Figure 4.16). Nonetheless, the between studies variance between the other two subgroups cannot be explained by the factor Scoring Methods because the combined estimated effect sizes within both subgroups are heterogeneous (Q 2 = for CL, Q 2 = for CL & GR ) and their I 2 index (68.6% for CL, 71.2% for CL & GR ) are almost similar to that of the overall variance (68.9%). Number of Raters per Station This number of raters per station subgroup analysis has only two subgroups because the High subgroup was use in only one study (Figure 4.17). Nonetheless, the effect sizes within Low subgroup are uniquely homogeneous with no heterogeneity (I 2 index of zero; and Q 2 = 3.00) compared to Moderate subgroup (I 2 index of 75.2%; and Q 2 = 44.44). Accordingly, this may indicates that part of the heterogeneity could be explained by the factor Number of Raters (Figure 4.17). Examiners The examiners subgroup analysis has only two subgroups because none of the studies reported construct validity coefficients that had used SPs only (Figure 4.18). Nonetheless, the between studies variance cannot be explained by the factor Examiners because the effect sizes within the other two subgroups are heterogeneous (Q 2 = for FP & SP, Q 2 = for FP ) and their I 2 index (71.3% for FP & SP, 42.7% for FP ) are almost similar to that of the overall variance (68.9%). 79

80 Exam s Content The exam s content subgroup analysis has only two subgroups because none of the studies that reported construct validity coefficients had assessed communication only (Figure 4.19). Nonetheless, the between studies variance cannot be explained by the factor Content because the effect sizes within the other two subgroups are heterogeneous (Q 2 = for Clinical, Q 2 = for Clinical & Communication ) and their I 2 index (63.2% for Clinical, 69.4% for Clinical & Communication ) are almost similar to that of the overall variance (68.9%). Exam s Context Based on the exam s context subgroup analysis, the effect sizes within the HSD subgroup are more homogeneous than the R subgroups (Figure 4.20). As a result, the between studies variance is relatively higher in R subgroup (Q 2 = 37.15) compared to the HSD subgroup (Q 2 = 1.07). This indicates that part of the heterogeneity could be explained by the factor exam s context. Moreover, the I 2 index of the HSD subgroup is obviously smaller (6.6%) than that of the other subgroup (51.6%) as well as the overall variance (68.9%). This may indicate that a small part of the heterogeneity could be explained by the factor Exam s Context (Figure 4.20). Candidates Background Based on the candidates background subgroup analysis, the between studies variance cannot be explained by this factor because in this analysis only the subgroups identified as IMG and IMG & non-img subgroups included only one study each (Figure 4.21). 80

81 Comparison s Test Based on the comparison s test subgroup analysis (Figure 4.22), the effect sizes within Clinical and non-clinical subgroups are uniquely homogeneous with no heterogeneity (i.e., I 2 index of zero) and the between studies variance is markedly low (Q 2 = 0.06 for Clinical, Q 2 = 1.11 for non-clinical ). On the other hand, the effect sizes within Clinical & non-clinical are more heterogeneous (Q 2 = 52.76, and I 2 = 71.6%). Therefore, this may indicates that part of the heterogeneity could be explained by the factor Comparison s Test (Figure 4.22). 81

82 Figure 4.14: Forest plot for Construct Validity (r) by Number of Stations. Construct Validity r by Number of Stations Study ID ES (95% CI) % Weight moderate MacRae et al, 2000 Martin et al, 1997 Petrusa et al, 1990 Stillman P et al, 1991 Vallevand et al, 2012 Subtotal (I-squared = 44.8%, p = 0.123). low Bansal et al, 2007 Goff et al, 2000 Goff et al, 2001 Goff et al, 2002 Hamadeh et al, 1993 Hilliard et al, 1998 Jefferies et al, 2007 Jefferies et al, 2011 MacRae et al, 1997 Neilsen et al, 2003 Regehr et al, 1998 Reznick et al, 1997 Tudiver et al, 2009 Subtotal (I-squared = 51.2%, p = 0.017). high Cohen et al, 1990 Sloan et al, 1993 Sloan et al, 1995 Subtotal (I-squared = 0.0%, p = 0.538). Overall (I-squared = 68.9%, p = 0.000) NOTE: Weights are from random effects analysis Pearson's Correlation Coefficient r 0.47 (0.15, 0.80) (0.07, 0.81) (-0.12, 0.33) (0.00, 0.22) (0.02, 0.46) (0.08, 0.35) (0.21, 0.82) (0.37, 0.89) (0.30, 0.86) (0.48, 1.00) (0.04, 0.67) (0.03, 0.58) (0.40, 0.90) (0.01, 0.46) (0.04, 0.83) (0.09, 0.85) (-0.23, 0.31) (0.23, 0.69) (0.04, 0.59) (0.33, 0.55) (0.19, 0.79) (0.43, 0.87) (0.29, 0.70) (0.42, 0.68) (0.31, 0.51)

83 Figure 4.15: Forest plot for Construct Validity (r) by Station's Duration. Construct Validity r by Station's Duration Study ID ES (95% CI) % Weight long Hilliard et al, 1998 MacRae et al, 1997 MacRae et al, 2000 Neilsen et al, 2003 Tudiver et al, 2009 Vallevand et al, 2012 Subtotal (I-squared = 0.0%, p = 0.823). short Bansal et al, 2007 Cohen et al, 1990 Goff et al, 2000 Goff et al, 2001 Goff et al, 2002 Hamadeh et al, 1993 Petrusa et al, 1990 Sloan et al, 1993 Sloan et al, 1995 Subtotal (I-squared = 58.3%, p = 0.014). moderate Jefferies et al, 2007 Jefferies et al, 2011 Martin et al, 1997 Regehr et al, 1998 Reznick et al, 1997 Stillman P et al, 1991 Subtotal (I-squared = 77.5%, p = 0.000). Overall (I-squared = 68.9%, p = 0.000) NOTE: Weights are from random effects analysis Pearson's Correlation Coefficient r 0.30 (0.03, 0.58) (0.04, 0.83) (0.15, 0.80) (0.09, 0.85) (0.04, 0.59) (0.02, 0.46) (0.22, 0.46) (0.21, 0.82) (0.19, 0.79) (0.37, 0.89) (0.30, 0.86) (0.48, 1.00) (0.04, 0.67) (-0.12, 0.33) (0.43, 0.87) (0.29, 0.70) (0.37, 0.64) (0.40, 0.90) (0.01, 0.46) (0.07, 0.81) (-0.23, 0.31) (0.23, 0.69) (0.00, 0.22) (0.12, 0.50) (0.31, 0.51)

84 Figure 4.16: Forest plot for Construct Validity (r) by Scoring Methods. Construct Validity r by Scoring Methods Study ID ES (95% CI) % Weight CL Cohen et al, 1990 Hamadeh et al, 1993 Petrusa et al, 1990 Sloan et al, 1993 Sloan et al, 1995 Subtotal (I-squared = 68.6%, p = 0.013). CL&GR Bansal et al, 2007 Goff et al, 2000 Goff et al, 2001 Goff et al, 2002 Hilliard et al, 1998 Jefferies et al, 2007 Jefferies et al, 2011 MacRae et al, 1997 Martin et al, 1997 Neilsen et al, 2003 Regehr et al, 1998 Reznick et al, 1997 Stillman P et al, 1991 Tudiver et al, 2009 Vallevand et al, 2012 Subtotal (I-squared = 71.2%, p = 0.000). GR MacRae et al, 2000 Subtotal (I-squared =.%, p =.). Overall (I-squared = 68.9%, p = 0.000) NOTE: Weights are from random effects analysis Pearson's Correlation Coefficient r 0.49 (0.19, 0.79) (0.04, 0.67) (-0.12, 0.33) (0.43, 0.87) (0.29, 0.70) (0.22, 0.62) (0.21, 0.82) (0.37, 0.89) (0.30, 0.86) (0.48, 1.00) (0.03, 0.58) (0.40, 0.90) (0.01, 0.46) (0.04, 0.83) (0.07, 0.81) (0.09, 0.85) (-0.23, 0.31) (0.23, 0.69) (0.00, 0.22) (0.04, 0.59) (0.02, 0.46) (0.28, 0.52) (0.15, 0.80) (0.15, 0.80) (0.31, 0.51)

85 Figure 4.17: Forest plot for Construct Validity (r) by Number of Raters. Construct Validity r by Number of Raters Study ID ES (95% CI) % Weight moderate Goff et al, 2000 Goff et al, 2002 Hilliard et al, 1998 Jefferies et al, 2007 Jefferies et al, 2011 MacRae et al, 1997 Martin et al, 1997 Neilsen et al, 2003 Petrusa et al, 1990 Regehr et al, 1998 Stillman P et al, 1991 Vallevand et al, 2012 Subtotal (I-squared = 75.2%, p = 0.000). low Bansal et al, 2007 Cohen et al, 1990 Goff et al, 2001 Hamadeh et al, 1993 MacRae et al, 2000 Reznick et al, 1997 Sloan et al, 1993 Sloan et al, 1995 Subtotal (I-squared = 0.0%, p = 0.885). high Tudiver et al, 2009 Subtotal (I-squared =.%, p =.). Overall (I-squared = 68.9%, p = 0.000) NOTE: Weights are from random effects analysis Pearson's Correlation Coefficient r 0.63 (0.37, 0.89) (0.48, 1.00) (0.03, 0.58) (0.40, 0.90) (0.01, 0.46) (0.04, 0.83) (0.07, 0.81) (0.09, 0.85) (-0.12, 0.33) (-0.23, 0.31) (0.00, 0.22) (0.02, 0.46) (0.21, 0.49) (0.21, 0.82) (0.19, 0.79) (0.30, 0.86) (0.04, 0.67) (0.15, 0.80) (0.23, 0.69) (0.43, 0.87) (0.29, 0.70) (0.42, 0.60) (0.04, 0.59) (0.04, 0.59) (0.31, 0.51)

86 Figure 4.18: Forest plot for Construct Validity (r) by Examiners. Construct Validity r by Examiner Study ID ES (95% CI) % Weight FP&SP Hilliard et al, 1998 Jefferies et al, 2007 MacRae et al, 1997 Petrusa et al, 1990 Stillman P et al, 1991 Vallevand et al, 2012 Subtotal (I-squared = 71.3%, p = 0.004). FP Bansal et al, 2007 Cohen et al, 1990 Goff et al, 2000 Goff et al, 2001 Goff et al, 2002 Hamadeh et al, 1993 Jefferies et al, 2011 MacRae et al, 2000 Martin et al, 1997 Neilsen et al, 2003 Regehr et al, 1998 Reznick et al, 1997 Sloan et al, 1993 Sloan et al, 1995 Tudiver et al, 2009 Subtotal (I-squared = 42.7%, p = 0.041). Overall (I-squared = 68.9%, p = 0.000) NOTE: Weights are from random effects analysis Pearson's Correlation Coefficient r 0.30 (0.03, 0.58) (0.40, 0.90) (0.04, 0.83) (-0.12, 0.33) (0.00, 0.22) (0.02, 0.46) (0.11, 0.45) (0.21, 0.82) (0.19, 0.79) (0.37, 0.89) (0.30, 0.86) (0.48, 1.00) (0.04, 0.67) (0.01, 0.46) (0.15, 0.80) (0.07, 0.81) (0.09, 0.85) (-0.23, 0.31) (0.23, 0.69) (0.43, 0.87) (0.29, 0.70) (0.04, 0.59) (0.37, 0.56) (0.31, 0.51)

87 Figure 4.19: Forest plot for Construct Validity (r) by Exam's Content. Construct Validity r by Exam's Content Study ID ES (95% CI) % Weight clinical & communication Bansal et al, 2007 Hilliard et al, 1998 Jefferies et al, 2007 Jefferies et al, 2011 MacRae et al, 1997 MacRae et al, 2000 Sloan et al, 1995 Stillman P et al, 1991 Vallevand et al, 2012 Subtotal (I-squared = 69.4%, p = 0.001). clinical Cohen et al, 1990 Goff et al, 2000 Goff et al, 2001 Goff et al, 2002 Hamadeh et al, 1993 Martin et al, 1997 Neilsen et al, 2003 Petrusa et al, 1990 Regehr et al, 1998 Reznick et al, 1997 Sloan et al, 1993 Tudiver et al, 2009 Subtotal (I-squared = 63.2%, p = 0.002). Overall (I-squared = 68.9%, p = 0.000) NOTE: Weights are from random effects analysis Pearson's Correlation Coefficient r 0.52 (0.21, 0.82) (0.03, 0.58) (0.40, 0.90) (0.01, 0.46) (0.04, 0.83) (0.15, 0.80) (0.29, 0.70) (0.00, 0.22) (0.02, 0.46) (0.22, 0.51) (0.19, 0.79) (0.37, 0.89) (0.30, 0.86) (0.48, 1.00) (0.04, 0.67) (0.07, 0.81) (0.09, 0.85) (-0.12, 0.33) (-0.23, 0.31) (0.23, 0.69) (0.43, 0.87) (0.04, 0.59) (0.31, 0.57) (0.31, 0.51)

88 Figure 4.20: Forest plot for Construct Validity (r) by Exam's Context. Construct Validity r by Exam's Context Study ID ES (95% CI) % Weight HSD Stillman P et al, 1991 Vallevand et al, 2012 Subtotal (I-squared = 6.6%, p = 0.301). R Bansal et al, 2007 Cohen et al, 1990 Goff et al, 2000 Goff et al, 2001 Goff et al, 2002 Hamadeh et al, 1993 Hilliard et al, 1998 Jefferies et al, 2007 Jefferies et al, 2011 MacRae et al, 1997 MacRae et al, 2000 Martin et al, 1997 Neilsen et al, 2003 Petrusa et al, 1990 Regehr et al, 1998 Reznick et al, 1997 Sloan et al, 1993 Sloan et al, 1995 Tudiver et al, 2009 Subtotal (I-squared = 51.6%, p = 0.005). Overall (I-squared = 68.9%, p = 0.000) NOTE: Weights are from random effects analysis Pearson's Correlation Coefficient r 0.11 (0.00, 0.22) (0.02, 0.46) (0.04, 0.25) (0.21, 0.82) (0.19, 0.79) (0.37, 0.89) (0.30, 0.86) (0.48, 1.00) (0.04, 0.67) (0.03, 0.58) (0.40, 0.90) (0.01, 0.46) (0.04, 0.83) (0.15, 0.80) (0.07, 0.81) (0.09, 0.85) (-0.12, 0.33) (-0.23, 0.31) (0.23, 0.69) (0.43, 0.87) (0.29, 0.70) (0.04, 0.59) (0.35, 0.53) (0.31, 0.51)

89 Figure 4.21: Forest plot for Construct Validity (r) by Candidates' background. Construct Validity r by Candidates' Background Study ID ES (95% CI) % Weight IMG Vallevand et al, 2012 Subtotal (I-squared =.%, p =.). non-img Bansal et al, 2007 Cohen et al, 1990 Goff et al, 2000 Goff et al, 2001 Goff et al, 2002 Hamadeh et al, 1993 Hilliard et al, 1998 Jefferies et al, 2007 Jefferies et al, 2011 MacRae et al, 1997 MacRae et al, 2000 Martin et al, 1997 Neilsen et al, 2003 Petrusa et al, 1990 Regehr et al, 1998 Reznick et al, 1997 Sloan et al, 1993 Sloan et al, 1995 Tudiver et al, 2009 Subtotal (I-squared = 51.6%, p = 0.005). IMG & non-img Stillman P et al, 1991 Subtotal (I-squared =.%, p =.). Overall (I-squared = 68.9%, p = 0.000) NOTE: Weights are from random effects analysis Pearson's Correlation Coefficient r 0.24 (0.02, 0.46) (0.02, 0.46) (0.21, 0.82) (0.19, 0.79) (0.37, 0.89) (0.30, 0.86) (0.48, 1.00) (0.04, 0.67) (0.03, 0.58) (0.40, 0.90) (0.01, 0.46) (0.04, 0.83) (0.15, 0.80) (0.07, 0.81) (0.09, 0.85) (-0.12, 0.33) (-0.23, 0.31) (0.23, 0.69) (0.43, 0.87) (0.29, 0.70) (0.04, 0.59) (0.35, 0.53) (0.00, 0.22) (0.00, 0.22) (0.31, 0.51)

90 Figure 4.22: Forest plot for Construct Validity (r) by Comparison's Test. Construct Validity r by Test's Comparison Study ID ES (95% CI) % Weight clinical Hamadeh et al, 1993 Hilliard et al, 1998 Subtotal (I-squared = 0.0%, p = 0.799). non-clinical Bansal et al, 2007 Sloan et al, 1993 Sloan et al, 1995 Subtotal (I-squared = 0.0%, p = 0.574). clinical & non-clinical Cohen et al, 1990 Goff et al, 2000 Goff et al, 2001 Goff et al, 2002 Jefferies et al, 2007 Jefferies et al, 2011 MacRae et al, 1997 MacRae et al, 2000 Martin et al, 1997 Neilsen et al, 2003 Petrusa et al, 1990 Regehr et al, 1998 Reznick et al, 1997 Stillman P et al, 1991 Tudiver et al, 2009 Vallevand et al, 2012 Subtotal (I-squared = 71.6%, p = 0.000). Overall (I-squared = 68.9%, p = 0.000) NOTE: Weights are from random effects analysis Pearson's Correlation Coefficient r 0.35 (0.04, 0.67) (0.03, 0.58) (0.12, 0.53) (0.21, 0.82) (0.43, 0.87) (0.29, 0.70) (0.42, 0.69) (0.19, 0.79) (0.37, 0.89) (0.30, 0.86) (0.48, 1.00) (0.40, 0.90) (0.01, 0.46) (0.04, 0.83) (0.15, 0.80) (0.07, 0.81) (0.09, 0.85) (-0.12, 0.33) (-0.23, 0.31) (0.23, 0.69) (0.00, 0.22) 0.31 (0.04, 0.59) (0.02, 0.46) (0.27, 0.51) (0.31, 0.51)

91 3) Predictive Validity Only two studies have reported predictive validity, correlation coefficient values (r) with a mean r = 0.47 ± 0.38 (95% CI: ) and pooled fixed-effect size estimates (SMD) of r = 0.47 (95% CI: ); see Figure High heterogeneity was observed (I 2 = 97.9%, Q 2 = 47.95, p < 0.001). This implies that over 90% of observed variability is attributed to betweenstudies variability. As these heterogeneities are very large and significant, it is important to understand its sources. As for criterion and construct validity, it is expected that the predictive validity may be affected by the same sources of errors including the exam s context, the candidates background, and traditional assessment tools with which the OSCE s results are being compared with. Unfortunately, subgroup analysis, similar to that for internal reliability, could not be applied because of the small number of studies that reported predictive validity, correlation coefficient values. 91

92 Figure 4.23: Forest plot for Predictive Validity (r). Feasibility Regardless of the OSCEs strong psychometric properties, it will not be practical to implement an OSCE approach to assessment of clinical competency if the resources are not available. These resources may vary according to the purpose and the context of the OSCE. In the following sections we will summarize the results of the six major factors that may affect the feasibility of any OSCE. 92

Catching the Hawks and Doves: A Method for Identifying Extreme Examiners on Objective Structured Clinical Examinations

Catching the Hawks and Doves: A Method for Identifying Extreme Examiners on Objective Structured Clinical Examinations Catching the Hawks and Doves: A Method for Identifying Extreme Examiners on Objective Structured Clinical Examinations July 20, 2011 1 Abstract Performance-based assessments are powerful methods for assessing

More information

Geriatric Neurology Program Requirements

Geriatric Neurology Program Requirements Geriatric Neurology Program Requirements Approved November 8, 2013 Page 1 Table of Contents I. Introduction 3 II. Institutional Support 3 A. Sponsoring Institution 3 B. Primary Institution 4 C. Participating

More information

Psychometrics for Beginners. Lawrence J. Fabrey, PhD Applied Measurement Professionals

Psychometrics for Beginners. Lawrence J. Fabrey, PhD Applied Measurement Professionals Psychometrics for Beginners Lawrence J. Fabrey, PhD Applied Measurement Professionals Learning Objectives Identify key NCCA Accreditation requirements Identify two underlying models of measurement Describe

More information

Scope of Practice for the Diagnostic Ultrasound Professional

Scope of Practice for the Diagnostic Ultrasound Professional Scope of Practice for the Diagnostic Ultrasound Professional Copyright 1993-2000 Society of Diagnostic Medical Sonographers, Dallas, Texas USA: All Rights Reserved Worldwide. Organizations which endorse

More information

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS Michael J. Kolen The University of Iowa March 2011 Commissioned by the Center for K 12 Assessment & Performance Management at

More information

Item Analysis Explanation

Item Analysis Explanation Item Analysis Explanation The item difficulty is the percentage of candidates who answered the question correctly. The recommended range for item difficulty set forth by CASTLE Worldwide, Inc., is between

More information

Test Validity. What is validity? Types of validity IOP 301-T. Content validity. Content-description Criterion-description Construct-identification

Test Validity. What is validity? Types of validity IOP 301-T. Content validity. Content-description Criterion-description Construct-identification What is? IOP 301-T Test Validity It is the accuracy of the measure in reflecting the concept it is supposed to measure. In simple English, the of a test concerns what the test measures and how well it

More information

Psychotherapists and Counsellors Professional Liaison Group (PLG) 15 December 2010

Psychotherapists and Counsellors Professional Liaison Group (PLG) 15 December 2010 Psychotherapists and Counsellors Professional Liaison Group (PLG) 15 December 2010 Standards of proficiency for counsellors Executive summary and recommendations Introduction At the meeting on 19 October

More information

Education, Training, and Certification of Radiation Oncologists in Indonesia

Education, Training, and Certification of Radiation Oncologists in Indonesia Education, Training, and Certification of Radiation Oncologists in Indonesia with acknowledgement to Prof HM Djakaria (Indonesian College of Radiation Oncology) Outline Historical Context Current Status

More information

Validation of an Online Assessment of Orthopedic Surgery Residents Cognitive Skills and Preparedness for Carpal Tunnel Release Surgery

Validation of an Online Assessment of Orthopedic Surgery Residents Cognitive Skills and Preparedness for Carpal Tunnel Release Surgery Validation of an Online Assessment of Orthopedic Surgery Residents Cognitive Skills and Preparedness for Carpal Tunnel Release Surgery Janet Shanedling, PhD Ann Van Heest, MD Michael Rodriguez, PhD Matthew

More information

POLICY FRAMEWORK FOR DENTAL HYGIENE EDUCATION IN CANADA The Canadian Dental Hygienists Association

POLICY FRAMEWORK FOR DENTAL HYGIENE EDUCATION IN CANADA The Canadian Dental Hygienists Association POLICY FRAMEWORK FOR DENTAL HYGIENE EDUCATION IN CANADA 2005 The Canadian Dental Hygienists Association October, 2000 Replaces January, 1998 POLICY FRAMEWORK FOR DENTAL HYGIENE EDUCATION IN CANADA, 2005

More information

Detection of Differential Test Functioning (DTF) and Differential Item Functioning (DIF) in MCCQE Part II Using Logistic Models

Detection of Differential Test Functioning (DTF) and Differential Item Functioning (DIF) in MCCQE Part II Using Logistic Models Detection of Differential Test Functioning (DTF) and Differential Item Functioning (DIF) in MCCQE Part II Using Logistic Models Jin Gong University of Iowa June, 2012 1 Background The Medical Council of

More information

Importance of Good Measurement

Importance of Good Measurement Importance of Good Measurement Technical Adequacy of Assessments: Validity and Reliability Dr. K. A. Korb University of Jos The conclusions in a study are only as good as the data that is collected. The

More information

Introduction. 1.1 Facets of Measurement

Introduction. 1.1 Facets of Measurement 1 Introduction This chapter introduces the basic idea of many-facet Rasch measurement. Three examples of assessment procedures taken from the field of language testing illustrate its context of application.

More information

Test-Taking Strategies and Task-based Assessment: The Case of Iranian EFL Learners

Test-Taking Strategies and Task-based Assessment: The Case of Iranian EFL Learners Test-Taking Strategies and Task-based Assessment: The Case of Iranian EFL Learners Hossein Barati Department of English, Faculty of Foreign Languages, University of Isfahan barati@yahoo.com Zohreh Kashkoul*

More information

Bond University. Faculty of Health Sciences and Medicine. An evaluation of mental health gains in adolescents who participate in a

Bond University. Faculty of Health Sciences and Medicine. An evaluation of mental health gains in adolescents who participate in a Bond University Faculty of Health Sciences and Medicine An evaluation of mental health gains in adolescents who participate in a structured day treatment program Jennifer Fothergill BAppSc(Nursing), BAppSc(Psychology),

More information

Survey Question. What are appropriate methods to reaffirm the fairness, validity reliability and general performance of examinations?

Survey Question. What are appropriate methods to reaffirm the fairness, validity reliability and general performance of examinations? Clause 9.3.5 Appropriate methodology and procedures (e.g. collecting and maintaining statistical data) shall be documented and implemented in order to affirm, at justified defined intervals, the fairness,

More information

Validity, Reliability, and Defensibility of Assessments in Veterinary Education

Validity, Reliability, and Defensibility of Assessments in Veterinary Education Instructional Methods Validity, Reliability, and Defensibility of Assessments in Veterinary Education Kent Hecker g Claudio Violato ABSTRACT In this article, we provide an introduction to and overview

More information

2016 Technical Report National Board Dental Hygiene Examination

2016 Technical Report National Board Dental Hygiene Examination 2016 Technical Report National Board Dental Hygiene Examination 2017 Joint Commission on National Dental Examinations All rights reserved. 211 East Chicago Avenue Chicago, Illinois 60611-2637 800.232.1694

More information

Pediatrics Milestones and Meaningful Assessment Translating the Pediatrics Milestones into Assessment Items for use in the Clinical Setting

Pediatrics Milestones and Meaningful Assessment Translating the Pediatrics Milestones into Assessment Items for use in the Clinical Setting Pediatrics Milestones and Meaningful Assessment Translating the Pediatrics Milestones into Assessment Items for use in the Clinical Setting Ann Burke Susan Guralnick Patty Hicks Jeanine Ronan Dan Schumacher

More information

Specific Standards of Accreditation for Residency Programs in Adult and Pediatric Neurology

Specific Standards of Accreditation for Residency Programs in Adult and Pediatric Neurology Specific Standards of Accreditation for Residency Programs in Adult and Pediatric Neurology INTRODUCTION 2011 A university wishing to have an accredited program in adult Neurology must also sponsor an

More information

Technical Specifications

Technical Specifications Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically

More information

Specific Standards of Accreditation for Residency Programs in Gynecologic Reproductive Endocrinology & Infertility

Specific Standards of Accreditation for Residency Programs in Gynecologic Reproductive Endocrinology & Infertility Specific Standards of Accreditation for Residency Programs in Gynecologic Reproductive Endocrinology & Infertility 2011 INTRODUCTION A university wishing to have an accredited program in Gynecologic Reproductive

More information

BOARD CERTIFICATION PROCESS (EXCERPTS FOR SENIOR TRACK III) Stage I: Application and eligibility for candidacy

BOARD CERTIFICATION PROCESS (EXCERPTS FOR SENIOR TRACK III) Stage I: Application and eligibility for candidacy BOARD CERTIFICATION PROCESS (EXCERPTS FOR SENIOR TRACK III) All candidates for board certification in CFP must meet general eligibility requirements set by ABPP. Once approved by ABPP, candidates can choose

More information

5.I.1. GENERAL PRACTITIONER ANNOUNCEMENT OF CREDENTIALS IN NON-SPECIALTY INTEREST AREAS

5.I.1. GENERAL PRACTITIONER ANNOUNCEMENT OF CREDENTIALS IN NON-SPECIALTY INTEREST AREAS Report of the Council on Ethics, Bylaws and Judicial Affairs on Advisory Opinion 5.I.1. GENERAL PRACTITIONER ANNOUNCEMENT OF CREDENTIALS IN NON-SPECIALTY INTEREST AREAS Ethical Advertising under ADA Code:

More information

An Investigation of Two Criterion-Referencing Scoring Procedures for National Board Dental Examinations

An Investigation of Two Criterion-Referencing Scoring Procedures for National Board Dental Examinations Loyola University Chicago Loyola ecommons Master's Theses Theses and Dissertations 1980 An Investigation of Two Criterion-Referencing Scoring Procedures for National Board Dental Examinations Maribeth

More information

Principles of Assessment: A Primer for Medical Educators in the Clinical Years

Principles of Assessment: A Primer for Medical Educators in the Clinical Years ISPUB.COM The Internet Journal of Medical Education Volume 1 Number 1 Principles of Assessment: A Primer for Medical Educators in the Clinical Years A Vergis, K Hardy Citation A Vergis, K Hardy.. The Internet

More information

UvA-DARE (Digital Academic Repository)

UvA-DARE (Digital Academic Repository) UvA-DARE (Digital Academic Repository) Standaarden voor kerndoelen basisonderwijs : de ontwikkeling van standaarden voor kerndoelen basisonderwijs op basis van resultaten uit peilingsonderzoek van der

More information

EDUCATIONAL INTERVENTION. The Use of an Objective Structured Clinical Examination With Postgraduate Residents in Pediatrics

EDUCATIONAL INTERVENTION. The Use of an Objective Structured Clinical Examination With Postgraduate Residents in Pediatrics EDUCATIONAL INTERVENTION The Use of an Objective Structured Clinical Examination With Postgraduate Residents in Pediatrics Robert I. Hilliard, MD, EdD, FRCPC; Susan E. Tallett, MBBS, FRCPC Objective: To

More information

400 Hour Evaluation of Student Learning Form Concordia University Social Work Practicum Program

400 Hour Evaluation of Student Learning Form Concordia University Social Work Practicum Program 400 Hour of Student Learning Form Concordia University Social Work Practicum Program Date: Name Name Agency Name This evaluation is intended to help monitor the student s development of professional practice

More information

SAUDI BOARD OF PERIODONTICS PROGRAM (SB-PERIO) SAUDI BOARD FINAL CLINICAL EXAMINATION OF PERIODONTICS (2018)

SAUDI BOARD OF PERIODONTICS PROGRAM (SB-PERIO) SAUDI BOARD FINAL CLINICAL EXAMINATION OF PERIODONTICS (2018) SAUDI BOARD OF PERIODONTICS PROGRAM (SB-PERIO) SAUDI BOARD FINAL CLINICAL EXAMINATION OF PERIODONTICS (2018) I Objectives a. Determine the ability of the candidate to practice as a specialist and provide

More information

Development, administration, and validity evidence of a subspecialty preparatory test toward licensure: a pilot study

Development, administration, and validity evidence of a subspecialty preparatory test toward licensure: a pilot study Johnson et al. BMC Medical Education (2018) 18:176 https://doi.org/10.1186/s12909-018-1294-z RESEARCH ARTICLE Open Access Development, administration, and validity evidence of a subspecialty preparatory

More information

ADMS Sampling Technique and Survey Studies

ADMS Sampling Technique and Survey Studies Principles of Measurement Measurement As a way of understanding, evaluating, and differentiating characteristics Provides a mechanism to achieve precision in this understanding, the extent or quality As

More information

Can Licensed Mental Health Counselors Administer and Interpret Psychological Tests?

Can Licensed Mental Health Counselors Administer and Interpret Psychological Tests? Can Licensed Mental Health Counselors Administer and Interpret Psychological Tests? ANALYSIS AND POSITION PAPER BY THE NATIONAL BOARD OF FORENSIC EVALUATORS The National Board of Forensic Evaluators (NBFE)

More information

Developing Skills at Making Observations

Developing Skills at Making Observations Developing Skills at Making Observations Lessons from Faculty Development and Rater Cognition Research Eric S. Holmboe Jennifer R. Kogan Roadmap 1. Define workplace based assessment and the theories supporting

More information

Miller s Assessment Pyramid

Miller s Assessment Pyramid Roadmap Developing Skills at Making Observations Lessons from Faculty Development and Rater Cognition Research 1. Define workplace based assessment and the theories supporting direct observation 2. Identify

More information

Supplementary Material*

Supplementary Material* Supplementary Material* Lipner RS, Brossman BG, Samonte KM, Durning SJ. Effect of Access to an Electronic Medical Resource on Performance Characteristics of a Certification Examination. A Randomized Controlled

More information

Introduction: Speaker. Introduction: Buros. Buros & Education. Introduction: Participants. Goal 10/5/2012

Introduction: Speaker. Introduction: Buros. Buros & Education. Introduction: Participants. Goal 10/5/2012 Introduction: Speaker PhD in Educational Measurement University of Nebraska-Lincoln October 28, 2012 CRITICAL TESTING AND MEASUREMENT CONCEPTS: ASSESSMENT PROFESSIONALS 13 years experience HE Assessment

More information

Appendix II. Framework and minimal standards for the education and training of psychologists

Appendix II. Framework and minimal standards for the education and training of psychologists Appendix II. Framework and minimal standards for the education and training of psychologists This appendix indicates the educational requirements for obtaining the EuroPsy, and is based substantially on

More information

THESE STANDARDS ARE DORMANT.

THESE STANDARDS ARE DORMANT. THESE STANDARDS ARE DORMANT. TO INQUIRE ABOUT AOBA PARTICIPATION IN THE CAQ FOR CRITICAL CARE MEDICINE FOR ANESTHESIOLOGY PLEASE CONTACT THE AOBA aoba@osteopathic.org. BASIC STANDARDS FOR SUBSPECIALTY

More information

Training Requirements for Certification as a PCIT Therapist

Training Requirements for Certification as a PCIT Therapist Training Requirements for Certification as a PCIT Therapist These requirements were developed by the PCIT International Training Task Force in collaboration with the PCIT International Board of Directors.

More information

2013 EDITORIAL REVISION 2017 VERSION 1.2

2013 EDITORIAL REVISION 2017 VERSION 1.2 Specific Standards of Accreditation for Residency Programs in Gynecologic Reproductive Endocrinology and Infertility 2013 EDITORIAL REVISION 2017 VERSION 1.2 INTRODUCTION A university wishing to have an

More information

Victoria YY Xu PGY-2 Internal Medicine University of Toronto. Supervisor: Dr. Camilla Wong

Victoria YY Xu PGY-2 Internal Medicine University of Toronto. Supervisor: Dr. Camilla Wong Validity, Reliability, Feasibility, and Acceptability of Using the Consultation Letter Rating Scale to Assess Written Communication Competencies Among Geriatric Medicine Postgraduate Trainees Victoria

More information

Critical Thinking Assessment at MCC. How are we doing?

Critical Thinking Assessment at MCC. How are we doing? Critical Thinking Assessment at MCC How are we doing? Prepared by Maura McCool, M.S. Office of Research, Evaluation and Assessment Metropolitan Community Colleges Fall 2003 1 General Education Assessment

More information

Quality Assessment Criteria in Conference Interpreting from the Perspective of Loyalty Principle Ma Dan

Quality Assessment Criteria in Conference Interpreting from the Perspective of Loyalty Principle Ma Dan 2017 2nd International Conference on Humanities Science, Management and Education Technology (HSMET 2017) ISBN: 978-1-60595-494-3 Quality Assessment Criteria in Conference Interpreting from the Perspective

More information

APPENDIX 2. Appendix 2 MoU

APPENDIX 2. Appendix 2 MoU APPENDIX 2 THIS APPENDIX CONTAINS BOTH THE TEXT OF THE CURRENT MEMORANDUM OF UNDERSTANDING BETWEEN JCSTD, THE GDC AND COPDEND ABOUT THEIR JOINT WORKING ARRANGEMENTS AND THE WORKING NOTES DRAFTED BY PROF

More information

Reliability of oral examinations: Radiation oncology certifying examination

Reliability of oral examinations: Radiation oncology certifying examination Practical Radiation Oncology (2013) 3, 74 78 www.practicalradonc.org Special Article Reliability of oral examinations: Radiation oncology certifying examination June C. Yang PhD, Paul E. Wallner DO, Gary

More information

ACCREDITATION COMMISSION FOR HOMEOPATHIC EDUCATION IN NORTH AMERICA

ACCREDITATION COMMISSION FOR HOMEOPATHIC EDUCATION IN NORTH AMERICA ACCREDITATION COMMISSION FOR HOMEOPATHIC EDUCATION IN NORTH AMERICA ACCREDITATION STANDARDS FOR THE DOCTORAL DEGREE IN HOMEOPATHY SUMMARY OF COMMENTS AND DELIBERATIONS/ REVISIONS February, 2014 Background

More information

Systemic Autoimmune Rheumatic Disease Fellowship, McGill University

Systemic Autoimmune Rheumatic Disease Fellowship, McGill University Systemic Autoimmune Rheumatic Disease Fellowship, McGill University Length of Fellowship: 1 year Type of Fellowship: Clinical and Clinical Research Fellowship Director: Dr. Christian Pineau. For a complete

More information

The DLOSCE: A National Standardized High-Stakes Dental Licensure Examination. Richard C. Black, D.D.S., M.S. David M. Waldschmidt, Ph.D.

The DLOSCE: A National Standardized High-Stakes Dental Licensure Examination. Richard C. Black, D.D.S., M.S. David M. Waldschmidt, Ph.D. The DLOSCE: A National Standardized High-Stakes Dental Licensure Examination Richard C. Black, D.D.S., M.S. Chair, DLOSCE Steering Committee David M. Waldschmidt, Ph.D. Director, ADA Department of Testing

More information

THE AMERICAN BOARD OF ORTHODONTICS SCENARIO BASED ORAL CLINICAL EXAMINATION STUDY GUIDE

THE AMERICAN BOARD OF ORTHODONTICS SCENARIO BASED ORAL CLINICAL EXAMINATION STUDY GUIDE THE AMERICAN BOARD OF ORTHODONTICS SCENARIO BASED ORAL CLINICAL EXAMINATION STUDY GUIDE 2019 The American Board of Orthodontics 401 North Lindbergh Blvd., Suite 300 St. Louis, Missouri 63141 314 432 6130

More information

Impact of Task-based Checklist Scoring and Two Domains Global Rating Scale in Objective Structured Clinical Examination of Pharmacy Students

Impact of Task-based Checklist Scoring and Two Domains Global Rating Scale in Objective Structured Clinical Examination of Pharmacy Students Pharmaceutical Education Impact of Task-based Checklist Scoring and Two Domains Global Rating Scale in Objective Structured Clinical Examination of Pharmacy Students Sajesh Kalkandi Veettil * and Kingston

More information

How Do We Assess Students in the Interpreting Examinations?

How Do We Assess Students in the Interpreting Examinations? How Do We Assess Students in the Interpreting Examinations? Fred S. Wu 1 Newcastle University, United Kingdom The field of assessment in interpreter training is under-researched, though trainers and researchers

More information

Comparison of a rational and an empirical standard setting procedure for an OSCE

Comparison of a rational and an empirical standard setting procedure for an OSCE Examination for general practice Comparison of a rational and an empirical standard setting procedure for an OSCE Anneke Kramer, 1 Arno Muijtjens, 2 Koos Jansen, 1 Herman Düsman 1, Lisa Tan 1 & Cees van

More information

2/12/2016. Disclosure. Objectives. The Hospice Medical Director: What Should They Be Doing?

2/12/2016. Disclosure. Objectives. The Hospice Medical Director: What Should They Be Doing? The Hospice Medical Director: What Should They Be Doing? Tommie W. Farrell, MD HMDCB FAAHPM Pathways at Hendrick Hospital Palliative and Supportive and Hospice Care Abilene Texas Disclosure Governing Board

More information

The Practice Standards for Medical Imaging and Radiation Therapy. Sonography Practice Standards

The Practice Standards for Medical Imaging and Radiation Therapy. Sonography Practice Standards The Practice Standards for Medical Imaging and Radiation Therapy Sonography Practice Standards 2017 American Society of Radiologic Technologists. All rights reserved. Reprinting all or part of this document

More information

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison Empowered by Psychometrics The Fundamentals of Psychometrics Jim Wollack University of Wisconsin Madison Psycho-what? Psychometrics is the field of study concerned with the measurement of mental and psychological

More information

Abstract. Introduction. Stephen D. Sisson, MD Amanda Bertram, MS Hsin-Chieh Yeh, PhD ORIGINAL RESEARCH

Abstract. Introduction. Stephen D. Sisson, MD Amanda Bertram, MS Hsin-Chieh Yeh, PhD ORIGINAL RESEARCH Concurrent Validity Between a Shared Curriculum, the Internal Medicine In- Training Examination, and the American Board of Internal Medicine Certifying Examination Stephen D. Sisson, MD Amanda Bertram,

More information

Council on Education of the Deaf Office of Program Accreditation

Council on Education of the Deaf Office of Program Accreditation Council on Education of the Deaf Office of Program Accreditation Manual I: CED STANDARDS FOR PROGRAMS PREPARING TEACHERS OF STUDENTS WHO ARE DEAF AND HARD OF HEARING (Aligned with CAEP) Manual Revised

More information

APCA Update for APDVS: Inteleos and the Physicians Vascular Interpretation (PVI) Examination

APCA Update for APDVS: Inteleos and the Physicians Vascular Interpretation (PVI) Examination APCA Update for APDVS: Inteleos and the Physicians Vascular Interpretation (PVI) Examination Disclosures No financial conflicts of interest Volunteer service on Council of Alliance for Physician Certification

More information

Competences for the Hong Kong Dentist

Competences for the Hong Kong Dentist Competences for the Hong Kong Dentist September 2009 Introduction Dentists are expected to contribute to the achievement of the general health of patients by implementing and promoting appropriate oral

More information

2015 Exam Committees Report for the National Physical Therapy Examination Program

2015 Exam Committees Report for the National Physical Therapy Examination Program 2015 Exam Committees Report for the National Physical Therapy Examination Program Executive Summary This report provides a summary of the National Physical Therapy Examination (NPTE) related activities

More information

Competence by Design. Kenneth A. Harris, MD FRCSC Deputy CEO & Executive Director, Office of Specialty Education February 3, 2016

Competence by Design. Kenneth A. Harris, MD FRCSC Deputy CEO & Executive Director, Office of Specialty Education February 3, 2016 Competence by Design Kenneth A. Harris, MD FRCSC Deputy CEO & Executive Director, Office of Specialty Education February 3, 2016 The best health for all. The best care for all. Discussion Topics Overview

More information

Application for Accreditation of Training Program in Renal Ultrasonography

Application for Accreditation of Training Program in Renal Ultrasonography Edited 4/1/2004 The American Society of Diagnostic and Interventional Nephrology Application for Accreditation of Training Program in Renal Ultrasonography 1 The American Society of Diagnostic and Interventional

More information

PSYCHOMETRIC PROPERTIES OF CLINICAL PERFORMANCE RATINGS

PSYCHOMETRIC PROPERTIES OF CLINICAL PERFORMANCE RATINGS PSYCHOMETRIC PROPERTIES OF CLINICAL PERFORMANCE RATINGS A total of 7931 ratings of 482 third- and fourth-year medical students were gathered over twelve four-week periods. Ratings were made by multiple

More information

Palliative Medicine: Program Description

Palliative Medicine: Program Description Program Description: Palliative Medicine: Program Description Palliative Medicine was recently approved as a subspecialty by the Royal College of Physicians and Surgeons of Canada (RCPSC). The Subspecialty

More information

Evaluating the Assessments in an Orthopaedic Residency Program

Evaluating the Assessments in an Orthopaedic Residency Program Evaluating the Assessments in an Orthopaedic Residency Program by Nicholas Carl Smith MD A Thesis submitted to the School of Graduate Studies in partial fulfillment of the requirements for the degree of

More information

THE ANGOFF METHOD OF STANDARD SETTING

THE ANGOFF METHOD OF STANDARD SETTING THE ANGOFF METHOD OF STANDARD SETTING May 2014 1400 Blair Place, Suite 210, Ottawa ON K1J 9B8 Tel: (613) 237-0241 Fax: (613) 237-6684 www.asinc.ca 1400, place Blair, bureau 210, Ottawa (Ont.) K1J 9B8 Tél.

More information

RESEARCH ARTICLES. Brian E. Clauser, Polina Harik, and Melissa J. Margolis National Board of Medical Examiners

RESEARCH ARTICLES. Brian E. Clauser, Polina Harik, and Melissa J. Margolis National Board of Medical Examiners APPLIED MEASUREMENT IN EDUCATION, 22: 1 21, 2009 Copyright Taylor & Francis Group, LLC ISSN: 0895-7347 print / 1532-4818 online DOI: 10.1080/08957340802558318 HAME 0895-7347 1532-4818 Applied Measurement

More information

Department of Neurology and Neurosurgery Clinical and Clinical Research Fellowship Application Form

Department of Neurology and Neurosurgery Clinical and Clinical Research Fellowship Application Form Department of Neurology and Neurosurgery Clinical and Clinical Research Fellowship Application Form Type of Fellowship Epilepsy Fellowship. Name of Fellowship Supervisor Dr. Bernard Rosenblatt Fellowship

More information

Vascular Surgery Fellowship Curriculum Goals and Objectives

Vascular Surgery Fellowship Curriculum Goals and Objectives Vascular Surgery Fellowship Curriculum Goals and Objectives Educational Goals and Philosophy.. Page 2 Program Overview. Page 2 Curriculum Overview.. Page 3 Goals and Objectives for Competencies Page 3

More information

AMERICAN BOARD OF SURGERY 2009 IN-TRAINING EXAMINATION EXPLANATION & INTERPRETATION OF SCORE REPORTS

AMERICAN BOARD OF SURGERY 2009 IN-TRAINING EXAMINATION EXPLANATION & INTERPRETATION OF SCORE REPORTS AMERICAN BOARD OF SURGERY 2009 IN-TRAINING EXAMINATION EXPLANATION & INTERPRETATION OF SCORE REPORTS Attached are the performance reports and analyses for participants from your surgery program on the

More information

Credentialing with Simulation

Credentialing with Simulation Credentialing with Simulation PP Chen COS, Department Anaesthesiology & Operating Services North District Hospital & Alice Ho Miu Ling Nethersole Hospital Director, NTE Simulation & Training Centre Outline

More information

Arts and Entertainment. Ecology. Technology. History and Deaf Culture

Arts and Entertainment. Ecology. Technology. History and Deaf Culture American Sign Language Level 3 (novice-high to intermediate-low) Course Description ASL Level 3 furthers the study of grammar, vocabulary, idioms, multiple meaning words, finger spelling, and classifiers

More information

GUIDELINES FOR POST PEDIATRICS PORTAL PROGRAM

GUIDELINES FOR POST PEDIATRICS PORTAL PROGRAM GUIDELINES FOR POST PEDIATRICS PORTAL PROGRAM Psychiatry is a medical specialty that is focused on the prevention, diagnosis, and treatment of mental, addictive, and emotional disorders throughout the

More information

COMPETENCIES FOR THE NEW DENTAL GRADUATE

COMPETENCIES FOR THE NEW DENTAL GRADUATE COMPETENCIES FOR THE NEW DENTAL GRADUATE The Competencies for the New Dental Graduate was developed by the College of Dentistry s Curriculum Committee with input from the faculty, students, and staff and

More information

English 10 Writing Assessment Results and Analysis

English 10 Writing Assessment Results and Analysis Academic Assessment English 10 Writing Assessment Results and Analysis OVERVIEW This study is part of a multi-year effort undertaken by the Department of English to develop sustainable outcomes assessment

More information

Basic Standards for Residency Training in General Neurology

Basic Standards for Residency Training in General Neurology Basic Standards for Residency Training in General Neurology American Osteopathic Association and American College of Osteopathic Neurologists and Psychiatrists Revised 2/2003 Revised 7/2004 Revised 6/2006

More information

2013 Supervisor Survey Reliability Analysis

2013 Supervisor Survey Reliability Analysis 2013 Supervisor Survey Reliability Analysis In preparation for the submission of the Reliability Analysis for the 2013 Supervisor Survey, we wanted to revisit the purpose of this analysis. This analysis

More information

Performance Evaluation Tool

Performance Evaluation Tool FSBPT Performance Evaluation Tool Foreign Educated Therapists Completing a Supervised Clinical Practice The information contained in this document is proprietary and not to be shared elsewhere. Contents

More information

Basic Standards for Fellowship Training in Sleep Medicine

Basic Standards for Fellowship Training in Sleep Medicine Basic Standards for Fellowship Training in Sleep Medicine American Osteopathic Association and American College of Osteopathic Neurologists and Psychiatrists and American College of Osteopathic Internists

More information

Evaluation and Assessment: 2PSY Summer, 2017

Evaluation and Assessment: 2PSY Summer, 2017 Evaluation and Assessment: 2PSY - 542 Summer, 2017 Instructor: Daniel Gutierrez, Ph.D., LPC, LMHC, NCC Meeting times: Perquisite: July 24 28, 2017; 8:00am-5:00pm Admission to the MAC program Required Texts

More information

Introduction. October 2018 Page 1

Introduction. October 2018 Page 1 Requirements for Recognition of Dental Specialties and National Certifying Boards for Dental Specialists Adopted as Amended by the ADA House of Delegates, October 2018 Introduction A specialty is an area

More information

Assessing the reliability of the borderline regression method as a standard setting procedure for objective structured clinical examination

Assessing the reliability of the borderline regression method as a standard setting procedure for objective structured clinical examination Educational Research Article Assessing the reliability of the borderline regression method as a standard setting procedure for objective structured clinical examination Sara Mortaz Hejri 1, Mohammad Jalili

More information

CHAPTER VI RESEARCH METHODOLOGY

CHAPTER VI RESEARCH METHODOLOGY CHAPTER VI RESEARCH METHODOLOGY 6.1 Research Design Research is an organized, systematic, data based, critical, objective, scientific inquiry or investigation into a specific problem, undertaken with the

More information

WHAT IS THE DISSERTATION?

WHAT IS THE DISSERTATION? BRIEF RESEARCH PROPOSAL & DISSERTATION PROCEDURES 2018-2019: (HDIP STUDENTS) Dr Marta Sant Dissertations Coordinator WHAT IS THE DISSERTATION? All HDIP students are required to submit a dissertation with

More information

Comparing Vertical and Horizontal Scoring of Open-Ended Questionnaires

Comparing Vertical and Horizontal Scoring of Open-Ended Questionnaires A peer-reviewed electronic journal. Copyright is retained by the first or sole author, who grants right of first publication to the Practical Assessment, Research & Evaluation. Permission is granted to

More information

What is the Dissertation?

What is the Dissertation? BRIEF RESEARCH PROPOSAL & DISSERTATION PROCEDURES 2017-2018: (HDIP STUDENTS) Dr Marta Sant Dissertations Coordinator What is the Dissertation? All HDIP students are required to submit a dissertation with

More information

Chapter 9. Ellen Hiemstra Navid Hossein pour Khaledian J. Baptist M.Z. Trimbos Frank Willem Jansen. Submitted

Chapter 9. Ellen Hiemstra Navid Hossein pour Khaledian J. Baptist M.Z. Trimbos Frank Willem Jansen. Submitted Chapter Implementation of OSATS in the Residency Program: a benchmark study Ellen Hiemstra Navid Hossein pour Khaledian J. Baptist M.Z. Trimbos Frank Willem Jansen Submitted Introduction The exposure to

More information

PRACTICE SAMPLES CURRICULUM VITAE

PRACTICE SAMPLES CURRICULUM VITAE PRACTICE SAMPLES The Curriculum Vitae and Professional Statement provide the Candidate with the opportunity to communicate about him/herself as a Clinical Child and Adolescent Psychologist and serve as

More information

Handbook for Postdoctoral Fellows at The Menninger Clinic

Handbook for Postdoctoral Fellows at The Menninger Clinic Handbook for Postdoctoral Fellows at The Menninger Clinic 2017-2018 Chris Fowler, Ph.D., director of Psychology Patricia Daza, PhD, director of Psychology Training 1 Overview The psychology discipline

More information

Communication Skills in Standardized-Patient Assessment of Final-Year Medical Students: A Psychometric Study

Communication Skills in Standardized-Patient Assessment of Final-Year Medical Students: A Psychometric Study Advances in Health Sciences Education 9: 179 187, 2004. 2004 Kluwer Academic Publishers. Printed in the Netherlands. 179 Communication Skills in Standardized-Patient Assessment of Final-Year Medical Students:

More information

Technical Manual Written Examination, Objective Structured Clinical Examination, Assessment of Fundamental Knowledge, and Assessment of Clinical

Technical Manual Written Examination, Objective Structured Clinical Examination, Assessment of Fundamental Knowledge, and Assessment of Clinical Technical Manual Written Examination, Objective Structured Clinical Examination, Assessment of Fundamental Knowledge, and Assessment of Clinical Judgment - 2014 Public Technical Manual Table of Contents

More information

Validity refers to the accuracy of a measure. A measurement is valid when it measures what it is suppose to measure and performs the functions that

Validity refers to the accuracy of a measure. A measurement is valid when it measures what it is suppose to measure and performs the functions that Validity refers to the accuracy of a measure. A measurement is valid when it measures what it is suppose to measure and performs the functions that it purports to perform. Does an indicator accurately

More information

11-3. Learning Objectives

11-3. Learning Objectives 11-1 Measurement Learning Objectives 11-3 Understand... The distinction between measuring objects, properties, and indicants of properties. The similarities and differences between the four scale types

More information

Chapter 6 Facilitating the Movement of Qualified Dental Graduates to Provide Dental Services Across ASEAN Member States

Chapter 6 Facilitating the Movement of Qualified Dental Graduates to Provide Dental Services Across ASEAN Member States Chapter 6 Facilitating the Movement of Qualified Dental Graduates to Provide Dental Services Across ASEAN Member States Suchit Poolthong and Supachai Chuenjitwongsa Abstract The Association of Southeast

More information

Queen s Family Medicine PGY3 CARE OF THE ELDERLY PROGRAM

Queen s Family Medicine PGY3 CARE OF THE ELDERLY PROGRAM PROGRAM Goals and Objectives Family practice residents in this PGY3 Care of the Elderly program will learn special skills, knowledge and attitudes to support their future focus practice in Care of the

More information

Analysis of Confidence Rating Pilot Data: Executive Summary for the UKCAT Board

Analysis of Confidence Rating Pilot Data: Executive Summary for the UKCAT Board Analysis of Confidence Rating Pilot Data: Executive Summary for the UKCAT Board Paul Tiffin & Lewis Paton University of York Background Self-confidence may be the best non-cognitive predictor of future

More information

Program Specification for Medical Doctorate Degree In Andrology. II. Intended Learning Outcomes of Program (ILOs)

Program Specification for Medical Doctorate Degree In Andrology. II. Intended Learning Outcomes of Program (ILOs) Cairo University Faculty of Medicine Program Specification for Medical Doctorate Degree In Andrology Program type: Single Department offering the program: Andrology Department Program Code: ANDR 900 Total

More information

UPDATE ON COMPETENCY DEVELOPMENT AND THE TAXONOMY IN CLINICAL NEUROPSYCHOLOGY

UPDATE ON COMPETENCY DEVELOPMENT AND THE TAXONOMY IN CLINICAL NEUROPSYCHOLOGY UPDATE ON COMPETENCY DEVELOPMENT AND THE TAXONOMY IN CLINICAL NEUROPSYCHOLOGY CURRENT ISSUES IN POSTDOCTORAL TRAINING B ra d L. Rop er, P h. D., A B PP -CN Ru ssel l M. B a uer, P h. D., A B PP -CN JUNE

More information