Using Generalizability Theory to Investigate the Psychometric Property of an Assessment Center in Indonesia

Similar documents
Using Differential Item Functioning to Test for Inter-rater Reliability in Constructed Response Items

Investigating the Reliability of Classroom Observation Protocols: The Case of PLATO. M. Ken Cor Stanford University School of Education.

Examining Factors Affecting Language Performance: A Comparison of Three Measurement Approaches

Evaluating a Special Education Teacher Observation Tool through G-theory and MFRM Analysis

Validity, Reliability, and Fairness in Music Testing

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

Providing Evidence for the Generalizability of a Speaking Placement Test Scores

Using GeneralIzabliIty Theory I«Aosess Interrater Reliability of Contract Proposal Evaluations

Introduction On Assessing Agreement With Continuous Measurement

PÄIVI KARHU THE THEORY OF MEASUREMENT

PSYCHOMETRICS APPLIED TO HEALTHCARE PROFESSIONS EDUCATION

Bruno D. Zumbo, Ph.D. University of Northern British Columbia

A REVIEW OF THE LITERATURE ON MARKING RELIABILITY. Michelle Meadows and Lucy Billington. May 2005

Basic concepts and principles of classical test theory

The MHSIP: A Tale of Three Centers

RESEARCH ARTICLES. Brian E. Clauser, Polina Harik, and Melissa J. Margolis National Board of Medical Examiners

PLEASE SCROLL DOWN FOR ARTICLE

Psychometrics for Beginners. Lawrence J. Fabrey, PhD Applied Measurement Professionals

Generalizability Theory: A Comprehensive Method for Assessing and Improving the Dependability of Marketing Measures

The Psychometric Development Process of Recovery Measures and Markers: Classical Test Theory and Item Response Theory

Importance of Good Measurement

Psychological Methods

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

Creating Short Forms for Construct Measures: The role of exchangeable forms

10 Intraclass Correlations under the Mixed Factorial Design

English 10 Writing Assessment Results and Analysis

Adaptation and evaluation of early intervention for older people in Iranian.

Psychological testing

EFFECTS OF ITEM ORDER ON CONSISTENCY AND PRECISION UNDER DIFFERENT ORDERING SCHEMES IN ATTITUDINAL SCALES: A CASE OF PHYSICAL SELF-CONCEPT SCALES

MULTIPLE VIEWS OF L1 WRITING SCORE RELIABILITY JAMES DEAN BROWN University of Hawai i at Manoa

Conducting Survey Research. John C. Ricketts

Factor Indeterminacy in Generalizability Theory

Reliability & Validity Dr. Sudip Chaudhuri

Associate Prof. Dr Anne Yee. Dr Mahmoud Danaee

Introduction to Reliability

Paper 908 BOOSTING INTER-RATER RELIABILITY

Chapter 3. Psychometric Properties

Development, Standardization and Application of


June 26, E. Mineral Circle, Suite 350 Englewood, CO Critical Role of Performance Ratings

A Comparison of Rubrics and Graded Category Rating Scales with Various Methods Regarding Raters Reliability

Boston Biometrics International Validation Study

Comprehensive Statistical Analysis of a Mathematics Placement Test

Pediatrics Milestones and Meaningful Assessment Translating the Pediatrics Milestones into Assessment Items for use in the Clinical Setting

Estimating Individual Rater Reliabilities John E. Overall and Kevin N. Magee University of Texas Medical School

Introduction. 1.1 Facets of Measurement

Theory Testing and Measurement Error

USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION

Establishing Interrater Agreement for Scoring Criterion-based Assessments: Application for the MEES

Assessing the Validity and Reliability of the Teacher Keys Effectiveness. System (TKES) and the Leader Keys Effectiveness System (LKES)

Validity of measurement instruments used in PT research

CHAPTER III RESEARCH METHODOLOGY

Introduction. Chapter 1

Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments

Techniques for Explaining Item Response Theory to Stakeholder

Statistics for Psychosocial Research Session 1: September 1 Bill

Technical Specifications

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

ORIGINS AND DISCUSSION OF EMERGENETICS RESEARCH

Reliability of oral examinations: Radiation oncology certifying examination

VARIABLES AND MEASUREMENT

Detecting and Correcting for Rater Effects in Performance Assessment

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Development and Psychometric Properties of the Relational Mobility Scale for the Indonesian Population

Evaluating the quality of analytic ratings with Mokken scaling

Modeling the Influential Factors of 8 th Grades Student s Mathematics Achievement in Malaysia by Using Structural Equation Modeling (SEM)

Item Response Theory (IRT): A Modern Statistical Theory for Solving Measurement Problem in 21st Century

Validation of Scales

Research Brief Reliability of the Static Risk Offender Need Guide for Recidivism (STRONG-R)

CEMO RESEARCH PROGRAM

Extraversion. The Extraversion factor reliability is 0.90 and the trait scale reliabilities range from 0.70 to 0.81.

An International Study of the Reliability and Validity of Leadership/Impact (L/I)

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups

Instrument equivalence across ethnic groups. Antonio Olmos (MHCD) Susan R. Hutchinson (UNC)

A Comparison of the Nedelsky and Angoff Cutting Score Procedures Using Generalizability Theory

Measurement and Descriptive Statistics. Katie Rommel-Esham Education 604

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

Error in the estimation of intellectual ability in the low range using the WISC-IV and WAIS- III By. Simon Whitaker

Examining the Psychometric Properties of The McQuaig Occupational Test

Paull Barrett, PhD.

Validity, Reliability, and Defensibility of Assessments in Veterinary Education

The Personal Profile System 2800 Series Research Report

Exploratory Factor Analysis Student Anxiety Questionnaire on Statistics

Statistics for Psychosocial Research Lecturer: William Eaton

Multiple Act criterion:

Validity and Reliability. PDF Created with deskpdf PDF Writer - Trial ::

HOW IS PACE TO BE USED

Equating UDS Neuropsychological Tests: 3.0>2.0, 3.0=2.0, 3.0<2.0? Dan Mungas, Ph.D. University of California, Davis

Scale Building with Confirmatory Factor Analysis

Introduction to Multilevel Models for Longitudinal and Repeated Measures Data

Personality Traits Effects on Job Satisfaction: The Role of Goal Commitment

KCP learning. factsheet 44: mapp validity. Face and content validity. Concurrent and predictive validity. Cabin staff

Norma J. MacIntyre, Lisa Bennett, Alison M. Bonnyman, and Paul W. Stratford. 1. Introduction

ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION SCALE

Guest Editorial Rater effects: Advances in item response modeling of human ratings Part I

A Comparison of Traditional and IRT based Item Quality Criteria

Transcription:

Using Generalizability Theory to Investigate the Psychometric Property of an Assessment Center in Indonesia

Urip Purwono received his Ph.D. (psychology) from the University of Massachusetts at Amherst, USA specializing in psychometrics and educational measurement/evaluation. Founder of the Center for Psychometric Study, UNPAD, his methodological research interest includes test theory, test construction, test adaptation, and structural equation modeling. In addition, he also does many works related to cognitive and educational test development to be used in Indonesia s setting. He is currently the first Vice President of Indonesian Psychological Association and the President Elect of the Asean Regional Union of Psychological Society (ARUPS).

Using Generalizability Theory to Investigate the Psychometric Property of an Assessment Center in Indonesia R. Urip Purwono, Ph.D. Padjajaran University

Using Generalizability Theory to Investigate the Psychometric Properties of an Assessment Center Program Urip Purwono, M.Sc., Ph.D., Psi. The Center for Psychometric Study Faculty of Psychology Universitas Padjadjaran Bandung, Indonesia

INTRODUCTION Background Formally first introduced in 1990 by PT Telkom, Assessment Center method become increasingly popular in Indonesia nowadays uses for many purposes by many organization, both public and private sectors has many variations However, research is lagged far behind the practices. Psychometric study is even much less so that an appeal to fill this gap is warranted Without research, it is difficult to perfect the method and to improve the service

INTRODUCTION Measurement The process of assigning number to an attribute of an entity according to a certain rule (Steven, 1951) Assessment Center fit with this definition. Using performance tasks, It assigned number To attributes of an individual (namely competency ) The are rules in the assignment of the number ( i.e. the rubrics ) Another form of measurement: Test, Objective Personality Assessment

INTRODUCTION Technical Quality Indices Paper and Pencil measures use many indices of its technical quality: reliability, test/item information function, item fit, person fit, standard error of measurement i.e.: What is the reliability of the test? How big is the SEM? How is the Item Information Function looks like? (IRT). How about the in-fit and the out-fit (Rasch measurement) What about Assessment Center? Is there any number we can use to indicate its technical quality? Fall under reliability concerns (General Session 1, Willian Byham)

INTRODUCTION Technical Quality Indices (Cont.) The issue of reliability will never be over as long as we are using number Methods are continuously developed Cannot interpret the number if we cannot rely to the number The ratings change as the time of assessment change The ratings change from rater to rater The ratings changes due to the exercises

Purpose To familiarize us with indices derived from Generalizability Theory framework that can be used as quality indices of an assessment center program Consistent with the assertion of Cronbach, Gleser, Nanda, & Rajaratnam, 1972; Brennan, 1991; Kane, 1982; Shavelson, Webb, & Rowley, 1989) To present the basic idea of the Generalizability Theory and its simplest model appropriate to be used in the context of Assessment Center To illustrate the use of the approach in the context of Assessment Center using a real Assessment Center data

Assessment Center is a complex performance based assessment It involves: Standardized performance tasks Criterion to use in rating candidate performances Rubric The Complexity: Multiple raters Multiple dimensions

Indices of technical quality Classical Test Theory provides ways to estimate the amount of error contain in in every observed score Item response theory provide ways to estimate ability and assign score more accurately The Test Information Function is the inverse function of Standard Error of Measurement Inter-rater reliability provide information on the variation of ratings of two or more raters to person performances

Indices of technical quality (Cont.) The three approaches yield undifferentiated error of measurement (Feldt & Brennan, 1989), thus provides too gross a characterization of the potential and/or actual sources of measurement error The G-Theory, on the other hand, G theory provide information on the sources of error variation, disentangles them, and estimates each one, assuming randomly parallel tasks sampled from the same universe

The Tenets of GT (Cont.) Generalizability (G) theory is a statistical theory for evaluating the dependability (or reliability) of behavioral measurements (Cronbach, Gleser, Nanda, & Rajaratnam, 1972; Brennan, 2001; Shavelson & Webb, 1991). It is an application and extension of the Analysis of Variance procedures applied to solve measurement problem It is an extension of the Classical Test Theory

The Tenets of GT (Cont.) The building blocks are variance and covariance components for a universe of admissible observations (UAO). In performance based assessment: a sample of candidate s performance drawn from a complex universe defined by a combination of all possible tasks, raters, and measurement methods, etc. Tasks, raters, and methods are called facet The simplest AC facets: Task and raters

The Tenets of GT (Cont.) The individual performances rating is seen as affected by sampling variability due to tasks, raters, and so forth (facets), and their combinations, providing estimates of the magnitude of measurement error in the form of variance components it provides a summary coefficient reflecting the "reliability" of generalizing from a sample score or profile to the much larger domain of interest. This coefficient is called a generalizability coefficient (G coefficient) We use the results of the G-studies to design a measurement procedure that minimizes error

Illustration Job target: High rank position in a Public Sector/Local Government office Two exercises, with four assessors Design: p x I x r

Results ANOVA Table df SS MS Variance Proport. P 34,000 152,036 4,472 1,897,137 F1 2,000,804,402,082,006 F2 3,000 4,811 1,604,170,012 P*F1 68,000 7,821,115,000,000 P*F2 102,000 94,564,927,000,000 F1*F2 6,000 3,211,535,000,000 P*F1*F2 204,000 2395,664 11,743 11,743,845 G-coefficients: G Phi,564,550

D-Studies

Validity The G and D coefficient does not speak to the issue of validity Modern theory of validity: Evidences need to be collected with regard to the interpretation and use of the score Source of validity evidences: content, internal structure, relation to other variables, response process, and consequences of the installment of Assessment Center Program However, we cannot validate the method unless the results (numbers) have been proved to be reliable. The two goes hand in hand

Conclusion and Recommendations Generalizability Theory provides an appropriate framework to investigate the technical quality of Assessment Center, better than the CTT and IRT approach These investigation need to be adopted as a common practice if we are to perfect our Assessment Center method and provide a better and responsible service to the society It does not address the issues related to validity While large data is available, the association can take an initial step by setting up a Research and Development body in the organization

THANK YOU