Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

Similar documents
André Cyr and Alexander Davies

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

Technical Specifications

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments

Introduction to Reliability

Psychometrics for Beginners. Lawrence J. Fabrey, PhD Applied Measurement Professionals

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Connexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan

Measurement and Descriptive Statistics. Katie Rommel-Esham Education 604

Reliability & Validity Dr. Sudip Chaudhuri

Importance of Good Measurement

VARIABLES AND MEASUREMENT

Examining the Psychometric Properties of The McQuaig Occupational Test

Regression. Lelys Bravo de Guenni. April 24th, 2015

Overview of Experimentation

Influences of IRT Item Attributes on Angoff Rater Judgments

Validity. Ch. 5: Validity. Griggs v. Duke Power - 2. Griggs v. Duke Power (1971)

Validation of Scales

THE APPLICATION OF ORDINAL LOGISTIC HEIRARCHICAL LINEAR MODELING IN ITEM RESPONSE THEORY FOR THE PURPOSES OF DIFFERENTIAL ITEM FUNCTIONING DETECTION

Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys

Diagnostic Classification Models

Chapter 3 Psychometrics: Reliability and Validity

ADMS Sampling Technique and Survey Studies

By Hui Bian Office for Faculty Excellence

Georgina Salas. Topics EDCI Intro to Research Dr. A.J. Herrera

Handout 5: Establishing the Validity of a Survey Instrument

EVALUATING AND IMPROVING MULTIPLE CHOICE QUESTIONS

Results & Statistics: Description and Correlation. I. Scales of Measurement A Review

On the purpose of testing:

Validity. Ch. 5: Validity. Griggs v. Duke Power - 2. Griggs v. Duke Power (1971)

Differential Item Functioning

Basic concepts and principles of classical test theory

Validity and reliability of measurements

Chapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE

ESTABLISHING VALIDITY AND RELIABILITY OF ACHIEVEMENT TEST IN BIOLOGY FOR STD. IX STUDENTS

Comprehensive Statistical Analysis of a Mathematics Placement Test

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

Item Analysis: Classical and Beyond

Basic Psychometrics for the Practicing Psychologist Presented by Yossef S. Ben-Porath, PhD, ABPP

Development, Standardization and Application of

Detecting Suspect Examinees: An Application of Differential Person Functioning Analysis. Russell W. Smith Susan L. Davis-Becker

Empirical Knowledge: based on observations. Answer questions why, whom, how, and when.

Validity, Reliability, and Defensibility of Assessments in Veterinary Education

Research Questions and Survey Development

Introduction: Speaker. Introduction: Buros. Buros & Education. Introduction: Participants. Goal 10/5/2012

Sheila Barron Statistics Outreach Center 2/8/2011


Survey Question. What are appropriate methods to reaffirm the fairness, validity reliability and general performance of examinations?

Test Validity. What is validity? Types of validity IOP 301-T. Content validity. Content-description Criterion-description Construct-identification

Lecture Week 3 Quality of Measurement Instruments; Introduction SPSS

3 CONCEPTUAL FOUNDATIONS OF STATISTICS

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati.

Chapter 2--Norms and Basic Statistics for Testing

USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION

The Effect of Guessing on Item Reliability

Item Analysis Explanation

Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz


AMERICAN BOARD OF SURGERY 2009 IN-TRAINING EXAMINATION EXPLANATION & INTERPRETATION OF SCORE REPORTS

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

Critical Thinking Assessment at MCC. How are we doing?

Doing Quantitative Research 26E02900, 6 ECTS Lecture 6: Structural Equations Modeling. Olli-Pekka Kauppila Daria Kautto

Chapter 9: Intelligence and Psychological Testing

Bruno D. Zumbo, Ph.D. University of Northern British Columbia

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2

Samantha Sample 01 Feb 2013 EXPERT STANDARD REPORT ABILITY ADAPT-G ADAPTIVE GENERAL REASONING TEST. Psychometrics Ltd.

SLEEP DISTURBANCE ABOUT SLEEP DISTURBANCE INTRODUCTION TO ASSESSMENT OPTIONS. 6/27/2018 PROMIS Sleep Disturbance Page 1

Intelligence. Exam 3. iclicker. My Brilliant Brain. What is Intelligence? Conceptual Difficulties. Chapter 10

On Test Scores (Part 2) How to Properly Use Test Scores in Secondary Analyses. Structural Equation Modeling Lecture #12 April 29, 2015

Introduction to the HBDI and the Whole Brain Model. Technical Overview & Validity Evidence

2016 Technical Report National Board Dental Hygiene Examination

The Psychometric Principles Maximizing the quality of assessment

Reliability, validity, and all that jazz

Chapter 12. The One- Sample

Chapter 1 Applications and Consequences of Psychological Testing

Validity and reliability of measurements

PÄIVI KARHU THE THEORY OF MEASUREMENT

Variables in Research. What We Will Cover in This Section. What Does Variable Mean?

THE NATURE OF OBJECTIVITY WITH THE RASCH MODEL

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses

IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS

Reliability and Validity

6. Assessment. 3. Skew This is the degree to which a distribution of scores is not normally distributed. Positive skew

Validity refers to the accuracy of a measure. A measurement is valid when it measures what it is suppose to measure and performs the functions that

Associate Prof. Dr Anne Yee. Dr Mahmoud Danaee

DATA GATHERING. Define : Is a process of collecting data from sample, so as for testing & analyzing before reporting research findings.

Chapter -6 Reliability and Validity of the test Test - Retest Method Rational Equivalence Method Split-Half Method

AP PSYCH Unit 11.2 Assessing Intelligence

Reliability Theory for Total Test Scores. Measurement Methods Lecture 7 2/27/2007

The Current State of Our Education

ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION SCALE

2 Types of psychological tests and their validity, precision and standards

11-3. Learning Objectives

Psychological testing

A Comparison of Several Goodness-of-Fit Statistics

THE ROLE OF PSYCHOMETRIC ENTRANCE TEST IN ADMISSION PROCESSES FOR NON-SELECTIVE ACADEMIC DEPARTMENTS: STUDY CASE IN YEZREEL VALLEY COLLEGE

Underlying Theory & Basic Issues

Regression Discontinuity Analysis

Chapter 4: Defining and Measuring Variables

Transcription:

Empowered by Psychometrics The Fundamentals of Psychometrics Jim Wollack University of Wisconsin Madison

Psycho-what? Psychometrics is the field of study concerned with the measurement of mental and psychological traits, abilities, and attitudes.

Purpose of Session Introduce several key psychometric concepts and gain an appreciation for the theoretical underpinnings of standardized tests. Scales, Norms, and Equating Validity Reliability Test Theory (as time permits) Classical Test Theory Item Response Theory

Poll Question Because of my comfort-level with psychometrics, I am to discuss measurement-related issues with faculty and students. A. eager B. willing C. hesitant D. unwilling

SECTION I Scales, Norms, and Equating

Measurement and Scaling Measurement is the process of assigning scores in a systematic and coherent way For purposes of reporting, these scores are often transformed in some way to facilitate interpretations Scaling is the process of constructing a score scale that associates numbers or other ordered indicators with the performance of examinees.

Why Scale? Imagine three students (A, B, and C) take a standardized test Student A answered 20 of 30 items correct. Student B answered 20 of 29 items correct. Student C answered 21 of 30 items correct. What can we say about the achievement level of these three students?

What can we say about A, B, and C? A: 20/30 B: 20/29 C: 21/30 Suppose you learned that the students completed different test forms? the 3 hardest items were all on Test A? the 12 easiest items were all on Test C? The overall difficulty levels of two different tests are rarely identical Even if tests are of equal average difficulty, they may be differently difficult for students at different levels.

The Rationale Behind Scaling Raw scores (number correct scores) depend on the items on the test and do not have consistent meaning across forms. Same is true for percentage correct scores Makes score interpretations very difficult

Score Scales The score scale is the metric which is actually used for purposes of reporting scores to users. Moving from raw scores to the score scale involves either a linear or non-linear transformation The transformed scores themselves are called scaled scores or derived scores

Common Scales

Common Scales 16%

Advantage of Measurement Scales Standardization Scale must not measure differently depending on what it is that s being measured Pieces, bites, handfuls, and number/percent correct Pounds, inches, ºF, level on construct Without a standardized reporting metric, direct comparisons are impossible

Transforming between two scales Mean = 67 St. Dev = 8.5 Mean = 72 St. Dev = 5.7 35 40 45 50 55 60 65 70 75 80 85 90 95 100

Transforming between two scales Mean = 67 St. Dev = 8.5 Mean = 72 St. Dev = 5.7 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Linear transformation 1. Make means equal Add (72 67) = 5 to all blue scores

Transforming between two scales Mean = 72 St. Dev = 8.5 Mean = 72 St. Dev = 5.7 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Linear transformation 1. Make means equal Add (72 67) = 5 to all blue scores

Transforming between two scales Mean = 72 St. Dev = 8.5 Mean = 72 St. Dev = 5.7 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Linear transformation 2. Make St. devs equal Multiply all blue scores by (5.7/8.5)

Transforming between two scales Mean = 72 St. Dev = 5.7 Mean = 72 St. Dev = 5.7 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Linear transformation 2. Make St. devs equal Multiply all blue scores by (5.7/8.5)

Any set of test scores can easily be transformed to some other metric. This allows for direct norm-referenced comparisons Candidate A scored 82 on red test (96 th percentile) Candidate B scored 82 on blue test (96 th percentile)

Transforming between two scales Mean = 72 St. Dev = 5.7 Mean = 72 St. Dev = 5.7 Score = 82 96 th percentile 35 40 45 50 55 60 65 70 75 80 85 90 95 100

Transforming between two scales Are these two students comparable? Score = 82 96 th percentile 35 40 45 50 55 60 65 70 75 80 85 90 95 100

Poll Question Are these two students comparable? A. Yes B. No C. It cannot be determined

Are two students comparable? We don t know. Students scaled scores and percentile ranks are relative to other students who completed that same form. If the populations of test takers were different, it is quite likely that the examinees are not of equal ability.

SAT GRE GED 600 SAT = 600 GRE = 600?

Norming Perform initial scaling on a single (base) test form (calibration) The sample completing the base form should be large and representative of the target population The sample taking the base form is used as a reference point for purposes of comparison with all subsequent samples Norming

Norm Group The norm group is the group of individuals for whom the test scale was established SAT is scaled to have an average of 500 the average of the norm group was 500

Test Equating Need to transform the data so that the scores candidates receive are the same scores they would have received if they Were part of the normative sample, and Had been administered the base form This process, known as equating, ensures that test scores have identical meaning across administrations, even as items and populations change.

Equating The process of determining the transformation to convert between the raw score metric and the reporting metric (based on the norm group) Equating is a topic worthy of a full-length graduate-level course Requires comparison across common elements between base form and new form : Common items Assumed randomly equivalent populations Very difficult assumption to make across years

Simple Equating Design Base Test Common New Test Items Average 500 495 510 500 515 Std. Dev 100 105 98 100 93 After Equating In much the same way we did before, we can now align these two assessments using only data from common items Add (510 495) = 15 to all New Test scores Multiple all New Test scores by 98 / 105 = 0.93

SECTION I Scales, Norms, and Equating Questions?

SECTION II Validity

Definitions of Validity Formal Definition of Validity: Degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of interpretations and actions based on test scores. Less Formal Definition of Validity: Degree to which the inferences made from a test (e.g., that the student knows the material of interest, is competent to practice in discipline, is likely to be successful in training/school, etc.) are justified and accurate. Informal Definition of Validity: A test is valid if it measures what it s supposed to measure

Problem with the Informal Definition Informal Definition of Validity: A test is valid if it measures what it s supposed to measure Tests cannot be valid Tests are valid only for specific purposes Placement/Admissions exams for exemption VSA Testing (differences in target population) Validity is a matter of degree

Measuring Educ/Psychological Variables Unlike physical attributes (height, weight, etc.), educational and psychological attributes cannot be directly observed, hence cannot be directly measured. To the extent that a variable is abstract and latent rather than concrete and observable, it is called a construct. In order to move from construct formation to measurement, the construct which is to be measured must first be operationally defined.

Embodiment of a Construct Operationalize What behaviors comprise the construct? How does this construct relate to or distinguish itself from other constructs? Plan How will samples of these behaviors be obtained? Instrument development Develop a standard procedure for obtaining the sample of behaviors from the specified domain. Measurement

Test Validation The degree to which the evidence supports the claim that the assessment measures the intended construct. Three types of validity? Construct Validity Criterion-Related Validity Content Validity No, just one. It s really all about Construct Validity

Assessing Construct Validity Target group differences Is there a logical differentiation between groups? e.g., placement test math scores for students who completed HS Calculus vs. those who only completed HS Algebra Correlational studies between test and related (or unrelated) measures Convergent validity: Does test correlate with other measures that are theoretically related? ACT and SAT, Compass & Accuplacer, Different IQ tests Divergent validity: Does test fail to correlate with other measures that are theoretically unrelated? ACT and Stanford Binet Math and English placement scores

Assessing Construct Validity Factor Analysis Statistical procedure to empirically test whether performance on observed variables (items) can be explained by a smaller number of unobserved constructs. Dimensionality assessment Does empirical structure match theoretical structure? Can also assess whether clusters of items are related in ways that are expected. Are items that are intended to measure the same subscores (e.g., trigonometry, algebra, etc.) more similar to each other than to other items?

Assessing Construct Validity Content Validity The extent to which the set of items on the test are representative of and relevant to the construct Items should cover the breadth and depth of the construct Weight assigned to each content area should reflect importance of that content area within construct For employment and certification exams, often necessary to conduct a practice analysis Panels of content experts are often utilized to assess relevance of items

Assessing Construct Validity Criterion-Related Validity Examines the relationship of the test results to other variables/criteria external to the test Predictive The extent to which an individual s future level on the criterion can be predicted from prior test performance Correlation between ACT/SAT scores and first year GPA Concurrent The extent to which test scores estimate an individual s present standing on the criterion. Correlation between Prior Learning Assessment and final course grade

SECTION II Validity Questions?

SECTION III Reliability

A Game of Darts Validity: Confidence that the test will hit the bullseye Reliability: Confidence that any one dart is a good predictor of where next dart would go. Clustering the Darts together

Unreliable

Reliable, but not valid

Reasonable reliable and valid

Highly reliable and valid

Reliability and Validity A test cannot be valid (for any purpose) unless it is reliable. Validity: Confidence that the test will hit the bullseye Not that it will average out to the bullseye

Working Definitions of Reliability The degree to which a test is consistent and stable in measuring what it is intended to measure Measurement repeatability Will an examinee score similarly when administered an independent alternate form of the test administered under the same conditions and with no opportunity for learning (or forgetting)?

Understanding Reliability No two tests will consistently produce identical results. All test scores contain some random error Observed Score = True Score + Random Error = Signal + Noise This equation is often written as X = T + E

What is Random Error? Any non-systematic source of variance that is unrelated to the construct of interest. Examinee-specific factors Motivation Concentration Fatigue Boredom Test-specific factors Specific questions Ambiguous items Memory lapses Carelessness Luck in guessing Clarity of directions Reading load of items Scoring-specific factors Non-uniform scoring Carelessness Computational errors

Formal Definition of Reliability X = T + E A measure of the extent to which an examinee s score reflects their true score, (as opposed to random measurement error) Reliability = Variance True / Variance Observed = 1 Variance Error / Variance Observed Test with reliability of.80 contains 20% random error

Reliability and the SEM If reliability is a measure of the stability of measurement, the standard error of measurement (SEM) provides a measure of the instability of measurement. SEM = (st. dev.) 1 Provides a measure of the expected variability in an individual s score (X i ) upon retesting. Score Interval Probability of score falling in interval X i ± SEM 68% X i ± 2 SEM 95%

Why Care About Reliability? Measurement error is random its effect on a student s test score is unpredictable. In an unreliable test, students scores consist largely of measurement error. An unreliable test offers no advantage over randomly assigning test scores to students. Reliability is a necessary precursor to validity

Estimating Reliability Test-Retest Reliability Administer the same exam to the same group of candidates and correlate the scores Interval should be short enough for no learning, and long enough for no remembering Parallel/Alternate Forms Reliability 1. Develop equivalent forms of the test. 2. Have examinees take both tests. 3. Correlate the scores.

Estimating Reliability: 1 Administration Split-Half Reliability 1. Split exam in two random halves 2. Correlate scores across the two halves. 3. Apply a formula to estimate reliability Internal Consistency Cronbach s Coefficient α, KR-20, KR-21 Measures of the extent to which the test items throughout a test are homogeneous α is average split-half reliability across all possible split-halves. α and KR-20 are lower-bound estimates of reliability

Reliability In Practice.90 1.00 High Stakes Standardized Testing.80.90 Subscores or Low-stakes tests.70.80.60.70.50.60.40.50.30.40.20.30.10.20.00.10

Improving Reliability Improve item quality Increase the number of points or item alternatives Increase the number of items

SECTION III Reliability Questions? End?

SECTION IV Test Theory Classical Test Theory Item Response Theory

Classical Test Theory X = T + E Person characteristics Total test score serves as a proxy for examinee s level on the construct Item characteristics Item difficulty is estimated as the proportion of examinees who answer an item correctly Item discrimination measures how effectively the item differentiates between high- and lowperforming examinees. Correlation between item score (1/0) and total score

Item Response Theory Mathematical modeling approach to test scoring and analysis Less intuitive, but more sophisticated approach Solves many problems with CTT Sample-dependency of item/exam statistics Test-dependency of total scores Tough to compare people and items Equal item weighting No good way to account for guessing

Trait Level vs. Prob. Correct Response 1.0 Item 1 0.8 Probability 0.6 0.4 0.2 0.0-3.0-2.0-1.0 0.0 1.0 2.0 3.0 θ (Examinee Trait Level) 63

An Item Characteristic Curve 1.0 0.8 Probability 0.6 0.4 0.2 0.0-3.0-2.0-1.0 0.0 1.0 2.0 3.0 θ (Examinee Trait Level) 64

Sample Independent Same Curve 1.0 0.8 Probability 0.6 0.4 0.2 0.0-3.0-2.0-1.0 0.0 1.0 2.0 3.0 θ (Examinee Trait Level) 65

Item Response Theory Directly models the probability of a candidate getting an item correct based on their overall level on the construct and item characteristics is the person s level on the construct a i, b i, and c i are item parameters corresponding to the item s discrimination, difficulty, and guessing likelihood

Item Difficulty 1.0 0.8 Probability 0.6 0.4.50 1.0 0.2 0.0-3.0-2.0-1.0 0.0 1.0 2.0 3.0 θ (Examinee Trait Level)

Item Difficulty 1.0 Probability 0.8 0.6 0.4.68.41 0.2 0.0-3.0-2.0-1.0 0.0 1.0 2.0 3.0 θ (Examinee Trait Level)

Item Discrimination 1.0 0.8.68 Probability 0.6 0.4 0.2.41.32.59.59.41 =.18.68.32 =.36 0.0-3.0-2.0-1.0 0.0 1.0 2.0 3.0 θ (Examinee Trait Level)

Accounting for Guessing 1.0 0.8 Probability 0.6 0.4 0.2 0.0-3.0-2.0-1.0 0.0 1.0 2.0 3.0 θ (Examinee Trait Level)

Putting it all Together 1.0 0.8 Probability 0.6 0.4 0.2 0.0-3.0-2.0-1.0 0.0 1.0 2.0 3.0 θ (Examinee Trait Level)

Test Characteristic Curve (TCC) Describes relationship between total test score and examinee trait level (θ) TCC is obtained by adding item characteristic curves across all values of θ Each test has its own TCC

PT Test Characteristic Curve A form with slightly easier items will shift the TCC to the left, requiring the examinee to answer a greater number of items correctly in order to pass

PT Test Characteristic Curve A form with slightly harder easier items will shift the TCC to the right, left, requiring the the examinee to answer a smaller greater number of items correctly in order to pass

3 Hypothetical TCCs Projected Test Score 200 180 160 140 120 100 80 60 40 20 0-3 -2-1 0 1 2 3 θ IRT is also independent of characteristics of the specific test form Easier (Top) Anchor (Middle) Harder (Bottom)

IRT Summary Although dealing with raw scores is conceptually appealing, it is problematic in practice IRT overcomes many of these problems IRT difficulty and person trait estimates are scaled together Item and person parameters are properties of the items and people, and do not change across samples or test forms. Majority of programs use IRT scoring and linearly transform θ to scale of interest

SECTION IV Test Theory Classical Test Theory Item Response Theory Questions?

Thank you For more information, please contact Jim Wollack University of Wisconsin Madison jwollack@wisc.edu