Reliability Theory for Total Test Scores. Measurement Methods Lecture 7 2/27/2007

Similar documents
ISC- GRADE XI HUMANITIES ( ) PSYCHOLOGY. Chapter 2- Methods of Psychology

LANGUAGE TEST RELIABILITY On defining reliability Sources of unreliability Methods of estimating reliability Standard error of measurement Factors

Reliability and Validity

Simple Linear Regression the model, estimation and testing

CHAPTER - 6 STATISTICAL ANALYSIS. This chapter discusses inferential statistics, which use sample data to

CHAPTER VI RESEARCH METHODOLOGY

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

02a: Test-Retest and Parallel Forms Reliability

Reliability & Validity Dr. Sudip Chaudhuri

MBA SEMESTER III. MB0050 Research Methodology- 4 Credits. (Book ID: B1206 ) Assignment Set- 1 (60 Marks)

Georgina Salas. Topics EDCI Intro to Research Dr. A.J. Herrera

Results & Statistics: Description and Correlation. I. Scales of Measurement A Review

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

Chapter 02. Basic Research Methodology

CHAPTER 3 RESEARCH METHODOLOGY. In this chapter, research design, data collection, sampling frame and analysis

- Decide on an estimator for the parameter. - Calculate distribution of estimator; usually involves unknown parameter

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

Bruno D. Zumbo, Ph.D. University of Northern British Columbia

Validity. Ch. 5: Validity. Griggs v. Duke Power - 2. Griggs v. Duke Power (1971)

Descriptive Statistics Lecture

THE RELATIONSHIP BETWEEN EMOTIONAL INTELLIGENCE AND STRESS MANAGEMENT

Types of questions. You need to know. Short question. Short question. Measurement Scale: Ordinal Scale

Chapter 2. Behavioral Variability and Research

Saville Consulting Wave Professional Styles Handbook

Technical Specifications

Making Inferences from Experiments

Item Analysis Explanation

Tech Talk: Using the Lafayette ESS Report Generator

SPSS output for 420 midterm study

Chapter 11. Experimental Design: One-Way Independent Samples Design

Reinforcement Learning : Theory and Practice - Programming Assignment 1

EVALUATING AND IMPROVING MULTIPLE CHOICE QUESTIONS

Psych 1Chapter 2 Overview

Assessing Intelligence. AP Psychology Chapter 11: Intelligence Ms. Elkin Fall 2014

Appendix B Statistical Methods

(CORRELATIONAL DESIGN AND COMPARATIVE DESIGN)

One-Way ANOVAs t-test two statistically significant Type I error alpha null hypothesis dependant variable Independent variable three levels;

Lec 02: Estimation & Hypothesis Testing in Animal Ecology

Study Guide for the Final Exam

Biostatistics 2 nd year Comprehensive Examination. Due: May 31 st, 2013 by 5pm. Instructions:

Citation for published version (APA): Ebbes, P. (2004). Latent instrumental variables: a new approach to solve for endogeneity s.n.

IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS

Research Methods Gleitman et al. (2011), Chapter 1

Sampling for Success. Dr. Jim Mirabella President, Mirabella Research Services, Inc. Professor of Research & Statistics

Basic concepts and principles of classical test theory

Audio: In this lecture we are going to address psychology as a science. Slide #2

DEVELOPING THE RESEARCH FRAMEWORK Dr. Noly M. Mascariñas

Survival Skills for Researchers. Study Design

VARIABLES AND MEASUREMENT

Variables in Research. What We Will Cover in This Section. What Does Variable Mean?

The Impact of Statistically Adjusting for Rater Effects on Conditional Standard Errors for Performance Ratings

3 CONCEPTUAL FOUNDATIONS OF STATISTICS

DAT Next Generation. FAQs

Unit 1 Exploring and Understanding Data

HUMAN-COMPUTER INTERACTION EXPERIMENTAL DESIGN

Generalization and Theory-Building in Software Engineering Research

On Test Scores (Part 2) How to Properly Use Test Scores in Secondary Analyses. Structural Equation Modeling Lecture #12 April 29, 2015

Critical Thinking Assessment at MCC. How are we doing?

Validity. Ch. 5: Validity. Griggs v. Duke Power - 2. Griggs v. Duke Power (1971)

Improving business performance with emotional intelligence. Genos emotional intelligence products and services overview

Prepared by: Assoc. Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies

Statistics for Psychosocial Research Lecturer: William Eaton

Having your cake and eating it too: multiple dimensions and a composite

Chapter 4 Research Methodology

Chapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE

MindmetriQ. Technical Fact Sheet. v1.0 MindmetriQ

AP Psychology -- Chapter 02 Review Research Methods in Psychology

Author s response to reviews

Definition of Scientific Research RESEARCH METHODOLOGY CHAPTER 2 SCIENTIFIC INVESTIGATION. The Hallmarks of Scientific Research

CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA

CHAPTER 3 METHOD AND PROCEDURE

Empirical Knowledge: based on observations. Answer questions why, whom, how, and when.

SPSS output for 420 midterm study

CHAPTER 3 RESEARCH METHODOLOGY

A Brief Introduction to Bayesian Statistics

In this paper we intend to explore the manner in which decision-making. Equivalence and Stooge Strategies. in Zero-Sum Games

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

Chapter 2--Norms and Basic Statistics for Testing

The SAGE Encyclopedia of Educational Research, Measurement, and Evaluation Multivariate Analysis of Variance

Variables in Research. What We Will Cover in This Section. What Does Variable Mean? Any object or event that can take on more than one form or value.

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Working When No One Is Watching: Motivation, Test Scores, and Economic Success

Ec331: Research in Applied Economics Spring term, Panel Data: brief outlines

The Heart Wants What It Wants: Effects of Desirability and Body Part Salience on Distance Perceptions

Statistics for Psychology

Measures. David Black, Ph.D. Pediatric and Developmental. Introduction to the Principles and Practice of Clinical Research

The Crime Scene. The Beginning. The First Steps. Chapter 2a Introduction

Cognitive Ability Testing for Student Admissions

Psychometrics for Beginners. Lawrence J. Fabrey, PhD Applied Measurement Professionals

The Effect of Guessing on Item Reliability

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

PLS 506 Mark T. Imperial, Ph.D. Lecture Notes: Reliability & Validity

Lecture Outline Biost 517 Applied Biostatistics I. Statistical Goals of Studies Role of Statistical Inference

Where does "analysis" enter the experimental process?

International Research Journal of Interdisciplinary & Multidisciplinary Studies (IRJIMS)

STAT 200. Guided Exercise 4

How to use the Lafayette ESS Report to obtain a probability of deception or truth-telling

HOW STATISTICS IMPACT PHARMACY PRACTICE?

RECALL OF PAIRED-ASSOCIATES AS A FUNCTION OF OVERT AND COVERT REHEARSAL PROCEDURES TECHNICAL REPORT NO. 114 PSYCHOLOGY SERIES

Transcription:

Reliability Theory for Total Test Scores Measurement Methods Lecture 7 2/27/2007

Today s Class Reliability theory True score model Applications of the model Lecture 7 Psych 892 2

Great Moments in Measurement Lecture 7 Psych 892 3

2006 Rose Bowl Game January 4, 2006. Vince Young leads Texas to a come-from-behind victory over USC to claim College Football s National Championship He subsequently declares for the NFL draft. Lecture 7 Psych 892 4

Lecture 7 Psych 892 5

Lecture 7 Psych 892 6

Wonderlic Test From Wikipedia: The Wonderlic Personnel Test (often referred to as Wunderlich) is an intelligence test primarily known for being administered to prospective players in the National Football League since the 1970s. The Wonderlic is a twelve minute, fifty question exam to assess aptitude for learning a job and adapting to solve problems for employees in a wide range of occupations. The score is calculated as the number of correct answers given in the allotted time. A score of 20 is intended to indicate average intelligence (corresponding to an intelligence quotient of 100). It is rumored that at least one player has scored a 1 on the test. Lecture 7 Psych 892 7

What About Reliability? From http://cps.nova.edu/~cpphelp/wpt.html: Description: The Wonderlic Personnel Test (WPT), so named to reduce the possibility that job applicants will think they are taking an intelligence test, was originally a revision of the Otis Self-Administering Tests of Mental Ability. The WPT is a 50-item, 12-minute omnibus test of intelligence. The items and the order in which they are presented provide a broad range of problem types (e.g., analogies, analysis of geometric figures, disarranged sentences, definitions) intermingled and arranged to become increasingly difficult. The WPT exists in 16 forms, and was designed for testing adult job applicants in business and industrial situations. Scoring: The WPT yields one final score which is the sum of correct answers. Reliability: The manual reports odd-even reliabilities, which are not appropriate for speeded tests; however, it also reports test-retest reliabilities of.82 to.94, and interform reliabilities of.73 to.95. Lecture 7 Psych 892 8

Reliability Theory Lecture 7 Psych 892 9

Basic Motivation Basic motivation for classical true-score theory is to provide a workable method for estimating the precision of measurement of a test score. Test score is the focus of this method. One of the first to develop in history. Lecture 7 Psych 892 10

Measurement By Analogy To begin our class, consider the process of measurement of a physical trait: length. We take our ruler/tape measure/whatever and use it to come up with the length of an object. If we wish to estimate the amount of error in our measurement, how would we proceed? Lecture 7 Psych 892 11

Multiple Measurements If we wish to estimate the error of measurement of the length of the object, we must take multiple measurements. The Mean of our measurements is the best estimate of the object s length. The Standard Deviation is the best estimate of the error in the measuring process. Lecture 7 Psych 892 12

Assumptions of Our Procedure To have the mean as our estimate of length and the SD as our estimate of measurement error there are several assumptions we must make What do you think we are assuming? Lecture 7 Psych 892 13

Assumptions Replications are independent trials. Making the errors for each trial independent of each other. The measurement instrument contains no source of constant error. The length of the object does not change over the time we take the measurements. Lecture 7 Psych 892 14

Transitioning to Psychological Measurement As soon as we move from our example to what we do in administering psychological tests, we see that our task is much more difficult. Replication, as a first example, becomes much more difficult. Can you envision having to take our midterm 10 times just to get an estimate of the error in our measurement? Lecture 7 Psych 892 15

Problems with Replication There are further problems with repeated administrations of psychological tests to the same examinee(s). Do the replications constitute independent trials? Hence, do the results yield uncorrelated errors? More exposures to the test will lead to stereotyped responses Lecture 7 Psych 892 16

Problems with Replication How much time should be allowed to lag between measurements? Are the psychological attributes constant over time? Few psychological attributes are constant enough to be considered traits. Maybe more appropriate to call many of these psychological states. As an example, consider mood. Lecture 7 Psych 892 17

Additional Concerns When it comes to length, it is fairly well understood that the number we arrive at based on a tape measure or ruler will represent the length of an object. When it comes to psychological attributes, the number we arrive at is not guaranteed to represent the attribute we intend to measure. Lecture 7 Psych 892 18

Psychological Attribute Measures To resolve the issue of what a score may represent psychologically, we consider three distinct (yet interrelated) concepts: 1. Reliability the precision with which the test score measures the attribute. 2. Validity the extent to which the test measures the attribute it was designed to measure. 3. Generalizability The extent to which the composite test score generalizes beyond the specific items chosen to form the composite, to the domain of further indicators that might have been used. Lecture 7 Psych 892 19

Reliability Where We Are Going You may feel that there is not a solution to the general problem: estimating the precision of measurement of a test score. We will postpone the conceptual issues of reliability by treating the classical true-score model as a piece of pure mathematics. In doing this, we will be able to illustrate the model. Its assumptions. When it can be applied. Lecture 7 Psych 892 20

The True-Score Model for Test Scores Lecture 7 Psych 892 21

Preliminaries Prior to introducing the true-score model, we introduce the following. Imagine we sample a single examinee: At random. From a population of interest. We administer a test of m items. We form the total number right or number keyed score, Y. Lecture 7 Psych 892 22

Classical True-Score Model The classical true-score model hypothesizes that the total score consists of two components: A portion representing the true score. A portion representing the error of measurement. The model can be expressed mathematically: Y = T + E Here T is the true-score. E is the error score. Lecture 7 Psych 892 23

Demonstration of True Score Model To show the true-score model, Table 5.1 (p. 64) lists the results of a simulation to demonstrate the sampling process for 10 examinees. These numbers were drawn from a distribution with a certain mean for T and E, and a certain variance for T and E. We will now simulate our own numbers for examinees to show how the process works. We will be using R. For the moment we will work with Y (not Y ). Lecture 7 Psych 892 24

Properties of T and E 1. T and E are measured on the scale of Y They are bounded within the range of Y. They have the same floor and ceiling. 2. T and E are uncorrelated. TE 0 In our example, this means that E is chosen independently of T. Lecture 7 Psych 892 25

Properties of T and E 3. The variance of Y is the sum of the variances of T and E. 2 Y 2 T 2 E This can be shown by the algebra of expectations: Var(Y) = Var(T+E) = Var(T) + Var(E) + 2Cov(T,E) = Var(T) + Var(E) Lecture 7 Psych 892 26

Properties of T and E 4. Variances of T and E are both less than and at most equal to the variance of Y. 2 T 2 Y and 2 E 2 Y Lecture 7 Psych 892 27

Properties of T and E 5. The ratio of the variance of T to the variance of Y, r 2 T 2 Y This term is bounded by zero and one. By definition this is called the reliability coefficient of Y. 2 T 2 T 2 E Lecture 7 Psych 892 28

Further Information The properties of the classical true-score model by themselves are relatively uninformative. To further expand our example, consider if we get, from our same sample of examinees, a second total test score, called Y. Lecture 7 Psych 892 29

The Second Score You can envision generating the second score for each examinee by having the same T for each person. The error score, E, however, would be independently drawn (but it would have the same error variance, σ 2 E. The classical true-score theory formulation for the new total test score would then be: Y = T + E Lecture 7 Psych 892 30

Simulated Data Using our former parameters, we can simulate the data for Y, using a process similar to that used for Y. We use the same T for each simulated examinee. We draw a new E for each examinee. Lecture 7 Psych 892 31

More Properties For each examinee, Y and Y have the same randomly drawn T value. Each has an independently drawn E and E value. By construction, E and E are uncorrelated with: T. Each other. Lecture 7 Psych 892 32

Independent Error Implications Because of the independence of error terms, we get the following result: TE TE' EE' 0 The correlation between each of these elements is zero. Lecture 7 Psych 892 33

More Implications Another property we set was for the variances of E and E to be equal: 2 2 E' E What follows from this result is that the variance of Y is equal to that of the variance of Y : 2 2 Y ' Y Lecture 7 Psych 892 34

Variance Formation Practice Show that 2 2 Y ' Y Lecture 7 Psych 892 35

More Properties A further property of the two tests now follows: YY ' Note that ρ YY is the correlation between Y and Y. r 2 T 2 Y This hold important consequences. ρ r can now be computed from observations. Reliability can be estimated from finite samples. Lecture 7 Psych 892 36

How Does That Happen? It is not usually the case that a variance ratio will equal a correlation. In our case, it is easy to show why: Cov Y, Y ' Cov T E, T E' TT 2 T TE TE' EE' Lecture 7 Psych 892 37

More Connections We note that: r 2 YT 2 Y ' T The reliability coefficient is the square of the correlation between Y and T or Y and T. Lecture 7 Psych 892 38

Wrapping Up Today, we scratched the surface of concepts about reliability. To do so, we used classical true-score theory. We will build upon these concepts next time Lecture 7 Psych 892 39

Next Time More of Chapter 5 Reliability Theory for Total Test Scores. Lecture 7 Psych 892 40