Reliability Theory for Total Test Scores. Measurement Methods Lecture 7 2/27/ PDF Free Download

Reliability Theory for Total Test Scores Measurement Methods Lecture 7 2/27/2007

Today s Class Reliability theory True score model Applications of the model Lecture 7 Psych 892 2

Great Moments in Measurement Lecture 7 Psych 892 3

2006 Rose Bowl Game January 4, 2006. Vince Young leads Texas to a come-from-behind victory over USC to claim College Football s National Championship He subsequently declares for the NFL draft. Lecture 7 Psych 892 4

Lecture 7 Psych 892 5

Lecture 7 Psych 892 6

Wonderlic Test From Wikipedia: The Wonderlic Personnel Test (often referred to as Wunderlich) is an intelligence test primarily known for being administered to prospective players in the National Football League since the 1970s. The Wonderlic is a twelve minute, fifty question exam to assess aptitude for learning a job and adapting to solve problems for employees in a wide range of occupations. The score is calculated as the number of correct answers given in the allotted time. A score of 20 is intended to indicate average intelligence (corresponding to an intelligence quotient of 100). It is rumored that at least one player has scored a 1 on the test. Lecture 7 Psych 892 7

What About Reliability? From http://cps.nova.edu/~cpphelp/wpt.html: Description: The Wonderlic Personnel Test (WPT), so named to reduce the possibility that job applicants will think they are taking an intelligence test, was originally a revision of the Otis Self-Administering Tests of Mental Ability. The WPT is a 50-item, 12-minute omnibus test of intelligence. The items and the order in which they are presented provide a broad range of problem types (e.g., analogies, analysis of geometric figures, disarranged sentences, definitions) intermingled and arranged to become increasingly difficult. The WPT exists in 16 forms, and was designed for testing adult job applicants in business and industrial situations. Scoring: The WPT yields one final score which is the sum of correct answers. Reliability: The manual reports odd-even reliabilities, which are not appropriate for speeded tests; however, it also reports test-retest reliabilities of.82 to.94, and interform reliabilities of.73 to.95. Lecture 7 Psych 892 8

Reliability Theory Lecture 7 Psych 892 9

Basic Motivation Basic motivation for classical true-score theory is to provide a workable method for estimating the precision of measurement of a test score. Test score is the focus of this method. One of the first to develop in history. Lecture 7 Psych 892 10

Measurement By Analogy To begin our class, consider the process of measurement of a physical trait: length. We take our ruler/tape measure/whatever and use it to come up with the length of an object. If we wish to estimate the amount of error in our measurement, how would we proceed? Lecture 7 Psych 892 11

Multiple Measurements If we wish to estimate the error of measurement of the length of the object, we must take multiple measurements. The Mean of our measurements is the best estimate of the object s length. The Standard Deviation is the best estimate of the error in the measuring process. Lecture 7 Psych 892 12

Assumptions of Our Procedure To have the mean as our estimate of length and the SD as our estimate of measurement error there are several assumptions we must make What do you think we are assuming? Lecture 7 Psych 892 13

Assumptions Replications are independent trials. Making the errors for each trial independent of each other. The measurement instrument contains no source of constant error. The length of the object does not change over the time we take the measurements. Lecture 7 Psych 892 14

Transitioning to Psychological Measurement As soon as we move from our example to what we do in administering psychological tests, we see that our task is much more difficult. Replication, as a first example, becomes much more difficult. Can you envision having to take our midterm 10 times just to get an estimate of the error in our measurement? Lecture 7 Psych 892 15

Problems with Replication There are further problems with repeated administrations of psychological tests to the same examinee(s). Do the replications constitute independent trials? Hence, do the results yield uncorrelated errors? More exposures to the test will lead to stereotyped responses Lecture 7 Psych 892 16

Problems with Replication How much time should be allowed to lag between measurements? Are the psychological attributes constant over time? Few psychological attributes are constant enough to be considered traits. Maybe more appropriate to call many of these psychological states. As an example, consider mood. Lecture 7 Psych 892 17

Additional Concerns When it comes to length, it is fairly well understood that the number we arrive at based on a tape measure or ruler will represent the length of an object. When it comes to psychological attributes, the number we arrive at is not guaranteed to represent the attribute we intend to measure. Lecture 7 Psych 892 18

Psychological Attribute Measures To resolve the issue of what a score may represent psychologically, we consider three distinct (yet interrelated) concepts: 1. Reliability the precision with which the test score measures the attribute. 2. Validity the extent to which the test measures the attribute it was designed to measure. 3. Generalizability The extent to which the composite test score generalizes beyond the specific items chosen to form the composite, to the domain of further indicators that might have been used. Lecture 7 Psych 892 19

Reliability Where We Are Going You may feel that there is not a solution to the general problem: estimating the precision of measurement of a test score. We will postpone the conceptual issues of reliability by treating the classical true-score model as a piece of pure mathematics. In doing this, we will be able to illustrate the model. Its assumptions. When it can be applied. Lecture 7 Psych 892 20

The True-Score Model for Test Scores Lecture 7 Psych 892 21

Preliminaries Prior to introducing the true-score model, we introduce the following. Imagine we sample a single examinee: At random. From a population of interest. We administer a test of m items. We form the total number right or number keyed score, Y. Lecture 7 Psych 892 22

Classical True-Score Model The classical true-score model hypothesizes that the total score consists of two components: A portion representing the true score. A portion representing the error of measurement. The model can be expressed mathematically: Y = T + E Here T is the true-score. E is the error score. Lecture 7 Psych 892 23

Demonstration of True Score Model To show the true-score model, Table 5.1 (p. 64) lists the results of a simulation to demonstrate the sampling process for 10 examinees. These numbers were drawn from a distribution with a certain mean for T and E, and a certain variance for T and E. We will now simulate our own numbers for examinees to show how the process works. We will be using R. For the moment we will work with Y (not Y ). Lecture 7 Psych 892 24

Properties of T and E 1. T and E are measured on the scale of Y They are bounded within the range of Y. They have the same floor and ceiling. 2. T and E are uncorrelated. TE 0 In our example, this means that E is chosen independently of T. Lecture 7 Psych 892 25

Properties of T and E 3. The variance of Y is the sum of the variances of T and E. 2 Y 2 T 2 E This can be shown by the algebra of expectations: Var(Y) = Var(T+E) = Var(T) + Var(E) + 2Cov(T,E) = Var(T) + Var(E) Lecture 7 Psych 892 26

Properties of T and E 4. Variances of T and E are both less than and at most equal to the variance of Y. 2 T 2 Y and 2 E 2 Y Lecture 7 Psych 892 27

Properties of T and E 5. The ratio of the variance of T to the variance of Y, r 2 T 2 Y This term is bounded by zero and one. By definition this is called the reliability coefficient of Y. 2 T 2 T 2 E Lecture 7 Psych 892 28

Further Information The properties of the classical true-score model by themselves are relatively uninformative. To further expand our example, consider if we get, from our same sample of examinees, a second total test score, called Y. Lecture 7 Psych 892 29

The Second Score You can envision generating the second score for each examinee by having the same T for each person. The error score, E, however, would be independently drawn (but it would have the same error variance, σ 2 E. The classical true-score theory formulation for the new total test score would then be: Y = T + E Lecture 7 Psych 892 30

Simulated Data Using our former parameters, we can simulate the data for Y, using a process similar to that used for Y. We use the same T for each simulated examinee. We draw a new E for each examinee. Lecture 7 Psych 892 31

More Properties For each examinee, Y and Y have the same randomly drawn T value. Each has an independently drawn E and E value. By construction, E and E are uncorrelated with: T. Each other. Lecture 7 Psych 892 32

Independent Error Implications Because of the independence of error terms, we get the following result: TE TE' EE' 0 The correlation between each of these elements is zero. Lecture 7 Psych 892 33

More Implications Another property we set was for the variances of E and E to be equal: 2 2 E' E What follows from this result is that the variance of Y is equal to that of the variance of Y : 2 2 Y ' Y Lecture 7 Psych 892 34

Variance Formation Practice Show that 2 2 Y ' Y Lecture 7 Psych 892 35

More Properties A further property of the two tests now follows: YY ' Note that ρ YY is the correlation between Y and Y. r 2 T 2 Y This hold important consequences. ρ r can now be computed from observations. Reliability can be estimated from finite samples. Lecture 7 Psych 892 36

How Does That Happen? It is not usually the case that a variance ratio will equal a correlation. In our case, it is easy to show why: Cov Y, Y ' Cov T E, T E' TT 2 T TE TE' EE' Lecture 7 Psych 892 37

More Connections We note that: r 2 YT 2 Y ' T The reliability coefficient is the square of the correlation between Y and T or Y and T. Lecture 7 Psych 892 38

Wrapping Up Today, we scratched the surface of concepts about reliability. To do so, we used classical true-score theory. We will build upon these concepts next time Lecture 7 Psych 892 39

Next Time More of Chapter 5 Reliability Theory for Total Test Scores. Lecture 7 Psych 892 40

Reliability Theory for Total Test Scores. Measurement Methods Lecture 7 2/27/2007