Psychometrics in context: Test Construction with IRT. Professor John Rust University of Cambridge

Size: px

Start display at page:

Download "Psychometrics in context: Test Construction with IRT. Professor John Rust University of Cambridge"

Miles Tobias Carter
5 years ago
Views:

1 Psychometrics in context: Test Construction with IRT Professor John Rust University of Cambridge

2 Plan Guttman scaling Guttman errors and Loevinger s H statistic Non-parametric IRT Traces in Stata Parametric IRT IRT in Mplus

3 Guttman Scaling Guttman, L. (1944) A basis for scaling qualitative data. American Sociological Review, 9, Binary items are ranked in some order (E.g. Difficulty) Agreement with an item implies agreement with items of a lower order. E.g. No need to find out whether a weightlifter who lifts 200 Kg can also lift 150 Kg one who multiplies 453 by 234 can multiply 6 by 3. For a rational respondent a single index on the scale RRRRRRRRRRWWWWWWWW AAAAAADDDDDDDDDDDDDDD

4 Guttman Errors In practice more likely to be: RRRRRRWRRWWRWWWWWWWWW Those in bold are described as Guttman errors Hence we need the notion of probability A respondent answering an item positively will have a significantly greater probability of also having answered less difficult items positively

5 Loevinger Loevinger, J. (1948) The technic of homogeneous tests compared with some aspects of scale analysis and factor analysis, Psychological Bulletin, 19(4) Loevinger s H statistic measures extent to which items appear in the same relative order Based on comparison of actual Guttman errors (A) to number expected if responses are random (R). E.g. RRRRRRRRRWRRWWRWWWWWWWWWWWW (A) RWRWRRWWRWRRWWWRWRWWRRRWRWRW (R)

6 Loevinger H LoevH is a function of (Random Actual )/Random if there are no errors, then R =1 If number of errors is as expected by chance alone, then R = 0 Can be calculated for each pair of items Can be averaged across all respondents to give an index for a particular item Can be averaged across all items to gives an index for the test

7 Criteria for Loevinger s H in a good scale (Mokken) The usually accepted (but somewhat arbitrary) criteria is that H should be greater than 0.3 for each item, and that H should be greater than 0.3 for the scale as a whole If H is > 0.5 : A strong Mokken Scale If H is > 0.4 and < 0.5: a moderate scale If H > 0.3 and < 0.4 : a weak scale

8 Mokken Mokken, R. J. (1971) A theory and procedure of scale analysis. De Gruyter: Netherlands Criteria for a good scale based on traces A trace is a plot the probability of agreement with an item against the total score (number correct) The probability of a positive response to an item should increase monotonically as the latent trait increases Double monotony must not exist. (I.e. The trace lines of items in a Mokken scale should not intersect.)

9 Items in Short GRIMS My partner is sensitive to and aware of my needs. (P) My partner doesn t listen to me any more. (N) I m sometimes lonely when I m with my partner. (N) Our relationship is full of joy and excitement. (P) I wish there was more warmth and affection between us. (N) I suspect we are on the brink of separation. (N) We can make up quickly after an argument. (P)

10 Run Stata Item analysis using the Mokken procedure in Stata Stata is available on the Public Workstation Facilities (PWFs). E.g. in the Cathy Marsh Room It contains routines traces, LoevH and msp First import data into Stata from SPSS

11 0 Rate of positive response Example of a trace Trace of the item BItem6 as a function of the score Total score

12 Linear prediction What do we need to know about an item in order to predict the probability of a person of known ability getting the item right. Linear prediction: y = α + βx

13 Linear prediction in scattergram

14 True/False (binary) data Classical scattergram doesn t show much 14

15 Item Response Theory (IRT) Arose from the need to link the behaviour of binary items to the scale non-linearly. Devised independently by Georg Rasch Lord, Novik and Birnbaum Plots the probability of getting an item right (for each item) against a latent trait of ability (or personality) now called θ (theta)

16 Item Characteristic Curve (ICC) 16

17 IRT Models Predicting probability from ability One parameter or Rasch model Three parameter model Two parameter model 17

18 Example 1

19 Example 2

20 Example 3

21 Example 4

22 Difficulty Parameter

23 Discrimination Parameter

24 Guessing Parameter

25 Item Characteristic Curve P(θ)=c + (1-c)/(1 + e -a(θ-b) ). Where θ = ability parameter (a person s ability) P(θ) = Probability of correct response given θ a = discrimination parameter b = difficulty parameter c = guessing parameter e = growth constant ( ) 25

26 Run Mplus Item analysis using Confirmatory Factor Analysis in Mplus Mplus is available on the Public Workstation Facilities (PWFs). E.g. in the Cathy Marsh Room It is suitable for modelling of binary and ordinal as well as continuous data Download demo version (6 items only) from Statmodel.com

27 IRT in Mplus Show modelling with exploratory factor analysis (EFA) Repeat with Confirmatory Factor Analysis (CFA) Note that CFA with binary data is IRT Shot ICC output with plots Demonstrate information function (reliability differs at different points of the scale)

28 Rehash of classical item analysis Take a set of items Find their difficulty values Find their discrimination indices Eliminate items that don t meet certain criteria Extreme values Poor correlation with scale itself Maximize internal consistency

29 Comparison of IRT with CTT in test construction IRT searches for hierarchies rather than correlations. In IRT, person scores and item difficulties are plotted on the same scale (theta) IRT makes use of item thresholds to incorporate item difficulty into the score (important in test equating) IRT does not assume linearity, hence IRT works with binary or ordinal data Some IRT models (eg Rasch, Mokken) require double monotonicity. In CTT, item discrimination indices only have to be above certain criteria.

30 Advantages of IRT Information function allows tests to be optimised at thresholds Reliability can be more accurately related to test score Individual reliabilities for each person Basis for Test equating Adaptive Testing

31 Next week: Measuring intelligence What is intelligence? IQ testing Controversies in intelligence testing Eugenics Multiple intelligences The Flynn Effect Intelligence testing today

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD Psy 427 Cal State Northridge Andrew Ainsworth, PhD Contents Item Analysis in General Classical Test Theory Item Response Theory Basics Item Response Functions Item Information Functions Invariance IRT