Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses

Size: px

Start display at page:

Download "Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses"

Cameron King
5 years ago
Views:

1 Item Response Theory Steven P. Reise University of California, U.S.A. Item response theory (IRT), or modern measurement theory, provides alternatives to classical test theory (CTT) methods for the construction, analysis, and scoring of psychological measures. A clear difference between traditional CTT and modern IRT psychometric methods is that the former is based on the representation of constructs through an aggregate composite score whereas the latter is based on the representation of constructs through a latent variable (or latent trait), which is assumed to underlie item responses. Moreover, IRT measurement models are formal statistical models that attempt to capture the interaction between person and item properties as they jointly determine an individual s response to an item (Embretson, 996). As such, IRT modeling rests on a set of testable assumptions, and IRT models can be statistically evaluated as to their fit to item-response data. From a technical perspective, IRT measurement models are closely related to confirmatory factor analytic models for ordinal data and have their origin in the work of Derrick Lawley (Lawley, 943), Frederick Lord (Lord, 98), and Darrell Bock (Bock & Aitkin, 98), among many others. From an applied perspective, IRT measurement models were developed to solve real-world practical problems in large-scale aptitude and achievement testing that were challenging and, in some cases impossible, under CTT psychometrics. It is only since the beginning of the century, however, that IRT models have been more commonly employed in the measurement of personality, psychopathology (Reise & Waller, 29), and medical outcomes constructs (Cella et al., 27). The defining features of IRT measurement models are the specification and estimation of the parameters of a mathematical function, typically a logistic function. This function is called an item response function (IRF), and its purpose is to model the relation between individual differences on a continuous latent trait construct (theta, ) and the probability of responding to a scale item (e.g., endorsing a true/false item in the keyed direction, or responding in category 3 on a five-point ordered rating item). Thus, in the following two sections, commonly applied IRT models appropriate for dichotomous and polytomous item response data are described. The discussion is restricted to unidimensional IRT models (i.e., models that assume only a single common latent variable underlies the covariances among item responses) because all the basic principles described here generalize easily to multidimensional IRT models. Interpretation of indices derived from IRT model parameters and assumptions underlying IRT models are then detailed, and common applications of IRT models computerized adaptive testing are presented. Unidimensional IRT Models for Dichotomous Item Responses In large-scale national and state-wide achievement testing, the most commonly used item response format is multiple choice, where responses are dichotomously scored as either correct () or incorrect (). Moreover, for many popular personality and psychopathology measures, the endorsed () versus not endorsed (), dichotomous yes/no, true/false, or agree/disagree response formats are very common. In all these cases, and assuming that asinglecommonlatentvariable() underlies item responses, a researcher may be interested in estimating an IRF to describe the relation between standing on a latent variable (representing the construct of interest) and the probability of endorsing an item in the keyed direction. The Encyclopedia of Clinical Psychology, First Edition. Edited by Robin L. Cautin and Scott O. Lilienfeld. 25 John Wiley & Sons, Inc. Published 25 by John Wiley & Sons, Inc. DOI: / wbecp357

2 2 ITEM RESPONSE THEORY The primary goal of fitting a dichotomous IRT model is to find an IRF that best representsorfitstheobserveditemresponsedata. To achieve this goal, one must select among several commonly applied models that vary in complexity. Dichotomous IRT models differ primarily in the number of item properties that need to be accounted for. The least complex IRT measurement model is the one-parameter logistic (PL) model shown in Equation, where the subscript i refers to an item, and x refers to a particular item response scored for not keyed/endorsed, and for a keyed/endorsed response. The ä represents the common-item slope parameter, and b is an item location parameter that is allowed to vary between items within a scale. P(x i = ) = exp(ä i ( b i )) () + exp(ä i ( b i )) The PL model in Equation, depending on its specification, is sometimes referred to as a Rasch model. Specifically if is fixed to for all items (for identification), and the variance of thelatentvariableisestimated,thentechnically EquationisaRaschmodel.If,asconsidered here,isconstantacrossitems,andthevariance of the latent variable is fixed to (for identification) then the model is more appropriately referred to as a PL model. In describing this model, and throughout this entry, it is assumed that the metric for the latent variable has been identifiedbyfixingitsmeantoanditsvariance to. Thus, assuming normality, the metric for the latent trait can be interpreted like a z-score. In the PL model, all items have a constant slope a, but items may differ from each other in their location parameter (represented by b). Item location parameters typically range between 2. to 2. and indicate the point on the latent trait metric where the probability of endorsing the item in the keyed direction is.5. Thus, the b parameter serves to shift the IRF from left to right along the latent trait continuum. Item location parameters are analogous,butnotequivalent,toitemproportion endorsed in traditional item analysis. Items that are commonly endorsed tend to have negative location parameters, and the IRF will be shifted to the left. Items that are rarely endorsed (endorsed by only individuals high on the latent variable) will have positive location parameters, and the IRF will be shifted to the right. To illustrate, Figure A shows the IRFs for three example scale items with a fixed to.5 and b parameters set to,, and, respectively. The a slope parameter, which typically ranges between. and 2., is often referred to as a discrimination parameter (constant across items within the PL model). It determines the steepness of the IRF at its inflection point. They are analogous to item-test biserial correlations in traditional scale analysis. Figures B and C show three items with the same location parameters as in Figure A, butwitha values of 2.5 and.5, respectively. These figures make clear the interpretation of the parameter; the higher the slope the more differentiating or discriminating the item is in the sense that response probabilities change rapidly as scores on the latent variable increase. This is especially true in the latent variable range around the item s location. The PL model requires all scale items to have equal slope, but items may vary in location, and thus items vary in where along the latent trait continuum they provide the most discrimination (see subsequent section). The PL model is analogous to the concept of essential tau-equivalence in classical test theory. A slightly more complex, or less restricted, model can be specified by allowing the scale items to vary in discrimination. This two-parameter logistic (2PL) model is shown in Equation 2. The2PLmodelisanalogoustotheconcept of congeneric measurement in classical test theory. P(x i = ) = exp(a i ( b i )) (2) + exp(a i ( b i )) In this model, items are allowed to vary in two ways slope (a) and location (b) and interpretation of parameters remains the same as in the PL model. Equation 2 states that

3 ITEM RESPONSE THEORY 3 (A) (B) (C) Figure A, IRFs, slope =.5. B, IRFs, slope = 2.5. C, IRFs, slope =.5. the probability of endorsing an item in the keyed direction is a function of the difference between an individual s standing on the latent variable () andtheitem slocationparameter (b), and this difference is weighted by the slope (a). Thus, for items with relatively low slope (discrimination), the difference between an individual s trait level and the item location has little effect on the response probability. In contrast, when the slope parameter is relatively high, differences between an individual s trait level and the item s location have a large effect on the response probability. To illustrate, Figure 2 displays the IRFs for two items that havethesamelocationparameters(b = ) but different slopes (.5 and.5, respectively). Clearly, the item with the larger slope provides relatively more discrimination around the middle of the trait continuum in the sense that the response probabilities are changing very rapidly. As a consequence, it is easier to discriminate between individuals who are in the middle of the trait range using the item with the larger slope, relative to the item with the lower slope. In the case of multiple-choice aptitude or achievement tests, where examinees who are lowonthelatentvariablecanobtainacorrect answer by guessing, a commonly applied IRT model is the three-parameter logistic model

4 4 ITEM RESPONSE THEORY Figure 2 IRFs, location =, slope =.5 and.5. (3PL), shown in Equation 3. exp(a P(x i = ) =c i +( c i ) i ( b i )) + exp(a i ( b i )) (3) Equation 3 expands the 2PL model by adding an additional parameter (c) often referred to as the pseudo-guessing or lower asymptote parameter. The c parameter is on a proportion metric and typically ranges between and 5 (for multiple-choice items with four response options).thevalueofthisparametersetsa boundary on the lower asymptote of the IRF such that, at low trait levels, the response probability never goes toward, but rather stays constant. To illustrate, Figure 3 displays the IRFs for three items, each having a slope of.5 and location of. but differing lower asymptote (,., and 5). It is important to notice that, in Figure 3, the interpretation of the location parameter has now changed slightly relative to its interpretation in the PL or 2PL models. Specifically, the location parameter remains the point on the latent trait continuum where the IRF is most steep (i.e., the inflection point) but this point no longer corresponds to the location on the latent trait at which the probability of endorsement is.5. Instead, the probability of endorsing an item in the keyed direction at = b is ( + c) / 2. Figure 3 IRFs, slope =.5, location =., lower =,., 5. The 3PL model has rarely been applied in personality or psychopathology measurement; nevertheless, an extension of this model to allow for both a nonzero lower asymptote or a non-one upper asymptote called a four-parameter logistic model (Equation 4) has received some attention (Reise & Waller, 29). exp(a P(x i = ) =c i +(d i c i ) i ( b i )) + exp(a i ( b i )) (4) Equation 4 is simply Equation 3 with replaced by d. This highly complex model may be appropriate to measurement contexts where the probability of endorsing an item (e.g., having a symptom) is not zero even for low trait individuals (e.g., endorsing sad moods within the last 7 days has a nonzero probability even for individuals who are low on depression). Conversely, the probability of endorsing an item (e.g., having a symptom) does not approach. even for individuals who are in the high range on the latent variable (e.g., suicide ideation within the last 7 days is not universally endorsed even for individuals at the highest levels of depression). To illustrate this model, Figure 4 displays a 4PL IRF with slope of.5, location of., lower asymptote of and upper asymptote of.9. In this model, the interpretation of the location parameter is even more complicated because it now reflects

5 ITEM RESPONSE THEORY Figure 4 IRF: a =.5, b =., c =., d =.9. the point on the latent trait scale where the response proportion is ( + c d)/2. Unidimensional IRT Models for Polytomous Item Responses When item responses are scored as ordered categories, polytomous IRT models are required. In the models for dichotomous items described above, only one IRF reflecting the probability of a keyed response needed to be estimated. This is because once the IRF is known, the probability of responding in the nonkeyed direction as a function of the latent variable is known by subtraction,thatis, P. Conceptually, one can think of a dichotomous item as having a single threshold between the nonkeyed () and keyed () response, and one goal of IRT modeling is to estimate this threshold via a location parameter indicating where on the latent variable a keyed response becomes more likely than anonkeyedone. With a polytomous item response format, the complexity of the IRT model increases, and a slightly different terminology is needed to describe response propensities. Specifically, instead of estimating a single IRF for a dichotomous item, a researcher needs to estimate K category response curve (CRCs) for a K category polytomous item. Here, let response categories be coded to K. Each CRC will model the relation between standing onthelatentvariableandtheprobabilityof responding to an item in a specific category. Although there are numerous potential polytomous IRT models that one may consider, the illustration is the graded-response model (GRM; Samejima, 969). In the GRM, for each item, a set of K- (b) location parameters needs to be estimated, and one common item slope (a). Stated differently, in the GRM, for each item, a set of K- 2PL IRFs are estimated with the slopes constrained to be equal within an item (but not between items).thesefunctionsarecalledthreshold response functions (TRF; Equation 5) with location parameters that indicate the trait level necessary to have a 5% chance of responding above one of the K- thresholds between the response categories. Pxi () = exp[a( b ji )] + exp[a( b ji )], (5) where, j = number of response categories minus, and x is the response category. For example, for a four-point item, three 2PL IRFs are estimated for responses versus, 2, 3; for responses, versus 2 and 3; for,, 2 versus 3. To illustrate, Figure 5A displays, from left to right, threshold response functions for a four-category item with the slope parameter equal to 2 and the location parameters of.,, and, respectively. Given the parameters for the threshold response functions, and the stipulation that the conditional probability of responding at least in the first category is, and the conditional probability of responding above the highest category is, the CRCs can be estimated by subtraction as shown below. To illustrate, Figure 5B shows the CRCs for the example item in Figure 5A. Going from left to right, the probability of responding in the lowest category (x = ) monotonically decreases as a function of trait level. For the middle two response categories (x = or2),responsepropensityis a unimodal function that increases and then

6 6 ITEM RESPONSE THEORY x = x = x = 2 x = 3 (A) (B) Figure 5 A, TRFs: a = 2., b =, b2 =, b3 =.. B, Category response curves. decreases as a function of trait level. Finally, the probability of responding in the highest category (x = 3) monotonically increases with increasing trait level. Observe that at any point on the latent trait continuum, the probabilities of category response sum to. Graded-response model item parameters are easily interpretable and determine the shapes and locations of the TRFs (and thus the CRCs). The higher the slope parameters, the steeper the TRCs and the narrower and more peaked the CRCs, indicating that the response categories differentiate among individuals at different trait levels well. The threshold parameters (b) determine the location of the TRFs andwhereeachofthecrcsforthemiddle response options peak. Specifically, the CRCs peak in the middle of two adjacent threshold parameters. The distances between adjacent location parameters are also important. A large distance between locations shows that an item discriminates across the entire trait range. Ideally, an item will be highly discriminating (high slope) and will have location parameters spread out across an appropriate range of the trait. Finally, it is important to note that CRCs for a polytomous item can be aggregated into a single IRF that is analogous to the IRF in the dichotomous models. By weighting the CRCs (i.e., the conditional probabilities of responding in a specific category) by the integers used to score the responses (e.g.,,, 2, 3), an item-response curve (IRC) for a polytomous item is obtained. K E(X i ) =IRC i = xp xi () (6) x= The one important difference is that the y-axis for a polytomous model will range from to K- (assuming categories scored to K-), whereas the y-axis for an IRF for a dichotomous model will range between and. The IRC in Figure 6 displays how the expected raw Expected score Figure 6 Item response curve.

7 ITEM RESPONSE THEORY 7 scoreonanitemchangesasafunctionofthe latent trait for the example item. Model Features: Information and Conditional Standard Errors Interpretation of the parameter estimates of the models described above is critical to the psychometricanalysisofaninstrument.generally speaking, researchers are most concerned that the items provide good discrimination and that the location parameters are spread out (between items in dichotomous models, and within items for polytomous models) across the full range of the latent trait continuum. However,toaidinthepsychometricassessment of a set of scale items, IRT modeling provides several useful tools that are derived from the estimated item parameters. Most useful are the item and scale information functions and the corresponding conditional standard error function, described below. For any item, once the model parameters are estimated (i.e., the IRFs are known), their values can be transformed into an item-information function (IIF). An IIF describes how much psychometric information, or discrimination, an item provides at each level of the latent variable. For dichotomous items, items with higher discrimination (slope parameters) provide more information, and the position along the latent variable continuum where that information is concentrated is determined by the item s location. Some items may provide information in the high trait range, whereas others differentiate best among low-trait individuals or among individuals in the middle of the trait range. For polytomous items, similar principles apply in that items with higher slopes provide more information, and the concentration of the information is peaked around the item s location parameters. However, because polytomous items have multiple location parameters ideally spread across the latent trait continuum polytomous information functions tend to spread the information out across the trait range. Indeed, that is the entire purpose of a polytomous response format to allow one item to make multiple (and hopefully meaningful) distinctions between people across the trait range. To illustrate the concept of information, Figure 7A shows the IRFs for five items that vary widely in slope and location. Figure 7B displays the corresponding item information functions and the scale information function derived by summing the IIFs across the five items. Item information, considered alone, is difficult to interpret because its metric has no simple definition. However, as described below, item information is critically related to the conditional standard error. Specifically, assuming that items are locally independent after controlling for the latent variable (see next section), IIFs are additive across items within a scale. Thus, a researcher can easily create a scale-information function (SIF) that indicates the amount of psychometric information an item set provides at each trait level. Then, the square root of divided by the scale information yields the conditional standard errorofthemaximumlikelihoodtrait-level estimate. When this transformation is made, the resulting function is a standard error function, indicating how precisely trait levels canbeestimated.thisfunctionisshownin Figure 7C for the five example items. The SIF and resulting standard error function are extremely useful in scale or short-form construction and in diagnosing the strengths and weaknesses of various instruments. They are also valuable in designing instruments to meet specific measurement needs (e.g., selecting items to differentiate best among high trait individuals). Item Response Theory Model Assumptions and Consequences The utility of latent variable measurement models depends critically on the extent to which the data meets the assumptions. Moreover, even if data are consistent with the requirements of IRT modeling, after model parameter estimation one then needs to show that the selected model provides an acceptable

8 8 ITEM RESPONSE THEORY Information (A) (B) Standard error (C) Figure 7 A,IRFs for five items. B, IIFand SIF for five items. C, Standard error function for five items. fit to the data. This section considers the former topic IRT modeling assumptions only. The complex topic of fit assessment is difficult to summarize, and readers are referred to the recommended readings provided in the Further Reading section. Commonly applied IRT models make three fundamental assumptions about item-response data. First, they assume that there is a fully continuous dimensional latent variable (or variables for multidimensional IRT models) that underlies the reliable item response variance. If there is no continuous underlying latent trait then estimating a latent trait measurement model is a meaningless exercise because model parameters would have no sensible interpretation. Second, IRT models assume that response probabilities are monotonically increasing; as individuals increase on the latent variable, their probabilities of endorsing a dichotomous item, or responding in a higher response category in a polytomous item, increase. This is a necessary assumption because the parametric models described above fit (or force) monotonically increasing IRFs onto the data. Alternative nonparametric and parametric (e.g., unfolding) IRT models are available when this assumption is not met but these are beyond the scope of the present discussion.

9 ITEM RESPONSE THEORY 9 The most critical assumption, and the one that has drawn the most research attention, is that item responses be locally independent (uncorrelated) after controlling for the latent variable (or latent variables in multidimensional IRT models). In unidimensional IRT models,itmustbeassumedthatallthecommonvarianceinanitemsetcanbeexplained by a single common factor; this is analogous to no correlated residuals in structural equation modeling. When the local independence assumption is not met (or at least well approximated), item parameter estimates can be biased because the latent trait is not properly specified. In turn, all derived functions from the item parameter estimates, such as the item or scale information and standard error, may also be erroneous to some degree depending on the severity of the violation. The most serious consequence of a local independence violation is that IRT modelsmaylosetheirmostimportantproperty, namely,thatofinvarianceofitemandperson parameters. The concepts of item and person invariance are commonly misunderstood. Simply stated, item-parameter invariance means that an item s parameters do not depend on the otheritemsthatareincludedintheanalysisor thesubsampleofthepopulationthatisusedto calibrate the item parameters within a linear transformation. Person-parameter invariance means that an individual s standing on the latent variable does not depend on which items are administered, again within a linear transformation. These item- and person-parameter invariance properties depend entirely on meeting the IRT model s assumptions. When the assumptions described above are not met, especially local independence, all the applications of IRT modeling, including those described in thenextsection,arequestionable. Item Response Theory Applications Beyond providing a more informed basis for basic psychometric analysis, the increasing popularity of IRT models is driven by their utility. For example, in large-scale aptitude and achievement testing, IRT models are used to link the scales for different versions of a test administered to different examinee subgroups so that scores (latent trait estimates) areonthesamescale(i.e.,comparable).more generally, across a wide range of disciplines, IRTmodelshavebeenusedtoformthebasis for computerized adaptive testing (CAT), and the examination of measurement equivalence across socio-demographic groups. These two topics, which depend critically on the assumption of item and person parameter invariance, arebrieflyreviewedbelow. The creation of a precalibrated item pool (i.e., a set of items measuring the same trait with known IRT model parameters) and the efficient administration of a subset of items tailored to an individual s trait level, is an attractive alternative to the CTT counterpart of short form creation. A simple CAT algorithm may begin by administering one or more items with location parameters in the middle of the trait range. The individual s responses are then used to estimate a person s position on the latent trait continuum. If, for example, they are estimated to be relatively high on the latent variable, a new item that has a higher location parameter is administered, scored, and the response is used to update the estimate of an individual s trait standing. This process continues until either the individual has responded to a predetermined number of items or their standard error falls below some threshold. The key to CAT is that individuals are being administered the items most relevant to differentiating among people in their trait range. In theory, high-trait individuals would receive only hard items, whereas low-trait individuals would receive only easy items. In this way, individuals do not waste their time responding to items that are not discriminating because their endorsement probability is either nearly or close to. A second popular application of IRT models is that it forms the basis for modern explorations of measurement invariance hypotheses, traditionally called item-bias analysis but now known as differential item functioning analysis

10 ITEM RESPONSE THEORY (DIF analysis). Because of parameter invariance (within a linear transformation), items may be calibrated in two sociodemographic samples that differ in mean and variance on the latent trait but the IRT item parameter estimatesinonesamplecanbeplacedonto thesamescaleastheitemparameterestimates in the second sample. The IRFs estimated separately in the two groups may then be tested for equivalence. If equivalence is found, a researcher may conclude that the item functions the same as a trait indicator across the groups, and a common set of item parameters maybeused.if,ontheotherhand,theirfs differ in slope or location after being placed onto a common metric, a researcher may conclude that the item functions differently for the two groups. In other words, if the IRFs for the same item estimated in the two samples are not equal, then conditionally on any trait level, one group will have a higher (or lower) expected score on the item. Depending on the severity of DIF, it may be difficult to validly apply the measure in different groups of examinees. SEE ALSO: Coefficient Alpha and Coefficient Omega hierarchical ; Item Response Theory, Approach to Test Construction; Measurement Invariance; Reliability; Scale Development; Structural Equation Modeling References Bock, R. D., & Aitkin, M. (98). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, Cella,D.,Yount,S.,Rothrock,N.,Gershon,R., Cook, K., Reeve, B.... Rose, M. (27). The patient-reported outcomes measurement information system (PROMIS): Progress of an NIH roadmap cooperative group during its first two years. Medical Care, 45(5 Suppl.), S3 S. Embretson, S. E. (996). The new rules of measurement. Psychological Assessment, 8, Lawley, D. N. (943). The application of maximum likelihood method to factor analysis. British Journal of Psychology, General Section, 33, Lord, F. M. (98). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum. Reise, S. P., & Waller, N. G. (29). Item response theory and clinical measurement. Annual Review of Clinical Psychology, 5, Samejima, F. (969). Estimation of ability using a response pattern of graded scores. Psychometrika Monograph Supplement, 34(4, part 2),. Further Reading de Ayala, R. J. (29). The theory and practice of item response theory. New York: Guilford Press. Embretson, S. E., & Reise, S. P. (2). Item response theory for psychologists. Mahwah, NJ: Erlbaum. Millsap, R. E. (2). Statistical approaches to measurement invariance.newyork,ny: Routledge. Thissen, D., & Wainer, H. (2). Test scoring. Mahwah, NJ: Erlbaum. Wainer, H. (2). Computerized adaptive testing: A primer (2nd ed.). Mahwah, NJ: Erlbaum.

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD Psy 427 Cal State Northridge Andrew Ainsworth, PhD Contents Item Analysis in General Classical Test Theory Item Response Theory Basics Item Response Functions Item Information Functions Invariance IRT