An Introduction to Modern Measurement Theory

Similar documents
International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS)

Copy Number Variation Methods and Data

Richard Williams Notre Dame Sociology Meetings of the European Survey Research Association Ljubljana,

Parameter Estimates of a Random Regression Test Day Model for First Three Lactation Somatic Cell Scores

Assessment of Response Pattern Aberrancy in Eysenck Personality Inventory

310 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'16

Multidimensional Reliability of Instrument for Measuring Students Attitudes Toward Statistics by Using Semantic Differential Scale

Using the Perpendicular Distance to the Nearest Fracture as a Proxy for Conventional Fracture Spacing Measures

Study and Comparison of Various Techniques of Image Edge Detection

Modeling the Survival of Retrospective Clinical Data from Prostate Cancer Patients in Komfo Anokye Teaching Hospital, Ghana

Bimodal Score Distributions and the MBTI: Fact or Artifact?

Project title: Mathematical Models of Fish Populations in Marine Reserves

ARTICLE IN PRESS Neuropsychologia xxx (2010) xxx xxx

What Determines Attitude Improvements? Does Religiosity Help?

THE NORMAL DISTRIBUTION AND Z-SCORES COMMON CORE ALGEBRA II

Joint Modelling Approaches in diabetes research. Francisco Gude Clinical Epidemiology Unit, Hospital Clínico Universitario de Santiago

Latent Class Analysis for Marketing Scales Development

Using Past Queries for Resource Selection in Distributed Information Retrieval

Unobserved Heterogeneity and the Statistical Analysis of Highway Accident Data

ALMALAUREA WORKING PAPERS no. 9

Optimal Planning of Charging Station for Phased Electric Vehicle *

The Limits of Individual Identification from Sample Allele Frequencies: Theory and Statistical Analysis

Price linkages in value chains: methodology

Economic crisis and follow-up of the conditions that define metabolic syndrome in a cohort of Catalonia,

Incorrect Beliefs. Overconfidence. Types of Overconfidence. Outline. Overprecision 4/22/2015. Econ 1820: Behavioral Economics Mark Dean Spring 2015

WHO S ASSESSMENT OF HEALTH CARE INDUSTRY PERFORMANCE: RATING THE RANKINGS

Appendix for. Institutions and Behavior: Experimental Evidence on the Effects of Democracy

NUMERICAL COMPARISONS OF BIOASSAY METHODS IN ESTIMATING LC50 TIANHONG ZHOU

HIV/AIDS-related Expectations and Risky Sexual Behavior in Malawi

Desperation or Desire? The Role of Risk Aversion in Marriage. Christy Spivey, Ph.D. * forthcoming, Economic Inquiry. Abstract

Appendix F: The Grant Impact for SBIR Mills

N-back Training Task Performance: Analysis and Model

ME Abstract. Keywords: multidimensional reliability, instrument of students satisfaction as an internal costumer, confirmatory factor analysis

HIV/AIDS-related Expectations and Risky Sexual Behavior in Malawi

Does reporting heterogeneity bias the measurement of health disparities?

TOPICS IN HEALTH ECONOMETRICS

NHS Outcomes Framework

Addressing empirical challenges related to the incentive compatibility of stated preference methods

Estimation Comparison of Multidimensional Reliability Coefficients Measurement of Senior High School Students Affection towards Mathematics

Physical Model for the Evolution of the Genetic Code

Comparison of methods for modelling a count outcome with excess zeros: an application to Activities of Daily Living (ADL-s)

Subject-Adaptive Real-Time Sleep Stage Classification Based on Conditional Random Field

The effect of salvage therapy on survival in a longitudinal study with treatment by indication

Optimal probability weights for estimating causal effects of time-varying treatments with marginal structural Cox models

Are Drinkers Prone to Engage in Risky Sexual Behaviors?

A GEOGRAPHICAL AND STATISTICAL ANALYSIS OF LEUKEMIA DEATHS RELATING TO NUCLEAR POWER PLANTS. Whitney Thompson, Sarah McGinnis, Darius McDaniel,

Prototypes in the Mist: The Early Epochs of Category Learning

The Influence of the Isomerization Reactions on the Soybean Oil Hydrogenation Process

Can Subjective Questions on Economic Welfare Be Trusted?

CONSTRUCTION OF STOCHASTIC MODEL FOR TIME TO DENGUE VIRUS TRANSMISSION WITH EXPONENTIAL DISTRIBUTION

Investigation of zinc oxide thin film by spectroscopic ellipsometry

MULTIDIMENSIONAL RELIABILITY OF INSTRUMENT STUDENTS SATISFACTION USING CONFIRMATORY FACTOR ANALYSIS ABSTRACT

Estimation for Pavement Performance Curve based on Kyoto Model : A Case Study for Highway in the State of Sao Paulo

Chapter 20. Aggregation and calibration. Betina Dimaranan, Thomas Hertel, Robert McDougall

The Importance of Being Marginal: Gender Differences in Generosity 1

EVALUATION OF BULK MODULUS AND RING DIAMETER OF SOME TELLURITE GLASS SYSTEMS

Computing and Using Reputations for Internet Ratings

I T L S. WORKING PAPER ITLS-WP Social exclusion and the value of mobility. INSTITUTE of TRANSPORT and LOGISTICS STUDIES

Integration of sensory information within touch and across modalities

Estimating the distribution of the window period for recent HIV infections: A comparison of statistical methods

INITIAL ANALYSIS OF AWS-OBSERVED TEMPERATURE

Prediction of Total Pressure Drop in Stenotic Coronary Arteries with Their Geometric Parameters

Evaluation of the generalized gamma as a tool for treatment planning optimization

Resampling Methods for the Area Under the ROC Curve

Kim M Iburg Joshua A Salomon Ajay Tandon Christopher JL Murray. Global Programme on Evidence for Health Policy Discussion Paper No.

Inverted-U and Inverted-J Effects in Self-Referenced Decisions

Modeling Multi Layer Feed-forward Neural. Network Model on the Influence of Hypertension. and Diabetes Mellitus on Family History of

UNIVERISTY OF KWAZULU-NATAL, PIETERMARITZBURG SCHOOL OF MATHEMATICS, STATISTICS AND COMPUTER SCIENCE

A Meta-Analysis of the Effect of Education on Social Capital

Non-linear Multiple-Cue Judgment Tasks

Journal of Economic Behavior & Organization

Delving Beneath the Covers: Examining Children s Literature

Rich and Powerful? Subjective Power and Welfare in Russia

Are National School Lunch Program Participants More Likely to be Obese? Dealing with Identification

Lateral Transfer Data Report. Principal Investigator: Andrea Baptiste, MA, OT, CIE Co-Investigator: Kay Steadman, MA, OTR, CHSP. Executive Summary:

Saeed Ghanbari, Seyyed Mohammad Taghi Ayatollahi*, Najaf Zare

Rich and Powerful? Subjective Power and Welfare in Russia

Reconciling Simplicity and Likelihood Principles in Perceptual Organization

Fast Algorithm for Vectorcardiogram and Interbeat Intervals Analysis: Application for Premature Ventricular Contractions Classification

Arithmetic Average: Sum of all precipitation values divided by the number of stations 1 n

POLITECNICO DI TORINO Repository ISTITUZIONALE

Strong, Bold, and Kind: Self-Control and Cooperation in Social Dilemmas

Combined Temporal and Spatial Filter Structures for CDMA Systems

Association between cholesterol and cardiac parameters.

Bimodal Bidding in Experimental All-Pay Auctions

Maize Varieties Combination Model of Multi-factor. and Implement

HERMAN AGUINIS University of Colorado at Denver. SCOTT A. PETERSEN U.S. Military Academy at West Point. CHARLES A. PIERCE Montana State University

Balanced Query Methods for Improving OCR-Based Retrieval

CORRUPTION PERCEPTIONS IN RUSSIA: ECONOMIC OR SOCIAL ISSUE?

Evaluation of Literature-based Discovery Systems

A MIXTURE OF EXPERTS FOR CATARACT DIAGNOSIS IN HOSPITAL SCREENING DATA

Stephanie von Hinke Kessler Scholder, George Davey Smith, Debbie A. Lawlor, Carol Propper, Frank Windmeijer

THE NATURAL HISTORY AND THE EFFECT OF PIVMECILLINAM IN LOWER URINARY TRACT INFECTION.

Rainbow trout survival and capture probabilities in the upper Rangitikei River, New Zealand

Alma Mater Studiorum Università di Bologna DOTTORATO DI RICERCA IN METODOLOGIA STATISTICA PER LA RICERCA SCIENTIFICA

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

Gene Selection Based on Mutual Information for the Classification of Multi-class Cancer

IMPROVING THE EFFICIENCY OF BIOMARKER IDENTIFICATION USING BIOLOGICAL KNOWLEDGE

Evaluation of two release operations at Bonneville Dam on the smolt-to-adult survival of Spring Creek National Fish Hatchery fall Chinook salmon

Risk Misperception and Selection in Insurance Markets: An Application to Demand for Cancer Insurance

Transcription:

An Introducton to Modern Measurement Theory Ths tutoral was wrtten as an ntroducton to the bascs of tem response theory (IRT) modelng and ts applcatons to health outcomes measurement for the Natonal Cancer Insttute s Cancer Outcomes Measurement Workng Group (COMWG). In no way ths tutoral s meant to replace any text on measurement theory, but only serve as a steppng-stone for health care researchers to learn ths methodology n a framework that may be more appealng to ther research feld. An llustraton of IRT modelng s provded n the appendx and can provde better nsght nto the utlty of these methods. Ths tutoral s only a draft verson that may undergo many revsons. Please feel free to comment or ask questons of the author of ths pece: Bryce B. Reeve, Ph.D. Outcomes Research Branch Appled Research Program Dvson of Cancer Control and Populaton Scences Natonal Cancer Insttute Tel: (301) 594-6574 Fax: (301) 435-3710 Emal: reeveb@mal.nh.gov

2 Outlne Topc Page Reference Terms 3 An Introducton to Modern Measurement Theory. 4 A Bref Hstory of Item Response Theory.. 5 Item Response Theory Bascs. 7 Assumptons of Item Response Theory Models. 11 Item Response Theory Models Two Theores towards Measurement.. 13 The IRT Models.. 14 The Rasch Smple Logstc Model... 15 The One-Parameter Logstc Model. 17 The Two-Parameter Logstc Model 18 The Three-Parameter Logstc Model.. 19 The Graded Model... 20 The Nomnal Model. 22 The Partal Credt Model.. 23 The Ratng Scale Model... 24 Trat Scorng... 25 Classcal Test Theory and Item Response Theory.. 31 Applcatons of Item Response Theory n Research Item and Scale Analyss.. 34 Dfferental Item Functonng. 38 Instrument Equatng and Computerzed Adaptve Tests 45 Concluson. 47 References.. 51 Illustraton of IRT Modelng... appendx

3 Reference: Some Terms You Wll See n ths Tutoral as well as n the Lterature Assumpton of Local Independence A response to a queston s ndependent of responses to other questons n a scale after controllng for the latent trat (construct) measured by the scale. Assumpton of Undmensonalty - the set of questons are measurng a sngle contnuous latent varable (construct). Construct (a.k.a. trat, doman, ablty, latent varable, or theta) - see defnton of theta below. Informaton Functon (for a scale or tem) - an ndex, typcally dsplayed n a graph, ndcatng the range of trat level θ over whch an tem or test s most useful for dstngushng among ndvduals. The nformaton functon characterzes the precson of measurement for persons at dfferent levels of the underlyng latent construct, wth hgher nformaton denotng more precson. Item Queston n a scale Item Characterstc Curve (ICC, a.k.a. tem response functon, IRF, or tem trace lne) The ICC models the relatonshp between a person s probablty for endorsng an tem category and the level on the construct measured by the scale. Item Dffculty (Threshold) Parameter b pont on the latent scale θ where a person has a 50% chance of respondng postvely to the scale tem. Item Dscrmnaton (Slope) Parameter a - descrbes the strength of an tem's dscrmnaton between people wth trat levels (θ) below and above the threshold b. The a parameter may also be nterpreted as descrbng how an tem may be related to the trat measured by the scale. Scale (measure, questonnare, or test) A scale n ths tutoral s assumed to measure a sngle construct or doman. Slope Parameter See Item Dscrmnaton Test Characterstc Curve (TCC, a.k.a. Test Response Functon) - The TCC descrbes the expected number of scale tems endorsed as a functon of the underlyng latent varable. Theta (θ) The unobservable (or latent) construct beng measured by the questonnare. These constructs or trats are measured along a contnuous scale. Examples of constructs n health outcomes measurement are depresson level, physcal functonng, anxety, or socal support. Threshold Parameter See Item Dffculty

4 An Introducton to Modern Measurement Theory Each year new health outcomes measures are developed or revsed from prevous measures n the hope of obtanng nstruments that are more relable, vald, senstve, and nterpretable. Ths ncreasng need for psychometrcally-sound measures calls for better analytcal tools beyond what tradtonal measurement theory (or classcal test theory, CTT) methods can provde. In the past decade, applcatons of tem response theory (IRT) n health research measurement have ncreased consderably because of ts utlty n tem and scale analyss, scale scorng, nstrument lnkng, and adaptve testng. IRT s a model-based measurement n whch trat level estmates (e.g., level of physcal functonng or level of depresson) depend on both persons responses and on the propertes of the questons that were admnstered (Embretson and Rese, 2000). IRT has a number of advantages over CTT methods to assess health outcomes. CTT statstcs such as tem dffculty (proporton of correct responses), tem dscrmnaton (corrected tem-total correlaton), and relablty are contngent on the sample of respondents to whom the questons were admnstered. IRT tem parameters are not dependent on the sample used to generate the parameters, and are assumed to be nvarant (wthn a lnear transformaton) across dvergent groups wthn a research populaton and across populatons. In addton, CTT yelds only a sngle estmate of relablty and correspondng standard error of measurement, whereas IRT models measure scale precson across the underlyng latent varable beng measured by the nstrument (Cooke & Mche, 1997; Hays, Morales, & Rese, 2000). A further dsadvantage of CTT methods s that a partcpant's score s dependent on the set of questons used for analyss, whereas, an IRT-estmated person s trat level s ndependent of the questons beng used.

5 Because the expected partcpant's scale score s computed from ther responses to each tem (that s characterzed by a set of propertes), the IRT estmated score s senstve to dfferences among ndvdual response patterns and s a better estmate of the ndvdual's true level on the trat contnuum than CTT's summed scale score (Santor & Ramsay, 1998). Because of these advantages, IRT s beng appled n health outcomes research to develop new measures or mprove exstng measures, to nvestgate group dfferences n tem and scale functonng, to equate scales for cross-walkng patent scores, and to develop computerzed adaptve tests. Ths handbook provdes an overvew of IRT and the more commonly used models n health outcomes research, along wth a dscusson of the applcatons of these models to develop vald, relable, senstve, and feasble endpont measures. A Bref Hstory of Item Response Theory (ncomplete) Whle many thnk of tem response theory as modern psychometrc theory, the concepts and methodology of IRT has been developed for over three-quarters of a century. L. L. Thurstone (1925) lad down the conceptual foundaton for IRT n hs paper, enttled A Method of Scalng Psychologcal and Educatonal Tests. In t, he provdes a technque for placng the tems of the Bnet and Smon (1905) test of chldren s mental development on an age-graded scale. Plots of the proportons of chldren n successve age cross-sectons succeedng on successve Bnet tasks and the effectve locaton of each tem on chronologcal age reflect many of the features suggestve of IRT (Bock, 1997). Thurstone dropped hs work n measurement to pursue the development of multple factor analyss, but hs colleagues and students contnued to refne the theoretcal bases of IRT (Stenberg & Thssen, n draft). Rchardson (1936) and Ferguson (1943) ntroduced the normal ogve model as a means to dsplay the proportons correct for ndvdual tems as a functon of

6 normalzed scores. Lawley (1943) extended the statstcal analyss of the propertes of the normal ogve curve and descrbed maxmum-lkelhood estmaton procedures for the tem parameters and lnear approxmatons to those estmates. Fred Lord (1952) ntroduced the dea of a latent trat or ablty and dfferentated ths construct from observed test score. Lazarsfeld (1950) descrbed the unobserved varable as accountng for the observed nterrelatonshps among the tem responses. Consdered a mlestone n psychometrcs (Embretson & Rese, 2000), Lord and Novck s (1968) textbook enttled Statstcal Theores of Mental Test Scores provdes a rgorous and unfed statstcal treatment of classcal test theory. The remanng half of the book, wrtten by Allen Brnbaum, provdes an equally sold descrpton of the IRT models. Bock, and several student collaborators at the Unversty of Chcago, ncludng Davd Thssen, Ej Murak, Rchard Gbbons, and Robert Mslevy developed effectve estmaton methods and computer programs such as Blog, Multlog, Parscale, and Testfact. Along wth Atken (Bock & Atken, 1981), Bock developed the algorthm of margnal maxmum lkelhood method to estmate the tem parameters that are used n many of these IRT programs. In a separate lne of development of IRT models, Georg Rasch (1960) dscussed the need for creatng statstcal models that mantan the property of specfc objectvty, the dea that people and tem parameters be estmated separately but comparable on a smlar metrc. Rasch nspred Gerhard Fscher (1968) to extend the applcablty of the Rasch models nto psychologcal measurement and Ben Wrght to teach these methods and help to nspre other students to the development of the Rasch models. These students, ncludng Davd Andrch, Geoffrey Masters, Graham Douglas, and Mark Wlson, helped to push the methodology nto educaton and behavoral medcne (Wrght, 1997).

7 Item Response Theory Model Bascs IRT s a model for expressng the assocaton between an ndvdual's response to an tem and the underlyng latent varable (often called "ablty" or "trat") beng measured by the nstrument. The underlyng latent varable n health research may be any measurable construct such as physcal functonng, rsk for cancer, or depresson. The latent varable, expressed as theta (θ), s a contnuous undmensonal construct that explans the covarance among tem responses (Stenberg & Thssen, 1995). People at hgher levels of θ have a hgher probablty of respondng correctly or endorsng an tem. IRT models use tem responses to obtan scaled estmates of θ, as well as to calbrate tems and examne ther propertes (Mellenbergh, 1994). Each tem s characterzed by one or more model parameters. The tem dffculty, or threshold, parameter b s the pont on the latent scale θ where a person has a 50% chance of respondng postvely to the scale tem (queston). Items wth hgh thresholds are less often endorsed (Stenberg & Thssen, 1995). The slope, or dscrmnaton, parameter a descrbes the strength of an tem's dscrmnaton between people wth trat levels (θ) below and above the threshold b. The a parameter may also be nterpreted as descrbng how an tem may be related to the trat measured by the scale and s drectly related, under the assumpton of a normal θ dstrbuton, to the bseral tem-test correlaton ρ (Lnden & Hambleton, 1997). For tem the relatonshp s: a ρ =. 2 1 ρ The slope parameter s often thought of and s lnearly related (under some condtons) to the varable loadng n a factor analyss. Some IRT models, n educaton research, nclude a lower-

8 asymptote parameter or guessng parameter c to possbly explan why people of low levels of the trat θ are respondng postvely to an tem. To model the relaton of the probablty of a correct response to an tem condtonal on the latent varable θ, trace lnes, estmated from the tem parameters, are plotted. Most IRT models n research assume that the normal ogve or logstc functon descrbes ths relatonshp accurately and fts the data. The logstc functon s smlar to the normal ogve functon, and s mathematcally smpler to use and, as a result, s predomnately used n research. The trace lne (or sometmes called the tem characterstc curve, ICC) can be vewed as the regresson of tem score on the underlyng varable θ (Lord, 1980, p. 34). The left graph n Fgure 1 models 1.00 30 P(X) 0.75 0.50 0.25 Test Characterstc Curve 25 20 15 10 5 0.00-3 -2-1 0 1 2 3 Underlyng Latent Varable (θ) 0-3 -2-1 0 1 2 Underlyng Latent Varable (θ) 3 Fgure 1: Left fgure s an IRT tem characterstc curve (trace lne) for one tem. Rght fgure s test characterstc curve for thrty tems. the probablty for endorsng an tem condtonal on the level on the underlyng trat. The hgher a person s trat level (movng from left to rght along the θ scale), the greater the probablty that the person wll endorse the tem. For example, f a queston asked, Are you unhappy most of the day?, then the left graph of Fgure 1 shows that people wth hgher levels of depresson (θ) wll have hgher probabltes for answerng yes to the queston. For dchotomous tems, the probablty of a negatve response s hgh for low values of the underlyng varable beng measured, and decreases for hgher levels on θ. The collecton of the tem trace lnes forms the

9 scale; thus, the sum of the probabltes of the correct response of the tem trace lnes yelds the test characterstc curve (TCC). The TCC descrbes the expected number of scale tems endorsed as a functon of the underlyng latent varable. The rght fgure of Fgure 1 presents a TCC curve for 30 tems. When the sum of the probabltes s dvded by the number of tems, the TCC gves the average probablty or expected proporton correct as a functon of the underlyng construct (Wess, 1995). Another mportant feature of IRT models s the nformaton functon, an ndex ndcatng the range of trat level θ over whch an tem or test s most useful for dstngushng among ndvduals. In other words, the nformaton functon characterzes the precson of measurement for persons at dfferent levels of the underlyng latent construct, wth hgher nformaton denotng more precson. Graphs of the nformaton functon place persons' trat level on the horzontal x-axs, and amount of nformaton on the vertcal y-axs (left graph n Fgure 2). 1.00 30 r =.97 Informaton 0.75 0.50 0.25 Test Informaton 25 20 15 10 5 r =.96 r =.95 r =.93 r =.90 r =.80 0.00-3 -2-1 0 1 2 3 Underlyng Latent Varable (θ) 0-3 -2-1 0 1 2 3 Underlyng Latent Varable (θ) Fgure 2: Item nformaton curve on left sde. Test nformaton curve wth approxmate test relablty on rght sde. The shape of the tem nformaton functon s dependent on the tem parameters. The hgher the tem s dscrmnaton, the more peaked the nformaton functon wll be; thus, hgher dscrmnaton parameters provde more nformaton about ndvduals whose trat levels (θ) le near the tem's threshold value. The tem s dffculty parameter(s) determnes where the tem

10 nformaton functon s located (Flannery, Rese, & Wdaman, 1995). Wth the assumpton of local ndependence (revewed below), the tem nformaton values can be summed across all of the tems n the scale to form the test nformaton curve (Lord, 1980). At each level of the underlyng trat θ, the nformaton functon s approxmately equal to the expected value of the nverse of the squared standard errors of the θ-estmates (Lord, 1980). The smaller the standard error of measurement (SEM), the more nformaton or precson the scale provdes about θ. For example, f a measure has a test nformaton value of 16 at θ = 2.0, then examnee scores at ths trat level have a standard error of measurement of ( 1 16) =.25, ndcatng good precson (relablty approxmately.94) at the level of theta (Flannery et al., 1995). The rght graph n Fgure 2 presents a test nformaton functon. Most nformaton (precson n measurement) s contaned wthn the mddle of the scale (-1.0 < θ < 1.5) wth less relablty (labeled r n graph) at the hgh and low ends of the underlyng trat. To observe the condtonal standard error of measurement for a gven scale, the nverse of the square root of the test nformaton functon across all levels of the θ contnuum s graphed. 3.0 2.0 1.0 0.0-3 -2-1 0 1 2 3 θ Fgure 3: Test standard error of measurement Fgure 3 presents the test SEM wth lttle error (more precson) n measurement for mddle to hgh levels of the underlyng trat, and hgh error n measurng respondents wth low levels of θ. If the scale, presented n Fgures 2 and 3, measures physcal functonng from poor (θ = -3) to

11 hgh (θ = +3), then the scale lacks precson n measurng physcally dsabled patents (θ < -1.5), but adequately captures the physcal functonng of average to healthy (ambulatory) ndvduals. Scale scorng n tem response theory has a major advantage over classcal test theory. In classcal test theory, the summed scale score s dependent on the dffculty of the tems used n the selected scale, and therefore, not an accurate measure of a person's trat level. The procedure assumes that equal ratngs on each tem of the scale represent equal levels of the underlyng trat (Cooke & Mche, 1997). Item response theory, on the other hand, estmates ndvdual latent trat level scores based on all the nformaton n a partcpant's response pattern. That s, IRT takes nto consderaton, whch tems were answered correctly (postvely) and whch ones were answered ncorrectly, and utlzes the dffculty and dscrmnaton parameters of the tems when estmatng trat levels (Wess, 1995). Persons wth the same summed score but dfferent response patterns may have dfferent IRT estmated latent scores. One person may answer more of the hghly dscrmnatng and dffcult tems and receve a hgher latent score than one who answers the same number of tems wth low dscrmnaton or dffculty. IRT trat level estmaton uses the tem response curves assocated wth the ndvdual's response pattern. A statstcal procedure, such as maxmum lkelhood estmaton, fnds the maxmum of a lkelhood functon created from the product of the populaton dstrbuton wth the ndvdual's trace curves assocated wth each tem's rght or wrong response. A full dscusson of trat scorng follows the overvew of the IRT models. Assumptons of Item Response Theory Models The IRT model s based on the assumpton that the tems are measurng a sngle contnuous latent varable θ rangng from - to +. The undmensonalty of a scale can be evaluated by performng an tem-level factor analyss, desgned to evaluate the factor structure

12 underlyng the observed covaraton among tem responses. The assumpton can be examned by comparng the rato of the frst to the second egenvalue for each scaled matrx of tetrachorc correlatons. Ths rato s an ndex of the strength of the frst dmenson of the data. Smlarly, another ndcaton of undmensonalty s that the frst factor accounts for a substantal proporton of the matrx varance (Lord, 1980; Rese & Waller, 1990). For tests usng many tems, the assumpton of undmensonalty may be unrealstc; however, Cooke and Mche (1997) report that IRT models are moderately robust to departures from undmensonalty. If multdmensonalty exsts, the nvestgator may want to consder dvdng the test nto subtests based on both theory and the factor structure provded by the tem-level factor analyss. Multdmensonal IRT models do exst, but ts models as well as nformatve documentaton and userfrendly software are stll n development. In the IRT model, the tem responses are assumed to be ndependent of one another: the assumpton of local ndependence. The only relatonshp among the tems s explaned by the condtonal relatonshp wth the latent varable θ. In other words, local ndependence means that f the trat level s held constant, there should be no assocaton among the tem responses (Thssen & Stenberg, 1988). Volaton of ths assumpton may result n parameter estmates that are dfferent from what they would be f the data were locally ndependent; thus, selectng tems for scale constructon based on these estmates may lead to erroneous decsons (Chen & Thssen, 1997). The assumptons of undmensonalty and local ndependence are related n that; tems found to be locally dependent wll appear as a separate dmenson n a factor analyss. For some IRT models, the latent varable (not the data response dstrbuton) s assumed to be normally dstrbuted wthn the populaton. Wthout ths assumpton, estmates of θ for

13 some response patterns (e.g., respondents who do not endorse any of the scale tems) have no fnte values resultng n unstable parameter estmates (Chen & Thssen, 1997). Item Response Theory Models Two Theores towards Measurement Thssen and Orlando (2001) dscuss two approaches to model buldng n tem response theory. One approach s to develop a well-fttng model to reflect the tem response data by parameterzng the ablty or trat of nterest as well as the propertes of the tems. The goal of ths approach s tem analyss. The model should reflect the propertes of the tem response data suffcently and accurately, so that the behavor of the tem s summarzed by the tem parameters. The phlosophy s that the tems are assumed to measure as they do, not as they should (Thssen & Orlando, 2001). Ths approach to model buldng beleves the theory of measurement s to explan (.e., model) the data. Another approach of IRT model buldng s to obtan specfc measurement propertes defned by the model to whch the tem response data must ft. If the tem or a person does not ft wthn the measurement propertes of the IRT model, assessed by analyss of resduals (.e., tem and person ft statstcs), the tem or person s dscarded. Ths approach follows that of the Rasch (1960) models, and n the cases where the data fts the model, offers a smple nterpretaton for tem analyss and scale scorng. Ths approach to model buldng beleves optmal measurement s defned mathematcally, and then the class of tem response models that yeld such measurement s derved. The two approaches descrbed above yeld a dvson n psychometrcs. Those who beleve health research measurement should be about descrbng the behavors behnd the response patterns n a survey wll use the most approprate IRT model (e.g., Rasch/One-

14 Parameter Logstc Model, Two-Parameter Logstc Model, Graded Model) to ft the data. The choce of the IRT model s data dependent. Researchers from the Rasch tradton beleve that the only approprate models to use are the Rasch famly of models, whch retan strong mathematcal propertes such as specfc objectvty (person parameters and tem parameters estmated separately) and summed score smple suffcency (no nformaton from the response pattern s needed) (see model descrptons below). Several advantages of the Rasch model nclude: the ablty of the model to produce more stable estmates of person and tem propertes when there s a small number of respondents, when extremely non-representatve samples are used, and when the populaton dstrbuton over the underlyng trat s heavly skewed. Embretson and Rese (2000) suggest one should use the Rasch famly of models when each tem carres equal weght (.e., each tem s equally mportant) n defnng the underlyng varable, and when strong measurement model propertes (.e., specfc objectvty, smple suffcency) are desred. If one desres fttng an IRT model to exstng data or desres hghly accurate parameter estmates, then a more complex model such as the Two-Parameter Logstc Model or Graded Model should be used. The IRT models Table 1 presents seven common IRT models wth potental applcaton to health-related research. The table also ndcates f the model s approprate for dchotomous (bnary) or polytomous (3 or more optons) responses, and some characterstcs assocated wth each model. Models noted wth an astersk are part of the Rasch famly of models and, therefore, retan the unque propertes of summed score smple suffcency and specfc objectvty. Each of the models are dscussed below. Because of the separate development of the Rasch and One- Parameter Logstc models, they are dscussed ndvdually.

15 Table 1 Model (* = belongs to Rasch Famly) Item Response Format Model Characterstcs Rasch Model* / One Parameter Logstc Model Dchotomous Dscrmnaton power equal across all tems. Threshold vares across tems. Two Parameter Logstc Model Dchotomous Dscrmnaton and threshold parameters vary across tems. Three Parameter Logstc Model Dchotomous Includes psuedo-guessng parameter Graded Model Polytomous Ordered responses. Dscrmnaton vares across tems. Nomnal Model Polytomous No pre-specfed tem order. Dscrmnaton vares across tems. Partal Credt Model* Polytomous Dscrmnaton power constraned to be equal across tems. Ratng Scale Model* Polytomous Dscrmnaton equal across tems. Item threshold steps equal across tems. The Rasch Smple Logstc Model Rasch (1960) was the frst to develop the one-parameter logstc model (sometmes referred to as the smple logstc model), however ths model dffered from models dscussed below. In the Rasch Model, a person s characterzed by a level on a latent trat ξ, and an tem s characterzed by a degree of dffculty δ. The probablty of an tem endorsement s a functon of the rato of a person's level on the trat to the tem dffculty ξ δ (Tnsley, 1992). Gven that the data adequately ft the Rasch model, one can make smple comparsons of the tems and respondents accordng to the prncples of specfc objectvty. Specfc objectvty means that comparson of two tems' dffculty parameters are assumed to be ndependent of any group of subjects beng surveyed, and the comparson of two subjects' trat levels does not depend on any subset of tems beng admnstered (Mellenbergh, 1994). The Rasch model assumes that the tems are all equal n dscrmnaton (weght equally on a factor) and that chance factors (e.g. guessng) do not nfluence the response. For a partcular tem, Rasch proposed a smple trace lne (probablty) functon, that ncreases from zero to one wth trat level, as:

16 T ξ = ξ + δ (Thssen & Orlando, 2001). The model n ths form has the nterpretaton of the probablty of a postve response beng equal to the value of the person parameter ξ relatve to the value of the tem parameter δ (Lnden & Hambleton, 1997). If we use current tem response theory notaton, substtutng exp θ for ξ and exp b for δ, we have: T expθ = expθ + expb 1 = 1+ exp[ ( θ b)] (Thssen & Orlando, 2001). As before, theta (θ) represents a person's trat level, and b represents the tem threshold. Ths model shows the dependent varable, the probablty of endorsng an tem, as a functon of the dfference between two ndependent varables, the person s level on the underlyng trat θ and the tem threshold b (dffculty). Rasch constraned the sum of the dffculty parameters for all scale tems to be equal to zero ( b = 0), thus settng the scale of the θ parameter. Gven ths constrant, the populaton dstrbuton of θ s unspecfed. The dstrbuton "has some mean, relatve to the average tem dffculty, and some varance, relatve to the unt slope of the trace lnes... The shape of the populaton dstrbuton [of θ] s unknown; t s whatever shape t has to be to produce the observed score dstrbuton" (Thssen & Orlando, 2001, p. 76-77). Fgure 4 presents seven tem trace lnes for varyng ranges of threshold parameters (-1.5, -1.0, -0.5, 0.0, 0.5, 1.0, 1.5). Items wth hgher threshold parameters are less often endorsed and requre hgh levels of the underlyng trat θ to endorse the tem.

17 1.0 0.8 P(X) 0.5 0.3 The One-Parameter Logstc Model 0.0-3 -2-1 0 1 2 3 θ Fgure 4 Rasch / One Parameter Logstc IRT Model (7 tems) The development of the Rasch model was ndependent of the development of the oneparameter logstc model, but both have smlar features and are mathematcally equvalent. The one-parameter logstc (1PL) model trace lne for a gven tem s: ( = 1θ u T ( u = 1 ) 1 θ =. 1+ exp[ a( θ b )] T ) traces the condtonal probablty of a postve ( = 1) u response to tem as a functon of the trat parameter θ, the threshold or dffculty parameter b, and the dscrmnaton parameter a. Where the Rasch model had a fxed slope of one for all tems, the 1PL model only requres the slope to be equal for all tems (Thssen & Orlando, 2001). The populaton dstrbuton of the underlyng varable θ for the one-parameter logstc model (as well as the two and three-parameter logstc models) s usually specfed to have a populaton mean of zero and a varance of one. The threshold (or dffculty) parameters b are located relatve to zero, whch s the average trat level n the populaton, and the slope parameter a takes some value relatve to the unt standard devaton of the latent varable (Thssen & Orlando, 2001). Thus, t s the latent varable θ the model s assumng to be normally dstrbuted, not the categorcal tem responses (Thssen & Stenberg, 1988).

18 The Two-Parameter Logstc Model The two-parameter logstc model (2PL; Brnbaum, 1968) allows the slope or dscrmnaton parameter a to vary across tems nstead of beng constraned to be equal as n the one-parameter logstc or Rasch model. The relatve mportance of the dfference between a person s trat level and tem threshold s determned by the magntude of the dscrmnatng power of the tem (Embretson & Rese, 2000). The two-parameter logstc model trace lne for the probablty of a postve response to tem for a person wth latent trat level θ s: T ( u = 1 ) 1 θ =. 1+ exp[ 1.7a ( θ b ) The constant, 1.7, s added to the model as an adjustment so that the logstc model approxmates the normal ogve model. Approxmately half of the lterature ncludes the adjustment and half does not (Thssen & Stenberg, 1988). Fgure 5 presents 2PL trace lnes for fve tems wth varyng threshold/dffculty b and dscrmnaton a parameters. The tem marked by a dashed lne represents an tem wth lttle relatonshp wth the underlyng varable beng measured by the survey. Items wth steeper sloped trace lnes have more dscrmnatng power. 1.0 0.8 P(X) 0.5 0.3 0.0-3 -2-1 0 1 2 3 θ Fgure 5 Two Parameter Logstc Trace Lnes (5 tems)

19 The Three-Parameter Logstc IRT Model The three-parameter logstc model (3PL; Lord, 1980) was developed n educatonal testng to extend the applcaton of tem response theory to multple choce tems that may elct guessng. For tem, the three-parameter logstc trace lne s: T ( u = 1 ) 1 c θ = c +. 1+ exp[ 1.7a ( θ b )] The guessng parameter c s the probablty of a postve response to tem even f the person does not know the answer. When c = 0, the three-parameter model s equvalent to the 2PL model. Includng the guessng parameter changes the nterpretaton of other parameters n the model. The threshold parameter b s the value of theta at whch respondents have a (.5 +.5c)*(100)% chance of respondng correctly to the tem (Thssen & Orlando, 2001). 1.00 0.75 0.50 0.25 0.00-3 -2-1 0 1 2 3 θ Fgure 6 Three Parameter Logstc Model (1 tem) Fgure 6 presents a 3PL model trace lne for one tem. Survey respondents low on the underlyng trat have a 20 percent probablty (c =.2) of endorsng the tem. The nterpretaton of the guessng parameter for multple choce tests n educatonal measurement s straghtforward, but can be vague n health measurement. In most health-related measurement research, the guessng parameter s left out, thus optng for the 2PL model. However, the guessng parameter may provde nsghtful nformaton to understandng the

20 behavor of partcpants n the questonnare. There may be a consderable proporton of partcpants n the survey who may respond postvely to an tem for other reasons besdes the trat beng measured. In ablty testng (n an educatonal context), t s usually assumed that respondents wll usually nflate ther abltes by guessng at the rght answer. However n selfreport surveys, respondents may over report desrable behavors or atttudes, and underreport panful or embarrassng behavors (Schaeffer, 1988). In other words, persons may be motvated to conceal ther true trat level by clamng to have or not have symptoms. The Graded Model For questons wth three or more response categores, Samejma (1969) proposed a model for graded or ordered responses. A response may be graded on a range of scores, as an example, from poor (0) to excellent (9). For survey measurement, a subject may chose one opton out of a number of graded optons, such as a fve-pont Lkert-type scale: "strongly-dsagree", "dsagree", "neutral", "agree", and "strongly agree" (Mellenbergh, 1994). From dchotomously-scored tems to polytomously-scored tems, tem response theory adapts to the transton more easly than classcal (or tradtonal) test theory by needng only to make changes to the trace lne models themselves (Thssen, Nelson, Blleaud, & McLeod, 2001). Samejma s (1969) graded model s based on the logstc functon gvng the probablty that an tem response wll be observed n category k or hgher. For ordered responses u = k, k = 1,2,3,..., m, where response m reflects the hghest θ value, the graded model trace lne s: T ( u = k θ ) 1 = 1+ exp[ a ( θ b T k 1 )] 1+ exp[ a ( θ b ( u = k θ ) = T * ( k θ ) T * ( k + 1θ ) k +1 )]

21 (Thssen, Nelson, et al., 2001). The trace lne models the probablty of observng each response alternatve as a functon of the underlyng construct (Stenberg & Thssen, 1995). The slope a vares by tem, but wthn an tem, all response trace lnes share the same slope (dscrmnaton). Ths constrant of equal slope for responses wthn an tem keeps trace lnes from crossng, thus avodng negatve probabltes. The threshold parameters b k vares wthn an tem wth the constrant b k-1 < b k < b k+1. At each value θ = b k, the respondent has a 50% probablty of endorsng the category. T*(k θ) s the trace lne descrbng the probablty that a respondent at any partcular level of θ wll respond n that scorng category or a hgher category. The graded model trace lne T(u = k θ) represents the proporton of partcpants respondng to that category across θ whch wll be a nonmonotonc curve, except for the frst and last response categores (Thssen, Nelson, et al., 2001). For the frst response category k = 1, T*(1 θ) = 1; therefore, the trace lne T(u = 1 θ) wll have a monotoncally decreasng logstc functon wth the lowest threshold parameter: T ( u = 1 ) 1 θ = 1. 1+ exp[ a ( θ b )] (Thssen, Nelson et al., 2001). For the last response category k = m, T*(m+1 θ) = 0; therefore, the trace lne T(u = m θ) wll have a monotoncally ncreasng logstc functon wth the hghest threshold parameter: (Thssen, Nelson et al., 2001). T ( u = mθ ) 1 = 1+ exp[ a ( θ b 2 m )]

22 1.00 0.75 Strongly Dsagree Agree Strongly Agree 0.50 Dsagree Neut ral 0.25 0.00-3 -2-1 0 1 2 3 θ Fgure 7 Graded Model (1 tem wth 5 response categores) Fgure 7 dsplays a graded tem wth fve response categores. The model presents over what levels of the underlyng trat θ a person s lkely to endorse one of the response optons. The Nomnal Model Proposed by Bock (1972), the Nomnal Model s an alternatve to the Graded Model for polytomously scored tems, not requrng any a pror specfcaton of the order of the mutually exclusve response categores wth respect to θ. The nomnal model trace lne for scores u = 1, 2,..., m, for tem s: T ( u x ) exp[ a xθ + c x ] θ =. exp[ a θ + c ] = m k = 1 Where θ s the latent varable, a k s are dscrmnaton parameters, and c k s are the ntercepts. To dentfy the model, addtonal constrants are mposed. The sum of each set of parameters must equal zero,.e. k k m 1 k= 0 a = m 1 k c k k= 0 = 0 (Thssen, Nelson et al., 2001).

23 The Nomnal Model s used when no pre-specfed order can be determned among the response alternatves. In other words, the model allows one to determne whch response alternatve order s assocated wth hgher levels on the underlyng trat. Ths model has also been appled to determne the locaton of the neutral response n a Lkert-type scale n relaton to the ordered responses. Once the tem order has been confrmed, the Graded IRT Model s often ft to the data. The Partal Credt Model For tems wth two or more ordered responses, Masters (1982) created the partal credt model wthn the Rasch model framework, and thus the model shares the desrable characterstcs of the Rasch famly: smple sum as a suffcent statstc for trat level measurement, and separate persons and tem parameter estmaton allowng specfcally objectve comparsons. The partal credt model contans two sets of locaton parameters, one for persons and one for tems, on an underlyng undmensonal construct (Masters & Wrght, 1997). The partal credt model s a smple adaptaton of Rasch's model for dchotomes. The model follows that from the ntended order 0 < 1 < 2,..., < m, of a set of categores, the condtonal probablty of scorng x rather than x -1 on an tem should ncrease monotoncally throughout the latent varable range. For the partal credt model, the expectaton for person j scorng n category x over x -1 for tem s modeled: exp( θ j δ x ), 1+ exp( θ δ ) j x where δ x s an tem parameter governng the probablty of scorng x rather than x -1. The δ x parameter can be thought of as an tem step dffculty assocated wth the locaton on the underlyng trat where categores x-1 and x ntersect. Rewrtng the model, the response functon

24 for the probablty of person j scorng x on one of the possble outcomes 0, 1, 2,..., m of tem can be wrtten: T ( u = x ) x k = 0 exp ( θ j δ k ) θ j = m j, x = 0, 1,..., m h, exp ( θ δ ) h= 0 k = 0 (Masters & Wrght, 1997). Thus, the probablty of a respondent j endorsng category x for tem s a functon of the dfference between ther level on the underlyng trat and the step dffculty ( θ δ ). Thssen, Nelson et al. (2001, p. 148) note the Partal Credt Model to be a constraned verson of the Nomnal Model n whch, "not only are the a's constraned to be lnear functons of the category codes, all of those lnear functons are constraned to have the same slope [for all the tems]" The Generalzed Partal Credt Model s a generalzaton of the Partal Credt Model that allows the dscrmnaton parameter to vary among the tems. The Ratng Scale Model The Ratng Scale Model (Andrch, 1978a, 1978b) s another member of the Rasch famly because the model retans the elegant measurement property of smple score suffcency. The Ratng Scale Model s derved from the Partal Credt Model wth the same constrant of equal dscrmnaton power across all tems. The Ratng Scale Model dffers from the Partal Credt Model n that the dstance between dffculty steps (or levels) from category to category wthn each tem s the same across all tems. The Ratng Scale Model ncludes an addtonal parameter λ, whch locates where the tem s on the underlyng construct beng measured by the scale. The response functon for the uncondtonal probablty of person j scorng x on one of the possble outcomes 0, 1, 2,..., m of tem can be wrtten: j k

25 T ( u = x ) x exp ( θ ( + k = j λ δ k ) 0 0 θ j = m j, where h ( + ) 0 exp ( θ ( λ + δ k = θ j λ δ k ) h= 0 k = 0 j k ( ) = 0 (Embretson & Rese, 2000). The constrant that a fxed set of ratng ponts are used for the entre tem set requres the tem formats to be smlar throughout the scale (e.g., all tems have four response categores). Trat Scorng In classcal test theory, scales are scored typcally by summng the responses to the tems. Ths summed score may then be lnearly transformed to a scaled score estmate of a person s trat level. For example, f a respondent endorses 20 out of 50 tems, he receves a score of 40%. On the other hand, tem response theory uses the propertes of the tems (.e., tem dscrmnaton, tem dffculty) as well as knowledge of how tem propertes nfluence behavor (.e., the tem trace lne) to estmate a person s trat score based on ther responses to the tems (Embretson & Rese, 2000). IRT models are used to calculate a person s trat level by frst estmatng the lkelhood of the pattern of responses to the tems, gven the level on the underlyng trat beng measured by the scale. Because the tems are locally ndependent, the lkelhood functon L s L = ntems = 1 ( θ ) T u whch s smply the product of the ndvdual tem trace lnes T ( θ ). ( θ ) u T models the probablty of the response u to the tem condtonal on the underlyng trat θ. Often, nformaton about the populaton s ncluded n the estmaton process along wth the nformaton of the tem response patterns. Therefore, the lkelhood functon s a product of the IRT trace u

26 lnes for each ndvdual tem multpled by the populaton dstrbuton of the latent construct φ ( θ ): ntems = 1 ( θ ) φ( θ ) L = T u. Next, trat levels are estmated typcally by a maxmum lkelhood method; specfcally, the person s trat level maxmzes the lkelhood functon gven the tem propertes. Thus, a respondent s trat level s estmated by a process that 1) calculates the lkelhood of a response pattern across the contnuous levels of the underlyng trat θ, and 2) uses some search method to fnd the trat level at the maxmum of the lkelhood (Embretson & Rese, 2000). Often tmes, ths search method uses some form of the estmate of the mode (hghest peak) or the average of the lkelhood functon. These estmates can be lnearly transformed to have any mean and standard devaton a researcher may desre. As an llustraton of trat scorng, Fgure 8 presents graphs of the populaton dstrbuton of the underlyng varable, four tems two-parameter logstc IRT trace lnes, and the lkelhood functon for a person s response pattern of endorsng (.e., true response) the frst two tems and not endorsng (.e., false response) the last two tems of a four tem scale. In Fgure 8a, the populaton dstrbuton of θ s assumed to be normally dstrbuted, and the scale s set to a mean of zero and varance of one. Other dstrbutons can be used or no populaton nformaton can be provded n the estmaton process. However, the dstrbutons of trat levels are often assumed to be normally dstrbuted n the populaton. The trace lnes for tem 1 (a = 2.33, b = -0.14; see Fgure 8b) and tem 2 (a = 2.05, b = -0.02; see Fgure 8c) represent the probablty of a respondent endorsng the tem gven ther level on the underlyng trat. The trace lnes for tem 3 (a = 3.47, b = 0.26; see Fgure 8d) and tem 4 (a = 2.41, b = 1.27; see Fgure 8e) represent the probablty of a respondent not endorsng the tem gven ther level on the underlyng trat.

27 These trace lnes for non-endorsement are represented by a monotoncally decreasng curve to reflect that respondents hgh on the latent trat are less lkely to respond false to (or not endorse) the tems. The lkelhood functon (see Fgure 8f) descrbes the lkelhood of a respondent s trat level gven the populaton dstrbuton of θ and the person s response pattern of endorsng the frst two tems and not endorsng the last two tems, as represented by the followng equaton: ( θ )* T ( u = trueθ )* T ( u = trueθ )* T ( u = falseθ ) T ( u falseθ ) = 1 1 2 2 3 3 * 4 4 L φ = The maxmum of the lkelhood (.e., the respondent s trat level) for ths response pattern s estmated to be θˆ = 0.14. The average of the dstrbuton can also be used as an estmate of the respondent s trat level, θˆ = 0.12. As another example, Fgure 9 presents graphs of the populaton dstrbuton, four tem trace lnes, and lkelhood functon for a respondent who endorses the frst, second, and fourth tem, but responds negatvely to the thrd tem. The maxmum-lkelhood estmate for ths response pattern s θˆ = 0.72. As expected, ths response pattern yelds a hgher estmated score because the respondent endorsed one addtonal tem than the pror response pattern. The lkelhood functon for the second response pattern s smaller (has less area under the curve) than the lkelhood functon for the frst response pattern. The decreased area reflects an nconsstency of tem responses n the second pattern. The tems are ordered by thresholds (.e., tem 1 s the least dffcult and tem 4 s the most dffcult tem to endorse; see b parameter estmates), therefore, one would expect that f a person endorses tem four, then they should also endorse the frst three tems wth lower dffcultes. As an example, f these four tems represent physcal tasks of harder complexty from walkng ten steps (tem 1), walkng to your malbox (tem 2), walkng a block (tem 3), to walkng a mle (tem 4), then you would expect someone

28 Fgure 8 a) Populaton Dstrbuton (Normal; mean = 0; varance = 1) -3-2 -1 0 1 2 3 b) Item 1 (a = 2.33, b = -0.14) true response c) Item 2 (a = 2.05, b = -0.02) true response θ 1 1.0 0.5 0.5 0-3 -2-1 0 1 2 3 0.0-3 -2-1 0 1 2 3 θ θ d) Item 3 (a = 3.47, b = 0.26) false response e) Item 4 (a = 2.41, b = 1.27) false response 1.0 1.0 0.5 0.5 0.0-3 -2-1 0 1 2 3 0.0-3 -2-1 0 1 2 3 θ θ f) Lkelhood functon (mode = 0.14, average = 0.12) -3-2 -1 0 1 2 3 θ

29 Fgure 9 a) Populaton Dstrbuton (Normal; mean = 0; varance = 1) -3-2 -1 0 1 2 3 b) Item 1 (a = 2.33, b = -0.14) true response c) Item 2 (a = 2.05, b = -0.02) true response θ 1 1.0 0.5 0.5 0-3 -2-1 0 1 2 3 0.0-3 -2-1 0 1 2 3 θ θ d) Item 3 (a = 3.47, b = 0.26) false response e) Item 4 (a = 2.41, b = 1.27) true response 1.0 1.0 0.5 0.5 0.0-3 -2-1 0 1 2 3 0.0-3 -2-1 0 1 2 3 θ θ f) Lkelhood functon (mode = 0.72, average = 0.80) -3-2 -1 0 1 2 3 θ

30 who endorsed the tem walkng a mle to also endorse the tem walk a block. The one-parameter logstc model or Rasch model constrans the dscrmnaton parameter, a, to be equal across the four tems n the example. Therefore any response pattern wth the same number of endorsed tems wll have the same estmated trat level. Knowledge of an ndvdual s partcular response pattern s not needed and, by the same token, nformaton about that ndvdual s trat level that mght be derved from the response pattern s gnored. Thus, the total score s a suffcent statstc for estmatng trat levels. Table 2 lsts all possble response patterns for the four bnary tems (2 4 = 16 patterns), the summed score of the response pattern, and the assocated maxmum-lkelhood estmates of the trat level calculated by usng ether the two-parameter logstc (2PL) IRT model or the Rasch model. The table shows that response patterns wth two tem endorsements (#6 to #11) are estmated by the Rasch model to have the same trat level ( θˆ = 0.22). For these types of tem response patterns wth equal number of total responses, the 2PL IRT model wll gve hgher estmates of the trat level for patterns wth hgher tem thresholds (dffcultes), and lkewse, lower estmated trat levels for patterns wth lower tem thresholds. Thus, the 2PL IRT model wll estmate a hgher latent trat score for a person who gets two harder tems correct than a person who endorses two easer tems. Those who exclusvely use the Rasch model vew ths property of the 2PL IRT model as a weakness. It s nconsstent for a person to endorse a harder tem and not an easer tem. The Pearson correlaton among the summed score, the Rasch model score, and the 2PL IRT model score, for the above example, are above.97. Despte the presented data s only an example, addng more tems stll yeld correlatons among the scores above.9. So the queston s often asked, Why not use the summed score, t s easer to compute? Summed scores are on an ordnal scale that assumes the dstance between any consecutve scores s equal. IRT model

31 Table 2 # Item Response Pattern 0 = not endorse, 1 = endorse Summed Score 2 PL IRT Model Maxmum Lkelhood Estmate 1 PL IRT / Rasch Model Maxmum Lkelhood Estmate 1 0 0 0 0 0-0.82-0.84 2 1 0 0 0 1-0.27-0.22 3 0 1 0 0 1-0.21-0.22 4 0 0 1 0 1-0.19-0.22 5 0 0 0 1 1-0.01-0.22 6 1 1 0 0 2 0.14 0.22 7 1 0 1 0 2 0.15 0.22 8 0 1 1 0 2 0.19 0.22 9 1 0 0 1 2 0.31 0.22 10 0 1 0 1 2 0.36 0.22 11 0 0 1 1 2 0.37 0.22 12 1 1 1 0 3 0.52 0.71 13 1 1 0 1 3 0.72 0.71 14 1 0 1 1 3 0.74 0.71 15 0 1 1 1 3 0.80 0.71 16 1 1 1 1 4 1.35 1.36 Note: The four tems are ordered by dffculty wth the last tem havng the hghest threshold. scores are on an nterval scale and dstance between scores vary dependng on the dffculty (and sometmes dscrmnatng power) of the queston. For example, the dfference n physcal ablty to endorse two tems that ask f one can walk 10 feet and 20 feet s certanly dfferent than the ablty to endorse two tems that ask f one can walk 100 feet and two mles. Ablty scores should be close together for the frst two tems (walkng 10 feet and 20 feet), and scores should be farther apart for answerng the last two tems (walkng 100 feet and 2 mles). Classcal Test Theory and Item Response Theory The past and most of the present research n health measurement has been grounded n classcal test theory (CTT) models; however works by Embretson and Rese (2000) and Rese (1999) pont out several advantages to movng to IRT modelng. Table 3 provdes several key dfferences between CTT and IRT models. Precson of measurement statstcs such as standard error of measurement (SEM) and relablty ndcate how well an nstrument measures a sngle construct. The SEM descrbes an expected score fluctuaton due to error n the measurement tool (Embretson and Rese, 2000).

32 Table 3 Classcal Test Theory Item Response Theory Measures of precson fxed for all scores Precson measures vary across scores Longer scales ncrease relablty Shorter, targeted scales can be equally relable Test propertes are sample dependent Test propertes are sample free Mxed tem formats leads to unbalanced mpact Easly handles mxed tem formats. on total test scores Comparng respondents requres parallel scales Dfferent scales can be placed on a common metrc Summed scores are on ordnal scale Scores on nterval scale Graphcal tools for tem and scale analyss Relablty s the fracton of observed score varance that s true score varance, or the proporton that s not error varance (Waner & Thssen, n press). In CTT, both SEM and relablty (such as Cronbach s α, nternal consstency) measures are fxed for all scale scores. In other words, CTT models assume that measurement error s dstrbuted normally and equally for all score levels (Embretson & Rese, 2000). In IRT, measures of precson are estmated separately for each score level or response pattern, controllng for the characterstcs (e.g., dffculty) of the tems n the scale. Precson of measurement s best (low SEM, hgh relablty/nformaton) typcally n the mddle of the scale range (or trat contnuum), and precson s least at the low and hgh ends of the contnuum where tems do not dscrmnate well among respondents. In CTT, scale relablty s a functon of the number of tems n the scale. Hgher relablty requres longer scales. Many tmes, redundant or smlar tems are ncluded n such nstruments. In IRT, shorter and equally relable scales can be developed wth approprate tem placement. Redundant tems are dscouraged and actually volate the assumpton of local ndependence of the IRT model. These short, relable scales are often accomplshed through the use of adaptve tests that chose a set of tems that target n on a respondent s level on an underlyng trat.

33 CTT scale measures such as relablty (Cronbach s α), tem-total score correlaton (pont-bseral correlaton), standard error of measurement, and dffculty (proporton) are sample dependent, meanng that, these measures vary across samples, especally for non-representatve samples. IRT tem propertes are assumed to be sample-nvarant wthn a lnear transformaton. Ths property of IRT makes the model very attractve for researchers nvestgatng populaton dfferences. Mxed tem formats are surveys that nclude tems scored as true/false, Lkert-type/graded scales, or open-ended responses. In CTT, mxed tem formats have an unbalanced mpact on the total scale score. Items are unequally weghted leadng to some tems, wth a hgh number of response optons, to drve the survey score. Methods to correct for mxed tem formats are lmted because CTT s statstcs are sample dependent. IRT has models for both dchotomously scored tems (e.g., true/false), and polytomously scored questons (e.g., 5 category Lkert-type scale). IRT tem parameters are set to relate responses to the underlyng trat (Embretson & Rese, 2000), thus, IRT can easly model the mxed tem formats ncluded n many surveys. In health care research, there s a great need to compare respondents who take dfferent surveys. CTT requres nstruments to have a parallel form (e.g., equal means, varances, and covarances) to equate scores. Ths s dffcult, almost mpossble, to accomplsh gven the multtude of exstng surveys n health research. Error n equatng scores s nfluenced by any survey form dfferences (e.g., number of responses for each tem, number of tems). IRT models control for dfferences n tem propertes. Usng a set of anchor tems, IRT can place new tems or tems wth dfferent formats on a smlar metrc to lnk respondent scores. Once IRT tem parameters have been estmated wth an IRT model, nvestgators may calculate comparable