Odds Ratio, Delta, ETS Classification, and Standardization Measures of DIF Magnitude for Binary Logistic Regression

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Odds Ratio, Delta, ETS Classification, and Standardization Measures of DIF Magnitude for Binary Logistic Regression"

Transcription

1 Journal of Educational and Behavioral Statistics March 2007, Vol. 32, No. 1, pp DOI: / Ó AERA and ASA. Odds Ratio, Delta, ETS Classification, and Standardization Measures of DIF Magnitude for Binary Logistic Regression Patrick O. Monahan Indiana University Colleen A. McHorney Merck & Co., Inc. Timothy E. Stump Anthony J. Perkins Regenstrief Institute, Inc. and Indiana University Previous methodological and applied studies that used binary logistic regression (LR) for detection of differential item functioning (DIF) in dichotomously scored items either did not report an effect size or did not employ several useful measures of DIF magnitude derived from the LR model. Equations are provided for these effect size indices. Using two large data sets, the authors demonstrate the usefulness of these effect sizes for judging practical importance: the LR adjusted odds ratio and its conversions to the delta metric, the Educational Testing Service (ETS) classification system, and the p metric; the LR model-based standardization indices, using various weights for averaging stratum-specific differences in fitted probabilities; and a p metric classification system. Pros and cons of these effect sizes are discussed. Recommendations are offered. These LR effect sizes will be valuable to practitioners, particularly for preventing flagging of statistically significant but practically unimportant DIF in large samples. Keywords: differential item functioning; logistic regression; effect sizes In differential item functioning (DIF) analyses, groups are compared on item performance after adjusting for overall performance on the measured trait (Holland & Wainer, 1993). Since Swaminathan and Rogers (1990) applied the binary logistic regression (LR) procedure to the detection of DIF in dichotomous test items, the LR method has become increasingly popular for this purpose. However, Swaminathan and colleagues focused on hypothesis testing This research was supported by NIA Grant R01 AG022067, NCI Grant R03 CA , and the Mary Margaret Walther Program for Cancer Care Research. Suggestions by the editor and two anonymous reviewers led to improved presentation. 92

2 (Narayanan & Swaminathan, 1996; Rogers & Swaminathan, 1993; Swaminathan & Rogers, 1990). It is important to incorporate an effect size into flagging rules, especially in large samples, because high power can yield significance for practically unimportant effect sizes (e.g., Kirk, 1996). Several methodological and applied studies investigating binary LR for DIF have flagged items for DIF based only on statistical significance (Clauser, Nungester, Mazor, & Ripkey, 1996; Huang & Dunbar, 1998; Kwak, Davenport, & Davison, 1998; Marshall, Mungas, Weldon, Reed, & Haan, 1997; Mazor, Kanjee, & Clauser, 1995; Whitmore & Schumacker, 1999; Woodard, Auchus, Godsall, & Green, 1998). Previous attempts to report effect sizes for binary LR have included using the LR Wald chi-square value (Huang & Dunbar, 1998), reporting raw or standardized LR coefficients on the log odds scale (Borsboom, Mellenbergh, & Heerden, 2002; Clauser & Mazor, 1998; Millsap & Everson, 1993; Swanson, Clauser, Case, Nungester, & Featherman, 2002), presenting R 2 -like measures (Swanson et al., 2002; Zumbo, 1999), calculating the partial gamma (Groenvold, Bjorner, Klee, & Kreiner, 1995), listing eta-squared (Whitmore & Schumacker, 1999), adopting a chance-corrected proportion of correct classification (Hess, Olejnik, & Huberty, 2001), and plotting fitted probabilities or fitted logits (Schmitt, Holland, & Dorans, 1993). These attempts contributed to DIF literature. However, none of these works focused on several intuitive effect sizes that can easily be derived from binary LR: the adjusted odds ratio, delta statistic, Educational Testing Service (ETS) classification system, adjusted odds ratio reported on the p metric, and model-based standardization indices of conditional differences in proportions. We found only one DIF study that reported odds ratios for binary LR (Volk, Cantor, Steinbauer, & Cass, 1997). The purposes of this article are to (a) provide and explain the equations for obtaining these useful effect sizes for the LR procedure, (b) demonstrate the application of these effect sizes, and (c) present the pros and cons of these effect sizes and offer guidance in how to use them. We focus here on effect sizes for uniform DIF. We are investigating LR effect sizes for nonuniform DIF. Although a strength of LR is powerful detection of nonuniform DIF, corresponding effect sizes requires more research because the choice of weights for averaging stratum-specific measures is especially critical when interactions are present (e.g., Mosteller & Tukey, 1977). Theoretical Foundation and Effect Size Formulas The Logistic Regression (LR) Procedure for DIF Detection In the binary LR model, the probability of endorsing a dichotomously scored item is Pðu ¼ 1 x; g; xgþ ¼ Effect Sizes for Logistic Regression 1 1 þ e ðb 0þb 1 xþb 2 gþb 3 xgþ ; ð1þ 93

3 Monahan, McHorney, Stump, and Perkins and the log odds (or logit) of endorsing the item is modeled as P ln ¼ b 1 P 0 þ b 1 x þ b 2 g þ b 3 xg; ð2þ where ln is the natural logarithm, x is a measure of overall proficiency (usually total score), g is a dummy variable representing group membership (traditionally, 1 = reference group, 0 = focal group), xg is the interaction term between total score and group membership, and b 0 is the intercept (Swaminathan & Rogers, 1990). The 1 df test of b 3 ¼ 0 is a test of nonuniform DIF. If nonuniform DIF is absent, the xg term can be deleted from the model, and then the 1 df test of b 2 ¼ 0 provides a test of uniform DIF. 1 Effect sizes complement these tests. We describe two categories of effect sizes distinguished by the metric of defining departures from the null hypothesis: conditional log odds ratios versus conditional differences in proportions. Effect Sizes for LR Based on the Conditional-Log-Odds-Ratio Definition of DIF A natural effect size for LR is the odds ratio. LR coefficients (^b j ) are estimated on the log odds scale. The exponential of ^b j [i.e., exp(^b j )] yields the maximum likelihood estimated odds ratio of the event of interest for every one-unit increase in the jth predictor, adjusted for other covariates in the LR model (Hosmer & Lemeshow, 2000). Thus, the exponential of ^b 2 provides the reference-to-focal odds ratio of endorsing the item, conditional on proficiency: 2 ^a LR ¼ expð^b 2 Þ: ð3þ Odds ratios range from 0 to. Values of ^a LR further from 1.0 represent greater DIF magnitude. An odds ratio and its reciprocal are equivalent in strength but not symmetrical in distance from the null value of 1.0 (e.g., 4.0 and 0.25). Another option for an effect size is to transform ^a LR to the logistic definition of the delta scale, used by ETS to measure item difficulty. We use the formula Holland and Thayer (1988) used to convert the Mantel-Haenszel (MH) odds ratio (^a MH ) to the MH delta-dif statistic (MH-D-DIF or ^D MH ): D-DIF or ^ LR ¼ 2:35 lnð^a LR Þ¼ 2:35ð^b 2 Þ: ð4þ It is apparent that D-DIF is a simple linear rescaling of the regression coefficient, ^b

4 Effect Sizes for Logistic Regression In addition, one can calculate the ETS classification system (Dorans & Holland, 1993): Category A. Items with negligible or nonsignificant DIF. Defined by D-DIF not significantly different from zero or absolute value less than 1.0. Category B. Items with slight to moderate magnitude of statistically significant DIF. Defined by D-DIF significantly different from zero and absolute value of at least 1.0 and either less than 1.5 or not significantly greater than 1.0. Category C. Items with moderate to large magnitude of statistically significant DIF. Defined by absolute value of D-DIF of at least 1.5 and significantly greater than 1.0. Notice that assigning Categories A and B entails using LR to test H o : b 2 ¼ 0: Assigning Categories B and C requires testing H o : D-DIF 1:0: Practitioners can perform the latter test in LR by testing H o : b 2 :4255 (i.e., D LR of 1.0 equals b 2 of.4255). It is also possible for reporting purposes to convert a conditional log-oddsratio-based index to the metric of differences in item-proportion-correct called the p metric. We use the formula that Dorans and Holland (1993) used to convert ^a MH to MH-P-DIF: P-DIF ¼ P f P C r ; ð5þ where, P C r ^a LR P f ¼ : ð6þ ð1 P f Þþ^a LR P f The term Pr C is the predicted proportion of examinees endorsing the item in the reference group based on ^a LR, and P f is the proportion of examinees endorsing the item in the focal group. Effect Sizes for LR Based on the Conditional-Differencein-Proportions Definition of DIF The contingency-table standardization (STD) procedure defines departures from the null DIF hypothesis with conditional differences in proportions, and the resulting measure is usually reported in the p metric (STD-P-DIF) (Dorans & Holland, 1993; Dorans & Kulick, 1986). One could estimate a LR model-based STD measure of DIF: STD-P-DIF ¼ where Pfm LR and PLR rm are predicted from the LR model. P w m ðpfm LR PLR rm Þ m P ; ð7þ w m m 95

5 Monahan, McHorney, Stump, and Perkins This index is reminiscent of item response theory (IRT) model-based standardization (Wainer, 1993) except instead of integrating over y, averaging occurs over total scores. Historically, absolute values between.05 and.10 are inspected to ensure that no possible DIF is overlooked, and absolute values above.10 are considered more unusual and should be examined (Dorans & Kulick, 1986). One could implement a p metric classification system for LR, applicable to STD-P-DIF or P-DIF: Category A. Items with negligible or nonsignificant DIF. Defined by p index not significantly different from zero or 0 p index :05. Category B. Items with marginal magnitude of statistically significant DIF. Defined by p index significantly different from zero and :05 < p index :10. Category C. Items with definite magnitude of statistically significant DIF. Defined by p index significantly different from zero and p index > :10. The weights (w m ) for averaging conditional differences in proportions in the STD procedure have traditionally been based on intuitive rationale. In DIF studies, w m is often chosen to be the number of focal group examinees at each stratum (N fm ) (Dorans & Kulick, 1986). Other plausible weights include (Dorans & Holland, 1993; Mosteller & Tukey, 1977) (a) the number of reference group examinees at each stratum (N rm ), (b) the number of the total examinees at each stratum (N tm ), or (c) the relative frequency of some real or hypothetical standard group. One could also use Cochran s (1954) statistically driven weights (Dorans & Holland, 1993): c m ¼ N rmn fm N tm : ð8þ Another option, available in the STDIF software (Robin, 2001; Zenisky, Hambleton, & Robin, 2003), is the equal weight (w m ¼ 1), which yields an unweighted average. Examples We performed a gender (1 = male, 0 = female) DIF analysis in two data sets. The first data set was the Supplement on Aging (SOA) to the 1984 National Health Interview Survey (U.S. Department of Health and Human Services, 1997). This study was designed to assess the future needs of the elderly in the United States. Participants were 55 and older (n ¼ 12; 943). We analyzed 23 dichotomous functional status items. Each item measured whether participants reported a problem (1 = yes, 0 = no) performing an activity. The second data set was from the Established Populations for Epidemiologic Studies of the Elderly (EPESE). Persons (age 65) were interviewed to identify predictors of mortality (Taylor, Wallace, Ostfeld, & Blazer, 1998). We analyzed the 20 dichotomous items of the Center for Epidemiologic Studies Depression Scale 96

6 Effect Sizes for Logistic Regression (CES-D; Radloff, 1977) obtained at baseline on 3,401 participants from the Duke site. The CES-D is a widely used self-report measure of depressive symptomatology for the general population. Each item was scored for presence (1) or absence (0) of a depressive symptom. Statistical Methods We modeled Equation 2, without the interaction term, using binary LR with the SAS LOGISTIC procedure. The matching score was the total score including the studied item (Holland & Thayer, 1988). For the purposes of this article, we examined only uniform DIF. Graphs of empirical logits and LOWESS smoothed curves indicated the LR assumption of linearity of the logit was reasonably satisfied for all items (Monahan, 2004). The SOA and EPESE items were approximately unidimensional because the cross-validated DETECT index was.16 and.28, respectively (Monahan, Stump, Finch, & Hambleton, in press; Roussos & Ozbek, 2006). In SOA data, there were 7,822 women and 5,121 men. In EPESE data, there were 2,203 women and 1,198 men. For both data sets, total score was skewed right (less skewed for EPESE), and men had a slightly lower mean and variance than women. Result of Using LR Statistical Test Alone to Detect Uniform DIF Table 1 displays DIF effect sizes for functional status items from the SOA data set. Each row (studied item) in Table 1 represents a different LR model. The right-most column shows the observed significance of the two-sided LR Wald chi-square test of uniform DIF. Based on this test alone, even if we used a conservative Bonferroni-adjusted significance level of (i.e.,.05/23), 11 of 23 items would be flagged for DIF (Table 1, bold items). Most DIF studies using LR have relied on Wald tests alone. We now illustrate the importance of effect sizes. Effect Sizes for LR Based on the Conditional-Log-Odds-Ratio Definition of DIF We will interpret the effect sizes in Table 1 from left to right, beginning with the LR odds ratio (^a LR ). By sorting on ascending D-DIF (equivalently, descending ^a LR ), items were conveniently grouped by direction and magnitude of DIF. Thus, for the 11 statistically significant items, the 6 items at the top were more greatly endorsed by men and the 5 items at the bottom were more greatly endorsed by women after adjusting for total score. The ^a LR for these 11 items varied in strength from 1.55 to 2.94 (men displayed greater functional problems after adjustment) and from 0.74 to 0.21 (women displayed greater functional problems after adjustment) (Table 1). For example, the LR model estimated that the odds that men reported having a problem with lifting and carrying 25 pounds was about one fifth (0.21) times the odds that women reported this problem, adjusted for overall functional status. The estimated odds of having a 97

7 Monahan, McHorney, Stump, and Perkins problem with using the telephone was almost 3 (2.94) times greater for men compared to women, controlling for overall functional status (Table 1). However, the odds of having a problem with walking was only 1½ times greater for men. Therefore, ^a LR indicates that not all 11 statistically significant items exhibited equally important DIF magnitude. Likewise, D-DIF (^D LR ) for the 11 statistically significant items varied in strength from 1.02 to 2.53 and from 0.71 to 3.63 (Table 1). According to the based ETS classification system (ETS-class in Table 1), only 5 of 11 items displayed moderate to large magnitude of statistically significant DIF (C), and 5 items showed moderate DIF (B). We also used the transformation of ^a LR to the p metric (P-DIF in Table 1) to classify items according to the p metric system described earlier (P-class in Table 1). This P-DIF classification system indicated weaker DIF than the ETS classification system for 5 items and stronger DIF than ETS classification for 2 items (Table 1). This was because the nonlinear relationship between D-DIF and P-DIF depends on the proportion of focal group examinees endorsing the item (P f ). Notice that P f for Items 2, 3, 9, 11, and 23 were closer to zero than P f for Items 14 and 16 (Table 1). Using Equations 4, 5, and 6, it can be shown that for a given value of ^D LR or ^a LR, P-DIF will be less for lowly or highly endorsed items than for moderately endorsed items. Likewise, for a given value of P- DIF, ^D LR and ^a LR will be greater for P f near zero or one than for P f near.50 (see Discussion). Effect Sizes for LR Based on the Conditional-Difference-in-Proportions Definition of DIF Using the traditional weight for DIF analyses (w m ¼ N fm ), the magnitudes of the based standardization index (STD-P-focal in Table 1) for the 11 statistically significant items were even smaller than the log-odds-ratio-based p metric magnitudes of P-DIF (Table 1). A p metric classification system based on STD-P-focal (i.e., STD-P-focal-class in Table 1) resulted in flagging only 1 item as definite DIF (C)andonly1itemasmarginalDIF(B). For both data sets, the LR standardization effect size was very similar when two other weights were used [total group distribution (N tm )andcochran(c m )], differing at most from STD-P-focal by.02 for any item but usually by.01 or less (data not shown). The standardization index using w m ¼ 1 (STD-P-equal in Table 1) differed from the other three STD-P-DIF indices (w m ¼ N fm, N tm, c m ), generally displaying slightly greater absolute values, resulting in 1 item demonstrating definite DIF (C) and 5 items revealing marginal DIF (B). Abbreviated Results for EPESE Data Table 2 shows effect sizes for the CES-D depression items. Again, fewer items were flagged when effect sizes supplemented the LR Wald test. Of five statistically significant items, only one item displayed moderate to large DIF (C) (text continued on p. 103) 98

8 TABLE 1 Logistic Regression (LR) Effect Sizes for Measuring Uniform Differential Item Functioning (DIF): Gender DIF in Supplement on Aging (SOA) Functional Status Items Definition for Measuring Departure From Null DIF Hypothesis Conditional Log Odds Ratio Conditional Difference in Proportions Item Description LR Odds Ratio ð^αlrþ D- DIF ( ^ LR) ETS- Class P- DIF P- Class Pf STD-P- Focal STD-P- Focal- Class STD- P- Equal STD-P- Equal- Class LR Wald HO : b2 = 0 p value 11 Using the telephone C.0495 A A.12 C < Dressing C.07 B A.08 B < Eating B.02 A A.08 B < Walking B.13 C A.03 A <.0001 quarter mile 16 Standing 2 hours B.101 C A.02 A < Walking B.07 B A.03 A < A.02 A A.02 A Light housework 10 Managing money A.01 A A.03 A.03 8 Preparing meals A.02 A A.02 A.09 1 Bathing A.02 A A.02 A Using the toilet A.00 A A.01 A.68 99

9 20 Reaching out A.00 A A.01 A Reaching over A.00 A A.00 A.88 head 18 Stooping, A.01 A A.00 A.62 crouching, kneeling 17 Sitting 2 hours A.00 A A.01 A Walk up A.02 A A.01 A.18 steps 4 Getting in and out of bed A.01 A A.01 A.34 6 Getting outside A.02 A A.02 A Using fingers A.03 A A.05 A.0002 to grasp 9 Shopping B.04 A A.03 A Lifting, C.08 B A.08 B <.0001 carrying 10 pounds 12 Heavy housework 22 Lifting, carrying 25 pounds C.14 C B.051 B < C.26 C C.08 B <.0001 Note: Items where men displayed greater functional problems than women, adjusted for overall functional problems, are indicated by: LR odds ratio > 1.0 and negative D-DIF, P-DIF, and STD-P-DIF. The Bonferroni-corrected significance level =.05/23 = (bold items = significant uniform DIF). Items are sorted by D-DIF. 100

10 TABLE 2 Logistic Regression (LR) Effect Sizes for Measuring Uniform Differential Item Functioning (DIF): Gender DIF in Established Populations for Epidemiologic Studies of the Elderly (EPESE) Depression Items Definition for Measuring Departure From Null DIF Hypothesis Conditional Log Odds Ratio Conditional Difference in Proportions Item Description LR Odds Ratio ð^αlrþ D- DIF ð ^ LRÞ ETS- Class P- DIF P- Class Pf STD-P- Focal STD-P- Focal- Class STD-P- Equal STD-P- Equal- Class LR Wald Ho : b 2 ¼ 0 p value 9 Life had been a failure B.04 A A.07 B Was not happy B.054 B A.08 B A.03 A A.051 B Did not enjoy life 13 Talked less than usual 7 Felt everything an effort 15 People were unfriendly 19 Felt people disliked me 4 Felt not as good as others A.03 A A.04 A A.04 A A.03 A A.01 A A.03 A A.01 A A.02 A A.01 A A.02 A

11 8 Did not feel hopeful about future 2 Appetite was poor 11 Sleep was restless 1 Bothered by things usually do not 20 I could not get going A.01 A A.01 A A.01 A A.01 A A.02 A A.02 A A.02 A A.02 A A.02 A A.02 A.25 6 Felt depressed A.04 A A.02 A.10 3 Could not shake A.03 A A.03 A.09 off the blues 10 Felt fearful A.03 A A.04 A Felt sad A.06 B A.03 A.01 5 Trouble A.06 B A.054 B.001 keeping mind on doing 14 Felt lonely A.07 B A.04 A Had crying C.07 B B.15 C <.0001 spells Note: Items where men displayed greater depressive symptoms than women, adjusted for overall depression, are indicated by: LR odds ratio > 1.0 and negative D-DIF, P-DIF, and STD-P-DIF. The Bonferroni-corrected significance level =.05/20 =.0025 (bold items = significant uniform DIF). Items are sorted by D-DIF. 102

12 by any classification system (Table 2). The main difference between Table 1 and Table 2 is that compared to functional status items (Table 1), depression items (Table 2) showed less difference between the log-odds-ratio-based p index (P-DIF) and the focal-group standardization p index (STD-P- focal). This was because depression items revealed less DIF magnitude and less skewness of total score. Specifically, using Equations 5, 6, and 7, where w m ¼ N fm, it can be shown that the difference between P-DIF and STD-P-focal depends on ^a LR, Pfm LR, and N fm (P f is a function of N fm and Pfm LR; PLR rm is a function of ^a LR and Pfm LR because the odds ratio is assumed to be constant across strata in uniform-dif LR). In short, the magnitude of STD- P-focal increases as P LR fm N fm ) than P LR fm Effect Sizes for Logistic Regression values near.50 receive greater weight (i.e., larger values near zero or one. Sensitivity of Results Results were very similar after deleting examinees at the floor and ceiling. We computed Cochran s (1954) test criterion by specifying w m ¼ c m in STD-P-DIF and by using predicted proportions in the standard error; the observed significance was extremely similar to the LR Wald observed significance for all items in both data sets (differed at most by.008). This is not surprising given that Cochran (1954) derived these weights for a test criterion that would be powerful for detecting an alternative hypothesis of a constant difference on either the logit or probit scale. Thus, in LR, although Cochran weights are an option when computing STD-P-DIF, the Cochran test might be an unnecessary adjunct to the LR Wald test. Discussion Choosing an Effect Size: Pros and Cons The effect sizes can be contrasted on a number of dimensions. First, as for ease of interpretation, indices reported on the delta and p metric are symmetrical around their null value of zero, which facilitates interpreting DIF in opposite directions. However, those experienced with interpreting odds ratios may find ^a LR easier to interpret than D-DIF, which is on the ETS-preferred delta metric. For data conforming to the two-parameter logistic (2PL) IRT model, one advantage of D-DIF is that the MH-D-DIF parameter (D 2PL ) can be written as a linear rescaling of the difference between b parameters (Roussos, Schnipke, & Pashley, 1999): 4 D 2PL ¼ 4aðb R b F ). The a MH parameter also shares the advantage of being related, although nonlinearly, to IRT b-dif (Roussos et al., 1999). The p metric is probably the most universally understood metric, conveniently connected to total and true score metrics. Practitioners should choose an effect size that they and their readership can easily interpret. Second, practitioners should choose an effect size whose metric for defining departures from the null hypothesis supplies the most valid definition of DIF for 103

13 Monahan, McHorney, Stump, and Perkins their purpose. Specifically, relative to conditional odds ratios (^a LR ) or log odds ratios (^D LR ), conditional differences between proportions (STD-P-DIF) will be compressed for items with low or high endorsement rates. However, from another perspective, odds ratios and log odds ratios play up small differences in proportions that are near zero or near one. P-DIF is based on the conditional log-odds-ratio definition but is reported on the p metric. Therefore, P-DIF shares some properties with STD-P-DIF (ease of interpretation and lower magnitude, relative to odds ratios, for lowly or highly endorsed items). Third, in terms of fundamental connections to the LR model, ^a LR is a natural estimator, fundamentally connected to a parameter of the LR model. D-DIF is also a simple rescaling of an estimated LR parameter and therefore is fundamentally connected to the LR model. A disadvantage of the STD-P-DIF index is that it is the most removed from the LR parameter estimation method. However, this does not invalidate it as a descriptive measure of DIF. Fourth, as far as ease of programming, standard software for LR automatically provides ^a LR. D-DIF can easily be calculated. P-DIF requires slightly more programming to convert ^a LR to P-DIF. STD-P-DIF requires the most additional programming because predicted proportions at each total score level must be first computed and then weighted. Fifth, the purpose of weights in standardization is not only to standardize according to the distribution of interest but also to yield smaller weight to sparse strata that provide less precise information. Using equal weights could be dangerous if one or more strata are sparse. (None of the values for total score were sparse for the present data sets.) If one chooses an outside (real or hypothetical) standard distribution, one must be careful to not combine large weights with ill-determined differences in proportions (Mosteller & Tukey, 1977). Recommendations on How to Use the Effect Sizes First, we recommend using an effect size and a statistical test when deciding whether items exhibit DIF. The effect size prevents flagging unimportant differences in large samples, and the statistical test prevents flagging noise in small samples. Unimportant differences can be significant, as demonstrated here, when using a statistical test alone, even if statistical tests are conservatively adjusted for multiple comparisons. In addition, effect sizes and their classifications help distinguish between levels of nonnegligible DIF (e.g., B vs. C). Second, practitioners must decide what values of the effect size represent negligible, moderate, and large magnitudes for the intended purpose. For example, ETS uses thresholds of 1.0 and 1.5 on the absolute value of the delta metric, which are equivalent to odds ratios greater than 1.53 (or less than 0.65) and greater than 1.89 (or less than 0.53), respectively. Users of STD procedures often use.05 and.10 thresholds on the absolute value of the p metric. In the medical sciences, the ^a LR thresholds of 1.5 and 2.0 are common due to convenient interpretations of one and one-half and twice the odds, respectively. 104

14 Effect Sizes for Logistic Regression However, smaller thresholds are used (e.g., 1.1) if the exposure is prevalent and disease serious (e.g., when determining whether risk of heart disease is associated with hormone supplements). Interestingly, ^a LR thresholds of 1.5 and 2.0 are nearly equivalent to the delta thresholds used in the ETS classification system. Third, one can take steps to facilitate interpretations. One could calculate the reciprocal of odds ratios less than one. By calculating 1/.74 = 1.35, one can readily see that.74 for Item 21 in Table 1 is not as strong as 1.55 for Items 5 and 16. One can use graphs (e.g., scatter, line, and bar), which help discern relative distances between DIF magnitudes. In addition, sorting items by direction and magnitude of DIF in tables, as we did here, aids interpretation. Fourth, these effect sizes can be used to facilitate comparisons of DIF procedures. One could compare MH, LR, SIBTEST, STD, and IRT procedures on the p metric (using P-DIF or STD-P-DIF for LR). Likewise, one could compare procedures on the odds ratio or delta metric, where STD-P-DIF and the latent-true-score adjusted difference in proportions of SIBTEST are converted using a formula similar to Equation 22 in Dorans and Holland (1993). Limitations The present analyses employed large sample sizes. In smaller sample sizes, the discrepancy between statistical significance versus flagging items with the combination of effect sizes and statistical significance should not be as great. The degree of discrepancy observed here between D-DIF, P-DIF, and STD-P-DIF may differ for other data sets. Conclusions These effect sizes and classification systems have received little attention in the DIF literature for binary LR: the adjusted odds ratio (^a LR ), D-DIF (^D LR ), P-DIF, LR model-based standardization indices (STD-P-DIF), and the ETS and p metric classification systems. The present examples demonstrate that these effect sizes are quite useful for preventing practically unimportant DIF from being flagged, especially in large samples. There are various pros and cons for choosing among these effects sizes. When steps are taken for their proper use, these effect sizes should be of great benefit to practitioners. Notes 1. The original proposal was to use the two df simultaneous test of uniform and nonuniform differential item functioning (DIF); however, when only uniform DIF is present, including the interaction term in the test may decrease power (Swaminathan & Rogers, 1990). 2. We considered the ML subscript (maximum likelihood estimation); however, the LR subscript in Equation 3 reminds practitioners that the odds ratio was estimated by assuming a logistic regression (LR) model. 105

15 Monahan, McHorney, Stump, and Perkins 3. Jodoin and Gierl (2001) suggested that R 2 -like indices are preferable to effect sizes based on ^b 2 because the latter would depend on the coding of the group variable [i.e., reference cell (0/1) vs. deviations-from-means method ( 1/1)]. However, we agree with Hosmer and Lemeshow (2000) that the reference cell method is more useful for LR because the exponential of b 2 is interpreted as a ratio of odds for one group versus the other group. If one codes the focal group as 0 and the reference group as 1 and then models item endorsement, as we did here, ^a LR and ^D LR have the same interpretations as for the Mantel-Haenszel (MH) procedure. Reference cell coding is no more arbitrary than row and column specification in the MH procedure. 4. In this formula (i.e., Equation 16 in Roussos, Schnipke, & Pashley, 1999), item discrimination (a) for the two-parameter logistic (2PL) item response theory (IRT) model varies over items and the MH delta-dif (MH-D-DIF) parameter is conditional on theta, whereas in Equation 13 in Donoghue, Holland, and Thayer (1993), a is constant across items because the MH-D-DIF parameter is conditional on observed total score where the corresponding IRT model is the Rasch model. References Borsboom, D., Mellenbergh, G. J., & Heerden, J. v. (2002). Different kinds of DIF: A distinction between absolute and relative forms of measurement invariance and bias. Applied Psychological Measurement, 26, Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differentially functioning test items. Educational Measurement: Issues and Practice, 17(1), Clauser, B. E., Nungester, R. J., Mazor, K., & Ripkey, D. (1996). A comparison of alternative matching strategies for DIF detection in tests that are multidimensional. Journal of Educational Measurement, 33, Cochran, W. G. (1954). Some methods for strengthening the common w 2 tests. Biometrics, 10, Donoghue, J. R., Holland, P. W., & Thayer, D. T. (1993). A Monte Carlo study of factors that affect the Mantel-Haenszel and standardization measures of differential item functioning. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp ). Hillsdale, NJ: Lawrence Erlbaum. Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel-Haenszel and standardization. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp ). Hillsdale, NJ: Lawrence Erlbaum. Dorans, N. J., & Kulick, E. (1986). Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the Scholastic Aptitude Test. Journal of Educational Measurement, 23, Groenvold, M., Bjorner, J. B., Klee, M. C., & Kreiner, S. (1995). Test for item bias in a quality of life questionnaire. Journal of Clinical Epidemiology, 48, Hess, B., Olejnik, S., & Huberty, C. J. (2001, April). The efficacy of two improvementover-chance effect size measures for two-group univariate comparisons under variance heterogeneity and nonnormality. Paper presented at the annual meeting of the American Educational Research Association, Seattle, WA. 106

16 Effect Sizes for Logistic Regression Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel- Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp ). Hillsdale, NJ: Lawrence Erlbaum. Holland, P. W., & Wainer, H. (Eds.). (1993). Differential item functioning. Hillsdale, NJ: Lawrence Erlbaum. Hosmer, D. W., & Lemeshow, S. (2000). Applied logistic regression (2nd ed.). New York: John Wiley. Huang, C.-Y., & Dunbar, S. B. (1998, April). Factors influencing the reliability of DIF detection methods. Paper presented at the annual meeting of the American Educational Research Association, San Diego, CA. Jodoin, M. G., & Gierl, M. J. (2001). Evaluating Type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14, Kirk, R. E. (1996). Practical significance: A concept whose time has come. Educational and Psychological Measurement, 56, Kwak, N., Davenport, E. C., Jr., & Davison, M. L. (1998, April). A comparative study of observed score approaches and purification procedures for detecting differential item functioning. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA. Marshall, S. C., Mungas, D., Weldon, M., Reed, B., & Haan, M. (1997). Differential item functioning in the Mini-Mental State Examination in English- and Spanish-speaking older adults. Psychology and Aging, 12, Mazor, K. M., Kanjee, A., & Clauser, B. E. (1995). Using logistic regression and the Mantel-Haenszel with multiple ability estimates to detect differential item functioning. Journal of Educational Measurement, 32, Millsap, R. E., & Everson, H. T. (1993). Methodological review: Statistical approaches for assessing measurement bias. Applied Psychological Measurement, 17, Monahan, P. O. (2004, April). Examining the assumption of linearity of the logit in the logistic regression procedure for detecting DIF. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA. Monahan, P. O., Stump, T. E., Finch, H., & Hambleton, R. K. (in press). Bias of exploratory and cross-validated DETECT index under unidimensionality. Applied Psychological Measurement. Mosteller, F., & Tukey, J. W. (1977). Data analysis and regression: A second course in statistics. Reading, MA: Addison-Wesley. Narayanan, P., & Swaminathan, H. (1996). Identification of items that show nonuniform DIF. Applied Psychological Measurement, 20, Radloff, L. S. (1977). The CES-D scale: A self-report depression scale for research in the general population. Applied Psychological Measurement, 1, Robin, F. (2001). STDIF: Standardization-DIF analysis program. Amherst: University of Massachusetts, School of Education. Rogers, H. J., & Swaminathan, H. (1993). A comparison of logistic regression and Mantel- Haenszel procedures for detecting differential item functioning. Applied Psychological Measurement, 17, Roussos, L. A., & Ozbek, O. (2006). Formulation of the DETECT population parameter and evaluation of DETECT estimator bias. Journal of Educational Measurement, 43,

17 Monahan, McHorney, Stump, and Perkins Roussos, L. A., Schnipke, D. L., & Pashley, P. J. (1999). A generalized formula for the Mantel-Haenszel differential item functioning parameter. Journal of Educational and Behavioral Statistics, 24, Schmitt, A. P., Holland, P. W., & Dorans, N. J. (1993). Evaluating hypotheses about differential item functioning. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp ). Hillsdale, NJ: Lawrence Erlbaum. Swaminathan, H., & Rogers, J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, Swanson, D. B., Clauser, B. E., Case, S. M., Nungester, R. J., & Featherman, C. (2002). Analysis of differential item functioning (DIF) using hierarchical logistic regression models. Journal of Educational and Behavioral Statistics, 27, Taylor, J. O., Wallace, R. B., Ostfeld, A. M., & Blazer, D. G. (1998). Established populations for epidemiologic studies of the elderly, (3rd ICPSR version) [Electronic version]. Ann Arbor, MI: Inter-university Consortium for Political and Social Research. U.S. Department of Health and Human Services. (1997). Longitudinal study of aging, (6th ICPSR version) [Electronic version]. Ann Arbor, MI: Inter-university Consortium for Political and Social Research. Volk, R. J., Cantor, S. B., Steinbauer, J. R., & Cass, A. R. (1997). Item bias in the CAGE screening test for alcohol use disorders. Journal of General Internal Medicine, 12, Wainer, H. (1993). Model-based standardized measurement of an item s differential impact. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp ). Hillsdale, NJ: Lawrence Erlbaum. Whitmore, M. L., & Schumacker, R. E. (1999). A comparison of logistic regression and analysis of variance differential item functioning detection methods. Educational and Psychological Measurement, 59, Woodard, J. L., Auchus, A. P., Godsall, R. E., & Green, R. C. (1998). An analysis of test bias and differential item functioning due to race on the Mattis Dementia Rating Scale. Journals of Gerontology. Series B, Psychological Sciences and Social Sciences, 53, P370-P374. Zenisky, A. L., Hambleton, R. K., & Robin, F. (2003). Detection of differential item functioning in large-scale state assessments: A study evaluating a two-stage approach. Educational and Psychological Measurement, 63, Zumbo, B. D. (1999). A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Ottawa, Canada: Directorate of Human Resources Research and Evaluation, Department of National Defense. Authors PATRICK O. MONAHAN is assistant professor, Division of Biostatistics, Department of Medicine, School of Medicine, Indiana University, 410 West 10th Street Suite 3000, Indianapolis, IN 46202; His area of interest is measurement and statistics applied to the behavioral and social sciences. COLLEEN A. MCHORNEY, PhD, is director of outcomes research at Merck & Co., Inc., WP39-166, 770 Sumneytown Pike, West Point, PA Her areas of expertise 108

18 Effect Sizes for Logistic Regression relate to the measurement and evaluation of patient-reported outcomes, including health status, quality of life, patient satisfaction, and patient preferences. TIMOTHY E. STUMP is statistician, Regenstrief Institute, Inc. and the Indiana University Center for Aging Research; His area of interest is measurement and statistics in the medical sciences. ANTHONY J. PERKINS is a statistical consultant for the Regenstrief Institute, Inc. and the Indiana University Center for Aging Research; His area of interest is item bias in quality of life instruments. Manuscript received July 15, 2004 Accepted August 2,

An Introduction to Missing Data in the Context of Differential Item Functioning

An Introduction to Missing Data in the Context of Differential Item Functioning A peer-reviewed electronic journal. Copyright is retained by the first or sole author, who grants right of first publication to Practical Assessment, Research & Evaluation. Permission is granted to distribute

More information

Differential Item Functioning

Differential Item Functioning Differential Item Functioning Lecture #11 ICPSR Item Response Theory Workshop Lecture #11: 1of 62 Lecture Overview Detection of Differential Item Functioning (DIF) Distinguish Bias from DIF Test vs. Item

More information

The Influence of Conditioning Scores In Performing DIF Analyses

The Influence of Conditioning Scores In Performing DIF Analyses The Influence of Conditioning Scores In Performing DIF Analyses Terry A. Ackerman and John A. Evans University of Illinois The effect of the conditioning score on the results of differential item functioning

More information

The Matching Criterion Purification for Differential Item Functioning Analyses in a Large-Scale Assessment

The Matching Criterion Purification for Differential Item Functioning Analyses in a Large-Scale Assessment University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Educational Psychology Papers and Publications Educational Psychology, Department of 1-2016 The Matching Criterion Purification

More information

A Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1.

A Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1. Running Head: A MODIFIED CATSIB PROCEDURE FOR DETECTING DIF ITEMS 1 A Modified CATSIB Procedure for Detecting Differential Item Function on Computer-Based Tests Johnson Ching-hong Li 1 Mark J. Gierl 1

More information

Impact of Differential Item Functioning on Subsequent Statistical Conclusions Based on Observed Test Score Data. Zhen Li & Bruno D.

Impact of Differential Item Functioning on Subsequent Statistical Conclusions Based on Observed Test Score Data. Zhen Li & Bruno D. Psicológica (2009), 30, 343-370. SECCIÓN METODOLÓGICA Impact of Differential Item Functioning on Subsequent Statistical Conclusions Based on Observed Test Score Data Zhen Li & Bruno D. Zumbo 1 University

More information

Manifestation Of Differences In Item-Level Characteristics In Scale-Level Measurement Invariance Tests Of Multi-Group Confirmatory Factor Analyses

Manifestation Of Differences In Item-Level Characteristics In Scale-Level Measurement Invariance Tests Of Multi-Group Confirmatory Factor Analyses Journal of Modern Applied Statistical Methods Copyright 2005 JMASM, Inc. May, 2005, Vol. 4, No.1, 275-282 1538 9472/05/$95.00 Manifestation Of Differences In Item-Level Characteristics In Scale-Level Measurement

More information

Item purification does not always improve DIF detection: a counterexample. with Angoff s Delta plot

Item purification does not always improve DIF detection: a counterexample. with Angoff s Delta plot Item purification does not always improve DIF detection: a counterexample with Angoff s Delta plot David Magis 1, and Bruno Facon 3 1 University of Liège, Belgium KU Leuven, Belgium 3 Université Lille-Nord

More information

Revisiting Differential Item Functioning: Implications for Fairness Investigation

Revisiting Differential Item Functioning: Implications for Fairness Investigation Revisiting Differential Item Functioning: Implications for Fairness Investigation Jinyan Huang** and Turgay Han* **Associate Professor and Ph.D. Faculty Member College of Education, Niagara University

More information

THE APPLICATION OF ORDINAL LOGISTIC HEIRARCHICAL LINEAR MODELING IN ITEM RESPONSE THEORY FOR THE PURPOSES OF DIFFERENTIAL ITEM FUNCTIONING DETECTION

THE APPLICATION OF ORDINAL LOGISTIC HEIRARCHICAL LINEAR MODELING IN ITEM RESPONSE THEORY FOR THE PURPOSES OF DIFFERENTIAL ITEM FUNCTIONING DETECTION THE APPLICATION OF ORDINAL LOGISTIC HEIRARCHICAL LINEAR MODELING IN ITEM RESPONSE THEORY FOR THE PURPOSES OF DIFFERENTIAL ITEM FUNCTIONING DETECTION Timothy Olsen HLM II Dr. Gagne ABSTRACT Recent advances

More information

André Cyr and Alexander Davies

André Cyr and Alexander Davies Item Response Theory and Latent variable modeling for surveys with complex sampling design The case of the National Longitudinal Survey of Children and Youth in Canada Background André Cyr and Alexander

More information

Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys

Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys Jill F. Kilanowski, PhD, APRN,CPNP Associate Professor Alpha Zeta & Mu Chi Acknowledgements Dr. Li Lin,

More information

Three Generations of DIF Analyses: Considering Where It Has Been, Where It Is Now, and Where It Is Going

Three Generations of DIF Analyses: Considering Where It Has Been, Where It Is Now, and Where It Is Going LANGUAGE ASSESSMENT QUARTERLY, 4(2), 223 233 Copyright 2007, Lawrence Erlbaum Associates, Inc. Three Generations of DIF Analyses: Considering Where It Has Been, Where It Is Now, and Where It Is Going HLAQ

More information

A Comparison of Several Goodness-of-Fit Statistics

A Comparison of Several Goodness-of-Fit Statistics A Comparison of Several Goodness-of-Fit Statistics Robert L. McKinley The University of Toledo Craig N. Mills Educational Testing Service A study was conducted to evaluate four goodnessof-fit procedures

More information

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Thakur Karkee Measurement Incorporated Dong-In Kim CTB/McGraw-Hill Kevin Fatica CTB/McGraw-Hill

More information

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model Gary Skaggs Fairfax County, Virginia Public Schools José Stevenson

More information

Author's response to reviews

Author's response to reviews Author's response to reviews Title: Comparison of two Bayesian methods to detect mode effects between paper-based and computerized adaptive assessments: A preliminary Monte Carlo study Authors: Barth B.

More information

Emotions as infectious diseases in a large social network: the SISa. model

Emotions as infectious diseases in a large social network: the SISa. model 1 2 Emotions as infectious diseases in a large social network: the SISa model 3 4 Alison L. Hill, David G. Rand, Martin A. Nowak, Nicholas A. Christakis June 6, 2010 5 6 7 8 9 10 11 12 13 14 Supplementary

More information

UCLA UCLA Electronic Theses and Dissertations

UCLA UCLA Electronic Theses and Dissertations UCLA UCLA Electronic Theses and Dissertations Title Detection of Differential Item Functioning in the Generalized Full-Information Item Bifactor Analysis Model Permalink https://escholarship.org/uc/item/3xd6z01r

More information

Describing and Categorizing DIP. in Polytomous Items. Rebecca Zwick Dorothy T. Thayer and John Mazzeo. GRE Board Report No. 93-1OP.

Describing and Categorizing DIP. in Polytomous Items. Rebecca Zwick Dorothy T. Thayer and John Mazzeo. GRE Board Report No. 93-1OP. Describing and Categorizing DIP in Polytomous Items Rebecca Zwick Dorothy T. Thayer and John Mazzeo GRE Board Report No. 93-1OP May 1997 This report presents the findings of a research project funded by

More information

Parameter Estimation of Cognitive Attributes using the Crossed Random- Effects Linear Logistic Test Model with PROC GLIMMIX

Parameter Estimation of Cognitive Attributes using the Crossed Random- Effects Linear Logistic Test Model with PROC GLIMMIX Paper 1766-2014 Parameter Estimation of Cognitive Attributes using the Crossed Random- Effects Linear Logistic Test Model with PROC GLIMMIX ABSTRACT Chunhua Cao, Yan Wang, Yi-Hsin Chen, Isaac Y. Li University

More information

Louis Leon Thurstone in Monte Carlo: Creating Error Bars for the Method of Paired Comparison

Louis Leon Thurstone in Monte Carlo: Creating Error Bars for the Method of Paired Comparison Louis Leon Thurstone in Monte Carlo: Creating Error Bars for the Method of Paired Comparison Ethan D. Montag Munsell Color Science Laboratory, Chester F. Carlson Center for Imaging Science Rochester Institute

More information

Comparison of the Null Distributions of

Comparison of the Null Distributions of Comparison of the Null Distributions of Weighted Kappa and the C Ordinal Statistic Domenic V. Cicchetti West Haven VA Hospital and Yale University Joseph L. Fleiss Columbia University It frequently occurs

More information

Comparing DIF methods for data with dual dependency

Comparing DIF methods for data with dual dependency DOI 10.1186/s40536-016-0033-3 METHODOLOGY Open Access Comparing DIF methods for data with dual dependency Ying Jin 1* and Minsoo Kang 2 *Correspondence: ying.jin@mtsu.edu 1 Department of Psychology, Middle

More information

CHAPTER VI RESEARCH METHODOLOGY

CHAPTER VI RESEARCH METHODOLOGY CHAPTER VI RESEARCH METHODOLOGY 6.1 Research Design Research is an organized, systematic, data based, critical, objective, scientific inquiry or investigation into a specific problem, undertaken with the

More information

Inferential Statistics

Inferential Statistics Inferential Statistics and t - tests ScWk 242 Session 9 Slides Inferential Statistics Ø Inferential statistics are used to test hypotheses about the relationship between the independent and the dependent

More information

Differential Item Functioning Analysis of the Herrmann Brain Dominance Instrument

Differential Item Functioning Analysis of the Herrmann Brain Dominance Instrument Brigham Young University BYU ScholarsArchive All Theses and Dissertations 2007-09-12 Differential Item Functioning Analysis of the Herrmann Brain Dominance Instrument Jared Andrew Lees Brigham Young University

More information

Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data

Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data TECHNICAL REPORT Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data CONTENTS Executive Summary...1 Introduction...2 Overview of Data Analysis Concepts...2

More information

Statistical analysis DIANA SAPLACAN 2017 * SLIDES ADAPTED BASED ON LECTURE NOTES BY ALMA LEORA CULEN

Statistical analysis DIANA SAPLACAN 2017 * SLIDES ADAPTED BASED ON LECTURE NOTES BY ALMA LEORA CULEN Statistical analysis DIANA SAPLACAN 2017 * SLIDES ADAPTED BASED ON LECTURE NOTES BY ALMA LEORA CULEN Vs. 2 Background 3 There are different types of research methods to study behaviour: Descriptive: observations,

More information

EPIDEMIOLOGY. Training module

EPIDEMIOLOGY. Training module 1. Scope of Epidemiology Definitions Clinical epidemiology Epidemiology research methods Difficulties in studying epidemiology of Pain 2. Measures used in Epidemiology Disease frequency Disease risk Disease

More information

Connexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan

Connexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan Connexion of Item Response Theory to Decision Making in Chess Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan Acknowledgement A few Slides have been taken from the following presentation

More information

Sawtooth Software. MaxDiff Analysis: Simple Counting, Individual-Level Logit, and HB RESEARCH PAPER SERIES. Bryan Orme, Sawtooth Software, Inc.

Sawtooth Software. MaxDiff Analysis: Simple Counting, Individual-Level Logit, and HB RESEARCH PAPER SERIES. Bryan Orme, Sawtooth Software, Inc. Sawtooth Software RESEARCH PAPER SERIES MaxDiff Analysis: Simple Counting, Individual-Level Logit, and HB Bryan Orme, Sawtooth Software, Inc. Copyright 009, Sawtooth Software, Inc. 530 W. Fir St. Sequim,

More information

Small Group Presentations

Small Group Presentations Admin Assignment 1 due next Tuesday at 3pm in the Psychology course centre. Matrix Quiz during the first hour of next lecture. Assignment 2 due 13 May at 10am. I will upload and distribute these at the

More information

Logistic Regression and Bayesian Approaches in Modeling Acceptance of Male Circumcision in Pune, India

Logistic Regression and Bayesian Approaches in Modeling Acceptance of Male Circumcision in Pune, India 20th International Congress on Modelling and Simulation, Adelaide, Australia, 1 6 December 2013 www.mssanz.org.au/modsim2013 Logistic Regression and Bayesian Approaches in Modeling Acceptance of Male Circumcision

More information

The Influence of Test Characteristics on the Detection of Aberrant Response Patterns

The Influence of Test Characteristics on the Detection of Aberrant Response Patterns The Influence of Test Characteristics on the Detection of Aberrant Response Patterns Steven P. Reise University of California, Riverside Allan M. Due University of Minnesota Statistical methods to assess

More information

Test item response time and the response likelihood

Test item response time and the response likelihood Test item response time and the response likelihood Srdjan Verbić 1 & Boris Tomić Institute for Education Quality and Evaluation Test takers do not give equally reliable responses. They take different

More information

Investigating Causal DIF via Propensity Score Methods

Investigating Causal DIF via Propensity Score Methods A peer-reviewed electronic journal. Copyright is retained by the first or sole author, who grants right of first publication to Practical Assessment, Research & Evaluation. Permission is granted to distribute

More information

On the Performance of Maximum Likelihood Versus Means and Variance Adjusted Weighted Least Squares Estimation in CFA

On the Performance of Maximum Likelihood Versus Means and Variance Adjusted Weighted Least Squares Estimation in CFA STRUCTURAL EQUATION MODELING, 13(2), 186 203 Copyright 2006, Lawrence Erlbaum Associates, Inc. On the Performance of Maximum Likelihood Versus Means and Variance Adjusted Weighted Least Squares Estimation

More information

ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION SCALE

ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION SCALE California State University, San Bernardino CSUSB ScholarWorks Electronic Theses, Projects, and Dissertations Office of Graduate Studies 6-2016 ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION

More information

linking in educational measurement: Taking differential motivation into account 1

linking in educational measurement: Taking differential motivation into account 1 Selecting a data collection design for linking in educational measurement: Taking differential motivation into account 1 Abstract In educational measurement, multiple test forms are often constructed to

More information

Adaptive EAP Estimation of Ability

Adaptive EAP Estimation of Ability Adaptive EAP Estimation of Ability in a Microcomputer Environment R. Darrell Bock University of Chicago Robert J. Mislevy National Opinion Research Center Expected a posteriori (EAP) estimation of ability,

More information

CHAPTER 3 RESEARCH METHODOLOGY

CHAPTER 3 RESEARCH METHODOLOGY CHAPTER 3 RESEARCH METHODOLOGY 3.1 Introduction 3.1 Methodology 3.1.1 Research Design 3.1. Research Framework Design 3.1.3 Research Instrument 3.1.4 Validity of Questionnaire 3.1.5 Statistical Measurement

More information

SAMPLING AND SCREENING PROBLEMS IN RHEUMATIC HEART DISEASE CASE. FINDING STUDY

SAMPLING AND SCREENING PROBLEMS IN RHEUMATIC HEART DISEASE CASE. FINDING STUDY This is a discussion of statistical methods in case-finding studies where there is no accurate or precise diagnostic test for the disease and where the frequency of its occurrence in the population is

More information

Introduction to Survey Research. Clement Stone. Professor, Research Methodology.

Introduction to Survey Research. Clement Stone. Professor, Research Methodology. Clement Stone Professor, Research Methodology Email: cas@pitt.edu 1 Presentation Outline What is survey research and when is it used? Stages of survey research 1. Specifying research questions, target

More information

What are Indexes and Scales

What are Indexes and Scales ISSUES Exam results are on the web No student handbook, will have discussion questions soon Next exam will be easier but want everyone to study hard Biggest problem was question on Research Design Next

More information

Do Your Online Friends Make You Pay? A Randomized Field Experiment on Peer Influence in Online Social Networks Online Appendix

Do Your Online Friends Make You Pay? A Randomized Field Experiment on Peer Influence in Online Social Networks Online Appendix Forthcoming in Management Science 2014 Do Your Online Friends Make You Pay? A Randomized Field Experiment on Peer Influence in Online Social Networks Online Appendix Ravi Bapna University of Minnesota,

More information

Equivalence of Testing Instruments in Canada: Studying Item Bias in a Cross-Cultural Assessment for Preschoolers

Equivalence of Testing Instruments in Canada: Studying Item Bias in a Cross-Cultural Assessment for Preschoolers Equivalence of Testing Instruments in Canada: Studying Item Bias in a Cross-Cultural Assessment for Preschoolers Luana Marotta Stanford University Lucia Tramonte University of New Brunswick J. Douglas

More information

Technical Specifications

Technical Specifications Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically

More information

Binary Diagnostic Tests Two Independent Samples

Binary Diagnostic Tests Two Independent Samples Chapter 537 Binary Diagnostic Tests Two Independent Samples Introduction An important task in diagnostic medicine is to measure the accuracy of two diagnostic tests. This can be done by comparing summary

More information

Overview of Lecture. Survey Methods & Design in Psychology. Correlational statistics vs tests of differences between groups

Overview of Lecture. Survey Methods & Design in Psychology. Correlational statistics vs tests of differences between groups Survey Methods & Design in Psychology Lecture 10 ANOVA (2007) Lecturer: James Neill Overview of Lecture Testing mean differences ANOVA models Interactions Follow-up tests Effect sizes Parametric Tests

More information

Using Bayesian Decision Theory to

Using Bayesian Decision Theory to Using Bayesian Decision Theory to Design a Computerized Mastery Test Charles Lewis and Kathleen Sheehan Educational Testing Service A theoretical framework for mastery testing based on item response theory

More information

Quantitative Research Methods and Tools

Quantitative Research Methods and Tools Quantitative Research Methods and Tools Fraser Health Authority, 2011 The Fraser Health Authority ( FH ) authorizes the use, reproduction and/or modification of this publication for purposes other than

More information

ANXIETY A brief guide to the PROMIS Anxiety instruments:

ANXIETY A brief guide to the PROMIS Anxiety instruments: ANXIETY A brief guide to the PROMIS Anxiety instruments: ADULT PEDIATRIC PARENT PROXY PROMIS Pediatric Bank v1.0 Anxiety PROMIS Pediatric Short Form v1.0 - Anxiety 8a PROMIS Item Bank v1.0 Anxiety PROMIS

More information

Experimental Psychology

Experimental Psychology Title Experimental Psychology Type Individual Document Map Authors Aristea Theodoropoulos, Patricia Sikorski Subject Social Studies Course None Selected Grade(s) 11, 12 Location Roxbury High School Curriculum

More information

Validating Measures of Self Control via Rasch Measurement. Jonathan Hasford Department of Marketing, University of Kentucky

Validating Measures of Self Control via Rasch Measurement. Jonathan Hasford Department of Marketing, University of Kentucky Validating Measures of Self Control via Rasch Measurement Jonathan Hasford Department of Marketing, University of Kentucky Kelly D. Bradley Department of Educational Policy Studies & Evaluation, University

More information

Centre for Education Research and Policy

Centre for Education Research and Policy THE EFFECT OF SAMPLE SIZE ON ITEM PARAMETER ESTIMATION FOR THE PARTIAL CREDIT MODEL ABSTRACT Item Response Theory (IRT) models have been widely used to analyse test data and develop IRT-based tests. An

More information

Structural Equation Modeling (SEM)

Structural Equation Modeling (SEM) Structural Equation Modeling (SEM) Today s topics The Big Picture of SEM What to do (and what NOT to do) when SEM breaks for you Single indicator (ASU) models Parceling indicators Using single factor scores

More information

Two-Way Independent ANOVA

Two-Way Independent ANOVA Two-Way Independent ANOVA Analysis of Variance (ANOVA) a common and robust statistical test that you can use to compare the mean scores collected from different conditions or groups in an experiment. There

More information

Computerized Adaptive Testing for Classifying Examinees Into Three Categories

Computerized Adaptive Testing for Classifying Examinees Into Three Categories Measurement and Research Department Reports 96-3 Computerized Adaptive Testing for Classifying Examinees Into Three Categories T.J.H.M. Eggen G.J.J.M. Straetmans Measurement and Research Department Reports

More information

12/31/2016. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

12/31/2016. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2 PSY 512: Advanced Statistics for Psychological and Behavioral Research 2 Introduce moderated multiple regression Continuous predictor continuous predictor Continuous predictor categorical predictor Understand

More information

Supplementary Materials:

Supplementary Materials: Supplementary Materials: Depression and risk of unintentional injury in rural communities a longitudinal analysis of the Australian Rural Mental Health Study (Inder at al.) Figure S1. Directed acyclic

More information

Analysis of Environmental Data Conceptual Foundations: En viro n m e n tal Data

Analysis of Environmental Data Conceptual Foundations: En viro n m e n tal Data Analysis of Environmental Data Conceptual Foundations: En viro n m e n tal Data 1. Purpose of data collection...................................................... 2 2. Samples and populations.......................................................

More information

Initial Report on the Calibration of Paper and Pencil Forms UCLA/CRESST August 2015

Initial Report on the Calibration of Paper and Pencil Forms UCLA/CRESST August 2015 This report describes the procedures used in obtaining parameter estimates for items appearing on the 2014-2015 Smarter Balanced Assessment Consortium (SBAC) summative paper-pencil forms. Among the items

More information

breast cancer; relative risk; risk factor; standard deviation; strength of association

breast cancer; relative risk; risk factor; standard deviation; strength of association American Journal of Epidemiology The Author 2015. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of Public Health. All rights reserved. For permissions, please e-mail:

More information

INTRODUCTION TO ITEM RESPONSE THEORY APPLIED TO FOOD SECURITY MEASUREMENT. Basic Concepts, Parameters and Statistics

INTRODUCTION TO ITEM RESPONSE THEORY APPLIED TO FOOD SECURITY MEASUREMENT. Basic Concepts, Parameters and Statistics INTRODUCTION TO ITEM RESPONSE THEORY APPLIED TO FOOD SECURITY MEASUREMENT Basic Concepts, Parameters and Statistics The designations employed and the presentation of material in this information product

More information

(entry, )

(entry, ) http://www.eolss.net (entry, 6.27.3.4) Reprint of: THE CONSTRUCTION AND USE OF PSYCHOLOGICAL TESTS AND MEASURES Bruno D. Zumbo, Michaela N. Gelin, & Anita M. Hubley The University of British Columbia,

More information

Propensity Score Methods for Causal Inference with the PSMATCH Procedure

Propensity Score Methods for Causal Inference with the PSMATCH Procedure Paper SAS332-2017 Propensity Score Methods for Causal Inference with the PSMATCH Procedure Yang Yuan, Yiu-Fai Yung, and Maura Stokes, SAS Institute Inc. Abstract In a randomized study, subjects are randomly

More information

Copyright. Kelly Diane Brune

Copyright. Kelly Diane Brune Copyright by Kelly Diane Brune 2011 The Dissertation Committee for Kelly Diane Brune Certifies that this is the approved version of the following dissertation: An Evaluation of Item Difficulty and Person

More information

Journal of Educational and Psychological Studies - Sultan Qaboos University (Pages ) Vol.7 Issue

Journal of Educational and Psychological Studies - Sultan Qaboos University (Pages ) Vol.7 Issue Journal of Educational and Psychological Studies - Sultan Qaboos University (Pages 537-548) Vol.7 Issue 4 2013 Constructing a Scale of Attitudes toward School Science Using the General Graded Unfolding

More information

Lecture Outline. Biost 517 Applied Biostatistics I. Purpose of Descriptive Statistics. Purpose of Descriptive Statistics

Lecture Outline. Biost 517 Applied Biostatistics I. Purpose of Descriptive Statistics. Purpose of Descriptive Statistics Biost 517 Applied Biostatistics I Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics University of Washington Lecture 3: Overview of Descriptive Statistics October 3, 2005 Lecture Outline Purpose

More information

DAZED AND CONFUSED: THE CHARACTERISTICS AND BEHAVIOROF TITLE CONFUSED READERS

DAZED AND CONFUSED: THE CHARACTERISTICS AND BEHAVIOROF TITLE CONFUSED READERS Worldwide Readership Research Symposium 2005 Session 5.6 DAZED AND CONFUSED: THE CHARACTERISTICS AND BEHAVIOROF TITLE CONFUSED READERS Martin Frankel, Risa Becker, Julian Baim and Michal Galin, Mediamark

More information

Measurement invariance and Differential Item Functioning. Short course in Applied Psychometrics Peterhouse College, January 2012

Measurement invariance and Differential Item Functioning. Short course in Applied Psychometrics Peterhouse College, January 2012 Measurement invariance and Differential Item Functioning Short course in Applied Psychometrics Peterhouse College, 10-12 January 2012 This course The course is funded by the ESRC RDI and hosted by The

More information

Regression Including the Interaction Between Quantitative Variables

Regression Including the Interaction Between Quantitative Variables Regression Including the Interaction Between Quantitative Variables The purpose of the study was to examine the inter-relationships among social skills, the complexity of the social situation, and performance

More information

Psychology Research Process

Psychology Research Process Psychology Research Process Logical Processes Induction Observation/Association/Using Correlation Trying to assess, through observation of a large group/sample, what is associated with what? Examples:

More information

AP Statistics. Semester One Review Part 1 Chapters 1-5

AP Statistics. Semester One Review Part 1 Chapters 1-5 AP Statistics Semester One Review Part 1 Chapters 1-5 AP Statistics Topics Describing Data Producing Data Probability Statistical Inference Describing Data Ch 1: Describing Data: Graphically and Numerically

More information

Current Directions in Mediation Analysis David P. MacKinnon 1 and Amanda J. Fairchild 2

Current Directions in Mediation Analysis David P. MacKinnon 1 and Amanda J. Fairchild 2 CURRENT DIRECTIONS IN PSYCHOLOGICAL SCIENCE Current Directions in Mediation Analysis David P. MacKinnon 1 and Amanda J. Fairchild 2 1 Arizona State University and 2 University of South Carolina ABSTRACT

More information

Ruth A. Childs Ontario Institute for Studies in Education, University of Toronto

Ruth A. Childs Ontario Institute for Studies in Education, University of Toronto The Alberta Journal of Educational Research Vol. 56, No. 4, Winter 2010, 459-469 Barnabas C. Emenogu Olesya Falenchuk and Ruth A. Childs Ontario Institute for Studies in Education, University of Toronto

More information

Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach

Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach University of South Florida Scholar Commons Graduate Theses and Dissertations Graduate School November 2015 Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach Wei Chen

More information

The Impact of Other Factors: Confounding, Mediation, and Effect Modification

The Impact of Other Factors: Confounding, Mediation, and Effect Modification The Impact of Other Factors: Confounding, Mediation, and Effect Modification Amy Yang Senior Statistical Analyst Biostatistics Collaboration Center Oct. 14 2016 BCC: Biostatistics Collaboration Center

More information

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests David Shin Pearson Educational Measurement May 007 rr0701 Using assessment and research to promote learning Pearson Educational

More information

Statistical reports Regression, 2010

Statistical reports Regression, 2010 Statistical reports Regression, 2010 Niels Richard Hansen June 10, 2010 This document gives some guidelines on how to write a report on a statistical analysis. The document is organized into sections that

More information

Panel: Using Structural Equation Modeling (SEM) Using Partial Least Squares (SmartPLS)

Panel: Using Structural Equation Modeling (SEM) Using Partial Least Squares (SmartPLS) Panel: Using Structural Equation Modeling (SEM) Using Partial Least Squares (SmartPLS) Presenters: Dr. Faizan Ali, Assistant Professor Dr. Cihan Cobanoglu, McKibbon Endowed Chair Professor University of

More information

Self-Oriented and Socially Prescribed Perfectionism in the Eating Disorder Inventory Perfectionism Subscale

Self-Oriented and Socially Prescribed Perfectionism in the Eating Disorder Inventory Perfectionism Subscale Self-Oriented and Socially Prescribed Perfectionism in the Eating Disorder Inventory Perfectionism Subscale Simon B. Sherry, 1 Paul L. Hewitt, 1 * Avi Besser, 2 Brandy J. McGee, 1 and Gordon L. Flett 3

More information

Detection Theory: Sensitivity and Response Bias

Detection Theory: Sensitivity and Response Bias Detection Theory: Sensitivity and Response Bias Lewis O. Harvey, Jr. Department of Psychology University of Colorado Boulder, Colorado The Brain (Observable) Stimulus System (Observable) Response System

More information

Non-Randomized Trials

Non-Randomized Trials Non-Randomized Trials ADA Research Toolkit ADA Research Committee 2011 American Dietetic Association. This presentation may be used for educational purposes Learning Objectives At the end of this presentation

More information

Copyright. Hwa Young Lee

Copyright. Hwa Young Lee Copyright by Hwa Young Lee 2012 The Dissertation Committee for Hwa Young Lee certifies that this is the approved version of the following dissertation: Evaluation of Two Types of Differential Item Functioning

More information

Investigating the robustness of the nonparametric Levene test with more than two groups

Investigating the robustness of the nonparametric Levene test with more than two groups Psicológica (2014), 35, 361-383. Investigating the robustness of the nonparametric Levene test with more than two groups David W. Nordstokke * and S. Mitchell Colp University of Calgary, Canada Testing

More information

Title: Identifying work ability promoting factors for home care aides and assistant nurses

Title: Identifying work ability promoting factors for home care aides and assistant nurses Author's response to reviews Title: Identifying work ability promoting factors for home care aides and assistant nurses Authors: Agneta Larsson (agneta.larsson@ltu.se) Lena Karlqvist (lena.karlqvist@ltu.se)

More information

Heterogeneity and statistical signi"cance in meta-analysis: an empirical study of 125 meta-analyses -

Heterogeneity and statistical signicance in meta-analysis: an empirical study of 125 meta-analyses - STATISTICS IN MEDICINE Statist. Med. 2000; 19: 1707}1728 Heterogeneity and statistical signi"cance in meta-analysis: an empirical study of 125 meta-analyses - Eric A. Engels *, Christopher H. Schmid, Norma

More information

Section on Survey Research Methods JSM 2009

Section on Survey Research Methods JSM 2009 Missing Data and Complex Samples: The Impact of Listwise Deletion vs. Subpopulation Analysis on Statistical Bias and Hypothesis Test Results when Data are MCAR and MAR Bethany A. Bell, Jeffrey D. Kromrey

More information

10. LINEAR REGRESSION AND CORRELATION

10. LINEAR REGRESSION AND CORRELATION 1 10. LINEAR REGRESSION AND CORRELATION The contingency table describes an association between two nominal (categorical) variables (e.g., use of supplemental oxygen and mountaineer survival ). We have

More information

Sawtooth Software. The Number of Levels Effect in Conjoint: Where Does It Come From and Can It Be Eliminated? RESEARCH PAPER SERIES

Sawtooth Software. The Number of Levels Effect in Conjoint: Where Does It Come From and Can It Be Eliminated? RESEARCH PAPER SERIES Sawtooth Software RESEARCH PAPER SERIES The Number of Levels Effect in Conjoint: Where Does It Come From and Can It Be Eliminated? Dick Wittink, Yale University Joel Huber, Duke University Peter Zandan,

More information

Item Bias in the Center for Epidemiologic Studies Depression Scale: Effects of Physical Disorders and Disability in an Elderly Community Sample

Item Bias in the Center for Epidemiologic Studies Depression Scale: Effects of Physical Disorders and Disability in an Elderly Community Sample Journal of Gerontology: PSYCHOLOGICAL SCIENCES 2000, Vol. 55B, No. 5, P273 P282 Copyright 2000 by The Gerontological Society of America Item Bias in the Center for Epidemiologic Studies Depression Scale:

More information

Statistically Speaking Lecture Series

Statistically Speaking Lecture Series Statistically Speaking Lecture Series Sponsored by the Biostatistics Collaboration Center The Impact of Other Factors: Confounding, Mediation, and Effect Modification Amy Yang, MSc Senior Statistical Analyst

More information

Testing for non-response and sample selection bias in contingent valuation: Analysis of a combination phone/mail survey

Testing for non-response and sample selection bias in contingent valuation: Analysis of a combination phone/mail survey Whitehead, J.C., Groothuis, P.A., and Blomquist, G.C. (1993) Testing for Nonresponse and Sample Selection Bias in Contingent Valuation: Analysis of a Combination Phone/Mail Survey, Economics Letters, 41(2):

More information

Propensity Score Analysis: Its rationale & potential for applied social/behavioral research. Bob Pruzek University at Albany

Propensity Score Analysis: Its rationale & potential for applied social/behavioral research. Bob Pruzek University at Albany Propensity Score Analysis: Its rationale & potential for applied social/behavioral research Bob Pruzek University at Albany Aims: First, to introduce key ideas that underpin propensity score (PS) methodology

More information

LSAC RESEARCH REPORT SERIES. Law School Admission Council Research Report March 2008

LSAC RESEARCH REPORT SERIES. Law School Admission Council Research Report March 2008 LSAC RESEARCH REPORT SERIES Conceptual Issues in Response-Time Modeling Wim J. van der Linden University of Twente, Enschede, The Netherlands Law School Admission Council Research Report 08-01 March 2008

More information

Tutorial 3: MANOVA. Pekka Malo 30E00500 Quantitative Empirical Research Spring 2016

Tutorial 3: MANOVA. Pekka Malo 30E00500 Quantitative Empirical Research Spring 2016 Tutorial 3: Pekka Malo 30E00500 Quantitative Empirical Research Spring 2016 Step 1: Research design Adequacy of sample size Choice of dependent variables Choice of independent variables (treatment effects)

More information

C h a p t e r 1 1. Psychologists. John B. Nezlek

C h a p t e r 1 1. Psychologists. John B. Nezlek C h a p t e r 1 1 Multilevel Modeling for Psychologists John B. Nezlek Multilevel analyses have become increasingly common in psychological research, although unfortunately, many researchers understanding

More information

ANXIETY. A brief guide to the PROMIS Anxiety instruments:

ANXIETY. A brief guide to the PROMIS Anxiety instruments: ANXIETY A brief guide to the PROMIS Anxiety instruments: ADULT ADULT CANCER PEDIATRIC PARENT PROXY PROMIS Bank v1.0 Anxiety PROMIS Short Form v1.0 Anxiety 4a PROMIS Short Form v1.0 Anxiety 6a PROMIS Short

More information