Author's response to reviews - PDF Free Download

Author's response to reviews Title: Comparison of two Bayesian methods to detect mode effects between paper-based and computerized adaptive assessments: A preliminary Monte Carlo study Authors: Barth B. Riley (barthriley@comcast.net) Adam C. Carle (adam.carle.cchmc@gmail.com) Version: 2 Date: 3 May 2012 Author's response to reviews: see over

April 24th, 2012 Ralph D'Agostino, Vern Farewell, Joel Greenhouse, and Louise Ryan Editors BMC Medical Research Methodology RE: MS 1963685156217877 Dear Editors; Thank you for the opportunity to resubmit our manuscript, which was manuscript originally entitled Comparison of two Bayesian Methods to Detect Differential Item Functioning between Paper-Based and Computerized Adaptive Assessments: A Monte Carlo Study. With this letter, we submit a revised version that takes into account all of the comments of each reviewer. We appreciate the reviewers thoughtful recommendations and believe that the manuscript has been strengthened as a result of this process. We note that the revised version is lengthy due to the recommendations by the reviewers for additional text and reorganization of sections to improve clarity. We have also added supplementary online material (example data files, which are referred to in the text as Additional Files 2-4). Consequently, we have attempted to shorten the manuscript as much as possible (e.g., removing Figure 1 of the original version, and deleting unnecessary verbiage). Table 4 from the original version of the manuscript has been split into two tables to in order to allow each table to be presented in portrait format. This manuscript has not been previously published nor is it under consideration in the same or similar form in any other journal. Again, we are truly grateful to the reviewers for their helpful comments and to you for the opportunity to address these comments and provide a revised manuscript. We look forward to hearing from you. Sincerely, Barth Riley, Ph.D. 1 Department of Health Systems Science, M/C 802, University of Illinois at Chicago 845 S. Damen Ave., Chicago, IL 60612, USA Email: barthr@uic.edu

Riley and Carle, Page 2 Reviewer 1 Minor Essential Revisions 1. Bayesian method. I think the authors should explain the reasoning behind each step in the three step Bayesian method. For instance, I am not sure why step 1 is necessary. It is very standard to obtain item parameters without estimates of underlying ability. We have made some revisions to the text on pages 8-9 to provide a rationale for each step. For step 1, we added the sentence: "This is to ensure that item parameters estimated in subsequent steps are on a common metric." Step 2 now reads as: "Using θ i obtained in Step 1, estimate the posterior distributions of mode-specific item parameters for subsequent comparison in step 3" (bold italicized text added). The first sentence describing step 3 has been changed to: " Estimate DIF for each item common across modes by assessing the difference in the posterior distributions of the item parameters (i.e., between β j CAT and β j P&P )." 2. Design. I am sorry but I can't find how many simulated test-takers were included in each replicate. Perhaps I am missing this information. If absent this detail should be added to the manuscript. On p. 11 (end of paragraph 0) the following was modified: Person measures (θ i ) for 500 simulees were generated using an N(0, 1) standard normal distribution. (Bold italicized text was added). On p. 12 (second paragraph) we added the following: Generation of CAT Item Response Data: Prior to performing CAT simulations, response data for all 100 items in the simulated item banks described above were generated for a total of 3,000 simulees in each iteration. This sample size permitted examination of the effect of CAT item usage on DIF detection rates. 3. Conclusion. Page 24 para 2: The authors write "The present study lends support to the use of a Bayesian framework for assessing DIF between CAT and conventional modes of administering an instrument." As I look at Table 2, false discovery looks well controlled, but power seems really low. What I take away from this is that neither of the two approaches are particularly good at detecting DIF in CAT vs. P&P given the study design. Am I missing the point? If yes, please spell it out for me. If no, perhaps an additional dimension to add to the simulation study is the number of simulated test takers, in order that the power get up to a level at which a person might make a decision about how to go about designing a study to assess mode DIF effects in CAT. We acknowledge the reviewer's point and have deleted the sentence in question. While it is difficult to predict the number of test takers required since there is no direct correspondence between sample size and item usage, on p. 16 (paragraph 1) we added the following:

Riley and Carle, Page 3 For RZ, an item usage of 369 and 422 were associated with power to detect DIF of 80 percent for the 1PL and 2PL conditions, respectively. For CrI, 80 percent power was associated with CAT item usage of 305 and 341 for 1PL and 2PL conditions, respectively. Discretionary Revisions 4. Title: Consider include "CAT Mode Effects" of some such in the title. The title has been changed as follows: " Comparison of two Bayesian Methods to Detect Mode Effects between Paper-Based and Computerized Adaptive Assessments: A Preliminary Monte Carlo Study" 5. page 7 para 1: Consider using "Power" instead of "Power rates" The term "power rates" has been replaced with "power." 6. page 7 para 2: Consider inserting an "s" in "representing the number of time[s]" The modification has been made. 7. Page 11 para 1: Are 10 replications per condition a sufficient number of replications? We acknowledged the small number of iterations as a limitation of the study findings in the Discussion section. As stated in the Abstract and at the end of the Introduction, this is a preliminary study. The decision to use 10 iterations per combination of conditions was due to the time-intensive nature of the Markov chain Monte Carlo simulations. Other studies (e.g., Glickman et al. 2009, see references) have employed this number of iterations. 8. page 12 para 0: Could the authors comment on what kinds of health constructs are represented by an item bank that has difficulty parameters uniformly distributed in a range [-3,+3] and population severity/ability parameters forming a distribution N(0,1)? Specifically, is it reasonable to propose that such an item difficulty and severity/ability distribution reflects the kinds of item banks researchers are likely to encounter in health research settings? We realize that the simulation results may not generalize to all applications in which CAT and paper-based health assessments are compared. We have added a sentence to the Discussion acknowledging the importance of testing the procedures using various distributions of parameters (p. 26, paragraph 2): Further research is needed to examine the robustness of the method under varying prior assumptions concerning the distribution of item and person parameters and when data fail to conform to these prior assumptions. It was nevertheless important for our purposes to assess the ability of the two procedures to detect DIF in items at varying difficulty levels, which we believed was best achieved through use of a uniform distribution of item difficulty values. We also wanted to

Riley and Carle, Page 4 minimize floor and ceiling effects in both items and simulees for which parameters could not be estimated, which may result from skewed, low/high average or widely varying distributions of person/item parameters. Further, it is not uncommon practice to set distributional properties of certain parameters in order to identify the model, e.g., setting the standard deviation of person measures to 1.0. 9. Page 12 para 1: The authors report RMSE: is this reasonably good fit? We are not aware of specific criteria or cutoffs for "good" RMSE in this context. However, we have added the following (p. 13, paragraph 1): RMSEs and correlations between item parameters and their estimates were consistent parameter recovery results presented elsewhere [38-40]. 10. Page 11-12: Perhaps somewhere in here the authors could describe the sample size used for estimating item bank parameter estimates (using Mplus). See response to #2 above. 12. Page 13 para 1: Is it reasonable to envision a researcher seeking to examine DIF due to mode effects (P&P vs. CAT) when mu_cat is 1 and mu_p&p is 0? I think this would be an extremely suspect design and probably a condition not worth considering. Perhaps I am being unimaginative. Regardless, the paper might be stronger if the authors would comment on this point. Given that mean differences in group measures have been shown to affect DIF detection, (see references 26-28 in the manuscript) we wanted to include this as a condition in our simulation study. We have included in the Discussion (p. 26 paragraph 1) the following sentence: These results highlight the need for careful sampling of respondents who complete each form of the instrument and assessment of trait-level differences prior to assessment of mode effects. 13. Page 16 para 2: Consider using "specificity" instead of "specificity rates". And elsewhere throughout the manuscript, consider deprecating "rate" for percentage, percent, or without additional qualification. The modification was made. 14. Results: It is reasonable to present results to three places of precision given a sample size of 160 replications? We have rounded all values in Table 2 to two decimal points and have made similar changes throughout the text. Consequently, we also revised the text on p. 14 (last paragraph) which now includes the sentence: "CAT to full-instrument correlations were.97 across all conditions. MSEs were 0.26 for the 1PL and 0.23-0.24 for the 2PL conditions. "

Riley and Carle, Page 5 15. Page 22, para 0: This is a comment/suggestion and not a point that needs addressing in this manuscript (unless the authors agree with the suggestion). The authors write "Research is clearly needed to determine the causes of elevated false positive rate for easy-to-endorse items." This reminds me of the "constant bias" problem that observed variable conditioning approaches to DIF detection (Millsap and Everson, 1993 Applied Psychol Meas 17:297-334; Camilli, in Differential item functioning Holland et al., Eds. [Lawrence Erlbaum Associates, Publishers, Hillsdale, NJ, 1993] pp. 397-413). I suspect this is due to step 1 of the procedure, which estimates a latent ability trait estimate assuming no DIF and uses this to estimate item parameters in a second step. I think what you need is a method that allows for simultaneously estimating item parameters, and possible differences across P&P and CAT administration group in selected items, and while controlling for that DIF effect on underlying ability estimates and underlying ability in a single step. This could be done with an Mplus and a MIMIC model approach, but the authors would have to translate SEM parameters (measurement slopes, thresholds, and direct effects) into IRT parameters (c.f. Macintosh and Hashim, 2003 Applied psychological measurement 27:372-9). The reviewer's point is well taken. We have added a sentence following the sentence on p. 22 that makes this and one other recommendation for further research: Two possible avenues of research in this area include: (1) further examination of different priors for item parameters and their effect on DIF detection for easy-toendorse items, and (2) an iterative process of identifying DIF items and then removing or appropriately weighting them in the estimation of person measures. Regarding the suggestion of using a MIMIC model in MPlus, our experience with this approach has been problematic, since it may be difficult to converge on a solution when there is a considerable amount of missing data as is the case in CAT. 16. Page 22 para 2: One more explanatory sentence in this paragraph ("As might be expected...") would be a welcome addition. The revised paragraph is as follows: As might be expected, DIF magnitude (i.e., the difference between CAT and P&P item parameters for a given item) was significantly and positively related to power. However, the same was not true for the percentage of items with DIF in the item bank. The latter result suggests that the power to detect a single DIF item is not significantly affected by the presence of other DIF items in the bank which may "contaminate" the person measures. (Bold italicized text was added). 17. Page 22 para 2: The authors report that a number of factors were related to power, and are explained usually by their association with frequency of use in CAT (e.g., difficulty, discrimination). Perhaps the authors have not adequately modeled the dependency of power to detect DIF on CAT item frequency. That is, the relationship may be non-linear with respect to CAT

Riley and Carle, Page 6 item frequency. In this case the explanatory models for power would be misspecified. Given the length and complexity of the paper, we decided to focus only on main effects in the prediction of true and false DIF detection. It is quite possible that inclusion of interaction effects may be warranted and may explicate the relationship of CAT item usage, item difficulty and discrimination with respect to power. We may see, for instance, that the relationship of CAT item usage to true DIF detection may vary with item difficulty and/or discrimination. This perhaps would reflect the non-linear effect the reviewer is suggesting. However, since the relationship of item usage to power was statistically significant and model AUCs were high (.95 and.93) for both RZ and CrI, the models in their present form are quite strong. 18. Reproducibility and Replication. I think the authors should consider including one or two CSV of their generated data as a supplementary file. Researchers interested in developing other approaches to DIF detection in CAT might benefit from comparing their results to ones published in this manuscript. We are certainly open to include additional material via a weblink associated with the paper. While we can provide sample data used in different phases of the project (parameter estimation, CAT simulation, DIF estimation), it would be helpful to know which data would be most useful. Reviewer 2 Major Compulsory Revisions Introduction 1. There has been quite a bit of literature comparing P&P and computerized administration (although not CAT) of self-report measures. For example, a recent meta-analysis [1] of 65 papers comparing computerized and P&P administration of patient reported outcomes (PRO) found that the methods are equivalent. The authors should consider this research and determine whether it is relevant to their study. There are two points to consider when examining the Gwaltney et al (2008) investigation in the context of the present study. First, Gwaltney et al. (2008) did not examine mode DIF but rather mode effects at the person level. Second, unlike computer-based assessment, CAT relies heavily on item parameters for item selection and standard errors to determine when to stop the assessment. DIF, even if it does not result in differential test functioning, can adversely affect CAT efficiency and possibly cause some measures to be biased. We therefore maintain that the issue of mode effects remains relevant when comparing conventional and CAT-administered assessments. 2. A related issue is the generalizability of this work. It is not clear how often one will need to compare P&P and CAT administration to assess DIF. ideally, one would determine the existence of DIF by administering the

Riley and Carle, Page 7 entire item bank via both computer and paper and then assess DIF. This procedure would allow one to use traditional methods to assess DIF, which are well developed and far easier to implement. A critical question is whether comparison of paper-administered and computeradministered banks of items would generalize to the CAT vs. P&P comparison of interest here. A unique feature of CAT is that items are selected sequentially based on previouslyadministered items and estimates of trait level. The use of item parameters that is not appropriate for the CAT context can affect the efficiency and precision of the CAT. Variation in item administration order may also lead to differences in how respondents answer certain items which would not be reflected in a fixed-order computer-based assessment. Methods 3. The authors need to include a data section so that they can describe: i. How the data was developed ii. The assumptions of the underlying data. iii. The item banks constructed from the data. iv. Any other relevant information that will help the reader understands the dataset and replicate it. We have inserted the heading Data Generation Procedures at the top of p 11. This section extends to the middle of p. 13. In this section we describe the distributional assumptions underlying item and person parameters, the CAT procedures used in the CAT simulation. We also provide a flowchart summarizing the overall process of the simulations. In addition, we plan to provide example datasets as online material. 4. The authors state that their, primary interest was identification of DIF in psychological and health-outcome measures. To do this, the authors generated item banks that included 100 items and simulated CAT administration using 30 items for each pseudo-respondent. This does not seem to be consistent with CATs used to assess health outcomes and probably does not reflect those assessing psychological outcomes. Item banks developed through the PROMIS initiative, for example, have as few as 9 items, and most have between 30 and 40 items. Further, in practice, most CATs in healthcare are not a fixed length. Rather, most are designed to terminate after a minimum criterion is exceeded. PROMIS CATs, for example, often terminate after 5 items are administered. As stated above, we recognize that the selection of the number and distribution of items and pseudo respondents may not reflect actual application. First, we have modified the sentence above as follows: Second, we focused on instruments fitting a one- (1PL) or two-parameter (2PL) IRT model [10-12], which are commonly applied to health outcome measures. Second, we have modified the sentence on p. 25 concerning the study limitations: Additional limitations of the present study include the small number of replications per experimental condition, the use of a fixed-length CAT and fixed item bank size.

Riley and Carle, Page 8 5. It is believed that the paper would benefit from reorganizing the section. For example, virtually all the information in the Methods section should be subsumed under the Analysis subsection. Within this section, the analyses should be presented in a coherent order; possibly the sequence in which the analyses were conducted. Each of these subsections would then include parts the authors have already written, such as Generation of the Validation (Paper-and-Pencil) Item Parameters and Response Data. We have made a number of changes to this section in order to more clearly convey the study procedures and their order. We have done this by including additional section headings and sentences or paragraphs describing the organizing of the subsequent section. For instance, we have inserted the following (top of p. 8) as an introduction to the Methods section (p. 9): The methods employed in this study will be presented in three main sections. First, we describe the development and underlying assumptions of two Bayesian methods for detecting item-level mode effects. Second, we describe the simulation study, including its design and data generation procedures. The third section outlines the analysis of the simulated data. Results 1. The authors should report the findings using the revised format of the Analysis section or with some alterations that facilitate reader understanding. We have made revisions throughout the Results section to convey the order of the analyses outlined in the Analysis section. We have, for instance, added text to link results to the research questions, as was done in the Analysis section. Discussion 1. The authors provide a thoughtful presentation about the pros and cons of both procedures; however, it is not until the Summary section that they finally state their findings that the modified robust Z statistic is preferred. This should be stated earlier, perhaps in the 2nd paragraph of the Discussion section. Further, it is not clear whether this finding is consistent across 1- and 2- parameter models. In the first paragraph of the Discussion we wrote: The CrI method resulted in slightly higher power, but this was offset by a higher false positive rate relative to RZ. For both 1PL and 2PL conditions, power was higher for CrI than RZ, but control of Type I error was better for RZ compared to CrI. Thus, power and Type I error control results were consistent across these two conditions. 2. The authors do not discuss the limitations of their methodology. Will these findings be replicated using real data? Will the inclusion of items with nonuniform DIF affect the results? If not, why? The authors do report in the

Riley and Carle, Page 9 Conclusion section that the procedure can be adapted to assess non-uniform DIF too, but this is not a conclusion. We have removed any statement in the manuscript concerning the potential utility of the studied procedures in detecting non-uniform mode DIF. Minor Essential Revisions Abstract (all of these are minor) 1. The methods sections needs to be modified. I understand that there are 16 conditions, but it requires the reader some work to sort this out. To provide greater clarity as to how the factors contributed to the total of 16 conditions, we modified the sentence in the abstract as shown below: A simulation study was conducted under the following conditions: (1) data-generating model (one- vs. two-parameter IRT model); (2) moderate vs. large DIF sizes; (3) percentage of DIF items (10% vs. 30%), and (4) mean difference in θ estimates across modes of 0 vs. 1 logits. 2. It is not clear whether the results are consistent across 1- and 2-parameter models. We modified the first sentence of the Results paragraph of the Abstract as follows: Results: Both methods evidenced good to excellent false positive control, with RZ providing better control of false positives and with slightly higher power for CrI, irrespective of measurement model. 3. The data the authors used is relevant to the findings, thus, it seems a brief description is in order. As data for the study were simulated, we have provided a detailed description of the data-generation procedures under Data Generation (beginning of p. 11). We will also provide example datasets as online material. 4. In the Discussion section, the authors state that the modified robust Z statistic is preferred. This should be stated in the abstract. We state in the Abstract the relative performance differences between the two methods (see Results and Conclusion paragraphs). Introduction 1. The authors discuss the purpose of the study in the final paragraph of the Introduction, but restate it, using more detail, in the Methods section (page 14). This seems redundant. The Introduction should be reorganized so that the research questions can be listed once. We deleted the research questions originally listed on page 14 and revised the last section of the Introduction to clearly state the research questions and place them in a single section.

Riley and Carle, Page 10 Methods 1. The authors should list the software and how it was used at the end of the Methods section. We have moved all references to software under the subheading Software at the end of the Methods section (p. 16). 2. The authors should list the software and how it was used at the end of the Methods section. We have added a Software section at the end of the Methods. 3. Figure 1 does not optimally depict the steps of the analysis because the boxes in the 1pl and 2pl columns are virtually identical. However, the authors may wish to use that framework to organize the sections of the analysis section (they would then also need to describe the DIF steps in more detail). We agree with the reviewer that the sequence of steps for the 1PL and 2PL conditions are virtually identical. Consequently, the inclusion of a flowchart may not add significantly to the reader's understanding of the processes being described. We have therefore decided to remove this figure and instead made changes to the organization of the text to improve clarity. 4. The authors clearly state that non-uniform DIF was not used in this study. But they do not indicate why. It seems like they should include it, or explain why they did not. Perhaps, it is also a limitation of the study. See response to Reviewer 2, Major Compulsory Revisions, Discussion, point 2. Results 1. It is not clear whether the authors should report information about the data they created. We have included the section heading Data Generation Procedures and reorganize this section in order to clearly describe how data were simulated at each step of the study. We believe it is important to describe these procedures to facilitate replication of the study. However, we are also open to placing this information in an Appendix if the reviewer feels this would be more appropriate. 2. Page 15 (bottom) the authors use the word medium, but it seems median is appropriate. It is not clear what the authors are demonstrating with this statistic because they do not include a rationale for its inclusion in the Analysis section. The word should be median instead of "medium." We have made this correction.

Riley and Carle, Page 11 3. The paper would benefit from a more thorough treatment of the results. For example, on page 16, they state the results of a comparison of the area under the curve for the RZ and #CrI are significant, but they do not indicate what it means. On p. 17 (paragraph 1) we have added the following sentence concerning the ROC analysis comparing RZ and CrI: This indicates that RZ values are significantly stronger predictor of the presence of mode DIF compared to CrI values. 4. The authors report ordinary least squares regressions for the 1st time in the Results section, but they should report all analyses conducted in the Methods section. We have moved the description of the analysis procedures used to create Figure 2 from the Results to the Analysis section in the methods. Both the description of the analytic procedures and the results of the ordinary least squares regression are placed under the heading Relationship of Item Difficulty to Power and Type I Error: Discussion 1. The authors should indicate if these findings are replicable when CATs do not have a fixed length. On p. 21 (paragraph 2), we added the following (see bold italicized text): Thus, power to detect DIF in items that are very easy or difficult to endorse is lower than that for items of average difficulty. This would likely explain why absolute item difficulty was a significant predictor of power even after controlling for CAT item usage. These findings may in part reflect the use of a fixed-length CAT during the simulation. In the case of a variable-length CAT, more items would likely be administered to simulees at the extremes of the trait continuum in order to achieve sufficient measurement precision, including items that are very easy or difficult to endorse. Conversely, we would expect fewer items to be administered to simulees who are in the center of the trait distribution under if a variable-length CAT were used. Miscellaneous 1. The authors provide 2 supplements but do not refer to them in the paper. We have made reference to Additional File 1 [Appendix A: Generating and Estimated Item Parameters for the Two Simulated Item Banks (Validation data) Based on the Oneand Two-Parameter IRT Models, Respectively] on p. 11, paragraph 4, and Additional File 5 (Appendix B - WinBUGS code) on p. 15 paragraph 2. We have also included references in the text to example data files (Additional Files 2-4).