Violating the Independent Observations Assumption in IRT-based Analyses of 360 Instruments: Can we get away with It?

Size: px
Start display at page:

Download "Violating the Independent Observations Assumption in IRT-based Analyses of 360 Instruments: Can we get away with It?"

Transcription

1 In R.B. Kaiser & S.B. Craig (co-chairs) Modern Analytic Techniques in the Study of 360 Performance Ratings. Symposium presented at the 16 th annual conference of the Society for Industrial-Organizational Psychology, San Diego. April 28, 2001 Violating the Independent Observations Assumption in IRT-based Analyses of 360 Instruments: Can we get away with It? S. Bartholomew Craig and Robert B. Kaiser Kaplan DeVries Inc. Multisource, or 360, ratings of performance provide an opportunity to violate a key assumption of many statistical procedures: the assumption of independent observations. This study examined the observable consequences of violating that assumption in an application of item response theory (IRT) to real-world ratings of two different domains of leadership performance. Using Raju, Van der Linden, & Fleer's (1995) framework for identifying differential functioning in items and tests (DFIT), we found no significant differences between item parameters estimated from samples in which the independence assumption was violated and parameters estimated from samples in which it was not violated. This finding, which was replicated with tests as short as five items and samples as small as, suggests that IRT can be safely applied to multisource ratings without trimming sample sizes to avoid violation of the independence assumption. Multisource, or 360, performance assessments are becoming increasingly common in organizations (Tornow & London, 1998). With the popularity of 360 assessments has come an increased interest in the psychometric properties of the instruments involved, with applications of sophisticated techniques such as item response theory (IRT) becoming correspondingly more frequent. For example, a number of studies have used IRT methods to investigate the measurement equivalence of 360 instruments across rating sources, language translations, and racial groups (e.g., Facteau & Craig, 2001; Maurer, The authors would like to thank the Center for Creative Leadership for providing the data used in this research, and Marlene Barlow and Sharon Denny for their assistance with the preparation of this manuscript. Correspondence concerning this article may be sent to Bart Craig or Rob Kaiser, Kaplan DeVries Inc., 1903-G Ashwood Ct., Greensboro, NC Electronic mail may be sent via Internet to bcraig@kaplandevries.com or rkaiser@kaplandevries.com. Raju, & Collins, 1998; Raju, 1999). Although the analysis of multisource performance data presents problems for which such techniques are well suited, the application of these methods is not without caveats. One potential pitfall concerns the nonindependent nature of such ratings. Because multiple raters rate the same target, some data points are not statistically independent of one other, presenting an opportunity to violate a basic assumption of many analytic techniques, including the marginal maximum likelihood estimation procedures used to estimate item parameters under IRT (Bock & Aitkin, 1981). A straightforward solution to the nonindependence problem exists, however. This solution involves randomly selecting one rater per ratee and thus limiting IRT-based analyses to a subset of the sample, within which the independent rating assumption holds. This random rater procedure effectively solves the problem and has been

2 IRT AND 360 RATINGS 2 used to good effect in past research (Raju, 1999). Unfortunately, the obvious sacrifice in the use of this solution is sample size. At least two factors make this sacrifice particularly undesirable in this application. First, the magnitude of sample size reduction is severe. If the average number of raters per target is only five, which is a commonly occurring case (e.g., London & Smither, 1995; Greguras & Robie, 1998), then the random rater procedure excludes 80% of the available data from analysis. This problem is further compounded by the fact that organizational samples are often small to begin with and by the need to analyze rating sources separately in many research designs, reducing effective sample size even further. Second, these severe reductions in usable sample size occur in the context of analytic procedures (i.e., those based on IRT) that are among the most data hungry techniques applied in organizational research. The foundation of most IRT-based analyses is item parameter estimation by means of marginal maximum likelihood (MML) procedures which can require samples of 1,000 or more just to satisfy convergence criteria, let alone produce parameter estimates with acceptable stability. Further, investigations of differential item functioning (DIF), which constitute a large proportion of the extant IRT-based research on 360 instruments, call for extremely precise item parameter estimates, lest equivalently functioning items be misidentified as showing DIF due to parameter estimate instability. In sum, although the random rater procedure successfully avoids the problem of nonindependent ratings, its associated price tag, in terms of reduced sample size, warrants a search for alternative solutions. As many researchers have to come to realize, it is often the case that, in practice, statistical procedures are fairly robust to violations of their assumptions. For example, correlation-based techniques assume at least an interval level measurement scale, yet can still be useful when applied to ordinal data. A similar state of affairs exists with regard to distributional normality in a variety of procedures (e.g., regression, factor analysis). If it could be established that IRT-based procedures are robust to violations of response independence, then future research might benefit from researchers having the latitude to decide whether the advantages of the random rater procedure justify the cost. The present study examined the real world impact of violating the independent ratings assumption in IRT-based analyses under a variety of conditions. This was accomplished by framing the research question as one of differential functioning do item parameter estimates derived from data that violate the independence assumption differ from those estimated from data that meet the assumption? Sample and Instrument Method An archival database of responses to the Benchmarks 360 leadership assessment instrument (McCauley, Lombardo, & Usher, 1989; Lombardo & McCauley, 1994) was used for this study. Benchmarks, a commercially available instrument published by the Center for Creative Leadership (CCL), contains 164 items in 22 scales, entitled 1 1. Resourcefulness 2. Doing Whatever It Takes 3. Being a Quick Study 4. Decisiveness 5. Leading Employees 6. Setting a Developmental Climate 7. Confronting Problem Employees 8. Work Team Orientation 9. Hiring Talented Staff 10. Building and Mending Relationships 11. Compassion and Sensitivity 12. Straightforwardness and Composure

3 IRT AND 360 RATINGS Balance between Personal Life and Work 14. Self-Awareness 15. Putting People at Ease 16. Acting with Flexibility 17. Problems with Interpersonal Relationships 18. Difficulty in Molding a Staff 19. Difficulty in Making Transitions 20. Lack of Follow-Through 21. Overdependence 22. Strategic Differences with Management Benchmarks performance ratings are collected from the superiors, peers, and directly reporting subordinates of focal managers (ratees), as well as from focal managers themselves. Each item is rated on a five point Likert-type scale ranging from "not at all" to "to a very great extent" to reflect the degree to which the item is typical of the ratee's behavior. The ratings analyzed here were collected between 1992 and 1997, most in preparation for focal managers' participation in a leadership development program at CCL, although data from external users of the instrument also contributed to the database. This instrument has been examined in multiple validation studies (for a review, see McCauley & Lombardo, 1990) and has received favorable reviews as a reliable measure of important aspects of leadership related to management development (e.g., Zedeck, 1995). Although differential item functioning is rarely found in comparisons among performance rating sources (Facteau & Craig, 2001), it was still desirable to eliminate any possible confounding effects due to rating source in the present study. Therefore, only ratings from direct reports were used in the analyses presented below, yielding an initial sample size of 31,731. Analyses Overview. The basic research design involved testing for differential functioning of items and tests, where the groups under examination were (1) data that met the independence assumption and (2) data in which the assumption was violated. The steps involved in creating these conditions are described below. Exploratory Factor Analyses. In our experience, the intended a priori structures of leadership assessment instruments rarely converge with their empirically identified factor structures. For this reason, and because the large number of items in Benchmarks made detailed item-level analyses unwieldy, we used exploratory factor analysis (EFA) to construct empirically coherent scales from a subset of the Benchmarks items. The intent of this stage of the analyses was twofold: (1) to derive shorter scales covering multiple content domains from the Benchmarks item pool, and (2) to establish that each of those scales met the IRT assumption of unidimensionality. Only the 106 items from the first 16 scales of Benchmarks (i.e., the Managerial Skills and Perspectives section) were submitted to EFA because of their conceptual distinctiveness from the items of scales 17 through 22. That latter group, referred to as the "derailment" scales, targets career progression problems rather than behavior per se, as do the first 16 scales. In the first step, ratings from 15,704 subordinates of 6,531 target managers on the 106 Benchmarks items were submitted to EFA using maximum likelihood estimation and oblique rotation 2. We hypothesized from reports in the literature (e.g., Bass, 1990) and our own experience with analyzing multisource leadership ratings that at least two large factors would emerge. It was anticipated that these two factors could be interpreted as some variants of two basic dimensions of leader behavior that have been discussed over the years for example, initiating structure and consideration (Fleishman, 1973), task-orientation and relationship-orientation (Fiedler, 1967), and directive and participative leadership (Bass,

4 IRT AND 360 RATINGS ). Bass (1990, p ) noted in his exhaustive review that these distinctions, respectively, are elements of what he called the overarching autocratic and democratic clusters of active leadership behavior. Further, he noted that, as a rule, the two factors will be found in some form in any adequate description of leadership. Because of their broad applicability to leadership assessment and their near ubiquitous occurrence in leadership assessment instruments, our intent was to construct optimized indicators of those two constructs to use in the remainder of the study. Interpreting the results of the EFA against the Kaiser (1960) criterion and Cattell s (1966) scree test suggested that the structure of subordinate ratings on the 106 items Benchmarks items could be described with anywhere from 2 to 13 factors. However, our objective was not to examine the factor structure of Benchmarks per se nor to use all of the discovered factors in subsequent analyses. Rather, our objective was to identify indicators for the two familiar factors discussed above. Inspection of the item content for the first two rotated factors in the present analysis did indeed suggest elements of Bass (1990) autocratic and democratic clusters. To construct our two scales, we selected 20 items from each factor on the basis of high (primary) factor loading and low cross-factor loading. Although previous research has found these two dimensions to be related (e.g., Schreisheim, House, & Kerr, 1976), our goal was to produce scales with the least possible variance overlap while still retaining the best available indicators of their respective constructs. Following Bass (1990, Ch. 23) review, we labeled these scales taskorientation and relations-orientation because their content was very similar to Bass definitions for constructs with the same names. Three sample items from each of the final scales are presented in Appendix A. After the items to be included were selected, cases with missing data on the retained items were deleted from the data set. This process was conducted independently for each scale, thus resulting in largely but not completely overlapping samples for the two scales. After cases with incomplete data were deleted, the remaining sample sizes were 19,624 for the task-orientation scale ( =.93) and 25,236 for the relations-orientation scale ( =.95). As expected, the resulting two 20- item scales were significantly correlated (r =.52, p <.001) and at approximately the same magnitude found for measures of conceptually similar constructs (Schriesheim et al., 1976). Reckase (1979) has suggested that tests can be considered to meet IRT s assumption of unidimensionality if their first unrotated factors account for at least 20% of the items common variance. Because that recommendation was derived from research on dichotomously scored items, it may be somewhat liberal when applied to polytomous items, which require the estimation of more parameters per item. However, the scales used in the present study exceeded Reckase s recommended lower limit by such a large margin that we believed the unidimensionality assumption to be met in these data. When the two newly created 20-item scales were submitted separately to subsequent EFAs, their first unrotated factors accounted for 42.81% (task-orientation) and 51.67% (relationsorientation) of their respective common variances. Table 1 displays descriptive statistics for the newly constructed scales after removing cases with incomplete data.

5 IRT AND 360 RATINGS 5 Table 1 Descriptive Statistics for 20 Item Scales in Sample of Direct Reports with Complete Data Task-orientation N = 19,624 Relations-orientation N = 25,236 Mean Standard Deviation Minimum Maximum (Scale Scores) Note. Scale score is the mean item response for that scale. Subsamples, Groups, and Conditions. Next, subsamples were randomly drawn from the larger data set of complete cases. Sample sizes of 200, 500, and 1000 were chosen to reflect the realities of data availability in organizations and respect a lower boundary below which IRT may be an inappropriate choice of analysis. As will be explained in more detail later, it was also desirable to select subsets of items in order to examine our research question under conditions of varying scale length. Scale lengths of five and 10 items were chosen for examination, in addition to the aforementioned 20 item scales. In each case, the shorter scales consisted of the items with the highest factor loadings from the 20 item scales (i.e., the highest loading five items composed the five item version of that scale, etc.). Recall that the two groups being compared in the present study were identified by whether the assumption of independent observations had or had not been violated. To maximize the similarity of the groups compared in the present study to alternatives available to researchers in applied settings, we defined the one rater group as one in which a single rater for each ratee had been randomly chosen from the available raters and the all raters group as one in which all available ratings were included, without any attempt to manipulate the number of raters per ratee. In the initial randomly selected subsamples (), the number of raters per rating target ranged from one to twelve (median = 3) for the task-orientation scale and from one to nine (median = 4) for the relations-orientation scale 3. Where this article uses the term condition we mean to refer to the conditions under which the two groups were compared, determined by the crossing of scale length (20 items, 10 items, and 5 items), sample size (1000, 500, and 200), and content domain (task-orientation and relations-orientation). Item Parameter Estimation. The MULTILOG computer program (Thissen, 1995) was used to estimate item parameters under Samejima s (1969) graded response model. The graded response model estimates one discrimination parameter (a) for each item, along with one difficulty parameter (b) for each response category. By convention, the b parameter for the highest category is not output by the software but is computable from the other bs as the degree of freedom. Thus, in the present study, each item has one a parameter and four b parameters. In order to test whether parameter estimates would differ as a function of the conditions examined here, each condition (content domain X scale length X sample size) was analyzed in an independent run of the MULTILOG software, yielding a (potentially) unique set of parameter estimates for each condition. Equating Parameter Metrics. Item parameters that are estimated from independent executions of the MULTILOG software (e.g., groups, conditions) are not on identical metrics. As a result, the parameters must be linearly transformed so as to place all the parameters in a given comparison on the same scale. This was accomplished using the EQUATE computer program (Baker, 1995). This software implements the iterative equating procedure described by Stocking and Lord (1983), which derives a multiplicative and an additive constant for use in the linear

6 IRT AND 360 RATINGS 6 transformation. To derive the most stable transformation coefficients possible, the estimation process is conducted twice, with any items showing DIF from the first run excluded from the second. Differential Functioning Analysis. Raju, van der Linden, and Fleer (1995) proposed a framework for examining measurement equivalence, at both the item and scale levels, called "differential functioning of items and tests" (DFIT). The DFIT framework, which is based on IRT, defines differential functioning as a difference in the expected item or scale scores for individuals with the same standing on the latent construct (θ), attributable to their membership in different groups. In the language of the DFIT framework, the expected score on an item or test, given θ, is called the "true score," and is expressed on the raw metric of the item or test (e.g., on a one to five scale for a five point Likert-type response format). Analyses based on DFIT yield several types of differential functioning indices. Noncompensatory differential item functioning (NCDIF) is a purely item level statistic that reflects true score differences for the two groups under examination. As the "noncompensatory" moniker suggests, the NCDIF index considers each item separately, without regard for the functioning of other items in the scale. Mathematically, NCDIF is the square of the difference in true scores for the two groups, averaged across θ. Thus, the square root of NCDIF gives the expected difference in item responses for individuals with the same standing on θ, but belonging to different groups. Differential test functioning (DTF) is analogous to NCDIF, but is a test level index. The square root of DTF is the average expected test score difference between individuals from different groups, but with the same standing on θ. Compensatory differential item functioning (CDIF) is an item level index that represents an item's net contribution to DTF. Computationally, CDIF is the change in DTF associated with the deletion of the focal item from the test. An important concept in DFIT is the directionality of differential functioning. Two items may exhibit similar levels of differential functioning, but in opposite directions (e.g., one item might "favor" Group 1 and the other might "favor" Group 2). In such a scenario, the two items would cancel each other and produce no net differential functioning at the test level. Conversely, a number of items that have nonsignificant, but nonzero, levels of NCDIF in the same direction can produce a test with significant DTF due to the accumulation of item level DF at the test level. Because the CDIF index represents an item's net contribution to DTF, it takes into account the functioning of other items on the test, in contrast to NCDIF. The relative importance placed on NCDIF, CDIF, and DTF depends on the purpose of the analysis. Because the focus of the present study is on the question of how violation of the independence assumption affects item parameter estimates, all three indicators are of interest here. The significance tests for the three differential functioning indices are biconditional. The differential functioning reflected in the magnitude of any of the indices is considered significant in the context of (1) a significant chi square statistic and (2) an associated index that exceeds an a priori specified critical value. The critical value for significant NCDIF depends on the number of response categories for that item. Exceeding the critical value is included as a condition for significance to mitigate the high sensitivity of chi square tests to sample size. Raju (personal communication, March, 1999) has recommended that the critical NCDIF value for an item with five response categories be set

7 IRT AND 360 RATINGS 7 to The critical DTF value for a test composed of such items would be multiplied by the number of items on the test. Logic for Order of Analyses The stability of item parameter estimates can vary with certain sample and scale characteristics (Mislevy & Bock, 1990). Specifically, estimates tend to become more stable (i.e., their associated standard error estimates become smaller) as sample size increases and as the number of items on the test increases. Because of these tendencies, it was desirable to test whether violation of the independence assumption produced detectable differences under a variety of conditions. The conditions examined were chosen to reflect the circumstances under which IRT might be applied to performance appraisal data in organizational settings. Because sample sizes and scale lengths both tend to be smaller in applied settings than in research settings, a sample size of 1000 and a scale length of 20 items were used to anchor the favorable ends of those two continua. At the unfavorable ends, a sample size of 200 and a scale length of five items were used. We believe that most researchers would agree that samples of 200 approach a level below which the use of IRT-based techniques may be inadvisable altogether. The two factors (sample size and scale length) were crossed with each other and with the two content domains described earlier to produce the matrix shown in Table 2. Table 2 Conditions under which assumption violations will be examined for differential functioning 20 items 10 items 5 items 20 items 10 items 5 items Taskorientation Phase 1 Phase 5 Phase 4 Phase 3 Phase 5 Phase 2 Content Domain Relationsorientation Phase 1 Phase 5 Phase 4 Phase 3 Phase 5 Phase 2 Because DFIT detects differences in item parameter estimates and such differences should be more likely to occur under less favorable conditions (as described above) a logical order for the conduct of the analyses presented itself. Specifically, we chose to test first whether violating the independence assumption under favorable conditions (Phase 1) would produce different parameter estimates in order to answer immediately the question of whether the violation presented an unavoidable problem for researchers in applied settings. That is, if the violation produced deleterious effects under the best conditions that could reasonably be expected in organizations, then we could conclude that the assumption should be fairly strictly respected. Next, we would test the least favorable condition (Phase 2) and compare the results with those from Phase 1 to determine whether there were any effects due to sample size and scale length. The remaining conditions would be tested in an order that depended on the outcomes of Phases 1 and 2. That order would be designed to identify, in the least possible number of analyses, the boundary between the conditions under which the assumption violation produced detectable effects and the conditions under which it did not. Results Phase 1: / Following the logic described earlier, the condition most favorable to IRT-based analyses was tested first (i.e., longer test, larger sample). Across the 40 items for the two content domains, no item or scale met both conditions for significant differential functioning, suggesting that violating the independent observations assumption has no effect on the parameter estimates computed by MULTILOG. Not surprisingly, given the large sample size, nearly all of the chi square tests were significant even though no NCDIF

8 IRT AND 360 RATINGS 8 or DTF statistic exceeded its critical value. Only item 8 (task-orientation) and item 19 (relations-orientation) manifested nonsignificant chi square statistics. Additionally, the chi square test of the DTF statistic for the task-orientation scale was nonsignificant. It is noteworthy that no DFIT statistic even approached its critical value; the NCDIF index computed for item 12 on the relations-orientation scale (.025) was the largest for any item or scale, and it had less than one fourth the magnitude required for significance (.096). Because there was no significant item (NCDIF) or scale level (DTF) differential functioning, the CDIF statistic was not meaningful in this analysis. Phase 2: / Having determined that violating the independence assumption under conditions of relative parameter stability produced no detectable effects, we next examined the least favorable condition: short scale length and small sample size. Testing for differential functioning under these conditions was tantamount to stacking the deck in favor of finding it. That is, any instability of parameters estimated under these conditions should Table 3 Differential Functioning Indices (NCDIF and DTF) One Rater per Ratee vs. Multiple Raters per Ratee Condition Task-orientation Relations-orientation Item Scale Note. Item-level values (NC-DIF) higher than.096 required for significant DIF. Scale-level values higher than 1.92 for 20-item scales or.48 for 5-item scales required for significant DTF.

9 IRT AND 360 RATINGS 9 manifest as random fluctuations between groups and be detected as differential functioning (albeit erroneously). Despite these hostile conditions, the results of this phase mirrored those from the previous phase: not a single item or scale met both conditions for significant differential functioning. The proportion of significant chi square statistics was actually lower in this phase, probably due to the smaller sample sizes. Items 3 and 5 on both scales, as well as both DTF statistics, had nonsignificant chi square tests. Also as in Phase 1, no item or scale s DFIT statistic even approached its critical value. The largest (NCDIF =.018 for item 3 on the relations-orientation scale) had less than one fifth the necessary magnitude (.096). Other Conditions Recall that the logic described earlier specified that the remaining conditions would be examined with the goal of identifying the boundary between the conditions under which differential functioning was detectible and the conditions under which it was not. Because 54 separate tests for differential functioning (42 in Phase 1, 12 in Phase 2) failed to yield even a single case of DIF or DTF, we decided that repeating our analyses under the remaining conditions would be superfluous. We reasoned that, if we had found no differences in parameter estimates under the most unstable of conditions, we would almost certainly not find any under more stable ones. Thus, the remaining conditions were not tested. This decision was consistent with the logic specified in the original design of this study. Table 3 summarizes the results of the DFIT analyses. Discussion This study explored the question of whether violating the assumption of independent observations in IRT-based analyses results in a deleterious effect on item parameters estimated by MML procedures. More specifically, we framed this question as one of differential functioning and applied the DFIT framework of Raju et al. (1995) to its investigation. We found that, under four distinct combinations of scale length, sample size, and content domain, not a single item or scale came near to meeting appropriate criteria for differential functioning. This finding suggests that, to the extent these results are generalizable, inclusion of multiple raters per rating target does not affect the estimation of IRT item parameters. We believe this finding to be of some importance to researchers wishing to apply IRT to multisource ratings in general, and to 360 leadership assessments in particular. Prior to this research, investigators examining multirater data under IRT have tended to err on the side of caution by randomly selecting one rater per target. The result has been the exclusion of sizable numbers of potentially useful data from analyses. The results presented here suggest that researchers may be able to reap the benefits of using all of the data available (e.g., parameter stability, generalizability) without incurring deleterious effects due to nonindependence. Although compelling, these findings have a few limitations that warrant mentioning. First, the two content areas from which we constructed scales (task-orientation and relations-orientation) were both domains of leadership behavior. These results should be replicated with multisource ratings in other domains, such as education, in order to evaluate their generalizability. Second, the two scales evaluated here were carefully constructed to exhibit excellent psychometric properties (e.g., unidimensionality, internal consistency). It is conceivable that other scales with less sound properties might be more vulnerable to these effects than those studied

10 IRT AND 360 RATINGS 10 here. Additionally, all of our comparisons involved equal sample sizes (i.e., the one rater group included more ratees in order to maintain the same sample size as the all raters group). This feature allowed us to unconfound effects due to rating independence from effects due to sample size. But, in a field setting, those effects almost certainly would be confounded. Thus, this study did not test for effects of the sample size reduction associated with limiting analyses to one rater per ratee. The analyses reported here do not provide an explanation for why the parameter estimates were unaffected by violation of the independence assumption, but it is possible to speculate. One possible interpretation is that the procedures employed by the MULTILOG software (i.e., marginal maximum likelihood estimation by means of the expectation maximization algorithm) are at least somewhat robust to violations of the assumption. If this is the case, more extreme violations of rater independence may produce effects that did not occur here. The use of real data in this study was both a strength and a limitation. Using real data allowed us to create conditions that closely resemble those that are likely to be encountered in applied research settings. However, we were not able to control the sample characteristics (e.g., degree of rating dependence) to the same extent possible in Monte Carlo simulations. Future research may be able to identify the boundaries beyond which effects occur by using data simulation techniques. Another, more intriguing, possibility is that the assumption was not actually violated. A core concept in IRT is the latent construct being measured by the items under consideration. The notion that multisource ratings violate the independent observations assumption is predicated on the belief that is objective leader behavior. That is, it assumes that the raters are in fact rating the objective performance behavior of a common target. However, an alternative view is that is really the rater s perception of leader behavior and, further, that each rater has a unique perception of the leader. Under this interpretation, the independence assumption is not really violated because each rater is rating a different target, his or her idiosyncratic perception of the leader. This interpretation is consistent with recent research on 360 leadership assessments. For example, Scullen, Mount, and Goff (2000) found that between 53 and 62% of the variance in ratings on two widely used 360 o instruments was due to idiosyncratic individual rater effects. Variance in ratings attributable to performance of the target leaders was about half that size between 21 and 25%. Thus it appears that leadership performance ratings are considerably more influenced by raters own unique perceptions of leaders than by leaders objective performance. It may well be, then, that multisource ratings of managers do not violate the independence of observations assumption after all. Future research should address this question directly by comparing ratings within ratee to ratings across ratees for evidence of dependence. In summary, the results reported here suggest that it may not be necessary to randomly select one rater per ratee in IRTbased analyses of multisource rating data. In situations like those examined here, incurring the associated (severe) sample size reduction may yield no benefits with regard to parameter estimation. References Baker, F. (1995). EQUATE 2.1: Computer program for equating two metrics in item response theory. Madison, WI: University of Wisconsin, Laboratory of Experimental Design.

11 IRT AND 360 RATINGS 11 Bass, B.M. (1990). Bass and Stogdill s handbook of leadership: Theory, research, and managerial applications (3rd Ed.). New York: Free Press. Bock, R.D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: An application of an EM algorithm. Psychometrika, 46, Cattell, R.B. (1966). The meaning and strategic use of factor analysis. In R.B. Cattell (Ed.), Handbook of multivariate experimental psychology (pp ). Chicago: Rand McNally. Facteau, J., & Craig, S.B. (2001). Are performance appraisal ratings obtained from different rating sources comparable? Journal of Applied Psychology, 86, (in press). Fiedler, F.E. (1967). A theory of leadership effectiveness. New York: McGraw-Hill. Fleishman, E.A. (1973). Twenty years of consideration and structure. In E.A. Fleishman and J.G. Hunt (Eds.), Current developments in the study of leadership. Carbondale: Southern Illinois University Press. Greguras, G.J., & Robie, C. (1998). A new look at within-source interrater reliability of 360-degree feedback ratings. Journal of Applied Psychology, 83, Kaiser, H.F. (1960). The application of electronic computers to factor analysis. Educational and Psychological Measurement, 20, Lombardo, M.M., & McCauley, C.D. (1994). Benchmarks : A Manual and Trainer s Guide. Greensboro, NC: Center for Creative Leadership. London, M., & Smither, J.W. (1995). Can multi-source feedback change perceptions of goal accomplishment, self-evaluations, and performance-related outcomes? Theory-based applications and directions for research. Personnel Psychology, 48, Maurer, T.J., Raju, N.S., & Collins, W.C. (1998). Peer and subordinate performance appraisal measurement equivalence. Journal of Applied Psychology, 83, McCauley, C. D. & Lombardo, M. (1990). Benchmarks : An instrument for diagnosing managerial strengths and weaknesses. In K. E. Clark & M. B. Clark (Eds.), Measures of Leadership. West Orange, NJ: Leadership Library of America. McCauley, C.D., Lombardo, M.M., & Usher, C.J. (1989). Diagnosing management development needs: An instrument based on how managers develop. Journal of Management, 15, Mislevy, R.J., & Bock, R.D. (1990). Bilog 3 User Manual. Mooresville, IN: Scientific Software, Inc. Raju, N.S. (1999). DFIT4P: A computer program for analyzing differential item and test functioning. Chicago: Illinois Institute of Technology. Raju, N.S. (Chair). (1999, April). IRT-based Evaluation of 360 Feedback Assessments: The BENCHMARKS Story. Symposium presented at the annual conference of the Society for Industrial- Organizational Psychology, Atlanta, Georgia. Raju, N.S., van der Linden, W., & Fleer, P. (1995). An IRT-based internal measure of test bias with applications for differential item functioning. Applied Psychological Measurement, 19, Reckase, M.D. (1979). Unifactor latent trait

12 IRT AND 360 RATINGS 12 models applied to multi-factor tests: Results and implications. Journal of Educational Statistics, 4, Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores (Psychometrika Monograph No. 17). Schriesheim, C.A., House, R.J., & Kerr, S. (1976). Leader initiating structure: A reconciliation of discrepant research results and some empirical tests. Organizational Behavior and Human Performance, 15, Scullen, S.E., Mount, M.K., & Goff, M. (2000). Understanding the latent structure of job performance ratings. Journal of Applied Psychology, 85, Stocking, M. L. & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7, Thissen, D. (1995). MULTILOG 6.3: A computer program for multiple, categorical item analysis and test scoring using item response theory. Chicago: Scientific Software, Inc. Tornow, W.W., & London, M. (1998). Maximizing the Value of 360-Degree Feedback. San Francisco: Jossey-Bass. Zedeck, S. (1995). [Review of Benchmarks]. In J. Conoley & J. Impara (Eds.), The twelfth mental measurements yearbook (Vol. 1, pp ). Lincoln, NE: Buros Institute of Mental Measurements. available on the Internet at 2 Only half of the cases were submitted to EFA in order to allow for the other half to be used in a confirmatory factor analysis as part of a different study of these same data. 3 Although the median number of raters per ratee was identical for the 1000 and 200 sample sizes, the range differed somewhat. In the condition, raters per target ranged from one to five for the task-orientation scale and from one to seven for the relationsorientation scale. Notes 1 The version of the Benchmarks instrument used in this study was replaced in its publisher's product line with an updated version in April, More information is

13 IRT AND 360 RATINGS 13 Appendix A Sample Items from Task-orientation and Relations-orientation Scales Task-orientation Is action-oriented. Faces difficult situations with guts and tenacity. Takes charge when trouble comes up. Relations-orientation Has a warm personality that puts people at ease. Tries to understand what other people think before making judgments about them. Shows interest in the needs, hopes, and dreams of other people. (item text copyright Center for Creative Leadership 1994)

14 IRT AND 360 RATINGS 14 Appendix B Item Parameter Estimates and Standard Error Estimates: Task-orientation b 1 (SE) b 2 (SE) b 3 (SE) b 4 (SE) a (SE) Item One All One All One All One All One All One All One All One All One All One All (.40) (.40) (.17) (.19) (.09) (.09) 1.11 (.11) 1.01 (.09) 1.33 (.09) 1.39 (.10) (.46) (.58) (.22) (.27) (.10) (.13) 0.74 (.09).76 (.09) 1.39 (.10) 1.22 (.10) (.25) (.26) (.13) (.14) (.09) -.08 (.09) 1.73 (.15) 1.84 (.15) 1.13 (.08) 1.13 (.08) (.24) (.23) (.12) (.12).02 (.09).16 (.08) 1.86 (.16) 1.97 (.15) 1.12 (.09) 1.17 (.09) (.35) (.24) (.16) (.13) (.09) -.74 (.09) 1.37 (.13) 1.33 (.10) 1.20 (.09) 1.40 (.09) (.78) ( (.31) (.30) (.16) (.17) 0.03 (.09) 0.00 (.08) 1.17 (.09) 1.21 (.10) (.29) (.26) (.56) (.67) (.13) (.13) (.29) (.28) (.08) -.87 (.08) -.80 (.15) -.80 (.15) 0.97 (.08).91 (.07).83 (.16).80 (.16) 1.68 (.10) 1.74 (.11) 1.87 (.25) 1.87 (.28) (.33) (.24) (.16) (.13) (.09) -.37 (.08) 2.14 (.19) 1.85 (.14) 1.08 (.08) 1.25 (.08) (.33) (.42) (.16) (.19) (.09) (.09) 0.89 (.09).73 (.08) 1.45 (.09) 1.53 (.10) (.67) (.97) (.24) (.24) (.13) (.13) 0.30 (.08).17 (.07) 1.30 (.09) 1.48 (.10) (.30) (.50) (.15) (.20) (.08) -.85 (.10) 1.02 (.10) 1.40 (.12) 1.38 (.09) 1.18 (.09) (.32) (.31) (.15) (.17) (.08) (.09) 0.99 (.10).79 (.08) 1.37 (.09) 1.47 (.09) (.24) (.20) (.32) (.26) (.12) (.11) (.19) (.16) (.07) -.76 (.06) -.60 (.12) -.49 (.12) 0.99 (.08) 1.02 (.07).89 (.14).93 (.14) 1.83 (.12) 2.00 (.13) 2.51 (.33) 2.48 (.33) (.32) (.76) (.16) (.19) (.08) (.09) 0.94 (.09) 1.20 (.10) 1.56 (.11) 1.37 (.10) (.38) (.42) (.53) (.41) (.15) (.19) (.35) (.21) (.08) (.09) (.18) -.95 (.14) 0.89 (.08).92 (.08).71 (.16).65 (.13) 1.73 (.10) 1.60 (.11) 1.84 (.28) 2.39 (.32) (.35) (.55) (.16) (.18) (.08) -.98 (.08) 1.05 (.09) 1.23 (.09) 1.61 (.10) 1.58 (.11) (.29) (.31) (.16) (.17) (.08) -.66 (.09) 1.68 (.14) 1.72 (.13) 1.30 (.09) 1.27 (.08) (.37) (.50) (.86) (.82) (.13) (.15) (.37) (.27) (.08) (.08) (.19) (.17) 0.69 (.08).62 (.07).67 (.16).36 (.13) 1.77 (.11) 1.77 (.10) 1.83 (.23) 1.92 (.26) (.29) (.39) (.14) (.16) (.08) -.79 (.08) 1.56 (.12) 1.47 (.11) 1.43 (.10) 1.50 (.10) (.24) (.19) (.29) (.42) (.11) (.10) (.18) (.20) (.07) -.64 (.07) -.40 (.12) -.44 (.15) 1.38 (.11) 1.33 (.09) 1.09 (.15) 1.26 (.20) 1.60 (.10) 1.76 (.11) 2.41 (.28) 1.81 (.24) Note. Item parameter estimates (but not standard error estimates) for the one rater per ratee condition ("One") have been transformed to the metric of the all available raters condition ("All") using EQUATE v2.1 (Baker, 1992).

15 IRT AND 360 RATINGS 15 Appendix C Item Parameter Estimates and Standard Error Estimates: Relations-orientation b 1 (SE) b 2 (SE) b 3 (SE) b 4 (SE) a (SE) Item One All One All One All One All One All One All One All One All One All One All (.27) (.35) (.14) (.19) (.08) -.73 (.09) 1.83 (.14) 1.73 (.13) 1.15 (.08) 1.21 (.09) (.22) (.30) (.11) (.14) (.08) -.25 (.08) 1.83 (.15) 1.60 (.13) 1.13 (.08) 1.20 (.08) (.15) (.20) (.53) (.27) (.08) (.10) (.21) (.15) (.06).03 (.06) 0.04 (.16) -.01 (.13) 1.77 (.11) 1.79 (.11) 1.53 (.25) 1.48 (.22) 1.63 (.10) 1.68 (.10) 1.54 (.20) 1.91 (.26) (.14) (.20) (.38) (.16) (.08) (.10) (.12) (.10) (.06) -.61 (.07) (.08) -.42 (.08) 0.73 (.07).77 (.07) 0.58 (.09).52 (.09) 1.87 (.12) 1.78 (.11) 4.00 (.45) 3.98 (.57) (.29) (.54) (.15) (.22) (.09) (.11) 0.69 (.09).74 (.09) 1.23 (.09) 1.25 (.09) (.29) (.36) (.14) (.17) (.09) (.10) 1.18 (.12).98 (.10) 1.16 (.08) 1.23 (.09) (.15) (.17) (.33) (.30) (.08) (.08) (.16) (.15) (.06) -.23 (.06) (.12) -.10 (.13) 1.19 (.09) 1.20 (.08) 0.99 (.16) 1.02 (.18) 1.66 (.11) 1.89 (.11) 2.23 (.24) 1.93 (.24) (.25) (.33) (.11) (.14) (.07) -.80 (.07) 1.18 (.10) 1.28 (.09) 1.48 (.10) 1.64 (.10) (.18) (.23) (.09) (.12) (.06) -.75 (.07) 1.07 (.09).84 (.08) 1.57 (.10) 1.63 (.10) (.27) (.30) (.14) (.15) (.08) -.51 (.09) 1.36 (.13) 1.53 (.13) 1.07 (.09) 1.13 (.08) (.19) (.22) (.10) (.11) 0.15 (.09).15 (.08) 2.05 (.17) 2.04 (.16) 1.05 (.08) 1.16 (.08) (.21) (.22) (.11) (.12) (.08) -.27 (.08) 1.60 (.13) 1.76 (.13) 1.21 (.08) 1.34 (.09) (.24) (.30) (.10) (.13) (.07) -.36 (.07) 1.70 (.12) 1.72 (.12) 1.35 (.10) 1.45 (.09) (.17) (.20) (.09) -.99 (.11) 0.43 (.09).70 (.09) 2.68 (.20) 2.75 (.21) 1.12 (.08) 1.13 (.08) (.22) (.28) (.10) (.14) (.06) -.77 (.07) 1.37 (.10) 1.51 (.10) 1.74 (.11) 1.73 (.09) (.24) (.36) (.12) (.17) (.07) -.78 (.08) 1.81 (.14) 1.74 (.13) 1.27 (.09) 1.33 (.09) (.19) (.27) (.09) (.14) (.06) -.85 (.07) 1.56 (.10) 1.45 (.10) 1.66 (.11) 1.65 (.10) (.21) (.35) (.11) (.15) (.07) -.75 (.08) 1.60 (.12) 1.51 (.11) 1.40 (.09) 1.43 (.09) (.19) (.39) (.58) (.49) (.10) (.15) (.23) (.16) (.06) (.08) (.14) -.93 (.12) 0.41 (.06).46 (.06) 0.41 (.11).44 (.11) 1.96 (.12) 2.03 (.13) 2.69 (.35) 2.92 (.40) (.13) (.21) (.23) (.22) (.07) (.11) (.12) (.11) -0.53(.05) -.58 (.07) (.07) -.35 (.10) 0.83 (.07).81 (.08) 0.63 (.09).61 (.11) 1.92 (.12) 1.72 (.11) 4.18 (.55) 3.14 (.42) Note. Item parameter estimates (but not standard error estimates) for the one rater per ratee condition ("One") have been transformed to the metric of the all available raters condition ("All") using EQUATE v2.1 (Baker, 1992).

Sensitivity of DFIT Tests of Measurement Invariance for Likert Data

Sensitivity of DFIT Tests of Measurement Invariance for Likert Data Meade, A. W. & Lautenschlager, G. J. (2005, April). Sensitivity of DFIT Tests of Measurement Invariance for Likert Data. Paper presented at the 20 th Annual Conference of the Society for Industrial and

More information

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian Recovery of Marginal Maximum Likelihood Estimates in the Two-Parameter Logistic Response Model: An Evaluation of MULTILOG Clement A. Stone University of Pittsburgh Marginal maximum likelihood (MML) estimation

More information

A Comparison of Several Goodness-of-Fit Statistics

A Comparison of Several Goodness-of-Fit Statistics A Comparison of Several Goodness-of-Fit Statistics Robert L. McKinley The University of Toledo Craig N. Mills Educational Testing Service A study was conducted to evaluate four goodnessof-fit procedures

More information

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking Jee Seon Kim University of Wisconsin, Madison Paper presented at 2006 NCME Annual Meeting San Francisco, CA Correspondence

More information

Mantel-Haenszel Procedures for Detecting Differential Item Functioning

Mantel-Haenszel Procedures for Detecting Differential Item Functioning A Comparison of Logistic Regression and Mantel-Haenszel Procedures for Detecting Differential Item Functioning H. Jane Rogers, Teachers College, Columbia University Hariharan Swaminathan, University of

More information

MULTISOURCE PERFORMANCE RATINGS: MEASUREMENT EQUIVALENCE ACROSS GENDER. Jacqueline Brooke Elkins

MULTISOURCE PERFORMANCE RATINGS: MEASUREMENT EQUIVALENCE ACROSS GENDER. Jacqueline Brooke Elkins MULTISOURCE PERFORMANCE RATINGS: MEASUREMENT EQUIVALENCE ACROSS GENDER by Jacqueline Brooke Elkins A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Arts in Industrial-Organizational

More information

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, April 23-25, 2003 The Classification Accuracy of Measurement Decision Theory Lawrence Rudner University

More information

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2 MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and Lord Equating Methods 1,2 Lisa A. Keller, Ronald K. Hambleton, Pauline Parker, Jenna Copella University of Massachusetts

More information

Differential Item Functioning

Differential Item Functioning Differential Item Functioning Lecture #11 ICPSR Item Response Theory Workshop Lecture #11: 1of 62 Lecture Overview Detection of Differential Item Functioning (DIF) Distinguish Bias from DIF Test vs. Item

More information

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Thakur Karkee Measurement Incorporated Dong-In Kim CTB/McGraw-Hill Kevin Fatica CTB/McGraw-Hill

More information

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model Gary Skaggs Fairfax County, Virginia Public Schools José Stevenson

More information

Technical Specifications

Technical Specifications Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically

More information

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS Michael J. Kolen The University of Iowa March 2011 Commissioned by the Center for K 12 Assessment & Performance Management at

More information

True or False?: Different Sources of Performance Ratings Don t Agree. James M. LeBreton, Jennifer R.D. Burgess, E. Kate Atchley

True or False?: Different Sources of Performance Ratings Don t Agree. James M. LeBreton, Jennifer R.D. Burgess, E. Kate Atchley True or False? 1 Running Head: INTERRATER RELIABILITY AND AGREEMENT True or False?: Different Sources of Performance Ratings Don t Agree James M. LeBreton, Jennifer R.D. Burgess, E. Kate Atchley The University

More information

Basic concepts and principles of classical test theory

Basic concepts and principles of classical test theory Basic concepts and principles of classical test theory Jan-Eric Gustafsson What is measurement? Assignment of numbers to aspects of individuals according to some rule. The aspect which is measured must

More information

Connectedness DEOCS 4.1 Construct Validity Summary

Connectedness DEOCS 4.1 Construct Validity Summary Connectedness DEOCS 4.1 Construct Validity Summary DEFENSE EQUAL OPPORTUNITY MANAGEMENT INSTITUTE DIRECTORATE OF RESEARCH DEVELOPMENT AND STRATEGIC INITIATIVES Directed by Dr. Daniel P. McDonald, Executive

More information

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1 Nested Factor Analytic Model Comparison as a Means to Detect Aberrant Response Patterns John M. Clark III Pearson Author Note John M. Clark III,

More information

The Youth Experience Survey 2.0: Instrument Revisions and Validity Testing* David M. Hansen 1 University of Illinois, Urbana-Champaign

The Youth Experience Survey 2.0: Instrument Revisions and Validity Testing* David M. Hansen 1 University of Illinois, Urbana-Champaign The Youth Experience Survey 2.0: Instrument Revisions and Validity Testing* David M. Hansen 1 University of Illinois, Urbana-Champaign Reed Larson 2 University of Illinois, Urbana-Champaign February 28,

More information

Jason L. Meyers. Ahmet Turhan. Steven J. Fitzpatrick. Pearson. Paper presented at the annual meeting of the

Jason L. Meyers. Ahmet Turhan. Steven J. Fitzpatrick. Pearson. Paper presented at the annual meeting of the Performance of Ability Estimation Methods for Writing Assessments under Conditio ns of Multidime nsionality Jason L. Meyers Ahmet Turhan Steven J. Fitzpatrick Pearson Paper presented at the annual meeting

More information

Examining context effects in organization survey data using IRT

Examining context effects in organization survey data using IRT Rivers, D., Meade, A. W., & Fuller, W. L. (2007, April). Examining context effects in organization survey data using IRT. Paper presented at the 22nd Annual Meeting of the Society for Industrial and Organizational

More information

Lessons in biostatistics

Lessons in biostatistics Lessons in biostatistics The test of independence Mary L. McHugh Department of Nursing, School of Health and Human Services, National University, Aero Court, San Diego, California, USA Corresponding author:

More information

IDEA Technical Report No. 20. Updated Technical Manual for the IDEA Feedback System for Administrators. Stephen L. Benton Dan Li

IDEA Technical Report No. 20. Updated Technical Manual for the IDEA Feedback System for Administrators. Stephen L. Benton Dan Li IDEA Technical Report No. 20 Updated Technical Manual for the IDEA Feedback System for Administrators Stephen L. Benton Dan Li July 2018 2 Table of Contents Introduction... 5 Sample Description... 6 Response

More information

Gender-Based Differential Item Performance in English Usage Items

Gender-Based Differential Item Performance in English Usage Items A C T Research Report Series 89-6 Gender-Based Differential Item Performance in English Usage Items Catherine J. Welch Allen E. Doolittle August 1989 For additional copies write: ACT Research Report Series

More information

Scaling TOWES and Linking to IALS

Scaling TOWES and Linking to IALS Scaling TOWES and Linking to IALS Kentaro Yamamoto and Irwin Kirsch March, 2002 In 2000, the Organization for Economic Cooperation and Development (OECD) along with Statistics Canada released Literacy

More information

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses Item Response Theory Steven P. Reise University of California, U.S.A. Item response theory (IRT), or modern measurement theory, provides alternatives to classical test theory (CTT) methods for the construction,

More information

PSYCHOMETRIC PROPERTIES OF CLINICAL PERFORMANCE RATINGS

PSYCHOMETRIC PROPERTIES OF CLINICAL PERFORMANCE RATINGS PSYCHOMETRIC PROPERTIES OF CLINICAL PERFORMANCE RATINGS A total of 7931 ratings of 482 third- and fourth-year medical students were gathered over twelve four-week periods. Ratings were made by multiple

More information

Assessing the Validity and Reliability of the Teacher Keys Effectiveness. System (TKES) and the Leader Keys Effectiveness System (LKES)

Assessing the Validity and Reliability of the Teacher Keys Effectiveness. System (TKES) and the Leader Keys Effectiveness System (LKES) Assessing the Validity and Reliability of the Teacher Keys Effectiveness System (TKES) and the Leader Keys Effectiveness System (LKES) of the Georgia Department of Education Submitted by The Georgia Center

More information

Issues That Should Not Be Overlooked in the Dominance Versus Ideal Point Controversy

Issues That Should Not Be Overlooked in the Dominance Versus Ideal Point Controversy Industrial and Organizational Psychology, 3 (2010), 489 493. Copyright 2010 Society for Industrial and Organizational Psychology. 1754-9426/10 Issues That Should Not Be Overlooked in the Dominance Versus

More information

Sample Sizes for Predictive Regression Models and Their Relationship to Correlation Coefficients

Sample Sizes for Predictive Regression Models and Their Relationship to Correlation Coefficients Sample Sizes for Predictive Regression Models and Their Relationship to Correlation Coefficients Gregory T. Knofczynski Abstract This article provides recommended minimum sample sizes for multiple linear

More information

An International Study of the Reliability and Validity of Leadership/Impact (L/I)

An International Study of the Reliability and Validity of Leadership/Impact (L/I) An International Study of the Reliability and Validity of Leadership/Impact (L/I) Janet L. Szumal, Ph.D. Human Synergistics/Center for Applied Research, Inc. Contents Introduction...3 Overview of L/I...5

More information

Dimensionality, internal consistency and interrater reliability of clinical performance ratings

Dimensionality, internal consistency and interrater reliability of clinical performance ratings Medical Education 1987, 21, 130-137 Dimensionality, internal consistency and interrater reliability of clinical performance ratings B. R. MAXIMt & T. E. DIELMANS tdepartment of Mathematics and Statistics,

More information

Description of components in tailored testing

Description of components in tailored testing Behavior Research Methods & Instrumentation 1977. Vol. 9 (2).153-157 Description of components in tailored testing WAYNE M. PATIENCE University ofmissouri, Columbia, Missouri 65201 The major purpose of

More information

Item Response Theory: Methods for the Analysis of Discrete Survey Response Data

Item Response Theory: Methods for the Analysis of Discrete Survey Response Data Item Response Theory: Methods for the Analysis of Discrete Survey Response Data ICPSR Summer Workshop at the University of Michigan June 29, 2015 July 3, 2015 Presented by: Dr. Jonathan Templin Department

More information

TLQ Reliability, Validity and Norms

TLQ Reliability, Validity and Norms MSP Research Note TLQ Reliability, Validity and Norms Introduction This research note describes the reliability and validity of the TLQ. Evidence for the reliability and validity of is presented against

More information

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing Terry A. Ackerman University of Illinois This study investigated the effect of using multidimensional items in

More information

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES 24 MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES In the previous chapter, simple linear regression was used when you have one independent variable and one dependent variable. This chapter

More information

Assessing Measurement Invariance in the Attitude to Marriage Scale across East Asian Societies. Xiaowen Zhu. Xi an Jiaotong University.

Assessing Measurement Invariance in the Attitude to Marriage Scale across East Asian Societies. Xiaowen Zhu. Xi an Jiaotong University. Running head: ASSESS MEASUREMENT INVARIANCE Assessing Measurement Invariance in the Attitude to Marriage Scale across East Asian Societies Xiaowen Zhu Xi an Jiaotong University Yanjie Bian Xi an Jiaotong

More information

Packianathan Chelladurai Troy University, Troy, Alabama, USA.

Packianathan Chelladurai Troy University, Troy, Alabama, USA. DIMENSIONS OF ORGANIZATIONAL CAPACITY OF SPORT GOVERNING BODIES OF GHANA: DEVELOPMENT OF A SCALE Christopher Essilfie I.B.S Consulting Alliance, Accra, Ghana E-mail: chrisessilfie@yahoo.com Packianathan

More information

Manifestation Of Differences In Item-Level Characteristics In Scale-Level Measurement Invariance Tests Of Multi-Group Confirmatory Factor Analyses

Manifestation Of Differences In Item-Level Characteristics In Scale-Level Measurement Invariance Tests Of Multi-Group Confirmatory Factor Analyses Journal of Modern Applied Statistical Methods Copyright 2005 JMASM, Inc. May, 2005, Vol. 4, No.1, 275-282 1538 9472/05/$95.00 Manifestation Of Differences In Item-Level Characteristics In Scale-Level Measurement

More information

Adaptive Aspirations in an American Financial Services Organization: A Field Study

Adaptive Aspirations in an American Financial Services Organization: A Field Study Adaptive Aspirations in an American Financial Services Organization: A Field Study Stephen J. Mezias Department of Management and Organizational Behavior Leonard N. Stern School of Business New York University

More information

The Development of Scales to Measure QISA s Three Guiding Principles of Student Aspirations Using the My Voice TM Survey

The Development of Scales to Measure QISA s Three Guiding Principles of Student Aspirations Using the My Voice TM Survey The Development of Scales to Measure QISA s Three Guiding Principles of Student Aspirations Using the My Voice TM Survey Matthew J. Bundick, Ph.D. Director of Research February 2011 The Development of

More information

Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure. Rob Cavanagh Len Sparrow Curtin University

Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure. Rob Cavanagh Len Sparrow Curtin University Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure Rob Cavanagh Len Sparrow Curtin University R.Cavanagh@curtin.edu.au Abstract The study sought to measure mathematics anxiety

More information

Bruno D. Zumbo, Ph.D. University of Northern British Columbia

Bruno D. Zumbo, Ph.D. University of Northern British Columbia Bruno Zumbo 1 The Effect of DIF and Impact on Classical Test Statistics: Undetected DIF and Impact, and the Reliability and Interpretability of Scores from a Language Proficiency Test Bruno D. Zumbo, Ph.D.

More information

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD Psy 427 Cal State Northridge Andrew Ainsworth, PhD Contents Item Analysis in General Classical Test Theory Item Response Theory Basics Item Response Functions Item Information Functions Invariance IRT

More information

Doing Quantitative Research 26E02900, 6 ECTS Lecture 6: Structural Equations Modeling. Olli-Pekka Kauppila Daria Kautto

Doing Quantitative Research 26E02900, 6 ECTS Lecture 6: Structural Equations Modeling. Olli-Pekka Kauppila Daria Kautto Doing Quantitative Research 26E02900, 6 ECTS Lecture 6: Structural Equations Modeling Olli-Pekka Kauppila Daria Kautto Session VI, September 20 2017 Learning objectives 1. Get familiar with the basic idea

More information

Essential Skills for Evidence-based Practice Understanding and Using Systematic Reviews

Essential Skills for Evidence-based Practice Understanding and Using Systematic Reviews J Nurs Sci Vol.28 No.4 Oct - Dec 2010 Essential Skills for Evidence-based Practice Understanding and Using Systematic Reviews Jeanne Grace Corresponding author: J Grace E-mail: Jeanne_Grace@urmc.rochester.edu

More information

11/24/2017. Do not imply a cause-and-effect relationship

11/24/2017. Do not imply a cause-and-effect relationship Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are highly extraverted people less afraid of rejection

More information

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups GMAC Scaling Item Difficulty Estimates from Nonequivalent Groups Fanmin Guo, Lawrence Rudner, and Eileen Talento-Miller GMAC Research Reports RR-09-03 April 3, 2009 Abstract By placing item statistics

More information

Context of Best Subset Regression

Context of Best Subset Regression Estimation of the Squared Cross-Validity Coefficient in the Context of Best Subset Regression Eugene Kennedy South Carolina Department of Education A monte carlo study was conducted to examine the performance

More information

Extraversion. The Extraversion factor reliability is 0.90 and the trait scale reliabilities range from 0.70 to 0.81.

Extraversion. The Extraversion factor reliability is 0.90 and the trait scale reliabilities range from 0.70 to 0.81. MSP RESEARCH NOTE B5PQ Reliability and Validity This research note describes the reliability and validity of the B5PQ. Evidence for the reliability and validity of is presented against some of the key

More information

ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION SCALE

ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION SCALE California State University, San Bernardino CSUSB ScholarWorks Electronic Theses, Projects, and Dissertations Office of Graduate Studies 6-2016 ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION

More information

LEDYARD R TUCKER AND CHARLES LEWIS

LEDYARD R TUCKER AND CHARLES LEWIS PSYCHOMETRIKA--VOL. ~ NO. 1 MARCH, 1973 A RELIABILITY COEFFICIENT FOR MAXIMUM LIKELIHOOD FACTOR ANALYSIS* LEDYARD R TUCKER AND CHARLES LEWIS UNIVERSITY OF ILLINOIS Maximum likelihood factor analysis provides

More information

Empirical Formula for Creating Error Bars for the Method of Paired Comparison

Empirical Formula for Creating Error Bars for the Method of Paired Comparison Empirical Formula for Creating Error Bars for the Method of Paired Comparison Ethan D. Montag Rochester Institute of Technology Munsell Color Science Laboratory Chester F. Carlson Center for Imaging Science

More information

Adaptive EAP Estimation of Ability

Adaptive EAP Estimation of Ability Adaptive EAP Estimation of Ability in a Microcomputer Environment R. Darrell Bock University of Chicago Robert J. Mislevy National Opinion Research Center Expected a posteriori (EAP) estimation of ability,

More information

Impact of Differential Item Functioning on Subsequent Statistical Conclusions Based on Observed Test Score Data. Zhen Li & Bruno D.

Impact of Differential Item Functioning on Subsequent Statistical Conclusions Based on Observed Test Score Data. Zhen Li & Bruno D. Psicológica (2009), 30, 343-370. SECCIÓN METODOLÓGICA Impact of Differential Item Functioning on Subsequent Statistical Conclusions Based on Observed Test Score Data Zhen Li & Bruno D. Zumbo 1 University

More information

A Comparison of Item Response Theory and Confirmatory Factor Analytic Methodologies for Establishing Measurement Equivalence/Invariance

A Comparison of Item Response Theory and Confirmatory Factor Analytic Methodologies for Establishing Measurement Equivalence/Invariance 10.1177/1094428104268027 ORGANIZATIONAL Meade, Lautenschlager RESEARCH / COMP ARISON METHODS OF IRT AND CFA A Comparison of Item Response Theory and Confirmatory Factor Analytic Methodologies for Establishing

More information

Gezinskenmerken: De constructie van de Vragenlijst Gezinskenmerken (VGK) Klijn, W.J.L.

Gezinskenmerken: De constructie van de Vragenlijst Gezinskenmerken (VGK) Klijn, W.J.L. UvA-DARE (Digital Academic Repository) Gezinskenmerken: De constructie van de Vragenlijst Gezinskenmerken (VGK) Klijn, W.J.L. Link to publication Citation for published version (APA): Klijn, W. J. L. (2013).

More information

INVESTIGATING FIT WITH THE RASCH MODEL. Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form

INVESTIGATING FIT WITH THE RASCH MODEL. Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form INVESTIGATING FIT WITH THE RASCH MODEL Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form of multidimensionality. The settings in which measurement

More information

TEST REVIEWS. Myers-Briggs Type Indicator, Form M

TEST REVIEWS. Myers-Briggs Type Indicator, Form M TEST REVIEWS Myers-Briggs Type Indicator, Form M Myers-Briggs Type Indicator, Form M Purpose Designed for "the identification of basic preferences on each of the four dichotomies specified or implicit

More information

Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data

Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data TECHNICAL REPORT Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data CONTENTS Executive Summary...1 Introduction...2 Overview of Data Analysis Concepts...2

More information

Sawtooth Software. The Number of Levels Effect in Conjoint: Where Does It Come From and Can It Be Eliminated? RESEARCH PAPER SERIES

Sawtooth Software. The Number of Levels Effect in Conjoint: Where Does It Come From and Can It Be Eliminated? RESEARCH PAPER SERIES Sawtooth Software RESEARCH PAPER SERIES The Number of Levels Effect in Conjoint: Where Does It Come From and Can It Be Eliminated? Dick Wittink, Yale University Joel Huber, Duke University Peter Zandan,

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 39 Evaluation of Comparability of Scores and Passing Decisions for Different Item Pools of Computerized Adaptive Examinations

More information

Investigating the robustness of the nonparametric Levene test with more than two groups

Investigating the robustness of the nonparametric Levene test with more than two groups Psicológica (2014), 35, 361-383. Investigating the robustness of the nonparametric Levene test with more than two groups David W. Nordstokke * and S. Mitchell Colp University of Calgary, Canada Testing

More information

2 Types of psychological tests and their validity, precision and standards

2 Types of psychological tests and their validity, precision and standards 2 Types of psychological tests and their validity, precision and standards Tests are usually classified in objective or projective, according to Pasquali (2008). In case of projective tests, a person is

More information

Cochrane Pregnancy and Childbirth Group Methodological Guidelines

Cochrane Pregnancy and Childbirth Group Methodological Guidelines Cochrane Pregnancy and Childbirth Group Methodological Guidelines [Prepared by Simon Gates: July 2009, updated July 2012] These guidelines are intended to aid quality and consistency across the reviews

More information

Multidimensionality and Item Bias

Multidimensionality and Item Bias Multidimensionality and Item Bias in Item Response Theory T. C. Oshima, Georgia State University M. David Miller, University of Florida This paper demonstrates empirically how item bias indexes based on

More information

Ecological Statistics

Ecological Statistics A Primer of Ecological Statistics Second Edition Nicholas J. Gotelli University of Vermont Aaron M. Ellison Harvard Forest Sinauer Associates, Inc. Publishers Sunderland, Massachusetts U.S.A. Brief Contents

More information

BAD ITEMS, BAD DATA Item Characteristics and Rating Discrepancies in Multi-source Assessments

BAD ITEMS, BAD DATA Item Characteristics and Rating Discrepancies in Multi-source Assessments BAD ITEMS, BAD DATA Item Characteristics and Rating Discrepancies in Multi-source Assessments Robert B. Kaiser, Kaplan DeVries Inc. S. Bartholomew Craig, North Carolina State University Abstract The authors

More information

Work Personality Index Factorial Similarity Across 4 Countries

Work Personality Index Factorial Similarity Across 4 Countries Work Personality Index Factorial Similarity Across 4 Countries Donald Macnab Psychometrics Canada Copyright Psychometrics Canada 2011. All rights reserved. The Work Personality Index is a trademark of

More information

Personality Traits Effects on Job Satisfaction: The Role of Goal Commitment

Personality Traits Effects on Job Satisfaction: The Role of Goal Commitment Marshall University Marshall Digital Scholar Management Faculty Research Management, Marketing and MIS Fall 11-14-2009 Personality Traits Effects on Job Satisfaction: The Role of Goal Commitment Wai Kwan

More information

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison Empowered by Psychometrics The Fundamentals of Psychometrics Jim Wollack University of Wisconsin Madison Psycho-what? Psychometrics is the field of study concerned with the measurement of mental and psychological

More information

Addendum: Multiple Regression Analysis (DRAFT 8/2/07)

Addendum: Multiple Regression Analysis (DRAFT 8/2/07) Addendum: Multiple Regression Analysis (DRAFT 8/2/07) When conducting a rapid ethnographic assessment, program staff may: Want to assess the relative degree to which a number of possible predictive variables

More information

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES Correlational Research Correlational Designs Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are

More information

Personal Style Inventory Item Revision: Confirmatory Factor Analysis

Personal Style Inventory Item Revision: Confirmatory Factor Analysis Personal Style Inventory Item Revision: Confirmatory Factor Analysis This research was a team effort of Enzo Valenzi and myself. I m deeply grateful to Enzo for his years of statistical contributions to

More information

Psychological Experience of Attitudinal Ambivalence as a Function of Manipulated Source of Conflict and Individual Difference in Self-Construal

Psychological Experience of Attitudinal Ambivalence as a Function of Manipulated Source of Conflict and Individual Difference in Self-Construal Seoul Journal of Business Volume 11, Number 1 (June 2005) Psychological Experience of Attitudinal Ambivalence as a Function of Manipulated Source of Conflict and Individual Difference in Self-Construal

More information

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida Vol. 2 (1), pp. 22-39, Jan, 2015 http://www.ijate.net e-issn: 2148-7456 IJATE A Comparison of Logistic Regression Models for Dif Detection in Polytomous Items: The Effect of Small Sample Sizes and Non-Normality

More information

André Cyr and Alexander Davies

André Cyr and Alexander Davies Item Response Theory and Latent variable modeling for surveys with complex sampling design The case of the National Longitudinal Survey of Children and Youth in Canada Background André Cyr and Alexander

More information

CHAPTER VI RESEARCH METHODOLOGY

CHAPTER VI RESEARCH METHODOLOGY CHAPTER VI RESEARCH METHODOLOGY 6.1 Research Design Research is an organized, systematic, data based, critical, objective, scientific inquiry or investigation into a specific problem, undertaken with the

More information

Latent Trait Standardization of the Benzodiazepine Dependence. Self-Report Questionnaire using the Rasch Scaling Model

Latent Trait Standardization of the Benzodiazepine Dependence. Self-Report Questionnaire using the Rasch Scaling Model Chapter 7 Latent Trait Standardization of the Benzodiazepine Dependence Self-Report Questionnaire using the Rasch Scaling Model C.C. Kan 1, A.H.G.S. van der Ven 2, M.H.M. Breteler 3 and F.G. Zitman 1 1

More information

Chapter 11. Experimental Design: One-Way Independent Samples Design

Chapter 11. Experimental Design: One-Way Independent Samples Design 11-1 Chapter 11. Experimental Design: One-Way Independent Samples Design Advantages and Limitations Comparing Two Groups Comparing t Test to ANOVA Independent Samples t Test Independent Samples ANOVA Comparing

More information

Adjusting for mode of administration effect in surveys using mailed questionnaire and telephone interview data

Adjusting for mode of administration effect in surveys using mailed questionnaire and telephone interview data Adjusting for mode of administration effect in surveys using mailed questionnaire and telephone interview data Karl Bang Christensen National Institute of Occupational Health, Denmark Helene Feveille National

More information

CHAPTER 3 METHOD AND PROCEDURE

CHAPTER 3 METHOD AND PROCEDURE CHAPTER 3 METHOD AND PROCEDURE Previous chapter namely Review of the Literature was concerned with the review of the research studies conducted in the field of teacher education, with special reference

More information

Abstract. In this paper, I will analyze three articles that review the impact on conflict on

Abstract. In this paper, I will analyze three articles that review the impact on conflict on The Positives & Negatives of Conflict 1 Author: Kristen Onkka Abstract In this paper, I will analyze three articles that review the impact on conflict on employees in the workplace. The first article reflects

More information

Chapter 3. Psychometric Properties

Chapter 3. Psychometric Properties Chapter 3 Psychometric Properties Reliability The reliability of an assessment tool like the DECA-C is defined as, the consistency of scores obtained by the same person when reexamined with the same test

More information

Organizational readiness for implementing change: a psychometric assessment of a new measure

Organizational readiness for implementing change: a psychometric assessment of a new measure Shea et al. Implementation Science 2014, 9:7 Implementation Science RESEARCH Organizational readiness for implementing change: a psychometric assessment of a new measure Christopher M Shea 1,2*, Sara R

More information

(CORRELATIONAL DESIGN AND COMPARATIVE DESIGN)

(CORRELATIONAL DESIGN AND COMPARATIVE DESIGN) UNIT 4 OTHER DESIGNS (CORRELATIONAL DESIGN AND COMPARATIVE DESIGN) Quasi Experimental Design Structure 4.0 Introduction 4.1 Objectives 4.2 Definition of Correlational Research Design 4.3 Types of Correlational

More information

LEADERSHIP ATTRIBUTES AND CULTURAL VALUES IN AUSTRALIA AND NEW ZEALAND COMPARED: AN INITIAL REPORT BASED ON GLOBE DATA

LEADERSHIP ATTRIBUTES AND CULTURAL VALUES IN AUSTRALIA AND NEW ZEALAND COMPARED: AN INITIAL REPORT BASED ON GLOBE DATA LEADERSHIP ATTRIBUTES AND CULTURAL VALUES IN AUSTRALIA AND NEW ZEALAND COMPARED: AN INITIAL REPORT BASED ON GLOBE DATA ABSTRACT Neal M. Ashkanasy Edwin Trevor-Roberts Jeffrey Kennedy This paper reports

More information

Discriminant Analysis with Categorical Data

Discriminant Analysis with Categorical Data - AW)a Discriminant Analysis with Categorical Data John E. Overall and J. Arthur Woodward The University of Texas Medical Branch, Galveston A method for studying relationships among groups in terms of

More information

STATISTICS AND RESEARCH DESIGN

STATISTICS AND RESEARCH DESIGN Statistics 1 STATISTICS AND RESEARCH DESIGN These are subjects that are frequently confused. Both subjects often evoke student anxiety and avoidance. To further complicate matters, both areas appear have

More information

Multiple Act criterion:

Multiple Act criterion: Common Features of Trait Theories Generality and Stability of Traits: Trait theorists all use consistencies in an individual s behavior and explain why persons respond in different ways to the same stimulus

More information

Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz

Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz This study presents the steps Edgenuity uses to evaluate the reliability and validity of its quizzes, topic tests, and cumulative

More information

An Empirical Study on Causal Relationships between Perceived Enjoyment and Perceived Ease of Use

An Empirical Study on Causal Relationships between Perceived Enjoyment and Perceived Ease of Use An Empirical Study on Causal Relationships between Perceived Enjoyment and Perceived Ease of Use Heshan Sun Syracuse University hesun@syr.edu Ping Zhang Syracuse University pzhang@syr.edu ABSTRACT Causality

More information

Business Statistics Probability

Business Statistics Probability Business Statistics The following was provided by Dr. Suzanne Delaney, and is a comprehensive review of Business Statistics. The workshop instructor will provide relevant examples during the Skills Assessment

More information

Unit 1 Exploring and Understanding Data

Unit 1 Exploring and Understanding Data Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile

More information

Multilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison

Multilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison Group-Level Diagnosis 1 N.B. Please do not cite or distribute. Multilevel IRT for group-level diagnosis Chanho Park Daniel M. Bolt University of Wisconsin-Madison Paper presented at the annual meeting

More information

Instrument Validation Study

Instrument Validation Study Instrument Validation Study REGARDING LEADERSHIP CIRCLE PROFILE By Industrial Psychology Department Bowling Green State University INSTRUMENT VALIDATION STUDY EXECUTIVE SUMMARY AND RESPONSE TO THE RECOMMENDATIONS

More information

REPORT. Technical Report: Item Characteristics. Jessica Masters

REPORT. Technical Report: Item Characteristics. Jessica Masters August 2010 REPORT Diagnostic Geometry Assessment Project Technical Report: Item Characteristics Jessica Masters Technology and Assessment Study Collaborative Lynch School of Education Boston College Chestnut

More information

CONTENT ANALYSIS OF COGNITIVE BIAS: DEVELOPMENT OF A STANDARDIZED MEASURE Heather M. Hartman-Hall David A. F. Haaga

CONTENT ANALYSIS OF COGNITIVE BIAS: DEVELOPMENT OF A STANDARDIZED MEASURE Heather M. Hartman-Hall David A. F. Haaga Journal of Rational-Emotive & Cognitive-Behavior Therapy Volume 17, Number 2, Summer 1999 CONTENT ANALYSIS OF COGNITIVE BIAS: DEVELOPMENT OF A STANDARDIZED MEASURE Heather M. Hartman-Hall David A. F. Haaga

More information

Rasch Versus Birnbaum: New Arguments in an Old Debate

Rasch Versus Birnbaum: New Arguments in an Old Debate White Paper Rasch Versus Birnbaum: by John Richard Bergan, Ph.D. ATI TM 6700 E. Speedway Boulevard Tucson, Arizona 85710 Phone: 520.323.9033 Fax: 520.323.9139 Copyright 2013. All rights reserved. Galileo

More information

PERCEIVED TRUSTWORTHINESS OF KNOWLEDGE SOURCES: THE MODERATING IMPACT OF RELATIONSHIP LENGTH

PERCEIVED TRUSTWORTHINESS OF KNOWLEDGE SOURCES: THE MODERATING IMPACT OF RELATIONSHIP LENGTH PERCEIVED TRUSTWORTHINESS OF KNOWLEDGE SOURCES: THE MODERATING IMPACT OF RELATIONSHIP LENGTH DANIEL Z. LEVIN Management and Global Business Dept. Rutgers Business School Newark and New Brunswick Rutgers

More information