Violating the Independent Observations Assumption in IRT-based Analyses of 360 Instruments: Can we get away with It?

Size: px

Start display at page:

Download "Violating the Independent Observations Assumption in IRT-based Analyses of 360 Instruments: Can we get away with It?"

Terence Jackson
5 years ago
Views:

1 In R.B. Kaiser & S.B. Craig (co-chairs) Modern Analytic Techniques in the Study of 360 Performance Ratings. Symposium presented at the 16 th annual conference of the Society for Industrial-Organizational Psychology, San Diego. April 28, 2001 Violating the Independent Observations Assumption in IRT-based Analyses of 360 Instruments: Can we get away with It? S. Bartholomew Craig and Robert B. Kaiser Kaplan DeVries Inc. Multisource, or 360, ratings of performance provide an opportunity to violate a key assumption of many statistical procedures: the assumption of independent observations. This study examined the observable consequences of violating that assumption in an application of item response theory (IRT) to real-world ratings of two different domains of leadership performance. Using Raju, Van der Linden, & Fleer's (1995) framework for identifying differential functioning in items and tests (DFIT), we found no significant differences between item parameters estimated from samples in which the independence assumption was violated and parameters estimated from samples in which it was not violated. This finding, which was replicated with tests as short as five items and samples as small as, suggests that IRT can be safely applied to multisource ratings without trimming sample sizes to avoid violation of the independence assumption. Multisource, or 360, performance assessments are becoming increasingly common in organizations (Tornow & London, 1998). With the popularity of 360 assessments has come an increased interest in the psychometric properties of the instruments involved, with applications of sophisticated techniques such as item response theory (IRT) becoming correspondingly more frequent. For example, a number of studies have used IRT methods to investigate the measurement equivalence of 360 instruments across rating sources, language translations, and racial groups (e.g., Facteau & Craig, 2001; Maurer, The authors would like to thank the Center for Creative Leadership for providing the data used in this research, and Marlene Barlow and Sharon Denny for their assistance with the preparation of this manuscript. Correspondence concerning this article may be sent to Bart Craig or Rob Kaiser, Kaplan DeVries Inc., 1903-G Ashwood Ct., Greensboro, NC Electronic mail may be sent via Internet to bcraig@kaplandevries.com or rkaiser@kaplandevries.com. Raju, & Collins, 1998; Raju, 1999). Although the analysis of multisource performance data presents problems for which such techniques are well suited, the application of these methods is not without caveats. One potential pitfall concerns the nonindependent nature of such ratings. Because multiple raters rate the same target, some data points are not statistically independent of one other, presenting an opportunity to violate a basic assumption of many analytic techniques, including the marginal maximum likelihood estimation procedures used to estimate item parameters under IRT (Bock & Aitkin, 1981). A straightforward solution to the nonindependence problem exists, however. This solution involves randomly selecting one rater per ratee and thus limiting IRT-based analyses to a subset of the sample, within which the independent rating assumption holds. This random rater procedure effectively solves the problem and has been

2 IRT AND 360 RATINGS 2 used to good effect in past research (Raju, 1999). Unfortunately, the obvious sacrifice in the use of this solution is sample size. At least two factors make this sacrifice particularly undesirable in this application. First, the magnitude of sample size reduction is severe. If the average number of raters per target is only five, which is a commonly occurring case (e.g., London & Smither, 1995; Greguras & Robie, 1998), then the random rater procedure excludes 80% of the available data from analysis. This problem is further compounded by the fact that organizational samples are often small to begin with and by the need to analyze rating sources separately in many research designs, reducing effective sample size even further. Second, these severe reductions in usable sample size occur in the context of analytic procedures (i.e., those based on IRT) that are among the most data hungry techniques applied in organizational research. The foundation of most IRT-based analyses is item parameter estimation by means of marginal maximum likelihood (MML) procedures which can require samples of 1,000 or more just to satisfy convergence criteria, let alone produce parameter estimates with acceptable stability. Further, investigations of differential item functioning (DIF), which constitute a large proportion of the extant IRT-based research on 360 instruments, call for extremely precise item parameter estimates, lest equivalently functioning items be misidentified as showing DIF due to parameter estimate instability. In sum, although the random rater procedure successfully avoids the problem of nonindependent ratings, its associated price tag, in terms of reduced sample size, warrants a search for alternative solutions. As many researchers have to come to realize, it is often the case that, in practice, statistical procedures are fairly robust to violations of their assumptions. For example, correlation-based techniques assume at least an interval level measurement scale, yet can still be useful when applied to ordinal data. A similar state of affairs exists with regard to distributional normality in a variety of procedures (e.g., regression, factor analysis). If it could be established that IRT-based procedures are robust to violations of response independence, then future research might benefit from researchers having the latitude to decide whether the advantages of the random rater procedure justify the cost. The present study examined the real world impact of violating the independent ratings assumption in IRT-based analyses under a variety of conditions. This was accomplished by framing the research question as one of differential functioning do item parameter estimates derived from data that violate the independence assumption differ from those estimated from data that meet the assumption? Sample and Instrument Method An archival database of responses to the Benchmarks 360 leadership assessment instrument (McCauley, Lombardo, & Usher, 1989; Lombardo & McCauley, 1994) was used for this study. Benchmarks, a commercially available instrument published by the Center for Creative Leadership (CCL), contains 164 items in 22 scales, entitled 1 1. Resourcefulness 2. Doing Whatever It Takes 3. Being a Quick Study 4. Decisiveness 5. Leading Employees 6. Setting a Developmental Climate 7. Confronting Problem Employees 8. Work Team Orientation 9. Hiring Talented Staff 10. Building and Mending Relationships 11. Compassion and Sensitivity 12. Straightforwardness and Composure

3 IRT AND 360 RATINGS Balance between Personal Life and Work 14. Self-Awareness 15. Putting People at Ease 16. Acting with Flexibility 17. Problems with Interpersonal Relationships 18. Difficulty in Molding a Staff 19. Difficulty in Making Transitions 20. Lack of Follow-Through 21. Overdependence 22. Strategic Differences with Management Benchmarks performance ratings are collected from the superiors, peers, and directly reporting subordinates of focal managers (ratees), as well as from focal managers themselves. Each item is rated on a five point Likert-type scale ranging from "not at all" to "to a very great extent" to reflect the degree to which the item is typical of the ratee's behavior. The ratings analyzed here were collected between 1992 and 1997, most in preparation for focal managers' participation in a leadership development program at CCL, although data from external users of the instrument also contributed to the database. This instrument has been examined in multiple validation studies (for a review, see McCauley & Lombardo, 1990) and has received favorable reviews as a reliable measure of important aspects of leadership related to management development (e.g., Zedeck, 1995). Although differential item functioning is rarely found in comparisons among performance rating sources (Facteau & Craig, 2001), it was still desirable to eliminate any possible confounding effects due to rating source in the present study. Therefore, only ratings from direct reports were used in the analyses presented below, yielding an initial sample size of 31,731. Analyses Overview. The basic research design involved testing for differential functioning of items and tests, where the groups under examination were (1) data that met the independence assumption and (2) data in which the assumption was violated. The steps involved in creating these conditions are described below. Exploratory Factor Analyses. In our experience, the intended a priori structures of leadership assessment instruments rarely converge with their empirically identified factor structures. For this reason, and because the large number of items in Benchmarks made detailed item-level analyses unwieldy, we used exploratory factor analysis (EFA) to construct empirically coherent scales from a subset of the Benchmarks items. The intent of this stage of the analyses was twofold: (1) to derive shorter scales covering multiple content domains from the Benchmarks item pool, and (2) to establish that each of those scales met the IRT assumption of unidimensionality. Only the 106 items from the first 16 scales of Benchmarks (i.e., the Managerial Skills and Perspectives section) were submitted to EFA because of their conceptual distinctiveness from the items of scales 17 through 22. That latter group, referred to as the "derailment" scales, targets career progression problems rather than behavior per se, as do the first 16 scales. In the first step, ratings from 15,704 subordinates of 6,531 target managers on the 106 Benchmarks items were submitted to EFA using maximum likelihood estimation and oblique rotation 2. We hypothesized from reports in the literature (e.g., Bass, 1990) and our own experience with analyzing multisource leadership ratings that at least two large factors would emerge. It was anticipated that these two factors could be interpreted as some variants of two basic dimensions of leader behavior that have been discussed over the years for example, initiating structure and consideration (Fleishman, 1973), task-orientation and relationship-orientation (Fiedler, 1967), and directive and participative leadership (Bass,

4 IRT AND 360 RATINGS ). Bass (1990, p ) noted in his exhaustive review that these distinctions, respectively, are elements of what he called the overarching autocratic and democratic clusters of active leadership behavior. Further, he noted that, as a rule, the two factors will be found in some form in any adequate description of leadership. Because of their broad applicability to leadership assessment and their near ubiquitous occurrence in leadership assessment instruments, our intent was to construct optimized indicators of those two constructs to use in the remainder of the study. Interpreting the results of the EFA against the Kaiser (1960) criterion and Cattell s (1966) scree test suggested that the structure of subordinate ratings on the 106 items Benchmarks items could be described with anywhere from 2 to 13 factors. However, our objective was not to examine the factor structure of Benchmarks per se nor to use all of the discovered factors in subsequent analyses. Rather, our objective was to identify indicators for the two familiar factors discussed above. Inspection of the item content for the first two rotated factors in the present analysis did indeed suggest elements of Bass (1990) autocratic and democratic clusters. To construct our two scales, we selected 20 items from each factor on the basis of high (primary) factor loading and low cross-factor loading. Although previous research has found these two dimensions to be related (e.g., Schreisheim, House, & Kerr, 1976), our goal was to produce scales with the least possible variance overlap while still retaining the best available indicators of their respective constructs. Following Bass (1990, Ch. 23) review, we labeled these scales taskorientation and relations-orientation because their content was very similar to Bass definitions for constructs with the same names. Three sample items from each of the final scales are presented in Appendix A. After the items to be included were selected, cases with missing data on the retained items were deleted from the data set. This process was conducted independently for each scale, thus resulting in largely but not completely overlapping samples for the two scales. After cases with incomplete data were deleted, the remaining sample sizes were 19,624 for the task-orientation scale ( =.93) and 25,236 for the relations-orientation scale ( =.95). As expected, the resulting two 20- item scales were significantly correlated (r =.52, p <.001) and at approximately the same magnitude found for measures of conceptually similar constructs (Schriesheim et al., 1976). Reckase (1979) has suggested that tests can be considered to meet IRT s assumption of unidimensionality if their first unrotated factors account for at least 20% of the items common variance. Because that recommendation was derived from research on dichotomously scored items, it may be somewhat liberal when applied to polytomous items, which require the estimation of more parameters per item. However, the scales used in the present study exceeded Reckase s recommended lower limit by such a large margin that we believed the unidimensionality assumption to be met in these data. When the two newly created 20-item scales were submitted separately to subsequent EFAs, their first unrotated factors accounted for 42.81% (task-orientation) and 51.67% (relationsorientation) of their respective common variances. Table 1 displays descriptive statistics for the newly constructed scales after removing cases with incomplete data.

5 IRT AND 360 RATINGS 5 Table 1 Descriptive Statistics for 20 Item Scales in Sample of Direct Reports with Complete Data Task-orientation N = 19,624 Relations-orientation N = 25,236 Mean Standard Deviation Minimum Maximum (Scale Scores) Note. Scale score is the mean item response for that scale. Subsamples, Groups, and Conditions. Next, subsamples were randomly drawn from the larger data set of complete cases. Sample sizes of 200, 500, and 1000 were chosen to reflect the realities of data availability in organizations and respect a lower boundary below which IRT may be an inappropriate choice of analysis. As will be explained in more detail later, it was also desirable to select subsets of items in order to examine our research question under conditions of varying scale length. Scale lengths of five and 10 items were chosen for examination, in addition to the aforementioned 20 item scales. In each case, the shorter scales consisted of the items with the highest factor loadings from the 20 item scales (i.e., the highest loading five items composed the five item version of that scale, etc.). Recall that the two groups being compared in the present study were identified by whether the assumption of independent observations had or had not been violated. To maximize the similarity of the groups compared in the present study to alternatives available to researchers in applied settings, we defined the one rater group as one in which a single rater for each ratee had been randomly chosen from the available raters and the all raters group as one in which all available ratings were included, without any attempt to manipulate the number of raters per ratee. In the initial randomly selected subsamples (), the number of raters per rating target ranged from one to twelve (median = 3) for the task-orientation scale and from one to nine (median = 4) for the relations-orientation scale 3. Where this article uses the term condition we mean to refer to the conditions under which the two groups were compared, determined by the crossing of scale length (20 items, 10 items, and 5 items), sample size (1000, 500, and 200), and content domain (task-orientation and relations-orientation). Item Parameter Estimation. The MULTILOG computer program (Thissen, 1995) was used to estimate item parameters under Samejima s (1969) graded response model. The graded response model estimates one discrimination parameter (a) for each item, along with one difficulty parameter (b) for each response category. By convention, the b parameter for the highest category is not output by the software but is computable from the other bs as the degree of freedom. Thus, in the present study, each item has one a parameter and four b parameters. In order to test whether parameter estimates would differ as a function of the conditions examined here, each condition (content domain X scale length X sample size) was analyzed in an independent run of the MULTILOG software, yielding a (potentially) unique set of parameter estimates for each condition. Equating Parameter Metrics. Item parameters that are estimated from independent executions of the MULTILOG software (e.g., groups, conditions) are not on identical metrics. As a result, the parameters must be linearly transformed so as to place all the parameters in a given comparison on the same scale. This was accomplished using the EQUATE computer program (Baker, 1995). This software implements the iterative equating procedure described by Stocking and Lord (1983), which derives a multiplicative and an additive constant for use in the linear

6 IRT AND 360 RATINGS 6 transformation. To derive the most stable transformation coefficients possible, the estimation process is conducted twice, with any items showing DIF from the first run excluded from the second. Differential Functioning Analysis. Raju, van der Linden, and Fleer (1995) proposed a framework for examining measurement equivalence, at both the item and scale levels, called "differential functioning of items and tests" (DFIT). The DFIT framework, which is based on IRT, defines differential functioning as a difference in the expected item or scale scores for individuals with the same standing on the latent construct (θ), attributable to their membership in different groups. In the language of the DFIT framework, the expected score on an item or test, given θ, is called the "true score," and is expressed on the raw metric of the item or test (e.g., on a one to five scale for a five point Likert-type response format). Analyses based on DFIT yield several types of differential functioning indices. Noncompensatory differential item functioning (NCDIF) is a purely item level statistic that reflects true score differences for the two groups under examination. As the "noncompensatory" moniker suggests, the NCDIF index considers each item separately, without regard for the functioning of other items in the scale. Mathematically, NCDIF is the square of the difference in true scores for the two groups, averaged across θ. Thus, the square root of NCDIF gives the expected difference in item responses for individuals with the same standing on θ, but belonging to different groups. Differential test functioning (DTF) is analogous to NCDIF, but is a test level index. The square root of DTF is the average expected test score difference between individuals from different groups, but with the same standing on θ. Compensatory differential item functioning (CDIF) is an item level index that represents an item's net contribution to DTF. Computationally, CDIF is the change in DTF associated with the deletion of the focal item from the test. An important concept in DFIT is the directionality of differential functioning. Two items may exhibit similar levels of differential functioning, but in opposite directions (e.g., one item might "favor" Group 1 and the other might "favor" Group 2). In such a scenario, the two items would cancel each other and produce no net differential functioning at the test level. Conversely, a number of items that have nonsignificant, but nonzero, levels of NCDIF in the same direction can produce a test with significant DTF due to the accumulation of item level DF at the test level. Because the CDIF index represents an item's net contribution to DTF, it takes into account the functioning of other items on the test, in contrast to NCDIF. The relative importance placed on NCDIF, CDIF, and DTF depends on the purpose of the analysis. Because the focus of the present study is on the question of how violation of the independence assumption affects item parameter estimates, all three indicators are of interest here. The significance tests for the three differential functioning indices are biconditional. The differential functioning reflected in the magnitude of any of the indices is considered significant in the context of (1) a significant chi square statistic and (2) an associated index that exceeds an a priori specified critical value. The critical value for significant NCDIF depends on the number of response categories for that item. Exceeding the critical value is included as a condition for significance to mitigate the high sensitivity of chi square tests to sample size. Raju (personal communication, March, 1999) has recommended that the critical NCDIF value for an item with five response categories be set

7 IRT AND 360 RATINGS 7 to The critical DTF value for a test composed of such items would be multiplied by the number of items on the test. Logic for Order of Analyses The stability of item parameter estimates can vary with certain sample and scale characteristics (Mislevy & Bock, 1990). Specifically, estimates tend to become more stable (i.e., their associated standard error estimates become smaller) as sample size increases and as the number of items on the test increases. Because of these tendencies, it was desirable to test whether violation of the independence assumption produced detectable differences under a variety of conditions. The conditions examined were chosen to reflect the circumstances under which IRT might be applied to performance appraisal data in organizational settings. Because sample sizes and scale lengths both tend to be smaller in applied settings than in research settings, a sample size of 1000 and a scale length of 20 items were used to anchor the favorable ends of those two continua. At the unfavorable ends, a sample size of 200 and a scale length of five items were used. We believe that most researchers would agree that samples of 200 approach a level below which the use of IRT-based techniques may be inadvisable altogether. The two factors (sample size and scale length) were crossed with each other and with the two content domains described earlier to produce the matrix shown in Table 2. Table 2 Conditions under which assumption violations will be examined for differential functioning 20 items 10 items 5 items 20 items 10 items 5 items Taskorientation Phase 1 Phase 5 Phase 4 Phase 3 Phase 5 Phase 2 Content Domain Relationsorientation Phase 1 Phase 5 Phase 4 Phase 3 Phase 5 Phase 2 Because DFIT detects differences in item parameter estimates and such differences should be more likely to occur under less favorable conditions (as described above) a logical order for the conduct of the analyses presented itself. Specifically, we chose to test first whether violating the independence assumption under favorable conditions (Phase 1) would produce different parameter estimates in order to answer immediately the question of whether the violation presented an unavoidable problem for researchers in applied settings. That is, if the violation produced deleterious effects under the best conditions that could reasonably be expected in organizations, then we could conclude that the assumption should be fairly strictly respected. Next, we would test the least favorable condition (Phase 2) and compare the results with those from Phase 1 to determine whether there were any effects due to sample size and scale length. The remaining conditions would be tested in an order that depended on the outcomes of Phases 1 and 2. That order would be designed to identify, in the least possible number of analyses, the boundary between the conditions under which the assumption violation produced detectable effects and the conditions under which it did not. Results Phase 1: / Following the logic described earlier, the condition most favorable to IRT-based analyses was tested first (i.e., longer test, larger sample). Across the 40 items for the two content domains, no item or scale met both conditions for significant differential functioning, suggesting that violating the independent observations assumption has no effect on the parameter estimates computed by MULTILOG. Not surprisingly, given the large sample size, nearly all of the chi square tests were significant even though no NCDIF

8 IRT AND 360 RATINGS 8 or DTF statistic exceeded its critical value. Only item 8 (task-orientation) and item 19 (relations-orientation) manifested nonsignificant chi square statistics. Additionally, the chi square test of the DTF statistic for the task-orientation scale was nonsignificant. It is noteworthy that no DFIT statistic even approached its critical value; the NCDIF index computed for item 12 on the relations-orientation scale (.025) was the largest for any item or scale, and it had less than one fourth the magnitude required for significance (.096). Because there was no significant item (NCDIF) or scale level (DTF) differential functioning, the CDIF statistic was not meaningful in this analysis. Phase 2: / Having determined that violating the independence assumption under conditions of relative parameter stability produced no detectable effects, we next examined the least favorable condition: short scale length and small sample size. Testing for differential functioning under these conditions was tantamount to stacking the deck in favor of finding it. That is, any instability of parameters estimated under these conditions should Table 3 Differential Functioning Indices (NCDIF and DTF) One Rater per Ratee vs. Multiple Raters per Ratee Condition Task-orientation Relations-orientation Item Scale Note. Item-level values (NC-DIF) higher than.096 required for significant DIF. Scale-level values higher than 1.92 for 20-item scales or.48 for 5-item scales required for significant DTF.

9 IRT AND 360 RATINGS 9 manifest as random fluctuations between groups and be detected as differential functioning (albeit erroneously). Despite these hostile conditions, the results of this phase mirrored those from the previous phase: not a single item or scale met both conditions for significant differential functioning. The proportion of significant chi square statistics was actually lower in this phase, probably due to the smaller sample sizes. Items 3 and 5 on both scales, as well as both DTF statistics, had nonsignificant chi square tests. Also as in Phase 1, no item or scale s DFIT statistic even approached its critical value. The largest (NCDIF =.018 for item 3 on the relations-orientation scale) had less than one fifth the necessary magnitude (.096). Other Conditions Recall that the logic described earlier specified that the remaining conditions would be examined with the goal of identifying the boundary between the conditions under which differential functioning was detectible and the conditions under which it was not. Because 54 separate tests for differential functioning (42 in Phase 1, 12 in Phase 2) failed to yield even a single case of DIF or DTF, we decided that repeating our analyses under the remaining conditions would be superfluous. We reasoned that, if we had found no differences in parameter estimates under the most unstable of conditions, we would almost certainly not find any under more stable ones. Thus, the remaining conditions were not tested. This decision was consistent with the logic specified in the original design of this study. Table 3 summarizes the results of the DFIT analyses. Discussion This study explored the question of whether violating the assumption of independent observations in IRT-based analyses results in a deleterious effect on item parameters estimated by MML procedures. More specifically, we framed this question as one of differential functioning and applied the DFIT framework of Raju et al. (1995) to its investigation. We found that, under four distinct combinations of scale length, sample size, and content domain, not a single item or scale came near to meeting appropriate criteria for differential functioning. This finding suggests that, to the extent these results are generalizable, inclusion of multiple raters per rating target does not affect the estimation of IRT item parameters. We believe this finding to be of some importance to researchers wishing to apply IRT to multisource ratings in general, and to 360 leadership assessments in particular. Prior to this research, investigators examining multirater data under IRT have tended to err on the side of caution by randomly selecting one rater per target. The result has been the exclusion of sizable numbers of potentially useful data from analyses. The results presented here suggest that researchers may be able to reap the benefits of using all of the data available (e.g., parameter stability, generalizability) without incurring deleterious effects due to nonindependence. Although compelling, these findings have a few limitations that warrant mentioning. First, the two content areas from which we constructed scales (task-orientation and relations-orientation) were both domains of leadership behavior. These results should be replicated with multisource ratings in other domains, such as education, in order to evaluate their generalizability. Second, the two scales evaluated here were carefully constructed to exhibit excellent psychometric properties (e.g., unidimensionality, internal consistency). It is conceivable that other scales with less sound properties might be more vulnerable to these effects than those studied

10 IRT AND 360 RATINGS 10 here. Additionally, all of our comparisons involved equal sample sizes (i.e., the one rater group included more ratees in order to maintain the same sample size as the all raters group). This feature allowed us to unconfound effects due to rating independence from effects due to sample size. But, in a field setting, those effects almost certainly would be confounded. Thus, this study did not test for effects of the sample size reduction associated with limiting analyses to one rater per ratee. The analyses reported here do not provide an explanation for why the parameter estimates were unaffected by violation of the independence assumption, but it is possible to speculate. One possible interpretation is that the procedures employed by the MULTILOG software (i.e., marginal maximum likelihood estimation by means of the expectation maximization algorithm) are at least somewhat robust to violations of the assumption. If this is the case, more extreme violations of rater independence may produce effects that did not occur here. The use of real data in this study was both a strength and a limitation. Using real data allowed us to create conditions that closely resemble those that are likely to be encountered in applied research settings. However, we were not able to control the sample characteristics (e.g., degree of rating dependence) to the same extent possible in Monte Carlo simulations. Future research may be able to identify the boundaries beyond which effects occur by using data simulation techniques. Another, more intriguing, possibility is that the assumption was not actually violated. A core concept in IRT is the latent construct being measured by the items under consideration. The notion that multisource ratings violate the independent observations assumption is predicated on the belief that is objective leader behavior. That is, it assumes that the raters are in fact rating the objective performance behavior of a common target. However, an alternative view is that is really the rater s perception of leader behavior and, further, that each rater has a unique perception of the leader. Under this interpretation, the independence assumption is not really violated because each rater is rating a different target, his or her idiosyncratic perception of the leader. This interpretation is consistent with recent research on 360 leadership assessments. For example, Scullen, Mount, and Goff (2000) found that between 53 and 62% of the variance in ratings on two widely used 360 o instruments was due to idiosyncratic individual rater effects. Variance in ratings attributable to performance of the target leaders was about half that size between 21 and 25%. Thus it appears that leadership performance ratings are considerably more influenced by raters own unique perceptions of leaders than by leaders objective performance. It may well be, then, that multisource ratings of managers do not violate the independence of observations assumption after all. Future research should address this question directly by comparing ratings within ratee to ratings across ratees for evidence of dependence. In summary, the results reported here suggest that it may not be necessary to randomly select one rater per ratee in IRTbased analyses of multisource rating data. In situations like those examined here, incurring the associated (severe) sample size reduction may yield no benefits with regard to parameter estimation. References Baker, F. (1995). EQUATE 2.1: Computer program for equating two metrics in item response theory. Madison, WI: University of Wisconsin, Laboratory of Experimental Design.

11 IRT AND 360 RATINGS 11 Bass, B.M. (1990). Bass and Stogdill s handbook of leadership: Theory, research, and managerial applications (3rd Ed.). New York: Free Press. Bock, R.D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: An application of an EM algorithm. Psychometrika, 46, Cattell, R.B. (1966). The meaning and strategic use of factor analysis. In R.B. Cattell (Ed.), Handbook of multivariate experimental psychology (pp ). Chicago: Rand McNally. Facteau, J., & Craig, S.B. (2001). Are performance appraisal ratings obtained from different rating sources comparable? Journal of Applied Psychology, 86, (in press). Fiedler, F.E. (1967). A theory of leadership effectiveness. New York: McGraw-Hill. Fleishman, E.A. (1973). Twenty years of consideration and structure. In E.A. Fleishman and J.G. Hunt (Eds.), Current developments in the study of leadership. Carbondale: Southern Illinois University Press. Greguras, G.J., & Robie, C. (1998). A new look at within-source interrater reliability of 360-degree feedback ratings. Journal of Applied Psychology, 83, Kaiser, H.F. (1960). The application of electronic computers to factor analysis. Educational and Psychological Measurement, 20, Lombardo, M.M., & McCauley, C.D. (1994). Benchmarks : A Manual and Trainer s Guide. Greensboro, NC: Center for Creative Leadership. London, M., & Smither, J.W. (1995). Can multi-source feedback change perceptions of goal accomplishment, self-evaluations, and performance-related outcomes? Theory-based applications and directions for research. Personnel Psychology, 48, Maurer, T.J., Raju, N.S., & Collins, W.C. (1998). Peer and subordinate performance appraisal measurement equivalence. Journal of Applied Psychology, 83, McCauley, C. D. & Lombardo, M. (1990). Benchmarks : An instrument for diagnosing managerial strengths and weaknesses. In K. E. Clark & M. B. Clark (Eds.), Measures of Leadership. West Orange, NJ: Leadership Library of America. McCauley, C.D., Lombardo, M.M., & Usher, C.J. (1989). Diagnosing management development needs: An instrument based on how managers develop. Journal of Management, 15, Mislevy, R.J., & Bock, R.D. (1990). Bilog 3 User Manual. Mooresville, IN: Scientific Software, Inc. Raju, N.S. (1999). DFIT4P: A computer program for analyzing differential item and test functioning. Chicago: Illinois Institute of Technology. Raju, N.S. (Chair). (1999, April). IRT-based Evaluation of 360 Feedback Assessments: The BENCHMARKS Story. Symposium presented at the annual conference of the Society for Industrial- Organizational Psychology, Atlanta, Georgia. Raju, N.S., van der Linden, W., & Fleer, P. (1995). An IRT-based internal measure of test bias with applications for differential item functioning. Applied Psychological Measurement, 19, Reckase, M.D. (1979). Unifactor latent trait

12 IRT AND 360 RATINGS 12 models applied to multi-factor tests: Results and implications. Journal of Educational Statistics, 4, Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores (Psychometrika Monograph No. 17). Schriesheim, C.A., House, R.J., & Kerr, S. (1976). Leader initiating structure: A reconciliation of discrepant research results and some empirical tests. Organizational Behavior and Human Performance, 15, Scullen, S.E., Mount, M.K., & Goff, M. (2000). Understanding the latent structure of job performance ratings. Journal of Applied Psychology, 85, Stocking, M. L. & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7, Thissen, D. (1995). MULTILOG 6.3: A computer program for multiple, categorical item analysis and test scoring using item response theory. Chicago: Scientific Software, Inc. Tornow, W.W., & London, M. (1998). Maximizing the Value of 360-Degree Feedback. San Francisco: Jossey-Bass. Zedeck, S. (1995). [Review of Benchmarks]. In J. Conoley & J. Impara (Eds.), The twelfth mental measurements yearbook (Vol. 1, pp ). Lincoln, NE: Buros Institute of Mental Measurements. available on the Internet at 2 Only half of the cases were submitted to EFA in order to allow for the other half to be used in a confirmatory factor analysis as part of a different study of these same data. 3 Although the median number of raters per ratee was identical for the 1000 and 200 sample sizes, the range differed somewhat. In the condition, raters per target ranged from one to five for the task-orientation scale and from one to seven for the relationsorientation scale. Notes 1 The version of the Benchmarks instrument used in this study was replaced in its publisher's product line with an updated version in April, More information is

13 IRT AND 360 RATINGS 13 Appendix A Sample Items from Task-orientation and Relations-orientation Scales Task-orientation Is action-oriented. Faces difficult situations with guts and tenacity. Takes charge when trouble comes up. Relations-orientation Has a warm personality that puts people at ease. Tries to understand what other people think before making judgments about them. Shows interest in the needs, hopes, and dreams of other people. (item text copyright Center for Creative Leadership 1994)

14 IRT AND 360 RATINGS 14 Appendix B Item Parameter Estimates and Standard Error Estimates: Task-orientation b 1 (SE) b 2 (SE) b 3 (SE) b 4 (SE) a (SE) Item One All One All One All One All One All One All One All One All One All One All (.40) (.40) (.17) (.19) (.09) (.09) 1.11 (.11) 1.01 (.09) 1.33 (.09) 1.39 (.10) (.46) (.58) (.22) (.27) (.10) (.13) 0.74 (.09).76 (.09) 1.39 (.10) 1.22 (.10) (.25) (.26) (.13) (.14) (.09) -.08 (.09) 1.73 (.15) 1.84 (.15) 1.13 (.08) 1.13 (.08) (.24) (.23) (.12) (.12).02 (.09).16 (.08) 1.86 (.16) 1.97 (.15) 1.12 (.09) 1.17 (.09) (.35) (.24) (.16) (.13) (.09) -.74 (.09) 1.37 (.13) 1.33 (.10) 1.20 (.09) 1.40 (.09) (.78) ( (.31) (.30) (.16) (.17) 0.03 (.09) 0.00 (.08) 1.17 (.09) 1.21 (.10) (.29) (.26) (.56) (.67) (.13) (.13) (.29) (.28) (.08) -.87 (.08) -.80 (.15) -.80 (.15) 0.97 (.08).91 (.07).83 (.16).80 (.16) 1.68 (.10) 1.74 (.11) 1.87 (.25) 1.87 (.28) (.33) (.24) (.16) (.13) (.09) -.37 (.08) 2.14 (.19) 1.85 (.14) 1.08 (.08) 1.25 (.08) (.33) (.42) (.16) (.19) (.09) (.09) 0.89 (.09).73 (.08) 1.45 (.09) 1.53 (.10) (.67) (.97) (.24) (.24) (.13) (.13) 0.30 (.08).17 (.07) 1.30 (.09) 1.48 (.10) (.30) (.50) (.15) (.20) (.08) -.85 (.10) 1.02 (.10) 1.40 (.12) 1.38 (.09) 1.18 (.09) (.32) (.31) (.15) (.17) (.08) (.09) 0.99 (.10).79 (.08) 1.37 (.09) 1.47 (.09) (.24) (.20) (.32) (.26) (.12) (.11) (.19) (.16) (.07) -.76 (.06) -.60 (.12) -.49 (.12) 0.99 (.08) 1.02 (.07).89 (.14).93 (.14) 1.83 (.12) 2.00 (.13) 2.51 (.33) 2.48 (.33) (.32) (.76) (.16) (.19) (.08) (.09) 0.94 (.09) 1.20 (.10) 1.56 (.11) 1.37 (.10) (.38) (.42) (.53) (.41) (.15) (.19) (.35) (.21) (.08) (.09) (.18) -.95 (.14) 0.89 (.08).92 (.08).71 (.16).65 (.13) 1.73 (.10) 1.60 (.11) 1.84 (.28) 2.39 (.32) (.35) (.55) (.16) (.18) (.08) -.98 (.08) 1.05 (.09) 1.23 (.09) 1.61 (.10) 1.58 (.11) (.29) (.31) (.16) (.17) (.08) -.66 (.09) 1.68 (.14) 1.72 (.13) 1.30 (.09) 1.27 (.08) (.37) (.50) (.86) (.82) (.13) (.15) (.37) (.27) (.08) (.08) (.19) (.17) 0.69 (.08).62 (.07).67 (.16).36 (.13) 1.77 (.11) 1.77 (.10) 1.83 (.23) 1.92 (.26) (.29) (.39) (.14) (.16) (.08) -.79 (.08) 1.56 (.12) 1.47 (.11) 1.43 (.10) 1.50 (.10) (.24) (.19) (.29) (.42) (.11) (.10) (.18) (.20) (.07) -.64 (.07) -.40 (.12) -.44 (.15) 1.38 (.11) 1.33 (.09) 1.09 (.15) 1.26 (.20) 1.60 (.10) 1.76 (.11) 2.41 (.28) 1.81 (.24) Note. Item parameter estimates (but not standard error estimates) for the one rater per ratee condition ("One") have been transformed to the metric of the all available raters condition ("All") using EQUATE v2.1 (Baker, 1992).

15 IRT AND 360 RATINGS 15 Appendix C Item Parameter Estimates and Standard Error Estimates: Relations-orientation b 1 (SE) b 2 (SE) b 3 (SE) b 4 (SE) a (SE) Item One All One All One All One All One All One All One All One All One All One All (.27) (.35) (.14) (.19) (.08) -.73 (.09) 1.83 (.14) 1.73 (.13) 1.15 (.08) 1.21 (.09) (.22) (.30) (.11) (.14) (.08) -.25 (.08) 1.83 (.15) 1.60 (.13) 1.13 (.08) 1.20 (.08) (.15) (.20) (.53) (.27) (.08) (.10) (.21) (.15) (.06).03 (.06) 0.04 (.16) -.01 (.13) 1.77 (.11) 1.79 (.11) 1.53 (.25) 1.48 (.22) 1.63 (.10) 1.68 (.10) 1.54 (.20) 1.91 (.26) (.14) (.20) (.38) (.16) (.08) (.10) (.12) (.10) (.06) -.61 (.07) (.08) -.42 (.08) 0.73 (.07).77 (.07) 0.58 (.09).52 (.09) 1.87 (.12) 1.78 (.11) 4.00 (.45) 3.98 (.57) (.29) (.54) (.15) (.22) (.09) (.11) 0.69 (.09).74 (.09) 1.23 (.09) 1.25 (.09) (.29) (.36) (.14) (.17) (.09) (.10) 1.18 (.12).98 (.10) 1.16 (.08) 1.23 (.09) (.15) (.17) (.33) (.30) (.08) (.08) (.16) (.15) (.06) -.23 (.06) (.12) -.10 (.13) 1.19 (.09) 1.20 (.08) 0.99 (.16) 1.02 (.18) 1.66 (.11) 1.89 (.11) 2.23 (.24) 1.93 (.24) (.25) (.33) (.11) (.14) (.07) -.80 (.07) 1.18 (.10) 1.28 (.09) 1.48 (.10) 1.64 (.10) (.18) (.23) (.09) (.12) (.06) -.75 (.07) 1.07 (.09).84 (.08) 1.57 (.10) 1.63 (.10) (.27) (.30) (.14) (.15) (.08) -.51 (.09) 1.36 (.13) 1.53 (.13) 1.07 (.09) 1.13 (.08) (.19) (.22) (.10) (.11) 0.15 (.09).15 (.08) 2.05 (.17) 2.04 (.16) 1.05 (.08) 1.16 (.08) (.21) (.22) (.11) (.12) (.08) -.27 (.08) 1.60 (.13) 1.76 (.13) 1.21 (.08) 1.34 (.09) (.24) (.30) (.10) (.13) (.07) -.36 (.07) 1.70 (.12) 1.72 (.12) 1.35 (.10) 1.45 (.09) (.17) (.20) (.09) -.99 (.11) 0.43 (.09).70 (.09) 2.68 (.20) 2.75 (.21) 1.12 (.08) 1.13 (.08) (.22) (.28) (.10) (.14) (.06) -.77 (.07) 1.37 (.10) 1.51 (.10) 1.74 (.11) 1.73 (.09) (.24) (.36) (.12) (.17) (.07) -.78 (.08) 1.81 (.14) 1.74 (.13) 1.27 (.09) 1.33 (.09) (.19) (.27) (.09) (.14) (.06) -.85 (.07) 1.56 (.10) 1.45 (.10) 1.66 (.11) 1.65 (.10) (.21) (.35) (.11) (.15) (.07) -.75 (.08) 1.60 (.12) 1.51 (.11) 1.40 (.09) 1.43 (.09) (.19) (.39) (.58) (.49) (.10) (.15) (.23) (.16) (.06) (.08) (.14) -.93 (.12) 0.41 (.06).46 (.06) 0.41 (.11).44 (.11) 1.96 (.12) 2.03 (.13) 2.69 (.35) 2.92 (.40) (.13) (.21) (.23) (.22) (.07) (.11) (.12) (.11) -0.53(.05) -.58 (.07) (.07) -.35 (.10) 0.83 (.07).81 (.08) 0.63 (.09).61 (.11) 1.92 (.12) 1.72 (.11) 4.18 (.55) 3.14 (.42) Note. Item parameter estimates (but not standard error estimates) for the one rater per ratee condition ("One") have been transformed to the metric of the all available raters condition ("All") using EQUATE v2.1 (Baker, 1992).

Sensitivity of DFIT Tests of Measurement Invariance for Likert Data

Meade, A. W. & Lautenschlager, G. J. (2005, April). Sensitivity of DFIT Tests of Measurement Invariance for Likert Data. Paper presented at the 20 th Annual Conference of the Society for Industrial and