Using GeneralIzabliIty Theory I«Aosess Interrater Reliability of Contract Proposal Evaluations

Size: px

Start display at page:

Download "Using GeneralIzabliIty Theory I«Aosess Interrater Reliability of Contract Proposal Evaluations"

Erika McDaniel
5 years ago
Views:

1 / Using GeneralIzabliIty Theory I«Aosess Interrater Reliability of Contract Proposal Evaluations Iß 00 Richard A. Kass, Timothy El ig and Karen Mitchell US Army Research Institute for the Behavioral and Social Sciences QO The purpose of this report Is to present a technique for estimating Interrater reliability In terms of a generalizabllity coefficient, give an example of this technique from five recent contract proposal evaluations, and ^ ^ present the Implications of these data for organizing future contract proposal ^^ rev lews. GeneralIzabllIty Theory OH Most Investigations of Interrater reliability report the product moment correlation between the ratings of the raters. When more than two raters are employed, the product moment correlation may be reported for all possible pairings of raters. There are three general disadvantages with the correlational approach to assess Interrater reliability. First, there Is a theoretical problem of conceptualizing proposal evaluation scores In terms of the classical notion of true scores. Second, the correlational method does not permit the Investigation of different sources of error. Third, when more than two evaluators are Involved, pair-wise correlations do not readily allow for estimates of rater reliability based on composite ratings. GeneralIzabllIty Theory Is an analysis of variance approach to interrater reliability explicated most completely In a book by Cronbach, Gleser, Nanda and Rajaratnam (1972) entitled The Dependability of Behavioral Measurements. Brennen (1977) provides an amplification of the basic principles and procedures. The first advantage of General Izabll Ity Theory is that It does not rest on the classical notion of true and error scores. Evaluating contract proposals in L terms of classical test theory assumes that there is associated with each Ml proposal a true score, and the more (or better) raters employed the better the ^ final observed score will approximate a proposal's true score. In Generalizabllity Theory, there is no single true score which the evaluators are attempting to approximate. The General izabll ity Coefficient (GC) is an index of how well we are measuring (approximating) one particular specified universe out of any number of possible universes of interest. fl A universe is a collection of behavioral measurements. A "articular set of behavioral measurements in a universe is further defined in terms of the facets or conditions of measurement. With respect to contract proposal evaluations, there are often three facets: raters, criteria and proposals. It will later be shown that the calculation of the GC on the data in this report involves computing a three-factor (facets) completely crossed ANOVA. The "generalizabllity" (universe of interest) of Generalizabllity Theory refers to the extent that the facets defining the universe of interest may be fixed or random. It will be usetul to show the relationship between the calculation of the reliability coefficient (Rxx) and the Generalizabllity Coefficient (GC). V 491 LV -.- "

2 - - ^_i Reliability can be written as: o- 2 ( T ) Rxx - (1) cr 2 ( T ) + <r 2 < E > Where T and E represent true and error scores, respectively. If we substitute universe score U for true score, the equation for the generalizablllty coefficient (GC) Is: cr 2 < u ) GC - (2) (r 2 ( u ) + <r 2 ( E > It can be seen that the relationship among the terms remains the sane for reliability and general izablllty coefficients. The major difference Is that the relative size of the U and E terms In the GC formulation will vary depending on the number of facets defining the universe score and whether these facets are considered fixed or random facets. It was stated earlier that the second major limitation of the correlational approach to Interrater reliability Is Its Inability to distinguish different sources of error. In classical test theory there Is one complex error term. In General Izablllty Theory error variance may be Identified for each facet. Estimation of the source» of error variance Is most useful In making decisions concerning the design of future contract proposal evaluations. One can answer the question of how much Interrater reliability would be affected by Increasing or decreasing the number of raters or number of criteria, or both. The third limitation of the traditional correlational approach Is that It becomes awkward when more than two raters are used In the evaluation. The traditional approach Is to report the product moment correlation between all possible pairings of raters. In some cases an average or median correlation may be given as a single Index for the Interrater reliability. There are problems with this approach. An Individual correlation between any pair ol raters represents the reliability of the evaluation score. If either rater's score was used as the proposal's final score. In practice, this Is never done. Both raters' scores are used to yield a composite score. Consequently, the correlation between Individual rater's scores Is an underestimate of the reliability of the composite score. Since all correlations between possible pairs of ratings are underestimates, the average or median of these correlations will be an underestimate also. The extent to which the correlation underestimates the reliability of a composite score Increases as the number of raters Increases. The General Izablllty coefficient provides an index of the reliability of the composite rating. In this manner it may be noted that generalizablllty coefficients are interclass correlations (Ebel, 1951). Generalizablllty Theory, however, is an expansion of the interclass coefficient approach to allow for more complex experimental designs. 492

«or ch«(?xrxc) D«ilgn Source df SS MS Proposals (P) 8 791 98.9 Racers (R)" 3 121 40.

3 UULt I A luur by Critirlon by Proposal (PidC) ÄNOVA Maim MM PI pa 11 n M «3 Cl «C3 C1 Cl C2 C} C««C) C2 C3 Cl Ct) Cl C2 C3 «C5 1«8 E 0 p«p5 N P-/ P8 PV TABU 2 ANOVA Sumwry I.bl. «or ch«(?xrxc) D«ilgn Source df SS MS Proposals (P) Racers (R)" Criterion (C) 4 6,809 2,202.2 PR PC RC PRC I i i r

^MteMH^i An Empirical Example In this section, the Interrater reliability of five different sets of contract proposals are analyzed using the generalizab11ity theory approach.

Proposals Number of Criteria Set Evaluated ARI Raters Used A 3 5 3 B 6 3 3 c 8 4 4 D 9 4 5 E 31 3 4 To Illustrate the ANOVA method, the Interrater reliability of contract proposal evaluations set "D"

4 ^MteMH^i An Empirical Example In this section, the Interrater reliability of five different sets of contract proposals are analyzed using the generalizab11ity theory approach. The contract evaluations are actual evaluations conducted at the US Army Research Institute (ARI) and they vary along the following dimensions: Contract Proposal Number of Number of Evaluation Proposals Number of Criteria Set Evaluated ARI Raters Used A B c D E To Illustrate the ANOVA method, the Interrater reliability of contract proposal evaluations set "D" Is worked out In a step-by-step fashion. Table 1 depicts set "D" contract proposals evaluation In terms of a three-way ANOVA experimental design. Nine proposals were received, four raters were used. Each rater (R) rated all proposals (P) with respect to five criteria (C). These criteria reflect separate ratings for different aspects of the proposals, for example, technical adequacy, organizational experience, etc. Accordingly, each proposal received a total of 20 ratings (4 raters x 5 criteria). In contract proposal evaluations, raters are considered a random facet so that the final evaluation scores will generalize to the use of other raters having similar levels of expertise. The criterion facet Is considered a fixed facet In that the final evaluation scores do not generalize to other criteria. That Is, the use of some other criteria for a proposal evaluation may result In a different final rank ordering of the proposals. The proposals facet Is considered a random facet In that having more or fewer proposals would not change the score assigned to any one proposal. Table 2 presents the traditional ANOVA summary data for the actual ratings obtained In the proposal evaluation. In the traditional ANOVA, emphasis Is on the statistical tests of the "main" and "Interaction" effects by selecting the ratio of the appropriate Mean Square effect and appropriate Mean Square error term. In GeneralIzabllIty Theory the ANOVA summary table Is used only to obtain the quantities for the Mean Squares. The next step Is to compute the unique variance estimates for each facet using data In the ANOVA summary table and the formulations of the components of the Expected Mean Squares. Fortunately, there are well worked out procedures for this (Brennan, 1977). The final variance estimates for the separate facets are presented In Table 3 under the column for G-study variance estimates. GeneralIzabllIty theory distinguishes between G studies and D studies. G studies are oriented towards obtaining estimates of the various sources of error variances and G studies are characterized by random-effects ANOVA models. D studies, on the other hand, are designed to determine variance estimates In an 494

TAUU i Cliaii <«la latmtrittmt K*l lability DIM CO Ch»ng«t In tha Nuabar or Kataca or Criteria foaalbla 0 Studie» Co«mnuitt C Study Vjrluiicu I'.Ut 1 01.

5 TAUU i Cliaii <«la latmtrittmt K*l lability DIM CO Ch»ng«t In tha Nuabar or Kataca or Criteria foaalbla 0 Studie» Co«mnuitt C Study Vjrluiicu I'.Ut HC K-4 c-s C-5 K C-5 C-l K-4 C-7 <r 2 N o-' N ^ N o- 2 N 0-' I'rupoiuU (P) I U Katura (H) U UlaMiiwiuiia (C) itt.i I'H 1.» 4.O } 4.4} 1 N».» J 1.10 S 1.10 i l.ll J.91 U KC 7.J. l-kc J.7 2U * 1}.IS Tuul Uiilvutaa Varluiicv (U) rmat trrur Varluniii (ti) Cwnuralliability CuuK (ot) TADLE 4 Coapulcd General IzabllUy Coefficient for bucli of the Five Contract Proposal Evaluations Number of Katers NUMBER OF CRITERIA B.99 E.88 C.94 D.85 A

6 - -.. ^ - actual situation where some facets of the ANOVA model are fixed. While our empirical example is a D study, the results can be used to estimate G-study variances by temporarily assuming that the three facets are random effects. These estimated G-study variances can, in turn, be used to estimate variances for various D-study configurations of interest. The individual D study variance estimates are obtained by dividing the G-study variance estimates by their respective sampling frequencies. The D-study universe (U) and error (E) variances are combined according to equation 2 to compute the GO. For data set "D" with four raters (R - 4) and five criteria (C - 4), the generalizability coefficient is.85. Extrapolation of "D" to Other Evaluation Designs One can compute the extent of expected change in the GC when either (or both) the number of raters or number of criteria is changed. The necessary computation is quite easy. To determine the effect on interrater reliability of increasing the number of raters from four to six, the sampling frequency (N) is changed accordingly and the G-study variances are divided by the new sampling frequencies. This procedure is equivalent to using the Spearman-Brown prophesy formula to determine increases in reliability as test length is increased. Data in Table 3 summarize changes in the GC for data set "D" when the number of raters or criteria is changed. Increasing or decreasing the number of raters directly Increases or decreases the GC. This is because both ANOVA component«involving raters contribute to the error term. This may be contrasted to the negllgable effect resulting from changes in the number of criteria. Since criteria contribute to both the universe score variance and error variance, the GC ratio of these two terms changes little. Extrapolation to Other Evaluation Designs Using All Five s The projected changes in interrater reliability in Table 3 are based on the G-study variance estimates from one data set. Estimates of the effects of increasing and decreasing the number of raters and/or criteria on interrater reliability are strengthened to the extent that more G-study variance estimates are obtained. The procedure outlined for data set "D" was applied to the other four data sets. The computed general Izabil ity coefficients for all five data sets are presented in Table 4. The information in Table 4 can be used to compute the effects on reliability of changing the number of raters and/or criteria. Five replications of Table 4 can be estimated by using each data set independently to estimate changes in GC due to changes in the number of raters and criteria. Combining these five sets of independent estimates would yield five interrater reliability coefficients in each cell of Table 4. Moreover, the table can be expanded to provide estimates for combinations of one to seven raters and one to seven criteria. For comparison purposes, the means for each cell have been plotted in Figure 1. Figure 1 Indicates that as the number of raters used in the evaluation Increases, so does the Interrater reliability. The rate of increase decreases, however, as the number of raters exceeds five. In a similar manner, there is little effect of increasing the number of criteria beyond three. These data suggest for similar evaluations an average level of Interrater reliability of.90 can be attained by using three raters and three criteria per contract proposal evaluation. 496

NUMBER OF KATEKS.«.90. INTER-RATER.85 RELIABILITY.80 R - 1.75.70 2 3 4 5 6 Nuaber of Criteria FIGURE 1.

Generallzabillty analyses; principles and procedures (ACT Technical Bulletin No 26). Iowa City, Iowa: The American College Testing Program, 1977.

7 NUMBER OF KATEKS.«.90. INTER-RATER.85 RELIABILITY.80 R Nuaber of Criteria FIGURE 1. Int rrotcr Reliability a Function of the Number of Kacam and Crltorl». REFERENCES Brennan, R. L. Generallzabillty analyses; principles and procedures (ACT Technical Bulletin No 26). Iowa City, Iowa: The American College Testing Program, Cronbach, L. J. K., Gleser, G. C, Nanda, H.A.N. and Rajaratman, N. The dependability of behavioral measurements. New York: John Wiley and Sons, Inc., Ebel, R. L. Estimation of reliability of rating. Psychometrica, 1951, 16, pp, ^ »-..

Using Generalizability Theory to Investigate the Psychometric Property of an Assessment Center in Indonesia

Using Generalizability Theory to Investigate the Psychometric Property of an Assessment Center in Indonesia Urip Purwono received his Ph.D. (psychology) from the University of Massachusetts at Amherst,