Frame of reference rater training

Size: px

Start display at page:

Download "Frame of reference rater training"

Olivia Heath
5 years ago
Views:

1 In combination with goal setting and feedback, performance appraisal is one of the most effective instruments in Human Resource Management, and was found to enhance employee performance by about one-half standard deviation of the average performance level (Guzzo, Jette & Katzell, 1985). Performance appraisal is of considerable importance; for example, personnel selection, training programs, feedback to employees, promotions, and pay decisions are all based on the results of performance appraisals (Borman, 1975; Landy & Farr, 1980; Pulakos, 1984). However, the quality of performance appraisals has been questioned by past researchers. Ratings were found to be biased and of questionable accuracy, reliability, and validity (Bernardin & Pence, 1980). Thus, it is important to know whether training raters can improve the quality of performance appraisals. This study first presents some background information about a particular form of rater training; frame-of-reference (FOR) training. FOR training was first introduced and termed as such by Bernardin and Buckley (1981). A selective review of past FOR rater trainings is provided, followed by a summary of the essential components of FOR training, as well as a discussion of crucial issues in both developing training materials and in evaluation. We discuss the importance of considering alternatives to evaluation designs based on randomization and present an evaluation study of a FOR training using a double-pretest one-posttest design. Frame of reference rater training In one of the first studies on rater training, Latham, Wexley and Pursell (1975) examined the impact of a rater error training on the reduction of rating errors such as halo, first impressions, similarity, and contrast effects. The study found that managers participating in a special workshop committed none of these rating errors, whereas managers in a group discussion or control group showed a tendency to commit some of them. Similarly, Borman (1975) investigated the effects of a short rater training on halo error and found that it decreased significantly after a 5-minute training session. More importantly, the validity of performance ratings, measured as a correlation between raters ratings and corresponding true scores, was not affected by this training format.

2 These two studies are examples of traditional rater error trainings that consist of familiarizing participants with common rating errors and instructing raters on how to avoid them (Woehr & Huffcutt, 1994). Typical rating errors include leniency, central tendency, and halo errors. In fact, researchers found that rater error trainings were effective in reducing rating errors (e.g., Borman, 1975; Latham et al., 1975) though this kind of training had nearly no impact on rating accuracy (Borman, 1975; Pulakos, 1984). But as accuracy is considered the most essential criterion for evaluating the quality of performance appraisals, it has been suggested that former training programs may have misleadingly focused on reducing rating errors, and that accuracy might be better improved by developing training in accordance with relevant aspects of the rating task (Pulakos, 1986, p. 77). Frame of reference (FOR) training is a format that aims at providing raters with an appropriate performance standard in order to facilitate the rating task and to improve rating accuracy. Typically, the multidimensionality of performance is emphasized, followed by a definition of the relevant performance dimensions. Examples of outstanding, average, and insufficient performance are presented to provide raters with a common evaluative standard. Further, participants practice rating behaviors and receive feedback from a trainer. The primary focus of FOR training is rating accuracy, although error indices are also often computed for evaluation purposes. McIntyre et al. (1984) were some of the first researchers who used a training procedure previously suggested by Bernardin and Buckley (1981), and tested the effectiveness of FOR training on rating accuracy. In fact, FOR training was more effective in enhancing rating accuracy than rater error training or no training. Similarly, Pulakos (1984) evaluated the effectiveness of rater training programs. The relevant training formats focused on rating accuracy, rating errors, or were a combination of error and accuracy training. Accuracy training consisted of imposing a mental performance structure on raters in order to help them categorize particular behavioral cues, whereas the rater error training program was designed to reduce halo, leniency, and contrast errors, and the combined error-accuracy program integrated aspects of both training formats. Pulakos (1984) found that the rater accuracy training resulted in significantly more accurate ratings than any other training or control format. 2

3 More recently, Woehr (1994) found significant differences between raters who participated in a FOR training and raters who participated in the control condition. The author used four accuracy measures developed by Cronbach (1955) and found more accurate ratings only on those measures that focused on individual target ratees (i.e., differential elevation and differential accuracy). Finally, a meta-analysis on rater training effectiveness conducted by Woehr and Huffcutt (1994) showed that FOR training was found to be most effective in improving rating accuracy compared to three other training formats (i.e., rater error training, performance dimension training, and behavioral observation training). Although there exists a variety of different rater training formats, past research has increasingly focused on FOR trainings rather than on alternative rater training formats, such as rater error trainings (Athey & McIntyre, 1987). The most cited authors who present recommendations for FOR training are Bernardin and Buckley (1981), McIntyre et al. (1984) and Pulakos (1984, 1986). Although these training programs somewhat differ in the material used, the training procedures are similar (see however Sulsky & Kline, 2007). FOR training consists of an instruction phase, where participants are introduced to the goal and content of the training, a phase where evaluative standards are practiced, and a feedback phase where the trainer provides participants with rationales for true ratings. More specifically, a FOR rater training starts by presenting trainees an example job description, participants are instructed to discuss necessary qualifications of a person executing this job, and they are then given several vignettes demonstrating critical performance situations. These incidents represent outstanding, average, and unsatisfactory performance, respectively, and are rated by the participants. Afterwards, participants indicate their rationales for the ratings to the other group members and the trainer then presents the true scores for each vignette and provides participants with rationales for the ratings. Finally, the group discusses discrepancies between participants and ratings and the true scores. Instead of written vignettes, Pulakos (1986) used videotapes for rater training. The training began by lecturing participants on the multidimensionality of jobs and the participants were instructed to pay attention to these dimensions when appraising employee performance. The trainer read the definitions of the five ratings scales aloud, 3

4 each representing one of the relevant performance dimensions, and for each dimension, the group discussed different types of behaviors which were indicative of a certain effectiveness level. The group then identified differences between how a person rated at 7 might perform as compared to a person rated at 2, 3, or 5, etc. Practices with two videotaped managers followed, each participant explained his or her rating to the rest of the group, and the trainer placed the results on a flipchart. Finally, participants discussed their ratings and received feedback on their accuracy by the trainer. Other authors have also adapted these training rationales (e.g., Day & Sulsky, 1995, McIntyre et al., 1984; Noonan & Sulsky, 2001; Roch & O Sullivan, 2003). For example, Day and Sulsky s (1995) training procedure mainly followed the program developed by Pulakos (1984, 1986), and the training conducted by McIntyre et al. (1984) was similar to the procedure recommended by Bernardin and Buckley (1981), including three essential phases: informing participants about the job to be evaluated, practicing the ratings and feedback, and providing participants with rationales given by expert raters. Despite these similarities, there remain some important differences that may prove to be essential for the effectiveness of a rater training program. For example, McIntyre et al. (1984) used only one practice ratee instead of three targets, suggested by Bernardin and Buckley (1981). Furthermore, the presentation of the true scores was not followed by a group discussion session. This phase may be important in increasing acceptance and understanding of the underlying rationales for these ratings. Finally, the training procedure by McIntyre et al. (1984) lasted approximately 30 minutes, which appears to be rather short in comparison to other rater trainings. The duration of the training may be important since participants may need this time to fully understand and accept the information that is provided during the training. By contrast, the trainings conducted by Pulakos (1986), Day and Sulsky (1995) and Sulsky and Kline (2007) lasted approximately 90 minutes. This duration appears more appropriate for an effective rater training. Evaluation design issues Currently, the state of the art is far from conclusive concerning the effectiveness of rater trainings, in general and FOR-trainings, in particular (Woehr & Huffcutt, 1994). 4

5 Although recent research has concluded that the effects of FOR-training on accuracy are overwhelmingly positive (Roch & O Sullivan, 2003), the amount of available evidence is still in its infancy. For example, the well cited meta-analysis from Woehr and Huffcutt (1994) included only a total of six studies (N = 365) for FOR-trainings. Therefore, we are reluctant to recommend that on such a basis, generalizability of FOR trainings can be made. Rather, regular evaluations of rater trainings are necessary. There are several reasons why we currently see a necessity for an evaluation of the effectiveness of rater trainings. First, rater trainings usually concentrate on a specific type of job and a specific target group (e.g., students rate the performance of university professors). This means that even if raters are successfully trained to rate the performances of professors accurately, these raters still might not be able to produce accurate ratings for a different type of job (e.g., rating a secretary s performance), and the respective training rationale cannot be transferred to other trainee groups and/or target jobs without considerable modifications. Second, rater trainings are usually embedded within a specific organizational context. In fact, according to Bernardin and Buckley (1981), it is important to identify raters pursuing idiosyncratic standards by comparing their ratings of critical work behaviors to normative ratings ( true scores ). Each organization possesses its own set of particular standards for appraising performances. For example, a behavior which is appropriate in one firm may be inconceivable in another. Therefore, rater training has to take particular organizational standards into consideration. (Note that the core of FOR training is the development of common evaluation standards between raters.) Third, the amount of available studies is meagre, and generalizability is low both because specific standards for FOR-trainings are still not reliably developed and because the statistical power of most studies is not very high (which is not uncommon within the area of training research, in general; Yang, Sackett & Arvey, 1996). Provided that evaluations of rater trainings are recommendable, the question becomes: Which design should be used? To date, most studies examined the effects of FOR trainings within experimental designs with randomization (e.g., McIntyre, Smith & Hassett, 1984; Pulakos, 1984, 1986). Of course, experimental evaluation designs have 5

6 advantages, and randomization should not be frivolously abandoned (Shadish et al., 2002, pp. 488). Nevertheless, within organizations, the evaluation of trainings faces two challenges: Restricted sample sizes and practical difficulties with implementing a randomized control group. For example, there are often only small groups of employees (e.g., supervisors) which are eligible for training, and treatment diffusion, especially in cases of highly attractive interventions, remain a serious problem. Therefore, one alternative is a one-group pretest posttest design. In fact, FOR-trainings have also been evaluated by comparing pretest and posttest results (Hedge & Kavanaugh, 1988; Stamoulis & Hauenstein, 1993). This kind of pre-experimental design has been both severely attacked and defended (Sackett & Mullen, 1993). There are various threats to the validity of the conclusions which might be drawn, e.g., history, maturation, regression to the mean, and testing (cf. Shadish, Cook & Campbell, 2002). As with every design, researchers have to consider these threats closely. In the case of most rater trainings, history and maturation effects can be excluded because the evaluation data are typically collected directly before and after the training. However, issues that arise include regression to the mean and testing effects, both of which may complicate matters. Regression to the mean might occur because participants are initially selected based on their low scores in the beginning (those who are trained are typically those who need it most ). And given that participants know that they are being or have been trained, testing effects may result as another issue with rater trainings. In such situations, a double pretest is a possible solution. The first pretest can be considered a dry run (Shadish et al., 2002), and an increase of test scores from the first to the second pretest might indicate testing and/or regression effects. The following study is an evaluation of a FOR rater training within a double pretest one posttest design. 6

7 Method Development of true scores The most important method for developing true scores (Sulsky & Balzer, 1988) is the use of expert raters. Therefore, the present study uses expert ratings that were collected in four groups, with each group containing two experts. Experts were defined as supervisors whose background was in human resources or psychology and who had appraising experience (for example, by appraising two or more secretaries). It was considered important that these experts had the opportunity to observe performances of different secretaries in the past, in order to compare positive and negative behaviors and to form appropriate rating standards. Ratings were made by teams consisting of two experts. First, expert ratings were conducted within two of the four expert teams, with each team making ratings independently. Following this, the ratings from the other two teams were collected. After having completed their ratings, the last two teams obtained the results from the first two teams and were asked to discuss any deviations. Within their own team, experts were asked to agree on one score for each dimension, and to agree on one global rating, as well. The detailed rating procedure is described in the following paragraph. A few days before they met with the first author, the expert raters were given a definition of the three performance dimensions and of the scale anchors, in order to ensure that every expert rater had the same information. At the beginning of the meeting, all important aspects were reviewed and definitions of the dimensions and scales were explained once more. The video clips were then shown to the expert raters in a randomly determined order, and after having viewed a video clip, each rater made his or her own rating and noted the result on a sheet. They then were asked to reveal their ratings and discuss differences. Raters were asked to agree on one score for every dimensional and global rating they had to make. The trainer documented the final results and the rationales for making the ratings. Moreover, experts were asked how a 7

8 secretary should behave in order to reach a higher or lower score on a particular scale. This was important for distinguishing between a behavior that was rated a 4 from a behavior that was rated 5, for example. The final true scores were obtained by analyzing which of the scale points received the highest scores from the expert ratings for each dimensional and the global rating. Thus, each true score represents the mode of all four expert teams. Interrater agreement among the four expert teams was ICC(4) =.98 ( McGraw & Wong, 1996). Participants Graduate students with organizational psychology as a major (N = 24) were voluntarily enrolled in one of the training sessions and received course credits for their participation. The mean age of the group was 24.6 years (s = 3.15), ranging from 21 to 34 years. Most of the students were female (n = 15; 63%). Sixteen students (67%) studied social sciences, seven (29%) students studied management and one student (4%) studied business education. The vast majority (79 %) was employed part-time when the study took place. Eleven of the 24 trainees (46%) reported having previous experience with being formally appraised by others in a work context, and four participants reported having actively appraised other individuals (17%) in the past. Stimulus material The video material used in the present study consisted of twelve videotaped scenes exhibiting different work situations. In each clip, one of two secretaries was presented either alone or in interaction with a supervisor. The duration of the scenes ranged from 19 to 84 seconds and showed how the secretaries accomplished various job tasks. Two sample scripts of the video scenes can be found in appendix A. In each of the three evaluation segments of the training (i.e., pretest 1, pretest 2, posttest), four videotaped scenes were presented to the participants. Two of the four scenes showed a secretary performing with superior performance, whereas the other two scenes showed a secretary engaged in rather poor performance. Supervisors usually observe their employees in random order, that is, they randomly interact with different 8

9 individuals and do not observe behaviors which are concealed to them, so their observations are limited. Following this logic, within each test phase, the video scenes were presented in a randomly determined order. Pretest 1 was conducted one week before pretest 2. For pretest 2, the training and the posttest were integrated into one session lasting approximately 120 minutes. Each training session was conducted in small groups of two to six participants. Performance dimensions and levels The general latent structure of job performance is described by Campbell (1990) in terms of eight distinct dimensions; job-specific task proficiency, non-job-specific task proficiency, written and oral communication, demonstrating effort, maintaining personal discipline, facilitating peer and team performance, supervision, and management or job administration. Campbell, McCloy, Oppler and Sager (1993) point out that these eight factors sufficiently represent the general structure of job performance, although it must be emphasized that the relevance of each dimension depends upon the specific occupation. The ratings in the present study focus on three of the performance dimensions, namely the job-specific task proficiency, demonstrating effort, and the ability to communicate orally. Job-specific task proficiency is the degree to which the person (i.e., the secretary) can complete the core tasks of the job. These tasks distinguish the secretary s job from other jobs and are, for example, the tasks that refer to organizing appointments, making telephone calls, and welcoming the supervisor s clients. Demonstrating effort describes the intensity and perseverance of a person when accomplishing a task, especially under aggravating conditions. Oral communication captures the ability to express ideas and problems properly and to exchange information with other individuals in an effective way. Moreover, participants were also asked to provide a global rating for each videotaped scene. Positive correlations across different dimensions of job performance were found in empirical research conducted by Viswesvaran (1993, as cited in Viswesvaran & Ones, 2000). The global ratings were primarily collected for exploratory purposes in the current study. 9

10 Former studies that focused on FOR trainings typically used 7-point rating scales (e.g., Athey & McIntyre, 1987; McIntyre et al., 1984; Pulakos, 1984, 1986) based on Borman s (1977) recommendations and materials. Lievens (2001) used videotaped targets in an assessment center context and a 5-point rating scale, ranging from 1 (poor) to 5 (excellent). As the relatively short video clips in the present study exhibit only brief extracts of the secretaries performances and thus include only a restricted amount of information, a 7-point rating scale appeared to be inappropriate. Therefore, the performances were appraised on a 5-point scale, consisting of 1 (poor), 2 (sufficient), 3 (average), 4 (good) to 5 (excellent). For example, a secretary exhibited an average performance when the individual accomplished the tasks sufficiently without committing serious mistakes and thus contributed to a smooth workflow. The key aspect of an average performance was that the individual executed orders more so than demonstrating proactive behaviors. In contrast, excellent performance was comprised of secretaries anticipating future steps and facilitating workflow, and a poor performance was shown by a secretary who committed serious mistakes and thus interrupted a smooth workflow. Accuracy measures The present study assessed six different types of accuracy. Cronbach s (1955) four component indices (i.e., elevation, or E; differential elevation, or DE; stereotype accuracy, or SA; and differential accuracy, or DA) were computed as well as distance accuracy and the overall D index. In general, accuracy is defined as the strength and kind of relation between one set of scores (e.g., participants ratings) and a corresponding set of scores (e.g., true scores), which are accepted as a standard for comparison (Guion, 1965). A common way of computing accuracy measures is to determine distances between individuals ratings and ratings that are accepted as standard. This general description can be applied to the context of performance appraisals by defining rating accuracy as the difference between a rater s performance score for n ratees on k dimensions with a corresponding true score determined by expert raters. The smaller the difference between a participant s and an expert s rating, the more accurate the appraisal is. Therefore, lower scores on accuracy 10

11 measures indicate better accuracy. In the following paragraphs, a summary of accuracy measures used in the present study is presented. D 2 and Cronbach Accuracy Component Scores The four Cronbach accuracy measures have been used in several former studies dealing with performance appraisal research (e.g., Pulakos, 1984, 1986; Sulsky & Day, 1994; Stamoulis & Hauenstein, 1993) and are based on the D 2 index, an accuracy index representing the squared difference between subject ratings x and true scores t averaged across n ratees and k dimensions: 2 1 D = ΣΣ( x nk t kn Cronbach (1955) criticized that the D 2 index provides only aggregated information about accuracy, as it summarizes rating deviations across ratees and dimensions and therefore collapses potentially meaningful information. Cronbach s (1955) solution to this problem was to decompose this overall accuracy measure into four separate accuracy scores with independent conceptualizations. These accuracy components are based on the logic of a two-way analysis of variance, with each component representing a part of the deviation between participants ratings and true scores (Sulsky & Balzer, 1988): nk ) 2 1) elevation (E), expressing the differential grand mean; 2) differential elevation (DE), expressing the differential main effect of ratees; 3) stereotype accuracy (SA), expressing the differential main effect of dimensions; and 4) differential accuracy (DA), expressing the differential ratee x dimension interaction. The formulas are as follows (Cronbach, 1955): E 2 = ( x.. t..) 2 11

12 DE 2 1 = n i 2 [( xi. x..) ( t i. t..) ] SA 2 1 = k j [( x. j x..) ( t. j t..) ] 2 DA 2 1 = kn [( x xi. x. + x..) ( t t i. t. t..) ] ij j ij j + i j 2 where x ij and t ij = rating and true score for ratee i on item j; x i. and t i. = mean rating and mean true score for ratee i; x. j and t. j = mean rating and mean true score for item j; and x.. and t.. = mean rating and mean true score, over all ratees and items. Note again that a lower score indicates greater accuracy. In the present study, the secretaries performances will be assessed on three dimensions. Furthermore, raters will provide an overall rating such that four ratings will be made for each video scene. As the elevation and differential elevation component both provide aggregated figures across dimensions, separate accuracy measures will be computed for dimensions and global ratings, whereas stereotype accuracy and differential accuracy can only refer to dimensional ratings. Distance Accuracy As an alternative to the four Cronbach (1955) components, some researchers developed a modified form of the overall D 2 index. This measure is termed distance accuracy and represents the average absolute deviation of subject ratings from true scores. Distance accuracy was defined by McIntyre et al. (1984) as follows: Distance Accuracy k 12 d t n ij i = 1 j= = 1 d n r ijk

13 where k refers to the k th rater; n is the number of ratees; d is the number of dimensions; r refers to subject ratings; and t refers to true scores. Although the distance accuracy measure is similar to the overall D 2 index, it is not equal to the square root of this measure. A comparison between these two indices is rather difficult, as the D 2 index is computed by squaring the differences between trainees and experts ratings prior to aggregating them. Unlike the distance accuracy index, the D 2 index increases exponentially when distances between ratings increase. From a theoretical point of view, it must be noted that each index introduced in this section claims to measure accuracy. If this assumption is true, high correlations among accuracy measures should result. Contrarily, low correlations among accuracy indices have been found in previous studies (e.g., Murphy, Garcia, Kerkar, Martin & Balzer, 1982; Sulsky & Balzer, 1988). Therefore, we decided to compute a variety of accuracy scores. However, FOR training should not be expected to improve all types of rating accuracy (Stamoulis & Hauenstein, 1993). For example, FOR training should not be expected to improve the elevation component, which represents the mean rating over all ratees and dimensions, and the differential elevation component, which is the accuracy of the mean rating given to each ratee across all performance dimensions. Of the four Cronbach (1955) components, only differential accuracy and stereotype accuracy should be improved by FOR training. Differential accuracy measures a raters ability to make correct ratings for each ratee on each performance dimension according to a set of comparison scores. Stereotype accuracy refers to the average rating for each performance dimension collapsed across ratees, i.e., it depicts the extent to which raters are able to accurately differentiate between performance dimensions. Procedure The purpose of the training was to create a common performance theory (i.e., frame of reference) among raters such that they would agree on a particular standard for evaluating the performance of the two secretaries. Furthermore, the goal was to encourage group members to actively participate in group discussions. Moreover, it was 13

14 important to provide participants with rating experience and feedback about the accuracy of their appraisals. In the following section, the training procedure is described in detail. It should be noted that the first session included the introduction to the training, the presentation of the relevant performance dimensions, and the first pretest. The second pretest, the FOR training and feedback, and the posttest were part of the second session. Introduction to the training (session 1) Session 1 started with an introduction to the topic of performance ratings. Raters were given a synopsis of what rater trainings are and that the basis of this training consists of short video clips exhibiting the performances of two secretaries. Further, the trainer explained the importance of performance appraisals in organizations; for example, that these ratings are used for decisions concerning remuneration, promotion, personnel development, and the selection of employees. Participants were told that the overall goal of the study was to gain more information about how people evaluate the performance of other individuals in work situations and to provide participants with experience evaluating job performance. In order to enhance the acceptance of the expert ratings that served as a feedback to participants, the trainer presented some facts characterizing the individuals that were chosen as experts, such as information about their professional position and their experience with performance appraisal. At the end of this introduction, trainees filled out a general questionnaire containing questions about demographic data (i.e., age, gender, course of studies), and the person s previous experience with performance appraisals. Explaining the performance dimensions (session 1) During the second part, participants were introduced to their role: a general manager who assesses a secretary s performance. The group was given the information that the secretaries performances are not appraised on a global level, but that performance is considered a multidimensional construct. The trainer asked the participants which performance dimensions they think are important for the job of a secretary. This point was expected to increase activity and identification with the appraisal task. Afterwards, 14

15 the three performance dimensions that are assessed during the sessions were revealed to the group (the secretary s professional competency, her ability to communicate and her effort to fulfill a task). Definitions of these three dimensions were presented on a flip chart and explained by the trainer. Moreover, the participants were told that the three dimensional ratings and the overall rating are made on 5-point-scales ranging from 1 (poor) to 5 (excellent). Short definitions of the scale anchors were given to the group in order to ensure a clear understanding of these terms. After having presented these definitions, the trainer explained to the group how to fill out the performance rating sheets. This was followed by some general instructions concerning the videos. For example, participants were asked not to communicate when making their ratings in order to ensure that every individual appraised the performance according to his or her own impression. Pretest no. 1(session 1) In order to give raters an impression of what they were supposed to assess during the training, four work situations on videotape were presented to the trainees. Each secretary was presented in two video clips, respectively. One target primarily exhibited job performance below average whereas the other target mainly performed on a level above average. Raters evaluated the secretaries performances after each clip and filled out the first appraisal sheet. These ratings were not discussed since participants were asked to appraise similar video clips in the second pretest one week after the first session. The trainer then collected the appraisal sheets and finished the first session by mentioning that a variety of different work situations would be discussed in the second session. Pretest no. 2 (session 2) One week after the first session, the group met again for the second pretest, the training, the posttest, and the evaluation of the training. In order to refresh the information that was given to the participants in the first session, the group briefly reviewed the short definitions of the performance dimensions and performance levels. After having repeated general instructions, the trainer showed the next four video clips. These four 15

16 clips exhibited similar situations at similar performance levels and identical performance dimensions akin to the first pretest. The raters appraised the performances presented in the video clips and copied their ratings on a second sheet. Finally, the original appraisal sheets were collected by the trainer and the trainer explained that the group would discuss the results later during the training. Frame-of-reference training, group discussions and feedback The training continued by developing detailed definitions of the performance dimensions according to recommendations by McIntyre et al. (1984) and Pulakos (1984) who pointed out the need to familiarize raters with the performance dimensions. For this purpose, more detailed behaviors belonging to the relevant dimensions were presented to participants on slides in order to get a clearer understanding of the different performance dimensions. Afterwards, the group reviewed the video clips of the first and second pretest and participants indicated their ratings to the rest of the group after each clip (which they had noted on a separate sheet). The trainer noted the results of the participants on a slide in order to offer a visualization for the whole group. Participants then explained why they decided what they had for a certain rating and which clues of the secretaries performances were helpful during this decision. The trainer revealed the true scores of the experts ratings and indicated differences between the ratings of participants and experts. Afterwards, participants were asked to give their opinion about the discrepancies between their own ratings and the experts results and whether the expert ratings appeared reasonable to them. This procedure was repeated for every clip of the first and second pretest. The group was now expected to possess a common performance theory of a secretary s job. Posttest Immediately after the training, participants further appraised four video sequences. Note that at that point in time, performance of the targets varied from the performance levels in the two pretests. Participants again made their ratings and copied them on a separate sheet of paper in order to compare them with the rest of the group afterwards. The rating sheets were collected, and participants explained their ratings to the rest of the group. 16

17 The trainer again wrote down the participants results on a slide in order to offer a visual aid for the whole group. Afterwards, participants were asked to give their opinion about the discrepancies between their own ratings and the experts results and whether the expert ratings appeared reasonable to them. Experts results and rationales for the video clips were given to the group and served as feedback. Trainee reactions Satisfaction with the present FOR training was measured with eleven items that cover various aspects of the training sessions such as the content of the training and material, the trainer s competence, the learning environment, the ambiance in class, the structure of the training and the balance between presentation and group discussions. All answers were measured on 5-point scales, ranging from 1 (absolutely true) to 5 (absolutely not true). Additionally, participants were required to provide an overall rating of the training. This item was measured on a 5-point scale ranging over 1 (poor), 2 (sufficient), 3 (satisfactory), 4 (good) to 5 (excellent). Results To examine the effectiveness of the present FOR training, contrast analyses for repeated measures were conducted for various accuracy indices, namely the Cronbach accuracy components (i.e., elevation, differential elevation, stereotype accuracy, and differential accuracy), the overall D index, and distance accuracy. Note that lower scores on all of these indices denote greater accuracy. It was assumed that participants rated less accurately in the two pretests and more accurately in the posttest. In addition, a change in accuracy between the first and the second pretest was not expected. The expected pattern is represented by the contrast weights +1, +1, -2, such that no change occurs from the first to the second pretest (see Appendix B). A decrease in accuracy scores is expected in the posttest and thus should represent an improvement after the FOR training. According to Furr and Rosenthal (2003), the effect size computed in this analysis is r contrast. The hypothesis of a match between observed data and expected 17

18 pattern is tested with a one-sample t-test 1. An overview of descriptive statistics for the computed accuracy measures is presented in table 1. Table 1 The results of the contrast analysis for Cronbach s elevation component turned out to be non-significant for both dimensional and global ratings (t = 1.44, p =.163 for dimensional and t = -1.47, p =.156 for global ratings). Furthermore, the effect size shows that there is a rather weak association between the observed data and expected pattern for both dimensional and global ratings (r contrast =.29, respectively). Results of the contrast analysis for Cronbach s differential elevation component are similar: The hypothesis of a match between observed and expected data cannot be supported (t = -.26, p =.794 for dimensional and t = -.06, p =.956 for global ratings). The effect sizes for both dimensional and global ratings are negligible (r contrast =.06 and r contrast =.01, respectively). So far, results for two accuracy measures were presented that can not be recommended for measuring rating accuracy with respect to FOR training (Stamoulis & Hauenstein, 1993; Day & Sulsky, 1995). Unlike these accuracy indices, stereotype accuracy and differential accuracy should improve after such a training. In fact, contrast analysis results for stereotype accuracy are significant (t = 4.71, p <.001). The effect size indicates a strong association between the observed data and the expected pattern (r contrast =.70). Contrary to expectations, stereotype accuracy even increased from the first to the second pretest before decreasing after the training (see table 1). This means that participants rated less accurately in the second compared to the first pretest. Like stereotype accuracy, differential accuracy significantly improved (t = 11.39, p =.000). The large effect (r contrast =.92) suggests that the present data substantially matches the expected pattern. Similar to stereotype accuracy, differential accuracy 1 The test value for his one-sample t-test was zero. The null hypothesis is: There is no significant difference between the mean of L scores and the test value zero. The alternate hypothesis is: there is a significant difference between the mean of L scores and the test value zero. The L score represents the degree to which a participant exhibits the expected pattern of scores (Furr & Rosenthal, 2003). Appendix B explains the computation of L scores. 18

19 increased in the second pretest before decreasing in the posttest (see table 1 again for descriptive statistics). Figure 1 presents a visual comparison between differential accuracy (DA) and stereotype accuracy (SA). Basically, the distances between pretest and posttest scores are greater for differential accuracy whereas for stereotype accuracy, the distances between pretest and posttest scores are relatively smaller. Figure 1 Finally, contrast analyses were also computed for the overall D index and distance accuracy. Although these accuracy measures have hardly been reported in previous studies on rater training, they were computed in the present study in order to compare them to the traditional Cronbach components. Interestingly, contrast analysis on the overall D index for dimensional ratings is significant (t = 3.45, p =.002), whereas for global ratings, no significant effect emerged (t =.42, p =.678). Moreover, the effect size exhibits a strong relation between observed and expected data for dimensional ratings, but a weak correlation for global ratings (r contrast =.58 and r contrast =.09, respectively). The descriptive statistics in table 1 show that accuracy of dimensional ratings continually improved across the three test phases. For global ratings, participants rated more accurately in the second pretest but less accurately in the posttest. Contrast analysis for distance accuracy (for dimensional and global ratings) are not significant (t = -.88, p =.390 for dimensional and t = 1.22, p =.234 for global ratings). In fact, there is a rather weak relationship for both dimensional and global ratings (r contrast =.18 and r contrast =.25, respectively). The descriptive statistics in table 1 show that distance accuracy for both dimensional and global ratings improved in the second pretest. The posttest results indicate that participants provided slightly more accurate global ratings, whereas dimensional ratings were less accurate in the posttest. Participants rated less accurately in the second pretest. Therefore, the strong effect sizes of r =.70 for stereotype accuracy or r =.92 for differential accuracy raises the question of how to exactly interpret these results. In fact, Furr and Rosenthal (2003) emphasize 19

20 that large effect sizes are not necessarily associated with a perfect relation between observed data and expected pattern. Given that participants rated generally less accurately in the second pretest, we compared our results with an alternative model that takes a decrease of accuracy between the first and the second pretest into account (the respective pattern is represented by the contrast weights +1, +2, -3.). The general pattern of results did not change: Again, only the scores for stereotype accuracy (t = 5.23; r contrast =.74), differential accuracy (t = 14.63; r contrast =.95) and D (dimensional ratings; t = 2.96; r contrast =.53) were significant. A test of the difference between those two patterns can be computed for exploratory purposes (however, see Rosenthal, Rosnow & Rubin., 2000, pp. 170). The modified contrast pattern is a better adjustment for both stereotype accuracy (t = 2.79; p <.05; r =.50) and differential accuracy (t = 6.06; p <.01; r =.78) though significantly worse for D index (dimensional ratings t = ; p <.05; r =.57). Satisfaction with rater training The highest ratings were attained for the trainer s preparation (M = 4.96; SD = 0.20), the balance between presentation and group discussion (M = 4.67; SD = 0.48), and the structure of the training (M = 4.67; SD = 0.48). The lowest ratings were provided for questions concerning the cooperation in group (M = 4.08; SD = 0.83) and the speed of the training (M = 4.33; SD = 0.82). Note that although ratings on these items were lowest, they should not be considered negative since a 4 still indicates a good appraisal. Finally, the mean overall rating for the training was 4.58 (SD = 0.50) and thus indicates that the training was on average evaluated good to excellent. Discussion So far, most evaluation studies used control group designs in order to detect differences between non-trained and FOR trained raters, whereas some examined improvements in accuracy by comparing pretest and posttest results (e.g., Pulakos, 1984, 1986; McIntyre et al., 1984). Given the limits of both types of designs, we presented an evaluation study based on a double pretest one posttest design (Shadish et al., 2002). 20

21 Consistent with previous research, some of the accuracy components improved through FOR training whereas other accuracy components did not (Stamoulis & Hauenstein, 1993; Sulsky & Day, 1994). More specifically, stereotype accuracy and differential accuracy showed significant improvements in rating accuracy. According to the logic of analysis of variance (ANOVA), stereotype accuracy is the differential main effect of dimensions and represents the accuracy referring to the average rating across ratees for each performance dimension (Pulakos, 1986). Thus, a significant improvement in stereotype accuracy in the posttest indicates that participants were able to rate more accurately on the different performance dimensions after the training. Moreover, differential accuracy also improved after the FOR training. According to the logic of analysis of variance (ANOVA), differential accuracy is the differential ratee x dimension interaction and represents the accuracy referring to a rater s ability to appraise different targets accurately on different performance dimensions (Pulakos, 1986). Here, the sensitivity to ratee differences and the distinction between dimensions are both important. Day and Sulsky (1995) even suggested restricting the use of differential accuracy in order to measure accuracy, because FOR training is supposed to result in accurate ratee impressions on different performance dimensions. From a theoretical point of view, differential accuracy is clearly a relevant component because it measures a ratee x performance dimension interaction (Sulsky & Kline, 2007). As the other Cronbach (1955) components provide aggregated accuracy measures across ratees (stereotype accuracy), dimensions (differential elevation), or both ratees and dimensions (elevation), they appear to be less appropriate for measuring whether a person correctly rates a target on a specific dimension. However, stereotype accuracy can also be interpreted as a measure of the appropriate use of the rating dimensions, i.e. how they relate to each other, which is a central function of FOR trainings. Therefore, at least for the evaluation of FOR trainings, both differential accuracy and stereotype accuracy are appropriate accuracy measures. In the current study, D index (dimensional) also changed in the expected direction. Note that this score is a general measure of accuracy that summarizes four accuracy components. And because the training was expected to improve two of these components, a certain amount of change of D index (dimensional) should be found. At 21

22 the same time, the current study clearly demonstrates that the decomposition of accuracy into four components (Cronbach, 1955) is much more informative. Two further measures of accuracy were not analyzed in the present study. One of them is Borman s differential accuracy and is computed by correlating a rater s ratings for each dimension with corresponding true scores across ratees. Although it is based on the idea of Cronbach s differential accuracy component, it provides only correlational information and therefore neglects differences between participants and experts ratings. Sulsky and Balzer (1988) recommend that Borman s differential accuracy should not be considered an accuracy index because it is not based on the psychometric definition of accuracy provided earlier. Still, halo error was not computed in the present study. Halo is defined as the difference between the variance of an expert s rating and the variance of a participant s rating averaged across dimensions and ratees. This measure is insensitive to distances between raters scores and true scores and hence does not qualify as an appropriate accuracy measure. Contrary to expectations, stereotype accuracy and differential accuracy decreased from the first to the second pretest. There are various explanations for this phenomenon. First, the design of the study required that the work situations presented in the second pretest are similar to the situations in the first pretest. Participants might have been irritated by the similarity of the rating scenes. This could have induced participants to generate variance in their ratings because they believed that stability in performance is not possible, not wanted by the trainer, or just too simple for such a study. Second, another reason for deteriorated accuracy in the second pretest can be raters motivations to provide accurate ratings. More specifically, it is possible that participants were frustrated by the procedure of the rater training. At the beginning of the second session, raters first had to further appraise videotaped performances before the actual training started. This might have demotivated the trainees who were waiting for feedback, general input on how to become an accurate rater, or information about how experts appraise the targets performances. This frustration could have contributed to less accurate ratings. 22

23 Further research should both scrutinize the stability of the decrease of accuracy in double pretest designs when participants expect a training session. In the current context, this retesting effect does not undermine the conclusion that the rater training was effective. In fact, the effect size might even be underestimated. Limitations As with any study, there are several limitations of the present research. First, generalizability to performance situations in organizations is at issue because the participants of the present study were students and not supervisors. However, consistent with the majority of research on rater training (e.g. Roch & O Sullivan, 2003) we used a student subject sample. Moreover, almost all trainees had some job experience. Second, the training material consisted of videotaped behavior. In order to generalize these finding to live situations in organizations, the question must be answered concerning whether videotaped behaviors are rated similarly to live performances. Heslin et al. (2005) also used videotapes for analyzing performance appraisals. They summarized findings that indicate the generalizability of videotaped performances. For example, Lifson (1953, as cited in Heslin et al., 2005) found that filmed performances are rated similarly to live performances. Similarly, Ryan et al. (1995, as cited in Heslin et al., 2005) found that rating accuracy does not differ for live and videotaped performances. Therefore, videotaped performances appear to be generally an appropriate means (stimulus material) for conducting research on performance appraisals. Although past research reported that individuals do not differ in rating live or videotaped performances, it might be possible that an individual rates differently in an appraisal situation than in a natural environment. Under normal conditions, raters interact with the target persons, such as a supervisor who communicates with employees in day-to-day situations. Therefore, it remains open as to whether participants would have provided the same ratings if they had the chance to interact with the targets. In addition, the videotaped scenes and samples of behaviors presented were rather brief. Under normal conditions, raters would have the opportunity to observe a larger sample 23

Impact of Frame-of-Reference and Behavioral Observation Training on Alternative Training Effectiveness Criteria in a Canadian Military Sample

HUMAN PERFORMANCE, 14(1), 3 26 Copyright 2001, Lawrence Erlbaum Associates, Inc. Impact of Frame-of-Reference and Behavioral Observation Training on Alternative Training Effectiveness Criteria in a Canadian