ebook img

Differentiating Rater Accuracy Training Programs by Andrea L. Sinclair Thesis submitted to the ... PDF

186 Pages·2000·0.71 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Differentiating Rater Accuracy Training Programs by Andrea L. Sinclair Thesis submitted to the ...

Differentiating Rater Accuracy Training Programs by Andrea L. Sinclair Thesis submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirement for the degree of Master of Science in Industrial/Organizational Psychology Approved: Neil M.A. Hauenstein Roseanne J. Foti John Donovan October 4, 2000 Blacksburg, VA Keywords: Performance Appraisal and Rater Training Differentiating Rater Training Programs Andrea L. Sinclair (ABSTRACT) Prior investigation of a new rater training paradigm, rater variability training (RVT), found no clear empirical distinction between RVT and the more established frame-of-reference training (FOR), (Hauenstein, Facteau, & Schmidt, 1999). The purpose of the present study was to expand upon this previous investigation by including a purpose manipulation, alternative operationalizations of Cronbach’s accuracy components, finer-grained distinctions in the rating stimuli, and a second control group receiving quantitative accuracy feedback void of a substantive training lecture. Results indicate that finer-grained distinctions in the rating stimuli result in the best differential elevation accuracy for RVT trainees. Furthermore, RVT may be best suited for improving raters’ abilities to accurately evaluate average performing ratees when the performance appraisal is used for an administrative purpose. Evidence also suggests that in many cases, the use of Cronbach’s accuracy components obscures underlying patterns of rating accuracy. Finally, there is evidence to suggest that accuracy feedback without a training lecture improves some types of rating accuracy. Table of Contents Introduction.......................................................................................................................1 Review of Literature..........................................................................................................3 Methods...........................................................................................................................22 Results.............................................................................................................................30 Discussion.......................................................................................................................48 References.......................................................................................................................56 Appendix A.....................................................................................................................61 Appendix B.....................................................................................................................65 Appendix C.....................................................................................................................68 Appendix D.....................................................................................................................72 Appendix E......................................................................................................................74 Appendix F......................................................................................................................89 Appendix G.....................................................................................................................90 Appendix H...................................................................................................................148 Table 4.1........................................................................................................................149 Table 4.2........................................................................................................................150 Table 4.3........................................................................................................................151 Table 4.4........................................................................................................................152 Table 4.5........................................................................................................................153 Table 4.6........................................................................................................................154 Table 4.7........................................................................................................................157 Figure 4.1......................................................................................................................161 Figure 4.2......................................................................................................................162 Figure 4.3......................................................................................................................163 Figure 4.4......................................................................................................................164 Figure 4.5......................................................................................................................165 Figure 4.6......................................................................................................................166 Figure 4.7......................................................................................................................167 Figure 4.8......................................................................................................................168 Figure 4.9......................................................................................................................169 Figure 4.10....................................................................................................................170 Figure 4.11....................................................................................................................171 Figure 4.12....................................................................................................................172 Figure 4.13....................................................................................................................173 Figure 4.14....................................................................................................................174 Figure 4.15....................................................................................................................175 Figure 4.16....................................................................................................................176 Figure 4.17....................................................................................................................177 Vita................................................................................................................................178 i Chapter 1. Introduction Organizations rely on performance appraisals for making many organizational decisions. For example, organizations use appraisal information to make decisions about employee development, motivation, promotions and terminations. Hence, the information gained through the performance appraisal process has critical implications for both the individual and the organization. Because there is great importance placed on appraisal information, it is important to be cognizant that performance measurement typically relies on subjective measures, and therefore is subject to distortion. In order to deal with such distortions, various rater-training models have been proposed. Recently, considerable progress has been made in improving the effectiveness of rater training methods; however, many pertinent questions await investigation. The present study investigates some of these questions. Fundamental to this investigation is whether training leads to increases in performance appraisal rating accuracy. Of particular interest is the issue of whether the accuracy gains in one training program are greater than the accuracy gains in other training programs. Much research has demonstrated that frame- of-reference (FOR) training (Bernardin & Buckley, 1981) effectively improves rating accuracy (Athey & McIntyre, 1987; McIntyre, Smith, & Hassett, 1984; Pulakos, 1986; Woehr, 1994). However, there is evidence to suggest that FOR training may not be the best method for increasing all types of rating accuracy (Stamoulis & Hauenstein, 1993). Using Cronbach’s (1955) components of accuracy as their dependent measures, Stamoulis and Hauenstein found that FOR training led to significant increases in the accuracy components that collapses across ratees (stereotype accuracy and differential accuracy), but not for those components that collapse across rating dimensions (elevation and differential elevation). This is because during the FOR training session trainees discuss ratings within each ratee across performance dimensions. Therefore, the focus of FOR training is on the interrelated nature of performance dimensions, and not on between-ratee differentiation (Hauenstein, Facteau & Schmidt, 1999). As a result of this finding a new training program was proposed that focused on accurately differentiating performance levels of ratees (Stamoulis & Hauenstein, 1993; Hauenstein, 1998). However when systematically evaluated, no empirical distinction was found between the proposed rater variability training (RVT) and FOR training (Hauenstein et al., 1999). The purpose of the present study is to address limitations in the Hauenstein et al. (1999) study. This study improves upon four aspects of the previous study. To begin with, there is research suggesting that the benefits of RVT will emerge when a contextual constraint is imposed upon the raters. Hauenstein (1998) noted that FOR training is a more appropriate training method when performance appraisals are used for training and employee development because dimensional accuracy is important for those types of decisions. In contrast, RVT may be a more appropriate training method when the performance appraisals are used for administrative decisions because administrative decisions require accurate distinctions between ratees, collapsed over dimensions. Consequently, the present study includes a purpose manipulation such that raters in one group are informed that their evaluations will be used for developmental purposes (providing feedback) while raters in the other group are informed that their evaluations will be used for administrative purposes (hiring decisions). 1 In addition to the purpose manipulation, this study also uses an alternative operationalization of accuracy. Cronbach’s components of accuracy serve as the traditional operationalization of accuracy. This analysis requires squaring of components and a substantial amount of aggregation. As a result, valuable information about the direction of the rater’s accuracy scores (in relation to the target score) and the underlying patterns of over and underestimation is discarded (Edwards, 1993, 1994, 1995). Therefore, in addition to computing Cronbach’s components of accuracy, the present study includes an alternative analysis in which the distributions of the raw deviations between the observed components and the target components are inspected in relation to perfect accuracy. The third improvement of this study is to develop a rating task in which participants are required to make finer-grained distinctions between levels of performance. Hauenstein et al. (1999) note that asking raters to make distinctions between performance levels that are readily distinguishable (i.e., good, average, and poor) is not a sensitive test of the RVT model. The strength of the RVT model lies in its ability to accurately discriminate among ratees. However, when there are overt differences in the rating stimuli raters have very little difficulty in differentiating between performances, and thus the rater does not need to rely upon RVT to make these distinctions. This may explain why Hauenstein et al. did not find an empirical distinction between FOR and RVT. In order to incorporate finer-grained distinctions in the rating stimuli, the present study will ask participants to rate five different performances at posttest (as opposed to three). The raters will evaluate poor, below average, average, above average, and good performances. A final point of interest is to isolate the effects of the accuracy feedback component in the FOR and RVT training programs. Both FOR and RVT propose that it is a combination of lecture, practice, and accuracy feedback which produce increases in rating accuracy. Until now, there has been no research investigating whether the accuracy feedback component alone is responsible for improvements in rating accuracy. The present study includes a condition in which raters only receive quantitative accuracy feedback (for this condition the lecture component contains no instruction on how to improve the accuracy of performance ratings). If accuracy scores significantly improve for this group, then this would suggest that increases in rating accuracy are largely due to receiving quantitative accuracy feedback, and not a training lecture on how to improve the accuracy of ratings. 2 Chapter 2. Review of Literature Rater Training Methods Among the first rater training programs proposed was rater error training (RET) (Latham, Wexley & Pursell, 1975). Latham et al. attributed the unreliability of performance appraisal to such rating errors as central tendency, leniency/severity, and halo error. These biases are problematic because they prevent raters from making distinctions between ratees. RET is designed to combat these biases by first providing raters the opportunity to evaluate the performance of a hypothetical ratee. Then, in a workshop context, the trainer discusses the ratings with the raters. During this discussion, the various rating errors are clearly defined, and raters are informed of the types of errors they committed. At that point, the raters are instructed to increase the variability in subsequent observed ratings by refraining from committing such errors. Initial trials of the RET program proved successful. Latham et al. (1975) found that six months after participating in the RET workshop raters committed fewer rating errors whereas raters in the control group committed more similarity, contrast, and halo errors. Further investigation of the RET program, however, proved less promising. Smith (1986) reviewed the rater training literature and found that while RET reduced halo error it did not improve rating accuracy, and on two occasions RET negatively affected accuracy. In addition, the results of Murphy and Balzer’s (1989) meta-analysis illustrated that absence of rating errors is not an indication of accuracy. Thus, even though RET has been successful in prompting raters to provide ratings that have lower mean ratings (i.e., less leniency) and lower inter-item correlations (i.e., less halo), RET does not foster rating accuracy (Smith, 1986). In response to RET’s failure to improve rating accuracy, Bernardin and Pence (1980) asserted that RET training merely encourages raters to replace one response set (e.g., leniency) with a different response set (e.g., severity). This criticism of the RET model led to the proposal of a new model: frame-of-reference training (Bernardin & Buckley, 1981). The primary goal of frame-of-reference training (FOR) is to increase the accuracy of ratings. FOR training sets out to accomplish this goal by establishing a common frame of reference to which raters refer when evaluating ratee performance. In this process, the raters first receive performance dimension training. Performance dimension training involves a discussion of job performance dimensions focusing on behavioral examples of a range of performance effectiveness. Next, raters are exposed to videotaped vignettes that contain critical incidents representing good, average, and poor job performance. The raters are then asked to evaluate these critical incidents. Once the raters have recorded justification for their ratings, the trainer informs the raters of the target scores. Finally, raters are given quantitative and qualitative feedback in regards to their ratings, and any discrepancies between their scores and the target scores are discussed (Bernardin & Buckley, 1981). Woehr and Huffcutt (1994) found an average effect size of .83 for studies comparing FOR training with control or no training on the dependent variables of rating accuracy. Studies that have compared the effectiveness of RET and FOR training generally show that while RET is effective at reducing rating errors such as halo and leniency, rater error 3 training is an ineffective method for improving accuracy. FOR training, on the other hand, demonstrates significant improvements in rating accuracy, and has therefore been judged the superior approach (Athey & McIntyre, 1987; Bernardin & Buckley, 1981; Woehr, 1994). Nonetheless, there are findings in the FOR research that raise serious questions about the general superiority of FOR training. For example, Sulsky and Day (1992) found that FOR trainees recognized impression consistent behaviors even when the behaviors did not occur, meaning that FOR trained raters (as compared with controls) exhibited significantly worse behavioral accuracy for specific ratees. It appears that FOR training fosters accurate judgments of ratee performance, but decreases the accuracy of recall for the specific behaviors upon which those judgments are based (Hauenstein, 1998; Sulsky & Day, 1992; Sulsky & Day, 1994). In other words, once raters integrate the behaviors into the dimensional impressions, they are less likely to accurately remember the behaviors from which the dimensional impressions were formed. The failure to remember individual behaviors becomes problematic when raters are required to give ratees specific feedback (Sulsky & Day, 1994). Furthermore, false recognition of performance behaviors may lead to disputes between supervisors and subordinates (Sulsky & Day, 1992). In addition to Sulksy and Day’s concerns about FOR training, Stamoulis and Hauenstein (1993) voice concerns about whether FOR is best for all types of rating accuracy. In regards to types of accuracy, Cronbach’s (1955) four components of rating accuracy have received much use in rater training research. Elevation, differential elevation, stereotype accuracy, and differential accuracy are based on the true psychometric conceptualization of accuracy, and contain both correlational and distance information in relation to the target scores. Elevation serves as the average rating, over all ratees and items, given by the rater. A highly accurate rating would occur when the rater’s overall average is close to the overall true score. To illustrate, Bost (1994) provides the example of judging an Olympic Figure Skating competition. The average of the judges’ scores for a single skater serves as the true score. Each judge’s elevation accuracy can be determined by comparing the true score to the score given by an individual judge for a single skater. Differential elevation is the accuracy component which gives the average rating assigned to each ratee across all performance dimensions. The rating given to each ratee by the rater allows for ranking best performer to worst performer. Keeping with the figure skating example, the judge who correctly rank ordered the performances of all the figure skaters from best performer to worst performer would be the most accurate in terms of differential elevation. Stereotype accuracy refers to accuracy in discriminating among performance dimensions, averaging over ratees. A judge with high stereotype accuracy would accurately assess which performance, the fixed program or the freestyle skate, was better. Differential accuracy refers to accuracy in distinguishing among ratees within dimensions. In other words, differential accuracy is the rater’s ability to recognize differences between ratees in their patterns of performance. In this case, the accurate judge would be able to accurately determine which skater had high creativity and low difficulty and which skater exhibited the opposite pattern. Stamoulis and Hauenstein (1993) utilized Cronbach’s components of accuracy as their dependent measures and found that FOR trainees demonstrated the greatest improvements on 4 the accuracy components that collapse over ratees (stereotype accuracy and differential accuracy), but that FOR did not demonstrate improvements on the accuracy components that collapse over dimensions (elevation and differential elevation). In other words, they found that FOR training improved the accuracy of between-dimension discrimination, but did not increase the accuracy of between-ratee discrimination. Stamoulis and Hauenstein explain that this is because FOR training is not designed to foster between-ratee discrimination as measured by elevation and differential elevation accuracy. Rather, Hauenstein et al. (1999) explain that the focus of FOR training is on the interrelated nature of performance dimensions and the development of internal frames of reference, as opposed to the development of accurate portrayal of the true variability among individual performances. Murphy, Garcia, Kerkar, Martin, and Balzer (1982) maintain that elevation and differential elevation are more important than the dimensional components of accuracy when it comes to organizational decision making. Similarly, Murphy and Cleveland (1995) recommend differential elevation over differential accuracy. Also, the most common types of appraisal- oriented decisions require elevation accuracy and differential elevation accuracy (Murphy et al., 1982). That is, decisions that call for accurate differentiation among individual ratees. If this is the case, then it appears that FOR is not the best training method for many organizational decisions (Hauenstein, 1998). Clearly there are limitations to the FOR training method. In light of these limitations, Sulsky and Day (1992) suggested developing a training program that combines elements of existing training programs. They argued that a combination of training approaches would supply raters with increased access to behavioral information. Expanding on this idea, Stamoulis and Hauenstein (1993) proposed a new model coined rater variability training (RVT). The RVT program has its roots in the RET and FOR models. The goal of RVT is to increase rater variability so that performance ratings accurately correspond to the true variability in performance, which thereby increases the discriminability among ratees (Hauenstein, 1998). To achieve this goal, RVT combines elements of the traditional RET and FOR programs. Hauenstein (1998) explained that RVT incorporates RET’s emphasis on distinguishing among different ratees, but unlike the RET model, RVT does not focus on rating errors because rating errors are not correlated with accuracy (Murphy & Balzer, 1989). In order to achieve accurate ratings, the RVT model incorporates the accuracy feedback component of frame of reference training whereby trainees are given quantitative feedback about the discrepancies between their ratings and the target scores (Hauenstein et al., 1999). There are critical distinctions, however, between RVT and FOR training (Hauenstein et al., 1999). First of all, FOR training emphasizes the importance of identifying differences between practice ratings and target scores. On the other hand, RVT emphasizes the importance of identifying the relative performance differences between individual ratees. Hauenstein et al. explain that FOR trainees discuss the practice rating for each ratee in comparison to the target ratings. In contrast, RVT focuses on discussing ratings within each dimension across ratees. In other words, all practice ratings for the first dimension are discussed in relation to target scores for the first dimension, then all practice ratings for the second dimension are discussed in 5 relation to target scores for the second dimension, and so forth until all dimensions have been discussed. A recent study by Hauenstein, Facteau, and Schmidt (1999) represents the first systematic evaluation of RVT. The design of this study replicated Stamoulis and Hauenstein (1993) with the addition of an RVT condition. Hauenstein et al. compared FOR and RVT using Cronbach’s (1955) accuracy components as the dependent measures. The purpose of their study was to determine if RVT was a more effective strategy for improving raters’ ability to accurately discriminate among the performance levels of different ratees. Accordingly, they predicted that RVT training would lead to greater improvements in elevation and differential elevation accuracy than would FOR training because these accuracy components measure the accuracy of ratings for ratees collapsed over dimensions. In addition, they hypothesized that FOR-trained raters would exhibit the greatest improvement on stereotype accuracy because FOR training focuses on the relationship among dimensions. Finally, there was no prediction of differences on differential accuracy because both FOR and RVT provide quantitative feedback about ratees on individual dimensions. Therefore, both attempt to improve differential accuracy. The training emphasis is different, however, in that FOR focuses on relative differences among performance dimensions and RVT focuses on relative differences among ratees. Results of the study show support for the hypothesis that FOR trained raters exhibit the greatest improvement on stereotype accuracy. However, there was no empirical distinction between RVT and FOR training. The authors offer three possible explanations for the lack of empirical distinction between FOR and RVT. One explanation is that raters were not under any contextual pressures to make evaluations for either administrative or developmental reasons. In other words, Hauenstein et al. (1999) did not take into account the intended purpose of the appraisal information. It may be that empirical differences between FOR and RVT emerge when raters operate under motivational presses when making post-test ratings. Secondly, Hauenstein et al. (1999) suggest that the lack of empirical differences between RVT and FOR may be due to problems with the dependent measures. They found that the sensitivity of Cronbach’s components of accuracy vary as a function of the number of ratees and dimensions used. Hauenstein et al. report that with only three ratees at post-test, systematic differences in the unsquared differential elevation components were not replicated when squared and aggregated into the differential elevation score. In contrast, analyses of the eighteen unsquared differential accuracy components produced no training effects, yet when squared and aggregated into differential accuracy scores reliable training effects were found. Thus, they report that the aggregation of many systematic small effects leads to an overall effect that is more readily detected. This suggests that there may be inherent limitations in the calculation of Cronbach’s accuracy scores such that the raters’ accuracy scores are obscured in the calculation of the four components. A third possibility for RVT’s failure to demonstrate superior effectiveness over FOR (on elevation and differential elevation), is that the discrimination among ratees was too easy due to overt differences in rating stimuli. That is, raters had little difficulty in distinguishing between a “good” performer, an “average” performer, and a “poor” performer. Hauenstein et al. (1999) 6 suggest that if the raters were asked to make more fine-grained distinctions, for example, distinguishing between three ratees that are all above average, then the benefits of RVT may emerge. In addition to addressing these three limitations, the present study also examines the possibility that the accuracy feedback component is the sole factor responsible for improvements in rating accuracy. Both FOR and RVT propose that the combination of lecture and feedback produce increases in rating accuracy. None of the rater training research, however, has investigated whether increases in rating accuracy are simply a product of providing raters with quantitative accuracy feedback. This study examines this possibility by including a second control group that receives accuracy feedback without a lecture on how to improve the quality of ratings. By addressing these four limitations, the present study seeks to improve and expand upon the Hauenstein et al. (1999) study. Purpose of Performance Appraisal In the realm of rater training there has been a de-emphasis on the use of rater error training. In its place has emerged FOR and more recently, rater variability training. Initial trials of RVT, however, failed to yield empirical distinctions between FOR and RVT. One explanation for this lack of empirical distinction may be that there was no purpose manipulation. Consequently, the rater training literature is in need of methodologically sound research that controls for training format and the intended use of the appraisal data. Landy and Farr (1980), in their process model of performance appraisal, identify purpose of rating as a significant component in the rating process. They suggest that the purpose of the performance appraisal affects both the observation and the recall of behavior, as well as the evaluation of performance. In fact, it has been suggested that purpose of appraisal is the most important contextual factor for understanding performance appraisal processes and outcomes (Jawahar & Stone, 1997). Furthermore, Cleveland and Murphy (1992) state that ratings that are “used for one purpose may not (under similar circumstances) yield the same outcome when the appraisal system is used for a different purpose” (p. 138). Cleveland, Murphy, and Williams (1989) identified twenty different organizational uses of appraisal data. They found that the majority of these purposes could be dichotomized into two general classifications: Developmental Decisions and Administrative Decisions. When performance appraisal information is intended to be used for developmental purposes, employees receive concrete feedback about their job performance. This serves a valuable function because in order to improve performance in the future, employees need to know what their weaknesses were in the past and how to correct them. This also enables supervisors to identify which employees would receive the most benefit from additional training. Pointing out strengths and weaknesses is a coaching function for the supervisor, while receiving meaningful feedback and acting upon it is a motivational experience for the subordinate (Klimoski & Inks, 1990). In this way, performance appraisals serve as vehicles for personal development. Hence, the ultimate goal of developmental feedback is performance improvement. On the other hand, making and carrying out employment decisions are the fundamental goals of administrative decision-making. Administrative decisions include deciding which employees to promote, 7

Description:
clear empirical distinction between RVT and the more established frame-of-reference training. (FOR), (Hauenstein, Facteau, & Schmidt, 1999).
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.