2.4 Raters
In order for raters to achieve a common understanding and application of a scale, rater training is an important part of assessing speaking. As the standard for speaking assessment procedures involving high-stakes decisions is an inter-rater reliability coefficient of 0.80, some variability among raters is expected and tolerated. Under optimal conditions, the sources of error that can be associated with the use of a scale are expected to be random rather than systematic. Therefore, research aims to identify and control systematic error resulting from rater performance.
One type of systematic error results from a rater’s tendency to assign either harsh or lenient scores. When a pattern is identified in comparison to other raters in a pool, a rater may be identified as negatively or positively biased. Systematic effects with respect to score assignment have been found in association with rater experience, rater native language background, and also examinee native language background. Every effort should be made to identify and remove as their presence negatively affects the accuracy, utility, interpretability, and fairness of the scores we report.
With fairness at issue, researchers have studied factors affecting ratings. It was compared differences across Japanese language teachers and professional tour guides in their assignment of scores to 51 Japanese tour guide candidates. While no differences were found in the scores assigned, the two pools of raters did apply different criteria in their score assignments: teachers tended to focus on grammar, vocabulary, and fluency while tour guides tended to focus on pronunciation. It was examined the performance of three rater groups who differed in professional background and place of residence and found a tendency for the teachers to rate grammar more harshly in assessment of speaking 5 comparisons to the nonteaching groups who emphasized communicative success. They compared native-speaking English raters from Australia, Canada, the UK, and the USA and found raters from the UK harshest while raters from the USA were the most lenient. Differences in raters’ application of a scale have been found not only across raters of different backgrounds and experiences, but also across trained raters of similar backgrounds.
Studies comparing native speaker and nonnative speakers as raters have produced mixed findings. While some studies have identified tendencies for non-native speakers to assign harsher scores, others have found the opposite to be the case. In Winke raters with first language backgrounds that matched those of the candidates were found more lenient when rating second language English oral proficiency, and the authors suggest that this effect may be due to familiarity with accent. In an attempt to ameliorate such potential effects, some scientists provided special training for Indian raters who were evaluating the English language responses of Indian examinees on the TOEFL iBT. While the performance of the Indian raters was found comparable to that of Educational Testing Service raters both before and after the training, the Indian raters showed some improvement and increased confidence after participating in the training. Far fewer studies have been conducted on differences in ratings assigned by interviewers; however, there is no reason to expect that interviewers would be less subject to interviewer effects than raters are to rater effects. Indeed, in an examination of variability across two interviewers with respect to how they structured the interview, their questioning techniques, and the feedback they provided, it was identified differences that could easily result in different score assignments as well as differences in interpretations of the interviewee’s ability.
These findings underscore the importance of rater training; however, the positive effects of training tend to be short-lived. In a study examining rater severity over time, Lumley and McNamara found that many raters tended to drift over time. The phenomenon of rater drift calls into question the practice of certifying raters once and for all after successfully completing only a single training program and highlights the importance of ongoing training in order to maintain rater consistency. A more important concern raised by studies of rater variability—one that can only be partially addressed by rater training— is whose standard, whether that of an experienced rater, of an inexperienced rater, of a teacher, of a native speaker, or of a non-native speaker is the more appropriate standard to apply.
Do'stlaringiz bilan baham: |