Rating Design for Field Trial
The field trial responses were rated between September and October 2007. To analyze the effects of the design factors (tasks, rating criteria, raters, student proficiencies), the two following rating designs were used. The first rating design, the so-called multiple marking design, involved all raters, in groups of four, independently rating the same set of selected student responses within one group. For this, 30 responses from each of the 13 booklets were randomly chosen and allocated to the rater groups in a Youden square design (Preece, 1990; see Frey et al., 2009). The Youden square design is a particular form of an incomplete block design that in our case ensured a linkage of ratings across all booklets and an even distribution of rater combinations across booklets. The resulting linkage of students, tasks, and raters allowed us to perform variance component analyses motivated by g-theory, as described next.
The second design, the so-called single marking design, allocated all student responses randomly to all raters, with each rater rating an equal number of responses and each response being rated once. This design allowed controlling for systematic rater effects by ensuring an approximately balanced allocation of student responses across tasks to different raters.
The multifaceted Rasch analyses, which are described next, are based on the combined data from the two rating designs to ensure a sufficiently strong linkage between tasks, raters, and students.
Data Analysis
Given the previous considerations about quality control in rating procedures and the lack of a strong research base for level-specific approaches to assessing writing proficiency, the primary objective of the study we report is on establishing the psychometric qualities of the writing tasks using the ratings from the field trial. This is critical because the ratings form the basis of inferences about task difficulty estimates, raters' performance, and students' proficiency estimates. If the rating quality is poor vis-à-vis the design characteristics used in the field trial, the defensibility of any resulting narratives about the difficulty of the writing tasks and their alignment to the CEFR levels is compromised.
In more specific terms, the two research questions for this study based on the primary objective are as follows:
RQ1: What are the relative contributions of each of the design factors (tasks, criteria, raters, students) to the overall variability in the ratings of the HSA and MSA student samples?
RQ2: Based on the analyses in RQ1, how closely do empirical estimates of task difficulty and a priori estimates of task difficulty by task developers align? Is it possible to arrive at empirically grounded cut-scores in alignment with the CEFR using suitable statistical analyses?
We answer both research questions separately for students from the HSA and MSA school tracks. We do this primarily because preliminary calibrations with the writing data, which are not the main focus of this article, as well as related data on reading and listening proficiency tests have suggested that a separate calibration leads to more reliable and defensible interpretations. This decision was also partly politically motivated by the need for consistent and defensible reporting strategies across reading, listening, and writing proficiency tests.
To answer the first research question, we use descriptive statistics of the rating data as well as variance components analyses grounded in generalizability theory, which decomposes the overall variation in ratings according to the relative contribution of each of the design factors listed previously. To answer the second question, we take into account the interactional effects of our design facets on the variability of the ratings via multifaceted Rasch modeling. This represents a parametric latent-variable approach that goes beyond identifying the influence of individual design factors to statistically correct for potential biases in the resulting estimates of task and criteria difficulty, rater performance, and student proficiency. Utilizing descriptive statistics, g-theory analyses, and multifaceted Rasch model analyses en concerto helps to triangulate the empirical evidence for the rating quality and to illustrate the different kinds of inferences supported by each analytic approach.
Do'stlaringiz bilan baham: |