Assessing productive and interactive skills
157
per sentence and is scored B, the machine would then predict that
words per sentence might be a useful feature in telling the difference
between an A and a B level essay. According to this hypothesis, the
next essay, which has 18 words per sentence, should score A. If the
prediction is right, the model is strengthened; if it is wrong, the machine
makes adjustments and reduces the importance given to sentence
length. Of course, no automated scoring system could actually rely on
just one or two features and some take account of hundreds or even
thousands of features. Once the machine has developed a model that
works well on the samples it has been given, it can then be used to
score performances that have not previously been rated by humans.
At present, automated scorers have to be retrained every time a new
prompt is used. The levels of investment required and the number of
samples needed for training put them beyond the means of most
organisations. Outside a few large testing agencies, until more generic
automated scoring systems can be developed, human judges working
with rating scales will probably remain the standard means of obtaining
consistent scores on performance tests.
Scales can be developed through a wide range of methodologies.
Best practice in development will bring together theoretical insights,
evidence of the language elicited by test tasks and the practical needs
of assessors. Validity arguments supporting the use of rating scales in
the assessment of spoken language will need to show not only that the
descriptors refl ect theories of spoken language use and the nature of
the language actually elicited by the testing procedures, but also that
raters are interpreting the scales in the ways intended.
Task 6.5
Look at the table of task types on the website.
If you work as a teacher, what kind of short answer tests would suit
your students’ needs? Why?
What kinds of extended response tasks would be suitable? Why?
What kind of marking of written and spoken performance seems to best
suit the teaching or testing situation
you
are in? Why is this suitable?
Would you ever use error counts or impression marking? Under what
circumstances?
Consider a group of language learners that you know; suggest some
descriptors that would be suitable to include in rating scales to assess
their writing or speaking. Draft a rating scale using these.
Compare your rating scale with examples you can fi nd on the internet
from other assessment systems. What differences do you notice?
158
Do'stlaringiz bilan baham: |