Methodology and Results
Automated Move and Step Identification
The
RWT
is intended for use as a computer-assisted aid to academic writing instruction that
focuses on teaching the research article writing conventions in terms of communicative
effectiveness and rhetorical structure realized through communicative moves and functional
steps. Thus, our task was to build an analysis engine capable of classifying texts into moves
and steps. We approached the identification of these discourse units as a supervised
classification problem and employed a process of corpus data annotation, feature selection,
sentence representation, and training leading to classification (see Burstein et al., 2003;
Pendar & Cotos, 2008). Following this approach, we considered each sentence in a text as
an independent unit of analysis to be classified into two categories
–
one corresponding to a
move and the second corresponding to a step within the identified move.
Feature Construction and Sentence Representations
One of the main challenges of the problem was devising a representation that is suitable
for automated classification. To represent sentences, we used sets of word
k
-grams (i.e.,
sequences of
k
adjacent words). In our experiments,
k
-grams were computed for
k
= 1
(i.e., unigrams),
k
= 2 (i.e., bigrams), and
k
= 3 (i.e., trigrams). To prepare the feature
set for the classification task, the sentences were pre-processed through stemming,
using regular expressions to replace genre-specific discourse units (e.g., references and
years) with appropriate strings, and removing low-frequency
k
-grams to avoid over-
fitting and to reduce the so-called noise. In addition, the OpenNLP software package was
used to annotate each word with its part-of-speech. The sequence of part-of-speech
tokens was then used to compute sets of part-of-speech
k
-grams for k =1, k =3 and k =
3. Our initial hypothesis was that these additional features may improve classification
results by providing the classifiers with syntactic information.
We considered each sentence as an item to be classified into a move and a step. Given a
set of
n
k
-grams, a sentence is represented as an
n
-dimensional vector in the
R
n
Euclidean space. Formally, each sentence c
i
is represented as c
i
= <
f
1
, f
2
, f
3
,…,f
n
> where
each
f
j
is equal to the number of times the
j
th
k
-gram occurs in sentence c
i
. This
representation was computed for
k
= 1 through 3 on both the sequence of stemmed
words and the sequence of part-of-speech tokens. Following, we describe how classifiers
were trained for each level of
k
and for each type of sequence to classify sentences into a
move and a step.
Do'stlaringiz bilan baham: |