Table 4.6: Frequency lists for top 25 words in TravCorp and SettCorp
TravCorp
SettCorp
Number
Word
Frequency
%
Number
Word
Frequency
%
1
you
121
3.81
1
the
494
3.94
2
the
120
3.78
2
you
346
2.76
3
go
79
2.49
3
it
339
2.71
4
it
66
2.08
4
I
252
2.01
5
to
64
2.02
5
to
233
1.86
6
on
52
1.64
6
a
227
1.81
7
a
50
1.57
7
and
194
1.55
8
now
48
1.51
8
of
168
1.34
9
out
46
1.45
9
that
162
1.29
10
I
45
1.42
10
in
153
1.22
11
no
43
1.35
11
is
152
1.21
12
and
41
1.29
12
yeah
146
1.17
13
there
37
1.17
13
no
143
1.14
14
get
36
1.13
14
it’s
134
1.07
15
me
34
1.07
15
on
124
0.99
16
in
32
1.01
16
what
111
0.89
17
that
32
1.01
17
do
110
0.88
18
here
30
0.94
18
we
103
0.82
19
I’m
29
0.91
19
now
98
0.78
20
daddy
28
0.88
20
was
95
0.76
21
goin
27
0.85
21
have
92
0.73
22
way
27
0.85
22
one
90
0.72
23
what
27
0.85
23
there
89
0.71
24
yeah
27
0.85
24
like
83
0.66
25
look
26
0.82
25
all
80
0.64
In Section 4.3, frequency lists were compared using the % column. However,
frequency lists also allow for the normalisation of raw frequency counts between
corpora. In this study, due to the contrasting size of the corpora, when normalised,
results are given per 10,000 words. For example,
the
in TravCorp occurs 120 times
in 3,172 words. Therefore, in order to normalise per 10,000 words: 120 X 3.15 = 378
instances of
the
per 10,000 words. On the other hand,
the
occurs 494 times in 12,531
words in SettCorp. Therefore, in order to normalise, 494 ÷ 1.25 = 395 instances of
the
per 10,000 words. It should be pointed out here that the raw frequencies in Table
5
A
type
refers to a unique word form in a corpus.
100
4.6 are calculated on the basis of individual types. Therefore,
I
appears in position 10
of the frequency list for TravCorp, and
I’m
appears at position 19. Similarly,
is
appears at position 11 in SettCorp and
it’s
at position 14. Accordingly, where a
grammatical term such as
I
is analysed (see Chapter 5), it is the lemma I that is
analysed. The lemma of a word is its canonical form, therefore, the I lemma consists
of the lexemes
I’m
,
I’ve
,
I’ll
and
I’d
.
WordSmith also generates a
Wordlist (S)
which calculates a standardised type/token
ratio. A type/token ratio is ‘the average number of tokens per type’ (Baker, 2006:
54). If a corpus is above 2,000 words in size, WordSmith Tools™ calculates the
standardised type/token ratio based on taking a type/token ratio every 2,000 words in
the corpus and then calculating a mean of all these ratios. The standardised
type/token ratio of TravCorp is 29.47, SettCorp is 32.68, whereas the ratio for the
informal spoken conversations from the British National Corpus is 32.96
6
. These
ratios, although slightly ‘crude’ in nature, may again point towards the relative
representativeness of SettCorp and TravCorp. Baker (2006: 71) maintains that
‘frequency lists can be helpful in determining the focus of a text, but care must be
taken not to make presuppositions about the way that words are actually used within
it.’ In order to look at the way in which words are actually used within the corpora,
concordance lines
provide a tool which enables the researcher to perform a much
closer examination.
4.4.2 Concordance lines
According to Sinclair (2003: 173), ‘a concordance is an index to the places in a text
where particular words and phrases occur.’ Visually, as Figure 4.2 shows, the
software programmes used to generate concordances, generally present results in a
Key Word in Context (KWIC) format, which features a
node
word, the subject of the
query by the researcher, surrounded by the
cotext
, words that occur before and after
it:
6
Type/token ratio for the British National Corpus taken from Baker (2006: 52).
101
Do'stlaringiz bilan baham: |