85
were also recorded. Recently, two additional corpora, the British Academic Spoken
English corpus (BASE) and the Limerick-Belfast Corpus of Academic Spoken
English (Li-Bel CASE), have been designed as companions to MICASE. The BASE
corpus contains 1.6 million words, whereas Li-Bel CASE, when completed, will
hold one million words. In
addition to this, there are a number of spoken corpora that
represent specific social groupings. For example, the Bergen Corpus of London
Teenage Language (COLT) is a half a million word corpus of spontaneous teenage
talk. This corpus distinguishes between speaker-specific (for example gender, age,
social class etc.) and context-specific (location and setting) information.
Thus far, it seems that major spoken corpora are quite substantial at over half a
million words at least. In relation to corpus size, Sinclair (2004: 189) maintains that
‘there is no virtue in being small. Small is not beautiful; it is simply a limitation.’
However, in spite of this, it may be the case that small corpora are more adept than
larger ones at explaining the fine-grained distinction that exists between registers.
Biber
et al
.’s (1999) forty million word Longman Spoken and Written English
Corpus (LSWE) is divided into six registers; conversation, fiction, newspaper
language, academic prose, non-conversational speech and general prose. However,
within each of these registers is an enormous amount of variation. For example,
Hunston (2002) notes that newspaper language contains a variety of
newspaper types
(for example, broadsheet and tabloid) in addition to a range of article types (hard
news, letters, sport, business etc.). Indeed, it could be argued that conversation
contains an even wider variation of types. Therefore, for larger corpora such as the
one used in Biber
et al
.’s (1999) grammar, ‘to make distinctions between ‘smaller’
registers would quickly become unmanageable’ (Hunston, 2002: 161). Small corpora
studies have highlighted a range of variation that exists both in and between different
language varieties and registers.
Small corpora have allowed researchers to identify linguistic characteristics of
particular spoken registers. Vaughan (2007, 2008) uses a 40,000 word corpus of
meetings of English language teachers (C-MELT) to explore particular linguistic
features characteristic of this community of practice. For example, the size of C-
MELT allowed specific instances of humour to be isolated in order that they might
be assigned a function. Vaughan (2007: 186) found that teachers ‘use [humour] to
87
study, it is proposed that the datasets used provide a basis for a more in-depth
interpretation of the linguistic characteristics of both families. Therefore, the data
from the settled family and from the Traveller family will subsequently be referred
to as SettCorp and TravCorp respectively.
Do'stlaringiz bilan baham: