4.2 Spoken corpora and corpus size
Building a spoken corpus can initially be a daunting experience. Assembling a large
amount of spoken data is associated with high costs because of the difficulties
involved in recording, transcribing and coding the data. In addition, the
representativeness and balance of many large spoken corpora could be questioned as
there is no definitive list of spoken genres and certain speech contexts, for example,
family discourse, have proven difficult to access (see McCarthy, 1998). However,
this has not deterred corpus builders. It is now possible to access a range of spoken
corpora designed for a variety of purposes. Corpora such as the American National
Corpus (ANC) and the British National Corpus (BNC) are designed to represent the
language varieties of American and British English respectively and are also
designed to be comparable across genres. The BNC contains 100 million words, of
which 10 million are spoken
2
. In order to achieve representativeness, one part of the
spoken corpus was collected by a process of demographic sampling. Texts were
collected from individuals and demographic information such as name, age,
occupation, sex and social class was noted. This was further subdivided into region
and interaction type (monologue or dialogue). The demographically sampled corpus
was complemented by texts collected on context-governed criteria. These texts
related to more formal speech contexts such as those encountered in educational or
business settings (see Aston and Burnard, 1998 for a full description of the design of
the BNC).
2
Almost 15 million words of the ANC are currently available. This is divided into approximately
11.5 million words of written language and 3.5 million words of spoken language (see www.anc.org).
83
The International Corpus of English (ICE), a project that has been in place for
almost twenty years and involves eighteen research teams in different countries
across the globe, comprises of 60% spoken texts and 40% written. The ICE corpus,
when complete, will provide a range of one million word corpora of English from
countries where English is a first or major language. Similar to the BNC, the spoken
component of ICE contains 60% dialogic and 40% monologic material; these are
divided into public and private dialogues into scripted and unscripted monologues
(see Meyer, 2002). In the ICE corpus, the speakers chosen were adults of eighteen
years of age or older that had received a formal education through the medium of
English to at least secondary school level (however, this design proved to be flexible
in the case of well-known, established political leaders and radio or television
broadcasters whose public status made their inclusion appropriate). Information was
also recorded about sex, ethnic group, region, occupation and status in occupation
and role in relation to other participants (Greenbaum, 1991).
In relation to exclusively spoken corpora, the five million word Cambridge and
Nottingham Corpus of Discourse in English (CANCODE) is a corpus designed to
represent spoken British (and some Irish) English. In their initial corpus design
phase the CANCODE team developed a set of spoken text-types to correspond to
existing text typologies for the written language. They adopted what McCarthy
(1998) terms a ‘genre-based’ approach where not only is a population of speakers
targeted, but the context and environment in which the speech is produced is also
taken into consideration. The framework used for CANCODE sought to combine the
nature of speaker relationship with goal-types prevalent in everyday, spoken
interaction. The nature of the speaker relationship was divided into five broad
contexts;
transactional
,
professional
,
pedagogical
,
socialising
and
intimate
. For
each of these contexts, three goal-types were identified;
information provision
,
collaborative task
and
collaborative idea
(see Section 4.3 for a definition of the
terms) and these are operationalised in Table 4.2:
84
Do'stlaringiz bilan baham: |