Hurry up baby son all the boys is finished their breakfast

Spoken corpora and corpus size

Download 5,41 Mb.

Pdf ko'rish

bet	46/208
Sana	01.01.2022
Hajmi	5,41 Mb.
	#298502

1 ... 42 43 44 45 46 47 48 49 ... 208

Bog'liq
Clancy, B. (2010) A socio-pragmatic analysis of Irish settled and Traveller family discourse (PhD Thesis)

4.2 Spoken corpora and corpus size

Building a spoken corpus can initially be a daunting experience. Assembling a large
amount  of  spoken  data  is  associated  with  high  costs  because  of  the  difficulties
involved  in  recording,  transcribing  and  coding  the  data.  In  addition,  the
representativeness and balance of many large spoken corpora could be questioned as
there is no definitive list of spoken genres and certain speech contexts, for example,
family  discourse,  have  proven  difficult  to  access  (see  McCarthy,  1998).  However,
this has not deterred corpus builders. It is now possible to access a range of spoken
corpora designed for a variety of purposes. Corpora such as the American National
Corpus (ANC) and the British National Corpus (BNC) are designed to represent the
language  varieties  of  American  and  British  English  respectively  and  are  also
designed to be comparable across genres. The BNC contains 100 million words, of
which 10 million are spoken
2
. In order to achieve representativeness, one part of the
spoken  corpus  was  collected  by  a  process  of  demographic  sampling.  Texts  were
collected  from  individuals  and  demographic  information  such  as  name,  age,
occupation, sex and social class was noted. This was further subdivided into region
and interaction type (monologue or dialogue). The demographically sampled corpus
was  complemented  by  texts  collected  on  context-governed  criteria.  These  texts
related to more formal speech contexts such as those encountered in educational or
business settings (see Aston and Burnard, 1998 for a full description of the design of
the BNC).

2
Almost 15 million words of the ANC are currently available. This is divided into approximately
11.5 million words of written language and 3.5 million words of spoken language (see www.anc.org).

83

The  International  Corpus  of  English  (ICE),  a  project  that  has  been  in  place  for
almost  twenty  years  and  involves  eighteen  research  teams  in  different  countries
across the globe, comprises of 60% spoken texts and 40% written. The ICE corpus,
when  complete,  will  provide  a  range  of  one  million  word  corpora  of  English  from
countries where English is a first or major language. Similar to the BNC, the spoken
component  of  ICE  contains  60%  dialogic  and  40%  monologic  material;  these  are
divided  into  public  and  private  dialogues  into  scripted  and  unscripted  monologues
(see Meyer, 2002).  In the  ICE corpus, the speakers chosen were adults  of eighteen
years  of  age  or  older  that  had  received  a  formal  education  through  the  medium  of
English to at least secondary school level (however, this design proved to be flexible
in  the  case  of  well-known,  established  political  leaders  and  radio  or  television
broadcasters whose public status made their inclusion appropriate). Information was
also  recorded  about  sex,  ethnic  group,  region,  occupation  and  status  in  occupation
and role in relation to other participants (Greenbaum, 1991).

In  relation  to  exclusively  spoken  corpora,  the  five  million  word  Cambridge  and
Nottingham  Corpus  of  Discourse  in  English  (CANCODE)  is  a  corpus  designed  to
represent  spoken  British  (and  some  Irish)  English.  In  their  initial  corpus  design
phase  the  CANCODE  team  developed  a  set  of  spoken  text-types  to  correspond  to
existing  text  typologies  for  the  written  language.  They  adopted  what  McCarthy
(1998)  terms  a  ‘genre-based’  approach  where  not  only  is  a  population  of  speakers
targeted,  but  the  context  and  environment  in  which  the  speech  is  produced  is  also
taken into consideration. The framework used for CANCODE sought to combine the
nature  of  speaker  relationship  with  goal-types  prevalent  in  everyday,  spoken
interaction.  The  nature  of  the  speaker  relationship  was  divided  into  five  broad
contexts;
transactional
,
professional
,
pedagogical
,
socialising
  and
intimate
.  For
each  of  these  contexts,  three  goal-types  were  identified;
information  provision
,
collaborative  task
  and
collaborative  idea
  (see  Section  4.3  for  a  definition  of  the
terms) and these are operationalised in Table 4.2:

Download 5,41 Mb.

Do'stlaringiz bilan baham:

1 ... 42 43 44 45 46 47 48 49 ... 208