Microsoft Word Chinese Lexicography in the Contemporary Period HuangEtAl2016. docx



Download 434,84 Kb.
Pdf ko'rish
bet8/17
Sana30.03.2022
Hajmi434,84 Kb.
#519474
1   ...   4   5   6   7   8   9   10   11   ...   17
Bog'liq
ChineseLexicographyintheContemporaryPeriodHuangEtAl2016

Corpora and dictionaries 
The field of dictionary making has long been influenced by empirical and corpus-based 
methods. However, the early text collections ‘did not mean to be representative of the 
language; rather, dictionary makers stressed the normative function of their work, aiming 
to describe the ’proper‘ use of words’ (McArthur 1996: 235). Corpus today refers to 

much larger collection of authentic data which is machine readable and can be processed 
by a computer with different queries. 
Language corpora have been used to construct 
dictionaries since the release of the Collins-Birmingham University International 
Database COBUILD (Sinclair 1987).
The consensus among lexicographers and 
computational linguists is that statistical word modeling and corpus support are 
indispensible to modern dictionary compilation.
Corpus linguistics benefits lexicography in three aspects: providing authentic texts, 
building lexical database and helping dictionary compilation. A number of Chinese 
mega-corpora have been compiled in the last three decades; some were sponsored by 
government, others were developed at institutional level. Compared to English corpora, 
constructions of Chinese corpora started late when better computer technology became 
available and corpus linguistic theories had been well developed. Unlike English corpora, 
few Chinese corpora have so far been constructed for the explicit goal of lexicography. 
The corpus by 
the Center for Chinese Linguistics (CCL) of Peking University
is a corpus 
with more than 500 million characters. The data was collected with balanced genres of 
spoken language, fictions, popular magazines, newspapers and academic journals. Like 
many Chinese corpora, the main corpus was not segmented or tagged. The small portion 
(1 million words) which was tagged and annotated with different grammatical and 
semantic markers and used as the basis of the book 
The Grammatical Knowledge-base of 
Contemporary Chinese — A complete specification,
has become a reference dictionary 
for Chinese language processing in many institutions worldwide.
The Sinica Corpus was constructed in Academia Sinica in the 1990s under the direction 
of Keh-jiann Chen and Chu-Ren Huang in Taiwan (Chen et al. 1996). It is the first fully 
POS-tagged balanced Chinese corpus as well as the first Chinese corpus to be available 
on the world wide web. Like many modern balanced corpora its content distribution 
largely follows the original design of Brown Corpus but is also influenced by the designs 
of COBUILD and BNC. It is unique among modern Chinese corpora to have the full 
corpus manually checked word by word for both its segmentation and POS-tagging after 


Huang et al. (2016) [Pre-publication draft] 
12 
its initial automatic annotations. The Sinica Corpus is publicly available and freely 
searchable on the internet (http://app.sinica.edu.tw/kiwi/mkiwi/ ). Its latest version, the 
Sinica 5.0, has more than
10 million words.
Another widely used corpus of Chinese is the one million word Lancaster Corpus of 
Mandarin Chinese (LCMC) (McEnery and Xiao 2004). Although smaller and later than 
the above-mentioned two corpora, the LCMC adopts the Brown/LOB 
(Lancaster-Oslo-Bergen) Balance Corpus format with 500 texts of roughly 2,000 words 
from 15 different genres. 
This conventional set-up allows users of
the LCMC to readily 
compare it to English using LOB or Brown corpus. However, its size and format 
constraints also means that is often inadequate for modern computational lexicographic 
studies, which typically requires at least 10 million words (i.e. BNC size) of natural and 
non-trancated texts. 
Table 1 is a description of some important Chinese corpora: 

Download 434,84 Kb.

Do'stlaringiz bilan baham:
1   ...   4   5   6   7   8   9   10   11   ...   17




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish