Annexa 2 Introduction Database design and collection Database contents definition



Download 1,39 Mb.
bet1/18
Sana23.06.2017
Hajmi1,39 Mb.
#12884
  1   2   3   4   5   6   7   8   9   ...   18


Annexa 2



1. Introduction

2. Database design and collection

3. Database contents definition

4. Orthographical transcription

5. Lexicon

6. Speaker demographic information

7. Recording conditions

8. Test material

9. Deviations from general SALA II specifications

10. Sample sheets

Gràcies per la seva col·laboració. S’ha realitzat un enregistrament complet. Adéu-siau.

Appendix 1. Credit Card Checksum Algorithm

Appendix 2. Credit card numbers

Appendix 3. PIN codes

Appendix 4. Surnames

Appendix 5. Cities

Appendix 6. Companies

Appendix 7. Full names




Author(s):

Asunción Moreno, Climent Nadeu, José Mª Cazorla, Gonzalo Bustamante

Institute:

Universitat Politècnica de Catalunya

Address:

Jordi Girona 1-3 Mod D-5,

08034 Barcelona Spain



email:

asuncion@gps.tsc.upc.es

Date:

December, 17 , 2006

Version:

V1.6

Status:

DRAFT


CONTENTS


1. Introduction 5

1.1 Speech file formats 6

1.2 Directory structure 6

1.3 File naming conventions 7

1.4 Label files 8

1.4.1 Label file header 8

1.4.2 Label file body 10

1.4.3 Example of label file 12

2. Database design and collection 12

2.1 Recording platform 13

2.2 Speaker recruitment 13

2.3 Design of prompting and prompt-sheet 13



3. Database contents definition 14

3.1 Application words (A1-6) 15

3.2 Isolated digits (I1-2, B1) 15

3.2.1 Single digit (I1-2) 15

3.2.2 Isolated-digits string (B1) 16

3.3 Connected digits (C1-4) 16



3.3.1 Sheet number (C1) 16

3.3.2 Telephone number (C2) 16

3.3.3 Credit card number (C3) 17

3.3.4 PIN code (C4) 17

3.4 Dates (D1-3) 17



3.4.1 Spontaneous date (D1) 17

3.4.2 Prompted date (D2) 17

3.4.3 Relative and general date expression (D3) 18

3.5 Application word phrase (E1) 19

3.6 Spelt items (L1-3) 19

Alternative Name 19



3.6.1 Spelt personal name (L1) 20

3.6.2 Spelt city name (L2) 20

3.6.3 Spelt artificial word (L3) 20

3.7 Money amount (M1) 20

3.8 Natural number (N1) 21

3.9 Directory assistance names (O1-7) 21



3.9.1 Prompted surname (O1) 21

3.9.2 Spontaneous city name (O2) 21

3.9.3 Prompted city name (O3) 21

3.9.4 Prompted company name (O5) 22

3.9.5 Prompted personal name (O7) 22

3.10 Yes/No (Q1-2) 22

3.11 Phonetically rich sentences (S1-9) 22

3.12 Times (T1-2) 23



3.12.1 Spontaneous time (T1) 23

3.12.2 Prompted time (T2) 23

3.13 Phonetically rich words (W1-4) 24



4. Orthographical transcription 26

5. Lexicon 29

6. Speaker demographic information 32

6.1 Accent/Regions 32

6.2 Speaker ages 33

7. Recording conditions 33

7.1 Environments for mobile network 33

7.2 Network providers 34

All the recordings are GSM. 34

7.3 Handsets 34

8. Test material 34

9. Deviations from general SALA II specifications 35

10. Sample sheets 35

10.1 Sample instruction sheet and sample prompt sheet 35

Gràcies per la seva col·laboració. S’ha realitzat un enregistrament complet. Adéu-siau. 37

Appendix 1. Credit Card Checksum Algorithm 38

Appendix 2. Credit card numbers 39

Appendix 3. PIN codes 40

Appendix 4. Surnames 40

Appendix 5. Cities 42

Appendix 6. Companies 45

Appendix 7. Full names 48





1.Introduction

The Catalan Database for Mobile Telephone Network was recorded within the scope of a research project supported by the Catalonian Government. The design of the corpus and the collection was performed at the Universitat Politècnica de Catalunya (UPC). Transcription and formatting was performed at Verbio Speech Technologies, Spain.


The owner of the database is the Catalonian Government.
This database is an intermediate deliverable and comprises mobile (GSM) telephone recordings from 2000 speakers recorded directly over the fixed PSTN using two analogue lines. Signals were sampled at 8KHz and A-law encoded without automatic gain control. The database is distributed in 10 CD ROM volumes that contains the speech signals and one CD ROM volume that contains the documentation
File types are identified with the following extensions:


DOC

Microsoft Word V6.0 document

LST

DOS text index file with ISO Latin 1 symbols

TBL

DOS text file with ISO Latin 1 symbols

TXT

DOS text file

SES

DOS text file

CAO

SAM label file, text file with ISO Latin 1 symbols

CAA

A- law speech signal file

PS

Postcript file

The DVD-ROM has the following directory structure:


\+-COPYRIGH.TXT

+-DISK.ID

+-README.TXT

+-MOBIL_CA--+-DOC------+-DESIGN.DOC

| +-ISO88591.PS

| +-SAMPALEX.PS

| +-SAMPSTAT.TXT

| +-SUMMARY.TXT

| +-VALREP.DOC

+-INDEX----+-CONTENTS.LST

| +-B_TSTCA.SES

| +-B_TRNCA.SES

+-TABLE----+-LEXICON.TBL

| +-SESSION.TBL

+-BLOCK00--+-SES0000--+--B_0000A1.CAA

| | +--B_0000A1.CAO

. . . ...

| | +--B_0000W4.CAO

| +-SES0099--+--B_0099A1.CAA

| +--B_0099A1.CAO

. . ...

| +--B_0099W4.CAO

+-BLOCK01--+-SES0100--+--B_0100A1.CAA

| +--B_0100A1.CAO

. . ...

| +--B_0100W4.CAO

+--SES0199-+--B_0199A1.CAA

+--B_0199A1.CAO

. ...

+--B_0199W4.CAO



The signal files are stored in 10 CD-ROM according to the ISO 9660 format. Label files and documentation files are stored in one CD-ROM
The list of the distribution disks and directories are contained in the README.TXT file, stored on each disk. Further details regarding the database contents, files and directories are provided in the documentation files in the DOC directory, the session and lexicon tables in the TABLE directory and contents files in the INDEX directory.


Download 1,39 Mb.

Do'stlaringiz bilan baham:
  1   2   3   4   5   6   7   8   9   ...   18




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish