Annexa 2 Introduction Database design and collection Database contents definition



Download 1,39 Mb.
bet2/18
Sana23.06.2017
Hajmi1,39 Mb.
#12884
1   2   3   4   5   6   7   8   9   ...   18

1.1Speech file formats

The speech signal is recorded from the mobile telephone network via a PSTN line connection. The signals are stored directly in the digital format using A-law. They are recorded with a sampling rate of 8 kHz, 8-bit quantization with the least significant byte first (“lohi” or Intel format) as (signed) integers. A description of the sample rate, the quantization, and byte order used is stored in a SAM label file which corresponds to each speech file.



1.2Directory structure

The directory structure uses a shallow directory nesting with consecutive numbers to identify the individual block and session directories where a session corresponds to a single telephone call. The following three-level directory structure is defined:

///

where:





Defined as <#>

where is MOBIL, <#> is _, is a 2-character language code, CA for Catalan.





Defined as BLOCK

Where is a number from 00 to 99





Defined as SES

Where is the block number and a session number from 00 to 99



Table 1 1 Directory structure
Both signal files and label files belong to the same directory, although they are distributed on different disks.
In addition to the previous structure the following directories are used to store some other files:


/

README.TXT, DISK.ID, and COPYRIGH.TXT

//DOC

Documentation

//TABLE

Speaker, session, recording condition, and lexicon tables

//INDEX

Index files

Table 1 2 Non-speech related directory structure
All sessions have complete recordings for all prompted items. Exceptions can be found in the summary text files.
The root directory contains three mandatory files:

  • COPYRIGH.TXT: a copyright text in ASCII format.

  • DISK.ID: an 11-character string with the volume name (required for systems that cannot read the physical volume label).

  • README.TXT: an ASCII text file that lists all files of the database, except for signal and label files which can be indicated by their name template.

The first two of these are duplicated on all the disks.

1.3File naming conventions

File names adhere to the common subset of the ISO 9660 standard, i.e. file names with 8 characters followed by a 3-character file extension:



.

where:





Database Identification Code (00-ZZ): B_



Progressive recording session number (0000 to 9999), where NN is the block number and MM is the session number



Corpus code



language code: CA for Catalan



File type code:

O: orthographic label file

A: A-law speech file


Table 1 3 File naming conventions
An overview of the corpus codes is presented in the tables in section 3. A list of separate documentation files, tables and listings follows below:


Directory

File

DOC

DESIGN.DOC

ISO88591.PS

SAMPALEX.PS

SAMPSTAT.TXT

SUMMARY.TXT

VALREP.DOC

TABLE

LEXICON.TBL

SESSION.TBL

INDEX

CONTENTS.LST

B_TRNCA.SES

B_TSTCA.SES

Table 1 4 Summary of documentation files

1.4Label files

Associated SAM label files are text files where each row is ended (according to the DOS format). Rows are produced according to the main SAM paradigm:

ABC: x, y, z, ...

Where:


  • ABC is a three letter mnemonic followed by a colon: no spaces are allowed between them, so we can define as SAM-mnemonic the set “ABC:”;

  • after the mnemonic are all the defined items separated by commas;

  • missing items are accepted and nothing needs to be put between commas to substitute them;

  • spaces are not significant but good pretty printing rules allow a better readability.

A label file begins with the mnemonic “LHD:” and end with “ELF:”. The mnemonic “LBD:” splits a label file into two sections: the LABEL FILE HEADER and the LABEL FILE BODY.


All mnemonics used in the database are listed below. For each one the explanation, the format and the domain accepted for the related items are given.

1.4.1Label file header

In the following we will present, grouped by similarity, all the attributes that have been selected.


The format defined for the various items is described in the second column of the following tables by using a C printf() format string.
The order of the mnemonics in the label file header is irrelevant except that "LHD:" is the first mnemonic and "LBD:" is the last one.
Identification rows
LHD: Label File Header. Is the starting mnemonic. Identifies the format name and the version number.

ELF: terminates a label file and it provides a check for accidental file truncation.

CMT: rows can be placed anywhere, but they should not be put as the first or last row.
Session rows
DBN: database name.

VOL: volume identification name (this allows multi-volume speech databases to be built). It coincides with the volume name of the mastered CD-ROM and can be up to 11 chars long. The volume sequence number must start from “1”.

SES: session number, that is the code associated with a recording session at run time during collection; it is a simple sequential four digit number starting from 0000 to 9999.
Recording condition rows
REG: calling region.

ENV: calling environment.

NET: telephone network.

PHM: telephone hand set model.



Speaker rows
As in the recording condition set, the main field is the speaker code. It is used as pointer to a speaker file/table, but to speed up search processes some information is put also inside the label file. Relevant items are:
SCD: unique speaker code.

SEX: speaker sex (‘M’ale or ‘F’emale; void if unknown).

AGE: speaker age (precise or class mid-point ; void if unknown).

ACC: speaker accent, i.e. the regional/dialectical colouring factor.


File rows
DIR: Directory of the signal file. It is reported in DOS format with a leading backslash (it allows unambiguous paths that start from the root directory).

SRC: Signal file name.

CCD: Corpus code. As reported before, this field is included also in the filename construction and has been designed as a two alphanumeric character string; it is a compound code built by joining together the one letter corpus identifier and the item identifier.

CRP: Corpus repetition It is void for this database.

REP: For telephone speech databases, the place where the recording machine was located is mandatory because, together with the recording condition fields, it specifies completely the environment.

RED: Recording date and time of the signal file.

RET: Recording time. The time is the starting session time.

BEG: The begin field specifies the starting point of the speech inside the data file, skipping any header; SAM format used in this speech databases set this field to zero.

END: The end field specifies the end of the speech inside the data file; for this database END coincides with the file length minus one, as samples are one byte long.
Data file coding rows
SAM: the sampling frequency in hertz (set to 8000).

SNB: the number of bytes each sample (set to 1).

SBF: the sample byte order is fundamental for 16 bit samples, because it is different between DOS systems and UNIX. It is specified as a pair of digits “0” and “1”: the latter specifies the position of the most significant byte. However, this is irrelevant for A-law coding, which uses one byte only, so it will be left void.

SSB: number of significant bits per sample (set to 8).

QNT: the quantization attribute specifies the speech coding chosen i.e. “A-LAW, “MU-LAW”
Information about the labelling session
These fields store information about the labelling session, usually a phonetic one.
EXP: expert name.

SYS: system used to label the file.

DAT: date of completion of labelling.

1.4.2Label file body

LBR: stores information about the acquisition window (LaBelling during Recording) and the prompt text, i.e. what the speaker should have uttered. The related fields are: sequence beginning (in sample), sequence end, input gain on recording, minimum sample value, maximum sample value, orthographic text prompt. Input gain, minimum and maximum values are optional fields and can be left blank.



LBO: specifies a broad segmentation (LaBelling Orthographic) with the transcription of what the speaker actually said. The related fields are: sequence beginning (in sample), sequence centre, sequence end, corrected orthographic text. The sequence centre is optional and can be left blank (provided as an alternative when labelling phonetic events). Also the segmentation process is optional and need not be performed; in this case begin and end coincide with the “LBR:” end points.
Blanks should not be treated as significant in label fields, except to separate words in the orthographic transcriptions.
The orthographic texts is written using the ISO-8859-1 (Latin 1) character set. It is directly compatible with the Windows environment, but not with DOS and UNIX, in fact some of its characters are not displayed correctly in those environments.
Spontaneous questions are transcribed on “LBR:” rows by using a code word in angled brackets, according to the following table:


Corpus Code

Orthographic text prompt, in “LBR:”

D1



O2



Q1



Q2



T1


Table 1 5 Spontaneous orthographic text
These code words allow the user to easily distinguish spontaneous speech transcriptions from the normal read speech ones.


mnem.

item format

example

comments

LHD:

“%s, %d.%02d”

SAM, 6.0

format name + version

ELF:

“”




end of label file

CMT:

%.75s”

This is a comment row

comment row

DBN:

“%.75s”

Catalan_Mobile_Network

database name

VOL:

“%.11s”

MOBIL_CA_01

database volume ID

SES:

“%04d”

0345

session number

REG:

“%.75s”

CENTRAL

calling region

ENV:

“%.75s”

VEHICLE

calling environment

NET:

“%.75s”

GSM

telephone network

PHM:

“%.75s”

CELLULAR/HH

telephone hand set model

SCD:

“%06d”

172001

speaker code

SEX:

“%c”

M

speaker sex

AGE:

“%d”

35

speaker age

ACC:

“%.75s”

CENTRAL

speaker accent

DIR:

“\\%.8s\\...\\%.8s”

\MOBIL_CA\

BLOCK01\SES0100



signal file directory

SRC:

“%8s.%.3s”

B_0100S3.CAA

signal file name

CCD:

“%.2s”

S3

corpus code

CRP:

“%.02d”




corpus repetition (void)

REP:

“%s, %s, %s”

UPC,Barcelona,Spain

recording place:

place, city, country



RED:

“%02d/%.3s/%4d”

25/Nov/2005

recording date

RET:

“%02d:%02d:%02d”

16:21:35

recording time (:SS = :00)

BEG:

“%lu”

0

labelled sequence start position

END:

“%lu”

file length - 1

labelled sequence end position

SAM:

“%d”

8000

sampling frequency

SNB:

“%d”

1

number of (8-bit) bytes per sample

SBF:

“%2s”




sample byte order (meaningless with single byte samples, “SNB: 1”)

SSB:

“%d”

8

number of significant bits per sample

QNT:

“%.75s”

A-LAW

Quantization

EXP:

%s”

Sergio Oller

labelling expert

SYS:

%s, %d.%02d”

UPC_RevBD,7.32

labelling system

DAT:

“%02d/%.3s/%4d,,02d:%02d:%02d”

27/Jul/2006,10:10:44

labelling date

LBD:

“”




label body keyword

LBR:

“%lu, %lu, %d, %d, %d, %s”

0, 12457, , , , This is what the speaker should have uttered

labelling during recording:

begin, end, gain, min, max, orthographic text prompt



LBO:

“%lu, %lu, %lu, %s”

0, 6229, 12457, This is what the

orthographic labelling:

begin, centre, end, orthographic transcription text



EXT:

“%.75s”

speaker actually said

line extension

Table 1 6 Summary of SAM labels, formats and definition.

1.4.3Example of label file

Below is an example of a label file:


LHD: SAM,6.00

DBN: Catalan_Mobile_Network

VOL: MOBIL_CA_01

SES: 0005

CMT: *** Speech file information ***

DIR: \MOBIL_CA\BLOCK00\SES0005

SRC: B_0005W1.CAA

CCD: W1


CRP:

BEG: 0


END: 22910

REP: UPC,Barcelona,Spain

RED: 30/Sep/2005

RET: 22:20:37

CMT: *** Speech data coding ***

SAM: 8000

SNB: 1

SBF:


SSB: 8

QNT: A-LAW

CMT: *** Speaker information ***

SCD: 032200

SEX: M

AGE: 18


ACC: CENTRAL

CMT: *** Recording conditions ***

REG: CENTRAL

ENV: STREET

NET: GSM

PHM: CELLULAR/HH

CMT: *** Labelling information ***

EXP: Elsa Vicente

SYS: UPC_RevBD,7.32

DAT: 24/Jul/2006,12:32:43

LBD:

CMT: *** Label file body ***



CMT: LBR,begin,end,input gain,min,max,prompt

CMT: LBO,begin,centre,end,transcription

CMT: ***********************

LBR: 0,22910,,-32256,32256,bitxac

LBO: 0,11455,22910,[int] bitxac

ELF:


Figure 1.1 - Example of label file


Download 1,39 Mb.

Do'stlaringiz bilan baham:
1   2   3   4   5   6   7   8   9   ...   18




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish