Annexa 2 Introduction Database design and collection Database contents definition

Download 1,39 Mb.

bet	2/18
Sana	23.06.2017
Hajmi	1,39 Mb.
	#12884

1 2 3 4 5 6 7 8 9 ... 18

1.1Speech file formats

The speech signal is recorded from the mobile telephone network via a PSTN line connection. The signals are stored directly in the digital format using A-law. They are recorded with a sampling rate of 8 kHz, 8-bit quantization with the least significant byte first (“lohi” or Intel format) as (signed) integers. A description of the sample rate, the quantization, and byte order used is stored in a SAM label file which corresponds to each speech file.

1.2Directory structure

The directory structure uses a shallow directory nesting with consecutive numbers to identify the individual block and session directories where a session corresponds to a single telephone call. The following three-level directory structure is defined:

///

where:

	Defined as <#> where is MOBIL, <#> is _, is a 2-character language code, CA for Catalan.
	Defined as BLOCK Where is a number from 00 to 99
	Defined as SES Where is the block number and a session number from 00 to 99

Table 1 1 Directory structure
Both signal files and label files belong to the same directory, although they are distributed on different disks.
In addition to the previous structure the following directories are used to store some other files:

/	README.TXT, DISK.ID, and COPYRIGH.TXT
//DOC	Documentation
//TABLE	Speaker, session, recording condition, and lexicon tables
//INDEX	Index files

Table 1 2 Non-speech related directory structure
All sessions have complete recordings for all prompted items. Exceptions can be found in the summary text files.
The root directory contains three mandatory files:

COPYRIGH.TXT: a copyright text in ASCII format.
DISK.ID: an 11-character string with the volume name (required for systems that cannot read the physical volume label).
README.TXT: an ASCII text file that lists all files of the database, except for signal and label files which can be indicated by their name template.

The first two of these are duplicated on all the disks.

1.3File naming conventions

File names adhere to the common subset of the ISO 9660 standard, i.e. file names with 8 characters followed by a 3-character file extension:

where:

	Database Identification Code (00-ZZ): B_
	Progressive recording session number (0000 to 9999), where NN is the block number and MM is the session number
	Corpus code
	language code: CA for Catalan
	File type code: O: orthographic label file A: A-law speech file

Table 1 3 File naming conventions
An overview of the corpus codes is presented in the tables in section 3. A list of separate documentation files, tables and listings follows below:

Directory	File
DOC	DESIGN.DOC
	ISO88591.PS
	SAMPALEX.PS
	SAMPSTAT.TXT
	SUMMARY.TXT
	VALREP.DOC
TABLE	LEXICON.TBL
TABLE	SESSION.TBL
INDEX	CONTENTS.LST
	B_TRNCA.SES
	B_TSTCA.SES

Table 1 4 Summary of documentation files

1.4Label files

Associated SAM label files are text files where each row is ended (according to the DOS format). Rows are produced according to the main SAM paradigm:

ABC: x, y, z, ...

Where:

ABC is a three letter mnemonic followed by a colon: no spaces are allowed between them, so we can define as SAM-mnemonic the set “ABC:”;
after the mnemonic are all the defined items separated by commas;
missing items are accepted and nothing needs to be put between commas to substitute them;
spaces are not significant but good pretty printing rules allow a better readability.

A label file begins with the mnemonic “LHD:” and end with “ELF:”. The mnemonic “LBD:” splits a label file into two sections: the LABEL FILE HEADER and the LABEL FILE BODY.

All mnemonics used in the database are listed below. For each one the explanation, the format and the domain accepted for the related items are given.

1.4.1Label file header

In the following we will present, grouped by similarity, all the attributes that have been selected.

The format defined for the various items is described in the second column of the following tables by using a C printf() format string.
The order of the mnemonics in the label file header is irrelevant except that "LHD:" is the first mnemonic and "LBD:" is the last one.
Identification rows
LHD: Label File Header. Is the starting mnemonic. Identifies the format name and the version number.

ELF: terminates a label file and it provides a check for accidental file truncation.

CMT: rows can be placed anywhere, but they should not be put as the first or last row.
Session rows
DBN: database name.

VOL: volume identification name (this allows multi-volume speech databases to be built). It coincides with the volume name of the mastered CD-ROM and can be up to 11 chars long. The volume sequence number must start from “1”.

SES: session number, that is the code associated with a recording session at run time during collection; it is a simple sequential four digit number starting from 0000 to 9999.
Recording condition rows
REG: calling region.

ENV: calling environment.

NET: telephone network.

PHM: telephone hand set model.

Speaker rows
As in the recording condition set, the main field is the speaker code. It is used as pointer to a speaker file/table, but to speed up search processes some information is put also inside the label file. Relevant items are:
SCD: unique speaker code.

SEX: speaker sex (‘M’ale or ‘F’emale; void if unknown).

AGE: speaker age (precise or class mid-point ; void if unknown).

ACC: speaker accent, i.e. the regional/dialectical colouring factor.

File rows
DIR: Directory of the signal file. It is reported in DOS format with a leading backslash (it allows unambiguous paths that start from the root directory).

SRC: Signal file name.

CCD: Corpus code. As reported before, this field is included also in the filename construction and has been designed as a two alphanumeric character string; it is a compound code built by joining together the one letter corpus identifier and the item identifier.

CRP: Corpus repetition It is void for this database.

REP: For telephone speech databases, the place where the recording machine was located is mandatory because, together with the recording condition fields, it specifies completely the environment.

RED: Recording date and time of the signal file.

RET: Recording time. The time is the starting session time.

BEG: The begin field specifies the starting point of the speech inside the data file, skipping any header; SAM format used in this speech databases set this field to zero.

END: The end field specifies the end of the speech inside the data file; for this database END coincides with the file length minus one, as samples are one byte long.
Data file coding rows
SAM: the sampling frequency in hertz (set to 8000).

SNB: the number of bytes each sample (set to 1).

SBF: the sample byte order is fundamental for 16 bit samples, because it is different between DOS systems and UNIX. It is specified as a pair of digits “0” and “1”: the latter specifies the position of the most significant byte. However, this is irrelevant for A-law coding, which uses one byte only, so it will be left void.

SSB: number of significant bits per sample (set to 8).

QNT: the quantization attribute specifies the speech coding chosen i.e. “A-LAW, “MU-LAW”
Information about the labelling session
These fields store information about the labelling session, usually a phonetic one.
EXP: expert name.

SYS: system used to label the file.

DAT: date of completion of labelling.

1.4.2Label file body

LBR: stores information about the acquisition window (LaBelling during Recording) and the prompt text, i.e. what the speaker should have uttered. The related fields are: sequence beginning (in sample), sequence end, input gain on recording, minimum sample value, maximum sample value, orthographic text prompt. Input gain, minimum and maximum values are optional fields and can be left blank.

LBO: specifies a broad segmentation (LaBelling Orthographic) with the transcription of what the speaker actually said. The related fields are: sequence beginning (in sample), sequence centre, sequence end, corrected orthographic text. The sequence centre is optional and can be left blank (provided as an alternative when labelling phonetic events). Also the segmentation process is optional and need not be performed; in this case begin and end coincide with the “LBR:” end points.
Blanks should not be treated as significant in label fields, except to separate words in the orthographic transcriptions.
The orthographic texts is written using the ISO-8859-1 (Latin 1) character set. It is directly compatible with the Windows environment, but not with DOS and UNIX, in fact some of its characters are not displayed correctly in those environments.
Spontaneous questions are transcribed on “LBR:” rows by using a code word in angled brackets, according to the following table:

Corpus Code	Orthographic text prompt, in “LBR:”
D1
O2
Q1
Q2
T1

Table 1 5 Spontaneous orthographic text
These code words allow the user to easily distinguish spontaneous speech transcriptions from the normal read speech ones.

mnem.	item format	example	comments
LHD:	“%s, %d.%02d”	SAM, 6.0	format name + version
ELF:	“”		end of label file
CMT:	“%.75s”	This is a comment row	comment row
DBN:	“%.75s”	Catalan_Mobile_Network	database name
VOL:	“%.11s”	MOBIL_CA_01	database volume ID
SES:	“%04d”	0345	session number
REG:	“%.75s”	CENTRAL	calling region
ENV:	“%.75s”	VEHICLE	calling environment
NET:	“%.75s”	GSM	telephone network
PHM:	“%.75s”	CELLULAR/HH	telephone hand set model
SCD:	“%06d”	172001	speaker code
SEX:	“%c”	M	speaker sex
AGE:	“%d”	35	speaker age
ACC:	“%.75s”	CENTRAL	speaker accent
DIR:	“\\%.8s\\...\\%.8s”	\MOBIL_CA\ BLOCK01\SES0100	signal file directory
SRC:	“%8s.%.3s”	B_0100S3.CAA	signal file name
CCD:	“%.2s”	S3	corpus code
CRP:	“%.02d”		corpus repetition (void)
REP:	“%s, %s, %s”	UPC,Barcelona,Spain	recording place: place, city, country
RED:	“%02d/%.3s/%4d”	25/Nov/2005	recording date
RET:	“%02d:%02d:%02d”	16:21:35	recording time (:SS = :00)
BEG:	“%lu”	0	labelled sequence start position
END:	“%lu”	file length - 1	labelled sequence end position
SAM:	“%d”	8000	sampling frequency
SNB:	“%d”	1	number of (8-bit) bytes per sample
SBF:	“%2s”		sample byte order (meaningless with single byte samples, “SNB: 1”)
SSB:	“%d”	8	number of significant bits per sample
QNT:	“%.75s”	A-LAW	Quantization
EXP:	“%s”	Sergio Oller	labelling expert
SYS:	“%s, %d.%02d”	UPC_RevBD,7.32	labelling system
DAT:	“%02d/%.3s/%4d,,02d:%02d:%02d”	27/Jul/2006,10:10:44	labelling date
LBD:	“”		label body keyword
LBR:	“%lu, %lu, %d, %d, %d, %s”	0, 12457, , , , This is what the speaker should have uttered	labelling during recording: begin, end, gain, min, max, orthographic text prompt
LBO:	“%lu, %lu, %lu, %s”	0, 6229, 12457, This is what the	orthographic labelling: begin, centre, end, orthographic transcription text
EXT:	“%.75s”	speaker actually said	line extension

Table 1 6 Summary of SAM labels, formats and definition.

1.4.3Example of label file

Below is an example of a label file:

LHD: SAM,6.00

DBN: Catalan_Mobile_Network

VOL: MOBIL_CA_01

SES: 0005

CMT: *** Speech file information ***

DIR: \MOBIL_CA\BLOCK00\SES0005

SRC: B_0005W1.CAA

CCD: W1

CRP:

BEG: 0

END: 22910

REP: UPC,Barcelona,Spain

RED: 30/Sep/2005

RET: 22:20:37

CMT: *** Speech data coding ***

SAM: 8000

SNB: 1

SBF:

SSB: 8

QNT: A-LAW

CMT: *** Speaker information ***

SCD: 032200

SEX: M

AGE: 18

ACC: CENTRAL

CMT: *** Recording conditions ***

REG: CENTRAL

ENV: STREET

NET: GSM

PHM: CELLULAR/HH

CMT: *** Labelling information ***

EXP: Elsa Vicente

SYS: UPC_RevBD,7.32

DAT: 24/Jul/2006,12:32:43

LBD:

CMT: *** Label file body ***

CMT: LBR,begin,end,input gain,min,max,prompt

CMT: LBO,begin,centre,end,transcription

CMT: ***********************

LBR: 0,22910,,-32256,32256,bitxac

LBO: 0,11455,22910,[int] bitxac

ELF:

Figure 1.1 - Example of label file

Download 1,39 Mb.

Do'stlaringiz bilan baham:

1 2 3 4 5 6 7 8 9 ... 18