The speech signal is recorded from the mobile telephone network via a PSTN line connection. The signals are stored directly in the digital format using A-law. They are recorded with a sampling rate of 8 kHz, 8-bit quantization with the least significant byte first (“lohi” or Intel format) as (signed) integers. A description of the sample rate, the quantization, and byte order used is stored in a SAM label file which corresponds to each speech file.
1.2Directory structure
The directory structure uses a shallow directory nesting with consecutive numbers to identify the individual block and session directories where a session corresponds to a single telephone call. The following three-level directory structure is defined:
///
where:
Defined as <#>
where is MOBIL, <#> is _, is a 2-character language code, CA for Catalan.
Defined as BLOCK
Where is a number from 00 to 99
Defined as SES
Where is the block number and a session number from 00 to 99
Table 1 1 Directory structure Both signal files and label files belong to the same directory, although they are distributed on different disks.
In addition to the previous structure the following directories are used to store some other files:
Table 1 2 Non-speech related directory structure All sessions have complete recordings for all prompted items. Exceptions can be found in the summary text files.
The root directory contains three mandatory files:
COPYRIGH.TXT: a copyright text in ASCII format.
DISK.ID: an 11-character string with the volume name (required for systems that cannot read the physical volume label).
README.TXT: an ASCII text file that lists all files of the database, except for signal and label files which can be indicated by their name template.
The first two of these are duplicated on all the disks.
File names adhere to the common subset of the ISO 9660 standard, i.e. file names with 8 characters followed by a 3-character file extension:
.
where:
Database Identification Code (00-ZZ): B_
Progressive recording session number (0000 to 9999), where NN is the block number and MM is the session number
Corpus code
language code: CA for Catalan
File type code:
O: orthographic label file
A: A-law speech file
Table 1 3 File naming conventions An overview of the corpus codes is presented in the tables in section 3. A list of separate documentation files, tables and listings follows below:
Directory
File
DOC
DESIGN.DOC
ISO88591.PS
SAMPALEX.PS
SAMPSTAT.TXT
SUMMARY.TXT
VALREP.DOC
TABLE
LEXICON.TBL
SESSION.TBL
INDEX
CONTENTS.LST
B_TRNCA.SES
B_TSTCA.SES
Table 1 4 Summary of documentation files
1.4Label files
Associated SAM label files are text files where each row is ended (according to the DOS format). Rows are produced according to the main SAM paradigm:
ABC: x, y, z, ...
Where:
ABC is a three letter mnemonic followed by a colon: no spaces are allowed between them, so we can define as SAM-mnemonic the set “ABC:”;
after the mnemonic are all the defined items separated by commas;
missing items are accepted and nothing needs to be put between commas to substitute them;
spaces are not significant but good pretty printing rules allow a better readability.
A label file begins with the mnemonic “LHD:” and end with “ELF:”. The mnemonic “LBD:” splits a label file into two sections: the LABEL FILE HEADER and the LABEL FILE BODY.
All mnemonics used in the database are listed below. For each one the explanation, the format and the domain accepted for the related items are given.
1.4.1Label file header
In the following we will present, grouped by similarity, all the attributes that have been selected.
The format defined for the various items is described in the second column of the following tables by using a C printf() format string.
The order of the mnemonics in the label file header is irrelevant except that "LHD:" is the first mnemonic and "LBD:" is the last one.
Identification rows LHD: Label File Header. Is the starting mnemonic. Identifies the format name and the version number.
ELF: terminates a label file and it provides a check for accidental file truncation.
CMT: rows can be placed anywhere, but they should not be put as the first or last row.
Session rows DBN: database name.
VOL: volume identification name (this allows multi-volume speech databases to be built). It coincides with the volume name of the mastered CD-ROM and can be up to 11 chars long. The volume sequence number must start from “1”.
SES: session number, that is the code associated with a recording session at run time during collection; it is a simple sequential four digit number starting from 0000 to 9999.
Recording condition rows REG: calling region.
ENV: calling environment.
NET: telephone network.
PHM: telephone hand set model.
Speaker rows As in the recording condition set, the main field is the speaker code. It is used as pointer to a speaker file/table, but to speed up search processes some information is put also inside the label file. Relevant items are:
SCD: unique speaker code.
SEX: speaker sex (‘M’ale or ‘F’emale; void if unknown).
AGE: speaker age (precise or class mid-point ; void if unknown).
ACC: speaker accent, i.e. the regional/dialectical colouring factor.
File rows DIR: Directory of the signal file. It is reported in DOS format with a leading backslash (it allows unambiguous paths that start from the root directory).
SRC: Signal file name.
CCD: Corpus code. As reported before, this field is included also in the filename construction and has been designed as a two alphanumeric character string; it is a compound code built by joining together the one letter corpus identifier and the item identifier.
CRP: Corpus repetition It is void for this database.
REP: For telephone speech databases, the place where the recording machine was located is mandatory because, together with the recording condition fields, it specifies completely the environment.
RED: Recording date and time of the signal file.
RET: Recording time. The time is the starting session time.
BEG: The begin field specifies the starting point of the speech inside the data file, skipping any header; SAM format used in this speech databases set this field to zero.
END: The end field specifies the end of the speech inside the data file; for this database END coincides with the file length minus one, as samples are one byte long.
Data file coding rows SAM: the sampling frequency in hertz (set to 8000).
SNB: the number of bytes each sample (set to 1).
SBF: the sample byte order is fundamental for 16 bit samples, because it is different between DOS systems and UNIX. It is specified as a pair of digits “0” and “1”: the latter specifies the position of the most significant byte. However, this is irrelevant for A-law coding, which uses one byte only, so it will be left void.
SSB: number of significant bits per sample (set to 8).
QNT: the quantization attribute specifies the speech coding chosen i.e. “A-LAW, “MU-LAW”
Information about the labelling session These fields store information about the labelling session, usually a phonetic one.
EXP: expert name.
SYS: system used to label the file.
DAT: date of completion of labelling.
1.4.2Label file body
LBR: stores information about the acquisition window (LaBelling during Recording) and the prompt text, i.e. what the speaker should have uttered. The related fields are: sequence beginning (in sample), sequence end, input gain on recording, minimum sample value, maximum sample value, orthographic text prompt. Input gain, minimum and maximum values are optional fields and can be left blank.
LBO: specifies a broad segmentation (LaBelling Orthographic) with the transcription of what the speaker actually said. The related fields are: sequence beginning (in sample), sequence centre, sequence end, corrected orthographic text. The sequence centre is optional and can be left blank (provided as an alternative when labelling phonetic events). Also the segmentation process is optional and need not be performed; in this case begin and end coincide with the “LBR:” end points.
Blanks should not be treated as significant in label fields, except to separate words in the orthographic transcriptions.
The orthographic texts is written using the ISO-8859-1 (Latin 1) character set. It is directly compatible with the Windows environment, but not with DOS and UNIX, in fact some of its characters are not displayed correctly in those environments.
Spontaneous questions are transcribed on “LBR:” rows by using a code word in angled brackets, according to the following table:
Corpus Code
Orthographic text prompt, in “LBR:”
D1
O2
Q1
Q2
T1
Table 1 5 Spontaneous orthographic text These code words allow the user to easily distinguish spontaneous speech transcriptions from the normal read speech ones.