The challenge of TTS intelligibility for CALL
Intelligibility must be regarded as the first and most significant criterion for the
assessment of synthetic speech. The term ‘intelligibility’ here is not being confined to a
recognition of individual phonemes or words in isolation as used by a number of
investigators towards the beginning of the century such as Koul (2003). Rather, it is used
as the listener’s ability to recognise (and orthographically transcribe) a set of sentences
but not requiring any higher level cognitive functioning such as constructing a coherent
mental representation of the information contained in the text and relating it to any pre-
existing knowledge which the listener may have (Kintsch & van Dijk, 1978)
.
Questions are raised as to the adequacy of tests of ‘intelligibility’ which are used for
synthetic voices.
-106-
2014 CALL Conference
LINGUAPOLIS
www.antwerpcall.be
It is argued in this paper
that intelligibility tests alone are inadequate in order to describe
the complexity of human reaction to synthetic voices and a method for measuring ‘ease
of intelligibility’, otherwise referred to here as ‘
clarity’
, is proposed.
Evaluating TTS Synthesis
There exists a considerable body of research into the evaluation procedures which may
be used for synthetic speech in both commercial and educational environments. Amongst
the many facets of the speech being inquired into are: syllable articulation; word
intelligibility; sentence intelligibility and overall quality to include rhythm, speaking rate,
continuity, intonation, nearness to human voice and suitability for the users’ purpose
(Itahashi, 2000). Campbell (2007) refers to the many different ways speech synthesis
can be evaluated. These include diagnostic or comparative evaluations, subjective or
objective evaluations, modular or global evaluations, task-based or generic evaluations.
An Expert Advisory Group on Language Engineering Standards (EAGLES) produced a TTS
assessment taxonomy to explore the many dimensions of synthesised speech. They
distinguish between what they call ‘black box’ and ‘glass box’ assessments
(van
Bezooijen & van Heuven, 1997, p. 485). Glass box assessments focus on testing the
output of specific components of a speech system in a laboratory environment and may
use human raters or automated methods. This type of investigation is considered
objective in that it can produce objective data which can measure the quality of the
synthetic speech output. The glass box assessment is used primarily as a diagnostic tool
by system developers in order to improve the quality of the output of the TTS system
being developed (Lampert, 2004). The main problem with this methodology is that there
is not always a precise match between the objective measures it yields and more
subjective measures of the quality of synthetic speech. Some objective measures may be
over-sensitive compared to the human ear. Conversly synthetic speech may be perfectly
intelligible and rated highly in glass box assessments but nonetheless regarded as
unnatural by the listener (Cryer & Home, 2010). It is accepted that current objective
measures are not suitable for predicting the subjective quality of synthetic speech
(Huang, 2011).
The black box assessment concentrates on the functionality of the overall system and its
fitness for use in specific situations such as the telephone answering systems used in
banking (Léwy & Hornstein, 1994). This is typically done by way of human evaluation of
the synthesised speech (Bachan, 2008). It is mainly by way of subjective evaluations, as
gleaned through Likert type questionnaires, that evaluations of synthetic speech in a
CALL context are done. Such assessments try to evaluate the extent to which the
synthetic speech can help end users perform their intended task (Lampert, 2004) or try
to determine the fitness of a system for a purpose (Furui, 2007). Functionality rather
than form is the focus of attention.
It is widely claimed in recent literature reviews that state-of-the-art synthesised speech
has developed to a point where ‘intelligibility’ is no longer a factor
(Mayo, Clark, & King,
2011) and research is focusing more on factors such as naturalness, likeability, prosody,
its ability to express emotions, its persuasive abilities, etc. (Cryer & Home, 2010).
However, the TTS systems used in specific CALL applications may differ a lot and so high
intelligiblity cannot be taken for granted. This is particularly true for new or developing
TTS systems, especially systems in a new language, as is the case here with the
ABAIR
system,
and any evaluation of an emerging TTS system must include tests of
‘intelligibility’.
The term ‘intelligibility’ itself needs examination since it should not be seen as an “all or
nothing” concept. While we undoubtedly need a task/measure to assess the degree of
intelligiblity of utterances produced by a specific system, we would argue that it is
equally important to ascertain the ease with which listeners can carry it out. This latter
factor is likely to be a critical indicator of the eventual acceptability of a particular
-107-
2014 CALL Conference
LINGUAPOLIS
www.antwerpcall.be
synthesis-based CALL application. For that reason the term
‘ease of intelligibility'
is being
introduced.
Thi
s may be seen as being closely related to ‘
clarity’
, or the mental effort required for
successful completion of a task, frequently referred to as ‘cognitive loading’
(Pillay,
1994)
. ‘Clarity’ seems a more suitable concept when TTS synthetic speech is being
used
for a functional purpose such as CALL games. Two types of test are therefore introduced:
a transcription task to ascertain
intelligibility
(i.e. Performance measures) and subjects’
rating of
clarity
which indicates Opinion measures (Cryer & Home, 2010).
Do'stlaringiz bilan baham: |