The
Speech Synthesis Markup Language for the W3C VoiceXML
Standard
(Continued
from Part 1)
Future
Study
Several
element types and issues are proposed for inclusion
in future versions of the Speech Synthesis Markup Language.
Readers are referred to the specification itself for
the entire listing.
Other
Phoneme Alphabets
All
of the phoneme alphabets currently supported by SSML
suffer from the same defect in that they contain phonemic
symbols not specifically designed for expression within
XML documents. Many symbols require escape characters
for proper display. The design of a new, XML-optimal
phoneme alphabet is currently under study.
"lowlevel"
Elements: Fine-Grained Acoustic-Prosodic Control
The
"lowlevel" element has been proposed as a
container for a sequence of phoneme and pitch controls.
A lowlevel sequence is composed of "ph"
(phoneme with duration) and "f0" (timed
pitch target) elements. A lowlevel element may contain
a sequence of zero or more "ph" and "f0"
elements. Both the "ph" and "f0"
elements are empty. The elements may be interleaved
or placed in separate sequences. The "ph"
element is specified with attributes that designate
the phoneme symbol and phoneme duration. The "f0"
attribute is specified with attributes that designate
a pitch value target and a time offset from the previous
pitch target.
It
is anticipated that synthesis content authoring tools
could automatically generate documents composed only
of low-level sequences. These documents would be rendered
into audio by low-complexity waveform generators. For
this reason, compactness of the individual elements
has been given priority over readability.
An
overarching problem with SSML is that implementing a
speech synthesizer capable of responding in a perceptually
acceptable way to an 'emphasis' tag, for example, may
initially impose a substantial hardship on some engine
providers, especially small firms. For this reason it
has been proposed to create a document separate from
the SSML specification that provides high-level implementation
assistance for SSML application developers. In this
document, the high-level markup elements would be somewhat
defined using sequences of low-level elements.
Intonational Controls
The
existing specification supports many ways by which a
document author can affect the intonational rendering
of speech output. In part, this reflects the broad communicative
role of intonation in spoken language: it reflects document
structure, paragraph and sentence elements, prominence
and prosodic boundaries. Intonation also reflects emotion
and many less definable characteristics. The specification
could be enhanced to provide specific intonational controls
at boundaries, and at points of emphasis. In both cases
there are existing elements to which intonational attributes
could possibly be added.
Pronunciation
Lexicon
There
is often a need to use proper nouns or other unusual
words within text to be read aloud by a TTS system.
These words may not be present in the built-in lexicons
that may accompany the platform. The goal of the pronunciation
lexicon markup specification is to provide a mechanism
for application developers to supply high quality additional
pronunciations in a platform independent manner. In
many cases application developers will need to only
provide one or two additional pronunciations inline
within other voice markups. There are other cases, however,
where an application may make use of large pronunciation
lexica that cannot conveniently be specified inline
and will have to be provided as separate documents.
Standard
Conformance Criteria
The
conformance criteria in the SSML specification are derived
from the general XML 1.0 specification [5], and are
designed to ensure consistency and portability of SSML
documents across disparate platforms. XML 1.0 defines
the properties required of well-formed and valid
XML documents. These criteria also apply to conforming
SSML documents.
The
additional conformance criteria are less strictly specified.
It is recommended, for example, that the SSML processor
generate an error event to inform its hosting environment
if an unsupported element or element form is encountered
within the SSML document. A conforming processor also
should produce some tangible output in response to each
output-altering markup element present in an SSML document.
The output should be generated in a manner that complies
with the functional description of the element in the
specification. Exceptions are allowed when a non-supported
language is specified within an element, or when a parameter
is specified that exceeds local computing or rendering
capabilities. In these cases, the SSML processor should
generate an error event. Subsequent behavior by the
application is platform dependent.
Conclusion
Widespread
adoption of SSML by TTS engine developers may energize
the development of new classes of speech-enabled applications,
as well as new tools for authoring synthesizable content.
Since all W3C specifications are not static, but are
rather constantly updated to reflect common practice
and usage, it is anticipated that the SSML specification
will undergo changes in the future to reflect the practices
of SSML engine and content developers. As TTS technology
advances and results in synthesis engines that produce
more natural sounding synthesized speech, SSML may serve
the developers of synthesis content by providing for
more emotive speech output and less variation in the
rendered output across different synthesis platforms.
References
[1]
Java Speech Markup Language Specification, Version
0.5, Sun Microsystems Inc., August 28, 1997.
[2]
"SABLE: A Standard for TTS Markup", R. Sproat,
A. Hunt, M. Ostendorf, P. Taylor, A. Black, K. Lenzo,
M. Edgington, Proceedings Intl. Conf. Spoken Language
Processing, Sydney, November, 1998.
[3] Voice eXtensible Markup Language (VoiceXML),
version 1.0, May 2000 (http://www.voicexml.org/).
[4]
Speech Synthesis Markup Language Specification, M. Walker,
A. Hunt, W3C Last-Call Working Draft, Jan 3, 2001 (http://www.w3.org/TR/speech-synthesis).
[5]
Microsoft Speech API, version 5.0, Microsoft Inc., 1999.
[6]
Extensible Markup Language (XML) 1.0, October, 2000
(http://www.w3.org/TR/REC-xml).
back
to the top
Copyright
© 2001 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE
Industry Standards and Technology Organization
(IEEE-ISTO).
|