VoiceXML Review - Feature Articles

Volume 1, Issue 4 - April 2001

The Speech Synthesis Markup Language for the W3C VoiceXML Standard

By Mark R. Walker and Andrew Hunt

(Continued from Part 1)

Future Study

Several element types and issues are proposed for inclusion in future versions of the Speech Synthesis Markup Language. Readers are referred to the specification itself for the entire listing.

Other Phoneme Alphabets

All of the phoneme alphabets currently supported by SSML suffer from the same defect in that they contain phonemic symbols not specifically designed for expression within XML documents. Many symbols require escape characters for proper display. The design of a new, XML-optimal phoneme alphabet is currently under study.

"lowlevel" Elements: Fine-Grained Acoustic-Prosodic Control

The "lowlevel" element has been proposed as a container for a sequence of phoneme and pitch controls. A lowlevel sequence is composed of "ph" (phoneme with duration) and "f0" (timed pitch target) elements. A lowlevel element may contain a sequence of zero or more "ph" and "f0" elements. Both the "ph" and "f0" elements are empty. The elements may be interleaved or placed in separate sequences. The "ph" element is specified with attributes that designate the phoneme symbol and phoneme duration. The "f0" attribute is specified with attributes that designate a pitch value target and a time offset from the previous pitch target.

It is anticipated that synthesis content authoring tools could automatically generate documents composed only of low-level sequences. These documents would be rendered into audio by low-complexity waveform generators. For this reason, compactness of the individual elements has been given priority over readability.

An overarching problem with SSML is that implementing a speech synthesizer capable of responding in a perceptually acceptable way to an 'emphasis' tag, for example, may initially impose a substantial hardship on some engine providers, especially small firms. For this reason it has been proposed to create a document separate from the SSML specification that provides high-level implementation assistance for SSML application developers. In this document, the high-level markup elements would be somewhat defined using sequences of low-level elements.

Intonational Controls

The existing specification supports many ways by which a document author can affect the intonational rendering of speech output. In part, this reflects the broad communicative role of intonation in spoken language: it reflects document structure, paragraph and sentence elements, prominence and prosodic boundaries. Intonation also reflects emotion and many less definable characteristics. The specification could be enhanced to provide specific intonational controls at boundaries, and at points of emphasis. In both cases there are existing elements to which intonational attributes could possibly be added.

Pronunciation Lexicon

There is often a need to use proper nouns or other unusual words within text to be read aloud by a TTS system. These words may not be present in the built-in lexicons that may accompany the platform. The goal of the pronunciation lexicon markup specification is to provide a mechanism for application developers to supply high quality additional pronunciations in a platform independent manner. In many cases application developers will need to only provide one or two additional pronunciations inline within other voice markups. There are other cases, however, where an application may make use of large pronunciation lexica that cannot conveniently be specified inline and will have to be provided as separate documents.

Standard Conformance Criteria

The conformance criteria in the SSML specification are derived from the general XML 1.0 specification [5], and are designed to ensure consistency and portability of SSML documents across disparate platforms. XML 1.0 defines the properties required of well-formed and valid XML documents. These criteria also apply to conforming SSML documents.

The additional conformance criteria are less strictly specified. It is recommended, for example, that the SSML processor generate an error event to inform its hosting environment if an unsupported element or element form is encountered within the SSML document. A conforming processor also should produce some tangible output in response to each output-altering markup element present in an SSML document. The output should be generated in a manner that complies with the functional description of the element in the specification. Exceptions are allowed when a non-supported language is specified within an element, or when a parameter is specified that exceeds local computing or rendering capabilities. In these cases, the SSML processor should generate an error event. Subsequent behavior by the application is platform dependent.

Conclusion

Widespread adoption of SSML by TTS engine developers may energize the development of new classes of speech-enabled applications, as well as new tools for authoring synthesizable content. Since all W3C specifications are not static, but are rather constantly updated to reflect common practice and usage, it is anticipated that the SSML specification will undergo changes in the future to reflect the practices of SSML engine and content developers. As TTS technology advances and results in synthesis engines that produce more natural sounding synthesized speech, SSML may serve the developers of synthesis content by providing for more emotive speech output and less variation in the rendered output across different synthesis platforms.

References

[1] Java Speech Markup Language Specification, Version 0.5, Sun Microsystems Inc., August 28, 1997.

[2] "SABLE: A Standard for TTS Markup", R. Sproat, A. Hunt, M. Ostendorf, P. Taylor, A. Black, K. Lenzo, M. Edgington, Proceedings Intl. Conf. Spoken Language Processing, Sydney, November, 1998.

[3] Voice eXtensible Markup Language (VoiceXML), version 1.0, May 2000 (http://www.voicexml.org/).

[4] Speech Synthesis Markup Language Specification, M. Walker, A. Hunt, W3C Last-Call Working Draft, Jan 3, 2001 (http://www.w3.org/TR/speech-synthesis).

[5] Microsoft Speech API, version 5.0, Microsoft Inc., 1999.

[6] Extensible Markup Language (XML) 1.0, October, 2000 (http://www.w3.org/TR/REC-xml).

back to the top

Copyright © 2001 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE Industry Standards and Technology Organization (IEEE-ISTO).