VoiceXML Review - Feature Articles

Volume 1, Issue 4 - April 2001

The Speech Synthesis Markup Language for the W3C VoiceXML Standard

By Mark R. Walker and Andrew Hunt

A new set of XML-based markup standards developed for the purpose of enabling voice browsing of the Internet will begin emerging in 2001 from the Voice Browser Working Group, which was recently organized under the auspices of the W3C. Among the first in this series of soon-to-be-released specifications is the speech synthesis text markup standard. The Speech Synthesis Markup Language (SSML) Specification is largely based on the Java Speech Markup Language (JSML) [1], but also incorporates elements and concepts from SABLE [2], a previously published text markup standard, and from VoiceXML [3], which itself is based on JSML and SABLE. SSML also includes new elements designed to optimize the capabilities of contemporary speech synthesis engines in the task of converting text into speech. This article summarizes the markup element design philosophy and includes descriptions of each of the speech synthesis markup elements.

The Voice Browser Working Group has utilized the open processes of the W3C for the purpose of developing standards that enable access to the web using spoken interaction. The nearly completed SSML specification [4] is part of a new set of markup specifications for voice browsers, and is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in web and other applications. The essential role of the markup language is to give authors of synthesizable content a standard way to control aspects of speech output such as pronunciation, volume, pitch and rate across different synthesis-capable platforms.

It is anticipated that SSML will enable a large number of new applications simply because XML documents would be able to simultaneously support viewable and audio output forms. Email messages would potentially contain SSML elements automatically inserted by synthesis-enabled, mail editing tools that render themessages into speech when no text display was present. Web sites designed for sight-impaired users would likely acquire a standard form, and would be accessible with a potentially larger variety of Internet access devices. Finally, SSML has been designed to integrate with the Voice Dialogue markup standard in the creation of text-based dialogue prompts.

The greatest impact of SSML may be the way it spurs the development of new generations of synthesis-knowledgeable tools for assisting synthesis text authors. It is anticipated that authors of synthesizable documents will initially possess differing amounts of expertise. The effect of such differences may diminish as high-level tools for generating SSML content eventually appear. Some authors with little expertise may rely on choices made by the SSML processor at render time. Authorspossessing higher levels of expertise will make considerable effort to mark as many details of the document to ensure consistent speech quality across platforms and to more precisely specify output qualities. Other document authors, those who demand the highest possible control over the rendered speech, may utilize synthesis-knowledgeable tools to produce "low-level" synthesis markup sequences composed of phoneme, pitch and timing information for segments of documents or for entire documents.

The Markup Elements

"speak" and "prompt" elements

TheSpeech Synthesis Markup Language is an XML application. The root element "speak" is required when the document type is synthesis only.

<?xml version="1.0"?>
<speak>
SSML body ...
</speak>

SSML fragments are denoted by "prompt" containers when imbedded in a VoiceXML dialogue:

<prompt>Please say your city.</prompt>

The prompt element controls the output of synthesized speech and prerecorded audio within a dialogue instance. The <prompt> ... </prompt> container is not required in cases where there is no need to specify a prompt attribute, and the prompt further contains no speech markups, or consists of just an <audio> element:

<audio src="say_your_city.wav"/>

"xml:lang"

Following the XML convention, languages are indicated by an "xml:lang" attribute on the enclosing element with the value following to define language and country codes.

<speak xml:lang="en-US">
<paragraph>I don't speak Japanese.</paragraph>
<paragraph xml:lang="ja">Nihongo-ga wakarimasen.
</paragraph>
</speak>

Language information is inherited down the document hierarchy. The speech output platform determines behavior in the case that a document requires speech output in a language not supported by the speech output platform according to common text formatting patterns of the language.

"paragraph" and "sentence" elements

The "paragraph" element represents the paragraph structure in text. A "sentence" element represents the sentence structure in text. A paragraph contains zero or more sentences.

<paragraph>
<sentence>This is the first sentence of the paragraph.
</sentence>
<sentence>Here's another sentence.</sentence>
</paragraph>

The use of paragraph and sentence elements is optional. Where text occurs without an enclosing paragraph or sentence elements, the SSML processor should attempt to determine the structure.

"say-as" element

The "say-as" element indicates the type of text construct contained within the element. This information is used to help disambiguate the pronunciation of the contained text. In any case, it is assumed that pronunciations generated through the use of explicit text markup always take precedence over pronunciations produced by a lexicon.

The "type" attribute is a required attribute that indicates the contained text construct. The base set of enumerated type values includes spell-out, (contained text is pronounced as individual characters), acronym (contained text is an acronym), number, date, time (time of day), duration (temporal duration), currency, measure (measurement), telephone (telephone number), name, net (internet identifier), and address (indicates a postal address).

<say-as type="spell-out">
USA </say-as>


<say-as type="acronym">
DEC </say-as>

Rocky <say-as type="number">XIII</say-as>


Pope John the
<say-as type="number:ordinal">VI</say-as>


Deliver to
<say-as type="number:digits">123 </say-as>
Brookwood.


<say-as type="date:ymd"> 2000/1/20 </say-as>


Proposals are due in
<say-as type="date:my"> 5/2001 </say-as>


The total is <say-as type="currency">$20.45</say-as>


<say-as type="net:email">
road.runner@acme.com
</say-as>

The "sub" attribute is a say-as attribute employed to indicate that the specified text replaces the contained text for pronunciation. This allows a document to contain both a spoken and written form.

<say-as sub="World Wide Web Consortium">
W3C </say-as>

"phoneme" Element

The "phoneme" element provides a phonetic pronunciation for the contained text. The "phoneme" element may be empty. However, it is recommended that the element contain human-readable text that can be used for non-spoken rendering of the document. The "ph" attribute is a required attribute that specifies the phoneme string itself. The "alphabet" attribute is an optional attribute that specifies the phonetic alphabet. Phoneme alphabets currently supported by SSML include International Phonetic Alphabet (IPA), WorldBet, and X-SAMPA.

Well
<phoneme alphabet="worldbet" ph="h;&l;ou>
hello
</phoneme>
there!

"voice" element

The "voice" element is a production element that requests a change in speaking voice. Optional attributes include "gender", (gender of the voice to speak the contained text) with enumerated values male, female, neutral, "age", taking on (integer) values, "category", (indicates preferred age category of the voice) with enumerated values child, teenager, adult, elder, "variant", (indicates a preferred variant of the other voice) which takes on value (integer), and "name", (a platform-specific voice name). The value of "name" may be a space-separated list of names ordered from top preference down.

<voice gender="female" category="child">
Mary had a little lamb
</voice>


<voice gender="female" category="child" variant="2">
It's fleece was white as snow.
</voice>


<voice name="Mike">I want to be like Mike.</voice>

When there is no voice available that exactly matches the attributes specified in the document, a conforming SSML processor should throw an error event. Subsequent behavior of the application after the error event may be platform-specific. Voice attributes are inherited down the tree, including elements that change the language:

<voice gender="female">
Any female voice here.
<voice category="child">
A female child voice here.
<paragraph xml:lang="ja">

</paragraph>
</voice>
</voice>

A change in voice resets the prosodic parameters since different voices have different natural pitch and speaking rates. The "xml:lang" attribute may also be used to request usage of a voice with a specific dialect or other variant of the enclosing language.

"emphasis" element

The "emphasis" element requests that the contained text be spoken with emphasis (also referred to as prominence or stress). The synthesizer essentially determines how to render emphasis, since the nature of emphasis differs between languages, dialects or even voices. The optional "level" attribute indicates the strength of emphasis to be applied.

That is a <emphasis> big </emphasis> car!
That is a <emphasis level="strong">
huge
</emphasis>bank account!

"break" Element

The "break" element is an empty element that controls the pausing or other prosodic boundaries between words. If the element is not defined, the speech synthesizer is expected to automatically determine a break based on the linguistic context. Optional attributes include "size" and "time".

Take a deep breath <break/> then continue.
Press 1 or wait for the tone. <break time="3s"/>
I didn't hear you!

In practice, the "break" element is most often used to override the typical automatic behavior of a speech synthesizer.

"prosody" Element

The "prosody" element permits control of the pitch, speaking rate and volume of the speech output. The attributes include "pitch", the baseline pitch for the contained text in Hertz; "contour", which sets the actual pitch contour for the contained text; "rate", the speaking rate for the contained text; "duration", the time to take to read the element contents; and "volume", the volume for the contained text in the range 0.0 to 100.0. Relative changes for any of these attributes above are specified as floating-point values. For the pitch and range attributes, relative changes in semitones are permitted: "+5st", "-2st". Since speech synthesizers are not able to apply arbitrary prosodic values, conforming speech synthesis processors may set platform-specific limits on the values.

The price of the package is
<prosody rate="-10%">
<say-as type="currency">$45</say-as>
</prosody>

The "contour" attribute is used to define a set of pitch targets at specified intervals in the speech output. The algorithm for interpolating between the targets is platform-specific. In each pair of the form (interval,target), the first value is a percentage of the period of the contained text and the second value is the value of the "pitch" attribute.

<prosody contour="(0%,+20)(10%,+30%)(40%,+10)">
good morning
</prosody>

"audio" Element

The "audio" element supports the insertion of recorded audio files and the insertion of other audio formats in conjunction with synthesized speech output. The audio element may be empty. If the audio element is not empty, the contents correspond to the marked-up text to be spoken if the audio document is not available. The required attribute is "src", which is the URI of a document with an appropriate mime-type.

Please say your name after the tone.
<audio src="beep.wav"/>

<audio src="prompt.au">
What city do you want to fly from?</audio>

The "audio" element is not intended to be a complete mechanism for synchronizing synthetic speech output with other audio output or other output media (video, etc.).

"mark" Element

A "mark" element is an empty element that places a marker into the output stream for asynchronous notification. When audio output of the TTS document reaches the mark, the speech synthesizer issues an event that includes the required "name" attribute of the element. The platform defines the destination of the event. The "mark" element does not affect the speech output process.

Go from <mark name="here"/>
here, to
<mark name="there"/> there!

When supported by the implementation, requests can be made to pause and resume at document locations specified by the mark values.

Continued...

back to the top

Copyright © 2001 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE Industry Standards and Technology Organization (IEEE-ISTO).