The
Speech Synthesis Markup Language for the W3C VoiceXML
Standard
A
new set of XML-based markup standards developed for
the purpose of enabling voice browsing of the Internet
will begin emerging in 2001 from the Voice Browser Working
Group, which was recently organized under the auspices
of the W3C. Among the first in this series of soon-to-be-released
specifications is the speech synthesis text markup standard.
The Speech Synthesis Markup Language (SSML) Specification
is largely based on the Java Speech Markup Language
(JSML) [1], but also incorporates elements and concepts
from SABLE [2], a previously published text markup standard,
and from VoiceXML [3], which itself is based on JSML
and SABLE. SSML also includes new elements designed
to optimize the capabilities of contemporary speech
synthesis engines in the task of converting text into
speech. This article summarizes the markup element design
philosophy and includes descriptions of each of the
speech synthesis markup elements.
The
Voice Browser Working Group has utilized the open processes
of the W3C for the purpose of developing standards that
enable access to the web using spoken interaction. The
nearly completed SSML specification [4] is part of a
new set of markup specifications for voice browsers,
and is designed to provide a rich, XML-based markup
language for assisting the generation of synthetic speech
in web and other applications. The essential role of
the markup language is to give authors of synthesizable
content a standard way to control aspects of speech
output such as pronunciation, volume, pitch and rate
across different synthesis-capable platforms.
It
is anticipated that SSML will enable a large number
of new applications simply because XML documents would
be able to simultaneously support viewable and audio
output forms. Email messages would potentially contain
SSML elements automatically inserted by synthesis-enabled,
mail editing tools that render themessages into speech
when no text display was present. Web sites designed
for sight-impaired users would likely acquire a standard
form, and would be accessible with a potentially larger
variety of Internet access devices. Finally, SSML has
been designed to integrate with the Voice Dialogue markup
standard in the creation of text-based dialogue prompts.
The
greatest impact of SSML may be the way it spurs the
development of new generations of synthesis-knowledgeable
tools for assisting synthesis text authors. It is anticipated
that authors of synthesizable documents will initially
possess differing amounts of expertise. The effect of
such differences may diminish as high-level tools for
generating SSML content eventually appear. Some authors
with little expertise may rely on choices made by the
SSML processor at render time. Authorspossessing higher
levels of expertise will make considerable effort to
mark as many details of the document to ensure consistent
speech quality across platforms and to more precisely
specify output qualities. Other document authors, those
who demand the highest possible control over the rendered
speech, may utilize synthesis-knowledgeable tools to
produce "low-level" synthesis markup sequences
composed of phoneme, pitch and timing information for
segments of documents or for entire documents.
The
Markup Elements
"speak"
and "prompt" elements
TheSpeech
Synthesis Markup Language is an XML application. The
root element "speak" is required when the
document type is synthesis only.
<?xml
version="1.0"?>
<speak>
SSML body ...
</speak>
SSML
fragments are denoted by "prompt" containers
when imbedded in a VoiceXML dialogue:
<prompt>Please
say your city.</prompt>
The
prompt element controls the output of synthesized speech
and prerecorded audio within a dialogue instance. The
<prompt> ... </prompt> container is not
required in cases where there is no need to specify
a prompt attribute, and the prompt further contains
no speech markups, or consists of just an <audio>
element:
<audio
src="say_your_city.wav"/>
"xml:lang"
Following
the XML convention, languages are indicated by an "xml:lang"
attribute on the enclosing element with the value following
to define language and country codes.
<speak
xml:lang="en-US">
<paragraph>I don't speak Japanese.</paragraph>
<paragraph xml:lang="ja">Nihongo-ga
wakarimasen.
</paragraph>
</speak>
Language
information is inherited down the document hierarchy.
The speech output platform determines behavior in the
case that a document requires speech output in a language
not supported by the speech output platform according
to common text formatting patterns of the language.
"paragraph"
and "sentence" elements
The
"paragraph" element represents the paragraph
structure in text. A "sentence" element represents
the sentence structure in text. A paragraph contains
zero or more sentences.
<paragraph>
<sentence>This is the first sentence of the
paragraph.
</sentence>
<sentence>Here's another sentence.</sentence>
</paragraph>
The
use of paragraph and sentence elements is optional.
Where text occurs without an enclosing paragraph or
sentence elements, the SSML processor should attempt
to determine the structure.
"say-as"
element
The
"say-as" element indicates the type of text
construct contained within the element. This information
is used to help disambiguate the pronunciation of the
contained text. In any case, it is assumed that pronunciations
generated through the use of explicit text markup always
take precedence over pronunciations produced by a lexicon.
The
"type" attribute is a required attribute
that indicates the contained text construct. The base
set of enumerated type values includes spell-out,
(contained text is pronounced as individual characters),
acronym (contained text is an acronym), number,
date, time (time of day), duration
(temporal duration), currency, measure
(measurement), telephone (telephone number),
name, net (internet identifier), and address
(indicates a postal address).
<say-as
type="spell-out">
USA </say-as>
<!-- U. S. A. -->
<say-as
type="acronym">
DEC </say-as>
Rocky
<say-as type="number">XIII</say-as>
<!-- Rocky thirteen -->
Pope
John the
<say-as type="number:ordinal">VI</say-as>
<!-- Pope John the sixth -->
Deliver
to
<say-as type="number:digits">123 </say-as>
Brookwood.
<!-- Deliver to one two three Brookwood-->
<say-as
type="date:ymd"> 2000/1/20 </say-as>
<!-- January 20th two thousand -->
Proposals
are due in
<say-as type="date:my"> 5/2001 </say-as>
<!-- Proposals are due in May two thousand and
one -->
The
total is <say-as type="currency">$20.45</say-as>
<!-- The total is twenty dollars and forty-five
cents -->
<say-as
type="net:email">
road.runner@acme.com
</say-as>
The
"sub" attribute is a say-as attribute
employed to indicate that the specified text replaces
the contained text for pronunciation. This allows a
document to contain both a spoken and written form.
<say-as
sub="World Wide Web Consortium">
W3C </say-as>
<!-- World Wide Web Consortium -->
"phoneme"
Element
The
"phoneme" element provides a phonetic pronunciation
for the contained text. The "phoneme" element
may be empty. However, it is recommended that the element
contain human-readable text that can be used for non-spoken
rendering of the document. The "ph"
attribute is a required attribute that specifies the
phoneme string itself. The "alphabet"
attribute is an optional attribute that specifies the
phonetic alphabet. Phoneme alphabets currently supported
by SSML include International Phonetic Alphabet (IPA),
WorldBet, and X-SAMPA.
Well
<phoneme alphabet="worldbet" ph="h;&l;ou>
hello
</phoneme>
there!
"voice"
element
The
"voice" element is a production element that
requests a change in speaking voice. Optional attributes
include "gender", (gender of the voice
to speak the contained text) with enumerated values
male, female, neutral, "age", taking
on (integer) values, "category", (indicates
preferred age category of the voice) with enumerated
values child, teenager, adult, elder, "variant",
(indicates a preferred variant of the other voice) which
takes on value (integer), and "name", (a platform-specific
voice name). The value of "name" may be a
space-separated list of names ordered from top preference
down.
<voice
gender="female" category="child">
Mary had a little lamb
</voice>
<!--
now request a different female child's voice -->
<voice gender="female" category="child"
variant="2">
It's fleece was white as snow.
</voice>
<!--
platform-specific voice selection -->
<voice name="Mike">I want to be like
Mike.</voice>
When
there is no voice available that exactly matches the
attributes specified in the document, a conforming SSML
processor should throw an error event. Subsequent behavior
of the application after the error event may be platform-specific.
Voice attributes are inherited down the tree, including
elements that change the language:
<voice
gender="female">
Any female voice here.
<voice category="child">
A female child voice here.
<paragraph xml:lang="ja">
<!-- A female child voice in Japanese. -->
</paragraph>
</voice>
</voice>
A
change in voice resets the prosodic parameters since
different voices have different natural pitch and speaking
rates. The "xml:lang" attribute may also be
used to request usage of a voice with a specific dialect
or other variant of the enclosing language.
"emphasis" element
The
"emphasis" element requests that the contained
text be spoken with emphasis (also referred to as prominence
or stress). The synthesizer essentially determines how
to render emphasis, since the nature of emphasis differs
between languages, dialects or even voices. The optional
"level" attribute indicates the strength
of emphasis to be applied.
That
is a <emphasis> big </emphasis> car!
That is a <emphasis level="strong">
huge
</emphasis>bank account!
"break"
Element
The
"break" element is an empty element that controls
the pausing or other prosodic boundaries between words.
If the element is not defined, the speech synthesizer
is expected to automatically determine a break based
on the linguistic context. Optional attributes include
"size" and "time".
Take
a deep breath <break/> then continue.
Press 1 or wait for the tone. <break time="3s"/>
I didn't hear you!
In
practice, the "break" element is most often
used to override the typical automatic behavior of a
speech synthesizer.
"prosody"
Element
The
"prosody" element permits control of the pitch,
speaking rate and volume of the speech output. The attributes
include "pitch", the baseline pitch
for the contained text in Hertz; "contour",
which sets the actual pitch contour for the contained
text; "rate", the speaking rate for
the contained text; "duration", the
time to take to read the element contents; and "volume",
the volume for the contained text in the range 0.0 to
100.0. Relative changes for any of these attributes
above are specified as floating-point values. For the
pitch and range attributes, relative changes in semitones
are permitted: "+5st", "-2st". Since
speech synthesizers are not able to apply arbitrary
prosodic values, conforming speech synthesis processors
may set platform-specific limits on the values.
The
price of the package is
<prosody rate="-10%">
<say-as type="currency">$45</say-as>
</prosody>
The
"contour" attribute is used to define
a set of pitch targets at specified intervals in the
speech output. The algorithm for interpolating between
the targets is platform-specific. In each pair of the
form (interval,target), the first value is a percentage
of the period of the contained text and the second value
is the value of the "pitch" attribute.
<prosody
contour="(0%,+20)(10%,+30%)(40%,+10)">
good morning
</prosody>
"audio"
Element
The "audio" element supports the insertion
of recorded audio files and the insertion of other audio
formats in conjunction with synthesized speech output.
The audio element may be empty. If the audio element
is not empty, the contents correspond to the marked-up
text to be spoken if the audio document is not available.
The required attribute is "src", which
is the URI of a document with an appropriate mime-type.
<!--
Empty element -->
Please say your name after the tone.
<audio src="beep.wav"/>
<!-- Container element with alternative text -->
<audio src="prompt.au">
What city do you want to fly from?</audio>
The
"audio" element is not intended to be a complete
mechanism for synchronizing synthetic speech output
with other audio output or other output media (video,
etc.).
"mark"
Element
A
"mark" element is an empty element that places
a marker into the output stream for asynchronous notification.
When audio output of the TTS document reaches the mark,
the speech synthesizer issues an event that includes
the required "name" attribute of the element.
The platform defines the destination of the event. The
"mark" element does not affect the speech
output process.
Go
from <mark name="here"/>
here, to
<mark name="there"/> there!
When
supported by the implementation, requests can be made
to pause and resume at document locations specified
by the mark values.
Continued...
back
to the top
Copyright
© 2001 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE
Industry Standards and Technology Organization
(IEEE-ISTO).
|