VoiceXML Review - Feature Articles

Volume 1, Issue 3 - Mar. 2001

The Fundamentals of Text-to-Speech Synthesis

By Juergen Schroeter

Introduction

Text-to-speech (TTS) synthesis technology gives machines the ability to convert arbitrary text into audible speech, with the goal of being able to provide textual information to people via voice messages. Key target TTS applications in communications include: voice rendering of text-based messages such as email or fax as part of a unified messaging solution, as well as voice rendering of visual/text information (e.g., web pages). In the more general case, TTS systems provide voice output for all kinds of information stored in databases (e.g., phone numbers, addresses, car navigation information) and information services (e.g., restaurant locations and menus, movie guides, etc.). Ultimately, given an acceptable level of speech quality, TTS could also be used for reading books (i.e., Talking Books) and for voice access to large information stores such as encyclopedias, reference books, law volumes, etc.

VoiceXML, a Voice markup language, enables ubiquitous, interactive speech services over the Web, and supports telephone access to Web pages. . The following simple, yet illustrative, example is taken from the online tutorial of VoiceXML forum [1]. It simply plays a TTS prompt:

<?xml version="1.0"?>
<vxml application="tutorial.vxml" version="1.0">
      <form id="someName">
            <block>
                <prompt>                  
                    Where do you want to go?
                </prompt>                
            </block>
      </form>
</vxml>

Here the <prompt> tag causes the question to be rendered by a TTS engine and then played over the phone.

Complementary to (and a bit more realistic than) the above, a functional block diagram for an application of highly advanced speech technologies in a telecommunications setting is depicted in Figure 1. The customer, shown at the top center, makes a voice request to an automated customer-care application. The speech signal related to this request is analyzed by the Automatic Speech Recognition (ASR) subsystem shown on the top right. The ASR system "decodes" the words spoken and feeds these into the Spoken Language Understanding (SLU) component shown at the bottom right. The task of the SLU component is to extract the meaning of the words. Here, the words "I dialed a wrong number" imply that the customer wants a billing credit. Next, the Dialog Manager depicted in the bottom left determines the next action the customer-care system should take ("determine the correct number") and instructs the TTS component (shown in the top left) to synthesize the question "What number did you want to call?"

Figure 1: Flow Diagram of a Voice-Enabled Customer-Care Application

The attentive reader will have noticed that, in the illustration, the TTS output is "closest to the customer's ear". Experience, indeed, shows that there is a tendency for customers to weight TTS/speech output quality very heavily in judging the quality of the overall voice-enabled system. There is also the tendency to make this judgment very quickly, after hearing just a few prompts. Therefore, application developers and system integrators are rightfully reluctant to adapt TTS technology, accepting only the highest quality systems, or, at least for simple applications, record static prompts using a human voice talent.

What constitutes a high quality TTS system? TTS quality is characterized by two factors; namely the intelligibility of the speech that is produced, and the naturalness of the overall message that is spoken. For the past 30 years or more, intelligibility has been the driving factor in building TTS systems, since without high intelligibility, TTS systems serve no useful purpose. As a result, most modern TTS systems are highly intelligible, with formal tests showing TTS word intelligibility approaching that of naturally spoken speech. Until the mid-1990s, however, significantly less success had been achieved in making the synthetic speech sound natural, as if it came from a recording of a human speaker. Experience has shown that, even with high intelligibility, there exists a minimum level of voice quality that is essential (we call this 'customer quality') before consumers will agree to both listen to synthetic speech on a regular basis and pay for the services associated with using the synthetic speech. Hence the objective of most modern research in TTS systems is to continue achieving high intelligibility, but, at the same time, to provide synthetic speech that is customer quality or higher.

Clearly, passing the Turing Test in synthesis for all possible applications, for all kinds of input text, and with all possible emotions expressed in the voice is not possible today, but might be the topic of speech synthesis research for several years to come. A more practical, short-term approach is to start from the application side and ask oneself what synthesis quality is "good enough" for a given application and whether there is technology today that might satisfy the requirements of that specific application? For example, if all the application needs to do is synthesize telephone numbers, close-to-perfect results can be achieved quite easily [try, e.g., http://www.research.att.com/~mjm/cgi-bin/saynum]. However, for a reasonably open domain such as news or email reading, it would be dishonest to claim that synthesis quality today is high enough to be judged completely "natural-sounding."

This article is organized as follows. First, we review established methods for synthesizing speech from text. Then, we focus on the method that has led to the highest quality synthetic speech: concatenative TTS. Finally, we extend concatenative TTS to include on-line Unit Selection and explain how it achieves high naturalness while keeping the computational requirements moderate.

"Traditional Synthesis Methods

There exist several different methods to synthesize speech. Each method falls into one of the following categories: articulatory synthesis, formant synthesis, and concatenative synthesis.

Articulatory synthesis uses computational biomechanical models of speech production, such as models for the glottis (that generates the periodic and aspiration excitation) and the moving vocal tract. Ideally, an articulatory synthesizer would be controlled by simulated muscle actions of the articulators, such as the tongue, the lips, and the glottis. It would solve time-dependent, 3-dimensional differential equations to compute the synthetic speech output. The interested reader is referred to [2] for more information. Unfortunately, besides having notoriously high computational requirements, articulatory synthesis also, at present, does not result in natural-sounding fluent speech (static vowels, for example, as well as vowel-to-vowel transitions, can be synthesized sounding "natural", but most stop consonants sound mediocre at best). Speech scientists still lack significant knowledge to achieve this somewhat elusive goal.

Formant synthesis uses a set of rules for controlling a highly simplified source-filter model that assumes that the (glottal) source is completely independent from the filter (the vocal tract). The filter is determined by control parameters such as formant frequencies and bandwidths. Each formant is associated with a particular resonance (a "peak" in the filter characteristic) of the vocal tract. The source generates either stylized glottal or other pulses (for periodic sounds) or noise (for aspiration or frication). Formant synthesis generates highly intelligible, but not completely natural sounding speech. However, it has the advantage of a low memory footprint and only moderate computational requirements.

Concatenative synthesis uses actual snippets of recorded speech that were cut from recordings and stored in an inventory ("voice database"), either as "waveforms" (uncoded), or encoded by a suitable speech coding method (see footnote 1). Elementary "units" (i.e., speech segments) are, for example, phones (a vowel or a consonant), or phone-to-phone transitions ("diphones") that encompass the second half of one phone plus the first half of the next phone (e.g., a vowel-to-consonant transition). Some concatenative synthesizers use so-called demi-syllables (i.e., half-syllables; syllable-to-syllable transitions), in effect, applying the "diphone" method to the time scale of syllables. Concatenative synthesis itself then strings together (concatenates) units selected from the voice database, and, after optional decoding, outputs the resulting speech signal. Because concatenative systems use snippets of recorded speech, they have the highest potential for sounding "natural". In order to understand why this goal was, until recently, hard to achieve and what has changed in the last few years, we need to take a closer look.

Concatenative TTS Systems

A block diagram of a typical concatenative TTS system is shown in Figure 2. The first block is the message text analysis module that takes ASCII message text and converts it to a series of phonetic symbols and prosody (fundamental frequency, duration, and amplitude) targets. The text analysis module actually consists of a series of modules with separate, but in many cases intertwined, functions. Input text is first analyzed and non-alphabetic symbols and abbreviations are expanded into full words. For example, in the sentence "Dr. Smith lives at 4305 Elm Dr.", the first "Dr." is transcribed as "Doctor", while the second one is transcribed as "Drive". Next, "4305" is expanded to "forty three oh five". Then, a syntactic parser (recognizing the part of speech for each word in the sentence) is used to label the text. One of the functions of syntax is to disambiguate the sentence constituent pieces in order to generate the correct string of phones, with the help of a pronunciation dictionary. Thus, for the above sentence, the verb "lives" is disambiguated from the (potential) noun "lives" (plural of "life"). If the dictionary look-up fails, general letter-to-sound rules are used. Finally, with punctuated text, syntactic and phonological information available, a prosody module predicts sentence phrasing and word accents and, from those, generates targets, for example, for fundamental frequency, phoneme duration, and amplitude. The second block in Fig. 2 assembles the units according to the list of targets set by the front-end. It is this block that is responsible for the innovation towards much more natural sounding synthetic speech. Then the selected units are fed into a back-end speech synthesizer that generates the speech waveform for presentation to the listener.

In the last 3-4 years, TTS systems have become much more natural sounding, mostly due to a wider acceptance of corpus-driven unit-selection synthesis paradigms. In a sense, the desire for more natural-sounding synthetic voices that is driving this work was a natural extension of the earlier desire to achieve high intelligibility. We have started a new era in synthesis, where, under certain conditions, listeners cannot say with certainty whether the speech they are listening to was recorded from a live talker, or is being synthesized. The new paradigm for achieving very high quality synthesis using large inventories of recorded speech units is called "unit-selection synthesis".

What is behind unit-selection synthesis and the corresponding sea change in voice quality it achieves? Many dimensions come to play. One important aspect is the ever-increasing power and storage capacity of computers. This has direct effect on the size of the voice inventory we can store and work with. Where early concatenative synthesizers used very few (mostly one) prototypical units for each class of inventory elements, we can now easily afford to store many such units. Other important aspects include the fact that efficient search techniques are now available that allow searching potentially millions of available sound units in real time for the optimal sequence that make up a target utterance. Finally, we now have automatic labelers that speed up labeling a voice database phonetically and prosodically. It is important to note that both, the automatic labelers and the optimal search strategies borrow heavily from speech recognition. In the following, we will briefly touch upon these issues, after having reviewed "diphone synthesis."

Figure 2: Block Diagram of TTS Synthesis System

Continued...

Footnote 1: Note that, in order to achieve complete naturalness, any speech codec (coder/decoder), used for TTS, needs to be completely transparent, that is, speech that was encoded and then decoded again needs to sound just like the original. This means that the codec needs to work at a relatively high bit rate.
(return to text)

back to the top

Copyright © 2001 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE Industry Standards and Technology Organization (IEEE-ISTO).