VoiceXML Review - Columns - Speak & Listen

Volume 1, Issue 5 - May 2001

Answers to Your Questions About VoiceXML

By Jeff Kunins

(Continued from Part 1)

Q: How is the information that the caller speaks into the phone recorded? Is it converted into text? Is it subject to high error rates due to different styles of talking??

A: DSP (Digital Signal Processing) hardware from companies such as Intel Dialogic and Natural MicroSystems digitally captures audio coming in over the phone line. Speech recognition never "converts" the audio stream to text - it uses advanced statistical models to determine a confidence level that what the user said matches an entry in an expected list or "grammar" of phrases for a given interaction. This technology is quite resilient to users with varying accents of a given language, and the underlying "acoustic"models are available for many different languages and locales (e.g. U.S. English vs. British English vs. Brazilian Portuguese vs. Portuguese Portuguese).

That said, voice recognition is demonstrably mature enough for mission critical applications, and production applications successfully automate hard tasks such as driving directions and stock trading every day. However, recognizing human speech is still an enormously complex computational task that relies on applying sophisticated heuristic techniques to massive statistical data models in real time. Voice recognition software "out of the box" does not perform optimally, and applications require specialized design and tuning by qualified experts to be successful. Consider the following issues:

People will always say unexpected things. People are accustomed to having "real" conversations over the phone; they immediately assume voice applications can understand whatever they say, and can quickly get frustrated when their expectations aren't met. Even applications that prompt callers to choose from a short menu can consistently get hundreds of distinct responses. Voice applications can only "understand" the specific things they're trained for - similar to when people first bring their phrasebook to a foreign country. As with people, voice applications do their best to match what they're hearing with the phrases they know, and can easily mistake similar-sounding words for ones that are actually in their list. Depending on the situation, this can quickly lead to a frustrating experience. For example, consider a simple menu of keywords that includes "movies" and "restaurants." Callers who say "moving" without knowing that it is not a valid choice are likely to consistently get thrown into "movies" and be very frustrated. For this reason, applications must use clear, concise prompting to guide callers to say "the right" things, and use data from large amounts of real-world usability testing to take into account the unexpected things people tend to say. Minute shifts in prompt wording or the underlying "grammars" can have dramatic effects on usability, and ultimately the automation rate and ROI of voice applications.
"Grammars" must be tuned. As stated above, voice recognition technology works by comparing what the caller said to a specific list of expected choices. These "grammars" are required to make it computationally feasible to do speaker-independent voice recognition in real time. Large grammars, such as the 10,000+ companies on U.S. stock exchanges, can work very well in production today- doing so requires careful attention by both application designers and speech scientists "tuning" the underlying recognition engine. For example, "Pfizer" and "Fiserv" sound almost identical; the underlying grammar must be tuned to know which choice is more commonly selected, and the application must be carefully crafted to help callers get back on track when the system makes a mistake.
People pronounce the same words differently. Pronunciations for words and phrases can vary widely across regions of a given country. Proper names further complicate the matter- consider, for example, how to pronounce "Qantas Airways" or "Worcester Court." Voice recognition engines rely on built-in dictionaries that specify each of the ways callers may say each word and common phrase. If a grammar includes a word that isn't in the dictionary, the recognition engine must "guess" how it is supposed to be pronounced. While this can work reasonably well, it's likely to make mistakes or miss common alternative pronunciations. Especially because voice recognition is rapidly being deployed in new industries for new applications, it's critical to ensure that all relevant pronunciations are in the dictionaries- without this, recognition quality and automation rates can suffer significantly.
"Acoustic Models" must be continually refined. Voice recognizers use "acoustic models" to decide whether a caller has said something that matches a given grammar. Acoustic models are essentially a mathematical representation of how a wide variety of people sound when they say each of the building blocks of words (e.g. "buh" or "ing"). Acoustic models are built by analyzing millions of diverse recordings of real people actually speaking over the telephone. The more data that gets used to "train" these acoustic models, the better recognition quality becomes; particularly when the data is collected under real-world conditions using the same hardware and software. In addition, it's critical to ensure that the voice recognition software has been adequately trained on all of the words and phrases that make up the grammars for a particular application.
Noisy environments are problematic. Phone conversations- particularly mobile phones- often have a lot of background noise. This noise can be ambient sound (e.g. wind or cars honking), ambient conversation (e.g. other tables in the restaurant), side speech (e.g. "Kids, I said stop it!"), or unintended sounds (e.g. a cough or sneeze). Consider how difficult it is sometimes even for "real people" to distinguish the actual conversation from background noise; the problem is compounded for voice recognition engines because they have far less intelligent context about how to differentiate sounds and speakers' voices from one another. Voice applications, and voice recognition platforms, must be carefully designed to accommodate and minimize the difficulties presented by background noise.
Hundreds of thousands of calls must be transcribed by hand. In order to compile the necessary data to address most of the problems listed above, it is necessary to manually compare what callers "actually" say with what the voice recognition software "thought" they said. Very large numbers of calls must be manually transcribed in this fashion, so that speech scientists can analyze the data and determine how accurately each grammar in an application is performing. This is a very labor-intensive process, but critical to give designers the information they need to make the adjustments to call flows, grammars, prompts, pronunciation dictionaries, and acoustic models which are necessary to achieve the expected benefits of great voice applications.

Q: Other than VoiceXML, are there other languages that will permit voice-activated web access?

A: There will likely be other attempts, but the industry has very rapidly converged on VoiceXML as the dominant standard of choice. It's unlikely that another language will attain significant market share in the next several years.

back to the top