VoiceXML Review - Feature Articles

Volume 1, Issue 5 - May 2001

City CarShare Reservation System: A VoiceXML Case Study

By Rachel McConnell and Bryan Michael

(Continued from Part 1)

Speech Recognition

Automatic Speech Recognition (ASR) provides a way to capture a caller's utterances, compare these utterances to acceptable grammars in a meaningful way, and return an accurate result. Generally, as the number of allowable utterances in a grammar grows, complexity increases and recognition accuracy decreases -- in many cases non-linearly. True natural language processing is thus not yet practicable. However, advances in recognition algorithms and lower infrastructure costs, associated with both servers and telephony minutes, have driven adoption of speech technology, and voice recognition has come a long way in the last few years.

Indigo egg designed the City CarShare prompts to elicit dissimilar caller utterances, to reduce or eliminate the need for disambiguation. In many applications however this is not possible, and the question arises of how to disambiguate between rhyming or other homophonous utterances. Some techniques for improving speech recognition include:

Grammar tuning techniques can reduce many types of recognition errors. For example, cross-wording can fix utterances that contain words which run together (creating phrases). Also, adding representative probabilities to confusion pairs can fix substitution errors. Finally, adding out of grammar elements can fix false accepts and correct rejects.
Using N-Best lists that return multiple results with associated confidence levels can also provide more control in deciding between various interpretations of a captured utterance.
Using multiple interpretation results for disambiguation can also improve accuracy and the user experience.

Technical Issues

There were technical challenges in the City CarShare application as well as design challenges in creating a smooth and usable dialog flow. We have all felt the pain of long downloads on the Internet. In many cases, the latency associated with large audio files and streaming media applications make a web application frustrating to use. These same issues are exacerbated on the telephone because of the limited feedback mechanisms inherent to audio interfaces. To ensure that the time the caller is required to wait for the application is a short as possible, indigo egg used several techniques. For quick delivery of audio files, we set caching of all audio to fast, and enabled prefetching as well. As many decisions as possible are made in the VoiceXML itself, rather than calling a server-side routine. For example, embedded JavaScript is used to translate the return value from the caller's date utterance into an audio file name, so the application can repeat the date back to them for verification. For login, we use a subdialog instead of switching pages. This allows the dynamic content to be separated out from the main page, which can then be cached. The call to the server is no faster, but a faster return is possible as the application returns control to the original page rather than loading a new one. Also, the caller is always informed when a database lookup is taking place and when it is completed, so that they are never left hanging on the line.

In general, these are some strategies to improve application performance that can be applied across any implementation of VoiceXML. It should be noted, however, that each vendor platform has its own idiosyncrasies in terms of performance tweaking.

Keep as many resources as close to the VoiceXML interpreter as possible - this alleviates the need for fetching resources across the Internet and risking delays.
Where possible, cache resources to prevent network access delays.
When using <submit>, use the GET method rather than POST if the result can be cached for later use (POST results generally expire at once).
Where possible, avoid extensive page transitions. One large document usually performs better than several small documents because of the increased server hits. Also, transferring between forms in the current document will most likely be faster than transferring to another document.
If doing computational tasks, use JavaScript functions instead of sending data and accessing a server to retrieve a result.
The fast access and persistence of application root documents make them useful for storing variables and preserving the application's state while transferring between documents.

Results

The City CarShare reservation system has been deployed for only a short time, but so far user feedback has been excellent. Callers feel very comfortable with the system. Some user comments were,

"Great system! Quick, … straightforward, and it even lets you screw up (enter in incorrect info) without much hassle!"
"It's frightening how much it's like talking to a real person."

Having commercially deployed a variety of different voice applications, including indigo egg's City CarShare application, BeVocal has had the opportunity to see the positive results that can be achieved with the VoiceXML standard as it exists today. While there are many challenges ahead in continuing to push the standard forward to provide more sophisticated functionality to developers, VoiceXML 1.0 can solve real-world problems leading to increased levels of service at lower costs. We hope the techniques set forth in this article will help you increase the usability and thus the use of your VoiceXML 1.0 applications.

back to the top

Copyright © 2001 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE Industry Standards and Technology Organization (IEEE-ISTO).