Volume 1, Issue 6 - June 2001

Turning GUIs into VUIs:
Dialog Design Principles for Making Web Applications Accessible By Telephone

By Bill Byrne

(Editor's Note: This article has links to example WAV files woven throughout it. The links to the WAV files are indicated with a symbol.)


There's no question that the compatibility of VoiceXML with the familiar and ubiquitous web infrastructure has greatly simplified the implementation of speech recognition applications. Instead of a PC, you can use a telephone; instead of HTML, you can use VoiceXML; instead of a web browser, you can use a "voice browser" (or VoiceXML interpreter). It follows, then, that a reasonable way to develop voice applications is to simply translate web pages into "voice pages", right?


While web and voice applications may share similar infrastructure, usability considerations for each type of application are significantly different. After all, people approach graphical user interfaces (GUIs) and voice user interfaces (VUIs) in fundamentally different ways-some obvious and some rather subtle. Yet infrastructure parallels between web and voice applications together with the well-established mental model we have of web-based applications often cause developers to overlook these important differences. This, in turn, affects the usability and, ultimately, the success of many voice applications.

As we'll see below, good VUI design starts with a solid understanding of the most important differences between GUIs and VUIs and ends with the application of linguistic and social principles to the overall development effort.

Inherent Restrictions With Voice

GUIs are able to present a lot of information in parallel. Whether you're working in the consumer or enterprise space, screens in a web-based application typically offer pull-down menus, click boxes, tables, audio, as well as pictures and icons to aid navigation. The user in turn can scan hundreds of items to get to the desired information in just a few minutes. This kind of interface inherently satisfies some of the basic principles of user interface design:

  1. User Control: Users (even novice users) don't need to wait for direction. They initiate and or terminate each step at their own pace.
  2. Flexibility: The available options are constantly visible. A simple click on a new item at any time redirects the session immediately.
  3. Simplicity: In addition to the simplicity of "point and click", GUIs offer numerous ways to present information, e.g. pull-down menus, tables, pictures, text or audio.
  4. Predictability: The schemas employed in most web-based applications are familiar and consistent, which has given users a very clear mental model of how any GUI is likely to work.

In contrast, speech is sequential, so certain luxuries provided by GUIs just don't work in voice applications.

Since there's no way to present more than one piece of information simultaneously, things slow down considerably as users must carefully listen to various lists, dialog flow cues, and help prompts before they can proceed. Yet trying to present users with too much information in this way taxes short-term memory. For example, it's well known that most people can only remember between five and nine numbers for around twenty seconds after hearing them. Consequently, listening to long lists of choices is unreasonable, and purely hierarchical, menu driven applications are exhausting. This is one of the major drawbacks of the thousands of touch-tone or "interactive voice response" (IVR) systems deployed today in which callers use the keypad to choose from among the many items in a menu. [] And, if you think banner ads on a web page can be a little distracting, then being forced to listen to the same ads on a VUI will seem downright intrusive.

In addition, with speech, users focus on a much narrower context, which is built up temporally. Consider, for example, the difference in list orders between e-mail viewed on a GUI vs. voice mail heard over the telephone. In the case of e-mail, the latest message header is displayed first, at the top of the page, since the user can easily scan the list below to see if there are older related messages. In contrast, there is no way to "scan" a list of voice mail messages over the phone. The oldest messages are played first so the user can understand the context in which the latest messages were recorded. This same first-in-first-out (FIFO) order has been adopted by most universal messaging voice applications and VUI email readers.

To sum up so far, the sequential nature of speech means that VUIs are inherently more restrictive than GUIs. Far fewer choices can be explored with a VUI in a given amount of time compared to what can be scanned with a GUI. Furthermore, the burden on memory means VUI users are focused on a smaller set of choices and are tuned in to a narrower context. Without visual cues and a well-established mental model for VUIs, users have fewer ways to understand what choices are available to them. Without careful attention to design, these limitations can severely diminish system flexibility and user control.

General Usability for Voice

Despite the limitations noted above, well-designed voice applications have proved to be both engaging and effective. After all, a carefully developed VUI lets people interact conversationally with the application to do things that up until very recently required a desktop computer, a live operator, or a personal visit. To overcome the challenges described above, we can employ certain design techniques that will restore user control, system flexibility, and take advantage of the user's intuitive knowledge of human discourse. In other words, a well designed VUI puts users at ease, allowing them to interact in a way that is as easy as talking to a friend.

Since users inherently have less control and flexibility with VUIs as compared to GUIs, it's the designer's job to foster the perception of user control and flexibility. There are several general techniques we can use to accomplish this goal. Here are a few examples:

Tutorials and General Help

As mentioned above, VUI users aren't likely to have a clear mental model of how the application works and they don't have the luxury of clicking through several pages in a few seconds or looking at a site map to orient themselves. Tutorials and context sensitive help are an effective substitute. Tutorials are often played automatically for first-time users and can be accessed later with a command like "Play the tutorial". They outline how the application is organized and point out the availability of "global" commands such as GO BACK, HELP, PAUSE, START OVER, etc. They also give general tips for using speech recognition applications by explaining barge-in, the effect of ambient noise, etc. [] The always-available help command gives the users a way to figure out what to say in unfamiliar dialog states, or in cases where outside distractions cause them to miss important prompts.

Timely Help

Unfortunately, voice applications often inadvertently set traps. That is, users sometimes assume they can say something that turns out not to be "in-grammar". After trying several phrases that are met with the usual "I didn't understand that", the user hangs up out of frustration. The key is to foresee likely scenarios that don't necessarily follow the garden path. Consider a merchandise availability application that requires an item number as its input.

System: Tell me the four-digit item number or touch it in on the keypad.
User: five one five two.
System: Five one five two. Is that right?
User: Yes.
System: OK. Here's the item matching the number you gave me. Megachip II How many would you like?
User (confused): Megachip II? I'm looking for the ProPrinter 2000!
System: I didn't quite catch that. Tell me the number of units you'd like again or enter it on the keypad.

In this case the user may have simply entered and confirmed the wrong item number (maybe he misread someone's handwriting) and needs to return to the previous state. However, since the designer hasn't considered this scenario the user will be severely penalized for making an honest mistake. If, on the other hand, the error prompt included a timely help phrase such as "If this isn't the item you're looking for say START OVER", the user would have a way to continue in the dialog.

Framing Statements

Menus and lists often become much more manageable if the user simply knows how many items are listed and perhaps the kind of items contained in the list, which can be done with a short framing statement at the beginning. [] I know I can browse a list of ten items by phone much more easily than a list of 100.

Flexible Grammars

Most experienced designers know you can't guarantee that users will always repeat the commands explicitly requested of them by the application. For one, people often mimic the phrases and style that's used in the application. For example, in one recently released application, subscribers consistently began to say MOVE ON to get to the next domain even though they were never explicitly told to do so. As it turned out, the application repeatedly used the same phrase in a slightly different context. [] In other words, the application itself primed them for MOVE ON as opposed to, for example, NEXT DOMAIN or some other equally likely choice. In addition, a grammar developer can't expect to include all the reasonable phrases that users might utter in a VUI, and will have to study the utterances collected from the application logs to add new phrases as well as delete unused phrases.


Silence during a phone call often leads users to believe there may be a problem, and many will hang up. In fact, people's tolerance of "dead air" on the telephone is limited to two or three seconds, so reducing latency in VUIs is crucial. Yet there are other situations in which adding latency (or a pause) to the system improves the experience. Consider the following prompts. [] Without the pause, it sounds like the system already knew there was no information ahead of time. So, we think to ourselves, "Why didn't it just say that?" To correct the problem, a short pause was added to all prompts that were played as a result of this type of exception. []

Considering the Caller's Environment

Interacting with a computer screen is basically the same whether the user is indoors or outdoors, standing or sitting, in an airport or in the office. In contrast, talking on the phone can be a very different experience depending on where the caller is or what she might be doing. VUIs designed for in-vehicle use are especially challenging, not only because of the ambient noise that will compromise recognition, but because the driver's first priority is to concentrate on the road. A thorough discussion of in-vehicle VUI design techniques goes beyond the scope of this article, but suffice it to say that this type of VUI would do well to add a "pause" command to the system, as well as limit the number of states in which the user is asked open ended questions such as "What can I do for you?". After all, asking users to remember what their choices are distracts them from their primary objective of driving the vehicle.


back to the top


Copyright © 2001 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE Industry Standards and Technology Organization (IEEE-ISTO).