GUIs into VUIs:
Dialog Design Principles for Making Web Applications
Accessible By Telephone
Note: This article has links to example WAV files woven
throughout it. The links to the WAV files are indicated
no question that the compatibility of VoiceXML with
the familiar and ubiquitous web infrastructure has greatly
simplified the implementation of speech recognition
applications. Instead of a PC, you can use a telephone;
instead of HTML, you can use VoiceXML; instead of a
web browser, you can use a "voice browser"
(or VoiceXML interpreter). It follows, then, that a
reasonable way to develop voice applications is to simply
translate web pages into "voice pages", right?
web and voice applications may share similar infrastructure,
usability considerations for each type of application
are significantly different. After all, people approach
graphical user interfaces (GUIs) and voice user interfaces
(VUIs) in fundamentally different ways-some obvious
and some rather subtle. Yet infrastructure parallels
between web and voice applications together with the
well-established mental model we have of web-based applications
often cause developers to overlook these important differences.
This, in turn, affects the usability and, ultimately,
the success of many voice applications.
we'll see below, good VUI design starts with a solid
understanding of the most important differences between
GUIs and VUIs and ends with the application of linguistic
and social principles to the overall development effort.
Restrictions With Voice
are able to present a lot of information in parallel.
Whether you're working in the consumer or enterprise
space, screens in a web-based application typically
offer pull-down menus, click boxes, tables, audio, as
well as pictures and icons to aid navigation. The user
in turn can scan hundreds of items to get to the desired
information in just a few minutes. This kind of interface
inherently satisfies some of the basic principles of
user interface design:
Users (even novice users) don't need to wait for direction.
They initiate and or terminate each step at their
The available options are constantly visible. A simple
click on a new item at any time redirects the session
In addition to the simplicity of "point and click",
GUIs offer numerous ways to present information, e.g.
pull-down menus, tables, pictures, text or audio.
The schemas employed in most web-based applications
are familiar and consistent, which has given users
a very clear mental model of how any GUI is likely
contrast, speech is sequential, so certain luxuries
provided by GUIs just don't work in voice applications.
there's no way to present more than one piece of information
simultaneously, things slow down considerably as users
must carefully listen to various lists, dialog flow
cues, and help prompts before they can proceed. Yet
trying to present users with too much information in
this way taxes short-term memory. For example, it's
well known that most people can only remember between
five and nine numbers for around twenty seconds after
hearing them. Consequently, listening to long lists
of choices is unreasonable, and purely hierarchical,
menu driven applications are exhausting. This is one
of the major drawbacks of the thousands of touch-tone
or "interactive voice response" (IVR) systems
deployed today in which callers use the keypad to choose
from among the many items in a menu. 
And, if you think banner ads on a web page can be a
little distracting, then being forced to listen to the
same ads on a VUI will seem downright intrusive.
addition, with speech, users focus on a much narrower
context, which is built up temporally. Consider, for
example, the difference in list orders between e-mail
viewed on a GUI vs. voice mail heard over the telephone.
In the case of e-mail, the latest message header is
displayed first, at the top of the page, since the user
can easily scan the list below to see if there are older
related messages. In contrast, there is no way to "scan"
a list of voice mail messages over the phone. The oldest
messages are played first so the user can understand
the context in which the latest messages were recorded.
This same first-in-first-out (FIFO) order has been adopted
by most universal messaging voice applications and VUI
sum up so far, the sequential nature of speech means
that VUIs are inherently more restrictive than GUIs.
Far fewer choices can be explored with a VUI in a given
amount of time compared to what can be scanned with
a GUI. Furthermore, the burden on memory means VUI users
are focused on a smaller set of choices and are tuned
in to a narrower context. Without visual cues and a
well-established mental model for VUIs, users have fewer
ways to understand what choices are available to them.
Without careful attention to design, these limitations
can severely diminish system flexibility and user control.
Usability for Voice
the limitations noted above, well-designed voice applications
have proved to be both engaging and effective. After
all, a carefully developed VUI lets people interact
conversationally with the application to do things that
up until very recently required a desktop computer,
a live operator, or a personal visit. To overcome the
challenges described above, we can employ certain design
techniques that will restore user control, system flexibility,
and take advantage of the user's intuitive knowledge
of human discourse. In other words, a well designed
VUI puts users at ease, allowing them to interact in
a way that is as easy as talking to a friend.
users inherently have less control and flexibility with
VUIs as compared to GUIs, it's the designer's job to
foster the perception of user control and flexibility.
There are several general techniques we can use to accomplish
this goal. Here are a few examples:
and General Help
mentioned above, VUI users aren't likely to have a clear
mental model of how the application works and they don't
have the luxury of clicking through several pages in
a few seconds or looking at a site map to orient themselves.
Tutorials and context sensitive help are an effective
substitute. Tutorials are often played automatically
for first-time users and can be accessed later with
a command like "Play the tutorial". They outline
how the application is organized and point out the availability
of "global" commands such as GO BACK, HELP,
PAUSE, START OVER, etc. They also give general tips
for using speech recognition applications by explaining
barge-in, the effect of ambient noise, etc. 
The always-available help command gives the users a
way to figure out what to say in unfamiliar dialog states,
or in cases where outside distractions cause them to
miss important prompts.
voice applications often inadvertently set traps. That
is, users sometimes assume they can say something that
turns out not to be "in-grammar". After trying
several phrases that are met with the usual "I
didn't understand that", the user hangs up out
of frustration. The key is to foresee likely scenarios
that don't necessarily follow the garden path. Consider
a merchandise availability application that requires
an item number as its input.
Tell me the four-digit item number or touch it in on
User: five one five two.
System: Five one five two.
Is that right?
System: OK. Here's the
item matching the number you gave me. Megachip II How
many would you like?
User (confused): Megachip
II? I'm looking for the ProPrinter 2000!
System: I didn't quite
catch that. Tell me the number of units you'd like again
or enter it on the keypad.
In this case the user may have simply entered and confirmed
the wrong item number (maybe he misread someone's handwriting)
and needs to return to the previous state. However,
since the designer hasn't considered this scenario the
user will be severely penalized for making an honest
mistake. If, on the other hand, the error prompt included
a timely help phrase such as "If this isn't the
item you're looking for say START OVER", the user
would have a way to continue in the dialog.
and lists often become much more manageable if the user
simply knows how many items are listed and perhaps the
kind of items contained in the list, which can be done
with a short framing statement at the beginning. 
I know I can browse a list of ten items by phone much
more easily than a list of 100.
experienced designers know you can't guarantee that
users will always repeat the commands explicitly requested
of them by the application. For one, people often mimic
the phrases and style that's used in the application.
For example, in one recently released application, subscribers
consistently began to say MOVE ON to get to the next
domain even though they were never explicitly told to
do so. As it turned out, the application repeatedly
used the same phrase in a slightly different context.
In other words, the application itself primed them for
MOVE ON as opposed to, for example, NEXT DOMAIN or some
other equally likely choice. In addition, a grammar
developer can't expect to include all the reasonable
phrases that users might utter in a VUI, and will have
to study the utterances collected from the application
logs to add new phrases as well as delete unused phrases.
during a phone call often leads users to believe there
may be a problem, and many will hang up. In fact, people's
tolerance of "dead air" on the telephone is
limited to two or three seconds, so reducing latency
in VUIs is crucial. Yet there are other situations in
which adding latency (or a pause) to the system improves
the experience. Consider the following prompts. 
Without the pause, it sounds like the system already
knew there was no information ahead of time. So, we
think to ourselves, "Why didn't it just say that?"
To correct the problem, a short pause was added to all
prompts that were played as a result of this type of
the Caller's Environment
with a computer screen is basically the same whether
the user is indoors or outdoors, standing or sitting,
in an airport or in the office. In contrast, talking
on the phone can be a very different experience depending
on where the caller is or what she might be doing. VUIs
designed for in-vehicle use are especially challenging,
not only because of the ambient noise that will compromise
recognition, but because the driver's first priority
is to concentrate on the road. A thorough discussion
of in-vehicle VUI design techniques goes beyond the
scope of this article, but suffice it to say that this
type of VUI would do well to add a "pause"
command to the system, as well as limit the number of
states in which the user is asked open ended questions
such as "What can I do for you?". After all,
asking users to remember what their choices are distracts
them from their primary objective of driving the vehicle.
to the top
© 2001 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
Industry Standards and Technology Organization