Volume 1, Issue 6 - June 2001
   
   
 

Turning GUIs into VUIs:
Dialog Design Principles for Making Web Applications Accessible By Telephone

By Bill Byrne

(Continued from Part 1)

Social Factors

Recent research has shown that people tend to treat computers, television, and other new media as real people, whether ithey're interacting with an animated figure on a computer screen or a computer generated voice on a telephone [1]. As a result, the voice featured on a VUI, even if it's synthetic, is actually perceived of as an individual with a unique personality. This has several important consequences: First, before dialog flows or prompt wording can be decided on, designers must understand "who" is talking and carefully develop the character who will be featured in the application. How friendly, efficient, casual, chatty, young, humorous, experienced, or forgiving is he or she? The answers to these questions depend on the type of application and the company behind it. Think about the difference between a stock broker versus a music store clerk, or a major bank versus a major Hollywood studio, or an application that gives you traffic updates versus one that lets you change the percentage of your 401K plan. Second, the personality must be consistent throughout the application. It doesn't make sense for the application to seem warm and forgiving in one state and then cold and impatient in the next. While people will grow to like a personality different from their own, no one gets used to someone whose personality is unpredictable. Finally, it's crucial to find a voice actor talented enough to play this role consistently and a director who can ensure that the character originally developed for the application is the one that ends up being portrayed in the dialog.

Linguistic Factors

If VUIs are inherently social, it follows that the language they use should be as close to naturally occurring spoken discourse as possible. However, while everyone can tell what sounds natural and what doesn't when they hear it, spoken discourse is much more complicated than most people realize, and replicating it in a voice application requires a certain amount of linguistic expertise. Let me give a few examples of linguistic principles that play an important role in VUI design, some general, some more detailed.

Speaking "Correctly"

Most of us were taught from an early age (either directly or indirectly) that there is a "correct" and "incorrect" way to speak and write, and that if we drift too far from the "correct" way, we might as well hide our heads in shame. Perhaps you'll remember some of the old favorites still found in English grammar books (and grammar checkers): Don't leave your prepositions dangling. You mustn't split infinitives. It's "the woman whom I love" not "who I love". You can't start a sentence with "but". But as it turns out, many of these rules come from an eighteenth century fad when scholars were trying to force the structure of English to be more like Latin, while others have no basis at all [2]. What's more, the language we use and expect to hear in our everyday conversations with friends, neighbors, bank tellers, stock brokers, store clerks, and human resource managers has never followed the rules of standard written grammar. It has its own rules and patterns which have evolved naturally over hundreds of years and which every speaker intuitively follows from a very young age.

The problem is, the pressure to be "correct" causes many prompt writers to produce overly formal or even stilted sounding applications as shown in the following prompts.

Odd: [] Better:[]

Odd: [] Better: []

Sometimes the clients themselves make the requests. In one extreme case the clients were so worried about sounding "correct" that they banned the use of contractions in the entire application. Needless to say, the result was odd at best. In other cases, jargon can creep into prompts, especially speech recognition related phrases. Take the following prompt, for example: [ ]

Concerned that callers might not realize they could use their own voice to interact with the system, this prompt writer decided to make it clear by using "speak your response". But this phrase is technical jargon typically used by engineers to describe text-to-speech output and doesn't fit with an application directed at the general public.

In general, VUI designers need to understand how spoken discourse works in order to give users a quality experience. Otherwise users are asked to interact conversationally with a system that doesn't sound at all conversational.

Information Structure and Word Order

Information structure refers to the way "old" or presupposed information and "new" or asserted information are reflected in sentence structure [3]. For example, in English, new information typically comes at the end of the sentence while old information comes at the beginning. For example, if I asked you, "Why did you hit that guy?" your answer might be "I hit that guy because he insulted me". However, in this context you certainly wouldn't say "Because that guy insulted me I hit him." This is because I hit that guy is now the old or presupposed information and should come in the first part of the sentence while because he insulted me is the new or asserted information and should come at the end. The importance of information structure is especially clear in help prompts. For example, suppose the user needs to know what phrase he should use in order for the system to play the rest of a message. Since the phrase he's looking for constitutes the "new" information, it should come at the end of the help prompt. [] Putting the phrase at the beginning in this context doesn't conform to English information structure. []

Phonetics and Phonology

Even the simplest voice application typically involves a lot of prompt concatenation. And while a good ear is indispensable, a clear understanding of intonation patterns, stress, and the way people pronounce conversational language helps to make the prompt boundaries disappear when you hear the application in real time. In addition, knowledge of these patterns makes it easier for designers to adjust the grammars for better recognition. The following pair of prompts shows just how important it is to pay attention to phonetics and phonology in VUI design.

Odd: [] Better: []

Future Directions

Language is a dynamic and collaborative process. That is, in any given conversation there's no way to plan what we're going to say next until we've heard the other person's contribution. As the conversation progresses, its participants in turn modify what they say and how they say it to accommodate the growing pool of shared knowledge [4, 5]. In other words, they get to know each other. However, very few voice applications today have any built-in mechanisms for adapting to users. As result, people are often annoyed with the repetitive nature of even the best applications after only a few weeks. While a thorough discussion of ways to make VUIs more dynamic goes beyond the scope of this article, let me say that we can begin to mimic this behavior by simply keeping track of certain user events and then responding with different prompts and dialog flows accordingly. Some of these events include the number of times logged in, domains visited, time elapsed since last login, error rates, and changes in user preferences.

Conclusions

VoiceXML has simplified the deployment of voice applications and has given dialog developers an easier way to implement their designs. However, the principles of dialog design have not changed. An application's usability depends on how well its designers can ensure system flexibility and user control and on how well they understand the linguistic and social principles that affect the users' perception of the voice or "character" being portrayed. Finally, it's important to remember that VUIs cannot completely replace web applications. Rather, they are best when used to enhance them.

References

1. Reeves, B. & Nass, C. (1996). The media equation. New York: Cambridge.
2. Pinker, S. (1994). The language instinct. New York: W. Morrow and Co.
3. Lambrecht, K. (1994). Information structure and sentence form: topic, focus, and the mental representations of discourse referents. New York: Cambridge.
4. Karttunen, L. & Peters, S. (1975). Conventional implicature of Montague grammar. Berkeley Linguistic Society, 1, 266-278.
5. Clark, H.H. (1992). Arenas of language use. Chicago: University of Chicago Press.

back to the top

 

Copyright © 2001 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE Industry Standards and Technology Organization (IEEE-ISTO).