VoiceXML Review - Feature Articles

Volume 1, Issue 6 - June 2001

Human Factors and Voice Applications

By Ed Halpern

(Continued from Part 1)

Compelling Applications

Another dimension of application quality that goes slightly beyond usability is how compelling it is. If the application is not compelling, consumers may use it once or twice but may find alternative methods more interesting or fun. Some of the things that make applications compelling are the personality and voice characteristics of the prompts. Another is the use of sound effects or music. Another is variety within the application so that the user experience is not always the same.

Personality or Not?

This is a stylistic consideration that may make the application seem more interesting to a caller. It is not a substitute for a usable application. One can have a usable application that has a boring personality.

To complete a task the user needs clear instructions of their options and a reasonable grammar. But an interesting personality might make them more motivated to try. After several occurrences, however, the novelty of the personality can wear off. What makes an interesting demo may not make a successful application. A demo may only be played with once or twice. A successful application will be used repeatedly over time.

Some other design decisions that should be made include:

Use text to speech (TTS) or stored speech: While it may be expeditious and frugal to write an application using TTS, many people still do not find TTS pleasant or acceptable when given the choice. On the other hand, some people prefer the cyber sound of TTS over stored speech because the application is indeed a machine and should sound like one or because they like the high tech quality a TTS voice gives to an application. Understanding the users and the usage context will guide this decision.
Prompt Personality: When using stored speech, questions invariably come up about the sex and personality of the speaker. There are "it depends" answers to such questions. It depends on the subject of the application. It depends on the quality of that particular voice.

The particular voice personality that is selected delves more into creative entertainment than usability. The only comment I dare make here is that there is probably some interaction between the application content and an appropriate personality.
Prompting Style: The degree to which a prompt sounds conversational or command-like has significant implications for the grammars. A command-like prompt is explicit to the caller about what the command options are. A conversational style may be more open-ended. The more open-ended the prompt, the more natural it may seem to the caller. But when a prompt is more conversational and open-ended, the caller may feel they have more latitude with what they can say. When the caller engages in an open ended dialogue the risk increases that they will say something that is outside the grammar.

Application Complexity

Generally, the Keep it Simple, Stupid (KISS) model works well with speech applications.
Complex applications may provide more functionality than simple applications, but they might also cause usage problems. Two aspects of application complexity are:

How many speech to system interactions are there?
How many different things can the person say during each user-system interaction?

The more times a user must interact with the application, the greater the chance that the user or the recognition engine will make an error. Error recovery is the toughest part of good user interface design. Users do not like it when they are not understood. When there are more interactions, the application takes longer to navigate, the task takes longer to complete, and the risk of errors increases.

A second dimension of complexity is the perplexity. Perplexity refers to how many different things a person can say within a particular state. When an application allows users to say many different things, they may not remember them all. Or, they may remember a synonym. Second, with a larger grammar, the recognition accuracy may suffer because the probability increases that some speech utterances are acoustically similar to other speech utterances, thus confusable by the recognizer.

Other good design practices will make the application usable. Using a language that is familiar to the user will make the prompts easier to understand, and will make the commands seem more intuitive and easy to recall. Using a consistent language will have the same effect. And, as a general guideline, it is probably best not to include more than 6 or 7 options per step.

Errors

The issue of imperfect recognition is what makes speech applications different from other types of applications. Because of this imperfection, much of the user interface design work focuses on error handling. The user errors are utterances that are not anticipated in the application and therefore will not be recognized. Or worse, they might be misrecognized incorrectly or substituted for a different command.

In a typical PC or telephone IVR application, there is a manageable set of possible inputs. For example on the phone, the user can press any of the 12 keys on the keypad, and the number pressed is recognized by the system 100% of the time. The application can be designed to ignore inputs with no associated option. The range of possibilities is known.

In a speech application, however, the user is not confined by anything except that the input be audible. They may say anything. And will. Now the system can certainly ignore odd inputs, but if the aberrant input is similar in sound to something the system expects, then that aberrant input could be misinterpreted.

The other thing that is different when compared to a GUI is that the information needed by the user to help them figure out what to do comes one chunk at a time (serially). Only after one option has been played can the next one be offered. GUIs, on the other hand, can present many options or controls together (in parallel).

How do we know what people might do so that we can design it in? In many cases we can make good guesses, but we are not always right. That is the reason behind iterative design and testing. Letting actual users interact with the system will allow you to observe first hand what someone might say when given the prompt stimulus. If many of the users say things outside what is expected, the grammar can be extended to include those things, or the prompt can be re-written to stimulate different spoken utterances.

Error Handling

When an application error occurs for example a misrecognition, a message like "an error has occurred" is not helpful. It is better to tell the caller what to do next, for example, "please repeat" or "I didn't get that", or "please try again." These phrases are ambiguous with regard to the nature of the error and will work for both in-grammar and out-of- grammar recognition decisions.

In many cases it is a good idea to follow up that phrase with a list of available commands, lest there is confusion about what is allowable. When combined with the more generic phrase above, it is a graceful way to ask the caller to repeat from the following list of available commands.

There may be undetectable speech energy. It may be that the caller did not say anything. Or, it may be that the caller spoke too softly. The above phrases work for either case, but it is less helpful for the soft speaker than a prompt that says "I didn't hear that," which might stimulate a louder reply on the next try. If the caller never did say anything, a prompt saying "I didn't hear that" might be irritating and provoke a reaction like "of course you didn't hear that. I didn't say anything." The caller might even say that out loud

Parting Thoughts

Although speech is the most natural way for people to communicate, machine understanding differs dramatically from human understanding. Humans can easily understand poorly pronounced words from the context. Speech recognizers have more difficulty. Humans can ask clarification questions about meaning and word recognition. Speech recognizers have more difficulty. Humans can detect emotion, emphasis, and intonation. Recognizers cannot. To make an effective human-machine dialogue, effort must be made to experiment with the application and to make modifications based on human speech behavior. This will help to reduce the guesswork and to enhance the user experience.

back to the top

Copyright © 2001 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE Industry Standards and Technology Organization (IEEE-ISTO).