Human
Factors and Voice Applications
(Continued
from Part 1)
Compelling
Applications
Another
dimension of application quality that goes slightly
beyond usability is how compelling it is. If the application
is not compelling, consumers may use it once or twice
but may find alternative methods more interesting or
fun. Some of the things that make applications compelling
are the personality and voice characteristics of the
prompts. Another is the use of sound effects or music.
Another is variety within the application so that the
user experience is not always the same.
Personality
or Not?
This
is a stylistic consideration that may make the application
seem more interesting to a caller. It is not a substitute
for a usable application. One can have a usable application
that has a boring personality.
To
complete a task the user needs clear instructions of
their options and a reasonable grammar. But an interesting
personality might make them more motivated to try. After
several occurrences, however, the novelty of the personality
can wear off. What makes an interesting demo may not
make a successful application. A demo may only be played
with once or twice. A successful application will be
used repeatedly over time.
Some
other design decisions that should be made include:
-
Use text to speech (TTS) or stored speech: While
it may be expeditious and frugal to write an application
using TTS, many people still do not find TTS pleasant
or acceptable when given the choice. On the other
hand, some people prefer the cyber sound of TTS over
stored speech because the application is indeed a
machine and should sound like one or because they
like the high tech quality a TTS voice gives to an
application. Understanding the users and the usage
context will guide this decision.
-
Prompt Personality: When using stored speech,
questions invariably come up about the sex and personality
of the speaker. There are "it depends" answers
to such questions. It depends on the subject of the
application. It depends on the quality of that particular
voice.
The
particular voice personality that is selected delves
more into creative entertainment than usability. The
only comment I dare make here is that there is probably
some interaction between the application content and
an appropriate personality.
-
Prompting Style: The degree to which a prompt
sounds conversational or command-like has significant
implications for the grammars. A command-like prompt
is explicit to the caller about what the command options
are. A conversational style may be more open-ended.
The more open-ended the prompt, the more natural it
may seem to the caller. But when a prompt is more
conversational and open-ended, the caller may feel
they have more latitude with what they can say. When
the caller engages in an open ended dialogue the risk
increases that they will say something that is outside
the grammar.
Application
Complexity
Generally,
the Keep it Simple, Stupid (KISS) model works well with
speech applications.
Complex applications may provide more functionality
than simple applications, but they might also cause
usage problems. Two aspects of application complexity
are:
- How
many speech to system interactions are there?
- How
many different things can the person say during each
user-system interaction?
The
more times a user must interact with the application,
the greater the chance that the user or the recognition
engine will make an error. Error recovery is the toughest
part of good user interface design. Users do not like
it when they are not understood. When there are more
interactions, the application takes longer to navigate,
the task takes longer to complete, and the risk of errors
increases.
A
second dimension of complexity is the perplexity. Perplexity
refers to how many different things a person can say
within a particular state. When an application allows
users to say many different things, they may not remember
them all. Or, they may remember a synonym. Second, with
a larger grammar, the recognition accuracy may suffer
because the probability increases that some speech utterances
are acoustically similar to other speech utterances,
thus confusable by the recognizer.
Other
good design practices will make the application usable.
Using a language that is familiar to the user will make
the prompts easier to understand, and will make the
commands seem more intuitive and easy to recall. Using
a consistent language will have the same effect. And,
as a general guideline, it is probably best not to include
more than 6 or 7 options per step.
Errors
The issue of imperfect recognition is what makes speech
applications different from other types of applications.
Because of this imperfection, much of the user interface
design work focuses on error handling. The user errors
are utterances that are not anticipated in the application
and therefore will not be recognized. Or worse, they
might be misrecognized incorrectly or substituted for
a different command.
In
a typical PC or telephone IVR application, there is
a manageable set of possible inputs. For example on
the phone, the user can press any of the 12 keys on
the keypad, and the number pressed is recognized by
the system 100% of the time. The application can be
designed to ignore inputs with no associated option.
The range of possibilities is known.
In
a speech application, however, the user is not confined
by anything except that the input be audible. They may
say anything. And will. Now the system can certainly
ignore odd inputs, but if the aberrant input is similar
in sound to something the system expects, then that
aberrant input could be misinterpreted.
The
other thing that is different when compared to a GUI
is that the information needed by the user to help them
figure out what to do comes one chunk at a time (serially).
Only after one option has been played can the next one
be offered. GUIs, on the other hand, can present many
options or controls together (in parallel).
How
do we know what people might do so that we can design
it in? In many cases we can make good guesses, but we
are not always right. That is the reason behind iterative
design and testing. Letting actual users interact with
the system will allow you to observe first hand what
someone might say when given the prompt stimulus. If
many of the users say things outside what is expected,
the grammar can be extended to include those things,
or the prompt can be re-written to stimulate different
spoken utterances.
Error
Handling
When an application error occurs for example a misrecognition,
a message like "an error has occurred" is
not helpful. It is better to tell the caller what to
do next, for example, "please repeat" or "I
didn't get that", or "please try again."
These phrases are ambiguous with regard to the nature
of the error and will work for both in-grammar and out-of-
grammar recognition decisions.
In
many cases it is a good idea to follow up that phrase
with a list of available commands, lest there is confusion
about what is allowable. When combined with the more
generic phrase above, it is a graceful way to ask the
caller to repeat from the following list of available
commands.
There
may be undetectable speech energy. It may be that the
caller did not say anything. Or, it may be that the
caller spoke too softly. The above phrases work for
either case, but it is less helpful for the soft speaker
than a prompt that says "I didn't hear that,"
which might stimulate a louder reply on the next try.
If the caller never did say anything, a prompt saying
"I didn't hear that" might be irritating and
provoke a reaction like "of course you didn't hear
that. I didn't say anything." The caller might
even say that out loud
Parting
Thoughts
Although
speech is the most natural way for people to communicate,
machine understanding differs dramatically from human
understanding. Humans can easily understand poorly pronounced
words from the context. Speech recognizers have more
difficulty. Humans can ask clarification questions about
meaning and word recognition. Speech recognizers have
more difficulty. Humans can detect emotion, emphasis,
and intonation. Recognizers cannot. To make an effective
human-machine dialogue, effort must be made to experiment
with the application and to make modifications based
on human speech behavior. This will help to reduce the
guesswork and to enhance the user experience.
back
to the top
Copyright
© 2001 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE
Industry Standards and Technology Organization
(IEEE-ISTO).
|