|
Testing
VoicexML Applications
(Continued
from Part 1)
Testing
Usability
Usability
testing, a critical part of testing VoiceXML applications,
is an area many new VoiceXML developers are unfamiliar
with. The purpose of usability testing is to detect
potential problems that were not anticipated during
the design review. Usability testing is a process that
continues even after launch since the data generated
by real users is valuable data in understanding user
interactions.
Pre-launch
usability testing involves getting a small group of
people together (about 10) and having each person test
a scenario that represents a variety of situations that
real users are likely to experience. You then need to
analyze the behavioral data recorded from the actual
interaction and recommend solutions to dialogue designers
and engineers.
Post-launch
testing involves gathering data from use of the service
and identifying problems experienced in real-time situations.
Given the large pool of data available, you might need
to be selective as to what interactions you choose to
focus on. User interactions with the longest task completion
time or most number of loops to capture a single input
are good baselines.
There
are various metrics that can measure how usable an application
is:
- Task
completion time: How
long does it take a user to complete a task? How long
does it take to get driving directions in a driving
directions application? How long does it take to get
the status of an order in an order status application?
- User
experience to complete a task (goal completion):
Were the users confused with certain prompts? Did
the users have to go through several loops to get
to the information they need?
- Out-of-grammar
utterances: Anything
the user says that is not in the grammar. For example,
did the user say 'stop' when that wasn't a valid utterance?
Usability testing is ongoing cycle that captures user
data to continually measure and improve the user experience.
Performance
Testing
Before
a web site is launched, it has to go through rigorous
load and stress tests in order to ensure that it can
live up to certain service requirements. For example,
the number of concurrent users, expected response time,
and site availability are all important performance
metrics. Similarly, voice applications for the phone
need to set service level requirements and be tested
to ensure that they meet these minimum requirements.
Considerations
when writing performance tests for Voice Applications
There
is a key distinction between running web applications
and voice applications for the phone. While the hard
limit for how many sessions a web site can serve is
a range, most phone applications have an upper limit
on how many calls they can receive. The number of ports
that a phone application can serve determines the upper
limit. A port can serve one call at any given time.
So if the call capacity for a given application is 46
ports, no more than 46 callers can be served simultaneously.
In fact, the number of ports often becomes the maximum
throughput of a voice application. Various statistical
models such as the Erlang B Queuing model can estimate
the number of ports for your application based on the
frequency of calls and the average call length.
When
running performance tests on voice applications a call
generator tool is needed to simulate the required testing
load. A call generator requires access to an infrastructure
that has the capacity to make multiple simultaneous
calls. The call generator tool also needs to be able
to measure responsiveness of the application such as
speech and DTMF recognition and TTS playback. One way
to do this is by using DTMF in the application at the
end of every time slice that is to be measured and registering
the length of each one of those time slices. If you
don't have access to such a testing tool, the load can
be generated manually, by having multiple people call
into the application. However, this is not a process
that scales well, and is prone to human errors when
measuring responsiveness.
Strategies
for performance testing
The
purpose of running load tests is to ensure that all
services continue to function under certain minimum
requirements even at peak usage loads. Bottlenecks can
occur at various points in a VoiceXML application. Below
we discuss how to test and measure the performance of
various components in a VoiceXML application.
- Fetching
VoiceXML resources such as documents, external grammars
and audio files:
Web servers and any back-end infrastructure such as
application servers and databases serve content data.
There are two levels of testing that can be performed
to test performance of these components. A web site
testing tool, such as Load Runner or even a simple
HTTP proxy that can make HTTP requests and measure
response time, can generate enough load and test responsiveness.
However, in order to stress test network connectivity
between the VoiceXML gateway and the web server, a
call-in test is necessary. Generally, a voice application
tends to stress a web server much less than a corresponding
web application, and with content caching of audio
files at the VoiceXML gateway, the web server will
just need to handle requests for VoiceXML documents.
Therefore, it is unlikely that a web server will be
a bottleneck in a voice application.
- Execution
of a VoiceXML document:
The execution of a VoiceXML document tests the interpreter's
ability to scale and share cached resources as the
number of calls increase. A VoiceXML script with common
tags such as <field>, <transfer>, <subdialog>,
and <audio> would be a good baseline to start
with. Other features to test are the interpreter's
ability to parse large VoiceXML documents, compile
dynamic grammars efficiently and use caching when
necessary.
- Recognition
accuracy and response time: Recognition
accuracy and response time can be tested by playing
recorded messages using a call simulator and listening
to the response of the application. For example, if
the valid inputs at a particular prompt are "next,"
"previous" and "last," then the
call simulator would play an audio file for "next"
and the application would need to play a particular
touchtone sequence, say "111" when it recognizes
"next," "112" when it recognizes
"previous" and "113" say when
it recognizes "last." The call generator
would then need to recognize what the application
played and accordingly register that as accurate recognition
or misrecognition.
- Text-to-speech
(TTS) synthesis engine quality and response time:
TTS
quality evaluation can be measured by generating multiple
calls through the call generator, and recording all
TTS playbacks as well as measuring the time it takes
the TTS engine to playback the text.
Once
you have performed the above tests individually to identify
any bottlenecks, you will gradually need to run more
complicated tests that reflect the general functionality
used in your application.
back
to the top
Copyright
© 2001 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE
Industry Standards and Technology Organization
(IEEE-ISTO).
|