VoiceXML Review - Feature Articles

Volume 1, Issue 7 - July 2001

Testing VoicexML Applications

By Bryan Michael and Mukund Bhagavan

(Continued from Part 1)

Testing Usability

Usability testing, a critical part of testing VoiceXML applications, is an area many new VoiceXML developers are unfamiliar with. The purpose of usability testing is to detect potential problems that were not anticipated during the design review. Usability testing is a process that continues even after launch since the data generated by real users is valuable data in understanding user interactions.

Pre-launch usability testing involves getting a small group of people together (about 10) and having each person test a scenario that represents a variety of situations that real users are likely to experience. You then need to analyze the behavioral data recorded from the actual interaction and recommend solutions to dialogue designers and engineers.

Post-launch testing involves gathering data from use of the service and identifying problems experienced in real-time situations. Given the large pool of data available, you might need to be selective as to what interactions you choose to focus on. User interactions with the longest task completion time or most number of loops to capture a single input are good baselines.

There are various metrics that can measure how usable an application is:

Task completion time: How long does it take a user to complete a task? How long does it take to get driving directions in a driving directions application? How long does it take to get the status of an order in an order status application?
User experience to complete a task (goal completion): Were the users confused with certain prompts? Did the users have to go through several loops to get to the information they need?
Out-of-grammar utterances: Anything the user says that is not in the grammar. For example, did the user say 'stop' when that wasn't a valid utterance?

Usability testing is ongoing cycle that captures user data to continually measure and improve the user experience.

Performance Testing

Before a web site is launched, it has to go through rigorous load and stress tests in order to ensure that it can live up to certain service requirements. For example, the number of concurrent users, expected response time, and site availability are all important performance metrics. Similarly, voice applications for the phone need to set service level requirements and be tested to ensure that they meet these minimum requirements.

Considerations when writing performance tests for Voice Applications

There is a key distinction between running web applications and voice applications for the phone. While the hard limit for how many sessions a web site can serve is a range, most phone applications have an upper limit on how many calls they can receive. The number of ports that a phone application can serve determines the upper limit. A port can serve one call at any given time. So if the call capacity for a given application is 46 ports, no more than 46 callers can be served simultaneously. In fact, the number of ports often becomes the maximum throughput of a voice application. Various statistical models such as the Erlang B Queuing model can estimate the number of ports for your application based on the frequency of calls and the average call length.

When running performance tests on voice applications a call generator tool is needed to simulate the required testing load. A call generator requires access to an infrastructure that has the capacity to make multiple simultaneous calls. The call generator tool also needs to be able to measure responsiveness of the application such as speech and DTMF recognition and TTS playback. One way to do this is by using DTMF in the application at the end of every time slice that is to be measured and registering the length of each one of those time slices. If you don't have access to such a testing tool, the load can be generated manually, by having multiple people call into the application. However, this is not a process that scales well, and is prone to human errors when measuring responsiveness.

Strategies for performance testing

The purpose of running load tests is to ensure that all services continue to function under certain minimum requirements even at peak usage loads. Bottlenecks can occur at various points in a VoiceXML application. Below we discuss how to test and measure the performance of various components in a VoiceXML application.

Fetching VoiceXML resources such as documents, external grammars and audio files: Web servers and any back-end infrastructure such as application servers and databases serve content data. There are two levels of testing that can be performed to test performance of these components. A web site testing tool, such as Load Runner or even a simple HTTP proxy that can make HTTP requests and measure response time, can generate enough load and test responsiveness. However, in order to stress test network connectivity between the VoiceXML gateway and the web server, a call-in test is necessary. Generally, a voice application tends to stress a web server much less than a corresponding web application, and with content caching of audio files at the VoiceXML gateway, the web server will just need to handle requests for VoiceXML documents. Therefore, it is unlikely that a web server will be a bottleneck in a voice application.
Execution of a VoiceXML document: The execution of a VoiceXML document tests the interpreter's ability to scale and share cached resources as the number of calls increase. A VoiceXML script with common tags such as <field>, <transfer>, <subdialog>, and <audio> would be a good baseline to start with. Other features to test are the interpreter's ability to parse large VoiceXML documents, compile dynamic grammars efficiently and use caching when necessary.
Recognition accuracy and response time: Recognition accuracy and response time can be tested by playing recorded messages using a call simulator and listening to the response of the application. For example, if the valid inputs at a particular prompt are "next," "previous" and "last," then the call simulator would play an audio file for "next" and the application would need to play a particular touchtone sequence, say "111" when it recognizes "next," "112" when it recognizes "previous" and "113" say when it recognizes "last." The call generator would then need to recognize what the application played and accordingly register that as accurate recognition or misrecognition.
Text-to-speech (TTS) synthesis engine quality and response time: TTS quality evaluation can be measured by generating multiple calls through the call generator, and recording all TTS playbacks as well as measuring the time it takes the TTS engine to playback the text.

Once you have performed the above tests individually to identify any bottlenecks, you will gradually need to run more complicated tests that reflect the general functionality used in your application.

back to the top

Copyright © 2001 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE Industry Standards and Technology Organization (IEEE-ISTO).