VoiceXML Review - Feature Articles

Volume 1, Issue 7 - July 2001

Testing VoicexML Applications

By Bryan Michael and Mukund Bhagavan

Testing VoiceXML applications includes four major components:

Testing the expected application flow or logic
Testing the recognition accuracy
Performing usability testing
Performance testing

This article will discuss the key considerations related to VoiceXML application testing as well as strategies and tactics used to speed testing in each of these areas. Once tested applications are ready for commercial deployment, many VoiceXML developers choose to outsource hosting to a third party. This article will also discuss important VoiceXML performance testing considerations for large-scale commercial deployments of VoiceXML applications. By examining various factors that affect performance and discussing strategies to increase performance, developers can be confident their applications will provide the best possible experience to end-users.

Testing Application Flow

Today many VoiceXML developers perform application logic testing and voice recognition testing simultaneously. These two components are in fact independent and should be treated as such. The most efficient way to test VoiceXML applications is to decouple application flow testing from voice recognition testing. VoiceXML applications describe a dialog flow in which a user transitions from one state to the next via prompts and responses. The dialog flow can be represented visually using state diagrams. For example, the following state diagram represents an email reader written in VoiceXML:

Figure 1: State Diagram for Email Reader

In order to test the application flow, developers would historically dial up their application and interact with the application to test if the logic behaves consistent with the design. This method can be quite tedious because of the serial nature of an audio interface. Also, when problems with recognition accuracy occur, testing application flow becomes even more difficult and laborious. Today there are many tools available for testing application flow without using voice, audio, or a telephone. Interacting with your VoiceXML application via text commands can dramatically speed application flow testing. Once the application flow behaves correctly in all given scenarios, developers can transition to testing recognition accuracy.
Testing Recognition Accuracy

Testing recognition accuracy, particularly for large grammars, can also be tedious and time consuming. The process for testing recognition accuracy generally involves the following steps:

Data collection
Transcription of data
Tuning of grammars

Collecting data involves correlating audio and recognition logs of user interactions with your application. Generally the larger the sample data size, the more thorough developers can be in testing recognition accuracy. Diversity of data is also important so that you can incorporate a variety of voice samples for each different scenario.

Human transcription is one of the most important aspects of testing recognition accuracy because it is the only way to objectively ascertain recognition performance. Transcription of data can begin during the data collection phase. Usually all three processes run for a period of time in parallel so that developers can create a positive feedback loop through the learnings of each process. For example, if one particular recognition sequence continues to cause problems, adjustments can be made to accommodate the sequence and quickly test the scenario in future tests.

Below are the most common types testing recognition errors:

Out of grammar utterances - Often important in testing usability and dialog design, this error occurs when users say things outside the grammar. For example, if the user is navigating a browsable list with commands like "previous, next, first one, last one" and yet users say commands like "go back" which is not in the grammar, an out of grammar error will occur.

Substitution errors - Occurs when words are confused because they sound similar. For example, "Marquette street" and "Market street" sound very similar and could produce a substitution error if the grammar is not properly tuned.

Insertion errors - Occurs when the recognition result returns extraneous words appended to the original utterance. An example would be "Airport Way" as a returned result of the utterance "airport."

Deletion errors - Occurs when the recognition result returns an abbreviated version of the original utterance. An example would be "two three" as a result from the utterance "nine two three."

False Accepts - The utterance is not in the grammar and should have been rejected.

False Rejects - The utterance is in the grammar and should have been recognized.

Based on the types of errors above, there are several ways to tune an application to improve recognition accuracy. Below are the types of tuning developers can leverage to increase recognition accuracy:

Phonetic tuning - Words in a given language can be modeled by translating the sound of the words into a sequence of phonemes. In the English language, there are 40 different phonemes. Here are some examples of Computer Phonetic Alphabet representations:

bevocal b I v o k * l
menu m E n j u
stock s t O k
quotes k w o t s
weather w E D *r
directions d *r E k S * n z

By altering the phonetic dictionary, developers can increase recognition rates for some sequences. Other examples of phonetic tuning include adding alternate pronunciations and adding crossword phonetic modeling. Here is an example of adding crossword phonetic modeling:

palo_alto p a l o a l t o
palo_alto p a l o w a l t o
palo_alto p a l a l t o
palo_alto p A l a l t o
palo_alto p a l w a l t o
palo_alto p A l w a l t o

Grammar tuning - By adding representative probabilities to confusion pairs, developers can fix substitution errors. Also, adding elements from the "out of grammar" list can fix false accepts and correct rejects.

Recognition tuning - There are many runtime recognition parameters that can be altered to yield better recognition rates. For example, confidence rejection thresholds, pruning, and endpointing behavior.

Tuning the acoustic models and the endpointer - Acoustic models use Hidden Markov Models of dozens of speech features that represent all sounds occurring in a given language. Endpointing describes when the recognition engine should detect the beginning and ending of speech.

Continued...

back to the top

Copyright © 2001 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE Industry Standards and Technology Organization (IEEE-ISTO).