Volume 2, Issue 3 - April/May 2002
   
   
 

N-best Recognition Results

By Rob Marchand

Welcome to First Words, VoiceXML Review's column that teaches you about VoiceXML and how you can use it. We hope you enjoy the lesson.

This month, we're going to take a quick look at N-best results, and how to get at them from within VoiceXML.

Today's speech recognizers are usually capable of providing not only a single recognition result, but also a list of candidate results, ordered by some criteria.

Why would we want to do this? The main reason is to provide the application with more flexibility with regards to how it handles user interaction. By receiving a list of results from the recognizer, the VoiceXML application can react more reasonably in the event that the 'best' result is not the correct one.

Consider the situation where the user is prompted for a city name for some purpose (say a travel destination). If the recognizer returns 'Austin', but the user said 'Boston', the application has a couple
of different options:

  • Re-prompting the user, potentially performing the same misrecognition;
  • Constrain the user further (Austin Texas, versus Boston Massachusetts);
  • Remove 'Austin' from the grammar, and try again;

With the availability of a list of results, the application has another option: prompt the user with the next result from the list. Or, if the first two results are very close in confidence, then the application might offer the user the option to select between the two. In any event, this capability provides the application designer with another tool to use when developing a user-friendly VoiceXML interface.

N-best in VoiceXML

VoiceXML provides access to N-best results in a well-defined way. We will being with our tried and true pizza interface:

<?xml version = "1.0"?>
<vxml version = "2.0">
    <form>
        <block>
            <prompt>
                Welcome to the Voice X M L review pizza franchise
            </prompt>
        </block>
        <field name = "orderItem">
            <grammar>
                pizza | pizzas | pie | drinks | salad | wings
            </grammar>
            <grammar mode = "dtmf">
                1 {pizza} | 2 {drinks} | 3 {salad} | 4 {wings}
            </grammar>
            <prompt>
                What would you like to order?
We have pizza, drinks, salad or wings. </prompt>
<noinput> Say pizza, drinks, salad, or wings. </noinput> <nomatch> You can say pizza, drinks, salad, or wings. </nomatch> <filled> You said <value expr = "orderItem" /> </filled> </field> </form> </vxml>

 

We've slimmed this down a bit from our earlier examples to focus in on the task at hand. As you can see from the example, we prompt the user, and ask them what they would like to order. Once we say something in-grammar, the item will be read back to us.

In order to extend this to support N-best, we need to decide how many results we want to receive. This is done with the maxnbest property:

<property name="maxnbest" value="5"/>

VoiceXML specifies that this defines the maximum number of results you will receive. Your application should be prepared to receive fewer than this number (i.e., anywhere from one to maxnbest results can be returned).

N-best results are made available in the ECMAScript object application.lastresult$.

You may recall that a field has an associated shadow variable fieldname$, with component members fieldname$.confidence, fieldname$.utterance, fieldname$interpretation, and fieldname$.inputmode. These define the recognizer confidence, the actual utterance (words) matched by the recognizer, the semantic interpration returned by the recognizer, and the input mode (DTMF or Voice) for the collected data. The shadow variable application$.lastresult provides similar structure, with the exception that it is an array of objects, application$.lastresult[n] with members application$.lastresult[n].confidence, application$.lastresult[n].utterance, application$.lastresult[n].inputmode, and application$.lastresult[n].interpretation. The array includes the N-best results (numbered from 0 to a maximum size of maxnbest-1). You should check the length of the array using the application$.lastresult.length member before performing any operations based on application$lastresult. Note also that you can application$.lastresult.confidence, etc., correspond to the 0th element of the application$.lastresult array.

Continued...

back to the top

 

Copyright © 2001-2002 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE Industry Standards and Technology Organization (IEEE-ISTO).