VoiceXML Review - Columns - Speak & Listen

Volume 5, Issue 1 - January / February 2005

In this monthly column, an industry expert will answer common questions about VoiceXML and related technologies. Readers are encouraged to submit questions about VoiceXML, including development, voice-user interface design, and speech technology in general, or how VoiceXML is being used commercially in the marketplace. If you have a question about VoiceXML, e-mail it to speak.and.listen@voicexmlreview.org and be sure to read future issues of VoiceXML Review for the answer.

By Matt Oshry

Q: Some dialogs within my voice application use very large or ambiguous grammars, so the recognition can be tricky. In those situations I may need to confirm with the user that my application received the user's intended response from the recognizer. How do I decide when to confirm the selected response with the user?

A: The technique you're referring to is called "confidence-based confirmation", and there are several ways to implement it in VoiceXML. Let's start with a simple dialog that requests a city and state from the user:

<vxml version="2.0"
  xmlns="http://www.w3.org/2001/vxml" >
<form>
  <field name="where">
    <prompt>Say a city and state.</prompt>
    <grammar type="application/srgs+xml" mode="voice"
        src="citystate.grxml"/>
    <noinput>
      Sorry. I didn't hear you.
      <reprompt/>
    </noinput>
    <nomatch>
      Sorry. I didn't get that.
      <reprompt/>
    </nomatch>
    <filled>
      <log>where = <value expr="where"/></log>
    </filled>
  </field>
</form>
</vxml>

If the filled element gets executed, we know that the recognizer was at least as confident as the value of the confidencelevel property. Otherwise a nomatch event is thrown. According to section 6.3.2 of the VoiceXML 2.0 specification, the default confidencelevel is 0.5, but you can tweak the value to be as low as 0.0 or as high as 1.0. Assuming you haven't modified the property, we know the confidence score for the result is somewhere between 0.5 and 1.0. We can determine the exact value by checking either of the following:

where$.confidence
application.lastresult$.confidence

Using the confidence score from lastresult$ can get tricky if your grammar fills multiple slots, and 2.3.1 of the VoiceXML 2.0 specification states that the "distinction between field and utterance level confidence is platform dependent", so we'll utilize the confidence shadow variable of the field for maximum portability across Voice Browser implementations.

Now that we know how to access the confidence score, we need to pick a threshold. When the confidence score exceeds the threshold, we'll assume the recognizer is correct; otherwise, we'll confirm the recognizer's selection with the user. For this example I'll choose 0.75, but you should consult your resident speech expert, since the threshold will vary depending on the grammar you're using.

Here's an implementation of a confidence-based confirmation dialog that leverages the Form Interpretation Algorithm (FIA) by embedding the confirmation dialog in the same form as the field that collects the city and state. The cond attribute of the confirm field controls whether or not the interpreter executes it. If the cond attribute evaluates to false, FIA skips the confirm field and selects and executes the block. if the cond attribute evaluates to true, FIA selects the confirm field. If the user says 'no' in response to the confirm field, execution of the clear resets the guard condition on the where and confirm fields making them eligible for selection again during the next iteration of the FIA's main loop. In fact, the block will not be executed until either

1) the form item variable where is filled, and the associated confidence score is greater than or equal to 0.75, or

2) the user says 'yes' in response to the confirm field's prompt.

<vxml version="2.0"
   xmlns="http://www.w3.org/2001/vxml" >

<catch event="noinput nomatch">
   Sorry. I didn't get that.
   <reprompt/>
</catch>

<form>
  <field name="where">
     <prompt>Say a city and state.</prompt>
     <grammar type="application/srgs+xml" mode="voice"
            src="citystate.grxml"/>
  </field>
  <field name="confirm" 
       cond="typeof(where) != 'undefined' &amp;&amp; where$.confidence &lt; 0.75">
    <prompt>
     I heard you say <value expr="where"/>, is that correct?
    </prompt>
    <grammar type="application/srgs+xml" src="yesno.grxml"/>
    <filled>
      <if cond="confirm == 'no'">
        <clear namelist="confirm where"/>
      </if>
    </filled>
  </field>

  <block>
     <submit next="listing.cgi" namelist="where"/>
  </block>

</form>
</vxml>

What if the recognizer never obtains a confidence score greater than or equal to 0.75, and the user repeatedly responds 'no' to the confirm dialog? After the second and certainly the third confirmation attempt, the user is likely to give up. You can preempt this situation by keeping track of the number of times the user is asked to confirm her choice and take an appropriate action, for example, transferring the user to your call center. If you don't have that luxury, you can attempt to obtain the information from the user in a different way - for example, via DTMF ("type the first few letters..."), or by presenting a list of choices.

Here's a slightly modified version of the previous example that keeps track of confirmation attempts using a variable named confirmCount. Because the confirmCount variable is declared at form scope, it automatically gets initialized each time you enter the form. When the confirmCount reaches 2 in the confirm dialog, the application throws the event "com.yourcompany.yourapp.transfertoagent". The handler for this event presumably navigates to a form containing a transfer element that whisks the user off to a customer care representative.

<vxml version="2.0"
   xmlns="http://www.w3.org/2001/vxml" >

<catch event="noinput nomatch">
   Sorry. I didn't get that.
   <reprompt/>
</catch>

<form>
  <var name="confirmCount" expr="0"/>
  
  <field name="where">
     <prompt>Say a city and state.</prompt>
     <grammar type="application/srgs+xml" mode="voice"
            src="citystate.grxml"/>
  </field>
  <field name="confirm" 
       cond="typeof(where) != 'undefined' &amp;&amp; where$.confidence &lt; 0.75">
    <prompt>
     I heard you say <value expr="where"/>, is that correct?
    </prompt>
    <grammar type="application/srgs+xml" src="yesno.grxml"/>
    <filled>
        <if cond="++confirmCount &lt; 2">
          <clear namelist="confirm where"/>
        <else/>
           <throw event="com.yourcompany.yourapp.transfertoagent"/>
        </if>
    </filled>
  </field>

  <block>
     <submit next="listing.cgi" namelist="where"/>
  </block>

</form>
</vxml>

You can reuse the confirm field anywhere you require confidence-based confirmation
by doing the following:

Copy and paste the confirm field into the form after the field that requires confidence-based confirmation.
Replace the use of the variable 'where' with the value of the name attribute of the data collection field.
The data collection field is referenced in three places:

The cond attribute of the confirm field
The prompt
The namelist attribute of the clear element

Adjust the confidence threshhold to an appropriate value.
If you track the number of confirmation attempts, be sure to declare the counter within the form
into which you copy the confirmation field.

Q: According to the latest draft of VoiceXML 2.1, the markname and marktime variables store values corresponding to the mark that was last executed "before barge-in occurred or the end of audio playback occurred." If the user listens to the prompt in its entirety, how can my application accurately detect the user's reaction time if they speak during the timeout interval?

A: Your careful reading of the specification is correct. If the user doesn't barge-in on the prompt, the marktime property will only reflect the interval between the last mark that was executed and the end of audio playback. It will not include the timeout interval. You can extend the design of this feature setting the timeout property to zero seconds and by adding to the end of the prompt queue a silent audio file (e.g. timeout.wav) the duration of which is equivalent to the desired timeout. Here's some sample code:

<field name="city">
  <property name="timeout" value="0s"/>
  <grammar mode="voice" src="citystate.srgs"/>
  <prompt>
    <mark name="pre"/>
    <audio src="citystate.wav">say a city and state</audio>
    <mark name="timeout"/>
    <audio src="timeout.wav"/>
  </prompt>
  <filled>
    <log>
       markname=<value expr="city$.markname"/>, 
       marktime=<value expr="city$.marktime"/>
    </log>
      Okay. <value expr="city"/>
  </filled>
</field>

back to the top