Volume 3, Issue 5 - September/October 2003
 
   
   
 

Developing Multimodal Applications using XHTML+Voice

By:

Executive Summary

On the Internet, people use browsers to visit Web sites, access documents from networks, and fill out forms. With this growing capability to retrieve information, communications between users and their devices is receiving more attention. As devices become smaller, other means of input -- in addition to keyboard or tap screen -- are becoming necessary. Small handheld devices, including cell phones and PDA’s, now contain sufficient processing power to handle multiple tasks. On some devices it is difficult to perform these tasks using only keyboard, stylus, or handwriting recognition. This has lead to a new application technology called multimodal, the use of multiple methods of communication between the user and a device. These methods include keypad, touch or tap screen, handwriting recognition, and voice recognition.

This paper illustrates the basic structure and contents of an XHTML+Voice multimodal application, describing its fundamental building blocks. It is intended for those who are familiar with XHTML, VoiceXML, and HTML.

Each of the building blocks is described and coding samples are provided. A multimodal implementation of a hypothetical Pizza Order Form application is presented as an example.

The Structure of an XHTML+Voice Application

A basic XHTML+Voice multimodal application consists of a Namespace Declaration, Visual Part, Voice Part, and a Processing Part. Figure 1 illustrates these components and their relationship to each other.



Namespace Declaration

The Namespace Declaration for a typical XHTML+Voice application is written in XHTML, with additional declarations for VoiceXML, and XML-events. Figure 2 is an example of the namespace declaration for an XHTML+Voice application.

<?xml version="1.0" encoding="iso-8859-1" ?>
<!DOCTYPE html PUBLIC "-//W3C/DTD XHTML+Voice 1.0/EN" "xhtml+voice.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:ev="http://www.w3.org/2001/xml-events" xmlns:vxml="http://www.w3.org/2001/vxml" xml:lang="en_US" >
Figure 2 -- Namespace declaration

Visual Part

The Visual Part of an XHTML+Voice application is XHTML code that is used to display the various form elements to the device’s screen, if available. This can be ordinary XHTML code and may include check boxes and other form items that are found For example, Figure 3 displays the pizza size choices and their appropriate radio buttons. Figure 4 illustrates a typical form using XHTML+Voice.

<b>Size:</b><br/>
<input type="radio" name="size" id="sizeSmall" ev:event="focus" ev:handler="#voice_size"/>
Small 12&quot;
<input type="radio" name="size" id="sizeMedium" ev:event="focus" ev:handler="#voice_size"/>
Medium 16&quot;
<input type="radio" name="size" id="sizeLarge" ev:event="focus" ev:handler="#voice_size"/>
Large 22&quot;
Figure 3 -- Visual part of a multimodal application



Voice Part

The Voice Part of an application is the section of code that is used to prompt the user for a desired field within a form. This VoiceXML code utilizes an external grammar to define the possible field choices. If there are many choices, or a combination of choices is required, the external grammar can be used to handle the valid combinations.

For example, to select the vegetable toppings for a pizza, there are multiple ways to say the selections. The VoiceXML code in Figure 5 is used with the vegtoppings.jsgf grammar file to prompt the user to select vegetable toppings for the pizza. To add additional vegetable topping choices, modify the vegtoppings.jsgf file.

Continued...

back to the top

 

Copyright © 2001-2003 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE Industry Standards and Technology Organization (IEEE-ISTO).