Volume 5, Issue 1 - January / February 2005
 
   
   
  W3C Multimodal Interaction Working Group Activities

By Dr. Deborah Dahl

The W3C (World Wide Web Consortium) Multimodal Interaction Working Group (MMIWG) works on standards that will allow users to interact with applications with a combination of input modalities that include speech, pen, keyboard, and pointing, as well as future input modalities. Multimodal interaction is particularly valuable for applications on cell phones and PDA's, which have small keyboards and displays. On the desktop, multimodal interaction is useful for hands-busy and eyes-busy applications, as well as for making applications accessible to users who have difficulty using their hands or who are visually impaired. On any platform, multimodality allows applications to leverage the strengths of the different modalities to provide an optimal user experience.

Important goals of the work on multimodal standards in the W3C are:

  • to insure interoperability of multimodal web applications by standardizing common, royalty-free, architectures, data formats and authoring languages
  • to provide standards that will support useful and compelling multimodal applications on a variety of platforms based on current technology

The MMIWG was first chartered by the W3C in 2002 and its charter was renewed for two more years in January of this year. The rechartered group is focusing on the following areas:

  • Architecture: How are component modalities included in platforms, how are they controlled, and how do they communicate with each other?
  • Authoring: Using authoring tools such as markup, scripting, and styling, how do authors create multimodal applications and control modality components?
  • Representing users' intent: How do modality components such as speech and handwriting recognizers communicate the intent of users' inputs to the application?
  • Representing digital ink: How is pen input represented so that it can be incorporated into multimodal applications?

In addition to these activities, the MMIWG has also produced a specification for a Dynamic Properties Framework, which defines how applications can interact with dynamically changing properties of client devices. This work has been transitioned to the Device Independence Working Group.

The most mature specifications produced by the MMIWG are in the areas of representing users' intent and representing digital ink, so this article will focus on our work in those areas. We hope to produce Working Draft specifications on architecture and authoring in 2005.

Representing Users' Intent -- EMMA

Components such as speech recognizers that interpret users' inputs need a way to represent their results, which can be quite complex, so that other components using those results can reliably access them. For this reason, it is extremely valuable to have a common, cross-vendor, representation format for users' intents. This need will become more and more apparent as recognizers become increasingly decoupled from platforms. Proprietary solutions to representing user's intents are workable as long as the interface between the recognizer and the rest of the platform is not exposed. However, the flexibility of decoupling recognizers from platforms is becoming increasingly recognized in the industry, as demonstrated by the emergence of protocols like the IETF's Media Resources Control Protocol (MRCPv2).

Standardization also becomes more and more important as the users' inputs and their representations increase in complexity. Consider the simplest case -- for example, the user says "yes", and the speech recognition result is "yes". Obviously in this case the simple text result suffices as a format for conveying the user's intent to the application. But, in real systems, things quickly become more complicated. Additional information, such as confidences, literal tokens, the input mode, the nbest list, and structured interpretations are important to real applications. For this reason, they are included in VoiceXML 2.0 as shadow variables and the "application.lastresult$" variable. The information represented in VoiceXML 2.0 is clearly much more useful than just the simple text ASR results, but some applications would benefit from even more information, as we will see below.

Because multimodal applications by definition deal with more than just speech, it is even more important for multimodal applications to have a common format for representing results. This is particularly true for representing results that require the integration of information from two or more modalities, such as an utterance like "zoom in here", accompanied by a circling pen gesture.

To address this need, the MMIWG is developing a XML specification called Extensible MultiModal Annotation (EMMA) for representing the intentions reflected in a user's input, in any modality, along with other information about the circumstances of the utterance. Historically, EMMA builds upon an earlier specification called "Natural Language Semantic Markup Language" (NLSML), which was developed in the W3C Voice Browser Working Group, but EMMA supports the annotation of much more comprehensive information than NLSML.

Application Instance Data

We can divide the information contained in an EMMA document into the representation of the user's input (application instance data) and information about the user's input (annotations). The format of the application instance data is largely under the control of an application developer. While EMMA itself doesn't specify how the application instance data is created, one typical scenario is that the data is derived from Semantic Interpretation for Speech Recognition (SISR) tags included in Speech Recognition Grammar Specification (SRGS) format grammars. The SISR specification defines an XML serialization of the speech recognition results which can serve as the basis for the application instance data in an EMMA document. So, looking at the example from the SISR Last Call Working Draft in Section 7 , the ECMAScript semantic interpretation result for "I want a medium coke and three large pizzas with pepperoni and mushrooms" would look like this:

{
drink: {
liquid:"coke"
drinksize:"medium" } pizza: { number: "3" pizzasize: "large" topping: [ "pepperoni" "mushrooms" ] } }

Following the SISR rules for XML serialization, the XML version of this ECMAScript object would look like this:

<drink>
	<liquid> coke </liquid> 
	<drinksize> medium </drinksize>
</drink>
<pizza>
	<number> 3 </number>
<pizzasize> large </pizzasize>
<topping length="2">
<item index="0"> pepperoni </item> <item index="1"> mushrooms </item>
</topping> </pizza>

This XML version of the semantic interpretation result is the primary raw material for the final EMMA document, which will be augmented with additional information as we'll see in the next section. The simplest EMMA document that could be constructed from this semantics would just wrap the XML result with the <emma> root element and an enclosing <interpretation> element, as in the following example, with the EMMA elements shown in blues:

<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma">
<emma:interpretation id="int1" >    
  <drink>
<liquid> coke </liquid>
<drinksize> medium </drinksize>
</drink>
<pizza>
<number> 3 </number>
<pizzasize> large </pizzasize>
<topping length="2">
<item index="0"> pepperoni </item>
<item index="1"> mushrooms </item>
</topping>
</pizza>
</emma:emma>

EMMA also supports several additional formats for representing the user's input when a more complex representation is required: nbest, lattices, and derivations. If the recognizer returns more than one candidate for the result; for example, if it can't be certain whether the user said "Boston" or "Austin", the EMMA document will represent this with a list of <interpretation> elements enclosed by a <one-of> tag. Lattices also represent uncertainty, but in a more compact fashion than an nbest list by factoring out the common parts of the nbest list into single arcs. Derivations are used when the input undergoes several stages of processing and it is desirable to represent each succeeding stage as a separate EMMA interpretation. Since lattices and derivations are more typical of advanced systems, the EMMA specification itself is the best source for details about these elements.

Annotations

EMMA also supports a wide range of additional information that some applications may find useful, including an <info> element for application-specific or vendor-specific annotations. The most commonly used annotations are probably confidence and the literal speech recognition result. The following EMMA document adds "confidence" and "tokens" (shown in red) to the interpretation above to provide this information. Note that the tokens and the interpretation don't completely match; that is, the user said "coca cola", not "coke" and "medium-sized", not "medium". This type of regularization is typical of speech applications.

<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma">
<emma:interpretation id="int1" confidence="0.6" tokens="I want 
  a medium-sized coca cola and three large pizzas with pepperoni and mushrooms">    
  <drink>
<liquid> coke </liquid>
<drinksize> medium </drinksize>
</drink>
<pizza>
<number> 3 </number>
<pizzasize> large </pizzasize>
<topping length="2">
<item index="0"> pepperoni </item>
<item index="1"> mushrooms </item>
</topping>
</pizza> </emma:emma>

Other annotations supported by EMMA include relative and absolute timestamps, port information, the grammar used to produce the input, the modality of the input, the data model of the application instance data, how multiple inputs might be grouped to support a single composite multimodal input, and many more.

Next Steps with EMMA

The MMIWG is now considering which features of EMMA should be required and which features should be optional in preparation for publishing the Last Call Working Draft. The group's current plan is to publish the Last Call Working Draft in March, 2005. The next step will be for the group to create an Implementation Report with EMMA tests so that the EMMA specification can be tested by organizations that have implemented EMMA processors. In addition to companies that are interested in using EMMA processors in their platforms, the IETF Speechsc group is also very interested in using EMMA in their MRCPv2 protocol for distributed speech processing. The current plan is for EMMA to reach W3C Recommendation status in January of 2006.

Representing Digital Ink -- InkML

Future multimodal applications will enable users to create inputs using pen or stylus input -- digital ink. The MMIWG is creating a standard specification for an XML representation of digital ink called InkML. InkML can be used, for example, as input to handwriting recognizers. Once digital ink is converted to text by a handwriting recognizer and interpreted, the resulting interpretation would be represented within an application using EMMA, just like a speech recognition result.

The fundamental component of an ink input is a trace, which represents the stylus track over a surface (or canvas). Consider the following input (the example is taken from the InkML specification).

There are five traces in this input ("h", "e", "l", "l", "o") and each trace is represented as a series of points. This example would be represented by the following traces in InkML, (this example is again taken from the InkML specification.)

<ink>
 <trace>
       10 0 9 14 8 28 7 42 6 56 6 70 8 84 8 98 8 112 9 126 10 140
       13 154 14 168 17 182 18 188 23 174 30 160 38 147 49 135
       58 124 72 121 77 135 80 149 82 163 84 177 87 191 93 205
 </trace>
        <trace>
       130 155 144 159 158 160 170 154 179 143 179 129 166 125
       152 128 140 136 131 149 126 163 124 177 128 190 137 200
       150 208 163 210 178 208 192 201 205 192 214 180
 </trace>
        <trace>
       227 50 226 64 225 78 227 92 228 106 228 120 229 134
       230 148 234 162 235 176 238 190 241 204
 </trace>
 <trace>
       282 45 281 59 284 73 285 87 287 101 288 115 290 129
       291 143 294 157 294 171 294 185 296 199 300 213
 </trace>
        <trace>
       366 130 359 143 354 157 349 171 352 185 359 197
       371 204 385 205 398 202 408 191 413 177 413 163
       405 150 392 143 378 141 365 150
 </trace>
       </ink>

As in EMMA, InkML also provides for a rich variety of additional information about the ink input in addition to the representation of the basic trace. These include timestamps, detailed information about the ink capture device and brushes and metadata.

Next Steps with InkML

The MMIWG is now working on the Last Call Working Draft, which is expected to be published in April 2005. As with EMMA, the next step will be for the group to create an Implementation Report with tests so that the specification can be tested by organizations that have implemented InkML processors. The current plan is for InkML to reach W3C Recommendation status in March of 2006.

Conclusions

The W3C Multimodal Interaction Working group is working on specifications that will support multimodal interaction on the web. This year we plan to publish first Working Drafts on architecture and authoring, as well as to move EMMA and InkML close to Recommendation. The MMIWG welcomes feedback from the public on all of our specifications. Feedback can be sent to the group's public mailing list, www-multimodal@w3.org. The group also welcomes new members who are interested in working on multimodal standards. You can find out more information about how to participate on our public page, http://www.w3.org/2002/mmi/.

 

Dr. Deborah Dahl is a consultant in speech technologies and their application to business solutions. Dr. Dahl chairs the W3C Multimodal Working Group and is a member of the W3C Voice Browser Working Group. Dr. Dahl is a frequent speaker at speech industry trade shows and is the editor of the forthcoming book, "Practical Spoken Dialog Systems". She can be reached at dahl@conversational-technologies.com



  back to the top

Copyright © 2001-2005 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE Industry Standards and Technology Organization (IEEE-ISTO).