W3C Multimodal Interaction Working Group Activities By
Dr.
Deborah Dahl
The W3C (World Wide Web Consortium) Multimodal
Interaction Working Group (MMIWG) works on standards
that will allow users to interact with applications
with a combination of input modalities that include
speech, pen, keyboard, and pointing, as well as future
input modalities. Multimodal interaction is particularly
valuable for applications on cell phones and PDA's,
which have small keyboards and displays. On the desktop,
multimodal interaction is useful for hands-busy and
eyes-busy applications, as well as for making applications
accessible to users who have difficulty using their
hands or who are visually impaired. On any platform,
multimodality allows applications to leverage the
strengths of the different modalities to provide
an optimal user experience.
Important goals of the work on multimodal standards
in the W3C are:
- to insure interoperability of multimodal web applications
by standardizing common, royalty-free, architectures,
data formats and authoring languages
- to provide standards that will support useful and
compelling multimodal applications on a variety of
platforms based on current technology
The MMIWG was first chartered by the W3C in 2002 and
its charter was renewed for two more years in January
of this year. The rechartered group is focusing on
the following areas:
- Architecture: How are component modalities included
in platforms, how are they controlled, and how do
they communicate with each other?
- Authoring: Using authoring tools such as markup,
scripting, and styling, how do authors create multimodal
applications and control modality components?
- Representing users' intent: How do modality components
such as speech and handwriting recognizers communicate
the intent of users' inputs to the application?
- Representing digital ink: How is pen input represented
so that it can be incorporated into multimodal applications?
In addition to these activities, the MMIWG has also
produced a specification for a Dynamic
Properties Framework, which defines how applications
can interact with dynamically changing properties of
client devices. This work has been transitioned to
the Device Independence
Working Group.
The most mature specifications produced by the MMIWG
are in the areas of representing users' intent and
representing digital ink, so this article will focus
on our work in those areas. We hope to produce Working
Draft specifications on architecture and authoring
in 2005.
Representing Users' Intent -- EMMA Components such as speech recognizers that interpret
users' inputs need a way to represent their results,
which can be quite complex, so that other components
using those results can reliably access them. For this
reason, it is extremely valuable to have a common,
cross-vendor, representation format for users' intents.
This need will become more and more apparent as recognizers
become increasingly decoupled from platforms. Proprietary
solutions to representing user's intents are workable
as long as the interface between the recognizer and
the rest of the platform is not exposed. However, the
flexibility of decoupling recognizers from platforms
is becoming increasingly recognized in the industry,
as demonstrated by the emergence of protocols like
the IETF's Media Resources Control Protocol (MRCPv2).
Standardization
also becomes more and more important as the users'
inputs and their representations increase
in complexity. Consider the simplest case -- for example,
the user says "yes", and the speech recognition
result is "yes". Obviously in this case the
simple text result suffices as a format for conveying
the user's intent to the application. But, in real
systems, things quickly become more complicated. Additional
information, such as confidences, literal tokens, the
input mode, the nbest list, and structured interpretations
are important to real applications. For this reason,
they are included in VoiceXML 2.0 as shadow variables
and the "application.lastresult$" variable.
The information represented in VoiceXML 2.0 is clearly
much more useful than just the simple text ASR results,
but some applications would benefit from even more
information, as we will see below.
Because
multimodal applications by definition deal with more
than just speech, it is even more important
for multimodal applications to have a common format
for representing results. This is particularly true
for representing results that require the integration
of information from two or more modalities, such as
an utterance like "zoom in here", accompanied
by a circling pen gesture.
To address this need, the MMIWG is developing a XML
specification called Extensible MultiModal Annotation
(EMMA) for
representing the intentions reflected in a user's input,
in any modality, along with other information about
the circumstances of the utterance. Historically, EMMA
builds upon an earlier specification called "Natural
Language Semantic Markup Language" (NLSML), which
was developed in the W3C Voice Browser Working Group,
but EMMA supports the annotation of much more comprehensive
information than NLSML.
Application Instance Data We can divide the information contained in an EMMA
document into the representation of the user's input
(application instance data) and information
about the user's input (annotations). The
format of the application instance data is largely
under the control of an application developer. While
EMMA itself doesn't specify how the application instance
data is created, one typical scenario is that the data
is derived from Semantic Interpretation for Speech
Recognition (SISR) tags
included in Speech Recognition Grammar Specification (SRGS) format
grammars. The SISR specification defines an XML serialization
of the speech recognition results which can serve as
the basis for the application instance data in an EMMA
document. So, looking at the example from the SISR
Last Call Working Draft in Section
7 , the ECMAScript semantic interpretation result
for "I want a medium coke and three large pizzas
with pepperoni and mushrooms" would look like
this:
{ drink: { liquid:"coke" drinksize:"medium"
}
pizza: {
number: "3"
pizzasize: "large"
topping: [ "pepperoni" "mushrooms" ]
}
} |
Following the SISR rules for XML serialization, the
XML version of this ECMAScript object would look like
this:
<drink>
<liquid> coke </liquid>
<drinksize> medium </drinksize>
</drink>
<pizza>
<number> 3 </number> <pizzasize> large </pizzasize> <topping length="2"> <item index="0"> pepperoni </item>
<item index="1"> mushrooms </item> </topping>
</pizza> |
This XML
version of the semantic interpretation result is
the primary raw material for the final EMMA document,
which will be augmented with additional information
as we'll see in the next section. The simplest EMMA
document that could be constructed from this semantics
would just wrap the XML result with the <emma> root
element and an enclosing <interpretation> element,
as in the following example, with the EMMA elements
shown in blues:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma">
<emma:interpretation id="int1" >
<drink> <liquid> coke </liquid> <drinksize> medium </drinksize> </drink> <pizza> <number> 3 </number> <pizzasize> large </pizzasize> <topping length="2"> <item index="0"> pepperoni </item> <item index="1"> mushrooms </item> </topping> </pizza>
</emma:emma> |
EMMA also supports several additional formats for
representing the user's input when a more complex representation
is required: nbest, lattices, and derivations.
If the recognizer returns more than one candidate for
the result; for example, if it can't be certain whether
the user said "Boston" or "Austin",
the EMMA document will represent this with a list of <interpretation> elements
enclosed by a <one-of> tag. Lattices also represent
uncertainty, but in a more compact fashion than an
nbest list by factoring out the common parts of the
nbest list into single arcs. Derivations are
used when the input undergoes several stages of processing
and it is desirable to represent each succeeding stage
as a separate EMMA interpretation. Since lattices and
derivations are more typical of advanced systems, the
EMMA specification itself is the best source for details
about these elements.
Annotations EMMA also
supports a wide range of additional information that
some applications may find useful, including an <info> element
for application-specific or vendor-specific annotations.
The most commonly used annotations are probably confidence
and the literal speech recognition result. The following
EMMA document adds "confidence" and "tokens" (shown
in red) to the interpretation above to provide this
information. Note that the tokens and the interpretation
don't completely match; that is, the user said "coca
cola", not "coke" and "medium-sized",
not "medium". This type of regularization
is typical of speech applications.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma">
<emma:interpretation id="int1" confidence="0.6" tokens="I want
a medium-sized coca cola and three large pizzas with pepperoni and mushrooms">
<drink> <liquid> coke </liquid> <drinksize> medium </drinksize> </drink> <pizza> <number> 3 </number> <pizzasize> large </pizzasize> <topping length="2"> <item index="0"> pepperoni </item> <item index="1"> mushrooms </item> </topping> </pizza>
</emma:emma> |
Other annotations supported by EMMA include relative
and absolute timestamps, port information, the grammar
used to produce the input, the modality of the input,
the data model of the application instance data, how
multiple inputs might be grouped to support a single
composite multimodal input, and many more. Next Steps with EMMA The MMIWG is now considering which features of EMMA
should be required and which features should be optional
in preparation for publishing the Last Call Working
Draft. The group's current plan is to publish the Last
Call Working Draft in March, 2005. The next step will
be for the group to create an Implementation Report
with EMMA tests so that the EMMA specification can
be tested by organizations that have implemented EMMA
processors. In addition to companies that are interested
in using EMMA processors in their platforms, the IETF
Speechsc group is also very interested in using EMMA
in their MRCPv2 protocol for distributed speech processing.
The current plan is for EMMA to reach W3C Recommendation
status in January of 2006.
Representing Digital Ink -- InkML Future multimodal applications will enable users to
create inputs using pen or stylus input -- digital
ink. The MMIWG is creating a standard specification
for an XML representation of digital ink called InkML.
InkML can be used, for example, as input to handwriting
recognizers. Once digital ink is converted to text
by a handwriting recognizer and interpreted, the resulting
interpretation would be represented within an application
using EMMA, just like a speech recognition result.
The fundamental component of an ink input is a trace,
which represents the stylus track over a surface (or canvas).
Consider the following input (the example is taken
from the InkML specification).
There are
five traces in this input ("h", "e", "l", "l", "o")
and each trace is represented as a series of points.
This example would be represented by the following
traces in InkML, (this example is again taken from
the InkML specification.)
<ink>
<trace>
10 0 9 14 8 28 7 42 6 56 6 70 8 84 8 98 8 112 9 126 10 140
13 154 14 168 17 182 18 188 23 174 30 160 38 147 49 135
58 124 72 121 77 135 80 149 82 163 84 177 87 191 93 205
</trace>
<trace>
130 155 144 159 158 160 170 154 179 143 179 129 166 125
152 128 140 136 131 149 126 163 124 177 128 190 137 200
150 208 163 210 178 208 192 201 205 192 214 180
</trace>
<trace>
227 50 226 64 225 78 227 92 228 106 228 120 229 134
230 148 234 162 235 176 238 190 241 204
</trace>
<trace>
282 45 281 59 284 73 285 87 287 101 288 115 290 129
291 143 294 157 294 171 294 185 296 199 300 213
</trace>
<trace>
366 130 359 143 354 157 349 171 352 185 359 197
371 204 385 205 398 202 408 191 413 177 413 163
405 150 392 143 378 141 365 150
</trace>
</ink> |
As in EMMA, InkML also provides for a rich variety
of additional information about the ink input in addition
to the representation of the basic trace. These include
timestamps, detailed information about the ink capture
device and brushes and metadata.
Next Steps with InkML The MMIWG is now working on the Last Call Working
Draft, which is expected to be published in April 2005.
As with EMMA, the next step will be for the group to
create an Implementation Report with tests so that
the specification can be tested by organizations that
have implemented InkML processors. The current plan
is for InkML to reach W3C Recommendation status in
March of 2006.
Conclusions The W3C Multimodal Interaction Working group is working
on specifications that will support multimodal interaction
on the web. This year we plan to publish first Working
Drafts on architecture and authoring, as well as to
move EMMA and InkML close to Recommendation. The MMIWG
welcomes feedback from the public on all of our specifications.
Feedback can be sent to the group's public mailing
list, www-multimodal@w3.org. The group also welcomes
new members who are interested in working on multimodal
standards. You can find out more information about
how to participate on our public page, http://www.w3.org/2002/mmi/.
Dr.
Deborah Dahl is a consultant in speech technologies
and their application to business solutions. Dr. Dahl
chairs the W3C Multimodal Working Group and is
a member of the W3C Voice Browser Working Group.
Dr. Dahl is a frequent speaker at speech industry
trade shows and is the editor of the forthcoming
book, "Practical Spoken Dialog Systems". She
can be reached at dahl@conversational-technologies.com
|