Volume 3, Issue 2 - March/April 2003
   
   
 

Elvira - a VoiceXML Platform for Research

By Pavel Cenek

Introduction
Research in the field of dialogue systems often involves creating an experimental application, which is used for testing new ideas, statistical data collection, various measurements, performance tests, etc. A significant amount of time is spent on the creation of such an application before the scientific work itself can start.

This article introduces Elvira (http://gin2.itek.norut.no/elvira/) - a VoiceXML platform focused on the specific needs of researchers. Elvira can be used for quick arrangements of sophisticated research environments and allows a quick design of a dialogue system, making it possible to concentrate on the scientific problem itself. During its design, special attention was paid to its flexibility and easy extensibility, so that it can be utilized in a wide variety of research tasks and experiments.

The development of Elvira started in the Laboratory of Speech and Dialogue (http://www.fi.muni.cz/lsd/) at the Faculty of Informatics, Masaryk University, Brno, Czech Republic in January 2001. The starting impulse was the need for a suitable tool for creating an experimental dialogue system AudiC, which allowed visually impaired people to program in C.

First versions of Elvira implemented a very limited subset of VoiceXML 1.0. In spite of this fact, Elvira helped to successfully finish the project. The development of AudiC revealed many requirements for a good VoiceXML interpreter for research and influenced Elvira's later design.

From October 2001, Elvira is being developed in cooperation with Norut IT (http://www.itek.norut.no/), an applied research institute located in Tromsø, Norway. Based on experience from the AudiC project and influenced by the first public draft of VoiceXML 2.0, Elvira's architecture was completely redesigned to achieve better flexibility and easier extensibility.

The extensibility and flexibility of our VoiceXML platform is achieved by utilizing component paradigm for its design and development. The component based architecture of the system ensures its great modularity.

A component can be viewed as a self-contained binary object, which provides its services to the outer world through a set of precisely defined interfaces. Elvira is a system formed of such components. The selection of components is done at run-time and hence Elvira can operate in dozens various configurations with different features and capabilities dependent on currently used components.

Elvira's general architecture is depicted in the following figure, where components are represented by gray rectangles.


VoiceXML platform Elvira - system architecture

The heart of the platform is Elvira Core, which interprets VoiceXML and controls the other components. The Core is the only component which is supposed to never be replaced by a custom implementation. Therefore, our aim is to concentrate as many tasks as possible within the Core to allow the other components to be as simple as possible.

The input collection is handled by an input component. An input component can be able to process a voice stream delivered by the telephony component, another can support microphone, but there is no requirement that the components should support only voice input. It is possible to use e.g. keyboard for simulating speech and also more "exotic" devices e.g. stylus and handwriting recognition, touch screens, haptic devices for handicapped people or any combination of such devices. The output component, which is responsible for output generation, has similar degree of freedom.

The big diversity of devices implies a big diversity of capabilities of different input/output components. Some operations make sense only for some components, e.g. prosody modeling is useful only for speech synthesis. In order to embrace this diversity, components have to implement only some mandatory interfaces which are essential for correct running of the system. They can provide extended functionality by implementing other interfaces, if it is meaningful for the supported devices and useful for the current application. If an interface is not implemented, it is detected and handled by Elvira Core.

This principle allows users to deal only with issues relevant for their current work and keep things as simple as possible.

Every component is characterized by a unique name and by a category. The names are typically used to specify which component should be used for a specific task in the system (e.g. input collection and output generation). A component can be also selected based on its category. It is used for instance for an automatic support of new grammar types. Each component ensuring grammar analysis defines its category so that it contains the mime-type of the supported grammar format. When Elvira Core needs a grammar analyzer for a specific grammar format, it simply uses the component with the right category. Thus, everything that is needed to support a new grammar type is to copy the proper component into a location where Elvira Core can find it.

The same principle is used for selecting stream components for fetching resources according to the protocol specified in its URI and also for selection of a component which can handle a resource with a specific mime-type (the grammar components mentioned in previous paragraph are examples of such resource components).

Extensions for Research Purposes

Researchers in the field of human language technologies require great flexibility in order to be able to perform virtually any task they need to perform. We decided to keep the number of VoiceXML extensions as low as possible and rather provide a general and unified mechanism addressing the problem. The mechanism is implemented in Elvira Core and allows calling external functions written in C++ from any ECMAScript expression within VoiceXML.

Continued...

back to the top

 

Copyright © 2001-2003 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE Industry Standards and Technology Organization (IEEE-ISTO).