VoiceXML traces its lineage back to several informal gatherings in 1995 by Dave Ladd, Chris Ramming, Ken Rehor, and Curt Tuckey of AT&T Research. They were brainstorming ideas about how the Internet would affect telephony applications, when all the pieces fit into place: why not have a gateway system running a voice browser interpreting a voice dialog markup language, which delivers web content and services to ordinary phones? Thus began the AT&T Phone Web project. When AT&T spun off Lucent, a separate Phone Web project continued on there as well. Chris remained at AT&T, Ken went with Lucent, and Dave and Curt moved on to Motorola.

By early 1999 AT&T and Lucent had incompatible dialects of the Phone Markup Language (PML), Motorola had its new VoxML, and other companies were also experimenting with these ideas, in particular IBM with SpeechML. A standard language had to be designed to enable the voice web. The original Phone Web people remained close friends, so AT&T, Lucent, and Motorola began the organization of the VoiceXML Forum. IBM joined as a founder soon afterwards. From March to August of 1999 a small team of Forum technologists worked together to produce a new language, VoiceXML 0.9, combining the best features of the earlier languages and pushing on into new areas, especially DTMF (touch-tone key) support and mixed-initiative dialogs. After 0.9 was published, there began an extensive period of comment from the growing VoiceXML Forum community. These comments resulted in huge improvements to the language, including client-side scripting, properties, and subdialogs. VoiceXML 1.0 came out in March 2000, and almost overnight fifteen or twenty different implementations sprang up.

The following month, the VoiceXML Forum submitted the 1.0 language to the World Wide Web Consortium (W3C) for consideration. In May, the W3C “accepted” VoiceXML, an event that generated a lot of press coverage, but which merely acknowledged receipt of the submission. But the W3C’s Voice Browser Working Group eagerly took on the job of the next revision.

The W3C process has taken more time than any of us expected, but the emphasis on consensus among the many participating companies has led to a strong standard. The first public Working Draft of VoiceXML 2.0 was published in October 2001, the Last Call Working Draft came out in April 2002, and VoiceXML 2.0 became a Candidate Recommendation in January 2003.

The changes from VoiceXML 1.0 to 2.0 were fairly conservative. Much thought and effort went into clarifying expected behaviors, and correcting a few errors in the specification. Another large amount of work was spent in developing and weaving in new standards for speech recognition grammars and text-to-speech markup. There were a few extensions, such as the new element, but overall there is a high degree of similarity between 1.0 and 2.0.

VoiceXML’s Future

The W3C is now completing the Implementation Report, part of which consists of hundreds of interoperability tests to ensure that the VoiceXML standard is implementable, and that different implementations of VoiceXML can execute the same content in the same way. The VoiceXML Forum’s Conformance Committee will then round these tests out into a complete conformance suite, which will be a powerful tool to ensure interoperability between VoiceXML implementations.

Beginning in 2003, the W3C’s Voice Browser Working Group will start work on VoiceXML 3.0. Some suggestions that were too large to incorporate in 2.0 will be addressed, as well as other new extensions. Some of the improvements being discussed are:

  • Using the proposed W3C Natural Language Semantics Markup Language to represent recognition results.
  • Currently the <form> ties together the notions of input tasks and the data filled by those input tasks. Should a new high level task-oriented dialog construct parallel to <form> and <menu> be defined?
  • In some cases, the FIA does not provide application developers close enough control. Should a new low level procedural dialog construct parallel to <form> and <menu> be defined?
  • Should grammar and audio resources be defined centrally and then referenced by “id” attributes elsewhere?
  • What about standardized audio playback controls for changing the speed and volume of the audio, and for moving back and forward in the audio stream? These would be analogous to CD player controls.
  • Should standard speaker verification features be added to VoiceXML for additional security? What about enabling the generation of speaker-trained grammars, for use in personal address books and similar applications?

There will also likely be changes to VoiceXML to support new multimodal markup standards. The conceptually cleanest approaches to multimodal use XHTML as a container for mode-specific markup (XHTML for visual, VoiceXML for voice, InkXML for ink, etc.), and then define how the modes interact using XML Events. As part of this effort, a modularization of VoiceXML would be defined such that one subset could be used for multimodal markup.

The final official act of the original VoiceXML 1.0 language design team was to sign the Taylor Brewing Company Accord. The TBCA sought to rectify the chief imperfection of the VoiceXML 1.0 standard: its lack of author names. Here they are, for posterity: Linda Boyer, IBM; Peter Danielsen, Lucent; Jim Ferrans, Motorola; Gerald Karam, AT&T; David Ladd, Motorola; Bruce Lucas, IBM; and Kenneth Rehor, Lucent. We hope you have as much fun learning and using VoiceXML as we did putting it together. Enjoy!