Automated speech recognition (ASR) systems have greatly improved in recent years as better algorithms and acoustic models are developed, and as more computer power can be brought to bear on the task. An ASR system running on an inexpensive home or office computer with a good microphone can take free-form dictation, as long as it has been pre-trained for the speaker’s voice. Over the phone, and with no speaker training, a speech recognition system needs to be given a set of speech grammars that tell it what words and phrases it should expect. Within these constraints a surprisingly large set possible utterances can be recognized (e.g., a particular mutual fund name out of thousands). Recognition over mobile phones in noisy environments, while problematic, can be improved with a new technology called distributed speech recognition, where the early analysis is done on the handset. Speech recognition is used today in large numbers of commercial applications.

Advances are also being made in speech synthesis, or text-to-speech (TTS). Older TTS systems generate speech completely from scratch, and tend to sound like “drunken robots”. They can be hard to listen to, and at times even incomprehensible. But newer TTS systems are much more lifelike – they use a technique called waveform concatenation, in which speech is generated from libraries of pre-recorded waveforms.

It is important to note here that VoiceXML can be used even in environments lacking speech technology. Audio output can consist entirely of pre-recorded prompts, and input can be exclusively from the keypad. While speech technology makes applications much more powerful and pleasant to use, VoiceXML also brings the advantages of web development and deployment to older styles of computer telephony applications.

The Ubiquitous Web

The Internet extends to more devices than personal computers. Some examples are personal organizers with wireless data connections, mobile phones supporting the Wireless Application Protocol (WAP), and NTT Docomo’s i-mode phones. The future will bring more web-enabled devices: overnight delivery drop off boxes that schedule pickups and record their contents, networked MP3 portables, vending machines that reorder supplies when running low, wall displays that download artwork, web-based stereo receivers and televisions, and many others.

Speech technology is a very natural and powerful interface for ubiquitous web devices. Microphones are much smaller than keyboards and keypads, and speakers smaller than screens. So it seems quite likely that many future web devices will have on-board speech recognition (as do some mobile phones today).