VoiceXML Review - Feature Articles

Volume 1, Issue 3 - Mar. 2001

An Introduction to Speech Recognition

By Kimberlee A. Kemble

Have you ever talked to your computer? (And no, yelling at it when your Internet connection goes down or making polite chit-chat with it as you wait for all 25MB of that very important file to download doesn't count). Have you really talked to your computer? Where it actually recognized what you said and then did something as a result? If you have, then you've used a technology known as speech recognition.

VoiceXML takes speech recognition even further. Instead of talking to your computer, you're essentially talking to a web site, and you're doing this over the phone.

OK, you say, well, what exactly is speech recognition? Simply put, it is the process of converting spoken input to text. Speech recognition is thus sometimes referred to as speech-to-text.

Speech recognition allows you to provide input to an application with your voice. Just like clicking with your mouse, typing on your keyboard, or pressing a key on the phone keypad provides input to an application, speech recognition allows you to provide input by speaking. For example, you might say something like "checking account balance", to which your bank's VoiceXML application replies "one million, two hundred twenty-eight thousand, six hundred ninety eight dollars and thirty seven cents." (We can dream, can't we?)

Or, in response to hearing "Please say coffee, tea, or milk," you say "coffee" and the VoiceXML application you're calling tells you what the flavor of the day is and then asks if you'd like to place an order.

Pretty cool, wouldn't you say?

A Closer Look

The speech recognition process is performed by a software component known as the speech recognition engine. The primary function of the speech recognition engine is to process spoken input and translate it into text that an application understands. The application can then do one of two things:

The application can interpret the result of the recognition as a command. In this case, the application is a command and control application. An example of a command and control application is one in which the caller says "check balance", and the application returns the current balance of the caller's account.
If an application handles the recognized text simply as text, then it is considered a dictation application. In a dictation application, if you said "check balance," the application would not interpret the result, but simply return the text "check balance".

Note that VoiceXML 1.0 uses a command and control model for speech recognition.

Terms and Concepts

Following are a few of the basic terms and concepts that are fundamental to speech recognition. It is important to have a good understanding of these concepts when developing VoiceXML applications.

Utterances

When the user says something, this is known as an utterance. An utterance is any stream of speech between two periods of silence. Utterances are sent to the speech engine to be processed.

Silence, in speech recognition, is almost as important as what is spoken, because silence delineates the start and end of an utterance. Here's how it works. The speech recognition engine is "listening" for speech input. When the engine detects audio input--in other words, a lack of silence--the beginning of an utterance is signaled. Similarly, when the engine detects a certain amount of silence following the audio, the end of the utterance occurs.

Utterances are sent to the speech engine to be processed. If the user doesn't say anything, the engine returns what is known as a silence timeout--an indication that there was no speech detected within the expected timeframe, and the application takes an appropriate action, such as reprompting the user for input.

An utterance can be a single word, or it can contain multiple words (a phrase or a sentence). For example, "checking", "checking account," or "I'd like to know the balance of my checking account please" are all examples of possible utterances--things that a caller might say to a banking application written in VoiceXML. Whether these words and phrases are valid at a particular point in a dialog is determined by which grammars are active. (We'll present grammars in more detail later in the article.) Note that there are small snippets of silence between the words spoken within a phrase. If the user pauses too long between the words of a phrase, the end of an utterance can be detected too soon, and only a partial phrase will be processed by the engine.

Pronunciations

A speech recognition engine uses all sorts of data, statistical models, and algorithms to convert spoken input into text. One piece of information that a speech recognition engine uses to process a word is its pronunciation. A word's pronunciation represents what the speech engine thinks a word should sound like.

Words can have multiple pronunciations associated with them. For example, the word "the" has at least two pronunciations in the U.S. English language: "thee" and "thuh." As a VoiceXML application developer, you may want to provide multiple pronunciations for certain words and phrases to allow for variations in the ways your callers may speak them.

Grammars

As a VoiceXML application developer, you must specify the words and phrases that users can say to your application. These words and phrases are presented to the speech recognition engine and are used in the recognition process.

You can specify the valid words and phrases in a number of different ways, but in VoiceXML, you do this by specifying a grammar. A grammar uses a particular syntax, or set of rules, to define the words and phrases that can be recognized by the engine. A grammar can be as simple as a list of words, or it can be flexible enough to allow such variability in what can be said that it approaches natural language capability.

Grammars define the domain, or context, within which the recognition engine works. The engine compares the current utterance against the words and phrases in the active grammars. If the user says something that is not in the grammar, the speech engine will not be able to decipher it correctly.

Let's look at a specific example: "Welcome to VoiceXML Bank. At any time, say main menu to return to this point. Choose one: accounts, loans, transfers, or exit." The grammar to support this interaction might contain the following words and phrases:

Accounts
Account balances
My account information
Loans
Loan balances
My loan information
Transfers
Exit
Help

In this grammar, you can see that there are multiple ways to say each command.

You can define a single grammar for your application, or you may have multiple grammars. Chances are, you will have multiple grammars, and you will activate each grammar only when it is needed.

You can imagine that you want to put careful thought into the design of application grammars. They can be as restrictive or as flexible as your users and application require. Of course, there are tradeoffs between recognition speed (response time) and accuracy, versus the size of your grammar(s). You will need to experiment with different grammar designs to validate one that best matches the requirements and expectations of your users.

Speaker Dependence vs. Speaker Independence

Speaker dependence describes the degree to which a speech recognition system requires knowledge of a speaker's individual voice characteristics to successfully process speech. The speech recognition engine can "learn" how you speak words and phrases; it can be trained to your voice.

Speech recognition systems that require a user to train the system to his/her voice are known as speaker-dependent systems. If you are familiar with desktop dictation systems, most are speaker dependent. Because they operate on very large vocabularies, dictation systems perform much better when the speaker has spent the time to train the system to his/her voice.

Speech recognition systems that do not require a user to train the system are known as speaker-independent systems. Speech recognition in the VoiceXML world must be speaker-independent. Think of how many users (hundreds, maybe thousands) may be calling into your web site. You cannot require that each caller train the system to his or her voice. The speech recognition system in a voice-enabled web application must successfully process the speech of many different callers without having to understand the individual voice characteristics of each caller.

Accuracy

The performance of a speech recognition system is measurable. Perhaps the most widely used measurement is accuracy. It is typically a quantitative measurement and can be calculated in several ways. Arguably, the most important measurement of accuracy is whether the desired end result occurred. This measurement is useful in validating application design. For example, if the user said "yes," the engine returned "yes," and the "YES" action was executed, it is clear that the desired end result was achieved. But what happens if the engine returns text that does not exactly match the utterance? For example, what if the user said "nope," the engine returned "no," yet the "NO" action was executed? Should that be considered a successful dialog? The answer to that question is yes because the desired end result was achieved.

Another measurement of recognition accuracy is whether the engine recognized the utterance exactly as spoken. This measure of recognition accuracy is expressed as a percentage and represents the number of utterances recognized correctly out of the total number of utterances spoken. It is a useful measurement when validating grammar design. Using the previous example, if the engine returned "nope" when the user said "no," this would be considered a recognition error. Based on the accuracy measurement, you may want to analyze your grammar to determine if there is anything you can do to improve accuracy. For instance, you might need to add "nope" as a valid word to your grammar. You may also want to check your grammar to see if it allows words that are acoustically similar (for example, "repeat/delete," "Austin/Boston," and "Addison/Madison"), and determine if there is any way you can make the allowable words more distinctive to the engine.

Recognition accuracy is an important measure for all speech recognition applications. It is tied to grammar design and to the acoustic environment of the user. You need to measure the recognition accuracy and adjust your application and its grammars based on the results obtained when you test your application with typical users.

Continued...

back to the top

Copyright © 2001 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE Industry Standards and Technology Organization (IEEE-ISTO).