|
An
Introduction to Speech Recognition
Have
you ever talked to your computer? (And no, yelling at
it when your Internet connection goes down or making
polite chit-chat with it as you wait for all 25MB of
that very important file to download doesn't count).
Have you really talked to your computer? Where it actually
recognized what you said and then did something as a
result? If you have, then you've used a technology known
as speech
recognition.
VoiceXML
takes speech recognition even further. Instead of talking
to your computer, you're essentially talking to a web
site, and you're doing this over the phone.
OK,
you say, well, what exactly is speech recognition? Simply
put, it is the process of converting spoken input to
text. Speech recognition is thus sometimes referred
to as speech-to-text.
Speech
recognition allows you to provide input to an application
with your voice. Just like clicking with your mouse,
typing on your keyboard, or pressing a key on the phone
keypad provides input to an application, speech recognition
allows you to provide input by speaking. For example,
you might say something like "checking account
balance", to which your bank's VoiceXML application
replies "one million, two hundred twenty-eight
thousand, six hundred ninety eight dollars and thirty
seven cents." (We can dream, can't we?)
Or,
in response to hearing "Please say coffee, tea,
or milk," you say "coffee" and the VoiceXML
application you're calling tells you what the flavor
of the day is and then asks if you'd like to place an
order.
Pretty
cool, wouldn't you say?
A
Closer Look
The
speech recognition process is performed by a software
component known as the speech
recognition engine. The primary function of the
speech recognition engine is to process spoken input
and translate it into text that an application understands.
The application can then do one of two things:
- The
application can interpret the result of the recognition
as a command. In this case, the application is a command
and control application.
An example of a command and control application is
one in which the caller says "check balance",
and the application returns the current balance of
the caller's account.
- If
an application handles the recognized text simply
as text, then it is considered a dictation application.
In a dictation application,
if you said "check balance," the application
would not interpret the result, but simply return
the text "check balance".
Note
that VoiceXML 1.0 uses a command and control model for
speech recognition.
Terms
and Concepts
Following
are a few of the basic terms and concepts that are fundamental
to speech recognition. It is important to have a good
understanding of these concepts when developing VoiceXML
applications.
Utterances
When
the user says something, this is known as an utterance.
An utterance is any stream of speech between two periods
of silence. Utterances are sent to the speech engine
to be processed.
Silence, in speech recognition, is almost as important
as what is spoken, because silence delineates the start
and end of an utterance. Here's how it works. The speech
recognition engine is "listening" for speech
input. When the engine detects audio input--in other
words, a lack of silence--the beginning of an utterance
is signaled. Similarly, when the engine detects a certain
amount of silence following the audio, the end of the
utterance occurs.
Utterances are sent to the speech engine to be processed.
If the user doesn't say anything, the engine returns
what is known as a silence timeout--an indication that
there was no speech detected within the expected timeframe,
and the application takes an appropriate action, such
as reprompting the user for input.
An utterance can be a single word, or it can contain
multiple words (a phrase or a sentence). For example,
"checking", "checking account,"
or "I'd like to know the balance of my checking
account please" are all examples of possible utterances--things
that a caller might say to a banking application written
in VoiceXML. Whether these words and phrases are valid
at a particular point in a dialog is determined by which
grammars are active. (We'll present grammars
in more detail later in the article.) Note that there
are small snippets of silence between the words spoken
within a phrase. If the user pauses too long between
the words of a phrase, the end of an utterance can be
detected too soon, and only a partial phrase will be
processed by the engine.
Pronunciations
A speech recognition engine uses all sorts of data,
statistical models, and algorithms to convert spoken
input into text. One piece of information that a speech
recognition engine uses to process a word is its pronunciation.
A word's pronunciation represents what the speech engine
thinks a word should sound like.
Words can have multiple pronunciations associated with
them. For example, the word "the" has at least
two pronunciations in the U.S. English language: "thee"
and "thuh." As a VoiceXML application developer,
you may want to provide multiple pronunciations for
certain words and phrases to allow for variations in
the ways your callers may speak them.
Grammars
As a VoiceXML application developer, you must specify
the words and phrases that users can say to your application.
These words and phrases are presented to the speech
recognition engine and are used in the recognition process.
You can specify the valid words and phrases in a number
of different ways, but in VoiceXML, you do this by specifying
a grammar. A grammar uses
a particular syntax, or set of rules, to define the
words and phrases that can be recognized by the engine.
A grammar can be as simple as a list of words, or it
can be flexible enough to allow such variability in
what can be said that it approaches natural language
capability.
Grammars define the domain, or context, within which
the recognition engine works. The engine compares the
current utterance against the words and phrases in the
active grammars. If the user says something that is
not in the grammar, the speech engine will not be able
to decipher it correctly.
Let's look at a specific example: "Welcome to VoiceXML
Bank. At any time, say main menu to return to this point.
Choose one: accounts, loans, transfers, or exit."
The grammar to support this interaction might contain
the following words and phrases:
- Accounts
- Account
balances
- My
account information
- Loans
- Loan
balances
- My
loan information
- Transfers
- Exit
- Help
In
this grammar, you can see that there are multiple ways
to say each command.
You can define a single grammar for your application,
or you may have multiple grammars. Chances are, you
will have multiple grammars, and you will activate each
grammar only when it is needed.
You can imagine that you want to put careful thought
into the design of application grammars. They can be
as restrictive or as flexible as your users and application
require. Of course, there are tradeoffs between recognition
speed (response time) and accuracy, versus the size
of your grammar(s). You will need to experiment with
different grammar designs to validate one that best
matches the requirements and expectations of your users.
Speaker
Dependence vs. Speaker Independence
Speaker dependence describes
the degree to which a speech recognition system requires
knowledge of a speaker's individual voice characteristics
to successfully process speech. The speech recognition
engine can "learn" how you speak words and
phrases; it can be trained to your voice.
Speech recognition systems that require a user to train
the system to his/her voice are known as speaker-dependent
systems. If you are familiar with desktop dictation
systems, most are speaker dependent. Because they operate
on very large vocabularies, dictation systems perform
much better when the speaker has spent the time to train
the system to his/her voice.
Speech recognition systems that do not require a user
to train the system are known as speaker-independent
systems. Speech recognition in the VoiceXML world must
be speaker-independent. Think of how many users (hundreds,
maybe thousands) may be calling into your web site.
You cannot require that each caller train the system
to his or her voice. The speech recognition system in
a voice-enabled web application must successfully
process the speech of many different callers without
having to understand the individual voice characteristics
of each caller.
Accuracy
The performance of a speech recognition system is measurable.
Perhaps the most widely used measurement is accuracy.
It is typically a quantitative measurement and can be
calculated in several ways. Arguably, the most important
measurement of accuracy is whether the desired end result
occurred. This measurement is useful in validating application
design. For example, if the user said "yes,"
the engine returned "yes," and the "YES"
action was executed, it is clear that the desired end
result was achieved. But what happens if the engine
returns text that does not exactly match the utterance?
For example, what if the user said "nope,"
the engine returned "no," yet the "NO"
action was executed? Should that be considered a successful
dialog? The answer to that question is yes because the
desired end result was achieved.
Another measurement of recognition accuracy is whether
the engine recognized the utterance exactly as
spoken. This measure of recognition accuracy is expressed
as a percentage and represents the number of utterances
recognized correctly out of the total number of utterances
spoken. It is a useful measurement when validating grammar
design. Using the previous example, if the engine returned
"nope" when the user said "no,"
this would be considered a recognition error. Based
on the accuracy measurement, you may want to analyze
your grammar to determine if there is anything you can
do to improve accuracy. For instance, you might need
to add "nope" as a valid word to your grammar.
You may also want to check your grammar to see if it
allows words that are acoustically similar (for example,
"repeat/delete," "Austin/Boston,"
and "Addison/Madison"), and determine if there
is any way you can make the allowable words more distinctive
to the engine.
Recognition accuracy is an important measure for all
speech recognition applications. It is tied to grammar
design and to the acoustic environment of the user.
You need to measure the recognition accuracy and adjust
your application and its grammars based on the results
obtained when you test your application with typical
users.
Continued...
back
to the top
Copyright
© 2001 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE
Industry Standards and Technology Organization
(IEEE-ISTO).
|