|
Answers
to Your Questions About VoiceXML
(Continued
from Part 1)
Q:
How is the information that the caller speaks into the
phone recorded? Is it converted into text? Is it subject
to high error rates due to different styles of talking??
A: DSP (Digital Signal Processing) hardware from
companies such as Intel Dialogic and Natural MicroSystems
digitally captures audio coming in over the phone line.
Speech recognition never "converts" the audio
stream to text - it uses advanced statistical models
to determine a confidence level that what the user said
matches an entry in an expected list or "grammar"
of phrases for a given interaction. This technology
is quite resilient to users with varying accents of
a given language, and the underlying "acoustic"models
are available for many different languages and locales
(e.g. U.S. English vs. British English vs. Brazilian
Portuguese vs. Portuguese Portuguese).
That said, voice recognition is demonstrably mature
enough for mission critical applications, and production
applications successfully automate hard tasks such as
driving directions and stock trading every day. However,
recognizing human speech is still an enormously complex
computational task that relies on applying sophisticated
heuristic techniques to massive statistical data models
in real time. Voice recognition software "out of
the box" does not perform optimally, and applications
require specialized design and tuning by qualified experts
to be successful. Consider the following issues:
- People
will always say unexpected things. People
are accustomed to having "real" conversations
over the phone; they immediately assume voice applications
can understand whatever they say, and can quickly
get frustrated when their expectations aren't met.
Even applications that prompt callers to choose from
a short menu can consistently get hundreds of distinct
responses. Voice applications can only "understand"
the specific things they're trained for - similar
to when people first bring their phrasebook to a foreign
country. As with people, voice applications do their
best to match what they're hearing with the phrases
they know, and can easily mistake similar-sounding
words for ones that are actually in their list. Depending
on the situation, this can quickly lead to a frustrating
experience. For example, consider a simple menu of
keywords that includes "movies" and "restaurants."
Callers who say "moving" without knowing
that it is not a valid choice are likely to consistently
get thrown into "movies" and be very frustrated.
For this reason, applications must use clear, concise
prompting to guide callers to say "the right"
things, and use data from large amounts of real-world
usability testing to take into account the unexpected
things people tend to say. Minute shifts in prompt
wording or the underlying "grammars" can
have dramatic effects on usability, and ultimately
the automation rate and ROI of voice applications.
- "Grammars"
must be tuned. As stated above, voice recognition
technology works by comparing what the caller said
to a specific list of expected choices. These "grammars"
are required to make it computationally feasible to
do speaker-independent voice recognition in real time.
Large grammars, such as the 10,000+ companies on U.S.
stock exchanges, can work very well in production
today- doing so requires careful attention by both
application designers and speech scientists "tuning"
the underlying recognition engine. For example, "Pfizer"
and "Fiserv" sound almost identical; the
underlying grammar must be tuned to know which choice
is more commonly selected, and the application must
be carefully crafted to help callers get back on track
when the system makes a mistake.
- People
pronounce the same words differently. Pronunciations
for words and phrases can vary widely across regions
of a given country. Proper names further complicate
the matter- consider, for example, how to pronounce
"Qantas Airways" or "Worcester Court."
Voice recognition engines rely on built-in dictionaries
that specify each of the ways callers may say each
word and common phrase. If a grammar includes a word
that isn't in the dictionary, the recognition engine
must "guess" how it is supposed to be pronounced.
While this can work reasonably well, it's likely to
make mistakes or miss common alternative pronunciations.
Especially because voice recognition is rapidly being
deployed in new industries for new applications, it's
critical to ensure that all relevant pronunciations
are in the dictionaries- without this, recognition
quality and automation rates can suffer significantly.
- "Acoustic
Models" must be continually refined.
Voice recognizers use "acoustic models"
to decide whether a caller has said something that
matches a given grammar. Acoustic models are essentially
a mathematical representation of how a wide variety
of people sound when they say each of the building
blocks of words (e.g. "buh" or "ing").
Acoustic models are built by analyzing millions of
diverse recordings of real people actually speaking
over the telephone. The more data that gets used to
"train" these acoustic models, the better
recognition quality becomes; particularly when the
data is collected under real-world conditions using
the same hardware and software. In addition, it's
critical to ensure that the voice recognition software
has been adequately trained on all of the words and
phrases that make up the grammars for a particular
application.
- Noisy
environments are problematic. Phone conversations-
particularly mobile phones- often have a lot of background
noise. This noise can be ambient sound (e.g. wind
or cars honking), ambient conversation (e.g. other
tables in the restaurant), side speech (e.g. "Kids,
I said stop it!"), or unintended sounds (e.g.
a cough or sneeze). Consider how difficult it is sometimes
even for "real people" to distinguish the
actual conversation from background noise; the problem
is compounded for voice recognition engines because
they have far less intelligent context about how to
differentiate sounds and speakers' voices from one
another. Voice applications, and voice recognition
platforms, must be carefully designed to accommodate
and minimize the difficulties presented by background
noise.
- Hundreds
of thousands of calls must be transcribed by hand.
In order to compile the necessary data to address
most of the problems listed above, it is necessary
to manually compare what callers "actually"
say with what the voice recognition software "thought"
they said. Very large numbers of calls must be manually
transcribed in this fashion, so that speech scientists
can analyze the data and determine how accurately
each grammar in an application is performing. This
is a very labor-intensive process, but critical to
give designers the information they need to make the
adjustments to call flows, grammars, prompts, pronunciation
dictionaries, and acoustic models which are necessary
to achieve the expected benefits of great voice applications.
Q:
Other than VoiceXML, are there other languages that
will permit voice-activated web access?
A: There will likely be other attempts, but the
industry has very rapidly converged on VoiceXML as the
dominant standard of choice. It's unlikely that another
language will attain significant market share in the
next several years.
back
to the top
Copyright
© 2001 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE
Industry Standards and Technology Organization (IEEE-ISTO).
|