Enabling Speech & Multimodal Services on Mobile
Devices:
The ETSI Aurora DSR standards & 3GPP Speech Enabled Services
By
David Pearce
The desire for improved user interfaces for distributed
speech and multimodal services on mobile devices has
motivated the need for reliable recognition performance
over mobile channels. Performance needs to be robust
both to background noise and to any errors introduced
by the mobile transmission channel. There has been
much work in the telecommunications standards bodies
of ETSI and 3GPP to develop standards to achieve this
and enable interoperable services of high performance.
This paper provides an overview of the latest Distributed
Speech Recognition (DSR) standards that will be used
to support mobile speech services.
3GPP
(3rd Generation Partnership Project) is the body
that sets the standards for GSM and UMTS mobile communications.
In June 2004 3GPP approved the DSR Extended Advanced
Front-end as the recommended codec for “Speech
Enabled Services”. This selection was based on
extensive evaluations undertaken by two of the leading
ASR vendors (IBM & Scansoft) that confirmed the
performance advantages of DSR compared to the normal
voice codec. The significance of the selection by 3GPP
is that DSR will find widespread deployment in future
GSM and 3G mobile handsets that will usher in a new
wave of applications both for speech only services
and for distributed multimodal interfaces. This brings
with it implications not only for handset device manufactures
but also server vendors and application developers.
These developments are likely to be of interest to
the VoiceXML community as it extends the reach of applications
(existing and new) to the large numbers of mobile users
while delivering substantial improvements in performance
compared to the using the normal mobile voice channel.
By transporting the DSR speech features on the packet
data channel, speech can be easily combined with other
media, enabling new distributed multimodal services
on a single data channel. Thus DSR can be seen as an
enabler for the use of VoiceXML in multimodal services
using the capabilities of XHTML + VoiceXML (X+V).
1. Introduction to mobile
services & Distributed
Speech Recognition It
is estimated that there are now 1.4 billion mobile
phone subscribers worldwide and the numbers continue
to grow. The market was originally fueled by person-to-person
voice communications and this remains the dominant “application”.
Recently we have seen increasingly sophisticated devices
packed with many new features including messaging,
cameras, browsers, games and music. Alongside device
developments the mobile networks have improved, giving
increased coverage and widespread availability of the
2.5G packet data such as GPRS. There is also the prospect
of many new deployments of 3G networks, bringing much
larger bandwidths to mobile users. The 2.5G and 3G
data capabilities provide the opportunity to deliver
a range of different audio and visual information to
the user’s device and enable access to “content” while
on the move. The user interface for these devices has
certainly improved but the small keypad remains a barrier
to data entry. Reliable speech input holds the potential
to help greatly. Alongside pure speech input and output,
the benefits of a multimodal interface are well appreciated.
The ability to combine alternative input modalities
(e.g. speech and/or keypad) with visual (e.g. graphics,
text, pictures) and/or audio output can greatly enhance
the user experience and effectiveness of the interaction.
For some applications its best to use a recognizer
on the device itself e.g. interfacing to the phone
functions and voice dialing using personal address
book. Although the computational power of these devices
is increasing, the complexity of medium and large vocabulary
speech recognition systems is beyond the memory and
computational resources of many devices. Also the associated
delay to download speech data files (e.g. grammars,
acoustic models, language models, vocabularies) may
be prohibitive or be confidential (e.g. a corporate
directory).
Server-side processing of the combined speech input
and speech output can overcome many of these constraints
by taking full advantage of memory and processing power
as well as specialized speech engines and data files.
New applications can also be more easily introduced,
refined, extended and upgraded at the server.
So, with the speech input remote from the recognition
engine in the server, we are faced with the challenge
of how to obtain reliable recognition performance over
the mobile network and hence be robust to the wireless
transmission channel. In addition we would like to
have an architecture that can provide a multimodal
user interface. These have been two motivators that
have led to the creation of the standards for Distributed
Speech Recognition (DSR):
- Improved
recognition performance over wireless channels.
The
use of DSR avoids the degradations introduced
by the speech codec & channel transmission
errors over mobile voice channels:
- By
using a packet data channel (for example GPRS for
GSM) to transport the DSR features
instead of the circuit switched voice channel that
is normally used for voice
calls, then the effects of
channel transmission errors are greatly reduced
and consistent performance is obtained
over the coverage area.
- By
performing the front-end processing in the device
directly on the speech waveform
rather than after transcoding with a voice codec, the degradations introduced
by
the codec are avoided.
- In
addition the DSR advanced front-end is very noise
robust and halves the error
rate in background noise compared to the mel-cepstrum front-end, giving robust
performance for mobile
users who
are often calling from environments where there is background noise.
- Ease
of integration of combined speech and data applications for multimodal interfaces.
In addition to applications using only speech input
and speech output, the benefits from multimodal interaction
are now well appreciated. In such multimodal interfaces,
different modes of input (including speech or keypad)
may be used and different media for output (e.g.
audio or visual on the device display) are used to convey
the information back to the user. The use of DSR
enables these to operate over a single wireless data transport
rather than having separate speech and data channels.
As such, DSR is a building block for distributed
multimodal interfaces.
2.
The ETSI DSR Advanced Front-end Standard ES 202
050
Between
1999 and 2002 ETSI Aurora conducted a competitive
selection process to create an Advanced DSR front-end
standard that would provide improved robustness compared
to the mel-cepstrum front-end. To support this, a new
performance evaluation process and associated speech
databases was created to allow the comparison between
candidates [8]. Three sets of noisy database were used
for these performance evaluations:
- Aurora-2
connected digits with simulated addition of noises
- Aurora-3
connected digits from real-world data collected
in vehicle (5 languages)
- Aurora-4
large vocabulary wall street journal dictation
with simulated noise
addition.
A scoring procedure was agreed that gave appropriate
weight to the results from each of the databases. The
winning candidate gave an average of 53% reduction
in word error rate compared to the DSR mel-cepstrum
standard (ES 202 108). Details of the Aurora-3 performance
results are given in section 2.1 below.
The front-end calculation is a frame-based scheme
that produces an output vector every 10 ms. In the
front-end feature extraction, noise reduction by two
stages of Wiener filtering is performed first. Waveform
processing is applied to the de-noised signal and mel-cepstral
features are calculated. Finally, blind equalization
is applied to the cepstral features.
The features produced from the Advanced Front-end
are the familiar 12 cepstral coefficients C0 and log
Energy, the same as for the Mel-cepstrum standard to
ensure easy integration with existing server recognition
technology.
The compression algorithm for the cepstral features
uses the same split vector quantisation scheme as the
earlier standard but with the quantiser tables retrained
for the Advanced Front-end.
2.1 Performance results on Aurora 3 database
In
this section results are presented for the five languages
making up the Aurora 3 database and using
the Hidden Markov Toolkit (HTK) recogniser in its “simple” configuration
ie 3 mixtures per state. The row(s) in each table labelled “0.4W+0.35M+0.25H” represent
the weighted average of the well matched, medium mismatch
and high mismatch results. Table 1 shows the absolute
performance for DSR using the Mel-Cepstrum Front-End,
which then serves as a baseline for the performance
comparisons with the Advanced Front-end.
Aurora
3, Mel-Cepstrum Front-End |
|
Absolute
performance |
Training
Mode |
Italian |
Finnish |
Spanish |
German |
Danish |
Average |
Well
Matched |
92.39% |
92.00% |
92.51% |
91.00% |
86.24% |
90.83% |
Medium
Mismatch |
74.11% |
78.59% |
83.60% |
79.50% |
64.45% |
76.05% |
Medium
Mismatch |
50.16% |
35.62% |
52.30% |
72.85% |
35.01% |
49.19% |
0.4W+0.35M+0.25H |
75.43% |
73.21% |
79.34% |
82.44% |
65.81% |
75.25% |
Table
1 |
The
top half of table 2 shows the absolute performance
that is obtained when the speech is processed by
the DSR Advanced Front End. The bottom half of
the table shows the relative performance when compared
to the DSR baseline that was given in 1.
Aurora
3, Advanced Front-End |
|
Absolute
performance |
Training
Mode |
Italian |
Finnish |
Spanish |
German |
Danish |
Average |
Well
Matched |
96.90% |
95.99% |
96.66% |
95.15% |
93.65% |
95.67% |
Medium
Mismatch |
93.41% |
80.10% |
93.73% |
89.60% |
81.10% |
87.59% |
Medium
Mismatch |
88.64% |
84.77% |
90.50% |
91.30% |
78.35% |
86.71% |
0.4W+0.35M+0.25H |
93.61% |
87.62% |
94.09% |
92.25% |
85.43% |
90.60% |
|
Performance
relative to Mel-Cepstrum Front-End |
Training
Mode |
Italian |
Finnish |
Spanish |
German |
Danish |
Average |
Well
Matched |
59.26% |
49.87% |
55.41% |
46.11% |
53.85% |
52.90% |
Medium
Mismatch |
74.55% |
7.05% |
61.77% |
49.27% |
46.84% |
47.89% |
Medium
Mismatch |
77.21% |
76.34% |
80.08% |
67.96% |
66.69% |
73.66% |
0.4W+0.35M+0.25H |
69.10% |
41.50% |
63.80% |
52.68% |
54.60% |
56.34% |
Table
2 |
As
shown in the tables above, the Advanced front-end
consistently gives half the error rate compared to
the mel-cepstrum. It provides state-of-the-art robustness
to background noise which was the major performance
criterion as part of the standard selection. In addition
to this the other aspect of robustness that is important
is robustness to channel transmission errors. DSR
has been demonstrated to be very robust in this dimension
too and it is possible to achieve negligible degradation
in performance when tested over realistic mobile
channel operating conditions. Reference [10] provides
a review of channel robustness issues.
2.2 VAD
Compared to the DSR mel-cepstrum standard, one further
enhancement coming from the Advanced Front-end is the
inclusion of a bit in the bitstream to allow the communication
of voice activity (VAD). The VAD algorithm marks each
10ms frame in an utterance as speech/non-speech so
that this information can optionally be used for frame
dropping at the server recogniser. During recognition,
frame dropping reduces insertion errors in any pauses
between the spoken words particularly in noisy utterances
and can be used for endpointing for training. It has
been found that performance is particularly helped
by model training with endpointed data. The VAD information
can also be used to reduce response time latencies
experienced by users in deployed applications by giving
early information on utterance completion.
3.
The ETSI DSR Extended Front-end Standards ES 202
211 & ES 202 212
ES 202 211 is an extension of the mel-cepstrum DSR
Front-end standard ES 201 108 [10]. The mel-cepstrum
front-end provides the features for speech recognition
but these are not available for human listening. The
purpose of the extension is to allow the reconstruction
of the speech waveform from these features so that
they can be replayed. The front-end feature extraction
part of the processing is exactly the same as for ES
201 108. To allow speech reconstruction additional
fundamental frequency (perceived as pitch) and voicing
class (e.g. non-speech, voiced, unvoiced and mixed)
information is needed. This is the extra information
that is provided by the extended front-end processing
algorithms at the device side that is compressed and
transmitted along with the front-end features to the
server. This extra information may also be useful for
improved speech recognition performance with tonal
languages such as Mandarin, Cantonese and Thai. The
compressed extension bits need an extra 800 bps on
top of the 4800 bps for the cepstral features.
In a similar way, ES 202 212 is the extension of the
DSR Advanced Front-end ES 202 050.
Figure 1: Extended DSR front-ends
One of the main use cases for the reconstruction is
to assist dialogue design and refinement. During pre-deployment
trials of services it is desirable to be able to listen
to dialogues and check the overall flow of the application
and refine the vocabulary used in the grammars. For
this and other applications of the reconstruction the
designer needs to be able to replay what was spoken
to the system at the server (off-line) and understand
what was spoken. To test this intelligibility of the
speech two evaluations were conducted. The first is
a formal listening test for intelligibility called
the Diagnostic Rhyme Test (DRT) that was conducted
by Dynastat listening laboratories. The results of
this are shown in table 3. For comparison the MELP
codec used for military communications was chosen as
a suitable reference. The DSR reconstruction performs
as well as MELP in the DRT tests giving confidence
that the intelligibility is good. The transcription
task is closer to what would occur in an actual application.
For this a professional transcription house was used
to transcribe sentences from the Wall Street Journal
that had been passed through the DSR reconstruction
and other reference codecs. Table 4 shows the results
with less than 1% transcription errors.
Type:
Coder: |
Noise |
Clean |
Car
10dB |
Street
15dB |
Babble
15dB |
Unprocessed |
95.7 |
95.5 |
92.4 |
93.8 |
XFE
Reconstruction |
93.0 |
88.8 |
85.0 |
87.1 |
XAFE
Reconstruction |
92.8 |
88.9 |
87.5 |
87.9 |
LPC-10 |
86.9 |
81.3 |
81.2 |
81.2 |
MELP |
91.6 |
86.8 |
85.0 |
85.3 |
Table
3: Intelligibility listening tests using Diagnostic
Rhyme tests (conducted by Dynastat listening
laboratory) |
Type:
Coder: |
Noise |
Clean |
Car |
Street |
Babble |
Clean |
Average
Error (%) |
Unprocessed |
1,1,2 |
1,0,1 |
0,2,4 |
3,9,3 |
0,4,1 |
0.6 |
XFE
Reconstruction |
1,6,1 |
0,3,6 |
2,9,4 |
5,9,2 |
1,4,5 |
1.0 |
XAFE
Reconstruction |
0,6,2 |
0,5,4 |
0,4,3 |
3,5,2 |
1,6,5 |
0.8 |
LPC-10 |
8,18,6 |
62,26,7 |
67,22,7 |
47,12,3 |
18,10,9 |
5.5 |
MELP |
0,3,1 |
1,6,3 |
4,6,2 |
16,10,3 |
1,9,5 |
1.2 |
No.
of words in message |
1166 |
1153 |
1155 |
1149 |
1204 |
Total:
5827 |
Table
4:
Listening test transcription task results: Number
of missed/wrongly transcribed/partially transcribed
words |
The
pitch feature was also tested for tonal language
recognition of Mandarin and Cantonese and shown to
give better performance than proprietary pitch extraction
algorithms. Further information about the extension
algorithms and their performance can be found in
references [6, 7].
4. Transport Protocols the
IETF RTP Payload formats for DSR
In addition to the standards for the front-end features
themselves the protocols for the transport of these
features from the device to the server are also needed.
The IETF Real Time Protocols (RTP) are a well established
mechanism for the transport of many different media
types including video, VoIP, music etc. Associated
with RTP are also the SIP protocols for session initiation
and codec negotiation. By defining a RTP format for
the DSR features, services benefit from all the added
functionality of this set of protocols as well as the
support of other media types for multimodal applications.
The RTP payload formats for DSR have been agreed at
the IETF [11,12].
Within these payloads any number of frame pairs may
be sent within a packet. For the front-ends on their
own this takes 12 bytes per frame pair and with the
extension it is 14 bytes per frame pair. The choice
of the number of frame pairs to send in each payload
depends on the latency and bandwidth of the channel
available.
For a GSM GPRS channel then the raw uplink data capacity
available is 20bytes per 20ms slot (ie 8kbit/s). The
total overhead for the protocol headers in the stack
can be quite high as shown in the table below.
Data |
Size
(bytes) |
RTP |
12 |
UDP |
8 |
IP |
20 |
LLC+SNDCP |
10 |
Total |
50 |
In future networks it is expected that header compression
(RoHC) will be available reducing the 40 bytes for
the RTP, UDP, IP layers to about 4 Bytes. For current
GPRS networks the use of 4 or 8 frame pairs per payload
as a good compromise while for future networks with
RoHC it can be lower ie down to 1 frame pair per payload.
For speech output it is expected that speech encoded
for the network of the target device be used. E.g.
AMR for GSM devices. This output speech is also transported
over the GPRS packet data network using the RTP protocol.
It is common to have 4 slot downlink on GPRS networks
but even so it is recommended to use lower AMR data
rates such as 4.75kbit/s to keep up with the real-time
replay requirements.
5.
Speech Enabled Services in the 3rd Generation Partnership
Project (3GPP)
3GPP
is the body that sets the standards for GSM and UMTS
mobile communications. In 2002 3GPP conducted
a study and produced a technical report on the feasibility
of speech enabled services. The technical report [13]
provides an overview of the speech and multimodal services
envisaged and a new work item called Speech Enabled
Services (SES) was started. The SA4 codecs group within
3GPP has responsibility for the selection and recommendation
of the codec for SES. A selection procedure was agreed
in this working group consisting of “design constraints”, “test
and processing plan” and “recommendation
criteria” in the usual way. Two candidates for
the SES codec were considered: AMR and AMR-WB (being
the existing voice codec for 3GPP) and the DSR Extended
Advanced Front-end. Both of these being used over the
packet data channel rather than the circuit switched
channel that also suffers other degradations due to
the effects of transmission errors. To justify the
introduction of a new codec for SES services it was
seen as necessary to provide substantial performance
gain compared to the existing voice codec. Rather than
using HTK for the performance evaluations it was decided
that it would be best to use the talents of major server
recognition vendors for the evaluations. This would
enable a comparison between the performance that would
be obtained by a service either with DSR or using the
AMR voice codec. Two ASR vendors volunteered to undertake
the extensive testing, IBM and SpeechWorks (now Scansoft).
The performance evaluations were conducted over a wide
range of different databases some brought from 3GPP
but also proprietary databases owned by the ASR vendors.
Testing covered many different languages (German, Italian,
Spanish, Japanese, US English, Mandarin) environments
(handheld, vehicle) and tasks (digits, name dialling,
place names ….). In addition the codecs were
tested under block transmission errors. Results were
reported at SA4#30 meeting in Feburary 2004 in Malaga
and are summarised below. Note that results from both
the ASR vendors have been averaged in this table to
preserve anonymity the source.
5.1 Results from ASR vendor evaluations in 3GPP
8
kHz |
Number
of db tested |
AMR4.75
Average
Absolute Performance |
DSR
Average
Absolute Performance |
Average
Improvement |
Digits |
11 |
13.2 |
7.7 |
39.9% |
Sub-word |
5 |
9.1 |
6.5 |
30.0% |
Tone
confusability |
1 |
3.6 |
3.1 |
14.8% |
Channel
errors |
4 |
6.1 |
2.4 |
52.8% |
Weighted
Average |
|
36% |
Table
1: Low data-rate test |
8
kHz |
Number
of db tested |
AMR12.2
Average
Absolute Performance |
DSR
Average Absolute Performance |
Average
Improvement |
Digits |
11 |
10.9 |
7.7 |
27.6% |
Sub-word |
5 |
7.1 |
6.4 |
14.5% |
Tone
confusability |
1 |
3.8 |
3.1 |
19.7% |
Channel
errors |
4 |
5.5 |
2.4 |
40.9% |
Weighted
Average |
|
25% |
Table
2: High data-rate test at 8kHz |
16
kHz |
Number
of db tested |
AMR-WB12.65
Average Absolute Performance |
DSR
Average Absolute Performance |
Average
Improvement |
Digits |
8 |
9 |
5.6 |
35% |
Sub-word |
5 |
8.2 |
5.9 |
23.5% |
Channel
errors |
4 |
6.1 |
3.4 |
42.2% |
Weighted
Average |
|
31% |
Table
3: High data-rate test at 16kHz |
The results show a substantial performance advantage
for DSR compared to AMR both at 8kHz and at 16kHz.
Based on the agreed recommendation criteria then DSR
was selected in SA4 for SES [14] and approved by 3GPP
SA in June 2004.
6.
Conclusions
The
performance advantages of DSR have been clear for
a while but to some extent the deployment of DSR
in the market has been constrained by the need to simultaneously
develop both ends of the system ie DSR in the mobile
devices and in the network recognition servers. It’s
been a bit of a “chicken and egg” conundrum
with server vendors waiting to see widespread availability
of DSR in handsets before making product commitments
and handset manufactures similarly asking “where
are the recognition servers to support applications?”.
The DSR Extended Advanced Front-end standards are now
in place together with its transport protocol. The
data connectivity provided by 2.5G GPRS networks to
support the transport of packet switched speech and
multimodal services are already widely deployed and
wider bandwidths of 3G data networks are being launched.
With the adoption of DSR by 3GPP we have an egg!
References
[1]
D Pearce, “Enabling New Speech Driven Services
for Mobile Devices: An overview of the ETSI standards
activities for Distributed Speech Recognition Front-ends” Applied
Voice Input/Output Society Conference (AVIOS2000),
San Jose, CA, May 2000
[2]
ETSI Standard ES 201 108 “Distributed Speech
Recognition; Front-end Feature Extraction Algorithm;
Compression Algorithm”, April 2000.
[3]
ETSI standard ES 202 050 “Distributed
Speech Recognition; Advanced Front-end Feature
Extraction
Algorithm; Compression Algorithm”,
Oct 2002
[4]
ETSI Standard ES 202 211 “Distributed
Speech Recognition; Extended Front-end
Feature Extraction
Algorithm; Compression Algorithm, Back-end
Speech Reconstruction Algorithm”,
Nov 2003
[5]
ETSI Standard ES 202 212 “Distributed
Speech Recognition; Extended Advanced
Front-end Feature Extraction
Algorithm; Compression Algorithm, Back-end
Speech Reconstruction Algorithm”,
Nov 2003
[6]
3GPP TS 26.243: “ANSI
C code for the Fixed-Point Distributed
Speech Recognition
Extended Advanced Front-end”
[7]
T Ramababran, A Sorin et al, “The
ETSI Extended Distributed Speech
Recognition (DSR) Standards: Client
Side Processing And Tonal Language
Recognition Evaluation”,
ICASSP 2004.
[8]
T Ramababran, A Sorin et al, “The ETSI Extended
Distributed Speech Recognition
(DSR) Standards: Server-Side Speech Reconstruction”,
ICASSP 2004
[9]
D Pearce, “Developing
The ETSI Aurora Advanced Distributed
Speech Recognition Front-End & What
Next?” ASRU 2001, Dec 2001
[10]
D Pearce, “Robustness
to Transmission Channel – the
DSR Approach”, Keynote
paper at COST278 & ICSA
Research Workshop on Robustness
Issues in Conversational Interaction,
Aug 2004.
[11]
Q Xie, RTP Payload Formats for ETSI European Standard
ES 202 050,
ES 202
211, and ES 202
212 Distributed
Speech Recognition Encoding,
http://www.ietf.org/internet-drafts/draft-ietf-avt-rtp-dsr-codecs-03.txt
[12]
Q Xie, "RTP Payload Format for ETSI European
Standard ES 201 108 Distributed
Speech Recognition Encoding", RFC 3557,
July 2003. http://www.ietf.org/rfc/rfc3557.txt
[13]
3GPP TR 22.977 “Feasibility study for
speech enabled services”,
Sept 2002
[14]
3GPP TR 26.943: ”Recognition performance
evaluations of codecs
for Speech Enabled Services(SES)”,
Nov 2004
|