VoiceXML Forum, Tools Working Group
Draft 1.8 – Internal Working Draft – 31 July 2007
For Public Review 20 August 2007
Editor:
Chung Pak Lai, LumenVox
About the VoiceXML Forum:
Founded in 1999, the VoiceXML Forum is an industry organization whose mission is to promote and to accelerate the worldwide adoption of VoiceXML-based applications. To this end, the Forum serves as an educational and technical resource, a certification authority and a contributor and liaison to international standards bodies, such as the World Wide Web Consortium (W3C) IETF, ANSI and ISO. The VoiceXML Forum is organized as a program of the IEEE Industry Standards and Technology Organization (IEEE-ISTO). Membership in the Forum is open to any interested company. For more information, please visit the Website at www.voicexml.org.
Disclaimers:
This document is subject to change without notice and may be updated, replaced or made obsolete by other documents at any time. The VoiceXML Forum disclaims any and all warranties, whether express or implied, including (without limitation) any implied warranties of merchantability or fitness for a particular purpose.
The descriptions contained herein do not imply the granting of licenses to make, use, sell, license or otherwise transfer any technology required to implement systems or components conforming to this specification. The VoiceXML Forum, and its member companies, makes no representation on technology described in this specification regarding existing or future patent rights, copyrights, trademarks, trade secrets or other proprietary rights.
By submitting information to the VoiceXML Forum, and its member companies, including but not limited to technical information, you agree that the submitted information does not contain any confidential or proprietary information, and that the VoiceXML Forum may use the submitted information without any restrictions or limitations.
Table of Contents
- 1. Overview
- 2. Elements
- 2.1 <sl:log> element
- 2.2 <speech-resource-configuration> element
- 2.3 <speech-resource-allocation> element
- 2.4 <grammar-define-handler> element
- 2.5 <grammar-content> element
- 2.6 <recognize-request> element
- 2.7 <audio-stream-analysis> element
- 2.8 <audio-stream-configuration> element
- 2.9 <audio-features> element
- 2.10 <execute-recognition> element
- 2.11 <recognize-configuration> element
- 2.12 <recognize-results> element
- 2.13 <grammar-undefine-handler> element
- 3. ASR-Specific Parameters
- Appendix A: Example
- Appendix B: Acknowledgments
- Appendix C: Revision History
- Appendix D: Reference
1. Overview
This document describes tags for logging run-time data in an ASR (automatic speech recognition) server. Typically, the ASR server is part of a speech services system that includes an ASR server, a text-to-speech server, a VoiceXML browser, an application server, a database server, interfaces to external servers, and possibly other servers. The data logging (DL) specification, defined by the Tools Committee within the VoiceXML Forum, is called SLAML and comprises an SLAML overview plus a specification for the individual servers mentioned above (ASR, browser, etc.).
1.1 Specification Format
In the descriptions below, the name for an element tag is listed along with a description of the tag, a description of the SLAML attributes, proposed attributes, and a list of XML child elements.
Description: The description of the element.
SLAML Attributes: XML attributes that are defined in the SLAML document. Refer to the external documentation for the SLAML.
Proposed Attributes: XML attributes defined for the ASR server only.
XML child elements: The elements that may be included inside of this element.
Example: An example using the element.
Content: The content of the element, if it has any.
Attribute specification: Each attribute has its own set of values and type.
XML namespaces provide a simple method for qualifying element and attribute names used in Extensible Markup Language documents by associating them with namespaces identified by URI references. For more information, visit http://www.w3.org/TR/REC-xml-names/. In this specification, all the SLAML defined attributes have the namespace “sl” to indicate they are part of the SLAML standard. Proposed attributes are only specified for ASR logging and thus do not use the “sl” namespace identifier.
2. Elements
This section defines elements and attributes allowed in the ASR logging specification. The attributes or elements from the SLAML namespace are prefixed with sl:
2.1 <sl:log> element
Element | sl:log |
---|---|
Description | The root element of the ASR log. |
Children | Zero or more ASR-settings-handler element.Zero or more grammar-define-handler element.Zero or more speech-request element.Zero or more grammar-undefine-handler element.Zero or more speech-resource-configuration element. |
2.2 <speech-resource-configuration> element
Element | speech-resource-configuration |
---|---|
Description | Contains engine-specific information about the speech resource. This is where various vendor-specific parameters can be logged. |
Attributes | None |
Children | Zero or one vendor-specified elements. |
Parent | sl:log |
For example,
<speech-resource-configuration>
..
</speech-resource-configuration>
2.3 <speech-resource-allocation> element
Element | speech-resource-allocation |
---|---|
Description | This element handles the speech resource allocation request. It can be used to pinpoint errors when allocating speech resources, including license resources, memory resources, etc. |
Attributes |
|
Children | None |
Parent | sl:log |
For example,
<speech-resource-allocation sl:start="1124126927612"
sl:end="1124126927613"
sl:handle-request="asr-session-req-1"
outcome="success" />
2.4 <grammar-define-handler> element
Element | grammar-define-handler |
---|---|
Description | An element that denotes the speech engine loaded a grammar into the speech resource. This makes the grammar available for use in a speech request. |
Attributes |
|
Children | Zero or one grammar-content element. |
Parent | sl:log or recognize-request |
For example,
<grammar-define-handler sl:start="1124126927613"
sl:end="1124126927617"
sl:handle-request ="define-grammar-msg-1"
name="Global"
uri="http://server.example.com/globalgram.xml"
format="application/srgs"
lang="en-US"
outcome="success">
</grammar-define-handler>
2.5 <grammar-content> element
Element | grammar-content |
---|---|
Description | The actual inline/resolved grammar content. This would contain the grammar that was loaded into the speech resource. |
Attributes | None |
Children | None |
Parent | grammar-define-handler |
For example,
<grammar-content>
#ABNF 1.0 UTF-8;
language en-US;
mode voice;
tag-format <semantics/1.0>;
root $MainMenu;
$MainMenu = operator | customer service | main menu;
</grammar-content>
2.6 <recognize-request> element
Element | recognize-request |
---|---|
Description | This element is a record of the recognition request made by a voice browser. |
Attributes |
|
Children | zero or more ASR-settings-handler elementszero or more grammar-define-handler elementszero or more grammar-undefine-handler elementsone audio-stream-analysis elementone execute-recognition elementone recognize-results element |
Parent | sl:log |
For example,
<recognize-request sl:start="1124126927612"
sl:end="1124126937794"
sl:handle-request ="recognize-rqt-1"
computerid="ASR Server 1194"
outcome="nomatch">
...
</ recognize-request>
2.7 <audio-stream-analysis> element
Element | audio-stream-analysis |
---|---|
Description | This element indicates the ASR server began receiving an audio stream. |
Attributes |
|
Children | Zero or One audio-stream-configure elementsOne audio-feature elements |
Parent | recognize-request |
For example,
<audio-stream-analysis sl:start="1124126927630"
sl:end="1124126927774"
outcome="success"
mode=”speech”>
...
</audio_stream-analysis>
2.8 <audio-stream-configuration> element
Element | audio-stream-configuration |
---|---|
Description | This element contains information specific to the ASR engine about the audio stream. These are all vendor-specific elements. |
Attributes | None |
Children | Zero or more vendor-specified elements |
Parent | audio-stream-analysis |
2.9 <audio-features> element
Element | audio-feature |
---|---|
Description | Information about the streamed audio. |
Attributes |
|
Children | None |
Parent | audio-stream-analysis
|
For example,
<audio-features sl:audiolength="21347"
sl:audiolocation="http://ASR.server.com/134dkdhsT.vox"
sl:audioformat="ulaw">
...
</audio-features>
2.10 <execute-recognition> element
Element | execute-recognition |
---|---|
Description | Information about the ASR engine’s recognition process. |
Attributes |
|
Children | One recognize-configuration elementOne recognize-results element |
Parent | recognize-request
|
For example,
<execute-recognition>
active-grammar=”http://server.example.com/globalgram.xml;http://server.example.com/movie.grxml”
sl:start="1124126940000"
sl:end="1124126947700"
outcome="match">
...
</execute-recognition>
2.11 <recognize-configuration> element
Element | recognize-configuration |
---|---|
Description | Various engine-specific settings and information active during recognition. |
Attributes | None |
Children | Zero or More vendor-specified elements |
Parent | execute-recognition |
2.12 <recognize-results> element
Element | recognize-results |
---|---|
Description | The result of the recognition, based on the audio and using the activated grammars and format. The results are defined using a standard called NMSML (Natural Language Semantics Markup Language) or EMMA (Extensible MultiModal Annotation Markup). For more information, visit http://www.w3.org/TR/nl-spec/ and http://www.w3.org/TR/emma/ |
Attributes |
|
Children | None |
Parent | execute-recognition |
For example,
EMMA result:
<recognize-results type="application/emma+xml">
<emma:emma version="1.0"
xmlns:emma=http://www.w3.org/2003/04/emma xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/emma/emma10.xsd"
xmlns="http://www.example.com/example">
<emma:one-of>
<emma:interpretation id="interp1"
emma:confidence="0.8" emma:medium="acoustic" emma:mode="speech" emma:function="dialog"
emma:verbal="true">
<movie>Fantastic Four</movie>
</emma:interpretation>
</emma:one-of>
</emma:emma>
</recognize-results>
NLSML result:
<recognize-results type=”application/nlsml+xml">
<interpretation x-model="http://dataModel" confidence="860">
<instance>
<movie>
<name>
Fantastic Four
</name>
</movie>
</instance>
<input mode="speech" confidence="0.5" timestamp-start="2006-08-13T00:00:00"
timestamp-end="2006-08-13T00:00:02">
Ticket for Fantastic Four
</input>
</interpretation>
</recognize-results>
2.13 <grammar-undefine-handler> element
Element | grammar-undefine-handler |
---|---|
Description | An element that denotes the speech engine unloaded a grammar into the speech resource. |
Attributes |
|
Children | None |
Parent | speech-resource-allocation or recognize-request |
For example,
<grammar-undefine-handler sl:start="1124126947775"
sl:handle-request ="undefine-grammar-msg-1"
Name="Movienames"/>
3 ASR-Specific Parameters
All parameters are individual elements which the value is specified as the content for a parameter element. e.g.
<confidence-threshold>0.5</confidence-threshold>
<n-best-list-length>4</n-best-list-length>
<vendor-specific>enable_grammar_cache=true</vendor-specific>
The ASR-specific parameters are listed below:
confidence-threshold | When a recognition resource recognizes an utterance with some portion of the grammar, it associates a confidence level with that conclusion. The confidence-threshold parameter tells the recognizer resource what confidence level should be considered a successful match. This is an integer from 0.0-1.0 indicating the recognizer’s confidence in the recognition. If the recognizer determines that its confidence in all its recognition results is less than the confidence threshold, then it returns no-match as the recognition result. |
sensitivity-level | This parameter specifies the sensitivity on detecting speech. A higher value for this parameter means higher sensitivity. |
speed-vs-accuracy | This parameter specifies the balance between performance and accuracy for the recognizer resource. A higher value for this parameter means higher speed and less accuracy. |
n-best-list-length | When the recognizer matches an incoming stream with the grammar, it may come up with more than one alternative match because of confidence levels in certain words or conversation paths. If this parameter is not specified, by default, the recognition resource will only return the best match above the confidence threshold. The client, by setting this parameter, could ask the recognition resource to send it more than 1 alternative. All alternatives must still be above the confidence-threshold. A value greater than one does not guarantee that the recognizer will send the requested number of alternatives. |
no-input-timeout | When recognition is started and there is no speech detected for a certain period of time, the no-input-timeout parameter can be set to this timeout value. |
recognition-timeout | When recognition is started and there is no match for a certain period of time, the recognizer signal the client to terminate the recognition operation. The recognition-timeout parameter field sets this timeout value. The value is in milliseconds. |
speech-complete-timeout | This parameter specifies the length of silence required following user speech before the speech recognizer finalizes a result. Specified in milliseconds. The value for this field ranges from 0 to MAXTIMEOUT, where MAXTIMEOUT is platform specific. |
speech-incomplete-timeout | This parameter specifies the length of silence required following user speech before the speech recognizer finalizes a result. The speech prior to the silence is incomplete where it is possible to speak further and still match the grammar. By contract, the speech complete timeout specifies the speech prior to the silence is complete and it is not possible to speak further. Specified in milliseconds. The value for this field ranges from 0 to MAXTIMEOUT, where MAXTIMEOUT is platform specific. |
dtmf-interdigit-timeout | This parameter specifies the terminating timeout to use when recognizing DTMF input. The value is in milliseconds. |
dtmf-term-char | This parameter specifies the terminating DTMF character for DTMF input recognition. |
vendor-specific | This parameter specifies any vendor-specific parameters. The value of this will be a pair of property and value for the recognizer. (e.g. enable_grammar_cache=true) |
Appendix A: Examples
The example shows a complete ASR log:
<?xml version="1.0" encoding="UTF-8"?>
<sl:slaml xmlns:sl="https://voicexml.org/2006/slaml"
sl:version="1.0">
<sl:manifest>
<sl:session name="Session-12345"
start="asr-session-req-1"
sl:class="ASREngine" sl:log-tag="asr-log-1"/>
</sl:manifest>
<sl:log tag="svr-log-tag-1"
sl:class="ASREngine"
sl:entity="LumenVox Resource112"
xmlns="https://voicexml.org/2006/asr-log">
<!-- the configuration of the speech resource -->
<speech-resource-configuration>>
<license-type-configuration>Full</license-type-configuration> <!-- a description of the type of license opened-->
<feature-enabled>unlimited vocbulary size</feature-enabled> <!-- Various features available on this resource -->
<feature-enabled>on line adaptation</feature-enabled>
</speech-resource-configuration>
<speech-resource-allocation sl:start="1124126927612"
sl:end="1124126927613"
sl:handle-request="asr-session-req-1"
outcome="success"/>
<!-- Example of a global grammar being loading -->
<grammar-define-handler
sl:start="1124126927613"
sl:end="1124126927617"
name="Global"
rank="0"
uri="http://server.example.com/globalgram.xml"
format="SRGS_ABNF"
lang="en-US"
sl:handle-request="define-grammar-msg-1"
outcome="success">
<grammar-content> <!-- Optional if this was captured by the ASR engine at the time of execution -->
#ABNF 1.0 UTF-8;
language en-US;
mode voice;
tag-format <lumenvox/1.0>;
root $MainMenu;
$MainMenu = operator | customer service | main menu;
</grammar-content>
</grammar-define-handler>->->
<recognize-request sl:start="1124126927612"
sl:end="1124126937794"
computerid="LumenVox ASR Server 1194"
sl:handle-request="asr-1234"
sl:trace-id="aae7-12adv-ef54-ea12"
outcome="no-match"
sl:mode="sequential">
<!-- showing an example of defining and activating grammar inside speech-request -->
<grammar-define-handler sl:start="1124126927613"
sl:end="1124126927617"
name="Movienames"
uri="http://server.example.com/movie.xml"
format="SRGS_ABNF"
lang="en-US"
outcome="success">
<grammar-content>
#ABNF 1.0 UTF-8;
language en-US;
mode voice;
tag-format <lumenvox/1.0>;
root $Movie;
$Movie = Fantastic Four { $.movie.name="Fantastic Four"; } |
Superman Returns { $.movie.name="Superman Returns"; };
</grammar-content>
</grammar-define-handler>
<audio-stream-analysis sl:start="1124126927630"
sl:end="1124126927774"
outcome="success"
mode="speech">
<audio-stream-configuration>
<start-of-speech-detection>true</start-of-speech-detection>
<end-of-speech-detection>true</end-of-speech-detection>
<start-of-speech-rewind>300(ms)</start-of-speech-rewind>
<start-of-speech-energy-level>.4</start-of-speech-energy-level>
</audio-stream-configuration>
<audio-features audiolength="16475"
audiolocation="http://ASR.server.com/134dkdhss.vox"
audioformat="PCMU">>
</audio-features>
</audio-stream-analysis>->
<execute-recognition sl:start="1124126927774"
sl:end="1124126937794"
outcome="no-match"
active-grammar="http://server.example.com/globalgram.xml;http://server.example.com/movie.xml">>
<recognize-configuration>
<confidence-threshold>500</confidence-threshold>
<n-best-list-length>5</n-best-list-length>
<sensitivity-level>0.4</sensitivity-level>>
<noise-cancelation>false</noise-cancelation>
</recognize-configuration>
<recognize-results type="NLSML">
<interpretation>
<instance/>
<input>
<nomatch/>
</input>
</interpretation>
</recognize-results>
</execute-recognition>
<!-- showing an example of undefining and deactivating grammar inside speech-request -->
<grammar-undefine-handler sl:time="1124126947775"
Name="Movienames"/>
</recognize-request>->
<recognize-request sl:start="1124126937775"
sl:end="1124126947775"
sl:handle-request="asr-1237"
sl:outcome="success"
sl:trace-id="aae7-12adv-ef54-ea31">
<audio-stream-analysis sl:start="1124126937900"
sl:end="1124126940000"
sl:outcome="success"
sl:mode="speech">
<audio-stream-configuration>>
<end-of-speech-detection>true</end-of-speech-detection>
<start-of-speech-rewind>300(ms)</start-of-speech-rewind>
<start-of-speech-energy-level>0.4</start-of-speech-energy-level>
</audio-stream-configuration>
<audio-features audiolength="21347"
audiolocation="http://ASR.server.com/134dkdhsT.vox"
audioformat="PCMU">
</audio-features>
</audio-stream-analysis>
<execute-recognition sl:start="1124126940000"
sl:end="1124126947700"
sl:outcome="success"
activegrammars="http://server.example.com/movie.xml">
<recognize-configuration>
<confidence-threshold>500</confidence-threshold>
<n-best-list-length>5</n-best-list-length>
<sensitivity-level>0.4</sensitivity-level>
</recognize-configuration>
<!-- Note: we are using EMMA -->
<recognize-results type="EMMA">
<emma:emma version="1.0"
xmlns:emma="http://www.w3.org/2003/04/emma"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2003/04/emma
http://www.w3.org/TR/emma/emma10.xsd"
xmlns="http://www.example.com/example">
<emma:one-of id="list">
<emma:interpretation id="interp1"
emma:confidence="0.8"
emma:medium="acoustic"
emma:mode="speech"
emma:function="dialog"
emma:verbal="true">
<movie>Fantastic Four</movie>
</emma:interpretation>
</emma:one-of>
</emma:emma>
</recognize-results>
</execute-recognition>
</recognize-request>
<recognize-request sl:start="1124126937775"
sl:end="1124126947775"
sl:handle-request="asr-1239"
sl:outcome="success"
sl:trace-id="aae7-12adv-ef54-ea32">
<!-- Contents omitted -->
</recognize-request>
<!-- showing an example of undefining and deactivating grammar outside speech-request -->
<grammar-undefine-handler sl:time="1124126947775"
sl:handle-request="undefine-grammar-msg-3"
Name="global"/>
</sl:log>
</sl:slaml>
Appendix B: Acknowledgments
Appendix C: Revision History
- 07/10/2007:
- Added parameter section
- Revised some of the elements’ description
- 04/10/2007:
- Converted from Word Document into HTML format
- 01/29/2006:
- Accommodated Jeff Marcus comments including:
- Remove <speech-resource> element
- Remove all examples contain vendor’s name
- Changed grammar format to following MRCP standard which including applications/srgs+xml, applications/srgs…etc
- Updated the outcome of each element to follow MRCP standard
- Instead of having active-grammar element, replace that with active-grammar attribute list in <execute-recognition> element
- Remove <SOS-detect-event> and <EOS-detect-event>
- Rejection-threshold renamed to confidence-level to follow VoiceXML
- Modified each element’s outcome value to follow MRCP competition cause standard (http://www.ietf.org/internet-drafts/draft-ietf-speechsc-mrcpv2-11.txt Page 76)
- Updated ASR logging example to follow the above changes
- Added mode attribute to <audio-stream-analysis> element for distinguishing dtmf and speech
- Accommodated Jeff Marcus comments including:
- 01/15/2006:
- Modified the following elements:
- Renamed <speech-request> to <recognize-request>
- Renamed <audio-stream> to <audio-stream-analysis>
- Renamed <recognize> to <execute-recognition>
- speech-resource-allocation – changed sl:handle-request to sl:handle-interaction
- recognize-request – changed from reference-id to sl:traceid to follow SLAML
- Updated some examples to use new elements’ name
- Changed sl:atomic to sl:time to follow SLAML
- Modified the following elements:
- 01/09/2006:
- Modified the following elements:
- speech-resource-configuration – changed the attribute to VendorSpecific
- speech-request: instead of <sl:send-response>, changed to <sl:handle-request>
- Added new element speech-resource-allocation
- For now, remove the parameters’ list, update this section with mrcp standard parameters later
- Modified the following elements:
- 12/05/2006:
- Modified the following elements:
- Recognize-results – added EMMA script
- Speech-resource – sl:log contains the necessary for referencing section in the log file, speech-resource doesn’t require to contain the extra information for referencing so some attributes got dropped
- Added new references to the updated SLAML document
- Removed <grammar-activate-handler>, <grammar-deactivate-handler>, instead, added <active-grammar> in <recognize> element to follow the preference in MRCP
- Replace trace-id mechanism with updated SLAML message communication mechanism
- Modified the following elements:
- 09/11/2006:
- Accommodated David Thomson comments including:
- Overview of the ASR server data logging
- Specification format description
- Example for each element to show the usage
- Added more ASR server specified parameters
- Accommodated David Thomson comments including:
- 08/15/2006:
- Defined a set of parameters for audio-stream-parameters, recognize-parameters, and speech-resource-parameters.
- Removed prev-id (parent-id) in this spec until there is an agreement on that for the SLAML.
- Added the transcription element for storing transcript information about an audio.
- Uses NLSML for recognition result.
- Newly defined elements are appended with (New).
- 07/25/2006:
- Added the element sl:time which is a timestamp when there is no sl:end attribute expected in that element.
- All time values will be specified by the standard CSS2 Time Format
- The value of parameters which represent a range will be represented by a float between 1 and 0
- 07/19/2006:
- The specification accommodated the suggestions provided by Andrew Wahbe (VoiceGenie) and David Thomson (Speechphone) two weeks ago in the VoiceXML DL call.
- Reviewed this specification with the ASR draft we came up several months ago that was created by Jeffery Marcus (Nuance).
- Made the following naming changes:
- Renamed <SOS-detect-event> to <start-of-speech-detect-event>
- Renamed <EOS-detect-event> to <end-of-speech-detect-event>
- Andrew suggested it should be useful to add the following to the SLAML standard:
- outcome – This attribute indicating the event’s outcome; for instance, recognition can be match, no-match, no-input
- 07/06/2006:
- Made a couple of SLAML attributes naming convention changes:
- Renamed <sl:log-id> to <sl:trace>
- Renamed <sl:prev-id> to <sl:parent>
- Made a couple of SLAML attributes naming convention changes:
Appendix D: References
[MRCP]” Media Resource Control Protocol “, IETF RFC 4463, 2006
See http://www.ietf.org/rfc/rfc4463.txt [SLAML]” Session Log Annotation Markup Langauge “, Andrew Wahbe, 2007
[VOICEXML-2.0]” Voice eXtensible Markup Language 2.0“, McGlashan et al, W3C Note, March 2004.
See http://www.w3.org/TR/voicexml20/