Automatic Speech Recognition Logging Specification

VoiceXML Forum, Tools Working Group
Draft 1.8 – Internal Working Draft – 31 July 2007
For Public Review 20 August 2007

Editor:

Kyle Danielson, LumenVox
Chung Pak Lai, LumenVox

About the VoiceXML Forum:

Founded in 1999, the VoiceXML Forum is an industry organization whose mission is to promote and to accelerate the worldwide adoption of VoiceXML-based applications. To this end, the Forum serves as an educational and technical resource, a certification authority and a contributor and liaison to international standards bodies, such as the World Wide Web Consortium (W3C) IETF, ANSI and ISO. The VoiceXML Forum is organized as a program of the IEEE Industry Standards and Technology Organization (IEEE-ISTO). Membership in the Forum is open to any interested company. For more information, please visit the Website at www.voicexml.org.

Disclaimers:

This document is subject to change without notice and may be updated, replaced or made obsolete by other documents at any time. The VoiceXML Forum disclaims any and all warranties, whether express or implied, including (without limitation) any implied warranties of merchantability or fitness for a particular purpose.

The descriptions contained herein do not imply the granting of licenses to make, use, sell, license or otherwise transfer any technology required to implement systems or components conforming to this specification. The VoiceXML Forum, and its member companies, makes no representation on technology described in this specification regarding existing or future patent rights, copyrights, trademarks, trade secrets or other proprietary rights.

By submitting information to the VoiceXML Forum, and its member companies, including but not limited to technical information, you agree that the submitted information does not contain any confidential or proprietary information, and that the VoiceXML Forum may use the submitted information without any restrictions or limitations.
Table of Contents

1. Overview

This document describes tags for logging run-time data in an ASR (automatic speech recognition) server. Typically, the ASR server is part of a speech services system that includes an ASR server, a text-to-speech server, a VoiceXML browser, an application server, a database server, interfaces to external servers, and possibly other servers. The data logging (DL) specification, defined by the Tools Committee within the VoiceXML Forum, is called SLAML and comprises an SLAML overview plus a specification for the individual servers mentioned above (ASR, browser, etc.).

1.1 Specification Format

In the descriptions below, the name for an element tag is listed along with a description of the tag, a description of the SLAML attributes, proposed attributes, and a list of XML child elements.

Description: The description of the element.
SLAML Attributes: XML attributes that are defined in the SLAML document. Refer to the external documentation for the SLAML.
Proposed Attributes: XML attributes defined for the ASR server only.
XML child elements: The elements that may be included inside of this element.
Example: An example using the element.
Content: The content of the element, if it has any.
Attribute specification: Each attribute has its own set of values and type.

XML namespaces provide a simple method for qualifying element and attribute names used in Extensible Markup Language documents by associating them with namespaces identified by URI references. For more information, visit http://www.w3.org/TR/REC-xml-names/. In this specification, all the SLAML defined attributes have the namespace “sl” to indicate they are part of the SLAML standard. Proposed attributes are only specified for ASR logging and thus do not use the “sl” namespace identifier.

2. Elements

This section defines elements and attributes allowed in the ASR logging specification. The attributes or elements from the SLAML namespace are prefixed with sl:

2.1 <sl:log> element

Element	`sl:log`
Description	The root element of the ASR log.
Children	Zero or more `ASR-settings-handler` element. Zero or more `grammar-define-handler` element. Zero or more `speech-request` element. Zero or more `grammar-undefine-handler` element. Zero or more `speech-resource-configuration` element.

2.2 <speech-resource-configuration> element

Element	`speech-resource-configuration`
Description	Contains engine-specific information about the speech resource. This is where various vendor-specific parameters can be logged.
Attributes	None
Children	Zero or one vendor-specified elements.
Parent	`sl:log`

For example,
<speech-resource-configuration> .. </speech-resource-configuration>

2.3 <speech-resource-allocation> element

Element	`speech-resource-allocation`
Description	This element handles the speech resource allocation request. It can be used to pinpoint errors when allocating speech resources, including license resources, memory resources, etc.
Attributes	Required: `sl:handle-interaction` specifies the request handler ID. `sl:start` specifies the start time of the segment. `sl:end` specifies the end time of the segment. `sl:outcome` specifies the high-level outcome classification.
Children	None
Parent	`sl:log`

For example,
<speech-resource-allocation sl:start="1124126927612" sl:end="1124126927613" sl:handle-request="asr-session-req-1" outcome="success" />

2.4 <grammar-define-handler> element

Element	`grammar-define-handler`
Description	An element that denotes the speech engine loaded a grammar into the speech resource. This makes the grammar available for use in a speech request.
Attributes	Required: `name`: specifies the name of the grammar. `sl:start` specifies the start time of the action. `sl:end` specifies the end time of the action. `sl:handle-request` specifies the request identifier. `sl:outcome` specifies the high-level outcome classification. `format` specifies the grammar format. `lang` specifies the grammar’s language. `uri` specifies the URI of the grammar’s location.
Children	Zero or one `grammar-content` element.
Parent	`sl:log` or `recognize-request`

For example,
<grammar-define-handler sl:start="1124126927613" sl:end="1124126927617" sl:handle-request ="define-grammar-msg-1" name="Global" uri="http://server.example.com/globalgram.xml" format="application/srgs" lang="en-US" outcome="success"> </grammar-define-handler>

2.5 <grammar-content> element

Element	`grammar-content`
Description	The actual inline/resolved grammar content. This would contain the grammar that was loaded into the speech resource.
Attributes	None
Children	None
Parent	`grammar-define-handler`

For example,
<grammar-content> #ABNF 1.0 UTF-8; language en-US; mode voice; tag-format <semantics/1.0>; root $MainMenu; $MainMenu = operator | customer service | main menu; </grammar-content>

2.6 <recognize-request> element

Element	`recognize-request`
Description	This element is a record of the recognition request made by a voice browser.
Attributes	Required: `sl:start` specifies the start time of the action. `sl:end` specifies the end time of the action `sl:handle-request` specifies the request identifier `sl:outcome` specifies the high-level outcome classification `sl:traceid` specifies the unique transaction identifier `sl:mode` specifies either `parallel` or `sequential`
Children	zero or more `ASR-settings-handler` elements zero or more `grammar-define-handler` elements zero or more `grammar-undefine-handler` elements one `audio-stream-analysis` element one `execute-recognition` element one `recognize-results` element
Parent	`sl:log`

For example,
<recognize-request sl:start="1124126927612" sl:end="1124126937794" sl:handle-request ="recognize-rqt-1" computerid="ASR Server 1194" outcome="nomatch"> ... </ recognize-request>

2.7 <audio-stream-analysis> element

Element	`audio-stream-analysis`
Description	This element indicates the ASR server began receiving an audio stream.
Attributes	Required: `sl:start` specifies the start time of the action. `sl:end` specifies the end time of the action `sl:outcome` specifies the high-level outcome classification `input-mode` specifies either `dtmf` or `speech`
Children	Zero or One `audio-stream-configure` elements One `audio-feature` elements
Parent	`recognize-request`

For example,
<audio-stream-analysis sl:start="1124126927630" sl:end="1124126927774" outcome="success" mode=”speech”> ... </audio_stream-analysis>

2.8 <audio-stream-configuration> element

Element	`audio-stream-configuration`
Description	This element contains information specific to the ASR engine about the audio stream. These are all vendor-specific elements.
Attributes	None
Children	Zero or more `vendor-specified` elements
Parent	`audio-stream-analysis`

2.9 <audio-features> element

Element	`audio-feature`
Description	Information about the streamed audio.
Attributes	Required: `audiolength` specifies the length of the audio in bytes. `audiolocation` specifies the location URI of the saved audio. `audioformat` specifies the format of the audio (pcmu, pcma…).
Children	None
Parent	`audio-stream-analysis`

For example,
<audio-features sl:audiolength="21347" sl:audiolocation="http://ASR.server.com/134dkdhsT.vox" sl:audioformat="ulaw"> ... </audio-features>

2.10 <execute-recognition> element

Element	`execute-recognition`
Description	Information about the ASR engine’s recognition process.
Attributes	Required: `sl:start` specifies the start time of the action. `sl:end` specifies the end time of the action. `sl:outcome` specifies the high-level outcome classification. `active-grammar` specifies the active grammar set. `utterance` specifies the raw text recognzied by the ASR. `semantic-interpretation` specifies the text of the returned semantic interpretation. `language` specifies the language used for this recognition. `confidence-level` specifies the confidence threshold. `confidence` specifies the confidence for this recognition.
Children	One `recognize-configuration` element One `recognize-results` element
Parent	`recognize-request`

For example,
<execute-recognition> active-grammar=”http://server.example.com/globalgram.xml;http://server.example.com/movie.grxml” sl:start="1124126940000" sl:end="1124126947700" outcome="match"> ... </execute-recognition>

2.11 <recognize-configuration> element

Element	`recognize-configuration`
Description	Various engine-specific settings and information active during recognition.
Attributes	None
Children	Zero or More `vendor-specified` elements
Parent	`execute-recognition`

2.12 <recognize-results> element

Element	`recognize-results`
Description	The result of the recognition, based on the audio and using the activated grammars and format. The results are defined using a standard called NMSML (Natural Language Semantics Markup Language) or EMMA (Extensible MultiModal Annotation Markup). For more information, visit http://www.w3.org/TR/nl-spec/ and http://www.w3.org/TR/emma/
Attributes	Required: `type` specifies the format of the recognition result
Children	None
Parent	`execute-recognition`

For example,

EMMA result:
<recognize-results type="application/emma+xml"> <emma:emma version="1.0" xmlns:emma=http://www.w3.org/2003/04/emma xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/emma/emma10.xsd" xmlns="http://www.example.com/example"> <emma:one-of> <emma:interpretation id="interp1" emma:confidence="0.8" emma:medium="acoustic" emma:mode="speech" emma:function="dialog" emma:verbal="true"> <movie>Fantastic Four</movie> </emma:interpretation> </emma:one-of> </emma:emma> </recognize-results>
NLSML result:
<recognize-results type=”application/nlsml+xml"> <interpretation x-model="http://dataModel" confidence="860"> <instance> <movie> <name> Fantastic Four </name> </movie> </instance> <input mode="speech" confidence="0.5" timestamp-start="2006-08-13T00:00:00" timestamp-end="2006-08-13T00:00:02"> Ticket for Fantastic Four </input> </interpretation> </recognize-results>

2.13 <grammar-undefine-handler> element

Element	`grammar-undefine-handler`
Description	An element that denotes the speech engine unloaded a grammar into the speech resource.
Attributes	Required: `name` specifies the name of the grammar `sl:start` specifies the start time of the action. `sl:end` specifies the end time of the action `sl:handle-request` specifies the request identifier
Children	None
Parent	`speech-resource-allocation` or `recognize-request`

For example,
<grammar-undefine-handler sl:start="1124126947775" sl:handle-request ="undefine-grammar-msg-1" Name="Movienames"/>

3 ASR-Specific Parameters

All parameters are individual elements which the value is specified as the content for a parameter element. e.g.
<confidence-threshold>0.5</confidence-threshold> <n-best-list-length>4</n-best-list-length> <vendor-specific>enable_grammar_cache=true</vendor-specific>
The ASR-specific parameters are listed below:

confidence-threshold	When a recognition resource recognizes an utterance with some portion of the grammar, it associates a confidence level with that conclusion. The confidence-threshold parameter tells the recognizer resource what confidence level should be considered a successful match. This is an integer from 0.0-1.0 indicating the recognizer’s confidence in the recognition. If the recognizer determines that its confidence in all its recognition results is less than the confidence threshold, then it returns no-match as the recognition result.
sensitivity-level	This parameter specifies the sensitivity on detecting speech. A higher value for this parameter means higher sensitivity.
speed-vs-accuracy	This parameter specifies the balance between performance and accuracy for the recognizer resource. A higher value for this parameter means higher speed and less accuracy.
n-best-list-length	When the recognizer matches an incoming stream with the grammar, it may come up with more than one alternative match because of confidence levels in certain words or conversation paths. If this parameter is not specified, by default, the recognition resource will only return the best match above the confidence threshold. The client, by setting this parameter, could ask the recognition resource to send it more than 1 alternative. All alternatives must still be above the confidence-threshold. A value greater than one does not guarantee that the recognizer will send the requested number of alternatives.
no-input-timeout	When recognition is started and there is no speech detected for a certain period of time, the no-input-timeout parameter can be set to this timeout value.
recognition-timeout	When recognition is started and there is no match for a certain period of time, the recognizer signal the client to terminate the recognition operation. The recognition-timeout parameter field sets this timeout value. The value is in milliseconds.
speech-complete-timeout	This parameter specifies the length of silence required following user speech before the speech recognizer finalizes a result. Specified in milliseconds. The value for this field ranges from 0 to MAXTIMEOUT, where MAXTIMEOUT is platform specific.
speech-incomplete-timeout	This parameter specifies the length of silence required following user speech before the speech recognizer finalizes a result. The speech prior to the silence is incomplete where it is possible to speak further and still match the grammar. By contract, the speech complete timeout specifies the speech prior to the silence is complete and it is not possible to speak further. Specified in milliseconds. The value for this field ranges from 0 to MAXTIMEOUT, where MAXTIMEOUT is platform specific.
dtmf-interdigit-timeout	This parameter specifies the terminating timeout to use when recognizing DTMF input. The value is in milliseconds.
dtmf-term-char	This parameter specifies the terminating DTMF character for DTMF input recognition.
vendor-specific	This parameter specifies any vendor-specific parameters. The value of this will be a pair of property and value for the recognizer. (e.g. enable_grammar_cache=true)

Appendix A: Examples

The example shows a complete ASR log:
<?xml version="1.0" encoding="UTF-8"?> <sl:slaml xmlns:sl="https://voicexml.org/2006/slaml" sl:version="1.0"> <sl:manifest> <sl:session name="Session-12345" start="asr-session-req-1" sl:class="ASREngine" sl:log-tag="asr-log-1"/> </sl:manifest>


	<sl:log tag="svr-log-tag-1"

		    sl:class="ASREngine"

		    sl:entity="LumenVox Resource112"

		    xmlns="https://voicexml.org/2006/asr-log">
			<!-- the configuration of the speech resource -->

			<speech-resource-configuration>>

				<license-type-configuration>Full</license-type-configuration> <!-- a description of the type of license opened-->

	    		<feature-enabled>unlimited vocbulary size</feature-enabled> <!-- Various features available on this resource -->

	    		<feature-enabled>on line adaptation</feature-enabled>

			</speech-resource-configuration>
			<speech-resource-allocation sl:start="1124126927612"

										sl:end="1124126927613"

										sl:handle-request="asr-session-req-1"

										outcome="success"/>
			<!-- Example of a global grammar being loading -->

			<grammar-define-handler

				sl:start="1124126927613"

				sl:end="1124126927617"

				name="Global"

				rank="0"

				uri="http://server.example.com/globalgram.xml"

				format="SRGS_ABNF"

				lang="en-US"

				sl:handle-request="define-grammar-msg-1"

				outcome="success">

				<grammar-content> <!-- Optional if this was captured by the ASR engine at the time of execution -->

					#ABNF 1.0 UTF-8;

					language en-US;

					mode voice;

					tag-format <lumenvox/1.0>;

					root $MainMenu;

					$MainMenu = operator | customer service | main menu;

				</grammar-content>

			</grammar-define-handler>->->

			<recognize-request sl:start="1124126927612"

				sl:end="1124126937794"

				computerid="LumenVox ASR Server 1194"

				sl:handle-request="asr-1234"

				sl:trace-id="aae7-12adv-ef54-ea12"

				outcome="no-match"

				sl:mode="sequential">
				<!-- showing an example of defining and activating grammar inside speech-request -->

				<grammar-define-handler sl:start="1124126927613"

					sl:end="1124126927617"

					name="Movienames"

					uri="http://server.example.com/movie.xml"

					format="SRGS_ABNF"

					lang="en-US"

					outcome="success">

					<grammar-content>

						#ABNF 1.0 UTF-8;

						language en-US;

						mode voice;

						tag-format <lumenvox/1.0>;

						root $Movie;

						$Movie = Fantastic Four { $.movie.name="Fantastic Four"; } |

								 Superman Returns { $.movie.name="Superman Returns"; };

					</grammar-content>

				</grammar-define-handler>
				<audio-stream-analysis sl:start="1124126927630"

					sl:end="1124126927774"

					outcome="success"

					mode="speech">

					<audio-stream-configuration>

						<start-of-speech-detection>true</start-of-speech-detection>

						<end-of-speech-detection>true</end-of-speech-detection>

						<start-of-speech-rewind>300(ms)</start-of-speech-rewind>

						<start-of-speech-energy-level>.4</start-of-speech-energy-level>

					</audio-stream-configuration>

					<audio-features audiolength="16475"

						audiolocation="http://ASR.server.com/134dkdhss.vox"

						audioformat="PCMU">>

					</audio-features>

				</audio-stream-analysis>->

				<execute-recognition sl:start="1124126927774"

						sl:end="1124126937794"

						outcome="no-match"

						active-grammar="http://server.example.com/globalgram.xml;http://server.example.com/movie.xml">>

					<recognize-configuration>

						<confidence-threshold>500</confidence-threshold>

						<n-best-list-length>5</n-best-list-length>

						<sensitivity-level>0.4</sensitivity-level>>

						<noise-cancelation>false</noise-cancelation>

					</recognize-configuration>

					<recognize-results type="NLSML">

						<interpretation>

							<instance/>

							<input>

								<nomatch/>

							</input>

						</interpretation>

					</recognize-results>

				</execute-recognition>
				<!-- showing an example of undefining and deactivating grammar inside speech-request -->

				<grammar-undefine-handler sl:time="1124126947775"

					Name="Movienames"/>

			</recognize-request>->

			<recognize-request sl:start="1124126937775"

				sl:end="1124126947775"

				sl:handle-request="asr-1237"

				sl:outcome="success"

				sl:trace-id="aae7-12adv-ef54-ea31">
				<audio-stream-analysis sl:start="1124126937900"

					sl:end="1124126940000"

					sl:outcome="success"

					sl:mode="speech">
					<audio-stream-configuration>>

						<end-of-speech-detection>true</end-of-speech-detection>

						<start-of-speech-rewind>300(ms)</start-of-speech-rewind>

						<start-of-speech-energy-level>0.4</start-of-speech-energy-level>

					</audio-stream-configuration>

					<audio-features audiolength="21347"

						audiolocation="http://ASR.server.com/134dkdhsT.vox"

						audioformat="PCMU">

					</audio-features>

				</audio-stream-analysis>
				<execute-recognition sl:start="1124126940000"

					sl:end="1124126947700"

					sl:outcome="success"

					activegrammars="http://server.example.com/movie.xml">
					<recognize-configuration>

						<confidence-threshold>500</confidence-threshold>

						<n-best-list-length>5</n-best-list-length>

						<sensitivity-level>0.4</sensitivity-level>

					</recognize-configuration>
					<!-- Note: we are using EMMA -->

					<recognize-results type="EMMA">

						<emma:emma version="1.0"

							    xmlns:emma="http://www.w3.org/2003/04/emma"

							    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

							    xsi:schemaLocation="http://www.w3.org/2003/04/emma

							    http://www.w3.org/TR/emma/emma10.xsd"

		    					xmlns="http://www.example.com/example">

							<emma:one-of id="list">

    							<emma:interpretation id="interp1"

     							  		emma:confidence="0.8"

     									emma:medium="acoustic"

     									emma:mode="speech"

     									emma:function="dialog"

     									emma:verbal="true">

       								<movie>Fantastic Four</movie>

    							</emma:interpretation>

							</emma:one-of>

						</emma:emma>

					</recognize-results>

  				</execute-recognition>

  			</recognize-request>
			<recognize-request sl:start="1124126937775"

				sl:end="1124126947775"

				sl:handle-request="asr-1239"

				sl:outcome="success"

				sl:trace-id="aae7-12adv-ef54-ea32">

				  <!-- Contents omitted -->

			</recognize-request>

 <grammar-undefine-handler sl:time="1124126947775" sl:handle-request="undefine-grammar-msg-3" Name="global"/> </sl:log> </sl:slaml>

Appendix B: Acknowledgments

Jeffrey Marcus (Nuance)

David Thomson (SpeechPhone)

Andrew Wahbe (Genesys)

Appendix C: Revision History

07/10/2007:
- Added parameter section
- Revised some of the elements’ description

04/10/2007:
- Converted from Word Document into HTML format

01/29/2006:
- Accommodated Jeff Marcus comments including:
  - Remove <speech-resource> element
  - Remove all examples contain vendor’s name
  - Changed grammar format to following MRCP standard which including applications/srgs+xml, applications/srgs…etc
  - Updated the outcome of each element to follow MRCP standard
  - Instead of having active-grammar element, replace that with active-grammar attribute list in <execute-recognition> element
  - Remove <SOS-detect-event> and <EOS-detect-event>
  - Rejection-threshold renamed to confidence-level to follow VoiceXML
- Modified each element’s outcome value to follow MRCP competition cause standard (http://www.ietf.org/internet-drafts/draft-ietf-speechsc-mrcpv2-11.txt Page 76)
- Updated ASR logging example to follow the above changes
- Added mode attribute to <audio-stream-analysis> element for distinguishing dtmf and speech
01/15/2006:
- Modified the following elements:
  - Renamed <speech-request> to <recognize-request>
  - Renamed <audio-stream> to <audio-stream-analysis>
  - Renamed <recognize> to <execute-recognition>
  - speech-resource-allocation – changed sl:handle-request to sl:handle-interaction
  - recognize-request – changed from reference-id to sl:traceid to follow SLAML
- Updated some examples to use new elements’ name
- Changed sl:atomic to sl:time to follow SLAML
01/09/2006:
- Modified the following elements:
  - speech-resource-configuration – changed the attribute to VendorSpecific
  - speech-request: instead of <sl:send-response>, changed to <sl:handle-request>
- Added new element speech-resource-allocation
- For now, remove the parameters’ list, update this section with mrcp standard parameters later
12/05/2006:
- Modified the following elements:
  - Recognize-results – added EMMA script
  - Speech-resource – sl:log contains the necessary for referencing section in the log file, speech-resource doesn’t require to contain the extra information for referencing so some attributes got dropped
- Added new references to the updated SLAML document
- Removed <grammar-activate-handler>, <grammar-deactivate-handler>, instead, added <active-grammar> in <recognize> element to follow the preference in MRCP
- Replace trace-id mechanism with updated SLAML message communication mechanism
09/11/2006:
- Accommodated David Thomson comments including:
  - Overview of the ASR server data logging
  - Specification format description
  - Example for each element to show the usage
- Added more ASR server specified parameters
08/15/2006:
- Defined a set of parameters for audio-stream-parameters, recognize-parameters, and speech-resource-parameters.
- Removed prev-id (parent-id) in this spec until there is an agreement on that for the SLAML.
- Added the transcription element for storing transcript information about an audio.
- Uses NLSML for recognition result.
- Newly defined elements are appended with (New).
07/25/2006:
- Added the element sl:time which is a timestamp when there is no sl:end attribute expected in that element.
- All time values will be specified by the standard CSS2 Time Format
- The value of parameters which represent a range will be represented by a float between 1 and 0
07/19/2006:
- The specification accommodated the suggestions provided by Andrew Wahbe (VoiceGenie) and David Thomson (Speechphone) two weeks ago in the VoiceXML DL call.
- Reviewed this specification with the ASR draft we came up several months ago that was created by Jeffery Marcus (Nuance).
- Made the following naming changes:
  - Renamed <SOS-detect-event> to <start-of-speech-detect-event>
  - Renamed <EOS-detect-event> to <end-of-speech-detect-event>
- Andrew suggested it should be useful to add the following to the SLAML standard:
  - outcome – This attribute indicating the event’s outcome; for instance, recognition can be match, no-match, no-input
07/06/2006:
- Made a couple of SLAML attributes naming convention changes:
  - Renamed <sl:log-id> to <sl:trace>
  - Renamed <sl:prev-id> to <sl:parent>

Appendix D: References

[MRCP]” Media Resource Control Protocol “, IETF RFC 4463, 2006
See http://www.ietf.org/rfc/rfc4463.txt [SLAML]” Session Log Annotation Markup Langauge “, Andrew Wahbe, 2007
[VOICEXML-2.0]” Voice eXtensible Markup Language 2.0“, McGlashan et al, W3C Note, March 2004.
See http://www.w3.org/TR/voicexml20/