Automatic Speech Recognition Logging Specification


VoiceXML Forum, Tools Working Group
Draft 1.8 - Internal Working Draft - 31 July 2007
For Public Review 20 August 2007


Editor:
Kyle Danielson, LumenVox
Chung Pak Lai, LumenVox

About the VoiceXML Forum:

Founded in 1999, the VoiceXML Forum is an industry organization whose mission is to promote and to accelerate the worldwide adoption of VoiceXML-based applications. To this end, the Forum serves as an educational and technical resource, a certification authority and a contributor and liaison to international standards bodies, such as the World Wide Web Consortium (W3C) IETF, ANSI and ISO. The VoiceXML Forum is organized as a program of the IEEE Industry Standards and Technology Organization (IEEE-ISTO). Membership in the Forum is open to any interested company. For more information, please visit the Website at www.voicexml.org.

Disclaimers:

This document is subject to change without notice and may be updated, replaced or made obsolete by other documents at any time. The VoiceXML Forum disclaims any and all warranties, whether express or implied, including (without limitation) any implied warranties of merchantability or fitness for a particular purpose.

The descriptions contained herein do not imply the granting of licenses to make, use, sell, license or otherwise transfer any technology required to implement systems or components conforming to this specification. The VoiceXML Forum, and its member companies, makes no representation on technology described in this specification regarding existing or future patent rights, copyrights, trademarks, trade secrets or other proprietary rights.

By submitting information to the VoiceXML Forum, and its member companies, including but not limited to technical information, you agree that the submitted information does not contain any confidential or proprietary information, and that the VoiceXML Forum may use the submitted information without any restrictions or limitations.


Table of Contents

1. Overview

This document describes tags for logging run-time data in an ASR (automatic speech recognition) server. Typically, the ASR server is part of a speech services system that includes an ASR server, a text-to-speech server, a VoiceXML browser, an application server, a database server, interfaces to external servers, and possibly other servers. The data logging (DL) specification, defined by the Tools Committee within the VoiceXML Forum, is called SLAML and comprises an SLAML overview plus a specification for the individual servers mentioned above (ASR, browser, etc.).

1.1 Specification Format

In the descriptions below, the name for an element tag is listed along with a description of the tag, a description of the SLAML attributes, proposed attributes, and a list of XML child elements.

Description: The description of the element.
SLAML Attributes: XML attributes that are defined in the SLAML document. Refer to the external documentation for the SLAML.
Proposed Attributes: XML attributes defined for the ASR server only.
XML child elements: The elements that may be included inside of this element.
Example: An example using the element.
Content: The content of the element, if it has any.
Attribute specification: Each attribute has its own set of values and type.

XML namespaces provide a simple method for qualifying element and attribute names used in Extensible Markup Language documents by associating them with namespaces identified by URI references. For more information, visit http://www.w3.org/TR/REC-xml-names/. In this specification, all the SLAML defined attributes have the namespace “sl” to indicate they are part of the SLAML standard. Proposed attributes are only specified for ASR logging and thus do not use the “sl” namespace identifier.

2. Elements

This section defines elements and attributes allowed in the ASR logging specification. The attributes or elements from the SLAML namespace are prefixed with sl:

2.1 <sl:log> element

Element sl:log
Description The root element of the ASR log.
Children

Zero or more ASR-settings-handler element.
Zero or more grammar-define-handler element.
Zero or more speech-request element.
Zero or more grammar-undefine-handler element.
Zero or more speech-resource-configuration element.


2.2 <speech-resource-configuration> element

Element speech-resource-configuration
Description Contains engine-specific information about the speech resource. This is where various vendor-specific parameters can be logged.
Attributes None
Children Zero or one vendor-specified elements.
Parent sl:log

For example,
<speech-resource-configuration>
..
</speech-resource-configuration>

2.3 <speech-resource-allocation> element

Element speech-resource-allocation
Description This element handles the speech resource allocation request. It can be used to pinpoint errors when allocating speech resources, including license resources, memory resources, etc.
Attributes
  • Required:
    • sl:handle-interaction specifies the request handler ID.
    • sl:start specifies the start time of the segment.
    • sl:end specifies the end time of the segment.
    • sl:outcome specifies the high-level outcome classification.
Children None
Parent sl:log

For example,
<speech-resource-allocation sl:start="1124126927612"
							   sl:end="1124126927613"
							   sl:handle-request="asr-session-req-1"
							   outcome="success" /> 

2.4 <grammar-define-handler> element

Element grammar-define-handler
Description An element that denotes the speech engine loaded a grammar into the speech resource. This makes the grammar available for use in a speech request.
Attributes
  • Required:
    • name: specifies the name of the grammar.
    • sl:start specifies the start time of the action.
    • sl:end specifies the end time of the action.
    • sl:handle-request specifies the request identifier.
    • sl:outcome specifies the high-level outcome classification.
    • format specifies the grammar format.
    • lang specifies the grammar's language.
    • uri specifies the URI of the grammar's location.
Children Zero or one  grammar-content element.
Parent sl:log or recognize-request

For example,
<grammar-define-handler sl:start="1124126927613"
                           sl:end="1124126927617"
		 				   sl:handle-request ="define-grammar-msg-1"
						   name="Global"
						   uri="http://server.example.com/globalgram.xml"
						   format="application/srgs"
						   lang="en-US"
						   outcome="success">
</grammar-define-handler>

2.5 <grammar-content> element

Element grammar-content
Description The actual inline/resolved grammar content.  This would contain the grammar that was loaded into the speech resource.
Attributes None
Children None
Parent grammar-define-handler

For example,
<grammar-content>
	    #ABNF 1.0 UTF-8;
	    language en-US;
	    mode voice;
	    tag-format <semantics/1.0>;
	    root $MainMenu;
	    $MainMenu = operator | customer service | main menu;
</grammar-content>


2.6 <recognize-request> element

Element recognize-request
Description This element is a record of the recognition request made by a voice browser.
Attributes
  • Required:
    • sl:start specifies the start time of the action.
    • sl:end specifies the end time of the action
    • sl:handle-request specifies the request identifier
    • sl:outcome specifies the high-level outcome classification
    • sl:traceid specifies the unique transaction identifier
    • sl:mode specifies either parallel or sequential
Children

zero or more ASR-settings-handler elements
zero or more grammar-define-handler elements
zero or more grammar-undefine-handler elements
one audio-stream-analysis element
one execute-recognition element
one recognize-results element

Parent sl:log

For example,
<recognize-request sl:start="1124126927612"
	sl:end="1124126937794"
	sl:handle-request ="recognize-rqt-1"
	computerid="ASR Server 1194"
	outcome="nomatch">
	...
</ recognize-request>


2.7 <audio-stream-analysis> element

Element audio-stream-analysis
Description This element indicates the ASR server began receiving an audio stream.
Attributes
  • Required:
    • sl:start specifies the start time of the action.
    • sl:end specifies the end time of the action
    • sl:outcome specifies the high-level outcome classification
    • input-mode specifies either dtmf or speech
Children Zero or One audio-stream-configure elements
One audio-feature elements
Parent recognize-request

For example,
<audio-stream-analysis sl:start="1124126927630"
	sl:end="1124126927774"
	outcome="success"
 mode=”speech”>
	...
</audio_stream-analysis>

2.8 <audio-stream-configuration> element

Element audio-stream-configuration
Description This element contains information specific to the ASR engine about the audio stream. These are all vendor-specific elements.
Attributes None
Children Zero or more vendor-specified elements
Parent audio-stream-analysis


2.9 <audio-features> element

Element audio-feature
Description Information about the streamed audio.
Attributes
  • Required:
    • audiolength specifies the length of the audio in bytes.
    • audiolocation specifies the location URI of the saved audio.
    • audioformat specifies the format of the audio (pcmu, pcma...).
Children None
Parent audio-stream-analysis

For example,
<audio-features sl:audiolength="21347"
sl:audiolocation="http://ASR.server.com/134dkdhsT.vox"
	sl:audioformat="ulaw">
	...
</audio-features>

2.10 <execute-recognition> element

Element execute-recognition
Description Information about the ASR engine’s recognition process.
Attributes
  • Required:
    • sl:start specifies the start time of the action.
    • sl:end specifies the end time of the action.
    • sl:outcome specifies the high-level outcome classification.
    • active-grammar specifies the active grammar set.
    • utterance specifies the raw text recognzied by the ASR.
    • semantic-interpretation specifies the text of the returned semantic interpretation.
    • language specifies the language used for this recognition.
    • confidence-level specifies the confidence threshold.
    • confidence specifies the confidence for this recognition.
Children One recognize-configuration element
One recognize-results element
Parent recognize-request

For example,
<execute-recognition>
  active-grammar=”http://server.example.com/globalgram.xml;http://server.example.com/movie.grxml”
  sl:start="1124126940000"
sl:end="1124126947700"
	outcome="match">
	...
</execute-recognition>

2.11 <recognize-configuration> element

Element recognize-configuration
Description Various engine-specific settings and information active during recognition.
Attributes None
Children Zero or More vendor-specified elements
Parent execute-recognition


2.12 <recognize-results> element

Element recognize-results
Description The result of the recognition, based on the audio and using the activated grammars and format. The results are defined using a standard called NMSML (Natural Language Semantics Markup Language) or EMMA (Extensible MultiModal Annotation Markup). For more information, visit http://www.w3.org/TR/nl-spec/ and http://www.w3.org/TR/emma/
Attributes
  • Required:
    • type specifies the format of the recognition result
Children None
Parent execute-recognition

For example,

EMMA result:
<recognize-results type="application/emma+xml">
  <emma:emma version="1.0"
xmlns:emma=http://www.w3.org/2003/04/emma xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/emma/emma10.xsd"
xmlns="http://www.example.com/example">
<emma:one-of>
<emma:interpretation id="interp1"
emma:confidence="0.8" emma:medium="acoustic" emma:mode="speech" emma:function="dialog"
emma:verbal="true">
  				<movie>Fantastic Four</movie>
  			</emma:interpretation>
  		</emma:one-of>
  	</emma:emma>
</recognize-results>
NLSML result:
<recognize-results type=”application/nlsml+xml">
	<interpretation x-model="http://dataModel" confidence="860">
		<instance>
			<movie>
				<name>
					Fantastic Four
				</name>
			</movie>
		</instance>
		<input mode="speech" confidence="0.5" timestamp-start="2006-08-13T00:00:00"
		timestamp-end="2006-08-13T00:00:02">
				Ticket for Fantastic Four
		</input>
	</interpretation>
</recognize-results>

2.13 <grammar-undefine-handler> element

Element grammar-undefine-handler
Description An element that denotes the speech engine unloaded a grammar into the speech resource.
Attributes
  • Required:
    • name specifies the name of the grammar
    • sl:start specifies the start time of the action.
    • sl:end specifies the end time of the action
    • sl:handle-request specifies the request identifier
Children None
Parent speech-resource-allocation or recognize-request

For example,
<grammar-undefine-handler sl:start="1124126947775"
	sl:handle-request ="undefine-grammar-msg-1"
	Name="Movienames"/>


3 ASR-Specific Parameters

All parameters are individual elements which the value is specified as the content for a parameter element. e.g.

<confidence-threshold>0.5</confidence-threshold>
<n-best-list-length>4</n-best-list-length>
<vendor-specific>enable_grammar_cache=true</vendor-specific>

The ASR-specific parameters are listed below:

confidence-threshold When a recognition resource recognizes an utterance with some portion of the grammar, it associates a confidence level  with that conclusion.  The confidence-threshold parameter tells the  recognizer resource what confidence level should be considered a successful match. This is an integer from 0.0-1.0 indicating the recognizer's confidence in the recognition. If the recognizer determines that its confidence in all its recognition results is less than the confidence threshold, then it returns no-match as the recognition result.
sensitivity-level This parameter specifies the sensitivity on detecting speech. A higher value for this parameter means higher sensitivity.
speed-vs-accuracy This parameter specifies the balance between performance and accuracy for the recognizer resource. A higher value for this parameter means higher speed and less accuracy.
n-best-list-length When the recognizer matches an incoming stream with the grammar, it may come up with more than one alternative match because of confidence levels in certain words or conversation paths.  If this  parameter  is not specified, by default, the recognition resource  will only return the best match above the confidence threshold. The  client, by setting this parameter, could ask the recognition resource to send it more than 1 alternative. All alternatives must still be above the confidence-threshold. A value greater than one does not guarantee that the recognizer will send the requested number of alternatives.
no-input-timeout When recognition is started and there is no speech detected for a  certain period of time, the no-input-timeout parameter can be set to this timeout value.
recognition-timeout When recognition is started and there is no match for a certain period of time, the recognizer signal the client to terminate the recognition operation. The recognition-timeout parameter field sets this timeout value. The value is in milliseconds.
speech-complete-timeout This parameter specifies the length of silence required following user speech before the speech recognizer finalizes a result. Specified in milliseconds. The value for this field ranges from 0 to MAXTIMEOUT, where MAXTIMEOUT is platform specific.
speech-incomplete-timeout This parameter specifies the length of silence required following user speech before the speech recognizer finalizes a result. The speech prior to the silence is incomplete where it is possible to speak further and still match the grammar. By contract, the speech complete timeout specifies the speech prior to the silence is complete and it is not possible to speak further. Specified in milliseconds. The value for this field ranges from 0 to MAXTIMEOUT, where MAXTIMEOUT is platform specific.
dtmf-interdigit-timeout This parameter  specifies the terminating timeout to use when recognizing DTMF input.  The value is in milliseconds.
dtmf-term-char This parameter specifies the terminating DTMF character for DTMF input recognition.
vendor-specific This parameter specifies any vendor-specific parameters. The value of this will be a pair of property and value for the recognizer. (e.g. enable_grammar_cache=true)

 

Appendix A: Examples


The example shows a complete ASR log:
<?xml version="1.0" encoding="UTF-8"?>
<sl:slaml xmlns:sl="http://voicexml.org/2006/slaml"
	sl:version="1.0">
	<sl:manifest>
		<sl:session name="Session-12345"
				    start="asr-session-req-1"
					sl:class="ASREngine" sl:log-tag="asr-log-1"/>
	</sl:manifest>

	<sl:log tag="svr-log-tag-1"
		    sl:class="ASREngine"
		    sl:entity="LumenVox Resource112"
		    xmlns="http://voicexml.org/2006/asr-log">

			<!-- the configuration of the speech resource -->
			<speech-resource-configuration>>
				<license-type-configuration>Full</license-type-configuration> <!-- a description of the type of license opened-->
	    		<feature-enabled>unlimited vocbulary size</feature-enabled> <!-- Various features available on this resource -->
	    		<feature-enabled>on line adaptation</feature-enabled>
			</speech-resource-configuration>

			<speech-resource-allocation sl:start="1124126927612"
										sl:end="1124126927613"
										sl:handle-request="asr-session-req-1"
										outcome="success"/>

			<!-- Example of a global grammar being loading -->
			<grammar-define-handler
				sl:start="1124126927613"
				sl:end="1124126927617"
				name="Global"
				rank="0"
				uri="http://server.example.com/globalgram.xml"
				format="SRGS_ABNF"
				lang="en-US"
				sl:handle-request="define-grammar-msg-1"
				outcome="success">
				<grammar-content> <!-- Optional if this was captured by the ASR engine at the time of execution -->
					#ABNF 1.0 UTF-8;
					language en-US;
					mode voice;
					tag-format <lumenvox/1.0>;
					root $MainMenu;
					$MainMenu = operator | customer service | main menu;
				</grammar-content>
			</grammar-define-handler>->->
			<recognize-request sl:start="1124126927612"
				sl:end="1124126937794"
				computerid="LumenVox ASR Server 1194"
				sl:handle-request="asr-1234"
				sl:trace-id="aae7-12adv-ef54-ea12"
				outcome="no-match"
				sl:mode="sequential">

				<!-- showing an example of defining and activating grammar inside speech-request -->
				<grammar-define-handler sl:start="1124126927613"
					sl:end="1124126927617"
					name="Movienames"
					uri="http://server.example.com/movie.xml"
					format="SRGS_ABNF"
					lang="en-US"
					outcome="success">
					<grammar-content>
						#ABNF 1.0 UTF-8;
						language en-US;
						mode voice;
						tag-format <lumenvox/1.0>;
						root $Movie;
						$Movie = Fantastic Four { $.movie.name="Fantastic Four"; } |
								 Superman Returns { $.movie.name="Superman Returns"; };
					</grammar-content>
				</grammar-define-handler>

				<audio-stream-analysis sl:start="1124126927630"
					sl:end="1124126927774"
					outcome="success"
					mode="speech">
					<audio-stream-configuration>
						<start-of-speech-detection>true</start-of-speech-detection>
						<end-of-speech-detection>true</end-of-speech-detection>
						<start-of-speech-rewind>300(ms)</start-of-speech-rewind>
						<start-of-speech-energy-level>.4</start-of-speech-energy-level>
					</audio-stream-configuration>
					<audio-features audiolength="16475"
						audiolocation="http://ASR.server.com/134dkdhss.vox"
						audioformat="PCMU">>
					</audio-features>
				</audio-stream-analysis>->
				<execute-recognition sl:start="1124126927774"
						sl:end="1124126937794"
						outcome="no-match"
						active-grammar="http://server.example.com/globalgram.xml;http://server.example.com/movie.xml">>
					<recognize-configuration>
						<confidence-threshold>500</confidence-threshold>
						<n-best-list-length>5</n-best-list-length>
						<sensitivity-level>0.4</sensitivity-level>>
						<noise-cancelation>false</noise-cancelation>
					</recognize-configuration>
					<recognize-results type="NLSML">
						<interpretation>
							<instance/>
							<input>
								<nomatch/>
							</input>
						</interpretation>
					</recognize-results>
				</execute-recognition>

				<!-- showing an example of undefining and deactivating grammar inside speech-request -->
				<grammar-undefine-handler sl:time="1124126947775"
					Name="Movienames"/>
			</recognize-request>->
			<recognize-request sl:start="1124126937775"
				sl:end="1124126947775"
				sl:handle-request="asr-1237"
				sl:outcome="success"
				sl:trace-id="aae7-12adv-ef54-ea31">

				<audio-stream-analysis sl:start="1124126937900"
					sl:end="1124126940000"
					sl:outcome="success"
					sl:mode="speech">

					<audio-stream-configuration>>
						<end-of-speech-detection>true</end-of-speech-detection>
						<start-of-speech-rewind>300(ms)</start-of-speech-rewind>
						<start-of-speech-energy-level>0.4</start-of-speech-energy-level>
					</audio-stream-configuration>
					<audio-features audiolength="21347"
						audiolocation="http://ASR.server.com/134dkdhsT.vox"
						audioformat="PCMU">
					</audio-features>
				</audio-stream-analysis>

				<execute-recognition sl:start="1124126940000"
					sl:end="1124126947700"
					sl:outcome="success"
					activegrammars="http://server.example.com/movie.xml">

					<recognize-configuration>
						<confidence-threshold>500</confidence-threshold>
						<n-best-list-length>5</n-best-list-length>
						<sensitivity-level>0.4</sensitivity-level>
					</recognize-configuration>

					<!-- Note: we are using EMMA -->
					<recognize-results type="EMMA">
						<emma:emma version="1.0"
							    xmlns:emma="http://www.w3.org/2003/04/emma"
							    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
							    xsi:schemaLocation="http://www.w3.org/2003/04/emma
							    http://www.w3.org/TR/emma/emma10.xsd"
		    					xmlns="http://www.example.com/example">
							<emma:one-of id="list">
    							<emma:interpretation id="interp1"
     							  		emma:confidence="0.8"
     									emma:medium="acoustic"
     									emma:mode="speech"
     									emma:function="dialog"
     									emma:verbal="true">
       								<movie>Fantastic Four</movie>
    							</emma:interpretation>
							</emma:one-of>
						</emma:emma>
					</recognize-results>
  				</execute-recognition>
  			</recognize-request>

			<recognize-request sl:start="1124126937775"
				sl:end="1124126947775"
				sl:handle-request="asr-1239"
				sl:outcome="success"
				sl:trace-id="aae7-12adv-ef54-ea32">
				  <!-- Contents omitted -->
			</recognize-request>

			<!-- showing an example of undefining and deactivating grammar outside speech-request -->
			<grammar-undefine-handler sl:time="1124126947775"
				sl:handle-request="undefine-grammar-msg-3"
				Name="global"/>
	</sl:log>
</sl:slaml>

Appendix B: Acknowledgments

Jeffrey Marcus (Nuance)
David Thomson (SpeechPhone)
Andrew Wahbe (Genesys)

Appendix C: Revision History

Appendix D: References

[MRCP]
" Media Resource Control Protocol ", IETF RFC 4463, 2006
See http://www.ietf.org/rfc/rfc4463.txt
 
[SLAML]
" Session Log Annotation Markup Langauge ", Andrew Wahbe, 2007
 
[VOICEXML-2.0]
" Voice eXtensible Markup Language 2.0", McGlashan et al, W3C Note, March 2004.
See http://www.w3.org/TR/voicexml20/

 

Copyright © 2000 - 2007 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE Industry Standards and Technology Organization (IEEE-ISTO)
For inquiries contact voicexml-admin@voicexml.org