VoiceXML Forum, Tools Working Group
Draft 1.8 - Internal Working Draft - 31 July 2007
For Public Review 20 August 2007
About the VoiceXML Forum:
Founded in 1999, the VoiceXML Forum is an industry organization whose mission is to promote and to accelerate the worldwide adoption of VoiceXML-based applications. To this end, the Forum serves as an educational and technical resource, a certification authority and a contributor and liaison to international standards bodies, such as the World Wide Web Consortium (W3C) IETF, ANSI and ISO. The VoiceXML Forum is organized as a program of the IEEE Industry Standards and Technology Organization (IEEE-ISTO). Membership in the Forum is open to any interested company. For more information, please visit the Website at www.voicexml.org.
Disclaimers:
This document is subject to change without notice and may be updated, replaced or made obsolete by other documents at any time. The VoiceXML Forum disclaims any and all warranties, whether express or implied, including (without limitation) any implied warranties of merchantability or fitness for a particular purpose.
The descriptions contained herein do not imply the granting of licenses to make, use, sell, license or otherwise transfer any technology required to implement systems or components conforming to this specification. The VoiceXML Forum, and its member companies, makes no representation on technology described in this specification regarding existing or future patent rights, copyrights, trademarks, trade secrets or other proprietary rights.
By submitting information to the VoiceXML Forum, and its member companies, including but not limited to technical information, you agree that the submitted information does not contain any confidential or proprietary information, and that the VoiceXML Forum may use the submitted information without any restrictions or limitations.
Table of Contents
sl:
Element | sl:log |
---|---|
Description | The root element of the ASR log. |
Children | Zero or more |
Element | speech-resource-configuration |
---|---|
Description | Contains engine-specific information about the speech resource. This is where various vendor-specific parameters can be logged. |
Attributes | None |
Children | Zero or one vendor-specified elements. |
Parent | sl:log |
<speech-resource-configuration> .. </speech-resource-configuration>
Element | speech-resource-allocation |
---|---|
Description | This element handles the speech resource allocation request. It can be used to pinpoint errors when allocating speech resources, including license resources, memory resources, etc. |
Attributes |
|
Children | None |
Parent | sl:log |
<speech-resource-allocation sl:start="1124126927612" sl:end="1124126927613" sl:handle-request="asr-session-req-1" outcome="success" />
Element | grammar-define-handler |
---|---|
Description | An element that denotes the speech engine loaded a grammar into the speech resource. This makes the grammar available for use in a speech request. |
Attributes |
|
Children | Zero or one grammar-content
element. |
Parent | sl:log or recognize-request |
<grammar-define-handler sl:start="1124126927613" sl:end="1124126927617" sl:handle-request ="define-grammar-msg-1" name="Global" uri="http://server.example.com/globalgram.xml" format="application/srgs" lang="en-US" outcome="success"> </grammar-define-handler>
Element | grammar-content |
---|---|
Description | The actual inline/resolved grammar content. This would contain the grammar that was loaded into the speech resource. |
Attributes | None |
Children | None |
Parent | grammar-define-handler |
<grammar-content> #ABNF 1.0 UTF-8; language en-US; mode voice; tag-format <semantics/1.0>; root $MainMenu; $MainMenu = operator | customer service | main menu; </grammar-content>
Element | recognize-request |
---|---|
Description | This element is a record of the recognition request made by a voice browser. |
Attributes |
|
Children | zero or more |
Parent | sl:log |
<recognize-request sl:start="1124126927612" sl:end="1124126937794" sl:handle-request ="recognize-rqt-1" computerid="ASR Server 1194" outcome="nomatch"> ... </ recognize-request>
Element | audio-stream-analysis |
---|---|
Description | This element indicates the ASR server began receiving an audio stream. |
Attributes |
|
Children | Zero or One audio-stream-configure elementsOne audio-feature elements |
Parent | recognize-request |
<audio-stream-analysis sl:start="1124126927630" sl:end="1124126927774" outcome="success" mode=”speech”> ... </audio_stream-analysis>
Element | audio-stream-configuration |
---|---|
Description | This element contains information specific to the ASR engine about the audio stream. These are all vendor-specific elements. |
Attributes | None |
Children | Zero or more vendor-specified elements |
Parent | audio-stream-analysis |
Element | audio-feature |
---|---|
Description | Information about the streamed audio. |
Attributes |
|
Children | None |
Parent |
audio-stream-analysis
|
<audio-features sl:audiolength="21347" sl:audiolocation="http://ASR.server.com/134dkdhsT.vox" sl:audioformat="ulaw"> ... </audio-features>
Element | execute-recognition |
---|---|
Description | Information about the ASR engine’s recognition process. |
Attributes |
|
Children | One recognize-configuration
elementOne recognize-results element |
Parent |
recognize-request
|
<execute-recognition> active-grammar=”http://server.example.com/globalgram.xml;http://server.example.com/movie.grxml” sl:start="1124126940000" sl:end="1124126947700" outcome="match"> ... </execute-recognition>
Element | recognize-configuration |
---|---|
Description | Various engine-specific settings and information active during recognition. |
Attributes | None |
Children | Zero or More vendor-specified elements |
Parent | execute-recognition |
Element | recognize-results |
---|---|
Description | The result of the recognition, based on the audio and using the activated grammars and format. The results are defined using a standard called NMSML (Natural Language Semantics Markup Language) or EMMA (Extensible MultiModal Annotation Markup). For more information, visit http://www.w3.org/TR/nl-spec/ and http://www.w3.org/TR/emma/ |
Attributes |
|
Children | None |
Parent | execute-recognition |
<recognize-results type="application/emma+xml"> <emma:emma version="1.0" xmlns:emma=http://www.w3.org/2003/04/emma xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/emma/emma10.xsd" xmlns="http://www.example.com/example"> <emma:one-of> <emma:interpretation id="interp1" emma:confidence="0.8" emma:medium="acoustic" emma:mode="speech" emma:function="dialog" emma:verbal="true"> <movie>Fantastic Four</movie> </emma:interpretation> </emma:one-of> </emma:emma> </recognize-results>NLSML result:
<recognize-results type=”application/nlsml+xml"> <interpretation x-model="http://dataModel" confidence="860"> <instance> <movie> <name> Fantastic Four </name> </movie> </instance> <input mode="speech" confidence="0.5" timestamp-start="2006-08-13T00:00:00" timestamp-end="2006-08-13T00:00:02"> Ticket for Fantastic Four </input> </interpretation> </recognize-results>
Element | grammar-undefine-handler |
---|---|
Description | An element that denotes the speech engine unloaded a grammar into the speech resource. |
Attributes |
|
Children | None |
Parent | speech-resource-allocation or recognize-request |
<grammar-undefine-handler sl:start="1124126947775" sl:handle-request ="undefine-grammar-msg-1" Name="Movienames"/>
All parameters are individual elements which the value is specified as the content for a parameter element. e.g.
<confidence-threshold>0.5</confidence-threshold> <n-best-list-length>4</n-best-list-length> <vendor-specific>enable_grammar_cache=true</vendor-specific>
The ASR-specific parameters are listed below:
confidence-threshold | When a recognition resource recognizes an utterance with some portion of the grammar, it associates a confidence level with that conclusion. The confidence-threshold parameter tells the recognizer resource what confidence level should be considered a successful match. This is an integer from 0.0-1.0 indicating the recognizer's confidence in the recognition. If the recognizer determines that its confidence in all its recognition results is less than the confidence threshold, then it returns no-match as the recognition result. |
sensitivity-level | This parameter specifies the sensitivity on detecting speech. A higher value for this parameter means higher sensitivity. |
speed-vs-accuracy | This parameter specifies the balance between performance and accuracy for the recognizer resource. A higher value for this parameter means higher speed and less accuracy. |
n-best-list-length | When the recognizer matches an incoming stream with the grammar, it may come up with more than one alternative match because of confidence levels in certain words or conversation paths. If this parameter is not specified, by default, the recognition resource will only return the best match above the confidence threshold. The client, by setting this parameter, could ask the recognition resource to send it more than 1 alternative. All alternatives must still be above the confidence-threshold. A value greater than one does not guarantee that the recognizer will send the requested number of alternatives. |
no-input-timeout | When recognition is started and there is no speech detected for a certain period of time, the no-input-timeout parameter can be set to this timeout value. |
recognition-timeout | When recognition is started and there is no match for a certain period of time, the recognizer signal the client to terminate the recognition operation. The recognition-timeout parameter field sets this timeout value. The value is in milliseconds. |
speech-complete-timeout | This parameter specifies the length of silence required following user speech before the speech recognizer finalizes a result. Specified in milliseconds. The value for this field ranges from 0 to MAXTIMEOUT, where MAXTIMEOUT is platform specific. |
speech-incomplete-timeout | This parameter specifies the length of silence required following user speech before the speech recognizer finalizes a result. The speech prior to the silence is incomplete where it is possible to speak further and still match the grammar. By contract, the speech complete timeout specifies the speech prior to the silence is complete and it is not possible to speak further. Specified in milliseconds. The value for this field ranges from 0 to MAXTIMEOUT, where MAXTIMEOUT is platform specific. |
dtmf-interdigit-timeout | This parameter specifies the terminating timeout to use when recognizing DTMF input. The value is in milliseconds. |
dtmf-term-char | This parameter specifies the terminating DTMF character for DTMF input recognition. |
vendor-specific | This parameter specifies any vendor-specific parameters. The value of this will be a pair of property and value for the recognizer. (e.g. enable_grammar_cache=true) |
<?xml version="1.0" encoding="UTF-8"?> <sl:slaml xmlns:sl="http://voicexml.org/2006/slaml" sl:version="1.0"> <sl:manifest> <sl:session name="Session-12345" start="asr-session-req-1" sl:class="ASREngine" sl:log-tag="asr-log-1"/> </sl:manifest> <sl:log tag="svr-log-tag-1" sl:class="ASREngine" sl:entity="LumenVox Resource112" xmlns="http://voicexml.org/2006/asr-log"> <!-- the configuration of the speech resource --> <speech-resource-configuration>> <license-type-configuration>Full</license-type-configuration> <!-- a description of the type of license opened--> <feature-enabled>unlimited vocbulary size</feature-enabled> <!-- Various features available on this resource --> <feature-enabled>on line adaptation</feature-enabled> </speech-resource-configuration> <speech-resource-allocation sl:start="1124126927612" sl:end="1124126927613" sl:handle-request="asr-session-req-1" outcome="success"/> <!-- Example of a global grammar being loading --> <grammar-define-handler sl:start="1124126927613" sl:end="1124126927617" name="Global" rank="0" uri="http://server.example.com/globalgram.xml" format="SRGS_ABNF" lang="en-US" sl:handle-request="define-grammar-msg-1" outcome="success"> <grammar-content> <!-- Optional if this was captured by the ASR engine at the time of execution --> #ABNF 1.0 UTF-8; language en-US; mode voice; tag-format <lumenvox/1.0>; root $MainMenu; $MainMenu = operator | customer service | main menu; </grammar-content> </grammar-define-handler>->-> <recognize-request sl:start="1124126927612" sl:end="1124126937794" computerid="LumenVox ASR Server 1194" sl:handle-request="asr-1234" sl:trace-id="aae7-12adv-ef54-ea12" outcome="no-match" sl:mode="sequential"> <!-- showing an example of defining and activating grammar inside speech-request --> <grammar-define-handler sl:start="1124126927613" sl:end="1124126927617" name="Movienames" uri="http://server.example.com/movie.xml" format="SRGS_ABNF" lang="en-US" outcome="success"> <grammar-content> #ABNF 1.0 UTF-8; language en-US; mode voice; tag-format <lumenvox/1.0>; root $Movie; $Movie = Fantastic Four { $.movie.name="Fantastic Four"; } | Superman Returns { $.movie.name="Superman Returns"; }; </grammar-content> </grammar-define-handler> <audio-stream-analysis sl:start="1124126927630" sl:end="1124126927774" outcome="success" mode="speech"> <audio-stream-configuration> <start-of-speech-detection>true</start-of-speech-detection> <end-of-speech-detection>true</end-of-speech-detection> <start-of-speech-rewind>300(ms)</start-of-speech-rewind> <start-of-speech-energy-level>.4</start-of-speech-energy-level> </audio-stream-configuration> <audio-features audiolength="16475" audiolocation="http://ASR.server.com/134dkdhss.vox" audioformat="PCMU">> </audio-features> </audio-stream-analysis>-> <execute-recognition sl:start="1124126927774" sl:end="1124126937794" outcome="no-match" active-grammar="http://server.example.com/globalgram.xml;http://server.example.com/movie.xml">> <recognize-configuration> <confidence-threshold>500</confidence-threshold> <n-best-list-length>5</n-best-list-length> <sensitivity-level>0.4</sensitivity-level>> <noise-cancelation>false</noise-cancelation> </recognize-configuration> <recognize-results type="NLSML"> <interpretation> <instance/> <input> <nomatch/> </input> </interpretation> </recognize-results> </execute-recognition> <!-- showing an example of undefining and deactivating grammar inside speech-request --> <grammar-undefine-handler sl:time="1124126947775" Name="Movienames"/> </recognize-request>-> <recognize-request sl:start="1124126937775" sl:end="1124126947775" sl:handle-request="asr-1237" sl:outcome="success" sl:trace-id="aae7-12adv-ef54-ea31"> <audio-stream-analysis sl:start="1124126937900" sl:end="1124126940000" sl:outcome="success" sl:mode="speech"> <audio-stream-configuration>> <end-of-speech-detection>true</end-of-speech-detection> <start-of-speech-rewind>300(ms)</start-of-speech-rewind> <start-of-speech-energy-level>0.4</start-of-speech-energy-level> </audio-stream-configuration> <audio-features audiolength="21347" audiolocation="http://ASR.server.com/134dkdhsT.vox" audioformat="PCMU"> </audio-features> </audio-stream-analysis> <execute-recognition sl:start="1124126940000" sl:end="1124126947700" sl:outcome="success" activegrammars="http://server.example.com/movie.xml"> <recognize-configuration> <confidence-threshold>500</confidence-threshold> <n-best-list-length>5</n-best-list-length> <sensitivity-level>0.4</sensitivity-level> </recognize-configuration> <!-- Note: we are using EMMA --> <recognize-results type="EMMA"> <emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2003/04/emma http://www.w3.org/TR/emma/emma10.xsd" xmlns="http://www.example.com/example"> <emma:one-of id="list"> <emma:interpretation id="interp1" emma:confidence="0.8" emma:medium="acoustic" emma:mode="speech" emma:function="dialog" emma:verbal="true"> <movie>Fantastic Four</movie> </emma:interpretation> </emma:one-of> </emma:emma> </recognize-results> </execute-recognition> </recognize-request> <recognize-request sl:start="1124126937775" sl:end="1124126947775" sl:handle-request="asr-1239" sl:outcome="success" sl:trace-id="aae7-12adv-ef54-ea32"> <!-- Contents omitted --> </recognize-request> <!-- showing an example of undefining and deactivating grammar outside speech-request --> <grammar-undefine-handler sl:time="1124126947775" sl:handle-request="undefine-grammar-msg-3" Name="global"/> </sl:log> </sl:slaml>