VoiceXML Review - Feature - VoiceXML and Voice-over-IP

Volume 6, Issue 3 - Sep/Oct 2006

VoiceXML and Voice-over-IP

Ian Sutherland and Pete Danielsen

VoiceXML 2.0 is a W3C Recommendation for specifying audio dialogs between machines and people. VoiceXML directs a machine to output pre-recorded audio or synthesized speech, to recognize spoken or DTMF (telephone keypad) input, and to record audio input. VoiceXML can also perform some telephony functions, such as call transfer, through the <transfer> and <disconnect> tags, which provide an abstract interface to the underlying telephony platform.

VoiceXML follows the web model of separating presentation from logic. A VoiceXML intepreter client places a request to a web server via HTTP. An application on the server contains the logic to handle the request and returns a VoiceXML document containing the "presentation", which is the dialog to conduct next. Under the direction of the VoiceXML document, the interpreter plays prompts, collects input, and submits input to the logic running on the web server. This cycle repeats over the course of a session.

VoiceXML interpreters have, in the past, been implemented on traditional TDM telephony platforms, but Voice-over-IP (VoIP) is increasingly being used instead of, or in addition to, TDM. We asked several experts on VoIP and VoiceXML to answer some questions about the impact of VoIP on VoiceXML, and on telephony and IVR applications in general. The participants are:

Dave Burke, Chief Technology Officer, Voxpilot
Don Jackson, VP of Advanced Telephony, Tellme
Mark Scott, VP of Development, Genesys

What do we mean by "Voice over Internet Protocol" exactly?

Dave Burke: Voice-over-IP (VoIP) is a general term referring to the approach of routing voice traffic over packet-switched Internet Protocol (IP) networks. Traditionally, voice traffic carried by the Public Switched Telephone Network (PSTN) is routed via circuit-switched means involving a dedicated circuit for each call maintained for the entire duration of the call. The voice signal is carried by analog means over the circuit, as a time-varying voltage, or by digital means, via Pulse Code Modulation (PCM), in which audio is sampled at a fixed rate, each sample is represented by a binary value, and the binary values are sent at a fixed bit-rate.

VoIP, on the other hand, uses a packet-switched approach. The voice signal is digitally sampled and broken up into short time segments, typically with a fixed duration in the region of 20ms. Each segment is carried in a data packet that is sent sequentially. The data packet includes information such as the destination address in addition to a sequence number and timestamp to indicate to the receiver in what order and exactly when to render each segment. One notable advantage of the packet-based approach over the circuit-based one stems from the use of compressing codecs to reduce the size of the voice data. Smaller packets have lower bandwidth requirements, thus allowing the network to handle an increased number of simultaneous calls. With the circuit-based approach, since the circuit's bandwidth is fixed, there is no saving obtained from reducing the data size. Since new codecs can be deployed easily on IP networks, VoIP has spurred considerable innovation in codec algorithms, with the most popular examples using a variation of the Code Excited Linear Prediction (CELP) approach such as G.723.1, G.729, and AMR.

Borrowing lessons learned from the PSTN world, VoIP paradigms separate call signaling from media transport. This separation allows signaling to follow a different path from the media. The signaling will often traverse several nodes (possibly undergoing some manipulation en route) while the media will flow directly end-to-end for efficiency. In VoIP, the two most common signaling protocols are H.323 and SIP. The former resulted from a mapping of PSTN approaches to the Internet while the latter evolved from applying Internet approaches to telecommunications (whimsically, H.323 is the result of "telco geeks discovering the Internet" while SIP is the result of "Internet geeks discovering telecoms"!). Of the two, SIP has become the dominant approach. Media transport is usually achieved via the Real-Time Protocol (RTP), a lightweight protocol that provides end-to-end network transport functions for real-time data such as audio and video. RTP prepends sequence numbers, timestamps, and payload identifiers to voice packets and is usually transported over UDP.

As VoIP technologies have advanced, the concept of Next-Generation Networks (NGN) - has evolved. The shining example of an NGN today is the IP Multimedia Subsystem (IMS). IMS is the 3rd Generation Partnership Project's (3GPP) vision for a converged telecommunications architecture that merges cellular and Internet technologies to uniformly deliver voice, video, and data on a single network. IMS is a standardized architecture that employs VoIP technology based on a 3GPP profile of SIP, and runs over the standard packet-based IP network. Since an IMS network is independent of the access network, it is equally applicable to fixed-line and wireless networks. IMS defines a collection of logical nodes. For example, the interface with the legacy PSTN is provided by specialized signaling and media gateways, interactive voice response (IVR) is provided by media servers, referred to as the Media Resource Function (MRF) in IMS, and call routing logic can be programmed on SIP application servers residing in the "service layer". Telecoms operators and service providers can source different components from different network equipment providers to tailor their network to their exact needs and budget.

Don Jackson: Two other VoIP protocols that one often hears about are the MGCP/Megaco/H.248 family and IAX. The protocols in the MGCP family are frequently used as control protocols within a functionally decomposed VoIP system, but MGCP is not a signaling protocol, and is really complementary to, rather than competing with, SIP and H.323 IAX (http://www.cornfed.com/iax.pdf) is a public protocol that has not yet been blessed by a standards body. It was developed and used by the Asterisk open source PBX project. IAX doesn't separate signaling and voice transport. Skype has an integrated VoIP solution that uses proprietary voice and signaling protocols.

Mark Scott: VoIP has been used for many years, despite having gained visibility only recently. The earliest use was purely internal, within carrier and enterprise networks, to minimize transport costs - a few years ago, 15% of all "traditional" international long distance calls were carried over VoIP; that number is surely higher today. More recently, VoIP has gained increased visibility, being used to deliver full local and long-distance phone service, via offerings such as Vonage and AT&T CallVantage. Direct, user-to-user VoIP via services such as Skype have also gained widespread acceptance.

What are the advantages of VoIP?

Don Jackson: VoIP systems use widely available IP networks for transport, and, given the scale and ongoing investment in Internet technologies, the cost of these networks is often lower than the alternatives. There are also savings to be found in having just one network to deploy and manage, versus having separate voice and data networks.

Some VoIP protocols like SIP act in a peer-to-peer manner in the sense that, ultimately, the two endpoints are the only two elements that agree on the parameters of the phone call, so new telephony features can be implemented by the endpoints/phones without changing the network in the middle. This just isn't possible in the TDM implementation of today's PSTN.

Mark Scott: VoIP offers a number of substantial advantages over traditional telephony, but the benefits that apply depend on the way VoIP is used.

Consider enterprise telephony: there is no such thing as a "switch" or "hub" for voice circuits, so cables running from every phone back to the central phone system (or some extension of it) are required. Data networks are much more readily deployed, with the ability to leverage wireless technology where necessary. Data-based communications systems also allow richer overall communication - video, text (IM), application sharing/co-browsing can also be part of a single communication session. As mentioned in the previous answer, VoIP typically results in lower switching costs, something that drove the initial growth of VoIP and that remains relevant today.

For service providers, the flexible nature of voice - any call can be routed anywhere, via any chain of intermediate proxies - provides a much richer service creation environment, allowing them to create new services without having rebuild their networks or to roll out network-wide upgrades to their switching software. For both service providers and enterprises, VoIP typically has much lower operational and administrative costs.

Dave Burke: In a nutshell, the advantages of VoIP are functionality, cost, and mobility. VoIP eases the addition of new features and capabilities such as location independence, video calling and conferencing, picture and video sharing, "click-to-call" Web features, and integrated presence. Since economies of scale may be exploited by combining voice and data on a single network, costs can often be reduced. The very concept of a public Internet and high quality, robust, and effective services such as Skype have rendered telephone calls free anywhere an Internet connection is available. Finally, VoIP brings with it mobility. No longer tethered to a fixed line or the coverage of one's mobile network and its partners, VoIP devices may be connected anywhere on the Internet and still be globally reachable.

What are the impacts of VoIP on speech recognition and speech synthesis? Do you have to worry about things like dropped packets?

Mark Scott: The impact of VoIP on speech recognition depends on how it is used. Many speech recognition systems use VoIP simply to separate traditional PSTN interfaces (such as T-1 or E-1 circuits) from the systems used to deliver speech recognition applications (such as VoiceXML platforms). This kind of separation can use G.711 - the same codec as used on PSTN circuits - without any kind of conversion or transcoding, and without any degradation in quality. Further, because the PSTN interfaces, provided by a VoIP/PSTN gateway, may even be in the same physical rack as the voice platform, a high-speed LAN with no dropped packets can be used to interconnect the two, resulting in absolutely no difference to the end user.

If non-G.711 codecs are used, which is often the case if VoIP is used in bandwidth-sensitive environments (such as consumer VoIP over the Internet), then this can have an impact on speech recognition. Low bit-rate speech coders, which reduce the number of bytes required for voice communication, are generally based on the principle of eliminating information that has minimal or zero impact on human perception. For instance, most humans, when listening to a loud 1,000Hz tone played simultaneously with a soft 1,100Hz tone, cannot hear the softer tone, and thus do not hear anything different when the tone is removed. Unfortunately, speech recognition engines make use of different principles than the human brain in recognizing speech; codecs that remove certain kinds of information can have a negative impact on recognition success rates. Normally, the reduction is nominal, and speech recognition applications are still entirely usable - indeed, mobile networks use similar compression technology to reduce bandwidth required, yet speech recognition applications can still be used on mobile phones despite the diminished accuracy compared to a land line. However, where possible, implementers of speech recognition systems should use G.711 unless it is prohibited by other constraints.

On the topic of codecs, VoIP does open the door for end-to-end systems implemented using VoIP to deliver better speech recognition quality than existing PSTN networks. Through applying wideband codecs, which provide a higher fidelity representation of the human voice, or distributed speech recognition (DSR) coders, which extract the key features of human speech that are critical for automated recognizers and encode those features with very small amounts of data, recognition rates that are superior to those of "perfect" PSTN environments are entirely achievable. This is particularly true in mobile networks with DSR, since it avoids the recognition-degrading codecs that are otherwise used.

Packet loss in VoIP systems does indeed affect recognition quality. However, this is by no means specific to speech recognition systems; human callers are just as sensitive, if not more so, to the impacts of packet loss. Fortunately, much of the early concern about packet loss stemmed from early VoIP systems over late-1990s Internet networks, which were subject to substantially higher packet loss (generally due to saturation) than the networks of today. While packet loss is a reality, it has a much smaller impact than other effects, such as dropouts on mobile networks, which occur which much greater frequency.

VoIP affects speech synthesis no differently than it affects other kinds of content, such as pre-recorded prompts. Quality loss due to the use of VoIP codecs, or due to packet loss, are the same for live person-to-person audio, pre-recorded audio, and synthesized speech.

There is a small impact on efficiency in non-G.711 networks when speech synthesis is used. Pre-recorded audio can be transcoded once to the format in use - such as G.729. Synthesized audio, on the other hand, is dynamic, and must be transcoded every time that it is used. This is very much a second order effect that does not substantially affect most systems, however.

Dave Burke: VoIP technologies have had a positive impact on how we integrate speech recognition and speech synthesis technologies. The Media Resource Control Protocol (MRCP), for example, is a recent protocol based on VoIP concepts that provides a network interface to speech resources. MRCP takes advantage of SIP to create a simplex VoIP "call" to a speech resource (for a speech recognizer, one sends audio data to the resource, while for a speech synthesizer, one receives audio data from the resource). In addition to the media stream, MRCP uses SIP to establish a control channel in parallel with the media stream. The control channel is used to control the resource (e.g. to tell the recognizer to start recognizing) as well as to receive notifications of events from the resource (e.g. that the caller has started speaking). The availability of a common, standard interface eases the integration of speech technologies into network equipment and accelerates their adoption, enabling operators and service providers to efficiently and effectively deliver exciting and compelling interactive services over the telephone.

VoIP does, however, bring new challenges related to quality of service (QoS). Since packets are sent independently of each other, there is no guarantee that they will arrive on time or in the order they were sent. Worse, packets may be lost due to network congestion. Delays are unavoidable but can be reduced by marking voice traffic as "high priority", for example using the Differentiated Services (DiffServ) approach. This allows voice packets to be labeled with a priority above non-time-sensitive data such as e-mail traffic, for example. RTP includes a sequence number in each packet allowing the receiver to reorder the packets before playing them. The reordering occurs in a jitter buffer of the receiver - a small buffer capable of temporarily storing in the order of 200 ms of audio data. The ideal jitter buffer size requires careful tuning: too small and jitter will not be concealed; too large and delays will be perceived. Modern systems use adaptive jitter buffers to optimally adjust the size of the buffer in real-time. The sequence number can be used to detect packet loss (since the sender of the voice data uses a contiguous, monotonic sequence number for each packet, so a skipped sequence number implies a missing packet). Lost packets can be concealed using different strategies: one simple approach is to repeat the last packet received.

Any poorly concealed jitter will have a negative impact on the recognition accuracy and perceived quality of synthesized audio. In today's VoIP networks, one often encounters reliable, uncongested IP networks connected to media gateways which bridge to the PSTN network, thus mitigating QoS issues. Out-of-order mismatches can be reduced or eliminated by simplifying network topologies (only allowing one route at a given time for voice traffic, for example). As VoIP technologies spread right out to the handset, the QoS of each part of the network matters since the end-to-end QoS is only as good as the weakest link. Modern network architectures such as the 3GPP's IP Multimedia System address QoS issues through the use of approaches such as the Resource Reservation Protocol (RSVP) and DiffServ to prioritize voice and video traffic through the network.

What are the advantages in using VoIP with VoiceXML?

Don Jackson: VoiceXML brings the web development model to IVR applications, with the added benefit of a W3C standard language specification. VoIP brings open standard protocols like SIP, and the Internet transport model to telephony. VoiceXML is a great language for service creation of speech applications, and VoIP provides for modern, open, functionally-decomposed transport networks, into which it is easy to integrate VoiceXML applications. VoIP + VoiceXML is an incredibly powerful combination.

For more detail on VoIP and VoiceXML you may wish to visit https://studio.tellme.com/general/voip.html

Dave Burke: While VoiceXML is independent of the underlying telephony interface, there are advantages and synergies to be exploited with using VoiceXML in conjunction with VoIP. Modern, IP-based VoiceXML platforms are typically subsumed within the media server node (VoiceXML can be viewed as a media server controller). An IP-based VoiceXML media server is essentially "future-proof": the platform can be integrated with legacy PSTN networks using signaling/media gateways, yet can also be used within all-IP networks of the future. Within the standards community, there are a number of initiatives to define the SIP interface to VoiceXML. For example, see RFC 4240 and the IETF Internet-Draft entitled "SIP Interface to VoiceXML Media Services" (draft-burke-vxml). There are several other initiatives in the standards community to define a media server control protocol that includes the ability to trigger VoiceXML dialogs as part of a wider umbrella of lower-level media server control. Examples include the Media Server Markup Language (MSML), the Media Server Control Protocol (MSCP), the Media Server Control Markup Language (MSCML) and Protocol, as well as legacy control protocols such as H.248.

As NGN architectures evolve, there is a move toward converged application servers combining both SIP and HTTP functionality. This allows a VoiceXML application comprised of dynamic VoiceXML documents generated by the application server (i.e. the traditional Web model) to interact with the SIP signaling. For example, in response to an utterance of "send an alert to Jane", the VoiceXML platform can submit an HTTP request to the application server that could then trigger a SIP outbound call to Jane and connect Jane to the media server to hear an alert. Other compelling applications involving the visual Web are possible. For example, a "click-to-call" link on a Web page might issue an HTTP request to the converged application server that would trigger an outbound SIP call to the user and connect them to a media server or human agent to access a service.

Mark Scott: The answer to this question depends largely on the context in which VoiceXML is used. When used to provide services in the context of a call center or for voice hosting, the benefits of VoIP cited above apply independently of the fact that VoiceXML is used in authoring the application.

VoIP-based VoiceXML platforms help reduce costs because they can be used as generic dialog resources available to a variety of applications. Non-VoIP platforms generally require some application-specific provisioning to determine which VoiceXML application to contact for an incoming call. This is typically a database query to map a phone number (or port) to a URL. A SIP INVITE request to a VoIP-based platform, however, already contains the VoiceXML URL. Besides eliminating the database query by the media server, VoIP with VoiceXML offers several operational and architectural advantages:

A SIP application developer can change VoiceXML URLs without having to coordinate a change on the media server, or the database it queries.
A SIP application and the VoiceXML media server may exchange data directly in SIP messages without requiring a separate communication channel, simplifying application integration. SIP applications treat their interaction with a VoiceXML media server as a remote procedure call. The INVITE to the media server identifies the "function" (the VoiceXML URL) and the "arguments" (INVITE header data are accessible in VoiceXML session variables). The BYE from the media server contains the results (VoiceXML variables identified in an <exit> element). This allows a SIP application to bring a media server into a call, delegate its dialog needs to it (e.g. "Get a valid credit card number"), and then release it and continue with the result, using only the SIP signaling channel for application data exchange.

A key benefit to using VoIP with VoiceXML is that the openness and flexibility of both environments can be used - particularly when SIP is used as the VoIP protocol (which it normally is). This flexibility allows rich information to be passed to and from voice applications (see sidebar). For instance, a call could be presented to a VoiceXML application, along with an indication that we already know that the user speaks Spanish and is calling about home loans. Unlike traditional systems, this would not require a complex, custom integration of the application with other infrastructure. The information about the caller's language and intent would simply be parameters of the incoming call, no different than the ANI and DNIS information in VoiceXML session variables. With this information, the VoiceXML application could interact with the user, pre-qualify them for the home loan, and transfer to an agent, attaching all collected information to the call. Again, this would require no integration with external systems; the collected information could be passed simply by using VoiceXML's <exit> and passing a namelist.

What are the advantages in using VoIP for voice hosting centers?

Don Jackson: Given the advantages of VoIP, it is the obvious technology to use when building a new voice product or network. So, over time, the entire telephony network will migrate to VoIP. This transition may take decades to be completed. Almost all new investment in telephone networks today is at least VoIP-capable, if not exclusively VoIP-based.

At Tellme, we are believers in complete VoIP. A pure VoIP network enables us to develop and deploy just one version of our platform, and to get the highest utilization for our capital equipment. We can then offer a complete IP-based call - VoIP transport and VoiceXML at the application layer.

Mark Scott: VoIP can be leveraged by a hosting provider in a number of ways:

It can be used for international operational efficiency. Use of PSTN/VoIP gateways separately from the voice platforms used to deliver services allows the use of high-end gateways, which deliver very high reliability/redundancy at low costs per port. For instance, OC-3 gateways handling over 2,000 ports on a single trunk can be used; no voice platform directly supports such interconnections.
VoIP can be used for load balancing across multiple voice platforms, and across multiple sites. For instance, consider two sites that each had 2,000 ports of connectivity to the network, but only 1,500 ports of voice platform capacity. A traditional system would only be able to handle 1,500 calls per site. Using VoIP, overflow calls at one site could be routed internally to calls at the other site, assuming that unused capacity was present. During a spike at one site - which is entirely possible due to geographic events such as a regional marketing promotion or localized natural disasters such as a hurricane - it would be possible to handle 2,000 ports at site A, while handling 1,000 ports at site B.
VoIP can be used to directly accept calls, either from VoIP-enabled end users, or from VoIP carriers such as Level 3, who own worldwide networks of gateways and can provide call termination at low costs.
VoIP can be used for integration with call centers and agents, avoiding the switching costs of transfers using PSTN networks.

What are the advantages in using VoIP in call centers?

Dave Burke: A significant advantage of VoIP in the call or contact center space stems from the mobility aspects of VoIP - the human agents can be distributed, possibly working from home. The agent's VoIP phone simply registers its location with the call center's network, allowing calls from customers to be appropriately routed to the agent's current location.

The functionality benefits of VoIP are apparent in the call center too. Video calling can be added relatively easily for an enriched user experience (e.g. the caller can be shown tutorial videos by the agent during the call). Video-over-IP works in the same way as voice-over-IP except there is an additional media stream for the video channel. The video data, like the voice data, is carried over RTP and uses specific compressing codecs.

Don Jackson: Many current call centers have evolved their implementations over many years, with one brand of PBX, another brand of ACD, and a separate call center management system to integrate operations and reporting. Modern call center implementations based on Internet standard VoIP and VoiceXML technologies can avoid the hodgepodge of vendor specific protocols, and lower their overall costs by adopting new IP-based systems.

Mark Scott: Contact centers benefit from VoIP in largely the same ways as hosting providers, since many contact centers deploy on-premise voice platforms to front-end calls to agents (and many hosted voice applications end with a transfer to an agent, as opposed to complete self-service).

Do I need to use a special phone to make VoIP calls?

Don Jackson: Not necessarily. Some VoIP systems integrate with traditional TDM phones, and do the conversion to/from VoIP at a server level. Another popular VoIP device is the Analog Telephone Adapter, or ATA. These boxes (often about the size of a deck of playing cards), can convert between POTS and VoIP.

There are standalone VoIP phones that are very popular. The Cisco phones are beautifully designed, and seem to be the de rigueur prop in episodes of "24" and "The West Wing". There are other excellent VoIP phones available in the market. There are also many VoIP "soft" phones available, the Skype client being one example. There is also the Gizmo Project from SIP Phone, and X-Lite from Counterpath, which are two nice SIP-based softphones.

Can VoIP be used for multimodal systems?

Don Jackson: Absolutely. Another advantage of VoIP-based systems is the relative richness of the signaling protocols. It is far easier to provide for multiple communication channels with VoIP, integrating speech, along with data, text, and graphics.

Mark Scott: VoIP is a fundamental part of multimodal systems. Since non-audio modalities such as video or visual content presentation, clicked/typed/stylus input, ink, and other input/output mechanisms are virtually always carried over a data network, it makes no sense to use traditional networks and approaches for audio while utilizing a completely separate network for other modalities.

back to the top