HTTP/TTS Protocol

The Voximal VoiceXML browser can connect to a TTS engine using HTTP. The HTTP protocol is used to transform the prompt text to an audio file. The audio file can be store in a cache directory in order to optimize the TTS ressources using. The first access request to generate the audio file, and save it into the cache. The next times, if you use the same text content, the VoiceXML browser will directly use the file in cache, as a prerecorded audio.

This protocol is simple :

  • From the VoiceXML browser, you configure to use HTTP, a (POST recommended) request containing mainly the text content and additional parameters (like language, voice…).
  • The web server with the TTS engine treats your request.
  • The VoiceXML browser receives an audio file (compatible with the Asterisk audio formats) : it stores the file in cache and plays it.
  • If you try to use the same content after, the VoiceXML will check and use the cache content instead of requesting the TTS engine.

The main TTS configuration is set in /etc/asterisk/voximal.conf, in the section “[prompt]” :

  • method : When you set the 'method' with POST or GET the HTTP/TTS protocol is used to process <prompt> text contents. If you set the value ASTERISK, the VoiceXML browser will send the content to the Asterisk module (as a text/UTF8 or XML/UTF8 content).
  • uri : You need to set the 'uri' for the TTS (or TextToVideo) service (our scripts install the services in http://ip/tts/provider/tts.php).
  • urivideo : same as uri but when you sent the xml:language=“video”.
  • format : Configure the audio file 'format' used, all the scripts not support all the format. Have a look on the install documentation to check and set the correct format.
  • formatvideo : same as format but when you sent the xml:language=“video”.
  • maxage : The parameter 'maxage' force to refresh the cache after sometime. The value 0, disable the caching, the HTTP request will be use for each prompt. The value -1 define infinite age. If the file exist in the cache, it will be always used from the cache directly.
  • checkBreak : Allows to parse the prompt content (in SSML) an search for the <break> tag. The <break> tags are processed by the VoiceXML browser to make pauses in the prompt.
  • cutPrompt : The option 'cutPrompt', allows to slice the prompt in order to mutualize the maximal contents (cuts in '.', ',', ':' …).
  • ssml : The option 'ssml' for to send the text as a SSML/XML well formated content (with <?xml> and <ssml> roots tags).

Configuration example :

[prompt]
api=microsoft
method=POST
ssml=0
cutprompt=1
maxage=-1
key=xxxxxxxxxxxxxxxxxxxxxxxxx
user=xxxxxxx
password=xxxxxxx

Most of this parameters can be change from the VoiceXML syntax using properties. Use the property name 'prompt' added with the parameter name.

VoiceXML example :

<property name="promptvoice" value="Poala"/>
  • text : the text to synthesize : from the <prompt> content (UTF8 format).
  • language : the language used (en-GB, fr-FR…) : from the xml:lang attribut.
  • format : the audio format to return (wav, gsm, mp4… formats supported by Asterisk) : from the configuration.
  • voice : the voice (Carla, Marcos… depends on the TTS provider) : from the xml:lang attribut (3th parameter ex: “it-IT-Paola”).
  • size* : the size of the image : from the property promptsize.
  • backgroud* : the image reference or color used for the background : from the property promptbackground.
  • color* : the color for the text : from the property promptcolor.
  • font* : the size of the font : from the property promptfont.
  • offset* : the offset X shift to the text : from the property promptoffset
  • position* : the position Y shift to the text : from the property promptposition
  • hmac : MD5 key generated for Voxygen Cloud integration.

* : Only for TextToVideo function. When you set xml:language=“video”.