The 3CX Voice Application Designer includes a new feature which allows for converting text to speech (TTS), i.e. produce artificial human voice. This opens the door to many possibilities, allowing you to create voice applications that play audio that is not pre-recorded, but is generated each time as needed.
For example, this could be used to create a voice application that reads information from a database and plays that information to the user, or an application that checks a web service and reads the results to the user.
This article will show how to use this new functionality to create powerful applications, adding new voices, supporting more languages, and troubleshooting common problems you may find.
Text to Speech Functionality
Windows includes the “Microsoft Speech Synthesizer” in its most modern versions – a series of components that allow for speech synthesis. 3CX Voice Application Designer uses these components to synthesize speech, allowing you to use all the features provided by these components and also letting you choose other voices in different languages which were originally created for Windows.
The list of Windows operating systems including the “Microsoft Speech Synthesizer” at the time of this writing are:
- Windows 7
- Windows Vista SP2 or later
- Windows Server 2008 (Server Core not supported)
- Windows Server 2008 R2 (Server Core supported with SP1 or later)
This feature is now available since 3CX Phone System 11.
Speech Synthesis Methods
The 3CX Voice Application Designer offers two formats for speech synthesis, a simple one named “Text” and another more complex one named “SSML”.
- Text Mode: this mode is appropriate for most users since it is easier. The text is converted to speech as it is entered.
- SSML Mode: this mode allows controlling aspects of synthesized audio, such as pronunciation, volume, pitch, rate, etc. The text must be an XML following the SSML (Speech Synthesis Markup Language) recommendation available at http://www.w3.org/TR/speech-synthesis/
Synthesizing Speech Using the Text Mode
Step 1 – Creating the project
To begin, we need to create a new project. Open the 3CX Voice Application Designer and go to “File” > “New” > “Project”. Enter a name for the project, for example “TTS Demo”, and save it into the desired folder. The main callflow “Main.flow” will be automatically opened.
Step 2 – Defining some variables
As an example, let’s define some variables, and then reproduce their values. This will allow us to show that the text to convert to speech can be created dynamically using expressions.
In the Project Explorer select the main callflow node, then go to the “Properties” window and press the button to the right of the item “Variables”.
This will open the variable collection editor. Let’s add two variables, one containing a name and another containing an amount. To do this:
- Press the “Add” button to add the first variable. Change the default name “Variable1” to “Name”. Then enter the value ‘John Doe’ (including simple quotes because it is a constant string value) into the InitialValue field.
- Press the “Add” button again to add the second variable. Change the default name “Variable1” to “Amount”. Then enter the value 120.45 into the “InitialValue” field.
Now that we have both variables defined, one containing the name “John Doe” and the other containing the amount “120.45”, we’ll create an expression to play the following message: “Mr. John Doe has a balance of $ 120.45”.
Step 3 – Adding the component to synthesize speech
To synthesize speech we simply need to add to our callflow a “Prompt Playback” component. Drag it from the toolbox and drop it into the design surface. Change the name of the component to “playTTS”. Change the property “AllowBargeIn” to “False” so that the message is played even if no digits are to be entered later (see this article for more information). Now click the button to the right of the property “Prompts” to open the prompt collection editor. You will see an arrow pointing down to the right of the “Add” button. Pressing that arrow will show you the different types of prompts that can be added. In this case choose “TextToSpeechAudioPrompt”.
Now we need to configure this prompt. First modify the name to “ttsPrompt”. Keep the format as “Text”, the name of the voice empty so the default voice is used, and the volume to 100 (maximum value). Now create the expression for the property “Text”. You can press the button to the right of this property to open the expression editor. We’ll use the VAD function named “CONCATENATE” to concatenate the different parts of the text, completing the fields as follows:
- The first text to concatenate will be ‘Mr. ‘. Remember to enter the single quotes to denote constant string value.
- Then select the variable callflow$.Name
- Because the VAD function shows by default 2 parts to concatenate, we need to press the button to the right of the second concatenated part in order to add a third part. Then we’ll enter ‘ has a balance of $ ‘. Remember to enter the single quotes to denote constant string value.
- Finally we’ll select the variable callflow$.Amount
The final value of the expression will be as follows:
CONCATENATE(‘Mr. ‘,callflow$.Name,’ has a balance of $ ‘,callflow$.Amount)
Step 4 – Build, deploy and test
The project is ready. Now all you need to do is save, build, deploy it to your 3CX Phone System installation and make a call to the assigned extension. When calling you will hear the synthesized message: “Mr. John Doe has a balance of $ 120.45”.
Synthesizing Speech Using the SSML Mode
If you need detailed control over certain aspects of synthesizing speech, SSML mode should be used. This mode is more complex than Text mode, but it gives us more control. In this case we must set the “Text” property of the prompt to an XML following the SSML recommendation available at http://www.w3.org/TR/speech-synthesis/.
For example, to emphasize certain parts of a sentence we could use the following XML:
<speak version=”1.0″ xmlns=”http://www.w3.org/2001/10/synthesis”
That is a <emphasis> big </emphasis> car!
That is a <emphasis level=”strong”> huge </emphasis>
To enter a constant string value into the Text property, single and double quotes must be escaped, otherwise they will be misinterpreted and a runtime “ecmascript.semantic” error will occur. Therefore, if we modify the project created above to enter the above XML as a constant value, we need to assign the following value to the Text property of the prompt “ttsPrompt”:
<?xml version=\”1.0\”?><speak version=\”1.0\” xmlns=\”http://www.w3.org/2001/10/synthesis\” xmlns:xsi=\”http://www.w3.org/2001/XMLSchema-instance\” xsi:schemaLocation=\”http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd\” xml:lang=\”en-US\”>That is a <emphasis> big </emphasis> car! That is a <emphasis level=\”strong\”>huge</emphasis> bank account!</speak>
If we build, deploy and call to the extension assigned to the callflow, we’ll see how the indicated words are emphasized.
The potential of SSML to control the generation of synthesized speech are enormous – this is just a simple example. For more information, click here.
Adding Voices and Supporting More Languages
The operating system provides a voice by default, for example “Microsoft Anna”. You can install new voices which can then be used to synthesize speech. The new voices can implement the synthesis in different languages, so for example we can add a voice in Spanish and have the functionality to convert text to speech in Spanish.
In order to check the voices already installed and select the voice to be used by default, go to “Control Panel” > “Speech Recognition” > “Text to Speech”. The voice selection list displays the available options.
To add new voices you can follow the steps outlined in the following article:
For example, the Cepstral voices are mentioned in that article, which includes voices in American English, British English, Spanish, French, German and Italian.
After installing the voices, you may need to restart your computer before using them.
When you encounter a problem, first make sure:
- You have installed 3CX Voice Application Designer with an activated license (the text to speech functionality is not available in the free version).
- You have installed 3CX Phone System version 11 or above with an activated license.
- You’re running 3CX Phone System in one of the supported operating systems (see the list above in this article).
If when making a call to the extension assigned to the callflow we just created it is dropped when it reaches the synthesized audio playback, and the logs show that we’re getting an error of type “error.badfetch”, this is probably due to a problem in the security configuration of our web server (Abyss or IIS). This usually happens when using SSML format, since in that case the XML is sent as a parameter across the web pages that compose the callflow, and that could be restricted on the web server to prevent code injection attacks.
In order to modify the web server configuration to allow the passage of XML between pages, do the following:
- Go to the folder %ProgramData%\3CX\Data\Http\Interface\ivr
- Open the web.config file from that folder.
- Look for the <system.web> starting element.
- Below that element add the following:
<httpRuntime requestValidationMode=”2.0″ />
<pages validateRequest=”False” />
- Restart your web server service (Abyss or IIS).
Using text to speech and creating voice applications for 3CX Phone System is simpler than ever. Tasks such as playing numbers or dates, which previously required the creation of user components to achieve it, are now as easy as placing a Playback Prompt component and entering the value – the synthesizer will take care of every detail.
This new feature of the 3CX Voice Application Designer allows you to create applications that play back voice messages that are not necessarily pre-recorded, but are generated dynamically from information obtained from the sources you want.