TTS Synthesis Markup Language

With the help of Speech Synthesis Markup Language (SSML), you can make your TTS responses seem more like natural speech. In the following Article you will find some examples of how to use it (applicable for both dynamic and static TTS).

The full List of SSML elements may be helpful for additional context and examples:

Created: May 2020

Permalink: https://wildix.atlassian.net/wiki/x/YwLOAQ

Do not use <speak> element as it is already hardcoded.

<break>

An optional element that you can use to insert pauses between words.

Attributes

Attribute	Description
strength	Optional. Specify the relative duration of a pause using one of the following values: none x-weak weak medium (default) strong x-strong
time	Optional. Specify the absolute duration of a pause in seconds or milliseconds. Example: 2s and 500ms

Syntax

<break />
<break strength="string" />
<break time="string" />

Usage

Play sound -> Welcome to Wildix <break time="2s"/> Please wait for the next available operator

Example:

<prosody>

An optional element that specifies the pitch, contour, range, rate, duration, and volume for speaking the element's text.

Attributes

Attribute	Description
pitch	Optional. Indicates the baseline pitch for the text. You may express the pitch as: An absolute value, expressed as a number followed by "Hz" (Hertz). For example, 600Hz. A relative value, expressed as a number preceded by "+" or "-" and followed by "Hz" or "st", that specifies an amount to change the pitch. For example: +80Hz or -2st. The "st" indicates the change unit is semitone, which is half of a tone (a half step) on the standard diatonic scale. A constant value: x-low low medium high x-high default
contour	Optional. Represents changes in pitch for speech content as an array of targets at specified time positions in the speech output. Each target is defined by sets of parameter pairs. For example: `<prosody contour="(0%,+20Hz) (10%,-2st) (40%,+10Hz)">` The first value in each set of parameters specifies the location of the pitch change as a percentage of the duration of the text. The second value specifies the amount to raise or lower the pitch, using a relative value or an enumeration value for pitch (see `pitch`).
range	Optional. A value that represents the range of pitch for the text. You may express `range` using the same absolute values, relative values, or enumeration values used to describe `pitch`.
rate	Optional. Indicates the speaking rate of the text. You may express `rate` as: A relative value, expressed as a number that acts as a multiplier of the default. For example, a value of 1 results in no change in the rate. A value of .5 results in a halving of the rate. A value of 3 results in a tripling of the rate. A constant value: x-slow slow medium fast x-fast default
duration	Optional. The period of time that should elapse while the TTS engine reads the text, in seconds or milliseconds. For example, 2s or 1800ms.
volume	Optional. Indicates the volume level of the speaking voice. You may express the volume as: An absolute value, expressed as a number in the range of 0.0 to 100.0, from quietest to loudest. For example, 75. The default is 100.0. A relative value, expressed as a number preceded by "+" or "-" that specifies an amount to change the volume. For example +10 or -5.5. A constant value: silent x-soft soft medium loud x-loud default

Syntax

<prosody pitch="value" contour="value" range="value" rate="value" duration="value" volume="value"></prosody>

<say-as>

An optional element that indicates the content type (such as number or date) of the element's text.

Attributes

Attribute	Description
interpret-as	Required. Indicates the content type of element's text. For a list of types, see the table below.
format	Optional. Provides additional information about the precise formatting of the element's text for content types that may have ambiguous formats. SSML defines formats for content types that use them (see table below).
detail	Optional. Indicates the level of detail to be spoken. For example, this attribute might request that the speech synthesis engine pronounce punctuation marks. There are no standard values defined for `detail`.

The following are the supported content types for the interpret-as and format attributes. Include the format attribute only if interpret-as is set to date and time.

interpret-as	format	Interpretation
address		The text is spoken as an address: `I'm at <say-as interpret-as="address">West Midlands, CV1 4LY, Coventry</say-as>`
cardinal, number		The text is spoken as a cardinal number: `There are <say-as interpret-as="cardinal">4</say-as> levels`
characters, spell-out		The text is spoken as individual letters (spelled out): `<say-as interpret-as="characters">test</say-as>`
date	dmy, mdy, ymd, ydm, ym, my, md, dm, d, m, y	The text is spoken as a date. The `format` attribute specifies the date's format (d=day, m=month, and y=year): `Today is <say-as interpret-as="date" format="mdy">12-05-2020</say-as>`
digits, number_digit		The text is spoken as a sequence of individual digits: `<say-as interpret-as="number_digit">123456789</say-as>`
fraction		The text is spoken as a fractional number: `<say-as interpret-as="fraction">3/8</say-as> of an inch`
ordinal		The text is spoken as an ordinal number: `Select the <say-as interpret-as="ordinal">3rd</say-as> option`
telephone		The text is spoken as a telephone number. The `format` attribute may contain digits that represent a country code. For example, "1" for the United States or "39" for Italy. The phone number may also include the country code, and if so, takes precedence over the country code in the `format`. The speech synthesis engine pronounces: `The number is <say-as interpret-as="telephone" format="44">3300 563 634</say-as>`
time	hms12, hms24	The text is spoken as a time. The `format` attribute specifies whether the time is specified using a 12-hour clock (hms12) or a 24-hour clock (hms24). Use a colon to separate numbers representing hours, minutes, and seconds: `The office opens at <say-as interpret-as="time" format="hms12">4:00am</say-as>`

Syntax

<say-as interpret-as="string" format="digit string" detail="string"> <say-as>

Usage

The following example shows how to use the <break> element to pause between steps:

Dialplan application Play sound -> The person you're trying to reach isn't available <break time="2s"/> Please call back on <say-as interpret-as="date" format="dmy">12-05-2020</say-as> at <say-as interpret-as="time" format="hms12">4:00</say-as>