TTS Synthesis Markup Language
With the help of Speech Synthesis Markup Language (SSML), you can make your TTS responses seem more like natural speech. In the following Article you will find some examples of how to use it (applicable for both dynamic and static TTS).
The full List of SSML elements may be helpful for additional context and examples:
- https://www.w3.org/TR/speech-synthesis/
- https://developers.google.com/assistant/actions/reference/ssml
Created: May 2020
Permalink: https://wildix.atlassian.net/wiki/x/YwLOAQ
Do not use <speak> element as it is already hardcoded.
<break>
An optional element that you can use to insert pauses between words.
Attributes
Attribute | Description |
---|---|
strength | Optional. Specify the relative duration of a pause using one of the following values:
|
time | Optional. Specify the absolute duration of a pause in seconds or milliseconds. Example: 2s and 500ms |
Syntax
<break /> <break strength="string" /> <break time="string" />
Usage
- Play sound -> Welcome to Wildix <break time="2s"/> Please wait for the next available operator
Example:
<prosody>
An optional element that specifies the pitch, contour, range, rate, duration, and volume for speaking the element's text.
Attributes
Attribute | Description |
---|---|
pitch | Optional. Indicates the baseline pitch for the text. You may express the pitch as:
|
contour | Optional. Represents changes in pitch for speech content as an array of targets at specified time positions in the speech output. Each target is defined by sets of parameter pairs. For example:<prosody contour="(0%,+20Hz) (10%,-2st) (40%,+10Hz)"> The first value in each set of parameters specifies the location of the pitch change as a percentage of the duration of the text. The second value specifies the amount to raise or lower the pitch, using a relative value or an enumeration value for pitch (see pitch ). |
range | Optional. A value that represents the range of pitch for the text. You may express range using the same absolute values, relative values, or enumeration values used to describe pitch . |
rate | Optional. Indicates the speaking rate of the text. You may express rate as:
|
duration | Optional. The period of time that should elapse while the TTS engine reads the text, in seconds or milliseconds. For example, 2s or 1800ms. |
volume | Optional. Indicates the volume level of the speaking voice. You may express the volume as:
|
Syntax
<prosody pitch="value" contour="value" range="value" rate="value" duration="value" volume="value"></prosody>
<say-as>
An optional element that indicates the content type (such as number or date) of the element's text.
Attributes
Attribute | Description |
---|---|
interpret-as | Required. Indicates the content type of element's text. For a list of types, see the table below. |
format | Optional. Provides additional information about the precise formatting of the element's text for content types that may have ambiguous formats. SSML defines formats for content types that use them (see table below). |
detail | Optional. Indicates the level of detail to be spoken. For example, this attribute might request that the speech synthesis engine pronounce punctuation marks. There are no standard values defined for detail . |
The following are the supported content types for the interpret-as and format attributes. Include the format attribute only if interpret-as is set to date and time.
interpret-as | format | Interpretation |
---|---|---|
address | The text is spoken as an address:
| |
cardinal, number | The text is spoken as a cardinal number:There are <say-as interpret-as="cardinal">4</say-as> levels | |
characters, spell-out | The text is spoken as individual letters (spelled out):<say-as interpret-as="characters">test</say-as> | |
date | dmy, mdy, ymd, ydm, ym, my, md, dm, d, m, y | The text is spoken as a date. The format attribute specifies the date's format (d=day, m=month, and y=year):Today is <say-as interpret-as="date" format="mdy">12-05-2020</say-as> |
digits, number_digit | The text is spoken as a sequence of individual digits:<say-as interpret-as="number_digit">123456789</say-as> | |
fraction | The text is spoken as a fractional number:<say-as interpret-as="fraction">3/8</say-as> of an inch | |
ordinal | The text is spoken as an ordinal number:Select the <say-as interpret-as="ordinal">3rd</say-as> option | |
telephone | The text is spoken as a telephone number. The format attribute may contain digits that represent a country code. For example, "1" for the United States or "39" for Italy. The phone number may also include the country code, and if so, takes precedence over the country code in the format . The speech synthesis engine pronounces:The number is <say-as interpret-as="telephone" format="44" > 3300 563 634 </say-as> | |
time | hms12, hms24 | The text is spoken as a time. The format attribute specifies whether the time is specified using a 12-hour clock (hms12) or a 24-hour clock (hms24). Use a colon to separate numbers representing hours, minutes, and seconds:The office opens at <say-as interpret-as="time" format="hms12">4:00am</say-as> |
Syntax
<say-as interpret-as="string" format="digit string" detail="string"> <say-as>
Usage
The following example shows how to use the <break>
element to pause between steps:
- Dialplan application Play sound -> The person you're trying to reach isn't available <break time="2s"/> Please call back on <say-as interpret-as="date" format="dmy">12-05-2020</say-as> at <say-as interpret-as="time" format="hms12">4:00</say-as>