TTS Synthesis Markup Language

TTS Synthesis Markup Language

With the help of Speech Synthesis Markup Language (SSML), you can make your TTS responses seem more like natural speech. In the following Article you will find some examples of how to use it (applicable for both dynamic and static TTS).

The full List of SSML elements may be helpful for additional context and examples:

Created: May 2020

Permalink: https://wildix.atlassian.net/wiki/x/YwLOAQ

Do not use <speak> element as it is already hardcoded.

<break>

An optional element that you can use to insert pauses between words.

Attributes

Attribute

Description

Attribute

Description

strength

Optional. Specify the relative duration of a pause using one of the following values:

  • none

  • x-weak

  • weak

  • medium (default)

  • strong

  • x-strong

time 

Optional. Specify the absolute duration of a pause in seconds or milliseconds. Example: 2s and 500ms

Syntax

<break /> <break strength="string" /> <break time="string" />

Usage

  • Play sound -> Welcome to Wildix <break time="2s"/> Please wait for the next available operator

Example:

<prosody>

An optional element that specifies the pitch, contour, range, rate, duration, and volume for speaking the element's text.

Attributes

Attribute

Description

Attribute

Description

pitch

Optional. Indicates the baseline pitch for the text. You may express the pitch as:

  • An absolute value, expressed as a number followed by "Hz" (Hertz). For example, 600Hz.

  • A relative value, expressed as a number preceded by "+" or "-" and followed by "Hz" or "st", that specifies an amount to change the pitch. For example: +80Hz or -2st. The "st" indicates the change unit is semitone, which is half of a tone (a half step) on the standard diatonic scale.

  • A constant value:

    • x-low

    • low

    • medium

    • high

    • x-high

    • default

contour

Optional. Represents changes in pitch for speech content as an array of targets at specified time positions in the speech output. Each target is defined by sets of parameter pairs. For example:

<prosody contour="(0%,+20Hz) (10%,-2st) (40%,+10Hz)">

The first value in each set of parameters specifies the location of the pitch change as a percentage of the duration of the text. The second value specifies the amount to raise or lower the pitch, using a relative value or an enumeration value for pitch (see pitch).

range

Optional. A value that represents the range of pitch for the text. You may express range using the same absolute values, relative values, or enumeration values used to describe pitch.

rate

Optional. Indicates the speaking rate of the text. You may express rate as:

  • A relative value, expressed as a number that acts as a multiplier of the default. For example, a value of 1 results in no change in the rate. A value of .5 results in a halving of the rate. A value of 3 results in a tripling of the rate.

  • A constant value:

    • x-slow

    • slow

    • medium

    • fast

    • x-fast

    • default

duration

Optional. The period of time that should elapse while the TTS engine reads the text, in seconds or milliseconds. For example, 2s or 1800ms.

volume

Optional. Indicates the volume level of the speaking voice. You may express the volume as:

  • An absolute value, expressed as a number in the range of 0.0 to 100.0, from quietest to loudest. For example, 75. The default is 100.0.

  • A relative value, expressed as a number preceded by "+" or "-" that specifies an amount to change the volume. For example +10 or -5.5.

  • A constant value:

    • silent

    • x-soft

    • soft

    • medium

    • loud

    • x-loud

    • default