Speech synthesis

Written by

Updated at February 12, 2024

Synthesis options
Languages and voices
- Role
Controlling pronunciation
Synthesis settings

Speech synthesis in Yandex SpeechKit allows you to convert any text to speech in multiple languages.

SpeechKit voice models use deep neural network technology. When synthesizing speech, the model pays attention to many details in the original voice. The model evaluates the entire text, not individual sentences, before starting the synthesis. This enables the synthesized voice to sound clear and natural, without electronic distortion, and reproduce appropriate intonations of a real person's speech.

The service is available at tts.api.cloud.yandex.net:443.

To try out the Text-to-Speech and Speech-to-Text product demos, visit the SpeechKit page on our website.

Synthesis options

Working with SpeechKit is performed via APIs. For more information about working with the Yandex Cloud API, see API concepts.

SpeechKit synthesis has two APIs: API v1 (REST) and API v3 (gRPC). The SpeechKit Python SDK is also implemented based on the API v3.

	API v1	API v3
Specification	REST	gRPC
Selecting voice	`voice` parameter	`hint: voice` parameter
Selecting language	Depends on the voice `lang` parameter	Depends on the voice, not specified explicitly in the request
Specifying role	Depends on the voice `emotion` parameter	Depends on the voice `hint: role` parameter
Controlling pronunciation	SSML TTS	TTS
Pronunciation speed	`speed` parameter	`hint: speed` parameter
Adjusting volume	No	`loudness_normalization_type` parameter
Output audio format	`format` parameter	`output_audio_spec` parameter
Specifying LPCM parameters	`sampleRateHertz` parameter	`output_audio_spec: raw_audio` parameter
Template synthesis	No	`text_template` parameter
Pricing method	Total number of characters in the requests	By request
Automatic splitting of long phrases	Not required	`unsafe_mode` parameter

Note

Multiple responses with audio fragments can be returned to a single request to the SpeechKit API v3. A complete response is a result of merging all the fragments received.

Languages and voices

You can select a voice to convert your text to speech. Each voice corresponds to a model trained on the speaker's speech pattern. Voices differ by tone, gender, and language. For a list of voices and their characteristics, see List of voices.

If no voice suits your business, SpeechKit can create a unique one specifically for you. For more information, see Yandex SpeechKit Brand Voice.

SpeechKit can synthesize speech in different languages. Each voice is designed to synthesize speech in a specific language. The voices can also read text in another language, but the quality of the synthesized speech will be worse in this case, as the speaker will pronounce the text with an accent, and there might be errors in word synthesis.

Role

The synthesized speech will sound differently depending on the selected role. Role is a manner of pronunciation for the same speaker. Different sets of roles are available for different voices. Attempting to use a role the selected voice does not have will cause a service error.

Controlling pronunciation

To control pronunciation in the synthesized speech, mark up the source text. SpeechKit can synthesize speech from text marked up using Speech Synthesis Markup Language (SSML) or TTS markup. These markup methods enable you to set the length of pauses, the pronunciation of individual sounds, and more. SSML and TTS markup have different data transmission parameters:

SSML is only supported in API v1 requests. To transmit text in SSML format, include the ssml parameter in the call body and use the <speak> tag as a wrapper for the text. For more information about SSML tags, see SSML markup.
TTS markup is supported in the API v1 and API v3. In API v1 requests, transmit the text marked up according to TTS rules in the text parameter in the request body. API v3 and the Python SDK require no special parameters and consider any transmitted text as marked up according to TTS rules. For more information about TTS markup, see TTS markup.

Synthesis settings

You can configure both pronunciation and technical characteristics of the synthesized speech.

Synthesized speech speed

The speed of synthesized speech affects perception of information. If the speech is too fast or too slow, it sounds unnatural. However, this can be useful in commercials where every second of air time counts.

By default, the speed of generated speech corresponds to the average speed of human speech.

Volume normalization

In API v3 and Python SDK requests, you can set the type and level of volume normalization. This can be useful if you are using SpeechKit synthesis along with other sound sources. For example, so that the volume of the voice assistant does not differ from the phone notifications.

SpeechKit supports two normalization types:

Peak normalization MAX_PEAK, at which the audio signal level rises to the maximum possible digital audio value without distortion.
LUFS normalization is weighted normalization based on the EBU R 128 standard according to which volume is normalized relative to the full digital scale.

You can set the normalization type in the loudness_normalization_type parameter. By default, SpeechKit uses LUFS.

The level of normalization is set in the hint: volume parameter. Possible values depend on the normalization type:

For MAX_PEAK, the parameter can have values in the (0;1] range, default value is 0.7.
For LUFS, the parameter can change in the range [-149;0), default value is -19.

If the normalization level value does not fall within the range supported by the normalization level, the SpeechKit server will return an InvalidArgument error.

Synthesized audio file format

You can select the audio file format that will be used by SpeechKit to synthesize speech.

For a full list of available formats and their characteristics, see Supported audio formats.

Speech synthesis

Synthesis options

Languages and voices

Role

Controlling pronunciation

Synthesis settings

Synthesized speech speed

Volume normalization

Synthesized audio file format

See also

Was the article helpful?

Speech synthesis

Synthesis optionsSynthesis options

Languages and voicesLanguages and voices

RoleRole

Controlling pronunciationControlling pronunciation

Synthesis settingsSynthesis settings

Synthesized speech speedSynthesized speech speed

Volume normalizationVolume normalization

Synthesized audio file formatSynthesized audio file format

See alsoSee also

Was the article helpful?

Synthesis options

Languages and voices

Role

Controlling pronunciation

Synthesis settings

Synthesized speech speed

Volume normalization

Synthesized audio file format

See also