Yandex.Cloud
  • Services
  • Why Yandex.Cloud
  • Pricing
  • Documentation
  • Contact us
Get started
Yandex SpeechKit
  • Getting started
  • Speech recognition
    • About the technology
    • Short audio recognition
    • Recognition of long audio fragments
    • Data streaming recognition
    • Audio formats
    • Recognition models
  • Speech synthesis
    • About the technology
    • API method description
    • List of voices
    • Using SSML
    • List of supported SSML phonemes
  • IVR integration
  • Using the API
    • Authentication in the API
    • Response format
    • Troubleshooting
  • Quotas and limits
  • Access management
  • Pricing policy
    • Current pricing policy
    • Archive
      • Policy before January 1, 2019
  • Questions and answers
  1. Speech synthesis
  2. About the technology

Speech synthesis

  • Languages
  • Voices and speech quality
    • Speech settings
    • Quality of transmitted text
    • Premium voices
  • SSML support

Speech synthesis in SpeechKit lets you convert any text to speech in multiple languages. You can choose the voice and manage speech parameters.

A highlight of Yandex speech technology is that we do not stitch fragments of real speech together, but train our acoustic model on the speaker's voice. We do this using neural networks. This creates smooth speech with natural intonation for any text.

Languages

You can synthesize speech in three languages:

  • Russian (ru-RU).
  • English (en-US).
  • Turkish (tr-TR).

Keep in mind that if you select Russian and synthesize text in English, it will still be spoken, but with an accent. For example, try synthesizing the phrase Let me speak from my heart! by selecting Russian in the language settings.

Voices and speech quality

Each voice corresponds to a model trained on the speaker's speech pattern, so the voice determines the tone, main language, and gender of the speaker (male or female). List of available voices.

To use the preferred voice and maintain speech quality:

  • Specify the recommended speech settings.
  • Track the quality of the text you transmit.
  • Try premium voices for communicating with clients.

Speech settings

The quality of speech and voice depend on the speech settings:

  • Language: Each voice was created for a specific language that the speaker spoke. To get the desired quality, use a voice from the list whose main language is the one selected.

    If you don't choose the main language, the speech quality will be worse and the voice used may not be the one you specified.

  • Speech rate: If the speech is too fast or too slow, it sounds unnatural. However, this can be useful in commercials where every second of air time counts.

  • Emotional tone is supported only for Russian (ru-RU) with jane or omazh. Don't use this parameter with other voices or languages, as the speech generated for individual phrases may be different from your settings.

    For these voices, a neural network was trained on three different datasets where the speaker spoke samples with different intonations: cheerful, irritated, and neutral. We don't plan to support tones for other voices now. For premium voices, the tone is selected automatically.

Quality of transmitted text

Reasons why other voices may be used:

  • Long text without punctuation marks. For better quality, insert periods and commas.
  • Specific sentences on a complex topic.
  • Many occurrences of words from other languages.

Premium voices

Some voices were trained using our new technology. Speech synthesized using the new technology sounds more natural.

Key differences:

  • Understanding of context. Before starting speech synthesis, the premium voice engine evaluates the whole text rather than individual sentences. This allows for intonation that is more typical of human speech.
  • Attention to detail. By using deep neural networks for premium voice synthesis, we make a much deeper analysis of the original voice. This lets us generate a much clearer voice that is richer in detail and avoids various distortions typical in standard voices.

Warning

The new technology currently only supports voices for Russian. Choosing a different language affects the speech quality.

SSML support

To get more control over speech synthesis, you can use Speech Synthesis Markup Language (SSML). This is an XML-based markup language that lets you set the duration of pauses, the pronunciation of individual sounds, and much more. For more information about supported tags and how to use them, see Using SSML.

What's next

  • Try speech synthesis using the demo on the service page.
  • API method description
In this article:
  • Languages
  • Voices and speech quality
  • Speech settings
  • Quality of transmitted text
  • Premium voices
  • SSML support
Language
Careers
Privacy policy
Terms of use
© 2021 Yandex.Cloud LLC