Yandex.Cloud
  • Services
  • Why Yandex.Cloud
  • Pricing
  • Documentation
  • Contact us
Get started
Yandex SpeechKit
  • Getting started
  • Releases
  • Speech recognition
    • About the technology
    • Short audio recognition
    • Recognition of long audio fragments
    • Data streaming recognition
    • Audio formats
    • Recognition models
  • Speech synthesis
    • About the technology
    • API method description
    • List of voices
    • Using SSML
    • List of supported SSML phonemes
  • IVR integration
  • Using the API
    • Authentication in the API
    • Response format
    • Troubleshooting
  • Quotas and limits
  • Access management
  • Pricing policy
    • Current pricing policy
    • Archive
      • Policy before January 1, 2019
  • Questions and answers
  1. Speech recognition
  2. About the technology

Speech recognition

  • Recognition methods
  • Recognition process
  • Recognition accuracy

Speech recognition (speech-to-text, STT) is the process of converting speech to text.

The service can recognize speech in several languages:

  • Russian
  • English
  • Turkish

Recognition methods

There are three recognition methods:

  1. Recognition of short audio fragments. This is suitable for recognizing small single-channel audio fragments.

  2. Streaming mode for short audio recognition. This allows you to send audio fragments and get results, including intermediate recognition results, over a single connection.

  3. Recognition of long audio fragments. This lets you recognize long multi-channel audio recordings, but the response may be slower.

    For now, you can only recognize long audio in Russian.

Recognition process

Audio is recognized in three stages:

  1. Words are detected. There are usually several possible words recognized (or hypotheses).
  2. Hypotheses are checked using the language model. The model validates to what extent a new word is consistent with other words that have already been recognized.
  3. The recognized text is processed: numbers are converted to digits, certain punctuation marks (such as hyphens) are added, and so on. The converted text is the final recognition result that is sent in the response body.

Recognition accuracy

To increase the accuracy of recognition, specify the language model that the service should use. The model should match the speech topic.

The accuracy of speech recognition is also affected by:

  • Original sound quality.
  • Audio encoding quality.
  • Speech intelligibility and rate.
  • Utterance complexity and length.

See also

  • Supported audio formats
  • Recognition models
  • Short audio recognition
  • Streaming mode for short audio recognition
  • Recognition of long audio fragments
In this article:
  • Recognition methods
  • Recognition process
  • Recognition accuracy
Language
Careers
Privacy policy
Terms of use
© 2021 Yandex.Cloud LLC