Speech recognition

Written by

Updated at March 26, 2024

Recognition methods
- Which recognition to choose
Recognition process
Recognition accuracy

Speech recognition is speech-to-text (STT) conversion.

Working with SpeechKit is performed via APIs. For more information about working with the Yandex Cloud API, see API concepts.

The service is available at stt.api.cloud.yandex.net:443.

You can also work with SpeechKit using the Python SDK. It is implemented based on the SpeechKit API v3.

To try out the Text-to-Speech and Speech-to-Text product demos, visit the SpeechKit page on our website.

Recognition methods

SpeechKit provides two ways of improving the quality of speech recognition:

Streaming recognition is used for real-time speech recognition. During streaming recognition, SpeechKit receives short audio fragments and sends the results, including intermediate ones, over a single connection.
Audio file recognition. SpeechKit Can recognize audio recordings in synchronous and asynchronous mode.
- Synchronous mode has strict limitations on the size and duration of a file and is suitable for recognizing single-channel audio fragments of up to 30 seconds.
- Asynchronous mode can process multi-channel audio fragments. Maximum recording duration: 4 hours.

Which recognition to choose

	Streaming recognition	Synchronous recognition	Asynchronous recognition
Use cases	Telephone assistants and robots Virtual assistants	Virtual assistants Voice control Recognition of short voice messages in messengers	Transcription of audio calls and presentations Subtitling Ensuring script adherence in call centers Identifying successful scripts Evaluating performance of call center operators.
Input data	Real-time voice	Pre-recorded short single-channel audio files	Pre-recorded multi-channel and long audio files
How it works	Exchanging messages with the server over a single connection	Request — quick response	Request — delayed response
Supported APIs	gRPC v2 gRPC v3	REST v1	REST v2
Maximum duration of audio data	5 minutes	30 seconds	4 hours
Maximum amount of transmitted data	10 MB	1 MB	1 GB
Number of recognition channels	1	1	2

Recognition process

Audio is recognized in three stages:

The acoustic model determines which set of low-level attributes corresponds to the audio signal.
The language model uses the acoustic model output to generate the text by words.
The service performs text processing: punctuation, converting numerals into numbers, and more.

Recognition accuracy depends on the recognition model. You can improve recognition accuracy of the model by providing data for tuning of the model. For more information about model tuning, see Extending a speech recognition model.

The accuracy of speech recognition is also affected by:

Original sound quality.
Audio encoding quality.
Speech intelligibility and rate.
Utterance complexity and length.

Speech recognition

Recognition methods

Which recognition to choose

Recognition process

Recognition accuracy

See also

Was the article helpful?

Speech recognition

Recognition methodsRecognition methods

Which recognition to chooseWhich recognition to choose

Recognition processRecognition process

Recognition accuracyRecognition accuracy

See alsoSee also

Was the article helpful?

Recognition methods

Which recognition to choose

Recognition process

Recognition accuracy

See also