Streaming speech recognition

Written by

Improved by

Updated at February 1, 2024

Streaming recognition restrictions
Using the service

Streaming mode allows you to simultaneously send audio for recognition and get recognition results over the same connection. You can also get intermediate recognition results when the speaker has not yet finished the utterance. After a pause, SpeechKit returns the final results and starts recognizing the next utterance.

Voice assistants and smart speakers work using this recognition mode. When you activate the assistant, it starts transmitting speech to the server for recognition. The server processes the data and returns the intermediate and final recognition results of each utterance. The intermediate results are used to show the recognition progress. After the final results, the assistant performs an action, such as playing music or calling another person.

Warning

Streaming mode is designed for real-time audio recognition. To recognize a recorded audio file, use synchronous or asynchronous audio recognition.

Streaming recognition restrictions

SpeechKit streaming recognition has a number of restrictions that need to be taken into account when creating an application. For a full list of SpeechKit restrictions, see Quotas and limits in SpeechKit.

	Streaming recognition
Use cases	Telephone assistants and robots Virtual assistants
Input data	Real-time voice
How it works	Exchanging messages with the server over a single connection
Supported APIs	gRPC v2 gRPC v3
Maximum duration of audio data	5 minutes
Maximum amount of transmitted data	10 MB
Number of recognition channels	1

Using the service

To use the service, create an application that will send audio fragments and process responses with recognition results.

Client application interface code

SpeechKit has two streaming recognition API versions: API v3 and API v2. We recommend using the API v3 for new projects.

For the application to access the service, clone the Yandex Cloud API repository and generate the client interface code for the used programming language from the API v2 or API v3 specification file.

Client application examples:

See also the gRPC documentation for detailed instructions on how to generate interfaces and implement client apps in various programming languages.

Warning

When requesting the results of an operation, gRPC clients by default limit the maximum message size that they can accept as a response to no more than 4 MB. If a response with recognition results exceeds this amount, an error is returned.

To get the entire response, increase the maximum message size limit:

For Go, use the MaxCallRecvMsgSize function.
For C++, in the call method, set the max_receive_message_size value.

Authentication with the service

In each request, the application must pass an IAM token or API key for authentication in the service and the ID of the folder for which the account has the ai.speechkit-stt.user role or higher. For more information about permissions, see Access management.

The most straightforward way to authenticate an application is to use a service account. When authenticating as a service account, do not indicate the folder ID in your requests: SpeechKit will use the same folder where the service account was created.

Learn more about authentication in SpeechKit.

Recognition request

To recognize speech, the application must first send a message with recognition settings:

For API v3: The RecognizeStreaming message with the session_options type.
For API v2: The StreamingRecognitionRequest message with the RecognitionConfig type.

When the session is set up, the server will wait for messages with audio fragments (chunks). Send the RecognizeStreaming message with the session_options type or the StreamingRecognitionRequest message with the audio_content type in API v2. Take the following recommendations into account when sending messages:

Do not send audio fragments too often or infrequently. The time between messages to the service should be approximately the same as the duration of the audio fragments you send, but no more than 5 seconds. For example, send 400 ms of audio for recognition every 400 ms.
Maximum duration of transmitted audio for the entire session: 5 minutes.
Maximum size of transmitted audio data: 10 MB.

If messages aren't sent to the service within 5 seconds or the data duration or size limit is reached, the session is terminated. To continue speech recognition, reconnect and send a new message with the speech recognition settings.

SpeechKit returns intermediate speech recognition results before a message stream with audio fragments has finished.

Recognition result

In each recognition result message (StreamingResponse or StreamingRecognitionResponse), the SpeechKit server returns one or more speech fragments that it recognized during this period (chunks). A list of recognized text alternatives is specified for each speech fragment (alternatives).

The SpeechKit server returns recognition results and specifies their type: partial for intermediate results or final for final results. When using the API v2, the recognition result type is determined by the final flag: the False value means that the result may change with the next response.

Streaming speech recognition

Streaming recognition restrictionsStreaming recognition restrictions

Using the serviceUsing the service

Client application interface codeClient application interface code

Authentication with the serviceAuthentication with the service

Recognition requestRecognition request

Recognition resultRecognition result

See alsoSee also

Was the article helpful?