Streaming speech recognition
Streaming mode allows you to simultaneously send audio for recognition and get recognition results over the same connection. You can also get intermediate recognition results when the speaker has not yet finished the utterance. After a pause, SpeechKit returns the final results and starts recognizing the next utterance.
Voice assistants and smart speakers work using this recognition mode. When you activate the assistant, it starts transmitting speech to the server for recognition. The server processes the data and returns the intermediate and final recognition results of each utterance. The intermediate results are used to show the recognition progress. After the final results, the assistant performs an action, such as playing music or calling another person.
Streaming recognition restrictions
SpeechKit streaming recognition has a number of restrictions that need to be taken into account when creating an application. For a full list of SpeechKit restrictions, see Quotas and limits in SpeechKit.
|Use cases||Telephone assistants and robots Virtual assistants|
|Input data||Real-time voice|
|How it works||Exchanging messages with the server over a single connection|
|Supported APIs||gRPC v2 gRPC v3|
|Maximum duration of audio data||5 minutes|
|Maximum amount of transmitted data||10 MB|
|Number of recognition channels||1|
Using the service
To use the service, create an application that will send audio fragments and process responses with recognition results.
Client application interface code
Client application examples:
- Audio file streaming recognition using API v3.
- Microphone speech streaming recognition using API v3.
- Example use of streaming recognition with API v2.
See also the gRPC documentation for detailed instructions on how to generate interfaces and implement client apps in various programming languages.
When requesting the results of an operation, gRPC clients by default limit the maximum message size that they can accept as a response to no more than 4 MB. If a response with recognition results exceeds this amount, an error is returned.
To get the entire response, increase the maximum message size limit:
- For Go, use the MaxCallRecvMsgSize function.
- For C++, in the call method, set the
Authorization in the service
In each request, the application must pass an IAM token or API key for authentication in the service and the ID of the folder for which the account has the
ai.speechkit-stt.user role or higher. For more information about permissions, see Access management.
It is easier to use a service account to authorize the application. When authorizing with a service account, do not pass the folder ID in requests: SpeechKit uses the folder where the service account was created.
To recognize speech, the application must first send a message with recognition settings:
- For API v3: The RecognizeStreaming message with the
- For API v2: The
StreamingRecognitionRequestmessage with the RecognitionConfig type.
When the session is set up, the server will wait for messages with audio fragments (chunks). Send the
RecognizeStreaming message with the session_options type or the
StreamingRecognitionRequest message with the audio_content type in API v2. Take the following recommendations into account when sending messages:
- Do not send audio fragments too often or infrequently. The time between messages to the service should be approximately the same as the duration of the audio fragments you send, but no more than 5 seconds. For example, send 400 ms of audio for recognition every 400 ms.
- Maximum duration of transmitted audio for the entire session: 5 minutes.
- Maximum size of transmitted audio data: 10 MB.
If messages aren't sent to the service within 5 seconds or the data duration or size limit is reached, the session is terminated. To continue speech recognition, reconnect and send a new message with the speech recognition settings.
SpeechKit returns intermediate speech recognition results before a message stream with audio fragments has finished.
In each recognition result message (StreamingResponse or StreamingRecognitionResponse), the SpeechKit server returns one or more speech fragments that it recognized during this period (
chunks). A list of recognized text alternatives is specified for each speech fragment (
The SpeechKit server returns recognition results and specifies their type:
partial for intermediate results or
final for final results. When using the API v2, the recognition result type is determined by the
final flag: the
False value means that the result may change with the next response.