Streaming speech recognition
Data streaming mode allows you to simultaneously send audio for recognition and get recognition results over the same connection.
Unlike other recognition methods, you can get intermediate results while speech is in progress. After a pause, the service returns final results and starts recognizing the next utterance.
For example, as soon as the user starts talking to Yandex.Station, the speaker begins transmitting the speech to the server for recognition. The server processes the data and returns the intermediate and final results of each utterance recognition. The intermediate results are used for showing the user the progress of speech recognition. Once the final results are available, Yandex.Station performs the requested action, such as playing a movie.
To use the service, create an app that will perform speech recognition in data streaming mode: send audio fragments and process responses with recognition results.
Using the service
Creating a client app
SpeechKit returns intermediate results of speech recognition before a message stream with audio fragments has finished.
See examples of client applications on Example uses for Streaming Recognition API v2. See the gRPC documentation for detailed instructions on how to generate interfaces and deploy client apps for various programming languages.
Authorization in the service
The application must also be authenticated for each request, such as with an IAM token. Learn more about service authentication.
In each recognition result message, the server returns one or more speech fragments that it managed to recognize during this period (
chunks). A list of recognized text alternatives is specified for each speech fragment (
During the recognition process, speech is split into utterances and the end of the utterance is marked with the
endOfUtterance flag. By default, the server returns a response only after an utterance is fully recognized. You can use the
partialResults flag to make the server return intermediate recognition results as well. Intermediate results let you quickly respond to the recognized speech without waiting for the end of the utterance.
Limitations of a speech recognition session
After receiving the message with the recognition settings, the service starts a recognition session. The following limitations apply to each session:
You can't send audio fragments too often or too rarely. The time between messages to the service should be approximately the same as the duration of the audio fragments you send, but no more than 5 seconds.
For example, send 400 ms of audio for recognition every 400 ms.
Maximum duration of transmitted audio for the entire session: 5 minutes.
Maximum size of transmitted audio data: 10 MB.
If messages aren't sent to the service within 5 seconds or the data duration or size limit is reached, the session is terminated. To continue speech recognition, reconnect and send a new message with the speech recognition settings.