Yandex.Cloud
  • Services
  • Why Yandex.Cloud
  • Pricing
  • Documentation
  • Contact us
Get started
Yandex SpeechKit
  • Getting started
  • Speech recognition
    • About the technology
    • Short audio recognition
    • Recognition of long audio fragments
    • Data streaming recognition
    • Audio formats
    • Recognition models
  • Speech synthesis
    • About the technology
    • API method description
    • List of voices
    • Using SSML
    • List of supported SSML phonemes
  • IVR integration
  • Using the API
    • Authentication in the API
    • Response format
    • Troubleshooting
  • Quotas and limits
  • Access management
  • Pricing policy
    • Current pricing policy
    • Archive
      • Policy before January 1, 2019
  • Questions and answers
  1. Speech recognition
  2. Data streaming recognition

Streaming mode for short audio recognition

  • Using the service
    • Creating a client app
    • Authorization in the service
    • Recognition result
    • Limitations of a speech recognition session
  • Service API
    • Message with recognition settings
    • Experimental additional recognition settings
    • Audio message
    • Message with recognition results
    • Error codes returned by the server
  • Examples

Data streaming mode allows you to simultaneously send audio for recognition and get recognition results over the same connection.

Unlike other recognition methods, you can get intermediate results while speech is in progress. After a pause, the service returns final results and starts recognizing the next utterance.

For example, as soon as the user starts talking to Yandex.Station, the speaker begins transmitting the speech to the server for recognition. The server processes the data and returns the intermediate and final results of each utterance recognition. The intermediate results are used for showing the user the progress of speech recognition. Once the final results are available, Yandex.Station performs the requested action, such as playing a movie.

To use the service, create an app that will perform speech recognition in data streaming mode, i.e., send audio fragments and process responses with recognition results.

Warning

Streaming mode is designed for real-time audio recognition. If you need to send a recorded audio file, use a different method.

Using the service

Creating a client app

For speech recognition, the app should first send a message with recognition settings and then send messages with audio fragments.

While the audio fragments are sent, the service simultaneously returns recognized text fragments for processing (such as outputting them to the console).

To enable the app to access the service, you need to generate the client interface code for the programming language you use. Generate this code from the stt_service.proto file hosted in the Yandex.Cloud API repository.

See examples of client apps below. See also the gRPC documentation for detailed instructions on how to generate interfaces and implement client apps in various programming languages.

Authorization in the service

In each request, the application must transmit the ID of folder that you have been granted the editor role or higher for. For more information, see Access management.

The application must also be authenticated for each request, such as with an IAM token. Learn more about service authentication.

Recognition result

In each recognition result message, the server returns one or more speech fragments that it managed to recognize during this period (chunks). A list of recognized text alternatives is specified for each speech fragment (alternatives).

During the recognition process, speech is split into utterances and the end of the utterance is marked with the endOfUtterance flag. By default, the server returns a response only after an utterance is fully recognized. You can use the partialResults flag to make the server return intermediate recognition results as well. Intermediate results let you quickly respond to the recognized speech without waiting for the end of the utterance.

Limitations of a speech recognition session

After receiving the message with the recognition settings, the service starts a recognition session. The following limitations apply to each session:

  • You can't send audio fragments too often or too rarely. The time between messages to the service should be approximately the same as the duration of the audio fragments you send, but no more than 5 seconds.

    For example, send 400 ms of audio for recognition every 400 ms.

  • Maximum duration of transmitted audio for the entire session: 5 minutes.

  • Maximum size of transmitted audio data: 10 MB.

If messages aren't sent to the service within 5 seconds or the data duration or size limit is reached, the session is terminated. To continue speech recognition, reconnect and send a new message with the speech recognition settings.

Service API

The service is located at: stt.api.cloud.yandex.net:443

Message with recognition settings

Parameter Description
config object
Field with the recognition settings and folder ID.
config
.specification
object
Recognition settings.
config
.specification
.languageCode
string
The language to use for recognition.
Acceptable values:
  • ru-ru (case-insensitive, used by default): Russian.
  • en-us (case-insensitive): English.
  • tr-tr (case-insensitive): Turkish.
config
.specification
.model
string
The language model to be used for recognition.
The closer the model is matched, the better the recognition result. You can only specify one model per request.
Acceptable values depend on the selected language. Default value: general.
config
.specification
.profanityFilter
boolean
The profanity filter.
Acceptable values:
  • true: Exclude profanity from recognition results.
  • false (default): Do not exclude profanity from recognition results.
config
.specification
.partialResults
boolean
The intermediate results filter.
Acceptable values:
  • true: Return intermediate results (part of the recognized utterance). For intermediate results, final is set to false.
  • false (default): Return only the final results (the entire recognized utterance).
config
.specification
.singleUtterance
boolean
Flag that disables recognition after the first utterance.
Acceptable values:
  • true: Recognize only the first utterance, stop recognition, and wait for the user to disconnect.
  • false (default): Continue recognition until the end of the session.
config
.specification
.audioEncoding
string
The format of the submitted audio.
Acceptable values:
  • LINEAR16_PCM: LPCM with no WAV header.
  • OGG_OPUS (default): OggOpus format.
config
.specification
.sampleRateHertz
integer (int64)
The sampling frequency of the submitted audio.
Required if format is set to LINEAR16_PCM. Acceptable values:
  • 48000 (default): Sampling rate of 48 kHz.
  • 16000: Sampling rate of 16 kHz.
  • 8000: Sampling rate of 8 kHz.
config.
specification.
rawResults
boolean
Flag that indicates how to write numbers. true: In words. false (default): In figures.
folderId string

ID of the folder that you have access to. Required for authorization with a user account (see the UserAccount resource). Don't specify this field if you make a request on behalf of a service account.

Maximum string length: 50 characters.

Experimental additional recognition settings

For streaming recognition models starting from the_Marcus Aurelius_ version and above, new recognition settings are supported. They are passed to a gRPC procedure via metadata.

Parameter Description
x-sensitivity-reduction-flag boolean
A flag that reduces the sensitivity of background noise recognition.
Acceptable values:
  • true: Sensitivity is reduced.
  • false (default): Sensitivity isn't reduced.
x-normalize-partials boolean
A flag that lets you get intermediate recognition results (parts of a recognized utterance) in normalized form: numbers are passed as digits, the profanity filter is enabled, and so on.
Acceptable values:
  • true: Return a normalized result.
  • false (default): Return an unnormalized result.

Audio message

Parameter Description
audio_content An audio fragment represented as an array of bytes. The audio must match the format specified in the message with recognition settings.

Message with recognition results

If speech fragment recognition is successful, you will receive a message containing a list of recognition results (chunks[]). Each result contains the following fields:

  • alternatives[]: List of recognized text alternatives. Each alternative contains the following fields:

    • text: Recognized text.
    • confidence: This field currently isn't supported. Don't use it.
  • final: Flag that indicates that this recognition result is final and will not change anymore. If the value is false, it means that the recognition result is intermediate and may change as the following speech fragments are recognized.

  • endOfUtterance: Flag that indicates that this result contains the end of the utterance. If the value is true, the new utterance will start with the next result obtained.

    Note

    If you specified singleUtterance=true in the settings, only one utterance will be recognized per session. After sending a message where endOfUtterance is true, the server doesn't recognize the following utterances and waits until you end the session.

Error codes returned by the server

To see how gRPC statuses correspond to HTTP codes, see google.rpc.Code.

List of possible gRPC errors returned by the service:

Code Status Description
3 INVALID_ARGUMENT Incorrect request parameters specified. Details are provided in the details field.
9 RESOURCE_EXHAUSTED A client exceeded a quota.
16 UNAUTHENTICATED The operation requires authentication. Check the IAM token and the folder ID that you passed.
13 INTERNAL Internal server error. This error means that the operation cannot be performed due to a server-side technical problem. For example, due to insufficient computing resources.

Examples

To try the examples in this section:

  1. Clone the Yandex.Cloud API repository:

    git clone https://github.com/yandex-cloud/cloudapi
    
  2. Get the ID of the folder your account has been granted access to.

  3. For authentication, the examples use an IAM token (see other authentication methods). Get an IAM token:

    • Instructions for a Yandex account.
    • Instructions for a service account.
  4. Download a sample audio file for recognition. The audio file is in LPCM format with a sampling rate of 8000.

Then proceed to creating a client app.

Python 3
Node.js
  1. Install the grpcio-tools package using the pip package manager:

    $ pip install grpcio-tools
    
  2. Go to the directory hosting the Yandex.Cloud API repository, create an output directory, and generate the client interface code there:

    $ cd cloudapi
    $ mkdir output
    $ python -m grpc_tools.protoc -I . -I third_party/googleapis --python_out=output --grpc_python_out=output google/api/http.proto google/api/annotations.proto yandex/cloud/api/operation.proto google/rpc/status.proto yandex/cloud/operation/operation.proto yandex/cloud/ai/stt/v2/stt_service.proto
    

    As a result, the stt_service_pb2.py and stt_service_pb2_grpc.py client interface files as well as dependency files will be created in the output directory.

  3. Create a file (for example, test.py) in the root of the output directory and add the following code to it:

    #coding=utf8
    import argparse
    
    import grpc
    
    import yandex.cloud.ai.stt.v2.stt_service_pb2 as stt_service_pb2
    import yandex.cloud.ai.stt.v2.stt_service_pb2_grpc as stt_service_pb2_grpc
    
    
    CHUNK_SIZE = 4000
    
    def gen(folder_id, audio_file_name):
        # Configure recognition settings.
        specification = stt_service_pb2.RecognitionSpec(
            language_code='ru-RU',
            profanity_filter=True,
            model='general',
            partial_results=True,
            audio_encoding='LINEAR16_PCM',
            sample_rate_hertz=8000
        )
        streaming_config = stt_service_pb2.RecognitionConfig(specification=specification, folder_id=folder_id)
    
        # Send a message with the recognition settings.
        yield stt_service_pb2.StreamingRecognitionRequest(config=streaming_config)
    
        # Read the audio file and send its contents in chunks.
        with open(audio_file_name, 'rb') as f:
            data = f.read(CHUNK_SIZE)
            while data != b'':
                yield stt_service_pb2.StreamingRecognitionRequest(audio_content=data)
                data = f.read(CHUNK_SIZE)
    
    
    def run(folder_id, iam_token, audio_file_name):
        # Establish a connection with the server.
        cred = grpc.ssl_channel_credentials()
        channel = grpc.secure_channel('stt.api.cloud.yandex.net:443', cred)
        stub = stt_service_pb2_grpc.SttServiceStub(channel)
    
        # Send data for recognition.
        it = stub.StreamingRecognize(gen(folder_id, audio_file_name), metadata=(('authorization', 'Bearer %s' % iam_token),))
    
        # Process server responses and output the result to the console.
        try:
            for r in it:
                try:
                    print('Start chunk: ')
                    for alternative in r.chunks[0].alternatives:
                        print('alternative: ', alternative.text)
                    print('Is final: ', r.chunks[0].final)
                    print('')
                except LookupError:
                    print('Not available chunks')
        except grpc._channel._Rendezvous as err:
            print('Error code %s, message: %s' % (err._state.code, err._state.details))
    
    
    if __name__ == '__main__':
        parser = argparse.ArgumentParser()
        parser.add_argument('--token', required=True, help='IAM token')
        parser.add_argument('--folder_id', required=True, help='folder ID')
        parser.add_argument('--path', required=True, help='audio file path')
        args = parser.parse_args()
    
        run(args.folder_id, args.token, args.path)
    
  4. Execute the created file by passing arguments with the IAM token, folder ID, and path to the audio file to recognize:

    $ export FOLDER_ID=b1gvmob95yysaplct532
    $ export IAM_TOKEN=CggaATEVAgA...
    $ python test.py --token ${IAM_TOKEN} --folder_id ${FOLDER_ID} --path speech.pcm
    Start chunk:
    alternative: Hello
    Is final: False
    
    Start chunk:
    alternative: Hello world
    Is final: True
    
  1. Go to the directory with the Yandex.Cloud API repository, create a direct named src, and generate a dependency file named package.json in it:

    $ cd cloudapi
    $ mkdir src
    $ cd src
    $ npm init
    
  2. Install the necessary packages using npm:

    $ npm install grpc @grpc/proto-loader google-proto-files --save
    
  3. Download a gRPC public key certificate from the official repository and save it in the root of the src directory.

  4. Create a file, for example index.js, in the root of the src directory and add the following code to it:

    const fs = require('fs');
    const grpc = require('grpc');
    const protoLoader = require('@grpc/proto-loader');
    const CHUNK_SIZE = 4000;
    
    // Get the folder ID and IAM token from the environment variables.
    const folderId = process.env.FOLDER_ID;
    const iamToken = process.env.IAM_TOKEN;
    
    // Read the file specified in the arguments.
    const audio = fs.readFileSync(process.argv[2]);
    
    // Specify the recognition settings.
    const request = {
        config: {
            specification: {
                languageCode: 'ru-RU',
                profanityFilter: true,
                model: 'general',
                partialResults: true,
                audioEncoding: 'LINEAR16_PCM',
                sampleRateHertz: '8000'
            },
            folderId: folderId
        }
    };
    
    // How often audio is sent in milliseconds.
    // For LPCM format, the frequency can be calculated using the formula: CHUNK_SIZE * 1000 / ( 2 * sampleRateHertz).
    const FREQUENCY = 250;
    
    const serviceMetadata = new grpc.Metadata();
    serviceMetadata.add('authorization', `Bearer ${iamToken}`);
    
    const packageDefinition = protoLoader.loadSync('../yandex/cloud/ai/stt/v2/stt_service.proto', {
        includeDirs: ['node_modules/google-proto-files', '..']
    });
    const packageObject = grpc.loadPackageDefinition(packageDefinition);
    
    // Establish a connection with the server.
    const serviceConstructor = packageObject.yandex.cloud.ai.stt.v2.SttService;
    const grpcCredentials = grpc.credentials.createSsl(fs.readFileSync('./roots.pem'));
    const service = new serviceConstructor('stt.api.cloud.yandex.net:443', grpcCredentials);
    const call = service['StreamingRecognize'](serviceMetadata);
    
    // Send a message with the recognition settings.
    call.write(request);
    
    // Read the audio file and send its contents in chunks.
    let i = 1;
    const interval = setInterval(() => {
        if (i * CHUNK_SIZE <= audio.length) {
            const chunk = new Uint16Array(audio.slice((i - 1) * CHUNK_SIZE, i * CHUNK_SIZE));
            const chunkBuffer = Buffer.from(chunk);
            call.write({audioContent: chunkBuffer});
            i++;
        } else {
            call.end();
            clearInterval(interval);
        }
    }, FREQUENCY);
    
    // Process server responses and output the result to the console.
    call.on('data', (response) => {
        console.log('Start chunk: ');
        response.chunks[0].alternatives.forEach((alternative) => {
            console.log('alternative: ', alternative.text)
        });
        console.log('Is final: ', Boolean(response.chunks[0].final));
        console.log('');
    });
    
    call.on('error', (response) => {
        // Handle errors.
        console.log(response);
    });
    
  5. Set the FOLDER_ID and IAM_TOKEN variables used in the script and run the created file. Specify the path to the audio file in the arguments:

    $ export FOLDER_ID=b1gvmob95yysaplct532
    $ export IAM_TOKEN=CggaATEVAgA...
    $  node index.js speech.pcm
    
    Start chunk:
    alternative: Hello world
    Is final:  true
    
In this article:
  • Using the service
  • Creating a client app
  • Authorization in the service
  • Recognition result
  • Limitations of a speech recognition session
  • Service API
  • Message with recognition settings
  • Experimental additional recognition settings
  • Audio message
  • Message with recognition results
  • Error codes returned by the server
  • Examples
Language
Careers
Privacy policy
Terms of use
© 2021 Yandex.Cloud LLC