Yandex.Cloud
  • Services
  • Why Yandex.Cloud
  • Pricing
  • Documentation
  • Contact us
Get started
Yandex SpeechKit
  • Getting started
  • Speech recognition
    • About the technology
    • Short audio recognition
    • Recognition of long audio fragments
    • Data streaming recognition
    • Audio formats
    • Recognition models
  • Speech synthesis
    • About the technology
    • API method description
    • List of voices
    • Using SSML
    • List of supported SSML phonemes
  • IVR integration
  • Using the API
    • Authentication in the API
    • Response format
    • Troubleshooting
  • Quotas and limits
  • Access management
  • Pricing policy
    • Current pricing policy
    • Archive
      • Policy before January 1, 2019
  • Questions and answers
  1. Speech recognition
  2. Recognition of long audio fragments

Recognition of long audio fragments

  • Recognizing long audio fragments
    • Before you start
    • Send a file for recognition
    • Get recognition results
    • Using gRPC
  • Examples
    • Recognize Russian speech in OggOpus format
    • Recognize speech in LPCM format

Long audio fragment recognition can be used for multi-channel audio files up to 1 GB.

Long audio fragment recognition is somewhat cheaper than other recognition methods. However, it's not suitable for online speech recognition due to its longer response time. For more information about pricing, see Pricing for SpeechKit.

Note

For now, you can only recognize long audio in Russian.

Recognizing long audio fragments

To recognize long audio fragments, you need to execute 2 requests:

  1. Send a file for recognition.
  2. Get recognition results.

If you send files using gRPC, see Using gRPC.

Before you start

  1. A recognition request should be sent on behalf of a service account with the editor role for the folder where it was created.

    If necessary, follow the instructions:

    • Creating a service account. In the management console, you can assign roles when creating a service account.
    • Viewing assigned roles.
    • Assigning roles to a service account.
  2. Get an IAM token or API key for your service account. In our examples, an IAM token is used for authentication.

    To use an API key, pass it in the Authorization header in the following format:

    Authorization: Api-Key <API key>
    
  3. Upload an audio file to Yandex Object Storage and get a link to the uploaded file:

    1. If you don't have a bucket in Object Storage, create one.

    2. Upload an audio file to your bucket. In Object Storage, uploaded files are called objects.

    3. Get a link to the uploaded file. Use this link in your audio recognition request.

      The link to the uploaded file has the following format:

      https://storage.yandexcloud.net/<bucket name>/<file path>
      

      The link will contain additional query parameters (after ?) for buckets with restricted access. You don't need to pass these parameters in SpeechKit since they just get ignored.

Send a file for recognition

HTTP request

POST https://transcribe.api.cloud.yandex.net/speech/stt/v2/longRunningRecognize

Parameters in the request body

{
    "config": {
        "specification": {
            "languageCode": "string",
            "model": "string",
            "profanityFilter": "string",
            "audioEncoding": "string",
            "sampleRateHertz": "integer",
            "audioChannelCount": "integer"
        }
    },
    "audio": {
        "uri": "string"
    }
}
Parameter Description
config object
Field with the recognition settings.
config.
specification
object
Recognition settings.
config.
specification.
languageCode
string
The language that recognition will be performed for.
Only Russian is currently supported (ru-RU).
config
.specification
.model
string
The language model to be used for recognition.
The closer the model is matched, the better the recognition result. You can only specify one model per request.
Acceptable values depend on the selected language. Default value: general. Depending on the selected model, pricing may change.
config.
specification.
profanityFilter
boolean
The profanity filter.
Acceptable values:
  • true: Exclude profanity from recognition results.
  • false (default): Do not exclude profanity from recognition results.
config.
specification.
audioEncoding
string
The format of the submitted audio.
Acceptable values:
  • LINEAR16_PCM: LPCM with no WAV header.
  • OGG_OPUS (default): OggOpus format.
config.
specification.
sampleRateHertz
integer (int64)
The sampling frequency of the submitted audio.
Required if format is set to LINEAR16_PCM. Acceptable values:
  • 48000 (default): Sampling rate of 48 kHz.
  • 16000: Sampling rate of 16 kHz.
  • 8000: Sampling rate of 8 kHz.
config.
specification.
audioChannelCount
integer (int64)
The number of channels in LPCM files. By default, 1.
Don't use this field for OggOpus files.
config.
specification.
rawResults
boolean
Flag that indicates how to write numbers. true: In words. false (default): In figures.
audio.
uri
string
The URI of the audio file for recognition. Supports only links to files stored in Yandex Object Storage.

Response

If your request is written correctly, the service returns the Operation object with the recognition operation ID (id):

{
 "done": false,
 "id": "e03sup6d5h7rq574ht8g",
 "createdAt": "2019-04-21T22:49:29Z",
 "createdBy": "ajes08feato88ehbbhqq",
 "modifiedAt": "2019-04-21T22:49:29Z"
}

Use this ID at the next step.

Get recognition results

Monitor the recognition results using the received ID. The number of result monitoring requests is limited, so consider the recognition speed: it takes about 10 seconds to recognize 1 minute of single-channel audio.

Warning

Recognition results are stored on the 3 days server. You can then request the recognition results using the received ID.

HTTP request

GET https://operation.api.cloud.yandex.net/operations/{operationId}

Path parameters

Parameter Description
operationId The operation ID received when sending the recognition request.

Response

Once the recognition is complete, the done field will be set to true and the response field will contain a list of recognition results (chunks[]).

Each result in the chunks[] list contains the following fields:

  • alternatives[]: List of recognized text alternatives. Each alternative contains the following fields:
    • words[]: List of recognized words.
      • startTime: Time stamp of the beginning of the word in the recording. An error of 1-2 seconds is possible.
      • endTime: Time stamp of the end of the word. An error of 1-2 seconds is possible.
      • word: Recognized word. Recognized numbers are written in words (for example, twelve rather than 12).
      • confidence: This field currently isn't supported. Don't use it.
    • text: Full recognized text. By default, numbers are written in figures. To output the entire text in words, specify true in the raw_results field.
    • confidence: This field currently isn't supported. Don't use it.
  • channelTag: Audio channel that recognition was performed for.
{
 "done": true,
 "response": {
  "@type": "type.googleapis.com/yandex.cloud.ai.stt.v2.LongRunningRecognitionResponse",
  "chunks": [
   {
    "alternatives": [
     {
      "words": [
       {
        "startTime": "0.879999999s",
        "endTime": "1.159999992s",
        "word": "when",
        "confidence": 1
       },
       {
        "startTime": "1.219999995s",
        "endTime": "1.539999988s",
        "word": "writing",
        "confidence": 1
       },
       ...
      ],
      "text": "when writing The Hobbit, Tolkien referred to the Norse mythology of the Old English poem Beowulf",
      "confidence": 1
     }
    ],
    "channelTag": "1"
   },
   ...
  ]
 },
 "id": "e03sup6d5h7rq574ht8g",
 "createdAt": "2019-04-21T22:49:29Z",
 "createdBy": "ajes08feato88ehbbhqq",
 "modifiedAt": "2019-04-21T22:49:36Z"
}

Using gRPC

To use the service, create an app that will send audio fragments and process responses with recognition results.

To enable the app to send requests and get results, you need to generate the client interface code for the programming language you use. Generate this code from the files stt_service.proto and operation_service.proto in the Yandex.Cloud API repository.

See the gRPC documentation for detailed instructions on how to generate interfaces and deploy client apps for various programming languages.

Warning

When requesting the results of an operation, gRPC clients by default limit the maximum message size that they can accept as a response to no more than 4 MB. If a response with recognition results exceeds this amount, an error is returned.

To get the entire response, increase the maximum message size limit:

  • For Go, use the MaxCallRecvMsgSize function.
  • For C++, in the call method, set the max_receive_message_size value.

Examples

  • Recognize Russian speech in OggOpus format
  • Recognize speech in LPCM format

Recognize Russian speech in OggOpus format

To recognize speech in OggOpus format, just specify the recognition language in the languageCode field of the configuration. The language model used by default is general.

cURL
Python
  1. Create a request body and save it to a file (such as body.json). Enter the link to the audio file in Object Storage in the uri field:

    {
        "config": {
            "specification": {
                "languageCode": "ru-RU"
            }
        },
        "audio": {
            "uri": "https://storage.yandexcloud.net/speechkit/speech.ogg"
        }
    }
    
  2. Send a recognition request:

    $ export IAM_TOKEN=CggaATEVAgA...
    $ curl -X POST \
        -H "Authorization: Bearer ${IAM_TOKEN}" \
        -d '@body.json' \
        https://transcribe.api.cloud.yandex.net/speech/stt/v2/longRunningRecognize
    
    {
        "done": false,
        "id": "e03sup6d5h1qr574ht99",
        "createdAt": "2019-04-21T22:49:29Z",
        "createdBy": "ajes08feato88ehbbhqq",
        "modifiedAt": "2019-04-21T22:49:29Z"
    }
    

    Save the recognition operation ID that you receive in the response.

  3. Wait a while for the recognition to complete. It takes about 10 seconds to recognize 1 minute of a single-channel audio file.

  4. Send a request to get information about the operation:

    $ curl -H "Authorization: Bearer ${IAM_TOKEN}" \
        https://operation.api.cloud.yandex.net/operations/e03sup6d5h1qr574ht99
    
    {
     "done": true,
     "response": {
      "@type": "type.googleapis.com/yandex.cloud.ai.stt.v2.LongRunningRecognitionResponse",
      "chunks": [
       {
        "alternatives": [
         {
          "text": "your number is 212-85-06",
          "confidence": 1
         }
        ],
        "channelTag": "1"
       }
      ]
     },
     "id": "e03sup6d5h1qr574ht99",
     "createdAt": "2019-04-21T22:49:29Z",
     "createdBy": "ajes08feato88ehbbhqq",
     "modifiedAt": "2019-04-21T22:49:36Z"
    }
    
  1. Create an API key for authentication in this example. To use an IAM token for authentication, correct the header in the header variable: replace Api-Key with Bearer and add the code used to get an IAM token instead of the API key.

  2. Create a Python file (such as test.py) and add the following code to it:

    # -*- coding: utf-8 -*-
    
    import requests
    import time
    import json
    
    # Specify your API key and the link to the audio file in Object Storage.
    key = '<API key>'
    filelink = 'https://storage.yandexcloud.net/speechkit/speech.ogg'
    
    POST = "https://transcribe.api.cloud.yandex.net/speech/stt/v2/longRunningRecognize"
    
    body ={
        "config": {
            "specification": {
                "languageCode": "ru-RU"
            }
        },
        "audio": {
            "uri": filelink
        }
    }
    
    # If you want to use an IAM token for authentication, replace Api-Key with Bearer.
    header = {'Authorization': 'Api-Key {}'.format(key)}
    
    # Send a recognition request.
    req = requests.post(POST, headers=header, json=body)
    data = req.json()
    print(data)
    
    id = data['id']
    
    # Request the operation status on the server until recognition is complete.
    while True:
    
        time.sleep(1)
    
        GET = "https://operation.api.cloud.yandex.net/operations/{id}"
        req = requests.get(GET.format(id=id), headers=header)
        req = req.json()
    
        if req['done']: break
        print("Not ready")
    
    # Show the full server response in JSON format.
    print("Response:")
    print(json.dumps(req, ensure_ascii=False, indent=2))
    
    # Show only text from recognition results.
    print("Text chunks:")
    for chunk in req['response']['chunks']:
        print(chunk['alternatives'][0]['text'])
    
  3. Run the created file:

    $ python test.py
    

Recognize speech in LPCM format

To recognize speech in LPCM format, specify the file sampling frequency and the number of audio channels in the recognition settings. Set the recognition language in the languageCode field and the language model in the model field.

  1. Create a request body and save it to a file (for example, body.json):

    Note

    To use the default language model, don't pass the model field in the request.

    {
        "config": {
            "specification": {
                "languageCode": "ru-RU",
                "model": "general:rc",
                "audioEncoding": "LINEAR16_PCM",
                "sampleRateHertz": 8000,
                "audioChannelCount": 1
            }
        },
        "audio": {
            "uri": "https://storage.yandexcloud.net/speechkit/speech.pcm"
        }
    }
    
  2. Send a recognition request:

    $ export IAM_TOKEN=CggaATEVAgA...
    $ curl -X POST \
        -H "Authorization: Bearer ${IAM_TOKEN}" \
        -d '@body.json' \
        https://transcribe.api.cloud.yandex.net/speech/stt/v2/longRunningRecognize
    
    {
        "done": false,
        "id": "e03sup6d5h1qr574ht99",
        "createdAt": "2019-04-21T22:49:29Z",
        "createdBy": "ajes08feato88ehbbhqq",
        "modifiedAt": "2019-04-21T22:49:29Z"
    }
    

    Save the recognition operation ID that you receive in the response.

  3. Wait a while for the recognition to complete. It takes about 10 seconds to recognize 1 minute of a single-channel audio file.

  4. Send a request to get information about the operation:

    $ curl -H "Authorization: Bearer ${IAM_TOKEN}" \
        https://operation.api.cloud.yandex.net/operations/e03sup6d5h1qr574ht99
    
    {
    "done": true, "response": {
     "@type": "type.googleapis.com/yandex.cloud.ai.stt.v2.LongRunningRecognitionResponse",
     "chunks": [
      {
       "alternatives": [
        {
         "text": "hello world",
         "confidence": 1
        }
       ],
       "channelTag": "1"
      }
     ]
    },
    "id": "e03sup6d5h1qr574ht99",
    "createdAt": "2019-04-21T22:49:29Z",
    "createdBy": "ajes08feato88ehbbhqq",
    "modifiedAt": "2019-04-21T22:49:36Z"
    }
    
In this article:
  • Recognizing long audio fragments
  • Before you start
  • Send a file for recognition
  • Get recognition results
  • Using gRPC
  • Examples
  • Recognize Russian speech in OggOpus format
  • Recognize speech in LPCM format
Language
Careers
Privacy policy
Terms of use
© 2021 Yandex.Cloud LLC