Recognition of long audio fragments
Long audio fragment recognition can be used for multi-channel audio files up to 1 GB.
Long audio fragment recognition is somewhat cheaper than other recognition methods. However, it's not suitable for online speech recognition due to its longer response time. For more information about pricing, see Pricing for SpeechKit.
Note
For now, you can only recognize long audio in Russian.
Recognizing long audio fragments
To recognize long audio fragments, you need to execute 2 requests:
If you send files using gRPC, see Using gRPC.
Before you start
-
A recognition request should be sent on behalf of a service account with the
editor
role for the folder where it was created.If necessary, follow the instructions:
- Creating a service account. In the management console, you can assign roles when creating a service account.
- Viewing assigned roles.
- Assigning roles to a service account.
-
Get an IAM token or API key for your service account. In our examples, an IAM token is used for authentication.
To use an API key, pass it in the
Authorization
header in the following format:Authorization: Api-Key <API key>
-
Upload an audio file to Yandex Object Storage and get a link to the uploaded file:
-
If you don't have a bucket in Object Storage, create one.
-
Upload an audio file to your bucket. In Object Storage, uploaded files are called objects.
-
Get a link to the uploaded file. Use this link in your audio recognition request.
The link to the uploaded file has the following format:
https://storage.yandexcloud.net/<bucket name>/<file path>
The link will contain additional query parameters (after
?
) for buckets with restricted access. You don't need to pass these parameters in SpeechKit since they just get ignored.
-
Send a file for recognition
HTTP request
POST https://transcribe.api.cloud.yandex.net/speech/stt/v2/longRunningRecognize
Parameters in the request body
{
"config": {
"specification": {
"languageCode": "string",
"model": "string",
"profanityFilter": "string",
"audioEncoding": "string",
"sampleRateHertz": "integer",
"audioChannelCount": "integer"
}
},
"audio": {
"uri": "string"
}
}
Parameter | Description |
---|---|
config | object Field with the recognition settings. |
config. specification |
object Recognition settings. |
config. specification. languageCode |
string The language that recognition will be performed for. Only Russian is currently supported ( ru-RU ). |
config .specification .model |
string The language model to be used for recognition. The closer the model is matched, the better the recognition result. You can only specify one model per request. Acceptable values depend on the selected language. Default value: general . Depending on the selected model, pricing may change. |
config. specification. profanityFilter |
boolean The profanity filter. Acceptable values:
|
config. specification. audioEncoding |
string The format of the submitted audio. Acceptable values:
|
config. specification. sampleRateHertz |
integer (int64) The sampling frequency of the submitted audio. Required if format is set to LINEAR16_PCM . Acceptable values:
|
config. specification. audioChannelCount |
integer (int64) The number of channels in LPCM files. By default, 1 .Don't use this field for OggOpus files. |
config. specification. rawResults |
boolean Flag that indicates how to write numbers. true : In words. false (default): In figures. |
audio. uri |
string The URI of the audio file for recognition. Supports only links to files stored in Yandex Object Storage. |
Response
If your request is written correctly, the service returns the Operation object with the recognition operation ID (id
):
{
"done": false,
"id": "e03sup6d5h7rq574ht8g",
"createdAt": "2019-04-21T22:49:29Z",
"createdBy": "ajes08feato88ehbbhqq",
"modifiedAt": "2019-04-21T22:49:29Z"
}
Use this ID at the next step.
Get recognition results
Monitor the recognition results using the received ID. The number of result monitoring requests is limited, so consider the recognition speed: it takes about 10 seconds to recognize 1 minute of single-channel audio.
Warning
Recognition results are stored on the 3 days server. You can then request the recognition results using the received ID.
HTTP request
GET https://operation.api.cloud.yandex.net/operations/{operationId}
Path parameters
Parameter | Description |
---|---|
operationId | The operation ID received when sending the recognition request. |
Response
Once the recognition is complete, the done
field will be set to true
and the response
field will contain a list of recognition results (chunks[]
).
Each result in the chunks[]
list contains the following fields:
alternatives[]
: List of recognized text alternatives. Each alternative contains the following fields:words[]
: List of recognized words.startTime
: Time stamp of the beginning of the word in the recording. An error of 1-2 seconds is possible.endTime
: Time stamp of the end of the word. An error of 1-2 seconds is possible.word
: Recognized word. Recognized numbers are written in words (for example,twelve
rather than12
).confidence
: This field currently isn't supported. Don't use it.
text
: Full recognized text. By default, numbers are written in figures. To output the entire text in words, specifytrue
in theraw_results
field.confidence
: This field currently isn't supported. Don't use it.
channelTag
: Audio channel that recognition was performed for.
{
"done": true,
"response": {
"@type": "type.googleapis.com/yandex.cloud.ai.stt.v2.LongRunningRecognitionResponse",
"chunks": [
{
"alternatives": [
{
"words": [
{
"startTime": "0.879999999s",
"endTime": "1.159999992s",
"word": "when",
"confidence": 1
},
{
"startTime": "1.219999995s",
"endTime": "1.539999988s",
"word": "writing",
"confidence": 1
},
...
],
"text": "when writing The Hobbit, Tolkien referred to the Norse mythology of the Old English poem Beowulf",
"confidence": 1
}
],
"channelTag": "1"
},
...
]
},
"id": "e03sup6d5h7rq574ht8g",
"createdAt": "2019-04-21T22:49:29Z",
"createdBy": "ajes08feato88ehbbhqq",
"modifiedAt": "2019-04-21T22:49:36Z"
}
Using gRPC
To use the service, create an app that will send audio fragments and process responses with recognition results.
To enable the app to send requests and get results, you need to generate the client interface code for the programming language you use. Generate this code from the files stt_service.proto and operation_service.proto in the Yandex.Cloud API repository.
See the gRPC documentation for detailed instructions on how to generate interfaces and deploy client apps for various programming languages.
Warning
When requesting the results of an operation, gRPC clients by default limit the maximum message size that they can accept as a response to no more than 4 MB. If a response with recognition results exceeds this amount, an error is returned.
To get the entire response, increase the maximum message size limit:
- For Go, use the MaxCallRecvMsgSize function.
- For C++, in the call method, set the
max_receive_message_size
value.
Examples
Recognize Russian speech in OggOpus format
To recognize speech in OggOpus format, just specify the recognition language in the languageCode
field of the configuration. The language model used by default is general
.
-
Create a request body and save it to a file (such as
body.json
). Enter the link to the audio file in Object Storage in theuri
field:{ "config": { "specification": { "languageCode": "ru-RU" } }, "audio": { "uri": "https://storage.yandexcloud.net/speechkit/speech.ogg" } }
-
Send a recognition request:
$ export IAM_TOKEN=CggaATEVAgA... $ curl -X POST \ -H "Authorization: Bearer ${IAM_TOKEN}" \ -d '@body.json' \ https://transcribe.api.cloud.yandex.net/speech/stt/v2/longRunningRecognize { "done": false, "id": "e03sup6d5h1qr574ht99", "createdAt": "2019-04-21T22:49:29Z", "createdBy": "ajes08feato88ehbbhqq", "modifiedAt": "2019-04-21T22:49:29Z" }
Save the recognition operation ID that you receive in the response.
-
Wait a while for the recognition to complete. It takes about 10 seconds to recognize 1 minute of a single-channel audio file.
-
Send a request to get information about the operation:
$ curl -H "Authorization: Bearer ${IAM_TOKEN}" \ https://operation.api.cloud.yandex.net/operations/e03sup6d5h1qr574ht99 { "done": true, "response": { "@type": "type.googleapis.com/yandex.cloud.ai.stt.v2.LongRunningRecognitionResponse", "chunks": [ { "alternatives": [ { "text": "your number is 212-85-06", "confidence": 1 } ], "channelTag": "1" } ] }, "id": "e03sup6d5h1qr574ht99", "createdAt": "2019-04-21T22:49:29Z", "createdBy": "ajes08feato88ehbbhqq", "modifiedAt": "2019-04-21T22:49:36Z" }
-
Create an API key for authentication in this example. To use an IAM token for authentication, correct the header in the
header
variable: replaceApi-Key
withBearer
and add the code used to get an IAM token instead of the API key. -
Create a Python file (such as
test.py
) and add the following code to it:# -*- coding: utf-8 -*- import requests import time import json # Specify your API key and the link to the audio file in Object Storage. key = '<API key>' filelink = 'https://storage.yandexcloud.net/speechkit/speech.ogg' POST = "https://transcribe.api.cloud.yandex.net/speech/stt/v2/longRunningRecognize" body ={ "config": { "specification": { "languageCode": "ru-RU" } }, "audio": { "uri": filelink } } # If you want to use an IAM token for authentication, replace Api-Key with Bearer. header = {'Authorization': 'Api-Key {}'.format(key)} # Send a recognition request. req = requests.post(POST, headers=header, json=body) data = req.json() print(data) id = data['id'] # Request the operation status on the server until recognition is complete. while True: time.sleep(1) GET = "https://operation.api.cloud.yandex.net/operations/{id}" req = requests.get(GET.format(id=id), headers=header) req = req.json() if req['done']: break print("Not ready") # Show the full server response in JSON format. print("Response:") print(json.dumps(req, ensure_ascii=False, indent=2)) # Show only text from recognition results. print("Text chunks:") for chunk in req['response']['chunks']: print(chunk['alternatives'][0]['text'])
-
Run the created file:
$ python test.py
Recognize speech in LPCM format
To recognize speech in LPCM format, specify the file sampling frequency and the number of audio channels in the recognition settings. Set the recognition language in the languageCode
field and the language model in the model
field.
-
Create a request body and save it to a file (for example,
body.json
):Note
To use the default language model, don't pass the
model
field in the request.{ "config": { "specification": { "languageCode": "ru-RU", "model": "general:rc", "audioEncoding": "LINEAR16_PCM", "sampleRateHertz": 8000, "audioChannelCount": 1 } }, "audio": { "uri": "https://storage.yandexcloud.net/speechkit/speech.pcm" } }
-
Send a recognition request:
$ export IAM_TOKEN=CggaATEVAgA... $ curl -X POST \ -H "Authorization: Bearer ${IAM_TOKEN}" \ -d '@body.json' \ https://transcribe.api.cloud.yandex.net/speech/stt/v2/longRunningRecognize { "done": false, "id": "e03sup6d5h1qr574ht99", "createdAt": "2019-04-21T22:49:29Z", "createdBy": "ajes08feato88ehbbhqq", "modifiedAt": "2019-04-21T22:49:29Z" }
Save the recognition operation ID that you receive in the response.
-
Wait a while for the recognition to complete. It takes about 10 seconds to recognize 1 minute of a single-channel audio file.
-
Send a request to get information about the operation:
$ curl -H "Authorization: Bearer ${IAM_TOKEN}" \ https://operation.api.cloud.yandex.net/operations/e03sup6d5h1qr574ht99 { "done": true, "response": { "@type": "type.googleapis.com/yandex.cloud.ai.stt.v2.LongRunningRecognitionResponse", "chunks": [ { "alternatives": [ { "text": "hello world", "confidence": 1 } ], "channelTag": "1" } ] }, "id": "e03sup6d5h1qr574ht99", "createdAt": "2019-04-21T22:49:29Z", "createdBy": "ajes08feato88ehbbhqq", "modifiedAt": "2019-04-21T22:49:36Z" }