Extending a speech recognition model

Written by

Updated at March 26, 2024

Auto-tuning
Model retraining

SpeechKit provides two ways to improve speech recognition.

Auto-tuning

By default, SpeechKit does not save data provided by users. However, the most effective way to improve a speech recognition model is to train it on real user data.

To improve the quality of speech recognition, you can use the auto-tuning model. With auto-tuning enabled, you can save data transmitted in requests and use it for further training. To do this, specify the x-data-logging-enabled: true header in API requests. For an example with logging enabled, see Troubleshooting in Yandex SpeechKit.

Auto-tuning helps improve recognition quality while a model is running without any additional assistance.

Model retraining

The basic speech recognition model is designed to work with everyday language, but it may not be sufficient to recognize specific vocabulary. By tuning, you can train the model to recognize domain-specific terms from different fields:

Medicine: Diagnoses, biological terms, and drug names.
Business: Company names.
Trade: Product ranges (jewelry, electronics, and so on).
Finance: Banking terms and names of banking products.

Data required for tuning

The following data is required for model tuning:

Glossary: Full list of terms. The glossary may contain words from audio recordings used for testing and other vocabulary. The glossary should be provided in a separate file, with each term placed on a separate line.
Text patterns: Homogeneous phrases that the model will use to synthesize utterances. The length of a pattern together with variables must not exceed 200 characters.

The glossary and the text templates must be in TSV format and in a normalized format:

Numerals: Written as words.
Latin words and characters: Transcribed.
Abbreviations: Spelled out.
Acronyms: Spelled out or transcribed.

: We are giving away, i.e. as a free gift, 2 kilos of potatoes, a model DNA helix, and ABC magazine from 2020.
: We're giving away, that is as a free gift, two kilos of potatoes, a model dee-en-a helix, and a-bee-cee magazine from twenty-twenty.

Text data will be generated from the received files. Glossary terms are inserted into the variable part of the templates. Fine-tuning will be effective if a sufficient amount of data is used:

At least 1 thousand utterances.
At least 3 to 5 phrases, preferably in proportion to the frequency of a term's use in real life.

For example, the first-name.tsv, middle-name.tsv, and last-name.tsv glossary files used for tuning a call center model may contain the first, middle, and last names of customers.

first-name.tsv	middle-name.tsv	last-name.tsv
John Tom Peter ...	Wendell Sean Larry ...	Thompson Carter Smith ...

If the pattern phrases suggest that the glossary terms may be in possessive case forms, you need to create a separate glossary file for each form. For example, files with names in the possessive case will contain the following entries:

first-name-possessive.tsv	middle-name-possessive.tsv	last-name-possessive.tsv
John Tom Peter ...	Wendell Sean Larry ...	Thompson's Carter's Smith's ...

Then the templates.tsv file may contain entries like

Hello, are you {first-name=first-names.tsv}{middle-name=middle-names.tsv} {last-name=last-names.tsv}?
Hello, can I talk to {first-name=first-names-ablative.tsv}{middle-name=middle-names-ablative.tsv} representative?

Uploading model tuning data

To send the tuning data to the SpeechKit team, contact support.

Model availability dates

Changes to a general:rc model are usually made over a period of 4 weeks as in any standard release roll-out cycle.

Extending a speech recognition model

Auto-tuningAuto-tuning

Model retrainingModel retraining

Data required for tuningData required for tuning

Uploading model tuning dataUploading model tuning data

Model availability datesModel availability dates

Was the article helpful?

Auto-tuning

Model retraining

Data required for tuning

Uploading model tuning data

Model availability dates