Extending a speech recognition model
SpeechKit provides two ways to improve speech recognition.
Auto-tuning
By default, SpeechKit does not save data provided by users. However, the most effective way to improve a speech recognition model is to train it on real user data.
To improve the quality of speech recognition, you can use the auto-tuning model. With auto-tuning enabled, you can save data transmitted in requests and use it for further training. To do this, specify the x-data-logging-enabled: true
header in API requests. For an example with logging enabled, see Troubleshooting in Yandex SpeechKit.
Auto-tuning helps improve recognition quality while a model is running without any additional assistance.
Model retraining
The basic speech recognition model is designed to work with everyday language, but it may not be sufficient to recognize specific vocabulary. By tuning, you can train the model to recognize domain-specific terms from different fields:
- Medicine: Diagnoses, biological terms, and drug names.
- Business: Company names.
- Trade: Product ranges (jewelry, electronics, and so on).
- Finance: Banking terms and names of banking products.
Data required for tuning
The following data is required for model tuning:
- Glossary: Full list of terms. The glossary may contain words from audio recordings used for testing and other vocabulary. The glossary should be provided in a separate file, with each term placed on a separate line.
- Text patterns: Homogeneous phrases that the model will use to synthesize utterances. The length of a pattern together with variables must not exceed 200 characters.
The glossary and the text templates must be in TSV
- Numerals: Written as words.
- Latin words and characters: Transcribed.
- Abbreviations: Spelled out.
- Acronyms: Spelled out or transcribed.
: We are giving away, i.e. as a free gift, 2 kilos of potatoes, a model DNA helix, and ABC magazine from 2020.
: We're giving away, that is as a free gift, two kilos of potatoes, a model dee-en-a helix, and a-bee-cee magazine from twenty-twenty.
Text data will be generated from the received files. Glossary terms are inserted into the variable part of the templates. Fine-tuning will be effective if a sufficient amount of data is used:
- At least 1 thousand utterances.
- At least 3 to 5 phrases, preferably in proportion to the frequency of a term's use in real life.
For example, the first-name.tsv
, middle-name.tsv
, and last-name.tsv
glossary files used for tuning a call center model may contain the first, middle, and last names of customers.
first-name.tsv | middle-name.tsv | last-name.tsv |
---|---|---|
John Tom Peter ... |
Wendell Sean Larry ... |
Thompson Carter Smith ... |
If the pattern phrases suggest that the glossary terms may be in possessive case forms, you need to create a separate glossary file for each form. For example, files with names in the possessive case will contain the following entries:
first-name-possessive.tsv | middle-name-possessive.tsv | last-name-possessive.tsv |
---|---|---|
John Tom Peter ... |
Wendell Sean Larry ... |
Thompson's Carter's Smith's ... |
Then the templates.tsv
file may contain entries like
Hello, are you {first-name=first-names.tsv}{middle-name=middle-names.tsv} {last-name=last-names.tsv}?
Hello, can I talk to {first-name=first-names-ablative.tsv}{middle-name=middle-names-ablative.tsv} representative?
Uploading model tuning data
To send the tuning data to the SpeechKit team, contact
Model availability dates
Changes to a general:rc
model are usually made over a period of 4 weeks as in any standard release roll-out cycle.