Evaluating the quality of STT models
Speech-to-Text (STT) recognition results on the Yandex SpeechKit platform depend on the choice of recognition model. To evaluate the quality of speech recognition, use a common WER (Word Error Rate) metric. The lower the metric value, the more accurately a speech fragment is recognized. The metric in SpeechKit is calculated using a special library named stt_metrics
.
To calculate the WER metric in Yandex DataSphere using this library:
- Upload the library.
- Run the test case.
- See how to fix speech recognition errors.
- Evaluate the recognition quality for multiple audio recordings at once.
Before you start
-
Create a project in DataSphere and open it.
-
Clone the Git repository that contains the notebooks with the Yandex Cloud API usage examples:
https://github.com/yandex-cloud/examples.git
Wait until cloning is complete. It may take some time. Once the operation is complete, in the File Browser section, a folder of the cloned repository will appear.
-
Open the
examples/speechkitpro/estimate_quality
folder and review the contents of theestimate_quality.ipynb
notebook. At the beginning of the notebook, the task of checking the quality of STT models and the WER (Word Error Rate) metric are described.
Upload the library
-
Select the cell with the code in the Evaluating the quality of STT models section:
from stt_metrics import WER, ClusterReferences from stt_metrics.text_transform import Lemmatizer
-
Run the selected cell. To do this, choose Run → Run Selected Cells or press Shift+Enter.
-
Wait for the operation to complete.
As a result, modules for evaluating the quality of STT models are uploaded.
Note
If you update the browser tab where the notebook is running or close it, the state of the notebook is saved. The variables and results of previous computations are not reset during these actions.
Run the test case
Go to the WER metric usage example section. The following operations are performed there:
- Uploading examples of:
- Recognized speech
- Text with markup
- Creating an object named
WER()
for processing data and calculating the metric. - Creating an object with information for WER calculation.
- Calculating the WER metric to determine the recognition quality.
- Displaying calculation results:
- The number of recognition errors.
- The number of words in the compared texts.
- Text alignment results.
- The WER metric value.
To calculate the WER metric:
- Select all the cells with the code in the WER metric usage example section and run them.
- Wait for the operation to complete.
See how to fix speech recognition errors
Speech recognition errors may occur for the following reasons:
- Markup artifacts. For example, spelling variants of the same word (such as
realize
andrealise
). - Different spelling of phrases. For example, the phrase
theater center
can be marked up astheater center
,theatre center
,theater centre
, ortheatre centre
. - Variants of word forms. For example, gender and cases of pronouns, verb tenses, and so on.
To improve the value of the WER metric, fix the errors using the following techniques:
- Preprocessing of the marked-up text. For example, you can delete markup artifacts.
- Uploading a set of synonyms into the metric calculation model using the
ClusterReferences()
method. - Reducing words to their base form (lemmatization) using the
Lemmatizer()
method. The base form of a word is:- For nouns: Nominative case, singular.
- For pronouns: Nominative case, singular.
- For verbs: Infinitive.
To test the suggested techniques:
- Select all the cells with the code in the Fixing errors section and run them.
- Wait for the operation to complete.
- Check that the WER metric value decreases from
0.27
to0.2
and0.13
with the sequential use of the methods.
Evaluate the recognition quality for multiple audio recordings at once
Go to the WER metric usage example (aggregate) section. It shows how to calculate the WER metric simultaneously for multiple fragments of marked-up text using the evaluate_wer
method. The example contains two pairs of audio files with marked-up text and recognized speech.
To test the suggested method:
- Select all cells with code in the WER metric usage example (aggregate) section and run them.
- Wait for the operation to complete.
- Make sure that the WER metric is calculated for two marked-up texts.