Speech recognition (speech-to-text, STT) is the process of converting speech to text.
The service can recognize speech in several languages:
There are three recognition methods:
- Recognition of short audio fragments. This is suitable for recognizing small single-channel audio fragments.
- Streaming mode for short audio recognition. This allows you to send audio fragments and get results, including intermediate recognition results, over a single connection.
- Recognition of long audio fragments. This allows you to recognize long multi-channel audio recordings, but the response may be slower.
Audio is recognized in three stages:
- Words are detected. There are usually several possible words recognized (or hypotheses).
- Hypotheses are checked using the language model. The model validates to what extent a new word is consistent with other words that have already been recognized.
- The recognized text is processed: numbers are converted to digits, certain punctuation marks (such as hyphens) are added, and so on. The converted text is the final recognition result that is sent in the response body.
To increase the accuracy of recognition, specify the language model that the service should use. The model should match the speech topic.
The accuracy of speech recognition is also affected by:
- Original sound quality.
- Audio encoding quality.
- Speech intelligibility and rate.
- Utterance complexity and length.