Speech recognition

Speech recognition (speech-to-text, STT) is the process of converting speech to text. The service can recognize spontaneous speech in several languages.

Languages

  • Russian
  • English
  • Turkish

Language models

SpeechKit has a two-stage approach to speech recognition. At the first stage, the audio signal is analyzed to detect sequences of sounds that could be interpreted as words. For each sequence of sounds, there are usually several possible words (or hypotheses).

The second stage applies a language model, which allows you to validate each hypothesis in terms of the language structure and context, i.e., to what extent a given word is consistent with other words that have already been recognized. The speech recognition system uses the language model as a dictionary to validate the hypotheses. Creating a dictionary like this is a complex computational task involving deep learning.

A neural network is trained on speech samples that are typical for a particular domain. That's why the language models are optimized for recognizing speech from a specific domain. For example, the Numbers model is the best choice for phone number recognition, while a person's first and last name are best recognized using the Names model.

The supported language models are listed below.

  • Queries (general) — Short phrases containing 3-5 words on various topics, including search engine or website queries. For example:
    • покажи следующий поворот
    • соединить с отделом продаж
    • еще чашку кофе и две мягких французских булочки
    • какая погода во владивостоке
    • напомни купить овощей и фруктов по дороге домой
  • Addresses (maps) — Addresses and names of companies or geographical features:
    • поехали на улицу кирпичные выемки пять
    • сколько ехать от льва толстого до новой земли
    • покажи маршрут до музея маяковского
  • Dates (dates) — Names of months, ordinal numbers, and cardinal numbers:
    • второго ноль седьмого две тысячи первого
    • двадцать седьмое апреля тысяча девятьсот девятнадцатого года
  • Names (names) — First and last names and phone call requests:
    • щукин платон
    • соедините с людчиком
    • переговорить с васей васиным
  • Numbers (numbers) — Cardinal numbers from 1 to 999 and delimiters (dot, comma, and dash). This model can be used to dictate phone numbers, account numbers, or document numbers:
    • два двенадцать восемьдесят пять ноль шесть
    • сто пятьдесят семь запятая пятнадцать сорок три
  • Queries (general) — Short phrases containing 3-5 words on various topics, including search engine or website queries:
    • connect me to the sales department
    • another cup of coffee and two soft French rolls
  • Addresses (maps) — Addresses and names of companies or geographical features:
    • go to Abbey Road
  • Queries (general) — Short phrases containing 3-5 words on various topics, including search engine or website queries:
    • satış departmanıyla görüşmek istiyorum
    • bir kahve daha ve iki küçük kurabiye
  • Addresses (maps) — Addresses and names of companies or geographical features:
    • Atatürk Bulvarı'na git

The models are trained on large datasets generated by Yandex services and applications. This allows us to continually improve the quality of speech recognition.

Recognition quality

The accuracy of recognition depends on the quality of the source sound, audio encoding quality, clarity and rate of speech, as well as phrase complexity and length. It is important that the speech topic matches the chosen language model because this improves the accuracy of recognition.

See also