Speech recognition

Speech recognition (speech-to-text, STT) is the process of converting speech to text. The service can recognize spontaneous speech in several languages.

Languages

  • Russian
  • English

Language models

SpeechKit has a two-stage approach to speech recognition. At the first stage, the audio signal is analyzed to detect sequences of sounds that could be interpreted as words. For each sequence of sounds, there are usually several possible words (or hypotheses).

The second stage applies a language model, which allows you to validate each hypothesis in terms of the language structure and context, i.e., to what extent a given word is consistent with other words that have already been recognized. The speech recognition system uses the language model as a dictionary to validate the hypotheses. Creating a dictionary like this is a complex computational task involving deep learning.

A neural network is trained on speech samples that are typical for a particular domain. That's why the language models are optimized for recognizing speech from a specific domain. For example, the Numbers model is the best choice for phone number recognition, while a person's first and last name are best recognized using the Names model.

The available language models are listed below.

  • Queries (general) — Short phrases containing 3-5 words on various topics, including search engine or website queries. For example:
    • [покажи следующий поворот]
    • [соединить с отделом продаж]
    • [еще чашку кофе и две мягких французских булочки]
    • [какая погода во владивостоке]
    • [напомни купить овощей и фруктов по дороге домой]
  • Addresses (maps) — Addresses and names of companies or geographical features. For example:
    • [поехали на улицу кирпичные выемки пять]
    • [сколько ехать от льва толстого до новой земли]
    • [покажи маршрут до музея маяковского]
  • Dates (dates) — Names of months, ordinal numbers, and cardinal numbers. For example:
    • [второго ноль седьмого две тысячи первого]
    • [двадцать седьмое апреля тысяча девятьсот девятнадцатого года]
  • Names (names) — First and last names and phone call requests. For example:
    • [щукин платон]
    • [соедините с людчиком]
    • [переговорить с васей васиным]
  • Numbers (numbers) — Cardinal numbers from 1 to 999 and delimiters (dot, comma, and dash). This model can be used to dictate phone numbers, account numbers, or document numbers. For example:
    • [два двенадцать восемьдесят пять ноль шесть]
    • [сто пятьдесят семь запятая пятнадцать сорок три]
  • Queries (general) — Short phrases containing 3-5 words on various topics, including search engine or website queries.
    • [connect me to the sales department]
    • [another cup of coffee and two soft French rolls]
  • Addresses (maps) — Addresses and names of companies or geographical features.
    • [go to Abbey Road]

The models are trained on large datasets generated by Yandex services and applications. This allows us to continually improve the quality of speech recognition.

Recognition quality

The accuracy of recognition depends on the quality of the source sound, audio encoding quality, clarity and rate of speech, as well as phrase complexity and length. It is important that the speech topic matches the chosen language model because this improves the accuracy of recognition.

See also