Optical character recognition (OCR)

This section describes how the Optical Character Recognition (OCR) feature works.

Preparing the recognition request

In your request, you specify a list of analysis features to be applied to the image. To recognize a text, use the TEXT_DETECTION type and set the list of languages in the configuration.

Request configuration

In the configuration, you can specify:

  • The list of languages to be used to detect the language model for recognition.

    If you don't know the text language, enter "*" so that the service can automatically select the most appropriate model.

  • The model to be used to detect text in the image. Available models:

    • page (default): Good for images with any number of lines of text.

    • line: Good for recognizing a single line of text. For example, if you don't want to send an entire image, you can cut out a single line and send it for recognition.

      The image must contain only one line of text and the text height must be at least 80% of the image height, otherwise the recognition results of the line model may be unpredictable. Example of an appropriate image:

      image

Language model detection

The service provides text recognition based on a language model that is trained on a specific set of languages. The model is selected automatically based on the list of languages you specified in the configuration.

Only a single model is used each time you recognize the text. For example, if an image contains text in Chinese and Japanese, only one language is recognized. To recognize both of these languages, specify several analysis options based on different language lists in your request.

Tip

If your text is in Russian and English, the English-Russian model works best. To use this model, specify one of these languages or both of them in text_detection_config, but don't specify any other languages.

Image requirements

An image in a request must meet the following requirements:

  • Supported file formats: JPEG, PNG, PDF.

    You specify the MIME type of the file in the mime_type property. The default is image.

  • Maximum file size: 1 MB.

  • Image size should not exceed 20 MP (length x width).

Response with recognition results

The service highlights the text characters found in the image and groups them by level: words are grouped into lines, the lines into blocks, and the blocks into pages.

image

As a result, the service returns an object that also specifies for each level:

  • pages[] — Page size.
  • blocks[] — Position of the text on the page.
  • lines[] — Position and recognition confidence.
  • words[] — Position, confidence, text, and language used for recognition.

To show the position of the text, the service returns the coordinates of the rectangle that frames the text. Coordinates are the number of pixels from the upper-left corner of the image.

The coordinates of a rectangle are calculated from the upper-left corner and specified counterclockwise:

1←4
↓ ↑
2→3

Example of a recognized word with coordinates:

{
  "boundingBox": {
    "vertices": [{
        "x": "410",
        "y": "404"
      },
      {
        "x": "410",
        "y": "467"
      },
      {
        "x": "559",
        "y": "467"
      },
      {
        "x": "559",
        "y": "404"
      }
    ]
  },
  "languages": [{
    "languageCode": "en",
    "confidence": 0.9412244558
  }],
  "text": "you",
  "confidence": 0.9412244558
}

Recognition confidence

The recognition confidence shows the service's confidence in the result. For example, the value "confidence": 0.9412244558 for the line we like you means that the text is recognized correctly with a probability of 94%.

Currently, the recognition confidence value is only calculated for lines. The confidence value for words and language is substituted with the line's confidence value.

What's next