Optical character recognition (OCR)

This section describes how the Optical Character Recognition (OCR) feature works in the service.

Text recognition process

Text in an image is recognized in two stages:

  1. Detecting the language model for text recognition.
  2. Detecting text in the image.

As a result of recognition, the service returns a JSON object with the recognized text, its position on the page, and the recognition confidence value.

Language model detection

The service provides text recognition based on a model that is trained on a specific set of languages. Some languages are very different from each other (like Arabic and Chinese), so they use different models.

The model is selected automatically based on the list of languages specified in the language_codes property. If you don't know the text language, specify "language_codes": ["*"] to let the service choose the best-fitting model.

For each requested feature (feature) only one model is used. For example, if the image contains text in Chinese and Japanese, only one language is recognized. To recognize languages from different models, specify several features in the request.

See the examples in the Recognizing text in an image instructions.

Tip

If your text is in Russian and English, the English-Russian model works best. To use it, specify one or both of these languages, but don't specify other languages in the same configuration.

Detecting text in images

The service highlights the text characters found in the image and groups them by level: words are grouped into lines, the lines into blocks, and the blocks into pages.

image

As a result, the service returns a JSON object, where additional information is provided for each of the levels:

  • pages[] — Page size.
  • blocks[] — Position of the text on the page.
  • lines[] — Position and recognition confidence.
  • words[] — Position, confidence, text, and language used for recognition.

To show the position of the text, the service returns the coordinates of the rectangle that frames the text. Coordinates are the number of pixels from the upper-left corner of the image.

The coordinates of a rectangle are calculated from the upper-left corner and specified counterclockwise:

1←4
↓ ↑
2→3

Example of a recognized word with coordinates:

{
  "boundingBox": {
    "vertices": [{
        "x": "410",
        "y": "404"
      },
      {
        "x": "410",
        "y": "467"
      },
      {
        "x": "559",
        "y": "467"
      },
      {
        "x": "559",
        "y": "404"
      }
    ]
  },
  "languages": [{
    "languageCode": "en",
    "confidence": 0.9412244558
  }],
  "text": "you",
  "confidence": 0.9412244558
}

Image requirements

An image in a request must meet the following requirements:

  • Supported file formats: JPEG, PNG, PDF.

    The MIME-type of the file is specified in the mime_type property. The default is image.

  • Maximum file size: 1 MB.

  • Image size should not exceed 20 MP (length x width).

Recognition confidence

The recognition confidence shows the service's confidence in the result. For example, the value "confidence": 0.9412244558 for the line we like you means that the text is recognized correctly with a probability of 94%.

Currently, the recognition confidence value is only calculated for lines. The confidence value for words and language is substituted with the line's confidence value.

What's next