Yandex.Cloud
  • Services
  • Why Yandex.Cloud
  • Pricing
  • Documentation
  • Contact us
Get started
Use cases
  • Web service
    • All use cases
    • Static website in Object Storage
    • Website on LAMP or LEMP stack
    • Fault-tolerant website with load balancing from Yandex Network Load Balancer
    • Fault-tolerant website using DNS load balancing
    • Joomla-based website with PostgreSQL
    • WordPress website
    • WordPress website on a MySQL database
    • 1C-Bitrix website
  • Online stores
    • All use cases
    • 1C-Bitrix online store
    • Opencart online store
  • Data archive
    • All use cases
    • Single-node file server
    • Configuring an SFTP server on Centos 7
    • Backup to Object Storage via Acronis Backup
    • Backup to Object Storage via CloudBerry Desktop Backup
    • Backup to Object Storage via Duplicati
    • Backup to Object Storage via Bacula
    • Digitizing archives in Yandex Vision
  • Test environment
    • All use cases
    • Testing applications with GitLab
    • Creating test VMs using GitLab CI
    • High-performance computing on preemptible VMs
  • Infrastructure management
    • All use cases
    • Getting started with Terraform
    • Uploading Terraform states to Object Storage
    • Getting started with Packer
    • VM images building automation using Jenkins
    • Continuous deployment of containerized applications using GitLab
    • Creating a cluster of 1C:Enterprise Linux servers with a Managed Service for PostgreSQL cluster
    • Creating a cluster of 1C:Enterprise Windows servers with MS SQL Server
    • Migrating to Yandex.Cloud using Hystax Acura
    • Emergency recovery in Yandex.Cloud using Hystax Acura
    • Configuring a fault-tolerant architecture in Yandex.Cloud
  • Windows in Yandex.Cloud
    • All use cases
    • Deploying Active Directory
    • Deploying Microsoft Exchange
    • Deploying Remote Desktop Services
    • Deploying an Always On availability group
    • Deploying an Always On availability group with an internal network load balancer
  • Network routing
    • All use cases
    • Routing through a NAT instance
    • Creating a VPN tunnel
    • Installing a Cisco CSR1000v virtual router
    • Installing a Mikrotik CHR virtual router
    • Creating a VPN connection using OpenVPN
  • Data visualization and analytics
    • All use cases
    • Visualizing data from a CSV file
    • Visualizing data from a ClickHouse database
    • Visualizing data from Yandex.Metrica
    • Visualizing data from Yandex.Metrica Logs API
    • Publishing a chart with a map from a CSV file to DataLens Public
    • Visualizing data from AppMetrica
    • Visualizing geodata from a CSV file
  • Internet of things
    • Use cases for the internet of things
    • Status monitoring of geographically distributed devices
    • Monitoring sensor readings and event notifications
  1. Data archive
  2. Digitizing archives in Yandex Vision

Digitizing archives in Yandex Vision

  • Before you start
    • Required paid resources
  • Create and set up a virtual machine
    • Create a virtual machine:
    • Configure Yandex CLI
    • Set up a service account
  • Set up the AWS CLI
  • Set up access to Object Storage
  • Create an archive with images
  • Create a script for digitizing and uploading images
    • Preparation
    • Writing a script
  • Check the digitized content
  • Delete the created cloud resources

Yandex Vision is a computer vision service for image analysis.

With this guide, you will:

  • Set up a Yandex.Cloud environment for Yandex Vision.
  • Recognize text in images using Yandex Vision.
  • Upload the results to Yandex Object Storage.
  1. Before you start.
  2. Create and configure a virtual machine.
  3. Set up the AWS CLI.
  4. Set up access to Object Storage.
  5. Create an archive with images.
  6. Create a script for digitizing and uploading images.
  7. Check the digitized content.
  8. Delete the created cloud resources.

Before you start

Before creating a virtual machine, you need to sign up for Yandex.Cloud and create a billing account:

  1. Go to the management console. Then log in to Yandex.Cloud or sign up if don't already have an account.
  2. On the billing page, make sure you linked a billing account, and it has the ACTIVE or TRIAL_ACTIVE status. If you don't have a billing account, create one.

If you have an active billing account, you can create or select a folder to run your VM in. Go to the Yandex.Cloud homepage and select or create a folder where you want to create a VM for your server. Learn more about the resource hierarchy in Yandex.Cloud.

Required paid resources

Infrastructure costs for recognition and data storage include:

  • A fee for a continuously running VM (see Yandex Compute Cloud pricing).
  • A fee for using a dynamic or static external IP address (see Yandex Virtual Private Cloud pricing).
  • A fee for using object storage (see pricing for Yandex Object Storage).
  • A fee for using Yandex Vision (see pricing for Yandex Vision).

Create and set up a virtual machine

Create a virtual machine:

  1. On the folder page in the management console, click Create resource, and select Virtual machine.

  2. In the Name field, enter a name for the VM.

    • Length — from 3 to 63 characters.
    • The name may contain lowercase Latin letters, numbers, and hyphens.
    • The first character must be a letter. The last character can't be a hyphen.
  3. Select the availability zone to host the VM in.

  4. Under Images from Cloud Marketplace, select the Centos 7 image.

  5. Under Disks, select:

    • SSD
    • 19 GB
  6. Under Computing resources:

    • Choose a platform for the VM.
    • Specify the number of vCPUs and amount of RAM:
      • Platform: Intel Cascade Lake.
      • Guaranteed vCPU share: 20%.
      • vCPU: 2.
      • RAM: 2 GB.
  7. In the Network settings section, select the network and subnet to connect the VM to. If you don't have a network or subnet yet, you can create them on the VM creation page.

  8. In the Public address field, leave the Automatically value to assign a random external IP address from the Yandex.Cloud pool. To ensure that the external IP address doesn't change after the VM is stopped, make it static.

  9. Specify data required for accessing the VM:

    • Enter the username in the Login field.

    • In the SSH key field, paste the contents of the public key file.

      You need to create a key pair for the SSH connection yourself. Learn how to connect to VMs via SSH.

    Alert

    The IP address and host name (FQDN) for connecting to the VM will be assigned when it's created. If you selected the No address option in the Public address field, you won't be able to access the VM from the internet.

  10. Click Create VM.

Creating the VM may take several minutes.

Configure Yandex CLI

  1. Log in to the VM over SSH.

  2. Install the YC CLI by following the instructions:

    1. Install the CLI.
    2. Initialize the CLI.
  3. Make sure that the YC CLI runs correctly:

    yc config list
    

    The settings you made during setup are output as a result.

Set up a service account

  1. Create a service account and assign it a name (like vision):

    yc iam service-account create --name vision --description "this is vision service account"
    
  2. Get the ID of the folder by following the instructions.

  3. Find the ID of your service account based on the folder ID:

    yc iam service-account --folder-id <FOLDER-ID>  get vision
    

    You get one of the following strings as an output:

    id: <SERVICE-ACCOUNT-ID>
    
  4. Configure the editor role for your service account by entering the relevant value:

    yc resource-manager folder add-access-binding default --role editor --subject serviceAccount:<SERVICE-ACCOUNT-ID>
    
  5. Create a static access key for your service account:

    yc iam access-key create --service-account-name vision --description "this key is for vision"
    

    Save the following values (they'll be used to set up the AWS CLI):

    • key_id
    • secret
  6. Get an IAM token for the service account in the CLI by following the instructions:

    yc iam key create --service-account-name vision --output key.json
    yc config profile create vision-profile
    yc config set service-account-key key.json
    yc iam create-token
    

    Save the IAM token value returned by the command yc iam create-token. You need this value to upload images to Yandex Vision.

Set up the AWS CLI

  1. Install the yum repository:

    sudo yum install epel-release -y
    
  2. Install pip:

    sudo yum install python-pip -y
    
  3. Install the AWS CLI:

    sudo pip install awscli --upgrade
    
  4. Set up the AWS CLI:

    aws configure
    
    • AWS Access Key ID: The key_id value from the previous section, "Create a static access key for your service account".
    • AWS Secret Access Key: The secret value from the same section.
    • Default region name: Enter ru-central1.
    • Default output format: Enter json.
  5. Check that the ~/.aws/credentials file contains the correct values:

    cat ~/.aws/credentials
    
  6. Check that the ~/.aws/config file contains the correct values:

    cat ~/.aws/config
    

Set up access to Object Storage

  1. Create an Object Storage bucket by following the instructions:

    • Leave the default value for the maximum size.
    • Bucket access: Limited.
    • Storage class: Cold.
  2. Go to the Yandex.Cloud console and make sure that the bucket is in the list:

    https://console.cloud.yandex.com/folders/<FOLDER-ID>/storage
    

Create an archive with images

  1. Upload images with text to your bucket by following the instructions.

  2. Make sure that the images were uploaded:

    aws --endpoint-url=https://storage.yandexcloud.net s3 ls s3://<BUCKET-NAME>/
    

    <BUCKET-NAME>: The name of your bucket.

  3. Download images to your virtual machine, such as to the my_pictures directory:

    aws --endpoint-url=https://storage.yandexcloud.net s3 cp s3://<BUCKET-NAME>/ my_pictures --recursive
    
  4. Create an archive of the images and name it (for example, my_pictures):

    tar -cf my_pictures.tar my_pictures/*
    
  5. Delete the image folder:

    rm -rfd my_pictures
    

Create a script for digitizing and uploading images

Preparation

  1. Install the jq package. The script will use it to process the results from Vision:

    yum install jq -y
    
  2. Create the environment variables necessary for the script to run:

    export BUCKETNAME="<BUCKET-NAME>"
    export FOLDERID="<FOLDER-ID>"
    export IAMTOKEN="<IAM-TOKEN>"
    
    • BUCKETNAME: The name of your bucket.
    • FOLDERID: The name of your folder.
    • IAMTOKEN: The IAM token you obtained in this section.

Writing a script

This script implements the following steps:

  1. Creates the necessary directories.
  2. Unpacks the archive with images.
  3. Processes images in a loop:
    1. Encodes images to be sent in a POST request to Vision.
    2. Forms the request body for the image.
    3. Sends the image to Yandex Vision with subsequent processing.
    4. Saves the result to output.json.
    5. Parses the text from output.json and saves it to a text file.
  4. Archives all the text files obtained after processing the images.
  5. Moves the digitized archive to Object Storage.
  6. Deletes any unnecessary files.

To make it easier to read, each step is commented in the script body.

  1. Create a file with any name (such as vision.sh). Open this file in a text editor (like vi):

    vi vision.sh
    
  2. Paste the script there:

    #!/bin/bash
    
    # Create the necessary directories.
     echo "Creating directories..."
    
    # Create a directory for the recognized text.
    mkdir my_pictures_text
    
    # Unpack the archive with images to the new directory.
    echo "Extract pictures in my_pictures directory..."
    tar -xf my_pictures.tar
    
    # Digitize the images from the archive.
    FILES=my_pictures/*
    for f in $FILES
    # For each file in the directory, perform the following actions in a loop:
    do
        # Encode the image to base64 to upload it to the Yandex Vision server.
        CODEIMG=$(base64 -i $f | cat)
    
        # Create the body.json file to be uploaded in a POST request to the Yandex Vision server.
        cat <<EOF > body.json
    {
    "folderId": "$FOLDERID",
    "analyze_specs": [{
    "content": "$CODEIMG",
    "features": [{
    "type": "TEXT_DETECTION",
    "text_detection_config": {
    "language_codes": ["en","ru"]
    }
    }]
    }]
    }
    EOF
        # Send the image to the Yandex Vision server for processing and write the result to the output.json file.
        echo "Processing file $f in Vision..."
        curl -X POST --silent \
        -H "Content-Type: application/json" \
        -H "Authorization: Bearer ${IAMTOKEN}" \
        -d '@body.json' \
        https://vision.api.cloud.yandex.net/vision/v1/batchAnalyze > output.json
    
        # Get the name of the image for later use.
        IMAGE_BASE_NAME=$(basename -- "$f")
        IMAGE_NAME="${IMAGE_BASE_NAME%.*}"
    
        # Get the text data from the JSON file with the processing results and write the data to a text file with the same name as the image file, but with the .txt extension.
        cat output.json | jq -r '.results[].results[].textDetection.pages[].blocks[].lines[].words[].text' | awk -v ORS=" " '{print}' > my_pictures_text/$IMAGE_NAME".txt"
    done
    
    # Archive the contents of the text file directory.
    echo "Packing text files to archive..."
    tar -cf my_pictures_text.tar my_pictures_text
    
    # Move the text file archive to your bucket.
    echo "Sending archive to Object Storage Bucket..."
    aws --endpoint-url=https://storage.yandexcloud.net s3 cp my_pictures_text.tar s3://$BUCKETNAME/ > /dev/null
    
    # Delete any unnecessary files.
    echo "Cleaning up..."
    rm -f body.json
    rm -f output.json
    rm -rfd my_pictures
    rm -rfd my_pictures_text
    rm -r my_pictures_text.tar
    
  3. Set the permissions to run the script:

    sudo chmod 755 vision.sh
    
  4. Run the script:

    ./vision.sh
    

Check the digitized content

  1. Go to Vision in the Yandex.Cloud console.
  2. Make sure that your bucket contains the my_pictures_text.tar archive.
  3. Download and unpack the archive.
  4. Make sure that the text in the <image name>.txt file matches the text in the image.

Delete the created cloud resources

If you no longer need the cloud resources you created to digitize the archive:

  • Delete the VM.
  • Delete the static IP address if you created one.
  • Delete the Object Storage bucket.
In this article:
  • Before you start
  • Required paid resources
  • Create and set up a virtual machine
  • Create a virtual machine:
  • Configure Yandex CLI
  • Set up a service account
  • Set up the AWS CLI
  • Set up access to Object Storage
  • Create an archive with images
  • Create a script for digitizing and uploading images
  • Preparation
  • Writing a script
  • Check the digitized content
  • Delete the created cloud resources
Language
Careers
Privacy policy
Terms of use
© 2021 Yandex.Cloud LLC