Yandex.Cloud
  • Services
  • Why Yandex.Cloud
  • Pricing
  • Documentation
  • Contact us
Get started
Yandex Data Proc
  • Use cases
    • Configuring networks for Data Proc clusters
    • Using Apache Hive
    • Running Spark applications
    • Running applications from a remote host
    • Copying files from Yandex Object Storage
  • Step-by-step instructions
    • All instructions
    • Creating clusters
    • Connecting to clusters
    • Updating subclusters
    • Managing subclusters
    • Deleting clusters
  • Concepts
    • Data Proc overview
    • Host classes
    • Hadoop and component versions
    • Component interfaces and ports
    • Component web interfaces
    • Auto scaling
    • Decommissioning subclusters and hosts
    • Network in Data Proc
    • Quotas and limits
  • Access management
  • Pricing policy
  • API reference
    • Authentication in the API
    • gRPC
      • Overview
      • ClusterService
      • JobService
      • ResourcePresetService
      • SubclusterService
      • OperationService
    • REST
      • Overview
      • Cluster
        • Overview
        • create
        • delete
        • get
        • list
        • listHosts
        • listOperations
        • listUILinks
        • start
        • stop
        • update
      • Job
        • Overview
        • create
        • get
        • list
        • listLog
      • ResourcePreset
        • Overview
        • get
        • list
      • Subcluster
        • Overview
        • create
        • delete
        • get
        • list
        • update
  • Questions and answers
  1. Use cases
  2. Using Apache Hive

Running Apache Hive jobs

  • Running jobs using the Yandex.Cloud CLI
  • Running jobs using the Hive CLI

Hive is a tool for accessing data storage in the Hadoop ecosystem. It lets you work with data in different formats and DBMS using an SQL-like query language. It's mainly used for working with data in HDFS, HBase, S3-compatible storage, and relational databases.

You can run Hive jobs from the Yandex.Cloud CLI and directly on the server using the Hive CLI.

Running jobs using the Yandex.Cloud CLI

If you don't have the Yandex.Cloud command line interface yet, install and initialize it.

Jobs are run using the Yandex.Cloud CLI through the Data Proc agent installed on the cluster master host. Job parameters are passed to the agent through the Data Proc API.

An executable file and its dependencies must be located in storage that the Data Proc cluster service account has access to. The application being run must have access to storage where the source data set and run results are saved.

The calculation result is saved in the Yandex Object Storage bucket along with the service output.

There are two ways to send an SQL query to Hive:

  • In the run job command:

    yc dataproc job create-hive --cluster-id <cluster ID> --name <job name> --query-list "SELECT Month, COUNT(*)  FROM flights GROUP BY Month;"
    
  • In the Object Storage object that the Data Proc cluster service account has read access to:

    yc dataproc job create-hive --cluster-id <cluster ID> --name <job name> --query-file-uri "s3a://<your bucket>/hive-query.sql"
    

You can find the query execution results and additional diagnostic information in the Object Storage bucket that you specified when creating the cluster: s3://<your bucket>/dataproc/clusters/<cluster ID>/jobs/<job ID>/.

The job ID is contained in the YC CLI job execution command output and in the API job execution response.

Running jobs using the Hive CLI

To run the Apache Hive command shell (CLI), connect to the master host over SSH and run the hive command.

Then run a Hive health check by executing the select 1; command. The correct results looks like this:

hive> select 1;
OK
1
Time taken: 0.077 seconds, Fetched: 1 row(s)
  1. Create an external table for the data from the example in parquet format. The table will contain a list of flights between US cities in 2018. Run the following query in the Hive CLI:

    CREATE EXTERNAL TABLE flights (Year bigint, Month bigint, FlightDate string, Flight_Number_Reporting_Airline bigint, OriginAirportID bigint, DestAirportID bigint) STORED AS PARQUET LOCATION 's3a://yc-mdb-examples/dataproc/example01/set01';
    
  2. Check the list of tables:

    show tables;
    
  3. The list of tables should look like this:

    OK
    flights
    Time taken: 0.043 seconds, Fetched: 1 row(s)
    
  4. Request the number of flights by month:

    SELECT Month, COUNT(*) FROM flights GROUP BY Month;
    
  5. Example of the query results:

    Query ID = root_20200119195338_28049b67-4de9-4568-a4c4-3bbe500b78a1
    Total jobs = 1
    Launching Job 1 out of 1
    Status: Running (Executing on YARN cluster with App id application_1579251575762_0013)
    
    ----------------------------------------------------------------------------------------------
            VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
    ----------------------------------------------------------------------------------------------
    Map 1 .......... container     SUCCEEDED      6          6        0        0       0       0
    Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0
    ----------------------------------------------------------------------------------------------
    VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 11.52 s
    ----------------------------------------------------------------------------------------------
    OK
    1       570118
    2       520731
    3       611987
    4       596046
    5       616529
    6       626193
    7       645299
    8       644673
    9       585749
    10      616101
    11      586178
    12      593842
    Time taken: 12.137 seconds, Fetched: 12 row(s)
    
In this article:
  • Running jobs using the Yandex.Cloud CLI
  • Running jobs using the Hive CLI
Language
Careers
Privacy policy
Terms of use
© 2021 Yandex.Cloud LLC