Running Apache Hive jobs
Hive is a tool for accessing data storage in the Hadoop ecosystem. It lets you work with data in different formats and DBMS using an SQL-like query language. It's mainly used for working with data in HDFS, HBase, S3-compatible storage, and relational databases.
You can run Hive jobs from the Yandex.Cloud CLI and directly on the server using the Hive CLI.
Running jobs using the Yandex.Cloud CLI
If you don't have the Yandex.Cloud command line interface yet, install and initialize it.
Jobs are run using the Yandex.Cloud CLI through the Data Proc agent installed on the cluster master host. Job parameters are passed to the agent through the Data Proc API.
An executable file and its dependencies must be located in storage that the Data Proc cluster service account has access to. The application being run must have access to storage where the source data set and run results are saved.
The calculation result is saved in the Yandex Object Storage bucket along with the service output.
There are two ways to send an SQL query to Hive:
-
In the run job command:
yc dataproc job create-hive --cluster-id <cluster ID> --name <job name> --query-list "SELECT Month, COUNT(*) FROM flights GROUP BY Month;"
-
In the Object Storage object that the Data Proc cluster service account has read access to:
yc dataproc job create-hive --cluster-id <cluster ID> --name <job name> --query-file-uri "s3a://<your bucket>/hive-query.sql"
You can find the query execution results and additional diagnostic information in the Object Storage bucket that you specified when creating the cluster: s3://<your bucket>/dataproc/clusters/<cluster ID>/jobs/<job ID>/
.
The job ID is contained in the YC CLI job execution command output and in the API job execution response.
Running jobs using the Hive CLI
To run the Apache Hive command shell (CLI), connect to the master host over SSH and run the hive
command.
Then run a Hive health check by executing the select 1;
command. The correct results looks like this:
hive> select 1;
OK
1
Time taken: 0.077 seconds, Fetched: 1 row(s)
-
Create an external table for the data from the example in parquet format. The table will contain a list of flights between US cities in 2018. Run the following query in the Hive CLI:
CREATE EXTERNAL TABLE flights (Year bigint, Month bigint, FlightDate string, Flight_Number_Reporting_Airline bigint, OriginAirportID bigint, DestAirportID bigint) STORED AS PARQUET LOCATION 's3a://yc-mdb-examples/dataproc/example01/set01';
-
Check the list of tables:
show tables;
-
The list of tables should look like this:
OK flights Time taken: 0.043 seconds, Fetched: 1 row(s)
-
Request the number of flights by month:
SELECT Month, COUNT(*) FROM flights GROUP BY Month;
-
Example of the query results:
Query ID = root_20200119195338_28049b67-4de9-4568-a4c4-3bbe500b78a1 Total jobs = 1 Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1579251575762_0013) ---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ---------------------------------------------------------------------------------------------- Map 1 .......... container SUCCEEDED 6 6 0 0 0 0 Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0 ---------------------------------------------------------------------------------------------- VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 11.52 s ---------------------------------------------------------------------------------------------- OK 1 570118 2 520731 3 611987 4 596046 5 616529 6 626193 7 645299 8 644673 9 585749 10 616101 11 586178 12 593842 Time taken: 12.137 seconds, Fetched: 12 row(s)