Working with MapReduce jobs
MapReduce is a parallel processing tool for large datasets (on the order of several dozen TB) on clusters in the Hadoop ecosystem. Enables the handling of data in different formats. Job input and output are stored in Yandex Object Storage.
In this article, a simple example demonstrates how MapReduce is used in Data Proc. Using MapReduce, we compute the population of the 500 largest cities in the world based on a set of data on the cities.
To run MapReduce on Hadoop, we use the Streaming interface. At the same time, the data preprocessing (map) and the final output computation (reduce) stages use programs that read from a standard program input (stdin) and write their output to a standard output (stdout).
Before you start
-
Create a service account with the
mdb.dataproc.agent
role. -
In Object Storage, create buckets and configure access to them:
- Create a bucket for the input data and grant the cluster service account
READ
permissions for this bucket. - Create a bucket for the processing output and grant the cluster service account
READ and WRITE
permissions for this bucket.
- Create a bucket for the input data and grant the cluster service account
-
Create a Data Proc cluster with the following configuration:
- Services:
HDFS
mapreduce
YARN
- Service account: Select the service account with the
mdb.dataproc.agent
role you created earlier. - Bucket name: Select a bucket to hold the processing output.
- Services:
Create a MapReduce job
-
For our input data, download and upload to a bucket an archived CSV file with a dataset on the cities.
-
Upload Python files to the input data bucket:
mapper.py
, which contains the code for data preprocessing (map stage), andreducer.py
, which contains the code for the final output computations (reduce stage):mapper.py
import sys population = sum(int(line.split('\t')[14]) for line in sys.stdin) print(population)
reducer.py
import sys population = sum(int(value) for value in sys.stdin) print(population)
-
Create a MapReduce job with the following parameters:
- Main class:
org.apache.hadoop.streaming.HadoopStreaming
- Job arguments:
-mapper
mapper.py
-reducer
reducer.py
-numReduceTasks
1
-input
s3a://<input data bucket name>/cities500.txt.bz2
-output
s3a://<output processing bucket name>/<output folder>
- Files:
s3a://<input data bucket name>/mapper.py
s3a://<input data bucket name>/reducer.py
- Settings:
mapreduce.job.maps: 6
yarn.app.mapreduce.am.resource.mb: 2048
yarn.app.mapreduce.am.command-opts: -Xmx2048m
- Main class:
-
Wait for the job status to change to
Done
. -
Download and view the file with the processing output:
part-00000
3157107417
Delete the resources you created
If you no longer need these resources, delete them: