Yandex Cloud
  • Services
  • Solutions
  • Why Yandex Cloud
  • Pricing
  • Documentation
  • Contact us
Get started
Language / Region
© 2022 Yandex.Cloud LLC
Yandex Data Proc
  • Practical guidelines
    • Working with jobs
      • Overview
      • Working with Hive jobs
      • Working with MapReduce jobs
      • Working with PySpark jobs
      • Working with Spark jobs
      • Using Apache Hive
      • Running Spark applications
      • Running applications from a remote host
    • Configuring networks for Data Proc clusters
    • Using Yandex Object Storage in Data Proc
  • Step-by-step instructions
    • All instructions
    • Information about existing clusters
    • Creating clusters
    • Connecting to clusters
    • Editing clusters
    • Updating subclusters
    • Managing subclusters
    • Sqoop usage
    • Managing jobs
      • All jobs
      • Spark jobs
      • PySpark jobs
      • Hive jobs
      • MapReduce jobs
    • Deleting clusters
    • Monitoring the state of a cluster and hosts
    • Working with logs
  • Concepts
    • Data Proc overview
    • Host classes
    • Hadoop and component versions
    • Component interfaces and ports
    • Component web interfaces
    • Jobs in Data Proc
    • Autoscaling
    • Decommissioning subclusters and hosts
    • Network in Data Proc
    • Quotas and limits
    • Storage in Data Proc
    • Component properties
    • Logs in Data Proc
    • Initialization scripts
  • Access management
  • Pricing policy
  • API reference
    • Authentication in the API
    • gRPC
      • Overview
      • ClusterService
      • JobService
      • ResourcePresetService
      • SubclusterService
      • OperationService
    • REST
      • Overview
      • Cluster
        • Overview
        • create
        • delete
        • get
        • list
        • listHosts
        • listOperations
        • listUILinks
        • start
        • stop
        • update
      • Job
        • Overview
        • cancel
        • create
        • get
        • list
        • listLog
      • ResourcePreset
        • Overview
        • get
        • list
      • Subcluster
        • Overview
        • create
        • delete
        • get
        • list
        • update
  • Releases
    • Images
  • Questions and answers
  1. Practical guidelines
  2. Working with jobs
  3. Working with MapReduce jobs

Working with MapReduce jobs

Written by
Yandex Cloud
  • Before you start
  • Create a MapReduce job
  • Delete the resources you created

MapReduce is a parallel processing tool for large datasets (on the order of several dozen TB) on clusters in the Hadoop ecosystem. Enables the handling of data in different formats. Job input and output are stored in Yandex Object Storage.

In this article, a simple example demonstrates how MapReduce is used in Data Proc. Using MapReduce, we compute the population of the 500 largest cities in the world based on a set of data on the cities.

To run MapReduce on Hadoop, we use the Streaming interface. At the same time, the data preprocessing (map) and the final output computation (reduce) stages use programs that read from a standard program input (stdin) and write their output to a standard output (stdout).

Before you start

  1. Create a service account with the mdb.dataproc.agent role.

  2. In Object Storage, create buckets and configure access to them:

    1. Create a bucket for the input data and grant the cluster service account READ permissions for this bucket.
    2. Create a bucket for the processing output and grant the cluster service account READ and WRITE permissions for this bucket.
  3. Create a Data Proc cluster with the following configuration:

    • Services:
      • HDFS
      • mapreduce
      • YARN
    • Service account: Select the service account with the mdb.dataproc.agent role you created earlier.
    • Bucket name: Select a bucket to hold the processing output.

Create a MapReduce job

  1. For our input data, download and upload to a bucket an archived CSV file with a dataset on the cities.

  2. Upload Python files to the input data bucket: mapper.py, which contains the code for data preprocessing (map stage), and reducer.py, which contains the code for the final output computations (reduce stage):

    mapper.py

    import sys
    
    population = sum(int(line.split('\t')[14]) for line in sys.stdin)
    print(population)
    

    reducer.py

    import sys
    
    population = sum(int(value) for value in sys.stdin)
    print(population)
    
  3. Create a MapReduce job with the following parameters:

    • Main class: org.apache.hadoop.streaming.HadoopStreaming
    • Job arguments:
      • -mapper
      • mapper.py
      • -reducer
      • reducer.py
      • -numReduceTasks
      • 1
      • -input
      • s3a://<input data bucket name>/cities500.txt.bz2
      • -output
      • s3a://<output processing bucket name>/<output folder>
    • Files:
      • s3a://<input data bucket name>/mapper.py
      • s3a://<input data bucket name>/reducer.py
    • Settings:
      • mapreduce.job.maps: 6
      • yarn.app.mapreduce.am.resource.mb: 2048
      • yarn.app.mapreduce.am.command-opts: -Xmx2048m
  4. Wait for the job status to change to Done.

  5. Download and view the file with the processing output:

    part-00000

    3157107417
    

Delete the resources you created

If you no longer need these resources, delete them:

  1. Delete the cluster.
  2. Delete buckets.
  3. Delete the service account.

Was the article helpful?

Language / Region
© 2022 Yandex.Cloud LLC
In this article:
  • Before you start
  • Create a MapReduce job
  • Delete the resources you created