Yandex Cloud
  • Services
  • Solutions
  • Why Yandex Cloud
  • Pricing
  • Documentation
  • Contact us
Get started
Language / Region
© 2022 Yandex.Cloud LLC
Yandex Data Proc
  • Practical guidelines
    • All practical guidelines
    • Working with jobs
      • Overview
      • Working with Hive jobs
      • Working with MapReduce jobs
      • Working with PySpark jobs
      • Working with Spark jobs
      • Running Apache Hive jobs
      • Running Spark applications
      • Running jobs from a remote host
    • Configuring networks for Data Proc
    • Using Yandex Object Storage in Data Proc
    • Exchanging data with Yandex Managed Service for ClickHouse
    • Importing data from Yandex Managed Service for MySQL clusters using Sqoop
    • Importing data from Yandex Managed Service for PostgreSQL clusters using Sqoop
    • Using initialization scripts to configure GeeseFS in Data Proc
  • Step-by-step instructions
    • All instructions
    • Information about existing clusters
    • Creating clusters
    • Connecting to a cluster
    • Updating clusters
    • Managing subclusters
    • Updating subclusters
    • Connecting to component interfaces
    • How to use Sqoop
    • Managing jobs
      • All jobs
      • Spark jobs
      • PySpark jobs
      • Hive jobs
      • MapReduce jobs
    • Deleting clusters
    • Working with logs
    • Monitoring the state of clusters and hosts
  • Concepts
    • Relationships between service resources
    • Host classes
    • Runtime environment
    • Data Proc component interfaces and ports
    • Jobs in Data Proc
    • Automatic scaling
    • Decommissioning subclusters and hosts
    • Network in Data Proc
    • Maintenance
    • Quotas and limits
    • Storage in Data Proc
    • Component properties
    • Logs in Data Proc
    • Initialization scripts
  • Access management
  • Pricing policy
  • API reference
    • Authentication in the API
    • gRPC
      • Overview
      • ClusterService
      • JobService
      • ResourcePresetService
      • SubclusterService
      • OperationService
    • REST
      • Overview
      • Cluster
        • Overview
        • create
        • delete
        • get
        • list
        • listHosts
        • listOperations
        • listUILinks
        • start
        • stop
        • update
      • Job
        • Overview
        • cancel
        • create
        • get
        • list
        • listLog
      • ResourcePreset
        • Overview
        • get
        • list
      • Subcluster
        • Overview
        • create
        • delete
        • get
        • list
        • update
  • Revision history
    • Service updates
    • Images
  • Questions and answers
  1. Step-by-step instructions
  2. Managing jobs
  3. PySpark jobs

Managing PySpark jobs

Written by
Yandex Cloud
  • Create a job
  • Cancel a job
  • Get a list of jobs
  • Get general information about the job
  • Get job execution logs

Create a job

Management console
CLI
API
  1. Go to the folder page and select Data Proc.

  2. Click on the name of the cluster and open the Jobs tab.

  3. Click Submit job.

  4. (optional) Enter a name for the job.

  5. In the Job type field, select PySpark.

  6. In the Main python file field, specify the path to the main PY application file in the following format:

    File Path format
    Instance file system file:///<file path>
    Distributed cluster file system hdfs:///<file path>
    Object Storage bucket s3a://bucket name>/<file path>
    Internet http://<file path> or https://<file path>

    Archives in standard Linux formats, such as zip, gz, xz, bz2, and others, are supported.

    The cluster service account needs read access to all the files in the bucket. Step-by-step instructions for setting up access to Object Storage are provided in Editing a bucket ACL.

  7. (optional) Specify the paths to PY files, if any.

  8. Specify job arguments.

    If an argument, variable, or property is in several space-separated parts, specify each part separately. At the same time, it is important to preserve the order in which you declare arguments, variables, and properties.

    The -mapper mapper.py argument, for instance, must be converted into two arguments: -mapper and mapper.py in that order.

  9. (optional) Specify the paths to JAR files, if any.

  10. (optional) Configure advanced settings:

    • Paths to the necessary files and archives.
    • Settings as key-value pairs.
  11. Click Submit job.

If you don't have the Yandex Cloud command line interface yet, install and initialize it.

The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name or --folder-id parameter.

To create a job:

  1. View a description of the CLI create command for PySpark jobs:

    yc dataproc job create-pyspark --help
    
  2. Create a job (the example doesn't show all the available parameters):

    yc dataproc job create-pyspark \
      --cluster-name <cluster name> \
      --name <job name> \
      --main-python-file-uri <path to main application py file> \
      --python-file-uris <paths to additional py files> \
      --jar-file-uris <paths to jar files> \
      --archive-uris <paths to archives> \
      --properties <key-value> \
      --args <arguments passed to job> \
      --packages <Maven coordinates of jar files as groupId:artifactId:version> \
      --repositories <additional repositories to search for packages> \
      --exclude-packages <packages to exclude as groupId:artifactId>
    

    Pass in the paths to the files required for the job in the following format:

    File Path format
    Instance file system file:///<file path>
    Distributed cluster file system hdfs:///<file path>
    Object Storage bucket s3a://bucket name>/<file path>
    Internet http://<file path> or https://<file path>

    Archives in standard Linux formats, such as zip, gz, xz, bz2, and others, are supported.

    The cluster service account needs read access to all the files in the bucket. Step-by-step instructions for setting up access to Object Storage are provided in Editing a bucket ACL.

You can find out the cluster ID and name in a list of clusters in the folder.

Use the create API method and pass the following information in the request:

  • The cluster ID in the clusterId parameter.
  • Job name, in the name parameter.
  • The job properties in the pysparkJob parameter.

You can get the cluster ID with a list of clusters in the folder.

Cancel a job

Note

You cannot cancel jobs with the status ERROR, DONE, or CANCELLED. To find out a job's status, retrieve a list of jobs in the cluster.

Management console
CLI
API
  1. Go to the folder page and select Data Proc.
  2. Click on the name of the cluster and open the Jobs tab.
  3. Click on the name of the job.
  4. In the upper right-hand corner of the page, click Cancel and confirm the action.

If you don't have the Yandex Cloud command line interface yet, install and initialize it.

The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name or --folder-id parameter.

To cancel a job, run the command below:

yc dataproc job cancel <job ID or name> \
   --cluster-name=<cluster name>

You can retrieve a job name or ID in the list of cluster jobs, and a cluster name in the list of folder clusters.

Use the API cancel method and pass in in the call:

  • The cluster ID in the clusterId parameter.
  • The job ID in the jobId parameter.

You can retrieve a cluster ID in the list of folder clusters, and a job ID in the list of cluster jobs.

Get a list of jobs

Management console
CLI
API
  1. Go to the folder page and select Data Proc.
  2. Click on the name of the cluster and open the Jobs tab.

If you don't have the Yandex Cloud command line interface yet, install and initialize it.

The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name or --folder-id parameter.

To get a list of jobs, run the command:

yc dataproc job list --cluster-name <cluster name>

You can find out the cluster ID and name in a list of clusters in the folder.

Use the list API method and pass the cluster ID in the clusterId request parameter.

You can get the cluster ID with a list of clusters in the folder.

Get general information about the job

Management console
CLI
API
  1. Go to the folder page and select Data Proc.
  2. Click on the name of the cluster and open the Jobs tab.
  3. Click on the name of the job.

If you don't have the Yandex Cloud command line interface yet, install and initialize it.

The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name or --folder-id parameter.

To get general information about the job, run the command:

yc dataproc job get \
   --cluster-name <cluster name> \
   --name <job name>

You can find out the cluster ID and name in a list of clusters in the folder.

Use the get API method and pass the following in the request:

  • The cluster ID in the clusterId parameter. You can get it together with a list of clusters in the folder.
  • The job ID in the jobId parameter. You can retrieve it with the list of cluster jobs.

Get job execution logs

Note

You can view the job logs and search data in them using Yandex Cloud Logging. For more information, see Working with logs.

Management console
CLI
API
  1. Go to the folder page and select Data Proc.
  2. Click on the name of the cluster and open the Jobs tab.
  3. Click on the name of the job.

If you don't have the Yandex Cloud command line interface yet, install and initialize it.

The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name or --folder-id parameter.

To get job execution logs, run the command:

yc dataproc job log \
   --cluster-name <cluster name> \
   --name <job name>

You can find out the cluster ID and name in a list of clusters in the folder.

Use the API listLog method and pass in in the call:

  • The cluster ID in the clusterId parameter. You can retrieve it with a list of folder clusters.
  • The job ID in the jobId parameter. You can retrieve it with the list of cluster jobs.

Was the article helpful?

Language / Region
© 2022 Yandex.Cloud LLC
In this article:
  • Create a job
  • Cancel a job
  • Get a list of jobs
  • Get general information about the job
  • Get job execution logs