Managing PySpark jobs

Written by

Yandex Cloud

Improved by

Dmitry A.

Updated at February 8, 2024

Create a job
Cancel a job
Get a list of jobs
Get general information about the job
Get job execution logs

Create a job

Management console

CLI

API

Go to the folder page and select Data Proc.
Click the cluster name and open the Jobs tab.
Click Submit job.
(Optional) Enter a name for the job.
In the Job type field, select PySpark.

In the Main python file field, specify the path to the main PY application file in the following format:

File	Path format
Instance file system	`file:///<path_to_file>`
Distributed cluster file system	`hdfs:///<path_to_file>`
Bucket Object Storage	`s3a://<bucket_name>/<path_to_file>`
Internet	`http://<path_to_file>` or `https://<path_to_file>`

Archives in standard Linux formats, such as zip, gz, xz, bz2, and others, are supported.

The cluster service account needs read access to all the files in the bucket. For step-by-step guides on setting up access to Object Storage, see Editing a bucket ACL.

(Optional) Specify the paths to the PY files, if any.
Specify job arguments.

If an argument, variable, or property is in several space-separated parts, specify each part separately. At the same time, it is important to preserve the order in which you declare arguments, variables, and properties.

The -mapper mapper.py argument, for instance, must be converted into two arguments: -mapper and mapper.py in that order.
(Optional) Specify the paths to the JAR files, if any.
(Optional) Configure advanced settings:
- Specify paths to the required files and archives.
- In the Properties field, specify component properties as key-value pairs.
Click Submit job.

If you do not have the Yandex Cloud command line interface yet, install and initialize it.

The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name or --folder-id parameter.

To create a job:

View a description of the CLI create command for PySpark jobs:
```
yc dataproc job create-pyspark --help
```

Create a job (the example does not show all the available parameters):

yc dataproc job create-pyspark \
   --cluster-name=<cluster_name> \
   --name=<job_name> \
   --main-python-file-uri=<path_to_main_application_py_file> \
   --python-file-uris=<paths_to_additional_py_files> \
   --jar-file-uris=<paths_to_jar_files> \
   --archive-uris=<paths_to_archives> \
   --properties=<component_properties> \
   --args=<arguments> \
   --packages=<Maven_coordinates_of_jar_files> \
   --repositories=<additional_repositories> \
   --exclude-packages=<packages_to_exclude>

Where:

--properties: Component properties as key-value pairs.
--args: Arguments passed to the job.
--packages: Maven coordinates of JAR files as groupId:artifactId:version.
--repositories: Additional repositories to search for packages.
--exclude-packages: Packages to exclude as groupId:artifactId.

Provide the paths to the files required for the job in the following format:

File	Path format
Instance file system	`file:///<path_to_file>`
Distributed cluster file system	`hdfs:///<path_to_file>`
Bucket Object Storage	`s3a://<bucket_name>/<path_to_file>`
Internet	`http://<path_to_file>` or `https://<path_to_file>`

Archives in standard Linux formats, such as zip, gz, xz, bz2, and others, are supported.

The cluster service account needs read access to all the files in the bucket. For step-by-step guides on setting up access to Object Storage, see Editing a bucket ACL.

You can get the cluster ID and name with a list of clusters in the folder.

Use the create API method and include the following information in the request:

Cluster ID in the clusterId parameter.
Job name in the name parameter.
The job properties in the pysparkJob parameter.

You can get the cluster ID with a list of clusters in the folder.

Cancel a job

Note

You cannot cancel jobs with the status ERROR, DONE, or CANCELLED. To find out a job's status, retrieve a list of jobs in the cluster.

Management console

CLI

API

Go to the folder page and select Data Proc.
Click the cluster name and open the Jobs tab.
Click the job name.
Click Cancel in the top-right corner of the page.
In the window that opens, select Cancel.

If you do not have the Yandex Cloud command line interface yet, install and initialize it.

The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name or --folder-id parameter.

To cancel a job, run the command below:

yc dataproc job cancel <job_name_or_ID> \
  --cluster-name=<cluster_name>

You can get the job name or ID with the list of cluster jobs, and the cluster name, with the list of folder clusters.

Use the API cancel method and include the following in the request:

Cluster ID in the clusterId parameter.
Job ID in the jobId parameter.

You can get the cluster ID with the list of folder clusters, and the job ID, with the list of cluster jobs.

Get a list of jobs

Management console

CLI

API

Go to the folder page and select Data Proc.
Click the cluster name and open the Jobs tab.

If you do not have the Yandex Cloud command line interface yet, install and initialize it.

The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name or --folder-id parameter.

To get a list of jobs, run the following command:

yc dataproc job list --cluster-name=<cluster_name>

You can get the cluster ID and name with a list of clusters in the folder.

Use the list API method and provide the cluster ID in the clusterId request parameter.

You can get the cluster ID with a list of clusters in the folder.

Get general information about the job

Management console

CLI

API

Go to the folder page and select Data Proc.
Click the cluster name and open the Jobs tab.
Click the job name.

If you do not have the Yandex Cloud command line interface yet, install and initialize it.

The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name or --folder-id parameter.

To get general information about the job, run the command:

yc dataproc job get \
   --cluster-name=<cluster_name> \
   --name=<job_name>

You can get the cluster ID and name with a list of clusters in the folder.

Use the get API method and include the following in the request:

Cluster ID in the clusterId parameter. You can get it together with a list of clusters in the folder.
Job ID in the jobId parameter. You can get it with the list of cluster jobs.

Get job execution logs

Note

You can view the job logs and search data in them using Yandex Cloud Logging. For more information, see Working with logs.

Management console

CLI

API

Go to the folder page and select Data Proc.
Click the cluster name and open the Jobs tab.
Click the job name.

If you do not have the Yandex Cloud command line interface yet, install and initialize it.

The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name or --folder-id parameter.

To get job execution logs, run the following command:

yc dataproc job log \
   --cluster-name=<cluster_name> \
   --name=<job_name>

You can get the cluster ID and name with a list of clusters in the folder.

Use the API listLog method and include the following in the request:

Cluster ID in the clusterId parameter. You can get it with a list of clusters in the folder.
Job ID in the jobId parameter. You can get it with the list of cluster jobs.

Managing PySpark jobs

Create a jobCreate a job

Cancel a jobCancel a job

Get a list of jobsGet a list of jobs

Get general information about the jobGet general information about the job

Get job execution logsGet job execution logs

Was the article helpful?

Create a job

Cancel a job

Get a list of jobs

Get general information about the job

Get job execution logs