Managing PySpark jobs
Create a job
-
Go to the folder page and select Data Proc.
-
Click on the name of the cluster and open the Jobs tab.
-
Click Submit job.
-
(optional) Enter a name for the job.
-
In the Job type field, select
PySpark
. -
In the Main python file field, specify the path to the main PY application file in the following format:
File Path format Instance file system file:///<file path>
Distributed cluster file system hdfs:///<file path>
Object Storage bucket s3a://bucket name>/<file path>
Internet http://<file path>
orhttps://<file path>
Archives in standard Linux formats, such as
zip
,gz
,xz
,bz2
, and others, are supported.The cluster service account needs read access to all the files in the bucket. Step-by-step instructions for setting up access to Object Storage are provided in Editing a bucket ACL.
-
(optional) Specify the paths to PY files, if any.
-
Specify job arguments.
If an argument, variable, or property is in several space-separated parts, specify each part separately. At the same time, it is important to preserve the order in which you declare arguments, variables, and properties.
The
-mapper mapper.py
argument, for instance, must be converted into two arguments:-mapper
andmapper.py
in that order. -
(optional) Specify the paths to JAR files, if any.
-
(optional) Configure advanced settings:
- Paths to the necessary files and archives.
- Settings as
key-value
pairs.
-
Click Submit job.
If you don't have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name
or --folder-id
parameter.
To create a job:
-
View a description of the CLI create command for
PySpark
jobs:yc dataproc job create-pyspark --help
-
Create a job (the example doesn't show all the available parameters):
yc dataproc job create-pyspark \ --cluster-name <cluster name> \ --name <job name> \ --main-python-file-uri <path to main application py file> \ --python-file-uris <paths to additional py files> \ --jar-file-uris <paths to jar files> \ --archive-uris <paths to archives> \ --properties <key-value> \ --args <arguments passed to job> \ --packages <Maven coordinates of jar files as groupId:artifactId:version> \ --repositories <additional repositories to search for packages> \ --exclude-packages <packages to exclude as groupId:artifactId>
Pass in the paths to the files required for the job in the following format:
File Path format Instance file system file:///<file path>
Distributed cluster file system hdfs:///<file path>
Object Storage bucket s3a://bucket name>/<file path>
Internet http://<file path>
orhttps://<file path>
Archives in standard Linux formats, such as
zip
,gz
,xz
,bz2
, and others, are supported.The cluster service account needs read access to all the files in the bucket. Step-by-step instructions for setting up access to Object Storage are provided in Editing a bucket ACL.
You can find out the cluster ID and name in a list of clusters in the folder.
Use the create API method and pass the following information in the request:
- The cluster ID in the
clusterId
parameter. - Job name, in the
name
parameter. - The job properties in the
pysparkJob
parameter.
You can get the cluster ID with a list of clusters in the folder.
Cancel a job
Note
You cannot cancel jobs with the status ERROR
, DONE
, or CANCELLED
. To find out a job's status, retrieve a list of jobs in the cluster.
- Go to the folder page and select Data Proc.
- Click on the name of the cluster and open the Jobs tab.
- Click on the name of the job.
- In the upper right-hand corner of the page, click Cancel and confirm the action.
If you don't have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name
or --folder-id
parameter.
To cancel a job, run the command below:
yc dataproc job cancel <job ID or name> \
--cluster-name=<cluster name>
You can retrieve a job name or ID in the list of cluster jobs, and a cluster name in the list of folder clusters.
Use the API cancel method and pass in in the call:
- The cluster ID in the
clusterId
parameter. - The job ID in the
jobId
parameter.
You can retrieve a cluster ID in the list of folder clusters, and a job ID in the list of cluster jobs.
Get a list of jobs
- Go to the folder page and select Data Proc.
- Click on the name of the cluster and open the Jobs tab.
If you don't have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name
or --folder-id
parameter.
To get a list of jobs, run the command:
yc dataproc job list --cluster-name <cluster name>
You can find out the cluster ID and name in a list of clusters in the folder.
Use the list API method and pass the cluster ID in the clusterId
request parameter.
You can get the cluster ID with a list of clusters in the folder.
Get general information about the job
- Go to the folder page and select Data Proc.
- Click on the name of the cluster and open the Jobs tab.
- Click on the name of the job.
If you don't have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name
or --folder-id
parameter.
To get general information about the job, run the command:
yc dataproc job get \
--cluster-name <cluster name> \
--name <job name>
You can find out the cluster ID and name in a list of clusters in the folder.
Use the get API method and pass the following in the request:
- The cluster ID in the
clusterId
parameter. You can get it together with a list of clusters in the folder. - The job ID in the
jobId
parameter. You can retrieve it with the list of cluster jobs.
Get job execution logs
Note
You can view the job logs and search data in them using Yandex Cloud Logging. For more information, see Working with logs.
- Go to the folder page and select Data Proc.
- Click on the name of the cluster and open the Jobs tab.
- Click on the name of the job.
If you don't have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name
or --folder-id
parameter.
To get job execution logs, run the command:
yc dataproc job log \
--cluster-name <cluster name> \
--name <job name>
You can find out the cluster ID and name in a list of clusters in the folder.
Use the API listLog method and pass in in the call:
- The cluster ID in the
clusterId
parameter. You can retrieve it with a list of folder clusters. - The job ID in the
jobId
parameter. You can retrieve it with the list of cluster jobs.