Integration with Apache Spark™
DataSphere integration with Yandex Data Proc lets you perform calculations on Apache Spark™ clusters. Calculations are performed in sessions created using Apache Livy.
Cluster Data Proc
Setting up a DataSphere project to work with clusters Data Proc
To be able to create Data Proc clusters from DataSphere or run existing Data Proc clusters, specify the following for the project:
- The service account for performing operations with Data Proc clusters.
- The subnet to create a new cluster in or connect an existing Data Proc cluster from. For integration, you can only use subnets created in the
The above parameters should be specified in additional project settings.
If you specified a subnet in the project settings, the time to allocate computing resources may be increased.
Roles required for clusters to run correctly Data Proc
- To create a Data Proc cluster, you need permission for the service account that DataSphere will use to perform operations. This permission is included in the
editorroles and higher.
- To manage Data Proc clusters, the service account needs the following roles:
vpc.userto access the network specified in the project settings.
mdb.all.adminto create and use Data Proc clusters.
mdb.dataproc.agentto create and use Data Proc clusters.
Read more about access management.
Creating a cluster from a project in DataSphere
Specifics of a cluster created from a DataSphere project:
The cluster is created in the project folder and the subnet specified in the project settings.
DataSphere monitors the lifetime of a cluster and automatically deletes it if it's idle for two hours.
A Data Proc cluster is considered active if computations are being performed or if there is an active notebook in the cluster project. A notebook is considered active if the break in computing is less than 20 minutes.
Learn more about how to create a cluster from a project.
Creating a cluster in Data Proc
Specifics of a cluster created in Data Proc:
- You control the life cycle of your cluster.
- For a Data Proc cluster to run correctly, make sure its version is at least
1.3and the following services are enabled:
Learn more about how to create a cluster in the service.
In Data Proc clusters, your code is executed in sessions. A session stores the intermediate state until you delete the session or cluster. Each cluster has a default session. Its ID is the same as the project ID.
Use the following commands to manage sessions:
%create_livy_session --host $host --id $idto create a session.
%delete_livy_session $idto delete a session.
Running Python code
Code is run in cells with the header:
#!spark [--cluster <cluster>] [--session <session>] [--variables <variable>].
<cluster>is the Data Proc cluster to perform calculations on. This can be:
- An HTTP link to Livy, such as
- The name of the cluster created through the notebook interface.
- A Data Proc cluster from the project settings in the management console if the parameter is omitted.
- An HTTP link to Livy, such as
<session>is the computing session ID. If this parameter is omitted, the default Data Proc cluster session is used.
<variable>is the variable imported to the cell from the core. Supported types:
pandas.DataFrame(converted to Spark DataFrame).