Setting up computations on Apache Spark™ clusters
To start computing on Apache Spark™ clusters, prepare a DataSphere project. To run computations, create a new Data Proc cluster or use an existing one.
Set up a project to work with the cluster Data Proc
To set up your project:
-
In the management console, open the DataSphere section in the folder where you work with DataSphere projects.
- Click in the line with the project to set up.
- In the window that opens, click Edit.
- In the window that opens, click Additional settings.
- In the Service account field, select a service account with the appropriate roles.
- In the Subnet field, specify the subnet where new Data Proc clusters will be created or existing ones will be used.
Note
If you specified a subnet in the project settings, the time to allocate computing resources may be increased.
Create a cluster Data Proc
Before creating a cluster, make sure you have sufficient resources in your cloud. You can check this in the management console under Quotas.
Create a cluster through the notebook interface in DataSphere
DataSphere monitors the lifetime of the cluster you created and automatically deletes it if it was idle for two hours.
To create a cluster using the notebook interface:
-
In the management console, open the DataSphere section in the folder where you work with DataSphere projects.
- Select the project to create a Data Proc cluster for.
- In the top panel, click File and select Data Proc Clusters.
- In the Create new cluster section of the window that opens:
- In the Name field, enter the Data Proc cluster name.
- In the Size list, select the Data Proc cluster configuration.
- Click Create.
Information about the status of the created cluster will be displayed in the same window.
Cluster statuses Data Proc
Once created, a cluster can have the following statuses:
STARTING
: The cluster is being created.UP
: The cluster is created and ready to run calculations.DOWN
: Problems creating the cluster.
Create a cluster in Data Proc
You can manage the life cycle of a cluster that you created manually. To ensure correct integration, you need to create a cluster with the following parameters:
- Version: 1.3 and higher.
- Enabled services:
LIVY
,SPARK
,YARN
, andHDFS
. - Cluster availability zone:
ru-central1-a
.
-
In management console, select the folder where you want to create a cluster.
-
Click Create resource and select Data Proc cluster from the drop-down list.
-
Enter the cluster name in the Cluster name field. The cluster name must be unique within the folder.
-
In the Version field, select
1.3.
-
In the Services field, select:
LIVY
,SPARK
,YARN
, andHDFS
. -
Enter the public part of your SSH key in the Public key field. For information about how to generate and use SSH keys, see the Yandex Compute Cloud documentation.
-
Select or create a service account that you want to grant access to the cluster.
-
In the Availability zone field, choose
ru-central1-a
. -
If necessary, set the properties of Hadoop and its components, for example:
hdfs:dfs.replication : 2 hdfs:dfs.blocksize : 1073741824 spark:spark.driver.cores : 1
The available properties are listed in the official documentation for the components:
-
Select or create a network for the cluster.
-
Enable the UI Proxy option to access the web interfaces of the components of Data Proc.
-
Configure subclusters: no more than one main subcluster with a Master host and subclusters for data storage or computing.
Note
To run computations on clusters, make sure you have at least one
Compute
orData
subcluster.The roles of the
Compute
andData
subclusters are different: you can deploy data storage components onData
and data processing components onCompute
subclusters. Storage on aCompute
subcluster is only used to temporarily store processed files. -
For each subcluster, you can configure:
- The number of hosts.
- Host class is the platform and computing resources available to the host.
- Storage size and type.
- The subnet of the network where the cluster is located.
-
For
Compute
subclusters, you can specify the auto scaling parameters. -
After you configure all the subclusters you need, click Create cluster.
Data Proc runs the create cluster operation. After the cluster status changes to Running, you can connect to any active subcluster using the specified SSH key.
Once the cluster is created, add it to the project settings:
-
In the management console, open the DataSphere section in the folder where you work with DataSphere projects.
- Click in the line with the project to set up.
- In the window that opens, click Edit.
- In the window that opens, click Additional settings.
- In the Data Proc cluster field, specify the cluster you just created.
Run a computing operation on a Data Proc cluster
To run computations on a cluster created from the notebook interface:
-
In the management console, open the DataSphere section in the folder where you work with DataSphere projects.
-
Click in the line with the project to run computations in.
-
In the cell, insert the code to compute. For example:
#!spark --cluster <cluster name> import random def inside(p): x, y = random.random(), random.random() return x*x + y*y < 1 NUM_SAMPLES = 1_000_000 count = sc.parallelize(range(0, NUM_SAMPLES)) \ .filter(inside).count() print("Pi is roughly %f" % (4.0 * count / NUM_SAMPLES))
Where:
#!spark --cluster <cluster name>
is a mandatory system command to run computations on a cluster. You can view the cluster name in the Data Proc Clusters window, the File menu.
-
Click to run computations.
Wait for the computation to start. While it is in progress, you'll see logs under the cell.
Delete a Data Proc cluster
If you created a cluster using the notebook interface in DataSphere, it will be automatically deleted after two hours of idle time. You can also delete it manually through the notebook interface.
A cluster that you created yourself should be deleted in Data Proc.
Delete a cluster through the notebook interface in DataSphere
To delete a cluster using the notebook interface:
-
In the management console, open the DataSphere section in the folder where you work with DataSphere projects.
- Select the project to delete the Data Proc cluster from.
- In the top panel, click File and select Data Proc Clusters.
- In the window that opens, click Destroy in the cluster line.
Delete a cluster in Data Proc
To delete a cluster:
- In the management console, open the folder with the cluster that you want to delete.
- Select Data Proc.
- Click for the necessary cluster and select Delete.
- Confirm cluster deletion.
Data Proc runs the delete cluster operation.