Yandex.Cloud
  • Services
  • Why Yandex.Cloud
  • Pricing
  • Documentation
  • Contact us
Get started
Yandex DataSphere
  • Getting started
  • Step-by-step instructions
    • All instructions
    • Creating a project
    • Installing dependencies
    • Running sample code in a notebook
    • Version control. Working with checkpoints
    • Managing computing resources
    • Clearing the interpreter state
    • Sharing a notebook
      • Publishing a notebook
      • Exporting a project
    • Changing a name or description
    • Deleting a project
    • Working with Git
    • Setting up integration with Data Proc
  • Concepts
    • Overview
    • Project
    • List of pre-installed software
    • Available commands
    • Computing resource configurations
    • Integration with version and data control systems
    • Integration with Data Proc
    • Quotas and limits
  • Access management
  • Pricing policy
  • Questions and answers
  1. Step-by-step instructions
  2. Setting up integration with Data Proc

Setting up computations on Apache Spark™ clusters

  • Set up a project to work with the cluster Data Proc
  • Create a cluster Data Proc
    • Create a cluster through the notebook interface in DataSphere
    • Create a cluster in Data Proc
  • Run a computing operation on a Data Proc cluster
  • Delete a Data Proc cluster
    • Delete a cluster through the notebook interface in DataSphere
    • Delete a cluster in Data Proc

To start computing on Apache Spark™ clusters, prepare a DataSphere project. To run computations, create a new Data Proc cluster or use an existing one.

Set up a project to work with the cluster Data Proc

To set up your project:

  1. In the management console, open the DataSphere section in the folder where you work with DataSphere projects.

  2. Click in the line with the project to set up.
  3. In the window that opens, click Edit.
  4. In the window that opens, click Additional settings.
  5. In the Service account field, select a service account with the appropriate roles.
  6. In the Subnet field, specify the subnet where new Data Proc clusters will be created or existing ones will be used.

Note

If you specified a subnet in the project settings, the time to allocate computing resources may be increased.

Create a cluster Data Proc

Before creating a cluster, make sure you have sufficient resources in your cloud. You can check this in the management console under Quotas.

Create a cluster through the notebook interface in DataSphere

DataSphere monitors the lifetime of the cluster you created and automatically deletes it if it was idle for two hours.

To create a cluster using the notebook interface:

  1. In the management console, open the DataSphere section in the folder where you work with DataSphere projects.

  2. Select the project to create a Data Proc cluster for.
  3. In the top panel, click File and select Data Proc Clusters.
  4. In the Create new cluster section of the window that opens:
    1. In the Name field, enter the Data Proc cluster name.
    2. In the Size list, select the Data Proc cluster configuration.
  5. Click Create.

Information about the status of the created cluster will be displayed in the same window.

Cluster statuses Data Proc

Once created, a cluster can have the following statuses:

  • STARTING: The cluster is being created.
  • UP: The cluster is created and ready to run calculations.
  • DOWN: Problems creating the cluster.

Create a cluster in Data Proc

You can manage the life cycle of a cluster that you created manually. To ensure correct integration, you need to create a cluster with the following parameters:

  • Version: 1.3 and higher.
  • Enabled services: LIVY, SPARK, YARN, and HDFS.
  • Cluster availability zone: ru-central1-a.
How to create a Data Proc cluster
Management console
  1. In management console, select the folder where you want to create a cluster.

  2. Click Create resource and select Data Proc cluster from the drop-down list.

  3. Enter the cluster name in the Cluster name field. The cluster name must be unique within the folder.

  4. In the Version field, select 1.3.

  5. In the Services field, select: LIVY, SPARK, YARN, and HDFS.

  6. Enter the public part of your SSH key in the Public key field. For information about how to generate and use SSH keys, see the Yandex Compute Cloud documentation.

  7. Select or create a service account that you want to grant access to the cluster.

  8. In the Availability zone field, choose ru-central1-a.

  9. If necessary, set the properties of Hadoop and its components, for example:

    hdfs:dfs.replication : 2
    hdfs:dfs.blocksize : 1073741824
    spark:spark.driver.cores : 1
    

    The available properties are listed in the official documentation for the components:

    • Hadoop
    • HDFS
    • YARN
    • MapReduce
    • Spark
    • Flume 1.8.0
    • HBASE
    • HIVE
    • SQOOP
    • Tez 0.9.1
    • Zeppelin 0.7.3
    • ZooKeeper 3.4.6
  10. Select or create a network for the cluster.

  11. Enable the UI Proxy option to access the web interfaces of the components of Data Proc.

  12. Configure subclusters: no more than one main subcluster with a Master host and subclusters for data storage or computing.

    Note

    To run computations on clusters, make sure you have at least one Compute or Data subcluster.

    The roles of the Compute and Data subclusters are different: you can deploy data storage components on Data and data processing components on Compute subclusters. Storage on a Compute subcluster is only used to temporarily store processed files.

  13. For each subcluster, you can configure:

    • The number of hosts.
    • Host class is the platform and computing resources available to the host.
    • Storage size and type.
    • The subnet of the network where the cluster is located.
  14. For Compute subclusters, you can specify the auto scaling parameters.

  15. After you configure all the subclusters you need, click Create cluster.

Data Proc runs the create cluster operation. After the cluster status changes to Running, you can connect to any active subcluster using the specified SSH key.

Once the cluster is created, add it to the project settings:

  1. In the management console, open the DataSphere section in the folder where you work with DataSphere projects.

  2. Click in the line with the project to set up.
  3. In the window that opens, click Edit.
  4. In the window that opens, click Additional settings.
  5. In the Data Proc cluster field, specify the cluster you just created.

Run a computing operation on a Data Proc cluster

To run computations on a cluster created from the notebook interface:

  1. In the management console, open the DataSphere section in the folder where you work with DataSphere projects.

  2. Click in the line with the project to run computations in.

  3. In the cell, insert the code to compute. For example:

    #!spark --cluster <cluster name>
    import random
    
    def inside(p):
        x, y = random.random(), random.random()
        return x*x + y*y < 1
    
    NUM_SAMPLES = 1_000_000
    
    count = sc.parallelize(range(0, NUM_SAMPLES)) \
                 .filter(inside).count()
    print("Pi is roughly %f" % (4.0 * count / NUM_SAMPLES))
    

    Where:

    • #!spark --cluster <cluster name> is a mandatory system command to run computations on a cluster. You can view the cluster name in the Data Proc Clusters window, the File menu.
  4. Click to run computations.

Wait for the computation to start. While it is in progress, you'll see logs under the cell.

Delete a Data Proc cluster

If you created a cluster using the notebook interface in DataSphere, it will be automatically deleted after two hours of idle time. You can also delete it manually through the notebook interface.

A cluster that you created yourself should be deleted in Data Proc.

Delete a cluster through the notebook interface in DataSphere

To delete a cluster using the notebook interface:

  1. In the management console, open the DataSphere section in the folder where you work with DataSphere projects.

  2. Select the project to delete the Data Proc cluster from.
  3. In the top panel, click File and select Data Proc Clusters.
  4. In the window that opens, click Destroy in the cluster line.

Delete a cluster in Data Proc

Management console

To delete a cluster:

  1. In the management console, open the folder with the cluster that you want to delete.
  2. Select Data Proc.
  3. Click for the necessary cluster and select Delete.
  4. Confirm cluster deletion.

Data Proc runs the delete cluster operation.

In this article:
  • Set up a project to work with the cluster Data Proc
  • Create a cluster Data Proc
  • Create a cluster through the notebook interface in DataSphere
  • Create a cluster in Data Proc
  • Run a computing operation on a Data Proc cluster
  • Delete a Data Proc cluster
  • Delete a cluster through the notebook interface in DataSphere
  • Delete a cluster in Data Proc
Language
Careers
Privacy policy
Terms of use
© 2021 Yandex.Cloud LLC