Yandex Cloud
  • Services
  • Solutions
  • Why Yandex Cloud
  • Pricing
  • Documentation
  • Contact us
Get started
Language / Region
© 2022 Yandex.Cloud LLC
Yandex Data Proc
  • Practical guidelines
    • Working with jobs
      • Overview
      • Working with Hive jobs
      • Working with MapReduce jobs
      • Working with PySpark jobs
      • Working with Spark jobs
      • Using Apache Hive
      • Running Spark applications
      • Running applications from a remote host
    • Configuring networks for Data Proc clusters
    • Using Yandex Object Storage in Data Proc
  • Step-by-step instructions
    • All instructions
    • Information about existing clusters
    • Creating clusters
    • Connecting to clusters
    • Editing clusters
    • Updating subclusters
    • Managing subclusters
    • Managing jobs
      • All jobs
      • Spark jobs
      • PySpark jobs
      • MapReduce jobs
      • Hive jobs
    • Deleting clusters
    • Monitoring the state of a cluster and hosts
    • Working with logs
  • Concepts
    • Data Proc overview
    • Host classes
    • Hadoop and component versions
    • Component interfaces and ports
    • Component web interfaces
    • Jobs in Data Proc
    • Autoscaling
    • Decommissioning subclusters and hosts
    • Network in Data Proc
    • Quotas and limits
    • Component properties
    • Logs in Data Proc
  • Access management
  • Pricing policy
  • API reference
    • Authentication in the API
    • gRPC
      • Overview
      • ClusterService
      • JobService
      • ResourcePresetService
      • SubclusterService
      • OperationService
    • REST
      • Overview
      • Cluster
        • Overview
        • create
        • delete
        • get
        • list
        • listHosts
        • listOperations
        • listUILinks
        • start
        • stop
        • update
      • Job
        • Overview
        • cancel
        • create
        • get
        • list
        • listLog
      • ResourcePreset
        • Overview
        • get
        • list
      • Subcluster
        • Overview
        • create
        • delete
        • get
        • list
        • update
  • Questions and answers
  1. Step-by-step instructions
  2. Creating clusters

Creating a Data Proc cluster

Written by
Yandex Cloud
  • Configure a network
  • Configure security groups
  • Create a cluster

Configure a network

In the subnet that the Data Proc subcluster will connect to with the Master role, enable NAT to the internet. This will enable the subcluster to interact with Yandex Cloud services or hosts on other networks.

Configure security groups

Warning

Security groups must be created and configured before creating a cluster. If the selected security groups don't have the required rules, Yandex Cloud disables cluster creation.

  1. Create one or more security groups for cluster service traffic.

  2. Add rules:

    • One rule for inbound and outbound service traffic:

      • Port range: 0-65535.
      • Protocol: Any.
      • Source type: Security group.
      • Destination: Current security group (Self).
    • A separate rule for outgoing HTTPS traffic:

      • Port range: 443.
      • Protocol: TCP.
      • Source type: CIDR.
      • Destination: 0.0.0.0/0.

      This will enable you to use Object Storage buckets, UI Proxy, and cluster autoscaling.

If you plan to use multiple security groups for a cluster, enable all traffic between these groups.

Note

You can set more detailed rules for security groups, such as allowing traffic in only specific subnets.

Security groups must be configured correctly for all subnets that will include cluster hosts.

You can set up security groups for connections to cluster hosts via an intermediate VM after creating a cluster.

Create a cluster

Management console
CLI
API
  1. In the management console, select the folder where you want to create a cluster.

  2. Click Create resource and select Data Proc cluster from the drop-down list.

  3. Name the cluster in the Cluster name field. The cluster name must be unique within the folder.

  4. Select a relevant image version and the components you want to use in the cluster.

    Note

    Note that some components require other components to work. For example, to use Spark, you need YARN.

  5. Enter the public part of your SSH key in the Public key field. For information about how to generate and use SSH keys, see the Yandex Compute Cloud documentation.

  6. Select or create a service account to be granted cluster access.

  7. Select the availability zone for the cluster.

  8. If necessary, configure the properties of cluster components, jobs, and the environment.

  9. Select or create a network for the cluster.

  10. Select security groups that have the required permissions.

    Warning

    When creating a cluster, security group settings are verified. If a cluster cannot function with these settings, a warning is issued. A sample functional configuration is provided above.

  11. Enable the UI Proxy option to access the web interfaces of Data Proc components.

  12. Cluster logs are saved in Yandex Cloud Logging. Select a log group from the list or create a new one.

    To enable this functionality, assign the cluster service account the logging.writer role. For more information, see the Yandex Cloud Logging documentation.

  13. Configure subclusters: no more than one main subcluster with a Master host and subclusters for data storage or computing.

    The roles of Compute and Data subcluster are different: you can deploy data storage components on Data subclusters, and data processing components on Compute subclusters. Storage on a Compute subcluster is only used to temporarily store processed files.

  14. For each subcluster, you can configure:

    • The number of hosts.

    • The host class, which dictates the platform and computing resources available to the host.

    • Size and type of storage.

    • The subnet of the network where the cluster is located.

      NAT to the internet must be enabled in the subnet for the subcluster with the Master role. For more information, see Configure a network.

  15. To access a cluster from the internet, select the Public access option in the primary subcluster settings. This way, you can only connect to the cluster over an SSL connection. For more information, see Connecting to clusters Data Proc.

    Warning

    You can't request public access after creating a cluster.

  16. For Compute subclusters, you can specify the autoscaling parameters.

    Note

    To enable automatic scaling, assign the following roles to the cluster service account:

    • editor
    • dataproc.agent
    1. Under Add subcluster, click Add.
    2. In the Roles field, select COMPUTENODE.
    3. Under Scalability, enable Automatic scaling.
    4. Set autoscaling parameters.
    5. The default metric used for autoscaling is yarn.cluster.containersPending. To enable scaling based on CPU usage, disable the Default scaling option and set the target CPU utilization level.
    6. Click Add.
  17. If necessary, configure additional cluster settings:

    Deletion protection: Manages cluster protection from accidental deletion by a user.

    Enabled protection will not prevent a manual connection to a cluster to delete data.

  18. Click Create cluster.

If you don't have the Yandex Cloud command line interface yet, install and initialize it.

The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name or --folder-id parameter.

To create a cluster:

  1. View a description of the CLI's create cluster command:

    yc dataproc cluster create --help
    
  2. Specify cluster parameters in the create command (the list of supported parameters in the example is not exhaustive):

    yc dataproc cluster create <cluster name> \
        --zone <cluster availability zone> \
        --service-account-name <cluster service account name> \
        --version <image version> \
        --services <component list> \
        --subcluster name=<name of MASTERNODE subcluster>,`
                    `role=masternode,`
                    `resource-preset=<host class>,`
                    `disk-type=<storage type>,`
                    `disk-size=<storage size, GB>,`
                    `subnet-name=<subnet name>,`
                    `hosts-count=1 \
        --subcluster name=<name of DATANODE subcluster>,`
                    `role=datanode,`
                    `resource-preset=<host class>,`
                    `disk-type=<storage type>,`
                    `disk-size=<storage size, GB>,`
                    `subnet-name=<subnet name>,`
                    `hosts-count=<host count> \
        --bucket <bucket name> \
        --ssh-public-keys-file <path to public portion of SSH key> \
        --security-group-ids <security group ID list> \
        --deletion-protection=<cluster deletion protection: true or false> \
        --log-group-id <log group ID>
    

    Cluster deletion protection will not prevent a manual connection to a cluster to delete data.

  3. To create a cluster deployed on groups of dedicated hosts, specify host IDs as a comma-separated list in the --host-group-ids parameter:

    yc dataproc cluster create <cluster name> \
        ...
        --host-group-ids <IDs of dedicated host groups>
    

    Alert

    You cannot edit this setting after you create a cluster. The use of dedicated hosts significantly affects cluster pricing.

To create a cluster, use the API create method and pass the following in the request:

  • ID of the folder to host the cluster, in the folderId parameter.

  • The cluster name in the name parameter.

  • Cluster configuration in the configSpec parameter, including:

    • Image version, in the configSpec.versionId parameter.
    • Component list, in the configSpec.hadoop.services parameter.
    • Public portion of the SSH key, in the configSpec.hadoop.sshPublicKeys parameter.
    • Subcluster settings, in the confibSpec.subclustersSpec parameter.
  • Cluster availability zone, in the zoneId parameter.

  • Service account ID, in the serviceAccountId parameter.

  • Bucket name, in the bucket parameter.

  • Cluster security group IDs, in the hostGroupIds parameter.

  • Cluster deletion protection settings in the deletionProtection parameter.

    Cluster deletion protection will not prevent a manual connection to a cluster to delete data.

To create a cluster deployed on groups of dedicated hosts, pass a list of host IDs in the hostGroupIds parameter.

Alert

You cannot edit this setting after you create a cluster. The use of dedicated hosts significantly affects cluster pricing.

Data Proc runs the create cluster operation. After the cluster status changes to Running, you can connect to any active subcluster using the specified SSH key.

Was the article helpful?

Language / Region
© 2022 Yandex.Cloud LLC
In this article:
  • Configure a network
  • Configure security groups
  • Create a cluster