Creating Data Proc clusters
In the management console, select the folder where you want to create a cluster.
Click Create resource and select Data Proc cluster from the drop-down list.
Enter a name for the cluster in the Cluster name field. The cluster name must be unique within the folder.
Select a relevant image version and the components you want to use in the cluster.
Note that some components require other components to work. For example, to use Spark, you need YARN.
Enter the public part of your SSH key in the Public key field. For information about how to generate and use SSH keys, see the Yandex Compute Cloud documentation.
Select or create a service account that you want to grant access to the cluster.
Select the availability zone for the cluster.
If necessary, set the properties of Hadoop and its components, for example:
hdfs:dfs.replication : 2 hdfs:dfs.blocksize : 1073741824 spark:spark.driver.cores : 1
The available properties are listed in the official documentation for the components:
Select or create a network for the cluster.
Enable the UI Proxy option to access the web interfaces of the components Data Proc.
Configure subclusters: no more than one main subcluster with a Master host and subclusters for data storage or computing.
The roles of
Datasubcluster are different: you can deploy data storage components on
Datasubclusters, and data processing components on
Computesubclusters. Storage on a
Computesubcluster is only used to temporarily store processed files.
For each subcluster, you can configure:
- The number of hosts.
- The host class is the platform and computing resources available to the host.
- Storage size and type.
- The subnet of the network where the cluster is located.
Computesubclusters, you can specify the autoscaling parameters.
To enable automatic scaling, assign the following roles to the cluster service account:
- Under Add subcluster, click Add.
- In the Roles field, select COMPUTENODE.
- Under Scalability, enable Automatic scaling.
- Set autoscaling parameters.
- The default metric used for autoscaling is
yarn.cluster.containersPending. To enable scaling based on CPU usage, disable the Default scaling option and set the target CPU utilization level.
- Click Add.
After you configure all the subclusters you need, click Create cluster.
Data Proc runs the create cluster operation. After the cluster status changes to Running, you can connect to any active subcluster using the specified SSH key.