Creating a Data Proc cluster
Configure a network
In the subnet that the Data Proc subcluster will connect to with the Master
role, enable NAT to the internet. This will enable the subcluster to interact with Yandex Cloud services or hosts on other networks.
Configure security groups
Warning
Security groups must be created and configured before creating a cluster. If the selected security groups don't have the required rules, Yandex Cloud disables cluster creation.
-
Create one or more security groups for cluster service traffic.
-
-
One rule for inbound and outbound service traffic:
- Port range:
0-65535
. - Protocol:
Any
. - Source type:
Security group
. - Destination: Current security group (
Self
).
- Port range:
-
A separate rule for outgoing HTTPS traffic:
- Port range:
443
. - Protocol:
TCP
. - Source type:
CIDR
. - Destination:
0.0.0.0/0
.
This will enable you to use Object Storage buckets, UI Proxy, and cluster autoscaling.
- Port range:
-
If you plan to use multiple security groups for a cluster, enable all traffic between these groups.
Note
You can set more detailed rules for security groups, such as allowing traffic in only specific subnets.
Security groups must be configured correctly for all subnets that will include cluster hosts.
You can set up security groups for connections to cluster hosts via an intermediate VM after creating a cluster.
Create a cluster
-
In the management console, select the folder where you want to create a cluster.
-
Click Create resource and select Data Proc cluster from the drop-down list.
-
Name the cluster in the Cluster name field. The cluster name must be unique within the folder.
-
Select a relevant image version and the components you want to use in the cluster.
Note
Note that some components require other components to work. For example, to use Spark, you need YARN.
-
Enter the public part of your SSH key in the Public key field. For information about how to generate and use SSH keys, see the Yandex Compute Cloud documentation.
-
Select or create a service account to be granted cluster access.
-
Select the availability zone for the cluster.
-
If necessary, configure the properties of cluster components, jobs, and the environment.
-
Select or create a network for the cluster.
-
Select security groups that have the required permissions.
Warning
When creating a cluster, security group settings are verified. If a cluster cannot function with these settings, a warning is issued. A sample functional configuration is provided above.
-
Enable the UI Proxy option to access the web interfaces of Data Proc components.
-
Cluster logs are saved in Yandex Cloud Logging. Select a log group from the list or create a new one.
To enable this functionality, assign the cluster service account the
logging.writer
role. For more information, see the Yandex Cloud Logging documentation. -
Configure subclusters: no more than one main subcluster with a Master host and subclusters for data storage or computing.
The roles of
Compute
andData
subcluster are different: you can deploy data storage components onData
subclusters, and data processing components onCompute
subclusters. Storage on aCompute
subcluster is only used to temporarily store processed files. -
For each subcluster, you can configure:
-
The number of hosts.
-
The host class, which dictates the platform and computing resources available to the host.
-
Size and type of storage.
-
The subnet of the network where the cluster is located.
NAT to the internet must be enabled in the subnet for the subcluster with the
Master
role. For more information, see Configure a network.
-
-
To access a cluster from the internet, select the Public access option in the primary subcluster settings. This way, you can only connect to the cluster over an SSL connection. For more information, see Connecting to clusters Data Proc.
Warning
You can't request public access after creating a cluster.
-
For
Compute
subclusters, you can specify the autoscaling parameters.Note
To enable automatic scaling, assign the following roles to the cluster service account:
editor
dataproc.agent
- Under Add subcluster, click Add.
- In the Roles field, select COMPUTENODE.
- Under Scalability, enable Automatic scaling.
- Set autoscaling parameters.
- The default metric used for autoscaling is
yarn.cluster.containersPending
. To enable scaling based on CPU usage, disable the Default scaling option and set the target CPU utilization level. - Click Add.
-
If necessary, configure additional cluster settings:
Deletion protection: Manages cluster protection from accidental deletion by a user.
Enabled protection will not prevent a manual connection to a cluster to delete data.
-
Click Create cluster.
If you don't have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name
or --folder-id
parameter.
To create a cluster:
-
View a description of the CLI's create cluster command:
yc dataproc cluster create --help
-
Specify cluster parameters in the create command (the list of supported parameters in the example is not exhaustive):
yc dataproc cluster create <cluster name> \ --zone <cluster availability zone> \ --service-account-name <cluster service account name> \ --version <image version> \ --services <component list> \ --subcluster name=<name of MASTERNODE subcluster>,` `role=masternode,` `resource-preset=<host class>,` `disk-type=<storage type>,` `disk-size=<storage size, GB>,` `subnet-name=<subnet name>,` `hosts-count=1 \ --subcluster name=<name of DATANODE subcluster>,` `role=datanode,` `resource-preset=<host class>,` `disk-type=<storage type>,` `disk-size=<storage size, GB>,` `subnet-name=<subnet name>,` `hosts-count=<host count> \ --bucket <bucket name> \ --ssh-public-keys-file <path to public portion of SSH key> \ --security-group-ids <security group ID list> \ --deletion-protection=<cluster deletion protection: true or false> \ --log-group-id <log group ID>
Cluster deletion protection will not prevent a manual connection to a cluster to delete data.
-
To create a cluster deployed on groups of dedicated hosts, specify host IDs as a comma-separated list in the
--host-group-ids
parameter:yc dataproc cluster create <cluster name> \ ... --host-group-ids <IDs of dedicated host groups>
Alert
You cannot edit this setting after you create a cluster. The use of dedicated hosts significantly affects cluster pricing.
To create a cluster, use the API create method and pass the following in the request:
-
ID of the folder to host the cluster, in the
folderId
parameter. -
The cluster name in the
name
parameter. -
Cluster configuration in the
configSpec
parameter, including:- Image version, in the
configSpec.versionId
parameter. - Component list, in the
configSpec.hadoop.services
parameter. - Public portion of the SSH key, in the
configSpec.hadoop.sshPublicKeys
parameter. - Subcluster settings, in the
confibSpec.subclustersSpec
parameter.
- Image version, in the
-
Cluster availability zone, in the
zoneId
parameter. -
Service account ID, in the
serviceAccountId
parameter. -
Bucket name, in the
bucket
parameter. -
Cluster security group IDs, in the
hostGroupIds
parameter. -
Cluster deletion protection settings in the
deletionProtection
parameter.Cluster deletion protection will not prevent a manual connection to a cluster to delete data.
To create a cluster deployed on groups of dedicated hosts, pass a list of host IDs in the hostGroupIds
parameter.
Alert
You cannot edit this setting after you create a cluster. The use of dedicated hosts significantly affects cluster pricing.
Data Proc runs the create cluster operation. After the cluster status changes to Running, you can connect to any active subcluster using the specified SSH key.