Creating a Data Proc cluster
To create a Data Proc cluster, a user must be assigned the editor
and dataproc.agent
roles. For more information, see the role description.
Configure a network
Configure internet access from the subnet to which the Data Proc subcluster with a master host will be connected, e.g., using a NAT gateway. This will enable the Data Proc subcluster to interact with Yandex Cloud services or hosts in other networks.
Configure security groups
Warning
You need to create and configure security groups before creating a Data Proc cluster. If the selected security groups do not have the required rules, Yandex Cloud disables the Data Proc cluster creation.
-
Create one or more security groups for service traffic of the Data Proc cluster.
-
-
One rule for inbound and another one for outbound service traffic:
- Port range:
0-65535
- Protocol:
Any
- Source/Destination name:
Security group
- Security group:
Current
- Port range:
-
A separate rule for outgoing HTTPS traffic. This will allow you to use Yandex Object Storage buckets, UI Proxy, and autoscaling of Data Proc clusters.
You can set up this rule using one of the two methods:
To all addressesTo the addresses used by Yandex Cloud- Port range:
443
- Protocol:
TCP
- Destination name:
CIDR
- CIDR blocks:
0.0.0.0/0
- Port range:
443
- Protocol:
TCP
- Destination name:
CIDR
- CIDR blocks:
84.201.181.26/32
: Getting the Data Proc cluster status, running jobs, UI Proxy.158.160.59.216/32
: Monitoring the Data Proc cluster state, autoscaling.213.180.193.243/32
: Access to Object Storage.
- Port range:
-
A rule that allows access to NTP servers for time syncing:
- Port range:
123
- Protocol:
UDP
- Destination name:
CIDR
- CIDR blocks:
0.0.0.0/0
- Port range:
-
If you plan to use multiple security groups for your Data Proc cluster, allow all traffic between these groups.
Note
You can set more detailed rules for security groups, such as allowing traffic in only specific subnets.
You must configure security groups correctly for all subnets in which the Data Proc cluster hosts will reside.
You can set up security groups after creating a Data Proc cluster to connect to Metastore or Data Proc cluster hosts via the internet or an intermediate VM.
Create a Data Proc cluster
A Data Proc cluster must include a subcluster with a master host and at least one subcluster for data storage or processing.
If you want to create a Data Proc cluster copy, import its configuration to Terraform.
-
In the management console
, select the folder where you want to create a Data Proc cluster. -
Click Create resource and select
Data Proc cluster from the drop-down list. -
Enter a name for the Data Proc cluster in the Cluster name field. The naming requirements are as follows:
- It must be unique within the folder.
- The name must be from 3 to 63 characters long.
- It may contain lowercase Latin letters, numbers, and hyphens.
- The first character must be a letter and the last character cannot be a hyphen.
-
Select a suitable image version and the services you want to use in the Data Proc cluster.
Using an image of version
2.0.39
or higher, you can create a lightweight cluster without HDFS and subclusters for data storage. In this case, be sure to add one or more data processing subclusters and specify a bucket name.Tip
To use the most recent image version, specify
2.0
. -
Enter the public part of your SSH key in the SSH key field. For information about how to generate and use SSH keys, see the Yandex Compute Cloud documentation.
-
Select or create a service account to which you will grant access to the Data Proc cluster. Make sure to assign the
dataproc.agent
role to the service account of the Data Proc cluster. -
Select the availability zone for the Data Proc cluster.
-
If required, configure the properties of Data Proc cluster components, jobs, and the environment.
-
If necessary, specify custom initialization scripts for Data Proc cluster hosts. For each script, specify:
-
URI: Link to the initialization script in the
https://
,http://
,hdfs://
, ors3a://
scheme. -
(Optional) Timeout: Script execution timeout, in seconds. If your initialization script runs longer than this time period, it will be terminated.
-
(Optional) Arguments: List of arguments of your initialization script, enclosed in square brackets
[]
and separated by commas, such as:["arg1","arg2",...,"argN"]
-
-
Select the name of a bucket in Object Storage to store job dependencies and results.
-
Select a network for the Data Proc cluster.
-
Select security groups that have the required permissions.
Warning
When you create a Data Proc cluster, security group settings are verified. If the Data Proc cluster cannot operate properly with these settings, a warning will appear. A sample functional configuration is provided above.
-
Enable the UI Proxy option to access the web interfaces of Data Proc components.
-
Yandex Cloud Logging stores Data Proc cluster logs. Select a log group from the list or create a new one.
To enable this functionality, assign the
logging.writer
role to the service account of the Data Proc cluster. For more information, see the Cloud Logging documentation. -
Configure Data Proc subclusters: maximum one subcluster with a master host (Master) and subclusters for data storage or processing.
Roles of Data Proc subclusters for data storage and processing are different: you can deploy data storage components on data storage subclusters and computing components on data processing subclusters. You can use a storage on a Data Proc subcluster for data processing only to temporarily store the files being processed.
For each Data Proc subcluster, you can configure:
-
Number of hosts.
-
Host class: Platform and computing resources available to the host.
-
Storage size and type.
-
Subnet of the network where the Data Proc cluster resides.
In the subnet, you need to set up a NAT gateway for the Data Proc subcluster with a master host. For more information, see Configure a network.
-
To access Data Proc subcluster hosts from the internet, select Public access. In this case, you can only connect to Data Proc subcluster hosts using SSL. For more information, see Connecting to a Data Proc cluster.
Warning
After you create a Data Proc cluster, you cannot request or disable public access to the subcluster. However, you can delete the Data Proc subcluster for data processing and create it again with the public access settings you need.
-
-
In Data Proc subclusters for data processing, you can specify autoscaling parameters.
Note
To enable automatic scaling, assign the following roles to the cluster service account:
dataproc.editor
dataproc.agent
- Under ** Add subcluster**, click Add.
- In the Roles field, select
COMPUTENODE
. - Under Scaling, enable the Autoscaling setting.
- Set autoscaling parameters.
- The default metric used for autoscaling is
yarn.cluster.containersPending
. To enable scaling based on CPU usage, disable the Default scaling setting and specify the target CPU utilization level. - Click Add.
-
If required, configure additional settings of the Data Proc cluster:
Deletion protection manages protection of the Data Proc cluster from accidental deletion by a user.Enabled protection will not prevent a manual connection to the Data Proc cluster and data deletion.
-
Click Create cluster.
If you do not have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name
or --folder-id
parameter.
To create a Data Proc cluster:
-
Check whether the folder has any subnets for the Data Proc cluster hosts:
yc vpc subnet list
If there are no subnets in the folder, create the required subnets in Yandex Virtual Private Cloud.
-
View the description of the CLI command for creating a Data Proc cluster:
yc dataproc cluster create --help
-
Specify Data Proc cluster parameters in the create command (the list of supported parameters in the example is not exhaustive):
yc dataproc cluster create <cluster_name> \ --bucket=<bucket_name> \ --zone=<availability_zone> \ --service-account-name=<service_account_name> \ --version=<image_version> \ --services=<list_of_components> \ --ssh-public-keys-file=<path_to_public_SSH_key> \ --subcluster name=<name_of_subcluster_with_master_host>,` `role=masternode,` `resource-preset=<host_class>,` `disk-type=<storage_type>,` `disk-size=<storage_size_in_GB>,` `subnet-name=<subnet_name>,` `assign-public-ip=<public_access_to_subcluster_host> \ --subcluster name=<name_of_data_storage_subcluster>,` `role=datanode,` `resource-preset=<host_class>,` `disk-type=<storage_type>,` `disk-size=<storage_size_in_GB>,` `subnet-name=<subnet_name>,` `hosts-count=<number_of_hosts>,` `assign-public-ip=<public_access_to_subcluster_host> \ --deletion-protection=<cluster_deletion_protection> \ --ui-proxy=<access_to_component_web_interfaces> \ --log-group-id=<log_group_ID> \ --security-group-ids=<list_of_security_group_IDs>
Note
The Data Proc cluster name must be unique within the folder. It may contain Latin letters, numbers, hyphens, and underscores. The name may be up to 63 characters long.
Where:
-
--bucket
: Name of the bucket in Object Storage that will store job dependencies and results. The service account of the Data Proc cluster must haveREAD and WRITE
permissions for this bucket. -
--zone
: Availability zone where the Data Proc cluster hosts will reside. -
--service-account-name
: Name of the Data Proc cluster service account. Make sure to assign thedataproc.agent
role to the service account of the Data Proc cluster. -
--version
: Image version.Using an image of version
2.0.39
or higher, you can create a lightweight cluster without HDFS and subclusters for data storage. In this case, be sure to add one or more data processing subclusters and specify a bucket name.Tip
To use the most recent image version, set the
--version
parameter value to2.0
. -
--services
: List of components that you want to use in the Data Proc cluster. If this parameter is omitted, the default set will be used:hdfs
,yarn
,mapreduce
,tez
, andspark
. -
--ssh-public-keys-file
: Full path to the file with the public part of the SSH key for access to the Data Proc cluster hosts. For information about how to generate and use SSH keys, see the Yandex Compute Cloud documentation. -
--subcluster
: Parameters of Data Proc subclusters:-
name
: Data Proc subcluster name. -
role
: Data Proc subcluster role:masternode
,datanode
, orcomputenode
. -
resource-preset
: Host class. -
disk-type
: Storage type (network-ssd
,network-hdd
, ornetwork-ssd-nonreplicated
). -
disk-size
: Storage size in GB. -
subnet-name
: Name of the subnet. -
hosts-count
: Number of hosts in the Data Proc subclusters for data storage or processing. The minimum value is1
and the maximum value is32
. -
assign-public-ip
: Access to Data Proc subcluster hosts from the internet. It may take either thetrue
orfalse
value. If access is enabled, you can only connect to the Data Proc cluster using SSL. For more information, see Connecting to a Data Proc cluster.Warning
After you create a Data Proc cluster, you cannot request or disable public access to the subcluster. However, you can delete the Data Proc subcluster for data processing and create it again with the public access settings you need.
-
-
--deletion-protection
: Deletion protection of the Data Proc cluster. It may take either thetrue
orfalse
value.Cluster deletion protection will not prevent a manual connection to a cluster to delete data.
-
--ui-proxy
: Access to Data Proc component web interfaces. It may take either thetrue
orfalse
value. -
--log-group-id
: Log group ID. -
--security-group-ids
: List of security group IDs.
To create a Data Proc cluster with multiple data storage or processing subclusters, provide the required number of
--subcluster
arguments in thecluster create
command:yc dataproc cluster create <cluster_name> \ ... --subcluster <subcluster_parameters> \ --subcluster <subcluster_parameters> \ ...
-
-
To enable autoscaling in Data Proc subclusters for data processing, specify the following parameters:
yc dataproc cluster create <cluster_name> \ ... --subcluster name=<subcluster_name>,` `role=computenode` `...` `hosts-count=<minimum_number_of_hosts>` `max-hosts-count=<maximum_number_of_hosts>,` `preemptible=<use_preemptible_VMs>,` `warmup-duration=<VM_warmup_time>,` `stabilization-duration=<stabilization_period>,` `measurement-duration=<utilization_measurement_interval>,` `cpu-utilization-target=<target_CPU_utilization_level>,` `autoscaling-decommission-timeout=<decommissioning_timeout>
Where:
hosts-count
: Minimum number of hosts (VMs) in the Data Proc subcluster. The minimum value is1
and the maximum value is32
.max-hosts-count
: Maximum number of hosts (VMs) in the Data Proc subcluster. The minimum value is1
and the maximum value is100
.preemptible
: Indicates if preemptible VMs are used. It may take either thetrue
orfalse
value.warmup-duration
: Time required to warm up a VM instance, in<value>s
format. The minimum value is0s
and the maximum value is600s
(10 minutes).stabilization-duration
: Interval in seconds, during which the required number of instances cannot be decreased, in<value>s
format. The minimum value is60s
(1 minute) and the maximum value is1800s
(30 minutes).measurement-duration
: Period in seconds, for which utilization measurements should be averaged for each instance, in<value>s
format. The minimum value is60s
(1 minute) and the maximum value is600s
(10 minutes).cpu-utilization-target
: Target CPU utilization level, in %. Use this setting to enable scaling based on CPU utilization. Otherwise,yarn.cluster.containersPending
will be used as a metric (based on the number of pending resources). The minimum value is10
and the maximum value is100
.autoscaling-decommission-timeout
: Decommissioning timeout in seconds. The minimum value is0
and the maximum value is86400
(24 hours).
Note
To enable automatic scaling, assign the following roles to the cluster service account:
dataproc.editor
dataproc.agent
-
To create a Data Proc subcluster residing on groups of dedicated hosts, specify their IDs separated by commas in the
--host-group-ids
parameter:yc dataproc cluster create <cluster_name> \ ... --host-group-ids=<IDs_of_groups_of_dedicated_hosts>
Alert
You cannot edit this setting after you create a cluster. The use of dedicated hosts significantly affects cluster pricing.
-
To configure Data Proc cluster hosts using initialization scripts, specify them in one or multiple
--initialization-action
parameters:yc dataproc cluster create <cluster_name> \ ... --initialization-action uri=<initialization_script_URI>,` `timeout=<script_execution_timeout>,` `args=["arg1","arg2","arg3",...]
Where:
URI
: Link to the initialization script in thehttps://
,http://
,hdfs://
, ors3a://
scheme.- (Optional)
timeout
: Script execution timeout, in seconds. If your initialization script runs longer than this time period, it will be terminated. - (Optional)
args
: Arguments separated by commas with which an initialization script must be executed.
Terraform
For more information about the provider resources, see the documentation on the Terraform
If you change the configuration files, Terraform automatically detects which part of your configuration is already deployed, and what should be added or removed.
To create a Data Proc cluster:
-
Using the command line, navigate to the folder that will contain the Terraform configuration files with an infrastructure plan. Create the directory if it does not exist.
-
If you don't have Terraform, install it and configure the Yandex Cloud provider.
-
Create a configuration file describing the cloud network and subnets.
The Data Proc cluster resides in a cloud network. If you already have a suitable network, you do not need to describe it again.
Data Proc cluster hosts reside in subnets of the selected cloud network. If you already have suitable subnets, you do not need to describe them again.
Example structure of a configuration file that describes a cloud network with a single subnet:
resource "yandex_vpc_network" "test_network" { name = "<network_name>" } resource "yandex_vpc_subnet" "test_subnet" { name = "<subnet_name>" zone = "<availability_zone>" network_id = yandex_vpc_network.test_network.id v4_cidr_blocks = ["<subnet>"] }
-
Create a configuration file describing the service account to access the Data Proc cluster, as well as the static key and the Object Storage bucket to store jobs and results.
resource "yandex_iam_service_account" "data_proc_sa" { name = "<service_account_name>" description = "<service_account_description>" } resource "yandex_resourcemanager_folder_iam_member" "dataproc" { folder_id = "<folder_ID>" role = "dataproc.agent" member = "serviceAccount:${yandex_iam_service_account.data_proc_sa.id}" } resource "yandex_resourcemanager_folder_iam_member" "bucket-creator" { folder_id = "<folder_ID>" role = "dataproc.editor" member = "serviceAccount:${yandex_iam_service_account.data_proc_sa.id}" } resource "yandex_iam_service_account_static_access_key" "sa_static_key" { service_account_id = yandex_iam_service_account.data_proc_sa.id } resource "yandex_storage_bucket" "data_bucket" { depends_on = [ yandex_resourcemanager_folder_iam_member.bucket-creator ] bucket = "<bucket_name>" access_key = yandex_iam_service_account_static_access_key.sa_static_key.access_key secret_key = yandex_iam_service_account_static_access_key.sa_static_key.secret_key }
-
Create a configuration file describing the Data Proc cluster and its subclusters.
If required, here you can also specify the properties of the Data Proc cluster components, jobs, and the environment.
Below is an example of a configuration file structure that describes a Data Proc cluster consisting of a subcluster with a master host, a data storage subcluster, and a data processing subcluster:
resource "yandex_dataproc_cluster" "data_cluster" { bucket = "<bucket_name>" name = "<cluster_name>" description = "<cluster_description>" service_account_id = yandex_iam_service_account.data_proc_sa.id zone_id = "<availability_zone>" security_group_ids = ["<list_of_security_group_IDs>"] deletion_protection = <cluster_deletion_protection> cluster_config { version_id = "<image_version>" hadoop { services = ["<list_of_components>"] # Example of the list: ["HDFS", "YARN", "SPARK", "TEZ", "MAPREDUCE", "HIVE"]. properties = { "<component_property>" = <value> ... } ssh_public_keys = [ file("${file("<path_to_public_SSH_key>")}") ] } subcluster_spec { name = "<name_of_subcluster_with_master_host>" role = "MASTERNODE" resources { resource_preset_id = "<host_class>" disk_type_id = "<storage_type>" disk_size = <storage_size_in_GB> } subnet_id = yandex_vpc_subnet.test_subnet.id hosts_count = 1 } subcluster_spec { name = "<name_of_data_storage_subcluster>" role = "DATANODE" resources { resource_preset_id = "<host_class>" disk_type_id = "<storage_type>" disk_size = <storage_size_in_GB> } subnet_id = yandex_vpc_subnet.test_subnet.id hosts_count = <number_of_subcluster_hosts> } subcluster_spec { name = "<name_of_data_processing_subcluster>" role = "COMPUTENODE" resources { resource_preset_id = "<host_class>" disk_type_id = "<storage_type>" disk_size = <storage_size_in_GB> } subnet_id = yandex_vpc_subnet.test_subnet.id hosts_count = <number_of_subcluster_hosts> } } }
Where
deletion_protection
is the deletion protection of the Data Proc cluster. It may take either thetrue
orfalse
value.Cluster deletion protection will not prevent a manual connection to delete the contents of a database.
Using an image of version
2.0.39
or higher, you can create a lightweight cluster without HDFS and subclusters for data storage. In this case, be sure to add one or more data processing subclusters and specify a bucket name.Tip
To use the most recent image version, set the
version_id
parameter to2.0
.To access web interfaces of Data Proc components, add the
ui_proxy
field set totrue
to the Data Proc cluster description:resource "yandex_dataproc_cluster" "data_cluster" { ... ui_proxy = true ... }
To configure autoscaling parameters in Data Proc subclusters for data processing, add the
autoscaling_config
section with the required settings to thesubcluster_spec
description of the relevant subcluster:subcluster_spec { name = "<subcluster_name>" role = "COMPUTENODE" ... autoscaling_config { max_hosts_count = <maximum_number_of_VMs_in_group> measurement_duration = <utilization_measurement_interval> warmup_duration = <warmup_time> stabilization_duration = <stabilization_period> preemptible = <use_preemptible_VMs> cpu_utilization_target = <target_vCPU_utilization_level> decommission_timeout = <decommissioning_timeout> } }
Where:
max_hosts_count
: Maximum number of hosts (VMs) in the Data Proc subcluster. The minimum value is1
and the maximum value is100
.measurement_duration
: Period, in seconds, for which utilization measurements are averaged for each instance, in<value>s
format. The minimum value is60s
(1 minute) and the maximum value is600s
(10 minutes).warmup_duration
: Time required to warm up a VM instance, in<value>s
format. The minimum value is0s
and the maximum value is600s
(10 minutes).stabilization_duration
: Period, in seconds, during which the required number of instances cannot be decreased, in<value>s
format. The minimum value is60s
(1 minute) and the maximum value is1800s
(30 minutes).preemptible
: Indicates if preemptible VMs are used. It may take either thetrue
orfalse
value.cpu_utilization_target
: Target CPU utilization level, in %. Use this setting to enable scaling based on CPU utilization. Otherwise,yarn.cluster.containersPending
will be used as a metric (based on the number of pending resources). The minimum value is10
and the maximum value is100
.decommission_timeout
: Decommissioning timeout in seconds. The minimum value is0
and the maximum value is86400
(24 hours).
For more information about resources you can create using Terraform, see the provider documentation
. -
Check that the Terraform configuration files are correct:
-
Using the command line, navigate to the folder that contains the up-to-date Terraform configuration files with an infrastructure plan.
-
Run the command:
terraform validate
If there are errors in the configuration files, Terraform will point to them.
-
-
Create a Data Proc cluster:
-
Run the command to view planned changes:
terraform plan
If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.
-
If you are happy with the planned changes, apply them:
-
Run the command:
terraform apply
-
Confirm the update of resources.
-
Wait for the operation to complete.
-
All the required resources will be created in the specified folder. You can check resource availability and their settings in the management console
. -
To create a Data Proc cluster, use the create API method and include the following in the request:
-
ID of the folder where the Data Proc cluster must reside, in the
folderId
parameter. -
Data Proc cluster name in the
name
parameter. -
Data Proc cluster configuration in the
configSpec
parameter, including:-
Image version in the
configSpec.versionId
parameter.Using an image of version
2.0.39
or higher, you can create a lightweight cluster without HDFS and subclusters for data storage. In this case, be sure to add one or more data processing subclusters and specify a bucket name.Tip
To use the most recent image version, specify
2.0
. -
Component list in the
configSpec.hadoop.services
parameter. -
Public part of the SSH key in the
configSpec.hadoop.sshPublicKeys
parameter. -
Settings of the Data Proc subclusters in the
configSpec.subclustersSpec
parameter.
-
-
Availability zone of the Data Proc cluster in the
zoneId
parameter. -
ID of the Data Proc cluster's service account in the
serviceAccountId
parameter. -
Bucket name in the
bucket
parameter. -
IDs of the Data Proc cluster's security groups in the
hostGroupIds
parameter. -
Data Proc cluster deletion protection settings in the
deletionProtection
parameter.Cluster deletion protection will not prevent a manual connection to a cluster to delete data.
To assign a public IP address to all hosts of a Data Proc subcluster, provide the true
value in the configSpec.subclustersSpec.assignPublicIp
parameter.
To create a Data Proc cluster residing on groups of dedicated hosts, provide the list of the host group IDs in the hostGroupIds
parameter.
Alert
You cannot edit this setting after you create a cluster. The use of dedicated hosts significantly affects cluster pricing.
To configure Data Proc cluster hosts using initialization scripts, specify them in one or multiple configSpec.hadoop.initializationActions
parameters.
After the Data Proc cluster's status changes to Running, you can connect to the Data Proc subcluster hosts using the specified SSH key.
Create a Data Proc cluster copy
You can create a Data Proc cluster with the settings of another cluster created earlier. To do so, you need to import the configuration of the source Data Proc cluster to Terraform. Thus you can either create an identical copy or use the imported configuration as the baseline and modify it as needed. Importing is a convenient option when the source Data Proc cluster has lots of settings (e.g., it is an HDFS cluster) and you need to create a similar one.
To create a Data Proc cluster copy:
-
If you do not have Terraform yet, install it.
-
Get the authentication credentials. You can add them to environment variables or specify them later in the provider configuration file.
-
Configure and initialize a provider. There is no need to create a provider configuration file manually, you can download it
. -
Place the configuration file in a separate working directory and specify the parameter values. If you did not add the authentication credentials to environment variables, specify them in the configuration file.
-
In the same working directory, place a file with a
.tf
extension and the following contents:resource "yandex_dataproc_cluster" "old" { }
-
Write the ID of the initial Data Proc cluster to the environment variable:
export DATAPROC_CLUSTER_ID=<cluster_ID>
-
Import the settings of the initial Data Proc cluster into the Terraform configuration:
terraform import yandex_dataproc_cluster.old ${DATAPROC_CLUSTER_ID}
-
Get the imported configuration:
terraform show
-
Copy it from the terminal and paste it into the
.tf
extension file. -
Place the file in the new
imported-cluster
directory. -
Edit the copied configuration so that you can create a new Data Proc cluster from it:
-
Specify the name of the new Data Proc cluster in the
resource
string and thename
parameter. -
Delete the
created_at
,host_group_ids
,id
, andsubcluster_spec.id
parameters. -
Change the SSH key format in the
ssh_public_keys
parameter. Source format:ssh_public_keys = [ <<-EOT <key> EOT, ]
Required format:
ssh_public_keys = [ "<key>" ]
-
(Optional) Make further modifications if you need a customized copy rather than identical one.
-
-
In the
imported-cluster
directory, get the authentication data. -
In the same directory, configure and initialize a provider. There is no need to create a provider configuration file manually, you can download it
. -
Place the configuration file in the
imported-cluster
directory and specify the parameter values. If you did not add the authentication credentials to environment variables, specify them in the configuration file. -
Make sure the Terraform configuration files are correct using this command:
terraform validate
If there are any errors in the configuration files, Terraform will point them out.
-
Create the required infrastructure:
-
Run the command to view planned changes:
terraform plan
If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.
-
If you are happy with the planned changes, apply them:
-
Run the command:
terraform apply
-
Confirm the update of resources.
-
Wait for the operation to complete.
-
All the required resources will be created in the specified folder. You can check resource availability and their settings in the management console
. -
Example
Creating a lightweight Data Proc cluster for Spark and PySpark jobs
Create a Data Proc cluster to run Spark jobs without HDFS and data storage subclusters and set the test characteristics:
-
Cluster name:
my-dataproc
-
Bucket name:
dataproc-bucket
-
Availability zone:
ru-central1-c
-
Service account:
dataproc-sa
-
Image version:
2.0
-
SPARK
andYARN
components -
Path to the public part of the SSH key:
/home/username/.ssh/id_rsa.pub
-
With the
master
Data Proc subcluster for master hosts and a singlecompute
subcluster for processing data:- Class:
s2.micro
- Network SSD storage (
network-ssd
): 20 GB - Subnet:
default-ru-central1-c
- Public access: Allowed
- Class:
-
Security group:
enp6saqnq4ie244g67sb
-
Protection against accidental Data Proc cluster deletion: Enabled
Run the following command:
yc dataproc cluster create my-dataproc \
--bucket=dataproc-bucket \
--zone=ru-central1-c \
--service-account-name=dataproc-sa \
--version=2.0 \
--services=SPARK,YARN \
--ssh-public-keys-file=/home/username/.ssh/id_rsa.pub \
--subcluster name="master",`
`role=masternode,`
`resource-preset=s2.micro,`
`disk-type=network-ssd,`
`disk-size=20,`
`subnet-name=default-ru-central1-c,`
`assign-public-ip=true \
--subcluster name="compute",`
`role=computenode,`
`resource-preset=s2.micro,`
`disk-type=network-ssd,`
`disk-size=20,`
`subnet-name=default-ru-central1-c,`
`assign-public-ip=true \
--security-group-ids=enp6saqnq4ie244g67sb \
--deletion-protection=true