Creating a Data Proc cluster

Written by

Updated at April 26, 2024

Configure a network
Configure security groups
Create a Data Proc cluster
Create a Data Proc cluster copy
Example
- Creating a lightweight Data Proc cluster for Spark and PySpark jobs

To create a Data Proc cluster, a user must be assigned the editor and dataproc.agent roles. For more information, see the role description.

Configure a network

Configure internet access from the subnet to which the Data Proc subcluster with a master host will be connected, e.g., using a NAT gateway. This will enable the Data Proc subcluster to interact with Yandex Cloud services or hosts in other networks.

Configure security groups

Warning

You need to create and configure security groups before creating a Data Proc cluster. If the selected security groups do not have the required rules, Yandex Cloud disables the Data Proc cluster creation.

Create one or more security groups for service traffic of the Data Proc cluster.
Add rules:
- One rule for inbound and another one for outbound service traffic:
  - Port range: 0-65535
  - Protocol: Any
  - Source/Destination name: Security group
  - Security group: Current
- A separate rule for outgoing HTTPS traffic. This will allow you to use Yandex Object Storage buckets, UI Proxy, and autoscaling of Data Proc clusters.
  
  You can set up this rule using one of the two methods:
  To all addresses
  
  To the addresses used by Yandex Cloud
  - Port range: 443
  - Protocol: TCP
  - Destination name: CIDR
  - CIDR blocks: 0.0.0.0/0
  - Port range: 443
  - Protocol: TCP
  - Destination name: CIDR
  - CIDR blocks:
    
    84.201.181.26/32: Getting the Data Proc cluster status, running jobs, UI Proxy.
    
    158.160.59.216/32: Monitoring the Data Proc cluster state, autoscaling.
    
    213.180.193.243/32: Access to Object Storage.
- A rule that allows access to NTP servers for time syncing:
  - Port range: 123
  - Protocol: UDP
  - Destination name: CIDR
  - CIDR blocks: 0.0.0.0/0

If you plan to use multiple security groups for your Data Proc cluster, allow all traffic between these groups.

Note

You can set more detailed rules for security groups, such as allowing traffic in only specific subnets.

You must configure security groups correctly for all subnets in which the Data Proc cluster hosts will reside.

You can set up security groups after creating a Data Proc cluster to connect to Metastore or Data Proc cluster hosts via the internet or an intermediate VM.

Create a Data Proc cluster

A Data Proc cluster must include a subcluster with a master host and at least one subcluster for data storage or processing.

If you want to create a Data Proc cluster copy, import its configuration to Terraform.

Management console

CLI

Terraform

API

In the management console, select the folder where you want to create a Data Proc cluster.
Click Create resource and select Data Proc cluster from the drop-down list.
Enter a name for the Data Proc cluster in the Cluster name field. The naming requirements are as follows:
- It must be unique within the folder.
- The name must be from 3 to 63 characters long.
- It may contain lowercase Latin letters, numbers, and hyphens.
- The first character must be a letter and the last character cannot be a hyphen.
Select a suitable image version and the services you want to use in the Data Proc cluster.

Using an image of version 2.0.39 or higher, you can create a lightweight cluster without HDFS and subclusters for data storage. In this case, be sure to add one or more data processing subclusters and specify a bucket name.

Tip

To use the most recent image version, specify 2.0.
Enter the public part of your SSH key in the SSH key field. For information about how to generate and use SSH keys, see the Yandex Compute Cloud documentation.
Select or create a service account to which you will grant access to the Data Proc cluster. Make sure to assign the dataproc.agent role to the service account of the Data Proc cluster.
Select the availability zone for the Data Proc cluster.
If required, configure the properties of Data Proc cluster components, jobs, and the environment.
If necessary, specify custom initialization scripts for Data Proc cluster hosts. For each script, specify:
- URI: Link to the initialization script in the https://, http://, hdfs://, or s3a:// scheme.
- (Optional) Timeout: Script execution timeout, in seconds. If your initialization script runs longer than this time period, it will be terminated.
- (Optional) Arguments: List of arguments of your initialization script, enclosed in square brackets [] and separated by commas, such as:
```
["arg1","arg2",...,"argN"]
```
Select the name of a bucket in Object Storage to store job dependencies and results.
Select a network for the Data Proc cluster.
Select security groups that have the required permissions.

Warning

When you create a Data Proc cluster, security group settings are verified. If the Data Proc cluster cannot operate properly with these settings, a warning will appear. A sample functional configuration is provided above.
Enable the UI Proxy option to access the web interfaces of Data Proc components.
Yandex Cloud Logging stores Data Proc cluster logs. Select a log group from the list or create a new one.

To enable this functionality, assign the logging.writer role to the service account of the Data Proc cluster. For more information, see the Cloud Logging documentation.
Configure Data Proc subclusters: maximum one subcluster with a master host (Master) and subclusters for data storage or processing.

Roles of Data Proc subclusters for data storage and processing are different: you can deploy data storage components on data storage subclusters and computing components on data processing subclusters. You can use a storage on a Data Proc subcluster for data processing only to temporarily store the files being processed.

For each Data Proc subcluster, you can configure:
- Number of hosts.
- Host class: Platform and computing resources available to the host.
- Storage size and type.
- Subnet of the network where the Data Proc cluster resides.
  
  In the subnet, you need to set up a NAT gateway for the Data Proc subcluster with a master host. For more information, see Configure a network.
- To access Data Proc subcluster hosts from the internet, select Public access. In this case, you can only connect to Data Proc subcluster hosts using SSL. For more information, see Connecting to a Data Proc cluster.
  
  Warning
  
  After you create a Data Proc cluster, you cannot request or disable public access to the subcluster. However, you can delete the Data Proc subcluster for data processing and create it again with the public access settings you need.
In Data Proc subclusters for data processing, you can specify autoscaling parameters.
Note
To enable automatic scaling, assign the following roles to the cluster service account:
- dataproc.editor
- dataproc.agent
1. Under ** Add subcluster**, click Add.
2. In the Roles field, select COMPUTENODE.
3. Under Scaling, enable the Autoscaling setting.
4. Set autoscaling parameters.
5. The default metric used for autoscaling is yarn.cluster.containersPending. To enable scaling based on CPU usage, disable the Default scaling setting and specify the target CPU utilization level.
6. Click Add.
If required, configure additional settings of the Data Proc cluster:
Deletion protection manages protection of the Data Proc cluster from accidental deletion by a user.

Enabled protection will not prevent a manual connection to the Data Proc cluster and data deletion.
Click Create cluster.

If you do not have the Yandex Cloud command line interface yet, install and initialize it.

The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name or --folder-id parameter.

To create a Data Proc cluster:

Check whether the folder has any subnets for the Data Proc cluster hosts:
```
yc vpc subnet list
```
If there are no subnets in the folder, create the required subnets in Yandex Virtual Private Cloud.
View the description of the CLI command for creating a Data Proc cluster:
```
yc dataproc cluster create --help
```
Specify Data Proc cluster parameters in the create command (the list of supported parameters in the example is not exhaustive):
```
yc dataproc cluster create <cluster_name> \
  --bucket=<bucket_name> \
  --zone=<availability_zone> \
  --service-account-name=<service_account_name> \
  --version=<image_version> \
  --services=<list_of_components> \
  --ssh-public-keys-file=<path_to_public_SSH_key> \
  --subcluster name=<name_of_subcluster_with_master_host>,`
               `role=masternode,`
               `resource-preset=<host_class>,`
               `disk-type=<storage_type>,`
               `disk-size=<storage_size_in_GB>,`
               `subnet-name=<subnet_name>,`
               `assign-public-ip=<public_access_to_subcluster_host> \
  --subcluster name=<name_of_data_storage_subcluster>,`
               `role=datanode,`
               `resource-preset=<host_class>,`
               `disk-type=<storage_type>,`
               `disk-size=<storage_size_in_GB>,`
               `subnet-name=<subnet_name>,`
               `hosts-count=<number_of_hosts>,`
               `assign-public-ip=<public_access_to_subcluster_host> \
  --deletion-protection=<cluster_deletion_protection> \
  --ui-proxy=<access_to_component_web_interfaces> \
  --log-group-id=<log_group_ID> \
  --security-group-ids=<list_of_security_group_IDs>
```
Note

The Data Proc cluster name must be unique within the folder. It may contain Latin letters, numbers, hyphens, and underscores. The name may be up to 63 characters long.

Where:
- --bucket: Name of the bucket in Object Storage that will store job dependencies and results. The service account of the Data Proc cluster must have READ and WRITE permissions for this bucket.
- --zone: Availability zone where the Data Proc cluster hosts will reside.
- --service-account-name: Name of the Data Proc cluster service account. Make sure to assign the dataproc.agent role to the service account of the Data Proc cluster.
- --version: Image version.
  
  Using an image of version 2.0.39 or higher, you can create a lightweight cluster without HDFS and subclusters for data storage. In this case, be sure to add one or more data processing subclusters and specify a bucket name.
  
  Tip
  
  To use the most recent image version, set the --version parameter value to 2.0.
- --services: List of components that you want to use in the Data Proc cluster. If this parameter is omitted, the default set will be used: hdfs, yarn, mapreduce, tez, and spark.
- --ssh-public-keys-file: Full path to the file with the public part of the SSH key for access to the Data Proc cluster hosts. For information about how to generate and use SSH keys, see the Yandex Compute Cloud documentation.
- --subcluster: Parameters of Data Proc subclusters:
  - name: Data Proc subcluster name.
  - role: Data Proc subcluster role: masternode, datanode, or computenode.
  - resource-preset: Host class.
  - disk-type: Storage type (network-ssd, network-hdd, or network-ssd-nonreplicated).
  - disk-size: Storage size in GB.
  - subnet-name: Name of the subnet.
  - hosts-count: Number of hosts in the Data Proc subclusters for data storage or processing. The minimum value is 1 and the maximum value is 32.
  - assign-public-ip: Access to Data Proc subcluster hosts from the internet. It may take either the true or false value. If access is enabled, you can only connect to the Data Proc cluster using SSL. For more information, see Connecting to a Data Proc cluster.
    
    Warning
    
    After you create a Data Proc cluster, you cannot request or disable public access to the subcluster. However, you can delete the Data Proc subcluster for data processing and create it again with the public access settings you need.
- --deletion-protection: Deletion protection of the Data Proc cluster. It may take either the true or false value.
  
  Cluster deletion protection will not prevent a manual connection to a cluster to delete data.
- --ui-proxy: Access to Data Proc component web interfaces. It may take either the true or false value.
- --log-group-id: Log group ID.
- --security-group-ids: List of security group IDs.
To create a Data Proc cluster with multiple data storage or processing subclusters, provide the required number of --subcluster arguments in the cluster create command:
```
yc dataproc cluster create <cluster_name> \
  ...
  --subcluster <subcluster_parameters> \
  --subcluster <subcluster_parameters> \
  ...
```
To enable autoscaling in Data Proc subclusters for data processing, specify the following parameters:
```
yc dataproc cluster create <cluster_name> \
  ...
  --subcluster name=<subcluster_name>,`
               `role=computenode`
               `...`
               `hosts-count=<minimum_number_of_hosts>`
               `max-hosts-count=<maximum_number_of_hosts>,`
               `preemptible=<use_preemptible_VMs>,`
               `warmup-duration=<VM_warmup_time>,`
               `stabilization-duration=<stabilization_period>,`
               `measurement-duration=<utilization_measurement_interval>,`
               `cpu-utilization-target=<target_CPU_utilization_level>,`
               `autoscaling-decommission-timeout=<decommissioning_timeout>
```
Where:
- hosts-count: Minimum number of hosts (VMs) in the Data Proc subcluster. The minimum value is 1 and the maximum value is 32.
- max-hosts-count: Maximum number of hosts (VMs) in the Data Proc subcluster. The minimum value is 1 and the maximum value is 100.
- preemptible: Indicates if preemptible VMs are used. It may take either the true or false value.
- warmup-duration: Time required to warm up a VM instance, in <value>s format. The minimum value is 0s and the maximum value is 600s (10 minutes).
- stabilization-duration: Interval in seconds, during which the required number of instances cannot be decreased, in <value>s format. The minimum value is 60s (1 minute) and the maximum value is 1800s (30 minutes).
- measurement-duration: Period in seconds, for which utilization measurements should be averaged for each instance, in <value>s format. The minimum value is 60s (1 minute) and the maximum value is 600s (10 minutes).
- cpu-utilization-target: Target CPU utilization level, in %. Use this setting to enable scaling based on CPU utilization. Otherwise, yarn.cluster.containersPending will be used as a metric (based on the number of pending resources). The minimum value is 10 and the maximum value is 100.
- autoscaling-decommission-timeout: Decommissioning timeout in seconds. The minimum value is 0 and the maximum value is 86400 (24 hours).
Note
To enable automatic scaling, assign the following roles to the cluster service account:
- dataproc.editor
- dataproc.agent
To create a Data Proc subcluster residing on groups of dedicated hosts, specify their IDs separated by commas in the --host-group-ids parameter:
```
yc dataproc cluster create <cluster_name> \
  ...
  --host-group-ids=<IDs_of_groups_of_dedicated_hosts>
```
Alert

You cannot edit this setting after you create a cluster. The use of dedicated hosts significantly affects cluster pricing.
To configure Data Proc cluster hosts using initialization scripts, specify them in one or multiple --initialization-action parameters:
```
yc dataproc cluster create <cluster_name> \
  ...
  --initialization-action uri=<initialization_script_URI>,`
                          `timeout=<script_execution_timeout>,`
                          `args=["arg1","arg2","arg3",...]
```
Where:
- URI: Link to the initialization script in the https://, http://, hdfs://, or s3a:// scheme.
- (Optional) timeout: Script execution timeout, in seconds. If your initialization script runs longer than this time period, it will be terminated.
- (Optional) args: Arguments separated by commas with which an initialization script must be executed.

Terraform allows you to quickly create a cloud infrastructure in Yandex Cloud and manage it using configuration files. Configuration files store the infrastructure description in the HashiCorp Configuration Language (HCL). Terraform and its providers are distributed under the Business Source License.

For more information about the provider resources, see the documentation on the Terraform website or mirror website.

If you change the configuration files, Terraform automatically detects which part of your configuration is already deployed, and what should be added or removed.

To create a Data Proc cluster:

Using the command line, navigate to the folder that will contain the Terraform configuration files with an infrastructure plan. Create the directory if it does not exist.
If you don't have Terraform, install it and configure the Yandex Cloud provider.
Create a configuration file describing the cloud network and subnets.

The Data Proc cluster resides in a cloud network. If you already have a suitable network, you do not need to describe it again.

Data Proc cluster hosts reside in subnets of the selected cloud network. If you already have suitable subnets, you do not need to describe them again.

Example structure of a configuration file that describes a cloud network with a single subnet:
```
resource "yandex_vpc_network" "test_network" {
  name = "<network_name>"
}

resource "yandex_vpc_subnet" "test_subnet" {
  name           = "<subnet_name>"
  zone           = "<availability_zone>"
  network_id     = yandex_vpc_network.test_network.id
  v4_cidr_blocks = ["<subnet>"]
}
```

Create a configuration file describing the service account to access the Data Proc cluster, as well as the static key and the Object Storage bucket to store jobs and results.

resource "yandex_iam_service_account" "data_proc_sa" {
  name        = "<service_account_name>"
  description = "<service_account_description>"
}

resource "yandex_resourcemanager_folder_iam_member" "dataproc" {
  folder_id = "<folder_ID>"
  role      = "dataproc.agent"
  member    = "serviceAccount:${yandex_iam_service_account.data_proc_sa.id}"
}

resource "yandex_resourcemanager_folder_iam_member" "bucket-creator" {
  folder_id = "<folder_ID>"
  role      = "dataproc.editor"
  member    = "serviceAccount:${yandex_iam_service_account.data_proc_sa.id}"
}

resource "yandex_iam_service_account_static_access_key" "sa_static_key" {
  service_account_id = yandex_iam_service_account.data_proc_sa.id
}

resource "yandex_storage_bucket" "data_bucket" {
  depends_on = [
    yandex_resourcemanager_folder_iam_member.bucket-creator
  ]

  bucket     = "<bucket_name>"
  access_key = yandex_iam_service_account_static_access_key.sa_static_key.access_key
  secret_key = yandex_iam_service_account_static_access_key.sa_static_key.secret_key
}

Create a configuration file describing the Data Proc cluster and its subclusters.

If required, here you can also specify the properties of the Data Proc cluster components, jobs, and the environment.

Below is an example of a configuration file structure that describes a Data Proc cluster consisting of a subcluster with a master host, a data storage subcluster, and a data processing subcluster:
```
resource "yandex_dataproc_cluster" "data_cluster" {
  bucket              = "<bucket_name>"
  name                = "<cluster_name>"
  description         = "<cluster_description>"
  service_account_id  = yandex_iam_service_account.data_proc_sa.id
  zone_id             = "<availability_zone>"
  security_group_ids  = ["<list_of_security_group_IDs>"]
  deletion_protection = <cluster_deletion_protection>

  cluster_config {
    version_id = "<image_version>"

    hadoop {
      services   = ["<list_of_components>"]
      # Example of the list: ["HDFS", "YARN", "SPARK", "TEZ", "MAPREDUCE", "HIVE"].
      properties = {
        "<component_property>" = <value>
        ...
      }
      ssh_public_keys = [
        file("${file("<path_to_public_SSH_key>")}")
      ]
    }

    subcluster_spec {
      name = "<name_of_subcluster_with_master_host>"
      role = "MASTERNODE"
      resources {
        resource_preset_id = "<host_class>"
        disk_type_id       = "<storage_type>"
        disk_size          = <storage_size_in_GB>
      }
      subnet_id   = yandex_vpc_subnet.test_subnet.id
      hosts_count = 1
    }

    subcluster_spec {
      name = "<name_of_data_storage_subcluster>"
      role = "DATANODE"
      resources {
        resource_preset_id = "<host_class>"
        disk_type_id       = "<storage_type>"
        disk_size          = <storage_size_in_GB>
      }
      subnet_id   = yandex_vpc_subnet.test_subnet.id
      hosts_count = <number_of_subcluster_hosts>
    }

    subcluster_spec {
      name = "<name_of_data_processing_subcluster>"
      role = "COMPUTENODE"
      resources {
        resource_preset_id = "<host_class>"
        disk_type_id       = "<storage_type>"
        disk_size          = <storage_size_in_GB>
      }
      subnet_id   = yandex_vpc_subnet.test_subnet.id
      hosts_count = <number_of_subcluster_hosts>
    }
  }
}
```
Where deletion_protection is the deletion protection of the Data Proc cluster. It may take either the true or false value.

Cluster deletion protection will not prevent a manual connection to delete the contents of a database.

Using an image of version 2.0.39 or higher, you can create a lightweight cluster without HDFS and subclusters for data storage. In this case, be sure to add one or more data processing subclusters and specify a bucket name.

Tip

To use the most recent image version, set the version_id parameter to 2.0.

To access web interfaces of Data Proc components, add the ui_proxy field set to true to the Data Proc cluster description:
```
resource "yandex_dataproc_cluster" "data_cluster" {
  ...
  ui_proxy = true
  ...
}
```
To configure autoscaling parameters in Data Proc subclusters for data processing, add the autoscaling_config section with the required settings to the subcluster_spec description of the relevant subcluster:
```
subcluster_spec {
  name = "<subcluster_name>"
  role = "COMPUTENODE"
  ...
  autoscaling_config {
    max_hosts_count        = <maximum_number_of_VMs_in_group>
    measurement_duration   = <utilization_measurement_interval>
    warmup_duration        = <warmup_time>
    stabilization_duration = <stabilization_period>
    preemptible            = <use_preemptible_VMs>
    cpu_utilization_target = <target_vCPU_utilization_level>
    decommission_timeout   = <decommissioning_timeout>
  }
}
```
Where:
- max_hosts_count: Maximum number of hosts (VMs) in the Data Proc subcluster. The minimum value is 1 and the maximum value is 100.
- measurement_duration: Period, in seconds, for which utilization measurements are averaged for each instance, in <value>s format. The minimum value is 60s (1 minute) and the maximum value is 600s (10 minutes).
- warmup_duration: Time required to warm up a VM instance, in <value>s format. The minimum value is 0s and the maximum value is 600s (10 minutes).
- stabilization_duration: Period, in seconds, during which the required number of instances cannot be decreased, in <value>s format. The minimum value is 60s (1 minute) and the maximum value is 1800s (30 minutes).
- preemptible: Indicates if preemptible VMs are used. It may take either the true or false value.
- cpu_utilization_target: Target CPU utilization level, in %. Use this setting to enable scaling based on CPU utilization. Otherwise, yarn.cluster.containersPending will be used as a metric (based on the number of pending resources). The minimum value is 10 and the maximum value is 100.
- decommission_timeout: Decommissioning timeout in seconds. The minimum value is 0 and the maximum value is 86400 (24 hours).
For more information about resources you can create using Terraform, see the provider documentation.
Check that the Terraform configuration files are correct:
1. Using the command line, navigate to the folder that contains the up-to-date Terraform configuration files with an infrastructure plan.
2. Run the command:
```
terraform validate
```
  If there are errors in the configuration files, Terraform will point to them.
Create a Data Proc cluster:
1. Run the command to view planned changes:
```
terraform plan
```
  If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.
2. If you are happy with the planned changes, apply them:
  1. Run the command:
```
terraform apply
```
  2. Confirm the update of resources.
  3. Wait for the operation to complete.
All the required resources will be created in the specified folder. You can check resource availability and their settings in the management console.

To create a Data Proc cluster, use the create API method and include the following in the request:

ID of the folder where the Data Proc cluster must reside, in the folderId parameter.
Data Proc cluster name in the name parameter.
Data Proc cluster configuration in the configSpec parameter, including:
- Image version in the configSpec.versionId parameter.
  
  Using an image of version 2.0.39 or higher, you can create a lightweight cluster without HDFS and subclusters for data storage. In this case, be sure to add one or more data processing subclusters and specify a bucket name.
  
  Tip
  
  To use the most recent image version, specify 2.0.
- Component list in the configSpec.hadoop.services parameter.
- Public part of the SSH key in the configSpec.hadoop.sshPublicKeys parameter.
- Settings of the Data Proc subclusters in the configSpec.subclustersSpec parameter.
Availability zone of the Data Proc cluster in the zoneId parameter.
ID of the Data Proc cluster's service account in the serviceAccountId parameter.
Bucket name in the bucket parameter.
IDs of the Data Proc cluster's security groups in the hostGroupIds parameter.
Data Proc cluster deletion protection settings in the deletionProtection parameter.

Cluster deletion protection will not prevent a manual connection to a cluster to delete data.

To assign a public IP address to all hosts of a Data Proc subcluster, provide the true value in the configSpec.subclustersSpec.assignPublicIp parameter.

To create a Data Proc cluster residing on groups of dedicated hosts, provide the list of the host group IDs in the hostGroupIds parameter.

Alert

You cannot edit this setting after you create a cluster. The use of dedicated hosts significantly affects cluster pricing.

To configure Data Proc cluster hosts using initialization scripts, specify them in one or multiple configSpec.hadoop.initializationActions parameters.

After the Data Proc cluster's status changes to Running, you can connect to the Data Proc subcluster hosts using the specified SSH key.

Create a Data Proc cluster copy

You can create a Data Proc cluster with the settings of another cluster created earlier. To do so, you need to import the configuration of the source Data Proc cluster to Terraform. Thus you can either create an identical copy or use the imported configuration as the baseline and modify it as needed. Importing is a convenient option when the source Data Proc cluster has lots of settings (e.g., it is an HDFS cluster) and you need to create a similar one.

To create a Data Proc cluster copy:

Terraform

If you do not have Terraform yet, install it.
Get the authentication credentials. You can add them to environment variables or specify them later in the provider configuration file.
Configure and initialize a provider. There is no need to create a provider configuration file manually, you can download it.
Place the configuration file in a separate working directory and specify the parameter values. If you did not add the authentication credentials to environment variables, specify them in the configuration file.
In the same working directory, place a file with a .tf extension and the following contents:
```
resource "yandex_dataproc_cluster" "old" { }
```
Write the ID of the initial Data Proc cluster to the environment variable:
```
export DATAPROC_CLUSTER_ID=<cluster_ID>
```
Import the settings of the initial Data Proc cluster into the Terraform configuration:
```
terraform import yandex_dataproc_cluster.old ${DATAPROC_CLUSTER_ID}
```
Get the imported configuration:
```
terraform show
```
Copy it from the terminal and paste it into the .tf extension file.
Place the file in the new imported-cluster directory.
Edit the copied configuration so that you can create a new Data Proc cluster from it:
- Specify the name of the new Data Proc cluster in the resource string and the name parameter.
- Delete the created_at, host_group_ids, id, and subcluster_spec.id parameters.
- Change the SSH key format in the ssh_public_keys parameter. Source format:
```
ssh_public_keys = [
  <<-EOT
    <key>
  EOT,
]
```
  Required format:
```
ssh_public_keys = [
  "<key>"
]
```
- (Optional) Make further modifications if you need a customized copy rather than identical one.
In the imported-cluster directory, get the authentication data.
In the same directory, configure and initialize a provider. There is no need to create a provider configuration file manually, you can download it.
Place the configuration file in the imported-cluster directory and specify the parameter values. If you did not add the authentication credentials to environment variables, specify them in the configuration file.
Make sure the Terraform configuration files are correct using this command:
```
terraform validate
```
If there are any errors in the configuration files, Terraform will point them out.
Create the required infrastructure:
1. Run the command to view planned changes:
```
terraform plan
```
  If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.
2. If you are happy with the planned changes, apply them:
  1. Run the command:
```
terraform apply
```
  2. Confirm the update of resources.
  3. Wait for the operation to complete.
All the required resources will be created in the specified folder. You can check resource availability and their settings in the management console.

Example

Creating a lightweight Data Proc cluster for Spark and PySpark jobs

CLI

Create a Data Proc cluster to run Spark jobs without HDFS and data storage subclusters and set the test characteristics:

Cluster name: my-dataproc
Bucket name: dataproc-bucket
Availability zone: ru-central1-c
Service account: dataproc-sa
Image version: 2.0
SPARK and YARN components
Path to the public part of the SSH key: /home/username/.ssh/id_rsa.pub
With the master Data Proc subcluster for master hosts and a single compute subcluster for processing data:
- Class: s2.micro
- Network SSD storage (network-ssd): 20 GB
- Subnet: default-ru-central1-c
- Public access: Allowed
Security group: enp6saqnq4ie244g67sb
Protection against accidental Data Proc cluster deletion: Enabled

Run the following command:

yc dataproc cluster create my-dataproc \
   --bucket=dataproc-bucket \
   --zone=ru-central1-c \
   --service-account-name=dataproc-sa \
   --version=2.0 \
   --services=SPARK,YARN \
   --ssh-public-keys-file=/home/username/.ssh/id_rsa.pub \
   --subcluster name="master",`
                `role=masternode,`
                `resource-preset=s2.micro,`
                `disk-type=network-ssd,`
                `disk-size=20,`
                `subnet-name=default-ru-central1-c,`
                `assign-public-ip=true \
   --subcluster name="compute",`
                `role=computenode,`
                `resource-preset=s2.micro,`
                `disk-type=network-ssd,`
                `disk-size=20,`
                `subnet-name=default-ru-central1-c,`
                `assign-public-ip=true \
   --security-group-ids=enp6saqnq4ie244g67sb \
   --deletion-protection=true

Creating a Data Proc cluster

Configure a networkConfigure a network

Configure security groupsConfigure security groups

Create a Data Proc clusterCreate a Data Proc cluster

Create a Data Proc cluster copyCreate a Data Proc cluster copy

ExampleExample

Creating a lightweight Data Proc cluster for Spark and PySpark jobsCreating a lightweight Data Proc cluster for Spark and PySpark jobs