Data Proc templates

Written by

Updated at April 18, 2024

Information about Data Proc templates as a resource
Specifics of a temporary cluster based on a Data Proc template
- Configurations of temporary clusters
- Statuses of temporary Data Proc clusters

A Data Proc template is a special resource for rapid deployment of Data Proc clusters in DataSphere projects. Templates define cluster configuration and can be used by DataSphere to deploy the cluster multiple times.

To use Data Proc clusters, set the following project parameters:

Default folder to enable integration with other Yandex Cloud services. A Data Proc cluster will be deployed in this folder based on the current cloud quotas. A fee for using the cluster will be debited from your cloud billing account.
Service account to be used by DataSphere for creating and managing clusters. The service account needs the following roles:
- dataproc.agent to use Data Proc clusters.
- dataproc.admin to create clusters from Data Proc templates.
- vpc.user to use the Data Proc cluster network.
- iam.serviceAccounts.user to create resources in the folder on behalf of the service account.
Subnet for DataSphere to communicate with the Data Proc cluster. Since the Data Proc cluster needs to access the internet, make sure to configure a NAT gateway in the subnet.

Note

If you specified a subnet in the project settings, the time to allocate computing resources may be increased.

Information about Data Proc templates as a resource

The following information is stored about each template:

Resource name
Resource creator
Cluster configuration
Template creation date in UTC format, such as July 18, 2022, 14:23

You can view all Data Proc templates created in your project on the Data Proc resource page. It also provides a list of all Data Proc clusters available in the project. It contains both temporary clusters based on Data Proc templates and connected clusters deployed in Yandex Data Proc. To view detailed information about a template or cluster, click it.

Specifics of a temporary cluster based on a Data Proc template

To create a cluster from a Data Proc template, activate the template in your project. When running a project in the IDE, DataSphere creates a temporary cluster in the Yandex Cloud folder and subnet specified in the project settings.

DataSphere tracks the cluster's lifetime and automatically deletes it if no computations have been performed on it within two hours. The cluster will also be deleted if you force stop the computations running in the project.

Configurations of temporary clusters

Automatic Data Proc clusters are deployed on Yandex Compute Cloud VMs powered by Intel Cascade Lake (standard-v2).

You can calculate the total disk storage capacity required for different cluster configurations using this formula:

<number_of_Data_Proc_hosts> × 256 + 128

| Cluster type | Number of hosts | Disk size | Host parameters |
|:------------:|:-----------------:|--------------|
| XS | 1 | 384 GB HDD | 4 vCPUs, 16 GB RAM |
| S | 4 | 1152 GB SSD | 4 vCPUs, 16 GB RAM |
| M | 8 | 2176 GB SSD | 16 vCPUs, 64 GB RAM |
| L | 16 | 4224 GB SSD | 16 vCPUs, 64 GB RAM |
| XL | 32 | 8320 GB SSD | 16 vCPUs, 64 GB RAM |

Tip

Before running a project with an activated Data Proc template, check that the quotas for creating HDDs or SSDs allow you to create a disk of a sufficient size.

You will be charged additionally for running temporary clusters created based on Data Proc templates according to the Yandex Data Proc pricing policy.

Statuses of temporary Data Proc clusters

DataSphere creates a temporary Data Proc cluster once you open your project in the IDE.

The created cluster appears in the list of available clusters on the Data Proc resource page. A temporary cluster can have one of the following statuses:

STARTING: The cluster is being created.
UP: The cluster has been created and is ready to run calculations.
DOWN: There have been issues while creating the cluster.

Data Proc templates

Information about Data Proc templates as a resource

Specifics of a temporary cluster based on a Data Proc template

Configurations of temporary clusters

Statuses of temporary Data Proc clusters

See also

Was the article helpful?

Data Proc templates

Information about Data Proc templates as a resourceInformation about Data Proc templates as a resource

Specifics of a temporary cluster based on a Data Proc templateSpecifics of a temporary cluster based on a Data Proc template

Configurations of temporary clustersConfigurations of temporary clusters

Statuses of temporary Data Proc clustersStatuses of temporary Data Proc clusters

See alsoSee also

Was the article helpful?

Information about Data Proc templates as a resource

Specifics of a temporary cluster based on a Data Proc template

Configurations of temporary clusters

Statuses of temporary Data Proc clusters

See also