Data Proc resource relationships

Written by

Yandex Cloud

Improved by

Dmitry A.

Updated at April 14, 2024

Resources
Leightweight clusters
Security

Data Proc helps implement distributed data storage and processing using the Apache Hadoop service ecosystem.

Resources

The main entity of the service is a cluster. It groups together all resources available in Hadoop, including storage and computing capacity.

Each cluster consists of subclusters. They integrate hosts that perform identical functions:

Subcluster with a master host (masternode), such as NameNode for HDFS or ResourceManager for YARN.

Each cluster may only have one subcluster with a master host.
Data storage subclusters (Data or datanode), such as DataNode for HDFS.
Data processing subclusters (Compute or computenode), such as NodeManager for YARN.

Subclusters of a single cluster must reside in the same cloud network and availability zone.

Warning

The ru-central1-c availability zone is being discontinued. If your cluster is hosted in this availability zone, create a new cluster and move the workload to it. Learn how to migrate lightweight clusters and HDFS clusters.

Hosts in each subcluster are created with the computing capacity that is consistent with the specified host class. For a list of available host classes and their specs, see Host classes.

VMs for cluster hosts can be hosted on:

Regular Yandex Cloud hosts:

These are physical servers for hosting cluster VMs. These hosts are selected randomly from the available pool of hosts that meet the requirements of the selected subcluster configuration.
Dedicated Yandex Cloud hosts:

These are physical servers that only host your VMs. Such VMs ensure the operation of both the cluster and your other services that support dedicated hosts. The hosts are selected from dedicated host groups specified when creating a cluster.

Such a placement option makes sure the VMs are physically isolated. A Data Proc cluster using dedicated hosts includes all features of a regular cluster.

For more information about dedicated hosts, see the Yandex Compute Cloud documentation.

For information about network configuration and network access to clusters, see Networking in Data Proc.

Warning

Changing host properties through the Yandex Compute Cloud interfaces may result in host failure. To change the cluster host settings, use the Data Proc interfaces, such as the management console, CLI, Terraform, or API.

Leightweight clusters

Starting from image version 2.0.39, you can use a lightweight cluster configuration without HDFS and data storage subclusters. For example, such clusters may include only YARN and SPARK. They are faster to create and use host computing resources more efficiently. We recommend using lightweight clusters to run single jobs for processing data in Spark or PySpark.

Benefits of lightweight clusters:

The Spark Driver is run on a subcluster with master hosts. This enables you to allocate different resources to the subcluster with master hosts, which will run the Spark Driver, and to the data processing subclusters, which will run Spark Executors.
At least one Spark Driver and Spark Executor instance will be running on each data processing subcluster of regular clusters. In lightweight clusters, the Spark Driver can use all free resources of the subcluster with master hosts, while Spark Executors can use all free resources of the data processing subclusters. This improves the performance of hosts.

Requirements for using lightweight clusters:

The HDFS component is not selected.
No data storage subclusters are used in a cluster.
The cluster contains one or more data processing subclusters.
The cluster settings specify a bucket in Yandex Object Storage.

For more information about resource allocation, see Spark jobs.

Security

Since a Data Proc cluster can run jobs without directly accessing clusters over SSH, the cluster logs the job execution results to an S3 bucket. This is done for the user's convenience. The logs are written under the service account specified during cluster creation. For more information, see Service accounts.

We recommend using at least two separate S3 buckets for a Data Proc cluster:

One for the source data, where the service account has read-only access.
Another one for the operation logs and results, where the service account has full access.

This is required to minimize the risk of unexpectedly modifying or deleting source data.

Data Proc resource relationships

ResourcesResources

Leightweight clustersLeightweight clusters

SecuritySecurity

Was the article helpful?

Resources

Leightweight clusters

Security