Relationship between Data Proc service resources
Data Proc lets you use distributed data storage and processing for data using Apache Hadoop ecosystem services.
The main entity used in the service is a cluster. It groups together all the resources available in Hadoop, including computing and storage capabilities.
Each cluster consists of subclusters. They integrate hosts that perform identical functions:
A subcluster with master hosts (for example, NameNode for HDFS or ResourceManager for YARN).
Each cluster may have only one subcluster with master hosts.
Subclusters for data storage (for example, DataNode for HDFS).
Subclusters for data processing (for example, NodeManager for YARN).
Subclusters for one cluster must reside in the same cloud network and availability zone. Learn more about the geo scope of Yandex.Cloud.
Hosts in each subcluster are created with the computing power consistent with the specified host class. For a list of available host classes and their characteristics, see Host classes.
For information about network configuration and network access to clusters, see Networks, clusters, and subclusters.
Since a Data Proc cluster can run jobs without directly accessing clusters over SSH, the cluster logs the job execution results to an S3 bucket. This is done for the user's convenience. Logging to the bucket is performed under the service account specified during cluster creation. For more information about the concept, go to Service accounts.
We recommend using at least two different S3 buckets for the Data Proc cluster:
- For the source data that the service account has read-only access to.
- For the operation logs and results, the service account has full access.
This is required to reduce the risk of the source data being unexpectedly modified or deleted.