Relationship between service resources Data Proc

Data Proc lets you use distributed data storage and processing for data using Apache Hadoop ecosystem services.

The main entity used in the service is a cluster. It groups together all the resources available in Hadoop, including computing and storage capabilities.

Each cluster consists of subclusters. They integrate hosts that perform identical functions:

  • A subcluster with master hosts (for example, NameNode for HDFS or ResourceManager for YARN).

    Note

    Each cluster may have only one subcluster with master hosts.

  • Subclusters for data storage (for example, DataNode for HDFS).

  • Subclusters for data processing (for example, NodeManager for YARN).

Subclusters for one cluster must reside in the same cloud network and availability zone. Learn more about Yandex.Cloud geography.

Hosts in each subcluster are created with the computing power consistent with the specified host class. For a list of available host classes and their characteristics, see Host classes.

For information about network configuration and network access to clusters, see Networks, clusters, and subclusters.

Security

Since a Data Proc cluster can run jobs without directly accessing clusters over SSH, the cluster logs job execution results to an S3 bucket. This is done for the user's convenience. Logging to the bucket is performed under the service account specified during cluster creation. For more information about the concept, go to the Service accounts page.

We recommend using at least two different S3 buckets for the Data Proc cluster:

  1. For the source data that the service account has read-only access to.
  2. For the operation logs and results that the service account has full access to.

This is required to reduce the risk of the source data being unexpectedly modified or deleted.