Yandex.Cloud
  • Services
  • Why Yandex.Cloud
  • Pricing
  • Documentation
  • Contact us
Get started
Yandex Data Proc
  • Use cases
    • Configuring networks for Data Proc clusters
    • Using Apache Hive
    • Running Spark applications
    • Running applications from a remote host
    • Copying files from Yandex Object Storage
  • Step-by-step instructions
    • All instructions
    • Creating clusters
    • Connecting to clusters
    • Updating subclusters
    • Managing subclusters
    • Deleting clusters
  • Concepts
    • Data Proc overview
    • Host classes
    • Hadoop and component versions
    • Component interfaces and ports
    • Component web interfaces
    • Auto scaling
    • Decommissioning subclusters and hosts
    • Network in Data Proc
    • Quotas and limits
  • Access management
  • Pricing policy
  • API reference
    • Authentication in the API
    • gRPC
      • Overview
      • ClusterService
      • JobService
      • ResourcePresetService
      • SubclusterService
      • OperationService
    • REST
      • Overview
      • Cluster
        • Overview
        • create
        • delete
        • get
        • list
        • listHosts
        • listOperations
        • listUILinks
        • start
        • stop
        • update
      • Job
        • Overview
        • create
        • get
        • list
        • listLog
      • ResourcePreset
        • Overview
        • get
        • list
      • Subcluster
        • Overview
        • create
        • delete
        • get
        • list
        • update
  • Questions and answers
  1. Concepts
  2. Data Proc overview

Relationship between service resources Data Proc

  • Security

Data Proc lets you use distributed data storage and processing for data using Apache Hadoop ecosystem services.

The main entity used in the service is a cluster. It groups together all the resources available in Hadoop, including computing and storage capabilities.

Each cluster consists of subclusters. They integrate hosts that perform identical functions:

  • A subcluster with master hosts (for example, NameNode for HDFS or ResourceManager for YARN).

    Note

    Each cluster may have only one subcluster with master hosts.

  • Subclusters for data storage (for example, DataNode for HDFS).

  • Subclusters for data processing (for example, NodeManager for YARN).

Subclusters for one cluster must reside in the same cloud network and availability zone. Learn more about Yandex.Cloud geography.

Hosts in each subcluster are created with the computing power consistent with the specified host class. For a list of available host classes and their characteristics, see Host classes.

For information about network configuration and network access to clusters, see Networks, clusters, and subclusters.

Security

Since a Data Proc cluster can run jobs without directly accessing clusters over SSH,
the cluster logs the job execution results to an S3 bucket. This is done for the user's convenience. Logging to the bucket is performed under the service account specified during cluster creation. For more information about the concept, go to Service accounts.

We recommend using at least two different S3 buckets for the Data Proc cluster:

  1. For the source data that the service account has read-only access to.
  2. For the operation logs and results that the service account has full access to.

This is required to reduce the risk of the source data being unexpectedly modified or deleted.

Language
Careers
Privacy policy
Terms of use
© 2021 Yandex.Cloud LLC