Yandex Cloud
  • Services
  • Solutions
  • Why Yandex Cloud
  • Pricing
  • Documentation
  • Contact us
Get started
Language / Region
© 2022 Yandex.Cloud LLC
Yandex Data Proc
  • Practical guidelines
    • All practical guidelines
    • Working with jobs
      • Overview
      • Working with Hive jobs
      • Working with MapReduce jobs
      • Working with PySpark jobs
      • Working with Spark jobs
      • Using Apache Hive
      • Running Spark applications
      • Running applications from a remote host
    • Configuring networks for Data Proc clusters
    • Using Yandex Object Storage in Data Proc
    • Using initialization scripts to configure GeeseFS in Data Proc
    • Exchanging data with Managed Service for ClickHouse
    • Importing databases using Sqoop
  • Step-by-step instructions
    • All instructions
    • Information about existing clusters
    • Creating clusters
    • Connecting to clusters
    • Editing clusters
    • Updating subclusters
    • Managing subclusters
    • Sqoop usage
    • Connecting to component interfaces
    • Managing jobs
      • All jobs
      • Spark jobs
      • PySpark jobs
      • Hive jobs
      • MapReduce jobs
    • Deleting clusters
    • Working with logs
    • Monitoring the state of clusters and hosts
  • Concepts
    • Data Proc overview
    • Host classes
    • Hadoop and component versions
    • Component interfaces and ports
    • Component web interfaces
    • Jobs in Data Proc
    • Automatic scaling
    • Decommissioning subclusters and hosts
    • Network in Data Proc
    • Maintenance
    • Quotas and limits
    • Storage in Data Proc
    • Component properties
    • Logs in Data Proc
    • Initialization scripts
  • Access management
  • Pricing policy
  • API reference
    • Authentication in the API
    • gRPC
      • Overview
      • ClusterService
      • JobService
      • ResourcePresetService
      • SubclusterService
      • OperationService
    • REST
      • Overview
      • Cluster
        • Overview
        • create
        • delete
        • get
        • list
        • listHosts
        • listOperations
        • listUILinks
        • start
        • stop
        • update
      • Job
        • Overview
        • cancel
        • create
        • get
        • list
        • listLog
      • ResourcePreset
        • Overview
        • get
        • list
      • Subcluster
        • Overview
        • create
        • delete
        • get
        • list
        • update
  • Revision history
    • Service updates
    • Images
  • Questions and answers
  1. Concepts
  2. Component properties

Component properties

Written by
Yandex Cloud
  • Setting up Spark for Object Storage
  • Installing Python packages

When creating a Data Proc cluster, you can specify the properties of cluster components, jobs, and environment in the following format:

<key>:<value>

The key can either be a simple string or contain a prefix indicating that it belongs to a specific component:

<key prefix>:<key body>:<value>

For example:

hdfs:dfs.replication : 2
hdfs:dfs.blocksize : 1073741824
spark:spark.driver.cores : 1

The available properties are listed in the official documentation for the components:

Prefix Path to the configuration file Documentation
core /etc/hadoop/conf/core-site.xml Hadoop
hdfs /etc/hadoop/conf/hdfs-site.xml HDFS
yarn /etc/hadoop/conf/yarn-site.xml YARN
mapreduce /etc/hadoop/conf/mapred-site.xml MapReduce
capacity-scheduler /etc/hadoop/capacity-scheduler.xml CapacityScheduler
resource-type /etc/hadoop/conf/resource-types.xml ResourceTypes
node-resources /etc/hadoop/conf/node-resources.xml NodeResources
spark /etc/spark/conf/spark-defaults.xml Spark
hbase /etc/hbase/conf/hbase-site.xml HBASE
hbase-policy /etc/hbase/conf/hbase-policy.xml HBASE
hive /etc/hive/conf/hive-site.xml HIVE
hivemetastore /etc/hive/conf/hivemetastore-site.xml HIVE Metastore
hiveserver2 /etc/hive/conf/hiveserver2-site.xml HIVE Server2
tez /etc/tez/conf/tez-site.xml Tez 0.9.2 and Tez 0.10.0
zeppelin /etc/zeppelin/conf/zeppelin-site.xml Zeppelin

Settings for running the jobs are specified in special properties:

  • dataproc:version: The version of the dataproc-agent that runs jobs, passes the property of a cluster state, and proxies the UI. Used for debugging. Default value: latest.
  • dataproc:max-concurrent-jobs: The number of concurrent jobs. Default value: auto (calculated based on the min-free-memory-to-enqueue-new-job and job-memory-footprint properties).
  • dataproc:min-free-memory-to-enqueue-new-job: The minimum size of free memory to run the job (in bytes). Default value: 1073741824 (1 GB).
  • dataproc:job-memory-footprint: The memory size to run the job on the MASTER cluster node, used to estimate the maximum number of jobs in the cluster. Default value: 536870912 (512 MB).

Setting up Spark for Object Storage

The following settings are available for Apache Spark:

Configuration Default value Description
fs.s3a.access.key — Static key ID
fs.s3a.secret.key — Secret key
fs.s3a.endpoint storage.yandexcloud.net Endpoint to connect to Object Storage
fs.s3a.signing-algorithm Empty value Signature algorithm
fs.s3a.aws.credentials.provider org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider Credentials provider

For more information, see the Apache Hadoop documentation.

Installing Python packages

To install additional Python packages, you can use the conda or pip package managers. Pass the package name in the cluster properties as follows:

Package manager Key Value Example
conda conda:<package name> Number of the package version according to the conda specification conda:koalas : 1.5.0
pip pip:<package name> Number of the package version according to the pip specification pip:psycopg2 : 2.7.0

Was the article helpful?

Language / Region
© 2022 Yandex.Cloud LLC
In this article:
  • Setting up Spark for Object Storage
  • Installing Python packages