Component properties
Written by
When creating a Data Proc cluster, you can specify the properties of cluster components, jobs, and environment in the following format:
<key>:<value>
The key can either be a simple string or contain a prefix indicating that it belongs to a specific component:
<key prefix>:<key body>:<value>
For example:
hdfs:dfs.replication : 2
hdfs:dfs.blocksize : 1073741824
spark:spark.driver.cores : 1
The available properties are listed in the official documentation for the components:
Prefix | Path to the configuration file | Documentation |
---|---|---|
core |
/etc/hadoop/conf/core-site.xml |
Hadoop |
hdfs |
/etc/hadoop/conf/hdfs-site.xml |
HDFS |
yarn |
/etc/hadoop/conf/yarn-site.xml |
YARN |
mapreduce |
/etc/hadoop/conf/mapred-site.xml |
MapReduce |
capacity-scheduler |
/etc/hadoop/capacity-scheduler.xml |
CapacityScheduler |
resource-type |
/etc/hadoop/conf/resource-types.xml |
ResourceTypes |
node-resources |
/etc/hadoop/conf/node-resources.xml |
NodeResources |
spark |
/etc/spark/conf/spark-defaults.xml |
Spark |
hbase |
/etc/hbase/conf/hbase-site.xml |
HBASE |
hbase-policy |
/etc/hbase/conf/hbase-policy.xml |
HBASE |
hive |
/etc/hive/conf/hive-site.xml |
HIVE |
hivemetastore |
/etc/hive/conf/hivemetastore-site.xml |
HIVE Metastore |
hiveserver2 |
/etc/hive/conf/hiveserver2-site.xml |
HIVE Server2 |
tez |
/etc/tez/conf/tez-site.xml |
Tez 0.9.2 and Tez 0.10.0 |
zeppelin |
/etc/zeppelin/conf/zeppelin-site.xml |
Zeppelin |
Settings for running the jobs are specified in special properties:
dataproc:version
: The version of the dataproc-agent that runs jobs, passes the property of a cluster state, and proxies the UI. Used for debugging. Default value:latest
.dataproc:max-concurrent-jobs
: The number of concurrent jobs. Default value:auto
(calculated based on themin-free-memory-to-enqueue-new-job
andjob-memory-footprint
properties).dataproc:min-free-memory-to-enqueue-new-job
: The minimum size of free memory to run the job (in bytes). Default value:1073741824
(1 GB).dataproc:job-memory-footprint
: The memory size to run the job on theMASTER
cluster node, used to estimate the maximum number of jobs in the cluster. Default value:536870912
(512 MB).
Setting up Spark for Object Storage
The following settings are available for Apache Spark:
Configuration | Default value | Description |
---|---|---|
fs.s3a.access.key |
— | Static key ID |
fs.s3a.secret.key |
— | Secret key |
fs.s3a.endpoint |
storage.yandexcloud.net |
Endpoint to connect to Object Storage |
fs.s3a.signing-algorithm |
Empty value | Signature algorithm |
fs.s3a.aws.credentials.provider |
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider |
Credentials provider |
For more information, see the Apache Hadoop documentation.
Installing Python packages
To install additional Python packages, you can use the conda or pip package managers. Pass the package name in the cluster properties as follows:
Package manager | Key | Value | Example |
---|---|---|---|
conda | conda:<package name> |
Number of the package version according to the conda specification | conda:koalas : 1.5.0 |
pip | pip:<package name> |
Number of the package version according to the pip specification | pip:psycopg2 : 2.7.0 |