Yandex Data Proc

A service for processing multi-terabyte data arrays using such open-source tools as Apache Spark, Apache Hadoop®, Apache HBase, Apache Hive, Apache Zeppelin,
and other Apache® ecosystem services.

Easy to use
You select the size of the cluster, node capacity, and a set of services, and Yandex Data Proc automatically creates and configures Spark and Hadoop clusters and other components. Collaborate by using Zeppelin notebooks and other web apps via UI Proxy.
Low cost
You can run a 10-node Data Proc cluster for just $0.23 per hour. Moreover, you can save up to 70% of the cost of VMs if you choose preemptible instances.
Full control of your cluster
You get full control of your cluster with root permissions for each VM. Install your own applications and libraries on running clusters without having to restart them.
Autoscaling
Preview
Yandex Data Proc uses Instance Groups to automatically increase or decrease computing resources of compute subclusters based on CPU usage indicators.
Secure data storage
Yandex Data Proc replaces failed nodes, redistributes the load between them automatically, and restarts jobs. Development and operation of Yandex Data Proc meets the requirements of local regulatory, GDPR, and ISO industry standards.
Workflow automation
Save time on building ETL pipelines and pipelines for training and developing models, as well as describing other iterative tasks. The Data Proc operator is already built into Apache Airflow.

Implement your projects using Yandex Data Proc

Analyze user behavior

Analyze events using a Hadoop cluster. Use analytics tools to categorize data and identify patterns and trends.

Process data in streaming mode

Process data streams in real time using Apache Spark clusters. Create metrics and save the necessary data slices by integrating Yandex Data Proc with Yandex Object Storage.
Works with
Object Storage

Retrieve, transform, and export data

Describe and process data streams using the Apache Oozie system. Automatically build data marts and business metrics.

We'll take care of most cluster maintenance

Processes
Yandex Data Proc
Apache Hadoop self‑installation
Data access control
Cluster creation and updates
Network configurations
OS and software installation
Image version upgrade
Interfaces for running jobs
Automated scaling
Integration with Yandex.Cloud services
Monitoring tools

Independent control

Control on the Yandex.Cloud side

Getting started

Select the necessary computing capacity and Apache® services and create a ready-to-use Data Proc cluster.

Create cluster

Questions and answers

What Apache® services are available in Yandex Data Proc?

Spark, HDFS, YARN, Hive, HBase®, Oozie, Sqoop, Flume, Tez®, and Zeppelin.

Spark, HDFS, YARN, Hive, HBase®, Oozie, Sqoop, Flume, Tez®, and Zeppelin.

Can anyone access my data?

Only you can manage access to your data using Yandex Resource Manager. Databases of different Yandex.Cloud customers are completely isolated from one another.

Only you can manage access to your data using Yandex Resource Manager. Databases of different Yandex.Cloud customers are completely isolated from one another.

Get started with Yandex Data Proc

  1. Apache, Apache Hadoop, Apache Spark, and Apache Oozie are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.