Yandex Data Proc

A service for processing multi-terabyte data arrays using open source tools like Apache Spark, Apache Hadoop®, Apache HBase®, Apache Hive, Apache Zeppelin, and other Apache® ecosystem services.

Easy-to-use

You select the size of the cluster, node capacity, and a set of services, and Yandex Data Proc automatically creates and configures Spark and Hadoop clusters and other components. Collaborate by using Zeppelin notebooks and other web apps via UI Proxy.

Low costs

Launch DataProc for as little as 18 RUB/hour. Save up to 70% on VMs by choosing preemptible instances.

Full control of your cluster

You get full control of your cluster with root permissions for each VM. Install your own applications and libraries on running clusters without having to restart them.

AutoscalingPreview

Yandex Data Proc uses Instance Groups to automatically increase or decrease computing resources of compute subclusters based on CPU usage indicators.

Managing table metadataPreview

Data Proc allows you to create managed Hive Metastore clusters, which can reduce the probability of failures and losses caused by metadata unavailability.

Task automation

Save time on building ETL pipelines and pipelines for training and developing models, as well as describing other iterative tasks. The Data Proc operator is already built into Apache Airflow.

Implement your projects with Data Proc

Primary data storage and preprocessing

Manage objects' table metadata in Object Storage buckets using Hive Metastore. Prepare and clean up data, create full-fledged repositories and domain-oriented data storefronts.

Works with
Object Storage

Analyze user behavior

Analyze events using Hadoop clusters, and use analytics tools to categorize data and identify patterns and trends.

Process data in streaming mode

Process data streams in real time using Apache Spark clusters. Create metrics and save the necessary data slices by integrating Yandex Data Proc with Yandex Object Storage.

Works with
Object Storage

We'll take care of most cluster maintenance

Processes
Yandex Data Proc
Apache Hadoop self‑installation
Data access control
Cluster creation and updates
Network configurations
OS and software installation
Image version upgrade
Interfaces for running jobs
Automated scaling
Integration with Yandex Cloud services
Monitoring tools

Independent control

Control on the Yandex Cloud side

Getting started

Select the necessary computing capacity and Apache® services and create a ready-to-use Data Proc cluster.

FAQ

What Apache® services are available in Yandex Data Proc?

Spark, HDFS, YARN, Hive, HBase®, Oozie, Sqoop, Flume, Tez®, and Zeppelin.

Get started with Yandex Data Proc

Apache, Apache Hadoop, Apache Spark, and Apache Oozie are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.