Exchanging data with Yandex Managed Service for ClickHouse®
With Data Proc, you can:
- Upload data from Managed Service for ClickHouse® to Spark DataFrame.
- Export data from Spark DataFrame to Managed Service for ClickHouse®.
If you no longer need the resources you created, delete them.
Getting started
Prepare the infrastructure:
-
Create a service account named
dataproc-sa
and assign it thedataproc.agent
role. -
In Object Storage, create buckets and configure access to them:
- Create a bucket for the input data and grant the cluster service account
READ
permissions for this bucket. - Create a bucket for the processing output and grant the cluster service account
READ and WRITE
permissions for this bucket.
- Create a bucket for the input data and grant the cluster service account
-
Create a cloud network named
dataproc-network
. -
Create a subnet in any availability zone in
dataproc-network
. -
Set up a NAT gateway for the subnet you created.
-
If you are using security groups, create a security group named
dataproc-sg
indataproc-network
and add the following rules to it:-
One rule for inbound and another one for outbound service traffic:
- Port range:
0-65535
- Protocol:
Any
(Any
) - Source/Destination name:
Security group
- Security group:
Current
(Self
)
- Port range:
-
Rule for outgoing HTTPS traffic:
- Port range:
443
- Protocol:
TCP
- Destination name:
CIDR
- CIDR blocks:
0.0.0.0/0
- Port range:
-
Rule for outgoing TCP traffic on port 8443 to access ClickHouse®:
- Port range:
8443
- Protocol:
TCP
- Destination name:
CIDR
- CIDR blocks:
0.0.0.0/0
- Port range:
-
-
Create a Data Proc cluster in any suitable host configuration with the following settings:
- Components:
- SPARK
- YARN
- HDFS
- Service account:
dataproc-sa
. - Bucket name: Bucket you created for output data.
- Network:
dataproc-network
. - Security groups:
dataproc-sg
.
- Components:
-
Create a Managed Service for ClickHouse® cluster in any suitable configuration with the following settings:
- Public access to cluster hosts: Allowed.
- Database:
db1
. - User:
user1
.
-
If you are using security groups in your Managed Service for ClickHouse® cluster, make sure they are set up correctly and allow connecting to the cluster.
-
If you do not have Terraform yet, install it.
-
Get the authentication credentials. You can add them to environment variables or specify them later in the provider configuration file.
-
Configure and initialize a provider. There is no need to create a provider configuration file manually, you can download it
. -
Place the configuration file in a separate working directory and specify the parameter values. If you did not add the authentication credentials to environment variables, specify them in the configuration file.
-
Download the data-proc-data-exchange-with-mch.tf
configuration file to the same working directory.This file describes:
- Network.
- Subnet.
- NAT gateway and route table required for Data Proc.
- Security groups required for the Data Proc and Managed Service for ClickHouse® clusters.
- Service account required for the Data Proc cluster to run.
- Service account required to create buckets in Object Storage.
- Buckets for input and output data.
- Data Proc cluster.
- Managed Service for ClickHouse® cluster.
-
In the
data-proc-data-exchange-with-mch.tf
file, specify:folder_id
: Cloud folder ID, the same one specified in the provider settings.input_bucket
: Name of the bucket for input data.output_bucket
: Name of the bucket for output data.dp_ssh_key
: Absolute path to the public key for the Data Proc cluster. For more information, see Connecting to a Data Proc host via SSH.ch_password
: ClickHouse® user password.
-
Make sure the Terraform configuration files are correct using this command:
terraform validate
If there are any errors in the configuration files, Terraform will point them out.
-
Create the required infrastructure:
-
Run the command to view planned changes:
terraform plan
If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.
-
If you are happy with the planned changes, apply them:
-
Run the command:
terraform apply
-
Confirm the update of resources.
-
Wait for the operation to complete.
-
All the required resources will be created in the specified folder. You can check resource availability and their settings in the management console
. -
Upload data from Managed Service for ClickHouse®
Create a table in the Managed Service for ClickHouse® cluster
-
Connect to the
db1
database of the Managed Service for ClickHouse® cluster asuser1
. -
Add test data to the database. As an example, a simple table is used with people's names and ages.
-
Create a table:
CREATE TABLE persons ( `name` String, `age` UInt8) ENGINE = MergeTree () ORDER BY `name`;
-
Populate the table with data:
INSERT INTO persons VALUES ('Anna', 19), ('Michael', 65), ('Alvar', 28), ('Lilith', 50), ('Max', 27), ('Jaimey', 34), ('Dmitry', 42), ('Qiang', 19), ('Augustyna', 20), ('Maria', 28);
-
Check the result:
SELECT * FROM persons;
-
Transfer the table from Managed Service for ClickHouse®
-
Prepare a script file:
-
Create a local file named
ch-to-dataproc.py
and copy the following script to it:ch-to-dataproc.pyfrom pyspark.sql import SparkSession # Creating a Spark session spark = SparkSession.builder.appName("ClickhouseDataproc").getOrCreate() # Setting the port and ClickHouse® cluster parameters jdbcPort = 8443 jdbcHostname = "c-<ClickHouse®_cluster_ID>.rw.mdb.yandexcloud.net" jdbcDatabase = "db1" jdbcUrl = f"jdbc:clickhouse://{jdbcHostname}:{jdbcPort}/{jdbcDatabase}?ssl=true" # Transferring the persons table from ClickHouse® to DataFrame df = spark.read.format("jdbc") \ .option("url", jdbcUrl) \ .option("user","user1") \ .option("password","<password_of_user1>") \ .option("dbtable","persons") \ .load() # Transferring the DataFrame to the bucket for checking df.repartition(1).write.mode("overwrite") \ .csv(path='s3a://<output_bucket_name>/csv', header=True, sep=',')
-
Specify the following in the script:
- Managed Service for ClickHouse® cluster ID.
- Password of
user1
. - Output bucket name.
-
Create a
scripts
folder in the input bucket and upload thech-to-dataproc.py
file to it.
-
-
Create a PySpark job by specifying the path to the script file (
s3a://<input_bucket_name>/scripts/ch-to-dataproc.py
) in the Main python file field. -
Wait for the job to complete and make sure the output bucket's
csv
folder contains the source table.
Export data to Managed Service for ClickHouse®
-
Prepare a script file:
-
Create a local file named
dataproc-to-ch.py
and copy the following script to it:dataproc-to-ch.pyfrom pyspark.sql import SparkSession from pyspark.sql.types import * # Creating a Spark session spark = SparkSession.builder.appName("DataprocClickhouse").getOrCreate() # Creating a data schema schema = StructType([StructField('name', StringType(), True), StructField('age', IntegerType(), True)]) # Creating a DataFrame df = spark.createDataFrame([('Alim', 19), ('Fred' ,65), ('Guanmin' , 28), ('Till', 60), ('Almagul', 27), ('Mary', 34), ('Dmitry', 42)], schema) # Specifying the port and ClickHouse® cluster parameters jdbcPort = 8443 jdbcHostname = "c-<ClickHouse®_cluster_ID>.rw.mdb.yandexcloud.net" jdbcDatabase = "db1" jdbcUrl = f"jdbc:clickhouse://{jdbcHostname}:{jdbcPort}/{jdbcDatabase}?ssl=true" # Transferring the DataFrame to ClickHouse® df.write.format("jdbc") \ .mode("error") \ .option("url", jdbcUrl) \ .option("dbtable", "people") \ .option("createTableOptions", "ENGINE = MergeTree() ORDER BY age") \ .option("user","user1") \ .option("password","<ClickHouse®_database_password>") \ .save()
-
Specify the following in the script:
- Managed Service for ClickHouse® cluster ID.
- Password of
user1
.
-
Create a
scripts
folder in the input bucket and upload thedataproc-to-ch.py
file to it.
-
-
Create a PySpark job by specifying the path to the script file (
s3a://<input bucket name>/scripts/dataproc-to-ch.py
) in the Main python file field. -
Wait for the job to complete and make sure the data has been transferred to Managed Service for ClickHouse®:
-
Connect to the
db1
database of the Managed Service for ClickHouse® cluster asuser1
. -
Run the following query:
SELECT * FROM people;
If the import was successful, the response will contain a table with the data.
-
Delete the resources you created
Some resources are not free of charge. To avoid paying for them, delete the resources you no longer need:
-
Delete the objects from the buckets.
-
In the terminal window, go to the directory containing the infrastructure plan.
-
Delete the
data-proc-data-exchange-with-mch.tf
configuration file. -
Make sure the Terraform configuration files are correct using this command:
terraform validate
If there are any errors in the configuration files, Terraform will point them out.
-
Confirm updating the resources.
-
Run the command to view planned changes:
terraform plan
If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.
-
If you are happy with the planned changes, apply them:
-
Run the command:
terraform apply
-
Confirm the update of resources.
-
Wait for the operation to complete.
-
This will delete all the resources described in the
data-proc-data-exchange-with-mch.tf
configuration file. -
ClickHouse® is a registered trademark of ClickHouse, Inc