Asynchronously replicating data from PostgreSQL to ClickHouse
You can migrate a database from PostgreSQL to ClickHouse using Yandex Data Transfer. To do this:
- Set up the transfer.
- Activate the transfer.
- Test the replication process.
- Select the data from the target.
If you no longer need these resources, delete them.
Before you begin
We'll create all the required resources for the example in Yandex Cloud. Prepare the infrastructure:
-
Create a source Managed Service for PostgreSQL cluster in any applicable configuration with publicly available hosts and the following settings:
- Database name:
db1
. - Username:
pg-user
. - Password:
<password for the source>
.
- Database name:
-
Create a Managed Service for ClickHouse target cluster in any applicable configuration with publicly available hosts and the following settings:
- Number of ClickHouse hosts: At least 2 (to enable replication in the cluster).
- Database name:
db1
. - Username:
ch-user
. - Password:
<password for the target>
.
-
Make sure that the cluster's security groups have been set up correctly and allow connecting to them:
-
Grant the
mdb_replication
role to thepg-user
in the Managed Service for PostgreSQL cluster.
-
If you don't have Terraform, install and configure it.
-
Download the file with provider settings. Place it in a separate working directory and specify the parameter values.
-
Download the postgresql-to-clickhouse.tf configuration file to the same working directory.
This file describes:
- Networks.
- Subnets.
- Security groups for making cluster connections.
- Managed Service for PostgreSQL source cluster.
- Managed Service for ClickHouse target cluster.
- Source endpoint.
- Target endpoint.
- Transfer.
-
In the
postgresql-to-clickhouse.tf
configuration file, specify the PostgreSQL and ClickHouse administrator passwords. -
Run the command
terraform init
in the directory with the configuration file. This command initializes the provider specified in the configuration files and enables you to use the provider resources and data sources. -
Make sure the Terraform configuration files are correct using the command:
terraform validate
If there are errors in the configuration files, Terraform will point to them.
-
Create the required infrastructure:
-
Run the command to view planned changes:
terraform plan
If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.
-
If you are happy with the planned changes, apply them:
-
Run the command:
terraform apply
-
Confirm the update of resources.
-
Wait for the operation to complete.
-
All the required resources will be created in the specified folder. You can check resource availability and their settings in the management console.
-
Set up the transfer
-
In the
db1
database, create a table namedx_tab
, and populate it with data:CREATE TABLE x_tab ( id NUMERIC PRIMARY KEY, name CHAR(5) ); CREATE INDEX ON x_tab (id); INSERT INTO x_tab (id, name) VALUES (40, 'User1'), (41, 'User2'), (42, 'User3'), (43, 'User4'), (44, 'User5');
-
Create a transfer:
ManuallyUsing Terraform-
Create a source endpoint of the
PostgreSQL
type and specify the cluster connection parameters in it:- Connection type:
Managed Service for PostgreSQL cluster
. - Cluster:
<PostgreSQL source cluster name>
from the drop-down list. - Database:
db1
. - Password:
pg-user
. - Password:
<user password>
.
- Connection type:
-
Create a target endpoint of the
ClickHouse
type and specify the cluster connection parameters in it.- Connection type:
Managed Service for ClickHouse cluster
. - Cluster:
<ClickHouse target cluster name>
from the drop-down list. - Database:
db1
. - User:
ch-user
. - Password:
<user password>
. - Cleanup policy:
DROP
.
- Connection type:
-
Create a transfer of the Snapshot and increment type that will use the created endpoints.
-
Set the
transfer_enabled
variable to1
in thepostgresql-to-clickhouse.tf
file. -
Make sure the Terraform configuration files are correct using the command:
terraform validate
If there are errors in the configuration files, Terraform will point to them.
-
Create the required infrastructure:
-
Run the command to view planned changes:
terraform plan
If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.
-
If you are happy with the planned changes, apply them:
-
Run the command:
terraform apply
-
Confirm the update of resources.
-
Wait for the operation to complete.
-
-
-
Activate the transfer
-
Activate the transfer and wait for its status to change to Replicating.
-
To check that the transfer has moved the replicated data to the target, connect to the target Yandex Managed Service for ClickHouse cluster and make sure that the
x_tab
table indb1
includes the same columns as thex_tab
table in the source database, as well as the timestamp columns:__data_transfer_commit_time
and__data_transfer_delete_time
:SELECT * FROM db1.x_tab WHERE id = 41;
┌─id─┬──name──┬─── __data-transfer_commit_time─┬───__data-transfer-delete_time─┐ │ 41 │ User2 │ 1633417594957267000 │ 0 │ └────┴────────┴────────────────────────────────┴───────────────────────────────┘
Test the replication process
-
Connect to the source cluster.
-
Delete the row with the
41
ID and edit the row with the42
ID in thex_tab
table of the PostgreSQL source database:DELETE FROM db1.public.x_tab WHERE id = 41; UPDATE db1.public.x_tab SET name = 'Key3' WHERE id = 42;
-
Make sure that you see the changes in the
x_tab
table on the ClickHouse target:SELECT * FROM db1.x_tab WHERE (id >= 41) AND (id <= 42);
┌─id─┬──name──┬─── __data-transfer_commit_time─┬───__data-transfer-delete_time─┐ │ 41 │ User2 │ 1633417594957267000 │ 1675417594957267000 │ │ 42 │ Key3 │ 1675417594957267000 │ 0 │ │ 42 │ User3 │ 1633417594957268000 │ 1675417594957267000 │ └────┴────────┴────────────────────────────────┴───────────────────────────────┘
Select the data from ClickHouse .
For table recovery, the ClickHouse target with replication enabled uses the ReplicatedReplacingMergeTree and the ReplacingMergeTree. The following columns are added automatically to each table:
-
__data_transfer_commit_time
: The time allowed for the row to change to this value, inTIMESTAMP
format. -
__data_transfer_delete_time
: The time allowed for deleting the row, inTIMESTAMP
format, if the row has been deleted from the source. If the row hasn't been deleted,0
is set.The
__data_transfer_commit_time
column is needed for the ReplicatedReplacedMergeTree engine. If a record is deleted or updated, a new row with the value in this column is inserted. The query by a single primary key returns multiple records with different values from the__data_transfer_commit_time
column.
If the transfer status is Replicating, the source data can be added or deleted. To ensure the standard behavior of SQL commands when a primary key points to a single record, add a construction that filters data by the __data_transfer_delete_time
column to your queries to tables moved to ClickHouse. For example, for the x_tab
table:
SELECT * FROM x_tab FINAL
WHERE __data_transfer_delete_time = 0;
To make it easier to run a SELECT
, create a view with filtering by the __data_transfer_delete_time
column and access it. For example, for the x_tab
table:
CREATE VIEW x_tab_view AS SELECT * FROM x_tab FINAL
WHERE __data_transfer_delete_time == 0;
Note
The FINAL
keyword noticeably decreases the query efficiency. Avoid it when working with large tables whenever possible.
Delete the resources you created
If you no longer need these resources, delete them:
-
Make sure that the transfer's status is Completed and delete it.
-
Delete the endpoints and clusters:
ManuallyUsing TerraformIf you created your resources using Terraform:
-
In the terminal window, change to the directory containing the infrastructure plan.
-
Delete the
postgresql-to-clickhouse.tf
configuration file. -
Make sure the Terraform configuration files are correct using the command:
terraform validate
If there are errors in the configuration files, Terraform will point to them.
-
Confirm the update of resources.
-
Run the command to view planned changes:
terraform plan
If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.
-
If you are happy with the planned changes, apply them:
-
Run the command:
terraform apply
-
Confirm the update of resources.
-
Wait for the operation to complete.
-
All the resources described in the
postgresql-to-clickhouse.tf
configuration file will be deleted. -
-