Migrating data to Managed Service for ClickHouse

To migrate your database to Managed Service for ClickHouse, you need to transfer the data directly, close the old database for writing, and then transfer the load to the database cluster in Yandex.Cloud.

To transfer data to a Managed Service for ClickHouse cluster, you can use Apache ZooKeeper and clickhouse-copier.

Transfer data to an intermediate virtual machine in Compute Cloud if:

  • The Managed Service for ClickHouse cluster isn't accessible from the internet.
  • Your network equipment or connection to the ClickHouse cluster in Yandex.Cloud isn't very reliable.
  • There is no environment to run clickhouse-copier.

Migration stages:

  1. Prepare for migration.
  2. Install ZooKeeper.
  3. Create a cluster Managed Service for ClickHouse.
  4. Create a task for clickhouse-copier.
  5. Add a task for clickhouse-copier in ZooKeeper.
  6. Run clickhouse-copier.

Prepare for migration

  1. Compatible software versions

    • ClickHouse versions must be the same in both clusters.
    • clickhouse-copier version must be the same (or higher) than the ClickHouse version in the Managed Service for ClickHouse cluster.
    • ZooKeeper ver. 3.4.10 and higher.
  2. Check that the source cluster is ready to migrate:

    • SSL is enabled for encrypting traffic.
    • The load on the database or shard that data will be copied from won't createproblems.
    • clickhouse-copier has access to the database, and the account used hasread-only access.
  3. If you are using a virtual machine in Yandex.Cloud for migration:

    • Create your VMs in the same cloud network as the Managed Service for ClickHouse cluster.
    • The computing power of the VM should be chosen based on the amount of data transferred.

Install Zookeeper

To migrate data, start a ZooKeeper node.

  1. Install Java Runtime Environment:

    $ sudo apt-get install default-jre
    
  2. Add the user that you want to run ZooKeeper under:

    $ sudo adduser hadoop
    
  3. Create a directory for ZooKeeper data:

    $ sudo mkdir -p /var/data/zookeeper
    $ sudo chown -R hadoop:hadoop /var/data
    
  4. Install ZooKeeper (single-node setup):

    1. Download a distribution, such as version 3.4.13:

      $ cd /opt
      $ sudo wget http://apache.is.co.za/zookeeper/zookeeper-3.4.13/zookeeper-3.4.13.tar.gz
      $ sudo tar -xvf zookeeper-3.4.13.tar.gz
      $ sudo chown hadoop:hadoop -R  zookeeper-3.4.13
      
    2. Switch to the user you created earlier to launch ZooKeeper:

      $ su hadoop
      
    3. Create the file /opt/zookeeper-3.4.13/conf/zoo.cfg with the following contents:

      tickTime=2000
      dataDir=/var/data/zookeeper
      clientPort=2181
      
    4. The master node must have a unique ID. To do this, create a /var/data/zookeeper/myid file with ID as its contents (for example, "1").

  5. To launch ZooKeeper for debugging:

    $ bash /opt/zookeeper-3.4.13/bin/zkServer.sh start-foreground
    
  6. To launch ZooKeeper as normal:

     $ bash /opt/zookeeper-3.4.13/bin/zkServer.sh start
    

Create a cluster Managed Service for ClickHouse

Make sure that the computing power and storage size of the cluster are appropriate for the environment,where the existing databases are deployed, and create a cluster.

Create a task for clickhouse-copier

To launch clickhouse-copier using ZooKeeper, you need to prepare:

  • A ZooKeeper configuration file (config.xml).
  • A file describing the task (cp-task.xml).

Instructions for clickhouse-copier can be found in the ClickHouse documentation.

Prepare a configuration file for ZooKeeper

The configuration file (config.xml) has to specify:

  • In the <zookeeper> element — The address of the host where you installed ZooKeeper.
  • In the <caConfig> element — The path to the certificate for connecting to Managed Service for ClickHouse.

You can download a certificate at https://storage.yandexcloud.net/cloud-certs/CA.pem.

Sample configuration:

<yandex>
  <zookeeper>
    <node>
      <host>127.0.0.1</host>
      <port>2181</port>
    </node>
  </zookeeper>

  <logger>
    <level>trace</level>
    <log>log.log</log>
    <errorlog>log.err.log</errorlog>
    <size>never</size>
  </logger>

  <openSSL>
    <client>
      <loadDefaultCAFile>true</loadDefaultCAFile>
      <caConfig>CA.pem</caConfig>
      <cacheSessions>true</cacheSessions>
      <disableProtocols>sslv2,sslv3</disableProtocols>
      <preferServerCiphers>true</preferServerCiphers>
      <invalidCertificateHandler>
        <name>RejectCertificateHandler</name>
      </invalidCertificateHandler>
    </client>
  </openSSL>
</yandex>

Describe the task

Example of data migration task description (cp-task.xml):

<yandex>
    <tcp_port_secure>9440</tcp_port_secure> <!-- optional -->
    <remote_servers>
        <source_cluster>
            <shard>
                <internal_replication>false</internal_replication>
                    <replica>
                        <host>your-clickhouse-server.com</host>
                        <port>9440</port>
                        <user>your-user</user>
                        <password>password</password>
                        <secure>1</secure>
                    </replica>
            </shard>
        </source_cluster>

        <destination_cluster>
               <shard>
                <internal_replication>false</internal_replication>
                    <replica>
                        <host>your-cloud-clickhouse.mdb.cloud.yandex.net</host>
                        <port>9440</port>
                        <user>your-user</user>
                        <password>password</password>
                        <secure>1</secure>
                    </replica>
               </shard>
        </destination_cluster>
    </remote_servers>

    <max_workers>2</max_workers>

    <settings_pull>
        <readonly>1</readonly>
    </settings_pull>

    <settings_push>
        <readonly>0</readonly>
    </settings_push>

    <settings>
        <connect_timeout>3</connect_timeout>
        <insert_distributed_sync>1</insert_distributed_sync>
    </settings>

    <tables>
        <table_hits>
            <cluster_pull>source_cluster</cluster_pull>
    	    <database_pull>your-db</database_pull>
            <table_pull>your-table</table_pull>

            <cluster_push>destination_cluster</cluster_push>
            <database_push>your-db</database_push>
            <table_push>your-table</table_push>

            <engine>
                <!-- Description of the table engine, the same as in the CREATE TABLE procedure
                in the source cluster. For an existing table, you can
                get a description with the query SHOW CREATE table_name -->
	            ENGINE = MergeTree() PARTITION BY Year ORDER BY (Year, FlightDate) SETTINGS index_granularity=8192
	        </engine>

    	    <sharding_key>jumpConsistentHash(intHash64(Year), 2)</sharding_key>

            <enabled_partitions>
                <partition>'2017'</partition>
	        </enabled_partitions>
        </table_hits>
    </tables>
</yandex>

Add a task for clickhouse-copier to Zookeeper

To add a task in ZooKeeper, run the following commands:

/opt/zookeeper-3.4.13/bin/zkCli.sh -server localhost:2181 rmr /cp-task.xml/description
/opt/zookeeper-3.4.13/bin/zkCli.sh -server localhost:2181 rmr /cp-task.xml/task_active_workers
/opt/zookeeper-3.4.13/bin/zkCli.sh -server localhost:2181 rmr /cp-task.xml

fc=$(cat ./cp-task.xml)
/opt/zookeeper-3.4.13/bin/zkCli.sh -server localhost:2181 create /cp-task.xml ""
/opt/zookeeper-3.4.13/bin/zkCli.sh -server localhost:2181 create /cp-task.xml/description "$fc"

Launch clickhouse-copier

Important

If you didn't create the folders that you specified in the --base-dir flag or for saving logs, clickhouse-copier might not start.

You can launch copier using the following command (to run it in daemon mode,add a --daemon flag):

$ clickhouse-copier
  --config ./config.xml \
  --task-path ./cp-task.xml \
  --base-dir ./clickhouse \
  --log-level debug

Once the operation is complete, check the logs to make sure that everything wascopied successfully.