Creating ClickHouse clusters
ClickHouseclusters are one or more database hosts that replication can be configured between.
Warning
When creating a ClickHouse cluster with 2 or more hosts, Managed Service for ClickHouse automatically creates a cluster of 3 ZooKeeper hosts for managing replication and fault tolerance. These hosts are considered when calculating the cost of the resource quotas used by the cloud and the. Read more about replication for ClickHouse.
The number of hosts that can be created with a ClickHouse cluster depends on the storage option selected:
-
When using network drives, you can request any number of hosts (from one to the current quota limit).
-
When using SSDs, you can create at least two replicas along with the cluster (a minimum of two replicas is required to ensure fault tolerance). If the available folder resources are still sufficient after creating a cluster, you can add extra replicas.
-
In the management console, select the folder where you want to create a DB cluster.
-
Select Managed Service for ClickHouse.
-
Click Create cluster.
-
Enter the cluster name in the Cluster name field. The cluster name must be unique within the folder.
-
Select the environment where you want to create the cluster (you can't change the environment once the cluster is created):
PRODUCTION
: For stable versions of your apps.PRESTABLE
: For testing, including the Managed Service for ClickHouse service itself. The Prestable environment is first updated with new features, improvements, and bug fixes. However, not every update ensures backward compatibility.
-
Select the host class that defines the technical specifications of the VMs where the DB hosts will be deployed. All available options are listed in Host classes. When you change the host class for the cluster, the characteristics of all existing instances change, too.
-
Under Storage size:
- Select the type of storage, either a more flexible network type (**network-hdd** or **network-ssd**) or faster local SSD storage (**local-ssd**). The size of the local storage can only be changed in 100 GB increments.
- Select the size to be used for data and backups. For more information about how backups take up storage space, see Backups.
-
Under Database, specify the DB attributes:
- DB name.
- Username.
- User password. At least 8 characters.
-
Under Hosts, specify the parameters for the database hosts created with the cluster (keep in mind that if you use SSDs when creating the ClickHouse cluster, you can set at least two hosts). To change the added host, place the cursor on the host line and click .
-
If necessary, configure additional cluster settings:
-
Backup start (UTC): The time in UTC when you want to start creating a backup of a cluster (in 24-hour format). If the time is not set, the backup will start at 22:00 UTC.
-
Maintenance window: Settings of the technical support window. You can use the settings to specify the preferred start time for cluster host maintenance (for example, you can select the time when the cluster is least loaded with requests):
- To specify the preferred start time for the maintenance window, select by schedule and set the desired day of the week and hour of day in UTC (Coordinated Universal Time) using the drop-down lists.
- To allow maintenance at any time, select arbitrary.
Maintenance may include updating the DBMS version, applying patches, and so on.
-
Access from DataLens: Enable this option to be able to analyze data from the cluster in Yandex DataLens. For more information about setting up a connection, see Connecting to DataLens.
-
Access from the management console: Select this option to be able to execute SQL queries to cluster databases from the Yandex.Cloud management console.
-
Access from Yandex.Metrica and AppMetrica: Enable this option to be able to import data from AppMetrica to the cluster.
-
-
If necessary, configure the DBMS settings:
-
Geobase uri: Address of the archive with the user geobase in Object Storage.
-
Keep alive timeout: Amount of time in seconds after the last request to ClickHouse, during which the server waits for a new request. If no requests are received during this time, ClickHouse breaks the connection. To learn more, see the ClickHouse documentation.
-
Log level: Event logging level. At each next level, the log will contain complete information from the previous one:
ERROR
: Information about errors in the cluster.WARNING
: Information about events that may cause errors in the cluster.INFORMATION
: Confirmations, information about events that don't lead to errors in the cluster.DEBUG
: System information to be used later in debugging.TRACE
: All available information about cluster performance.
For more information about log levels, see the ClickHouse documentation.
-
Mark cache size: Approximate size (in bytes) of the mark cache used by MergeTree table engines. The cache is shared by the server and memory is allocated as needed. To learn more about logging in ClickHouse, see the documentation.
-
Max concurrent queries: Maximum number of queries processed simultaneously. To learn more, see the ClickHouse documentation.
-
Max connections: Maximum number of inbound connections. To learn more, see the ClickHouse documentation.
-
Max partition size to drop: Maximum size (in bytes) of a partition in a MergeTree table, which you can delete using a
DROP
query. -
Max table size to drop: Maximum size (in bytes) of a MergeTree table, which you can delete using a
DROP
query. If 0, you can delete all tables without restrictions. To learn more, see the ClickHouse documentation. -
Metric log enabled: Enables or disables logging of the history of metric values from the system.metrics and system.events tables to the system.metric_log table. Logging is enabled by default (
true
). -
Metric log retention size: The maximum size in bytes that the system.metric_log table can reach before old records start being deleted from it. A value of 0 means that the old records aren't deleted as the table size grows. Default value: 536870912 (0.5 GB).
-
Metric log retention time: The period of time in milliseconds after which a record in the system.metric_log table is deleted. Time is counted as soon as the record is created in the table. A value of 0 means that records aren't deleted when the time elapses. The value must be a multiple of 1000. Default value: 2592000000 (30 days).
-
Part log retention size: The maximum size in bytes that the system.part_log table can reach before old records start being deleted from it. A value of 0 means that the old records aren't deleted as the table size grows. Default value: 536870912 (0.5 GB).
-
Part log retention time: The period of time in milliseconds after which a record in the system.part_log table is deleted. Time is counted as soon as the record is created in the table. A value of 0 means that records aren't deleted when the time elapses. The value must be a multiple of 1000. Default value: 2592000000 (30 days).
-
Query log retention size: The maximum size in bytes that the system.query_log table can reach before old records start being deleted from it. A value of 0 means that the old records aren't deleted as the table size grows. Default value: 1073741824 (1 GB).
-
Query log retention time: The period of time in milliseconds after which a record in the system.query_log table is deleted. Time is counted as soon as the record is created in the table. A value of 0 means that records aren't deleted when the time elapses. The value must be a multiple of 1000. Default value: 2592000000 (30 days).
-
Query thread log enabled: Enables or disables logging of information about the threads that execute requests, such as the name of the thread, time it was started, and how long a request was processed. Logs are written to the system.query_thread_log table. Logging is enabled by default (
true
). -
Query thread log retention size: The maximum size in bytes that the system.query_thread_log table can reach before old records start being deleted from it. A value of 0 means that the old records aren't deleted as the table size grows. Default value: 536870912 (0.5 GB).
-
Query thread log retention time: The period of time in milliseconds after which a record in the system.query_thread_log table is deleted. Time is counted as soon as the record is created in the table. A value of 0 means that records aren't deleted when the time elapses. The value must be a multiple of 1000. Default value: 2592000000 (30 days).
-
Text log enabled: Enables or disables writing of system logs to the system.text_log table. Logging is disabled by default (
false
). -
Text log level: The level of event logging in the system.text_log table. At each next level, the log will contain complete information from the previous one:
ERROR
: Information about errors in the DBMS.WARNING
: Information about events that may cause errors in the DBMS.INFORMATION
: Confirmation and information about events that don't lead to errors in the DBMS.DEBUG
: System information to be used later in debugging.TRACE
: All available information about DBMS performance.
-
Text log retention size: The maximum size in bytes that the system.text_log table can reach before old records start being deleted from it. A value of 0 means that the old records aren't deleted as the table size grows. Default value: 536870912 (0.5 GB).
-
Text log retention time: The period of time in milliseconds after which a record in the system.text_log table is deleted. Time is counted as soon as the record is created in the table. A value of 0 means that records aren't deleted when the time elapses. The value must be a multiple of 1000. Default value: 2592000000 (30 days).
-
Timezone: Server time zone. Specified by the IANA identifier as the UTC time zone or geographical location (for example, Africa/Abidjan). For more information, see the ClickHouse documentation.
-
Trace log enabled: Enables or disables logging of stack traces collected by the request profiler to the system.trace_log table. Logging is enabled by default (
true
). -
Trace log retention size: The maximum size in bytes that the system.trace_log table can reach before old records start being deleted from it. A value of 0 means that the old records aren't deleted as the table size grows. Default value: 536870912 (0.5 GB).
-
Trace log retention time: The period of time in milliseconds after which a record in the system.trace_log table is deleted. Time is counted as soon as the record is created in the table. A value of 0 means that records aren't deleted when the time elapses. The value must be a multiple of 1000. Default value: 2592000000 (30 days).
-
Uncompressed cache size: Cache size in bytes for uncompressed data used by the MergeTree table engines. To learn more, see the ClickHouse documentation.
-
Compression: Rules for compressing data in MergeTree tables.
- Method: Compression method. Two methods are available: LZ4 and zstd.
- Min part size: Minimum size (in bytes) of a data part in a table. ClickHouse only applies the rule to tables with data parts greater than or equal to the Min part size value.
- Min part size ratio: Minimum ratio of table part size to total table size. ClickHouse only applies the rule to tables in which this ratio is greater than or equal to the Min part size ratio value.
You can add multiple compression rules. ClickHouse checks the Min part size and Min part size ratio conditions and applies the rules to those tables that meet both of them. If multiple rules can be applied to the same table, ClickHouse applies the first one. If none of the rules are applicable, ClickHouse uses the LZ4 compression method. To learn more, see the ClickHouse documentation
-
Graphite rollup: GraphiteMergeTree engine configurations for thinning and aggregating/averaging (rollup) Graphite data. You can set up multiple configurations and use them for different tables.
To learn more about Graphite support in ClickHouse, see the documentation.
- Name: Configuration name.
- Patterns: Set of thinning rules. A rule applies if the metric name matches the Regexp parameter value and the age of the data matches the Retention parameter group value.
- Function: Aggregation function name.
- Regexp: Regular expression that the metric name must match.
- Retention: Retention parameters. The function applies to data whose age is in the range of [Age, Age + Precision]. You can set several groups of these parameters.
- Age: Minimum data age, in seconds.
- Precision: Accuracy of determining the age of the data, in seconds. Must be a divisor of 86,400 (the number of seconds in 24 hours).
-
Merge tree: MergeTree engine configuration. For more information, see the ClickHouse documentation
- Max bytes to merge at min space in pool: Maximum total size of a data part to merge when the number of free threads in the background pool is minimum.
- Max replicated merges in queue: Maximum number of merge tasks that can be in the
ReplicatedMergeTree
queue at the same time. - Number of free entries in pool to lower max size of merge: Threshold value of free entries in the pool. If the number of entries in the pool falls below this value, ClickHouse reduces the maximum size of a data part to merge. This helps handle small merges faster, rather than filling the pool with lengthy merges.
- Parts to delay insert: Number of active data parts in a table. When exceed, ClickHouse starts artificially reducing the rate of inserting data into the table.
- Parts to throw insert: Threshold value of active data parts in a table. When exceeded,ClickHouse throws the 'Too many parts ...' exception.
- Replicated deduplication window: Number of recent hash blocks that ZooKeeper stores (old ones are deleted).
- Replicated deduplication window seconds: Time during which ZooKeeper stores hash blocks (old ones are deleted).
-
-
Click Create cluster.
If you don't have the Yandex.Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name
or --folder-id
parameter.
To create a cluster:
-
Check whether the folder has any subnets for the cluster hosts:
$ yc vpc subnet list
If there are no subnets in the folder, create the necessary subnets in VPC.
-
View a description of the CLI's create cluster command:
$ yc managed-clickhouse cluster create --help
-
Specify the cluster parameters in the create command (the example shows only mandatory flags):
$ yc managed-clickhouse cluster create \ --name <cluster name> \ --environment <prestable or production> \ --network-name <network name> \ --host type=<clickhouse or zookeeper>,zone-id=<availability zone>,subnet-id=<subnet ID> \ --resource-preset <host class> \ --clickhouse-disk-type <network-hdd | network-ssd | local-ssd> \ --clickhouse-disk-size <storage size in GB> \ --user name=<username>,password=<user password> \ --database name=<DB name>
The subnet ID
subnet-id
should be specified if the selected availability zone contains two or more subnets.
With Terraform, you can quickly create a cloud infrastructure in Yandex.Cloud. The infrastructure components are identified through configuration files that specify the required cloud resources and their parameters.
If you don't have Terraform yet, install it and configure the provider.
To create a cluster:
-
In the configuration file, describe the parameters of resources that you want to create:
- Database cluster: Description of the cluster and its hosts.
- Network: Description of the cloud network where the cluster will be located. If you already have a suitable network, you don't need to describe it again.
- Subnets: Description of the subnets to connect the cluster hosts to. If you already have suitable subnets, you don't need to describe them again.
Example configuration file structure:
resource "yandex_mdb_clickhouse_cluster" "<cluster name>" { name = "<cluster name>" environment = "<environment>" network_id = "<network ID>" clickhouse { resources { resource_preset_id = "<host class>" disk_type_id = "<storage type>" disk_size = "<storage size, GB>" } } database { name = "<DB name>" } user { name = "<DB username>" password = "<password>" permission { database_name = "<name of the DB where the user is created>" } } host { type = "CLICKHOUSE" zone = "<availability zone>" subnet_id = "<subnet ID>" } } resource "yandex_vpc_network" "<network name>" { name = "<network name>" } resource "yandex_vpc_subnet" "<subnet name>" { name = "<subnet name>" zone = "<availability zone>" network_id = "<network ID>" v4_cidr_blocks = ["<range>"] }
For more information about resources that you can create using Terraform, see the provider's documentation.
-
Make sure that the configuration files are correct.
-
In the command line, go to the folder where you created the configuration file.
-
Run the check using the command:
terraform plan
If the configuration is described correctly, the terminal displays a list of created resources and their parameters. If there are errors in the configuration, Terraform points them out. This is a test step. No resources are created.
-
-
Create a cluster.
-
If the configuration doesn't contain any errors, run the command:
terraform apply
-
Confirm that you want to create the resources.
After this, all the necessary resources will be created in the specified folder and the IP addresses of the VMs will be displayed in the terminal. You can check resource availability and their settings in management console.
-
Examples
Creating a single-host cluster
To create a cluster with a single host, you should pass a single parameter, --host
.
Let's say we need to create a ClickHouse cluster with the following characteristics:
- Named
mych
. - In the
production
environment. - In the
default
network. - With a single
s2.micro
class ClickHouse host in theb0rcctk2rvtr8efcch64
subnet andru-central1-c
availability zone. - With 20 GB fast network storage (
network-ssd
). - With one user,
user1
, with the passworduser1user1
. - With one database,
db1
.
Run the command:
$ yc managed-clickhouse cluster create \
--name mych \
--environment=production \
--network-name default \
--clickhouse-resource-preset s2.micro \
--host type=clickhouse,zone-id=ru-central1-c,subnet-id=b0cl69g98qumiqmtg12a \
--clickhouse-disk-size 20 \
--clickhouse-disk-type network-ssd \
--user name=user1,password=user1user1 \
--database name=db1
Creating a single-host cluster
Let's say we need to create a ClickHouse cluster and a network for it with the following characteristics:
- Named
mych
. - In the
PRESTABLE
environment. - In the cloud with ID
b1gq90dgh25bebiu75o
. - In a folder named
myfolder
. - In a new network named
mynet
. - With a single
s2.micro
class host in the new subnet namedmysubnet
and theru-central1-c
availability zone. Themysubnet
subnet will have a range of10.5.0.0/24
. - With 32 GB of fast network storage.
- With the database name
my_db
. - With the username
user1
and passworduser1user1
.
The configuration file for the cluster looks like this:
provider "yandex" {
token = "<OAuth or static key of service account>"
cloud_id = "b1gq90dgh25bebiu75o"
folder_id = "${data.yandex_resourcemanager_folder.myfolder.id}"
zone = "ru-central1-c"
}
resource "yandex_mdb_clickhouse_cluster" "mych" {
name = "mych"
environment = "PRESTABLE"
network_id = "${yandex_vpc_network.mynet.id}"
clickhouse {
resources {
resource_preset_id = "s2.micro"
disk_type_id = "network-ssd"
disk_size = 32
}
}
database {
name = "my_db"
}
user {
name = "user1"
password = "user1user1"
permission {
database_name = "my_db"
}
}
host {
type = "CLICKHOUSE"
zone = "ru-central1-c"
subnet_id = "${yandex_vpc_subnet.mysubnet.id}"
}
}
resource "yandex_vpc_network" "mynet" { name = "mynet" }
resource "yandex_vpc_subnet" "mysubnet" {
name = "mysubnet"
zone = "ru-central1-c"
network_id = "${yandex_vpc_network.mynet.id}"
v4_cidr_blocks = ["10.5.0.0/24"]
}