Enabling machine learning models

Managed Service for ClickHouse lets you analyze data by applying CatBoost machine learning models without additional tools. To apply a model, add it to your cluster and call it in an SQL query using the built-in modelEvaluate() function. After running this query, you get model predictions for each row of input data. For more information about machine learning, ClickHouse see the documentation.

Before adding a model

Managed Service for ClickHouse works only with publicly readable models that are uploaded to Object Storage:

  1. Upload the trained model file to Object Storage.

  2. Set up public read access to the model file.

  3. Get a public link to the model.

Adding the model

  1. Select the cluster:

    1. Go to the folder page and select Managed Service for ClickHouse.
    2. Click on the name of the cluster and select the Machine learning tab from the left panel.
    3. Click Add model.
  2. Configure the model parameters:

    • Type: Model type. The only supported model type is CatBoost: ML_MODEL_TYPE_CATBOOST.
    • Name: Model name. The model name is one of the arguments of the modelEvaluate() function, which is used to call the model in ClickHouse.
    • URL: Model address in Object Storage.
  3. Click Add and wait for the model to be added.

Applying the model

To apply the model to data stored in a ClickHouse cluster:

  1. Connect to the cluster using the CLI client for ClickHouse or go to the SQL tab in the cluster management console.

  2. Execute an SQL query like:

    SELECT 
        modelEvaluate('<model name>', 
                      <name of column 1>,
                      <name of column 2>,
                      ...
                      <name of column N>)
    FROM <table name>
    

Specify the model name and the names of the columns with input data as the modelEvaluate() function arguments. The query results in a column with model predictions for each row of the source table.

Example

If you don't have a suitable data set or model to process it, you can test machine learning in Managed Service for ClickHouse using this example. We prepared a data file for it and trained a model to analyze it. You can upload data to ClickHouse and see model predictions for different rows of the table.

Note

In this example, we'll use public data from the Amazon Employee Access Challenge. The model is trained to predict values in the ACTION column. The same data and model are used in the examples provided in the ClickHouse documentation and on GitHub.

To upload data to ClickHouse and test the model:

  1. Install the ClickHouse CLI and set up a cluster connection as described in the documentation.

  2. Download the data file for analysis:

    $ wget https://storage.yandexcloud.net/managed-clickhouse/train.csv  
    
  3. Create a table for the data:

    $ clickhouse-client --host <host FQDN> \
                      --database <DB name>
                      --secure \
                      --user <DB username> \
                      --password <DB user password> \
                      --port 9440 \
                      -q 'CREATE TABLE ml_test_table (date Date MATERIALIZED today(), ACTION UInt8, RESOURCE UInt32, MGR_ID UInt32, ROLE_ROLLUP_1 UInt32, ROLE_ROLLUP_2 UInt32, ROLE_DEPTNAME UInt32, ROLE_TITLE UInt32, ROLE_FAMILY_DESC UInt32, ROLE_FAMILY UInt32, ROLE_CODE UInt32) ENGINE = MergeTree(date, date, 8192)'
    
  4. Upload the data to the table:

    $ clickhouse-client --host <host FQDN> \
                      --database <DB name>
                      --secure \
                      --user <DB username> \
                      --password <DB user password> \
                      --port 9440 \
                      -q 'INSERT INTO ml_test_table FORMAT CSVWithNames' \
                      < train.csv
    
  5. In the management console, add the test model:

  • Type: ML_MODEL_TYPE_CATBOOST.
  • Name: ml_test.
  • URL: https://storage.yandexcloud.net/managed-clickhouse/catboost_model.bin.
  1. Test the model:
  2. Connect to the cluster using the CLI client for ClickHouse or go to the SQL tab in the cluster management console.
  3. Test the model using queries:
    • Predicted values for first 10 rows of the ACTION column:

      SELECT 
          modelEvaluate('ml_test', 
                        RESOURCE,
                        MGR_ID,
                        ROLE_ROLLUP_1,
                        ROLE_ROLLUP_2,
                        ROLE_DEPTNAME,
                        ROLE_TITLE,
                        ROLE_FAMILY_DESC,
                        ROLE_FAMILY,
                        ROLE_CODE) > 0 AS prediction, 
          ACTION AS target
      FROM ml_test_table
      LIMIT 10
      
    • Predicted probability for the first 10 rows of the table:

      SELECT 
          modelEvaluate('ml_test', 
                        RESOURCE,
                        MGR_ID,
                        ROLE_ROLLUP_1,
                        ROLE_ROLLUP_2,
                        ROLE_DEPTNAME,
                        ROLE_TITLE,
                        ROLE_FAMILY_DESC,
                        ROLE_FAMILY,
                        ROLE_CODE) AS prediction,
          1. / (1 + exp(-prediction)) AS probability, 
          ACTION AS target
      FROM ml_test_table
      LIMIT 10