Yandex Cloud
  • Services
  • Solutions
  • Why Yandex Cloud
  • Pricing
  • Documentation
  • Contact us
Get started
Language / Region
© 2022 Yandex.Cloud LLC
Yandex Data Proc
  • Practical guidelines
    • All practical guidelines
    • Working with jobs
      • Overview
      • Working with Hive jobs
      • Working with MapReduce jobs
      • Working with PySpark jobs
      • Working with Spark jobs
      • Running Apache Hive jobs
      • Running Spark applications
      • Running jobs from a remote host
    • Configuring networks for Data Proc
    • Using Yandex Object Storage in Data Proc
    • Exchanging data with Yandex Managed Service for ClickHouse
    • Importing data from Yandex Managed Service for MySQL clusters using Sqoop
    • Importing data from Yandex Managed Service for PostgreSQL clusters using Sqoop
    • Using initialization scripts to configure GeeseFS in Data Proc
  • Step-by-step instructions
    • All instructions
    • Information about existing clusters
    • Creating clusters
    • Connecting to a cluster
    • Updating clusters
    • Managing subclusters
    • Updating subclusters
    • Connecting to component interfaces
    • How to use Sqoop
    • Managing jobs
      • All jobs
      • Spark jobs
      • PySpark jobs
      • Hive jobs
      • MapReduce jobs
    • Deleting clusters
    • Working with logs
    • Monitoring the state of clusters and hosts
  • Concepts
    • Relationships between service resources
    • Host classes
    • Runtime environment
    • Data Proc component interfaces and ports
    • Jobs in Data Proc
    • Automatic scaling
    • Decommissioning subclusters and hosts
    • Network in Data Proc
    • Maintenance
    • Quotas and limits
    • Storage in Data Proc
    • Component properties
    • Logs in Data Proc
    • Initialization scripts
  • Access management
  • Pricing policy
  • API reference
    • Authentication in the API
    • gRPC
      • Overview
      • ClusterService
      • JobService
      • ResourcePresetService
      • SubclusterService
      • OperationService
    • REST
      • Overview
      • Cluster
        • Overview
        • create
        • delete
        • get
        • list
        • listHosts
        • listOperations
        • listUILinks
        • start
        • stop
        • update
      • Job
        • Overview
        • cancel
        • create
        • get
        • list
        • listLog
      • ResourcePreset
        • Overview
        • get
        • list
      • Subcluster
        • Overview
        • create
        • delete
        • get
        • list
        • update
  • Revision history
    • Service updates
    • Images
  • Questions and answers
  1. Practical guidelines
  2. Configuring networks for Data Proc

Configuring networks for Data Proc

Written by
Yandex Cloud
  • Prepare the infrastructure
    • Create a network and subnet for your Data Proc cluster with egress NAT
    • Create the other resources
  • Set up NAT for the Data Proc cluster
  • Delete the resources you created

To grant Data Proc cluster access to resources outside their VPC virtual network, set up public IP addresses for them. If you don't want to use public IP addresses, you can set up egress NAT (Network Address Translation) for the subnet.

In this tutorial, you'll learn how to create a Data Proc cluster and set up subnets and a VM (a NAT instance).

To enable egress NAT for a Data Proc cluster:

  1. Prepare the infrastructure:
    1. Create a network and subnet for your Data Proc cluster with egress NAT.
    2. Create the other resources.
  2. Set up NAT for the Data Proc cluster.

If you no longer need these resources, delete them.

Prepare the infrastructure

You have to create:

  • Network.
  • Subnet for your Data Proc cluster.
  • Subnet for the NAT instance.
  • Security groups and rules for the cluster and NAT instance.
  • Service account for the cluster.
  • Cluster.
  • NAT instance.

Create a network and subnet for your Data Proc cluster with egress NAT

  1. Create a network named network-data-proc.

  2. In network-data-proc, create a subnet with the following parameters:

    • Name: subnet-cluster.

    • Zone: ru-central1-a.

    • CIDR: 192.168.1.0/24.

    • Advanced settings: Enable Egress NAT.

      Note

      This setting can only be enabled in the Management console.

  3. Save the IDs for network-data-proc and the subnet-cluster as you'll need them later.

Create the other resources

Manually
Using Terraform
  1. In network-data-proc, create a subnet with the following parameters:

    • Name: subnet-nat.
    • Zone: ru-central1-b.
    • CIDR: 192.168.100.0/24.

    You don't need to enable Egress NAT for this subnet.

  2. Create and configure security groups for the Data Proc cluster.

  3. Create a security group for the NAT instance.

  4. In the security group for the NAT instance, create the following rules:

    For incoming traffic:

    • A rule that allows all traffic from the Data Proc cluster's security group:

      • Port range: 0-65535.
      • Protocol: Any.
      • Source: Security group.
      • Security group: From list. Select the Data Proc cluster security group.
    • A rule allowing an SSH connection to the NAT instance over the internet:

      • Port range: 22.
      • Protocol: TCP.
      • Source: CIDR.
      • CIDR blocks: 0.0.0.0/0.

    For outgoing traffic:

    A rule allowing all egress traffic:

    • Port range: 0-65535.
    • Protocol: Any.
    • Source: CIDR.
    • CIDR block: 0.0.0.0/0.
  5. Create a service account with the following roles:

    • mdb.dataproc.agent.
    • storage.uploader.
    • storage.viewer.
  6. Create a Yandex Object Storage bucket.

  7. Create a Data Proc cluster with any suitable configuration with the following settings:

    • Service account: Select the service account you created previously.
    • Bucket ID format: List.
    • Bucket name: Select a previously created bucket.
    • Network: network-data-proc.
    • Security groups: Select the previously created security groups.
  8. In the network-dataproc network, create a VM from the NAT instance image with a public IP address. Specify the security groups that you configured previously.

  9. Go to the NAT properties and copy the VM's IP address.

  10. In the network-data-proc network, create a routing table named route-table-nat and add a static route to it:

    • Destination prefix: 0.0.0.0/0.
    • Next hop: The internal IP address of the NAT instance.
  1. If you don't have Terraform, set up and configure it by following the instructions.

  2. Download the file with the provider settings. Place it in a separate working directory and specify the parameter values.

  3. Download the cluster and the NAT instance configuration file to the same working directory.

    The file describes:

    • Network.
    • Subnets.
    • Security groups.
    • Data Proc cluster.
    • Service account to access cloud resources.
    • NAT instance.
    • Bucket.
  4. In the configuration file, specify all the relevant parameters.

  5. Run the terraform init command in the working directory hosting the configuration files. This command initializes the provider specified in the configuration files and enables you to use the provider resources and data sources.

  6. Validate the Terraform configuration files using the following command:

    terraform validate
    

    If there are errors in the configuration files, Terraform points them out.

  7. Import the network and subnet that you created previously.

    Alert

    Don't use IDs of networks and subnets created outside of this tutorial: running terraform apply or terraform destroy will result in their change or destruction, respectively.

    Import the network-data-proc network:

    terraform import yandex_vpc_network.network-data-proc <network ID>
    

    Import the subnet-cluster subnet:

    terraform import yandex_vpc_subnet.subnet-cluster <subnet ID>
    
  8. Create the required infrastructure:

    1. Run the command to view planned changes:

      terraform plan
      

      If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.

    2. If you are happy with the planned changes, apply them:

      1. Run the command:

        terraform apply
        
      2. Confirm the update of resources.

      3. Wait for the operation to complete.

All the required resources will be created in the specified folder. You can check that the resources are there with the correct settings, using the management console.

Set up NAT for the Data Proc cluster

  1. Connect to the NAT instance over SSH.

  2. To enable routing, add the following lines to the end of the /etc/sysctl.conf file:

    net.ipv4.ip_forward = 1
    net.ipv4.conf.all.accept_redirects = 1
    net.ipv4.conf.all.send_redirects = 1
    
  3. To enable the execution of /etc/rc.local at OS startup, run the commands:

    sudo systemctl enable rc-local && \
        sudo touch /etc/rc.local && \
        sudo chmod 755 /etc/rc.local
    
  4. To the /etc/rc.local file, add the code:

    #!/bin/sh
    
    iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
    
  5. Reboot the NAT instance OS:

    sudo reboot -f
    
  6. Check that NAT is configured properly. To do this, reconnect to the NAT instance over SSH and run the command:

    curl ifconfig.co
    

    If the configuration is correct, the command outputs the public IP address of the NAT instance.

Delete the resources you created

Manually
Using Terraform
  1. Delete the Data Proc cluster.
  2. Delete the VM.
  3. If you reserved public static IP addresses for the clusters, release and delete them.
  4. Delete the subnets.
  5. Delete the network.

To delete the infrastructure created with Terraform:

  1. In the terminal window, change to the directory containing the infrastructure plan.

  2. Delete the data-proc-nat.tf configuration file.

  3. Validate the Terraform configuration files using the following command:

    terraform validate
    

    If there are errors in the configuration files, Terraform points them out.

  4. Confirm the update of resources.

    1. Run the command to view planned changes:

      terraform plan
      

      If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.

  5. If you are happy with the planned changes, apply them:

    1. Run the command:

      terraform apply
      
    2. Confirm the update of resources.

    3. Wait for the operation to complete.

All resources described in the configuration file will be deleted.

Was the article helpful?

Language / Region
© 2022 Yandex.Cloud LLC
In this article:
  • Prepare the infrastructure
  • Create a network and subnet for your Data Proc cluster with egress NAT
  • Create the other resources
  • Set up NAT for the Data Proc cluster
  • Delete the resources you created