Getting started with Data Proc
To get started with the service:
Getting started
-
Go to the management console
and log in to Yandex Cloud or create an account if you do not have one yet. -
If you do not have a folder yet, create one:
-
In the management console
, select the appropriate cloud in the list on the left. -
At the top right, click
Create folder. -
Enter the folder name. The naming requirements are as follows:
- The name must be from 3 to 63 characters long.
- It may contain lowercase Latin letters, numbers, and hyphens.
- The first character must be a letter and the last character cannot be a hyphen.
-
(Optional) Enter a description of the folder.
-
Select Create a default network. This will create a network with subnets in each availability zone. Within this network, a default security group will be created, inside which all network traffic is allowed.
-
Click Create.
-
-
Set up a NAT gateway in the subnet to host the cluster.
-
If you use security groups, configure them.
-
You can connect to an Data Proc cluster from both inside and outside Yandex Cloud:
-
To connect from inside Yandex Cloud, create a Linux- virtual machine, which must be in the same network as the cluster.
-
To be able to connect to the cluster from the internet, request public access to subclusters when creating the cluster.
Note
The next step assumes that you connect to the cluster from a Linux-based VM.
-
-
Connect to the VM over SSH.
Create a cluster
To create a cluster:
- In the management console, open the folder to create your cluster in and select Data Proc.
- Click Create cluster.
- Set the cluster parameters and click Create cluster. For more information, see Creating clusters.
- Wait until the cluster is ready for use: its status will change to Alive. This may take some time.
Connect to the cluster
To connect to a cluster:
-
If you are using security groups for a cloud network, configure them to enable all relevant traffic between the cluster and the connecting host.
-
Copy the SSL key that you specified when creating the Data Proc cluster to the VM.
-
Connect to the cluster via SSH and make sure that Hadoop commands are executed. Depending on the image version, specify the username:
- For version 2.0:
ubuntu
. - For version 1.4:
root
.
- For version 2.0:
For more information about connecting to a Data Proc cluster, see Connecting to a cluster.
Connect to component interfaces
To connect to the Data Proc component interfaces using the web interface:
- Enable the UI Proxy setting in the cluster.
- Get a list of interface URLs.
To connect to the Data Proc component interfaces via SSH with port forwarding:
-
Create an intermediate VM with a public IP address in the same network as the cluster and with a security group that allows incoming and outgoing traffic through the component ports.
-
Connect to the created VM via SSH with a redirect to the appropriate ports of the Data Proc host. Depending on the image version, specify the username:
- For version 2.0:
ubuntu
. - For version 1.4:
root
.
- For version 2.0:
For more information about connecting to Data Proc cluster component interfaces, see Connecting to component interfaces.
What's next
- Read about service concepts.
- Learn more about creating clusters and working with jobs.
- Create a Hive Metastore cluster.