Setting up Time-Slicing GPUs
The Time-Slicing GPUs plugin in Kubernetes
To install the Time-Slicing GPUs plugin in Managed Service for Kubernetes:
If you no longer need the resources you created, delete them.
Getting started
-
If you do not have the Yandex Cloud command line interface yet, install and initialize it.
-
The folder specified in the CLI profile is used by default. You can specify a different folder using the
--folder-name
or--folder-id
parameter. -
Create a Managed Service for Kubernetes node group with the NVIDIA® Tesla® T4 GPU.
-
Install kubectl
and configure it to work with the created cluster.
Configure Time-Slicing GPUs
-
Create a time-slicing configuration:
-
Prepare the
time-slicing-config.yaml
file with the following content:--- kind: Namespace apiVersion: v1 metadata: name: gpu-operator --- apiVersion: v1 kind: ConfigMap metadata: name: time-slicing-config namespace: gpu-operator data: a100-80gb: |- version: v1 sharing: timeSlicing: resources: - name: nvidia.com/gpu replicas: 5 tesla-t4: |- version: v1 sharing: timeSlicing: resources: - name: nvidia.com/gpu replicas: 5
-
Run this command:
kubectl create -f time-slicing-config.yaml
Result:
namespace/gpu-operator created configmap/time-slicing-config created
-
-
Install the GPU Operator
:helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && \ helm repo update && \ helm install \ --namespace gpu-operator \ --create-namespace \ --set devicePlugin.config.name=time-slicing-config \ gpu-operator nvidia/gpu-operator
-
Apply the time-slicing configuration to the Managed Service for Kubernetes cluster or node group:
Cluster Managed Service for KubernetesNode group Managed Service for Kuberneteskubectl patch clusterpolicies.nvidia.com/cluster-policy \ --namespace gpu-operator \ --type merge \ --patch='{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config", "default": "tesla-t4"}}}}'
yc managed-kubernetes node-group add-labels <node_group_name_or_ID> \ --labels nvidia.com/device-plugin.config=tesla-t4
You can get the ID and name of the Managed Service for Kubernetes node group with a list of node groups in your cluster.
Text Time-Slicing GPUs functionality
-
Create a test app:
-
Save the following app creation specification to a YAML file named
nvidia-plugin-test.yml
.Deployment
is the Kubernetes API object that manages the replicated application.apiVersion: apps/v1 kind: Deployment metadata: name: nvidia-plugin-test labels: app: nvidia-plugin-test spec: replicas: 5 selector: matchLabels: app: nvidia-plugin-test template: metadata: labels: app: nvidia-plugin-test spec: tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule containers: - name: dcgmproftester11 image: nvidia/samples:dcgmproftester-2.0.10-cuda11.0-ubuntu18.04 command: ["/bin/sh", "-c"] args: - while true; do /usr/bin/dcgmproftester11 --no-dcgm-validation -t 1004 -d 300; sleep 30; done resources: limits: nvidia.com/gpu: 1 securityContext: capabilities: add: ["SYS_ADMIN"]
-
Run this command:
kubectl apply -f nvidia-plugin-test.yml
Result:
deployment.apps/nvidia-plugin-test created
-
-
Make sure that all the app's five Managed Service for Kubernetes pods have the
Running
status:kubectl get pods | grep nvidia-plugin-test
-
Run the
nvidia-smi
command in the running Managed Service for Kubernetesnvidia-container-toolkit
pod:kubectl exec <nvidia-container-toolkit_pod_name> \ --namespace gpu-operator -- nvidia-smi
Result:
Defaulted container "nvidia-container-toolkit-ctr" out of: nvidia-container-toolkit-ctr, driver-validation (init) Thu Jan 26 09:42:51 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: N/A | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 Off | 00000000:8B:00.0 Off | 0 | | N/A 72C P0 70W / 70W | 1579MiB / 15360MiB | 100% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 43108 C /usr/bin/dcgmproftester11 315MiB | | 0 N/A N/A 43211 C /usr/bin/dcgmproftester11 315MiB | | 0 N/A N/A 44583 C /usr/bin/dcgmproftester11 315MiB | | 0 N/A N/A 44589 C /usr/bin/dcgmproftester11 315MiB | | 0 N/A N/A 44595 C /usr/bin/dcgmproftester11 315MiB | +-----------------------------------------------------------------------------+
Delete the resources you created
Some resources are not free of charge. Delete the resources you no longer need to avoid paying for them:
- Delete the Managed Service for Kubernetes cluster.
- If you created any service accounts, delete them.