Setting up Time-Slicing GPUs

Written by

Updated at April 25, 2024

Getting started
Configure Time-Slicing GPUs
Text Time-Slicing GPUs functionality
Delete the resources you created

The Time-Slicing GPUs plugin in Kubernetes is used to alternate workloads that run on a single GPU with oversubscription.

To install the Time-Slicing GPUs plugin in Managed Service for Kubernetes:

Configure Time-Slicing GPUs.
Text Time-Slicing GPUs functionality.

If you no longer need the resources you created, delete them.

Getting started

If you do not have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name or --folder-id parameter.
Create a Managed Service for Kubernetes cluster.
Create a Managed Service for Kubernetes node group with the NVIDIA® Tesla® T4 GPU.
Install kubectl and configure it to work with the created cluster.

Configure Time-Slicing GPUs

Create a time-slicing configuration:

Prepare the time-slicing-config.yaml file with the following content:

---
kind: Namespace
apiVersion: v1
metadata:
  name: gpu-operator

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  a100-80gb: |-
    version: v1
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 5
  tesla-t4: |-
    version: v1
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 5

Run this command:

kubectl create -f time-slicing-config.yaml

Result:

namespace/gpu-operator created
configmap/time-slicing-config created

Install the GPU Operator:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && \
helm repo update && \
helm install \
  --namespace gpu-operator \
  --create-namespace \
  --set devicePlugin.config.name=time-slicing-config \
  gpu-operator nvidia/gpu-operator

Apply the time-slicing configuration to the Managed Service for Kubernetes cluster or node group:

Cluster Managed Service for Kubernetes

Node group Managed Service for Kubernetes

kubectl patch clusterpolicies.nvidia.com/cluster-policy \
  --namespace gpu-operator \
  --type merge \
  --patch='{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config", "default": "tesla-t4"}}}}'

yc managed-kubernetes node-group add-labels <node_group_name_or_ID> \
  --labels nvidia.com/device-plugin.config=tesla-t4

You can get the ID and name of the Managed Service for Kubernetes node group with a list of node groups in your cluster.

Text Time-Slicing GPUs functionality

Create a test app:

Save the following app creation specification to a YAML file named nvidia-plugin-test.yml.

Deployment is the Kubernetes API object that manages the replicated application.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nvidia-plugin-test
  labels:
    app: nvidia-plugin-test
spec:
  replicas: 5
  selector:
    matchLabels:
      app: nvidia-plugin-test
  template:
    metadata:
      labels:
        app: nvidia-plugin-test
    spec:
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: dcgmproftester11
          image: nvidia/samples:dcgmproftester-2.0.10-cuda11.0-ubuntu18.04
          command: ["/bin/sh", "-c"]
          args:
            - while true; do /usr/bin/dcgmproftester11 --no-dcgm-validation -t 1004 -d 300; sleep 30; done
          resources:
            limits:
              nvidia.com/gpu: 1
          securityContext:
            capabilities:
              add: ["SYS_ADMIN"]

Run this command:

kubectl apply -f nvidia-plugin-test.yml

Result:

deployment.apps/nvidia-plugin-test created

Make sure that all the app's five Managed Service for Kubernetes pods have the Running status:
```
kubectl get pods | grep nvidia-plugin-test
```

Run the nvidia-smi command in the running Managed Service for Kubernetes nvidia-container-toolkit pod:

kubectl exec <nvidia-container-toolkit_pod_name> \
  --namespace gpu-operator -- nvidia-smi

Result:

Defaulted container "nvidia-container-toolkit-ctr" out of: nvidia-container-toolkit-ctr, driver-validation (init)
Thu Jan 26 09:42:51 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:8B:00.0 Off |                    0 |
| N/A   72C    P0    70W /  70W |   1579MiB / 15360MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     43108      C   /usr/bin/dcgmproftester11         315MiB |
|    0   N/A  N/A     43211      C   /usr/bin/dcgmproftester11         315MiB |
|    0   N/A  N/A     44583      C   /usr/bin/dcgmproftester11         315MiB |
|    0   N/A  N/A     44589      C   /usr/bin/dcgmproftester11         315MiB |
|    0   N/A  N/A     44595      C   /usr/bin/dcgmproftester11         315MiB |
+-----------------------------------------------------------------------------+

Delete the resources you created

Some resources are not free of charge. Delete the resources you no longer need to avoid paying for them:

Delete the Managed Service for Kubernetes cluster.
If you created any service accounts, delete them.

Setting up Time-Slicing GPUs

Getting startedGetting started

Configure Time-Slicing GPUsConfigure Time-Slicing GPUs

Text Time-Slicing GPUs functionalityText Time-Slicing GPUs functionality

Delete the resources you createdDelete the resources you created

Was the article helpful?

Getting started

Configure Time-Slicing GPUs

Text Time-Slicing GPUs functionality

Delete the resources you created