ROSA with Nvidia GPU Workloads

Last edited: September 21, 2023
Published: February 21, 2023
Authors: Chris Kang,; Diana Sari

Tags:

AWS

GPU

ROSA

This content is authored by Red Hat experts, but has not yet been tested on every supported configuration.

ROSA guide to running Nvidia GPU workloads.

Prerequisites

ROSA Cluster (4.10+)
rosa cli #logged-in
oc cli #logged-in-cluster-admin
jq

If you need to install a ROSA cluster, please read our ROSA Quickstart Guide . Please be sure you are installing or using an existing ROSA cluster that it is 4.10.x or higher.

As of OpenShift 4.10, it is no longer necessary to set up entitlements to use the nVidia Operator. This has greatly simplified the setup of the cluster for GPU workloads.

Enter the oc login command, username, and password from the output of the previous command:

Example login:

oc login https://api.cluster_name.t6k4.i1.organization.org:6443 \
> --username cluster-admin \
> --password mypa55w0rd
Login successful.
You have access to 77 projects, the list has been suppressed. You can list all projects with ' projects'

Linux:

sudo dnf install jq

MacOS

brew install jq

Helm Prerequisites

If you plan to use Helm to deploy the GPU operator, you will need to do the following

Add the MOBB chart repository to your Helm

helm repo add mobb https://rh-mobb.github.io/helm-charts/

Update your repositories
```
helm repo update
```

GPU Quota

View the list of supported GPU instance types in ROSA
```
rosa list instance-types | grep accelerated
```
Select a GPU instance type
The guide uses g5.xlarge as an example. Please be mindful of the GPU cost of the type you choose.
```
export GPU_INSTANCE_TYPE='g5.xlarge'
```
Login to AWS
Login to AWS Console , type “quotas” in search by, click on “Service Quotas” -> “AWS services” -> “Amazon Elastic Compute Cloud (Amazon EC2). Search for “Running On-Demand [instance-family] instances” (e.g. Running On-Demand G and VT instances).

Please remember that when you request quota that AWS is per core. As an example, to request a single g5.xlarge, you will need to request quota in groups of 4; to request a single g5.8xlarge, you will need to request quota in groups of 32.

Verify quota and request increase if necessary

GPU Machine Pool

Set environment variables

export CLUSTER_NAME=<YOUR-CLUSTER>
export MACHINE_POOL_NAME=nvidia-gpu-pool
export MACHINE_POOL_REPLICA_COUNT=1

Create GPU machine pool

rosa create machinepool --cluster=$CLUSTER_NAME \
  --name=$MACHINE_POOL_NAME \
  --replicas=$MACHINE_POOL_REPLICA_COUNT \
  --instance-type=$GPU_INSTANCE_TYPE

Verify GPU machine pool

It may take 10-15 minutes to provision a new GPU machine. If this step fails, please login to the AWS Console and ensure you didn’t run across availability issues. You can go to EC2 and search for instances by cluster name to see the instance state.

oc wait --for=jsonpath='{.status.readyReplicas}'=1 machineset \
  -l hive.openshift.io/machine-pool=$MACHINE_POOL_NAME \
  -n openshift-machine-api --timeout=600s

Install and Configure Nvidia GPU

This section configures the Node Feature Discovery Operator (to allow OpenShift to discover the GPU nodes) and the Nvidia GPU Operator.

Two options: Helm or Manual

Helm

Create namespaces

oc create namespace openshift-nfd
oc create namespace nvidia-gpu-operator

Use the mobb/operatorhub chart to deploy the needed operators

helm upgrade -n nvidia-gpu-operator nvidia-gpu-operator \
  mobb/operatorhub --install \
  --values https://raw.githubusercontent.com/rh-mobb/helm-charts/main/charts/nvidia-gpu/files/operatorhub.yaml

Wait until the two operators are running

oc rollout status deploy/nfd-controller-manager -n openshift-nfd --timeout=300s

oc rollout status deploy/gpu-operator -n nvidia-gpu-operator --timeout=300s

Install the Nvidia GPU Operator chart

helm upgrade --install -n nvidia-gpu-operator nvidia-gpu \
  mobb/nvidia-gpu --disable-openapi-validation

Wait until NFD instances are ready

NOTE: If you are deploying ROSA in single-AZ change the replicas from 3 to 1 nfd-master

oc wait --for=jsonpath='{.status.availableReplicas}'=3 -l app=nfd-master deployment -n openshift-nfd

oc wait --for=jsonpath='{.status.numberReady}'=5 -l app=nfd-worker ds -n openshift-nfd

Wait until Cluster Policy is ready

oc wait --for=jsonpath='{.status.state}'=ready clusterpolicy \
  gpu-cluster-policy -n nvidia-gpu-operator --timeout=600s

Skip to Validate GPU

Manually

Install Nvidia GPU Operator

Create Nvidia namespace

oc create namespace nvidia-gpu-operator

Create Operator Group

cat <<EOF | oc apply -f -
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: nvidia-gpu-operator-group
  namespace: nvidia-gpu-operator
spec:
 targetNamespaces:
 - nvidia-gpu-operator
EOF

Get latest nvidia channel

CHANNEL=$(oc get packagemanifest gpu-operator-certified -n openshift-marketplace -o jsonpath='{.status.defaultChannel}')

Get latest nvidia package

PACKAGE=$(oc get packagemanifests/gpu-operator-certified -n openshift-marketplace -ojson | jq -r '.status.channels[] | select(.name == "'$CHANNEL'") | .currentCSV')

Create Subscription

envsubst  <<EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: gpu-operator-certified
  namespace: nvidia-gpu-operator
spec:
  channel: "$CHANNEL"
  installPlanApproval: Automatic
  name: gpu-operator-certified
  source: certified-operators
  sourceNamespace: openshift-marketplace
  startingCSV: "$PACKAGE"
EOF

Wait for Operator to finish installing

oc rollout status deploy/gpu-operator -n nvidia-gpu-operator --timeout=300s

Install Node Feature Discovery Operator

The node feature discovery operator will discover the GPU on your nodes and appropriately label the nodes so you can target them for workloads. We’ll install the NFD operator into the opneshift-ndf namespace and create the “subscription” which is the configuration for NFD.

Official Documentation for Installing Node Feature Discovery Operator

Set up namespace
```
oc create namespace openshift-nfd
```

Create OperatorGroup

cat <<EOF | oc apply -f -
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  generateName: openshift-nfd-
  name: openshift-nfd
  namespace: openshift-nfd
EOF

Create Subscription

cat <<EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: nfd
  namespace: openshift-nfd
spec:
  channel: "stable"
  installPlanApproval: Automatic
  name: nfd
  source: redhat-operators
  sourceNamespace: openshift-marketplace
EOF

Wait for Node Feature discovery to complete installation

oc rollout status deploy/nfd-controller-manager -n openshift-nfd --timeout=300s

Create NFD Instance

cat <<EOF | oc apply -f -
kind: NodeFeatureDiscovery
apiVersion: nfd.openshift.io/v1
metadata:
  name: nfd-instance
  namespace: openshift-nfd
spec:
  customConfig:
    configData: |
      #    - name: "more.kernel.features"
      #      matchOn:
      #      - loadedKMod: ["example_kmod3"]
      #    - name: "more.features.by.nodename"
      #      value: customValue
      #      matchOn:
      #      - nodename: ["special-.*-node-.*"]      
  operand:
    image: >-
      registry.redhat.io/openshift4/ose-node-feature-discovery@sha256:07658ef3df4b264b02396e67af813a52ba416b47ab6e1d2d08025a350ccd2b7b      
    servicePort: 12000
  workerConfig:
    configData: |
      core:
      #  labelWhiteList:
      #  noPublish: false
        sleepInterval: 60s
      #  sources: [all]
      #  klog:
      #    addDirHeader: false
      #    alsologtostderr: false
      #    logBacktraceAt:
      #    logtostderr: true
      #    skipHeaders: false
      #    stderrthreshold: 2
      #    v: 0
      #    vmodule:
      ##   NOTE: the following options are not dynamically run-time
      ##          configurable and require a nfd-worker restart to take effect
      ##          after being changed
      #    logDir:
      #    logFile:
      #    logFileMaxSize: 1800
      #    skipLogHeaders: false
      sources:
      #  cpu:
      #    cpuid:
      ##     NOTE: whitelist has priority over blacklist
      #      attributeBlacklist:
      #        - "BMI1"
      #        - "BMI2"
      #        - "CLMUL"
      #        - "CMOV"
      #        - "CX16"
      #        - "ERMS"
      #        - "F16C"
      #        - "HTT"
      #        - "LZCNT"
      #        - "MMX"
      #        - "MMXEXT"
      #        - "NX"
      #        - "POPCNT"
      #        - "RDRAND"
      #        - "RDSEED"
      #        - "RDTSCP"
      #        - "SGX"
      #        - "SSE"
      #        - "SSE2"
      #        - "SSE3"
      #        - "SSE4.1"
      #        - "SSE4.2"
      #        - "SSSE3"
      #      attributeWhitelist:
      #  kernel:
      #    kconfigFile: "/path/to/kconfig"
      #    configOpts:
      #      - "NO_HZ"
      #      - "X86"
      #      - "DMI"
        pci:
          deviceClassWhitelist:
            - "0200"
            - "03"
            - "12"
          deviceLabelFields:
      #      - "class"
            - "vendor"
      #      - "device"
      #      - "subsystem_vendor"
      #      - "subsystem_device"
      #  usb:
      #    deviceClassWhitelist:
      #      - "0e"
      #      - "ef"
      #      - "fe"
      #      - "ff"
      #    deviceLabelFields:
      #      - "class"
      #      - "vendor"
      #      - "device"
      #  custom:
      #    - name: "my.kernel.feature"
      #      matchOn:
      #        - loadedKMod: ["example_kmod1", "example_kmod2"]
      #    - name: "my.pci.feature"
      #      matchOn:
      #        - pciId:
      #            class: ["0200"]
      #            vendor: ["15b3"]
      #            device: ["1014", "1017"]
      #        - pciId :
      #            vendor: ["8086"]
      #            device: ["1000", "1100"]
      #    - name: "my.usb.feature"
      #      matchOn:
      #        - usbId:
      #          class: ["ff"]
      #          vendor: ["03e7"]
      #          device: ["2485"]
      #        - usbId:
      #          class: ["fe"]
      #          vendor: ["1a6e"]
      #          device: ["089a"]
      #    - name: "my.combined.feature"
      #      matchOn:
      #        - pciId:
      #            vendor: ["15b3"]
      #            device: ["1014", "1017"]
      #          loadedKMod : ["vendor_kmod1", "vendor_kmod2"]      
EOF

Wait until NFD instances are ready

oc wait --for=jsonpath='{.status.numberReady}'=3 -l app=nfd-master ds -n openshift-nfd

oc wait --for=jsonpath='{.status.numberReady}'=5 -l app=nfd-worker ds -n openshift-nfd

Apply nVidia Cluster Config

We’ll now apply the nvidia cluster config. Please read the nvidia documentation on customizing this if you have your own private repos or specific settings. This will be another process that takes a few minutes to complete.

Create cluster config

cat <<EOF | oc create -f -
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  migManager:
    enabled: true
  operator:
    defaultRuntime: crio
    initContainer: {}
    runtimeClass: nvidia
    deployGFD: true
  dcgm:
    enabled: true
  gfd: {}
  dcgmExporter:
    config:
      name: ''
  driver:
    licensingConfig:
      nlsEnabled: false
      configMapName: ''
    certConfig:
      name: ''
    kernelModuleConfig:
      name: ''
    repoConfig:
      configMapName: ''
    virtualTopology:
      config: ''
    enabled: true
    use_ocp_driver_toolkit: true
  devicePlugin: {}
  mig:
    strategy: single
  validator:
    plugin:
      env:
        - name: WITH_WORKLOAD
          value: 'true'
  nodeStatusExporter:
    enabled: true
  daemonsets: {}
  toolkit:
    enabled: true
EOF

Wait until Cluster Policy is ready

oc wait --for=jsonpath='{.status.state}'=ready clusterpolicy \
 gpu-cluster-policy -n nvidia-gpu-operator --timeout=600s

Validate GPU

Verify NFD can see your GPU(s)

oc describe node -l node.kubernetes.io/instance-type=$GPU_INSTANCE_TYPE \
  | egrep 'Roles|pci-10de' | grep -v master

You should see output like:

Roles:              worker
                    feature.node.kubernetes.io/pci-10de.present=true

Verify GPU Operator added node label to your GPU nodes
```
oc get node -l nvidia.com/gpu.present
```

[Optional] Test GPU access using Nvidia SMI

oc project nvidia-gpu-operator

for i in $(oc get pod -lopenshift.driver-toolkit=true --no-headers |awk '{print $1}'); do echo $i; oc exec -it $i -- nvidia-smi ; echo -e '\n' ;  done

You should see output that shows the GPUs available on the host such as this example screenshot. (Varies depending on GPU worker type)

Create Pod to run a GPU workload

oc project nvidia-gpu-operator

cat <<EOF | oc create -f -
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vector-add
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-vector-add
      image: "nvidia/samples:vectoradd-cuda11.2.1"
      resources:
        limits:
          nvidia.com/gpu: 1
      nodeSelector:
        nvidia.com/gpu.present: true
EOF

View logs
```
oc logs cuda-vector-add --tail=-1
```
Please note, if you get an error “Error from server (BadRequest): container “cuda-vector-add” in pod “cuda-vector-add” is waiting to start: ContainerCreating” try running “oc delete pod cuda-vector-add” and then re-run the create statement above. We’ve seen issues where if this step is ran before all of the operator consolidation is done it may just sit there.
You should see Output like the following (mary vary depending on GPU):
```
[Vector addition of 5000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
```
If successful, the pod can be deleted
```
oc delete pod cuda-vector-add
```

ROSA with Nvidia GPU Workloads

Prerequisites

Helm Prerequisites

GPU Quota

GPU Machine Pool

Install and Configure Nvidia GPU

Helm

Manually

Install Nvidia GPU Operator

Install Node Feature Discovery Operator

Apply nVidia Cluster Config

Validate GPU

Interested in contributing to these docs?

Products

Tools

Try, buy & sell

Communicate

About Red Hat

Subscribe to our newsletter, Red Hat Shares

Red Hat legal and privacy links

Red Hat legal and privacy links