Cannot modify the GPU Operator ClusterPolicy before deploying it #140

kpouget · 2021-05-05T12:04:39Z

Currently, the GPU Operator ClusterPolicy is fetched from the ClusterServiceVersion alm-example and instantiated right away.

However, in some cases, the default content is not the one we desire. See for instance this unmerged commit, where we need to set the repoConfig stanza when running with OCP 4.8 (using RHEL beta repositories).

    toolbox/gpu-operator/deploy_from_operatorhub.sh 
    [...]

    if oc version | grep -q "Server Version: 4.8"; then
        echo "Running on OCP 4.8, enabling RHEL beta repository"
        ./toolbox/gpu-operator/set_repo-config.sh --rhel-beta
    fi

    toolbox/gpu-operator/wait_deployment.sh

Another example would be when we want to customize the operator or operand image path to use custom ones.

The GPU Operator DaemonSets are never updated once created, so if they are created with the wrong values, the DaemonSets will never be fixed.

The hack above works (hopefully) because the driver container will fail to deploy without the right repoConfig configuration, so it's safe to manually delete it after the update, but in the general case, the Driver container should never be deleted once running, as the nvidia driver cannot be removed from the kernel while other process (workload or operand) use it.

We should find a way to allow patching the ClusterPolicy before deploying it. The solution should be generic, so that any kind of modification can be performed during the deployment.

The text was updated successfully, but these errors were encountered:

kpouget · 2021-05-05T14:41:18Z

the way I see it may be:

deploy the operator but don't create the CR
fetch the CR from the CSV
deploy the CR
currently 1-4 is performed in one command, so it must be split so that 3 can be easily controlled.

2 is very trivial,
4 is trivial as well, but we currently perform state capture in case of a failure (it's actually if the CR is not available that step 4 can fail, so this can moved to 1)

kpouget · 2021-05-05T14:42:07Z

/assign @omertuc

Omer, please keep this issue in mind when you'll work on the automation of Power image deployment

openshift-ci-robot assigned omertuc May 5, 2021

kpouget added the kind/feature Categorizes issue or PR as related to a new feature. label May 10, 2021

kpouget mentioned this issue May 25, 2021

WIP: Build GPU Operator operands to internal registry and deploy GPU operator with them #172

Closed

kpouget closed this as completed Apr 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot modify the GPU Operator ClusterPolicy before deploying it #140

Cannot modify the GPU Operator ClusterPolicy before deploying it #140

kpouget commented May 5, 2021

kpouget commented May 5, 2021

kpouget commented May 5, 2021

Cannot modify the GPU Operator ClusterPolicy before deploying it #140

Cannot modify the GPU Operator ClusterPolicy before deploying it #140

Comments

kpouget commented May 5, 2021

kpouget commented May 5, 2021

kpouget commented May 5, 2021