Skip to content
This repository was archived by the owner on Aug 16, 2023. It is now read-only.

Cannot modify the GPU Operator ClusterPolicy before deploying it #140

Closed
kpouget opened this issue May 5, 2021 · 2 comments
Closed

Cannot modify the GPU Operator ClusterPolicy before deploying it #140

kpouget opened this issue May 5, 2021 · 2 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@kpouget
Copy link
Collaborator

kpouget commented May 5, 2021

Currently, the GPU Operator ClusterPolicy is fetched from the ClusterServiceVersion alm-example and instantiated right away.

However, in some cases, the default content is not the one we desire. See for instance this unmerged commit, where we need to set the repoConfig stanza when running with OCP 4.8 (using RHEL beta repositories).

    toolbox/gpu-operator/deploy_from_operatorhub.sh 
    [...]

    if oc version | grep -q "Server Version: 4.8"; then
        echo "Running on OCP 4.8, enabling RHEL beta repository"
        ./toolbox/gpu-operator/set_repo-config.sh --rhel-beta
    fi

    toolbox/gpu-operator/wait_deployment.sh

Another example would be when we want to customize the operator or operand image path to use custom ones.

The GPU Operator DaemonSets are never updated once created, so if they are created with the wrong values, the DaemonSets will never be fixed.

The hack above works (hopefully) because the driver container will fail to deploy without the right repoConfig configuration, so it's safe to manually delete it after the update, but in the general case, the Driver container should never be deleted once running, as the nvidia driver cannot be removed from the kernel while other process (workload or operand) use it.


We should find a way to allow patching the ClusterPolicy before deploying it. The solution should be generic, so that any kind of modification can be performed during the deployment.

@kpouget
Copy link
Collaborator Author

kpouget commented May 5, 2021

the way I see it may be:

  1. deploy the operator but don't create the CR
  2. fetch the CR from the CSV
  3. deploy the CR
    currently 1-4 is performed in one command, so it must be split so that 3 can be easily controlled.

2 is very trivial,
4 is trivial as well, but we currently perform state capture in case of a failure (it's actually if the CR is not available that step 4 can fail, so this can moved to 1)

@kpouget
Copy link
Collaborator Author

kpouget commented May 5, 2021

/assign @omertuc

Omer, please keep this issue in mind when you'll work on the automation of Power image deployment

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

2 participants