Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OMap generator fails on PVCs created in a RADOS namespace #5140

Closed
SkalaNetworks opened this issue Feb 12, 2025 · 7 comments
Closed

OMap generator fails on PVCs created in a RADOS namespace #5140

SkalaNetworks opened this issue Feb 12, 2025 · 7 comments

Comments

@SkalaNetworks
Copy link

Describe the bug

When setting up mirroring, one of the requirement is enabling the OMap generator sidecar. This sidecar boots in the provisionner pod and tries to generate the omap of each PVC it detects. If one of these PVCs was to be created in a RADOS namespace within a pool, the OMap generator loops on it with error messages.

The capabilities needed to try to make it work seem also quite broad, cancelling the benefit of RADOS namespaces (rook/rook#15277)

Environment details

  • Image/version of Ceph CSI driver : quay.io/cephcsi/cephcsi:v3.13.0
  • Helm chart version : Deployed by Rook using version 1.16.3 for both the operator chart and the cluster chart
  • Kernel version : Talos 1.9 (Linux 6.12.5)
  • Mounter used for mounting PVC (for cephFS its fuse or kernel. for rbd its
    krbd or rbd-nbd) : krbd I guess?
  • Kubernetes cluster version : v1.32.0 (Talos 1.9)
  • Ceph cluster version : 19.2.0 (16063ff2022298c9300e49a547a16ffda59baf13) squid (stable)

Steps to reproduce

Steps to reproduce the behavior:

  1. Deploy a Ceph/Rook cluster using the Helm Chart and the following values

Operator:

        logLevel: DEBUG
        useOperatorHostNetwork: true
        csi:
          # Necessary for volume replication and mirroring
          csiAddons:
            enabled: true
          # Necessary for volume replication and mirroring
          enableOMAPGenerator: true
          serviceMonitor:
            enabled: true
        monitoring:
          enabled: true
        # Discover new disks to show them on the dashboard
        enableDiscoveryDaemon: true
        discoveryDaemonInterval: 5m

Cluster:

        operatorNamespace: rook-system
        clusterName: "somethingsomething"
        toolbox:
          enabled: true
        monitoring:
          enabled: true

        ingress:
          dashboard:
            annotations:
              cert-manager.io/cluster-issuer: letsencrypt
            host:
              name: "something"
              path: "/"
            tls:
              - hosts:
                  - "something"
                secretName: ceph-dashboard-tls
            ingressClassName: something

        cephClusterSpec:
          dashboard:
            enabled: true
            ssl: false
            prometheusEndpoint: "http://prometheus-kube-prometheus-prometheus.prometheus-system:9090"
            prometheusEndpointSSLVerify: false

          mgr:
            modules:
              - name: rook
                enabled: true

          storage:
            useAllDevices: false
            deviceFilter: ""

          crashCollector:
            disable: false
            daysToRetain: 365

          network:
            # Use the underlying host network for performance and to expose the daemons to remote Ceph clients
            provider: host

            # We run Ceph only on IPv6 networks, bind the daemons on IPv6 addresses
            ipFamily: IPv6

            # This needs to be overriden for each cluster with the IPv6 ranges dedicated to handling storage traffic
            addressRanges:
              public:
                - xxxx
              cluster:
                - xxxx

        cephFileSystems: []
        cephObjectStores: []
        cephBlockPools:
        # Block storage to be used by remote SPX clusters
        # Device type: NVMe
        # Replication: 3x
        - name: rbd-nvme3x
          spec:
            failureDomain: host
            enableRBDStats: true
            replicated:
              size: 3
            deviceClass: nvme
            mirroring:
              enabled: true
              mode: image
              peers:
                secretNames:
                - rbd-primary-site-secret
          storageClass:
            enabled: false
  1. Create the storageclass we'll use to trigger the bug, the peering relation with another cluster and the rados namespace associated with the storageclass
---
apiVersion: ceph.rook.io/v1
kind: CephBlockPoolRadosNamespace
metadata:
  name: test-mirroring
  namespace: spx-storage
spec:
  blockPoolName: rbd-nvme3x

---
# This is used to peer to another Ceph cluster, probably irrelevant to this bug
apiVersion: v1
data:
  token: xxxx
kind: Secret
metadata:
  name: rbd-primary-site-secret
  namespace: spx-storage

---
apiVersion: ceph.rook.io/v1
kind: CephRBDMirror
metadata:
  name: rbd-mirror
  namespace: spx-storage
spec:
  count: 1

---
apiVersion: ceph.rook.io/v1
kind: CephClient
metadata:
  name: test-mirroring
  namespace: spx-storage
spec:
  caps: # Note the capabilities only allow for accessing the namespace, this is because the storage in that namespace will be consumed by customer K8s clusters that should not see the storage of others in the pool
    mon: "profile rbd"
    mgr: "profile rbd pool=rbd-nvme3x namespace=test-mirroring"
    osd: "profile rbd pool=rbd-nvme3x namespace=test-mirroring"

---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: test-mirroring
provisioner: rook-system.rbd.csi.ceph.com
parameters:
  clusterID: 64048cf2257334960538a62d68cdcd55 # This was found by looking at the previously looking RADOS namespace and extracting the ID
  pool: rbd-nvme3x
  imageFormat: "2"
  imageFeatures: layering,fast-diff,object-map,deep-flatten,exclusive-lock
  csi.storage.k8s.io/provisioner-secret-name: "rook-ceph-client-test-mirroring"
  csi.storage.k8s.io/provisioner-secret-namespace: spx-storage
  csi.storage.k8s.io/controller-expand-secret-name:  "rook-ceph-client-test-mirroring"
  csi.storage.k8s.io/controller-expand-secret-namespace: spx-storage
  csi.storage.k8s.io/node-stage-secret-name: "rook-ceph-client-test-mirroring"
  csi.storage.k8s.io/node-stage-secret-namespace: spx-storage
  csi.storage.k8s.io/fstype: ext4
allowVolumeExpansion: true
reclaimPolicy: Delete
  1. We now have a storageclass to deploy PVCs of type RBD inside of a RADOS ns of name "test-mirroring" in pool "rbd-nvme3x"
  2. Create a PVC and mount it
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-mirroring
  namespace: default
spec:
  storageClassName: "test-mirroring"
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi

---
apiVersion: v1
kind: Pod
metadata:
  name: debug
  namespace: default
spec:
  containers:
  - name: debug-container
    image: busybox:latest
    command: ["sh", "-c", "while true; do sleep 3600; done"]
    volumeMounts:
    - name: storage-volume
      mountPath: /mnt/storage
  volumes:
  - name: storage-volume
    persistentVolumeClaim:
      claimName: test-mirroring
  1. PVC works, we can write in it, everything seems ok
  2. Look at the csi-provisionner elected as leader, it will have the following errors
    Image
  3. Apparently, the steps to create the OMap are realized using the capabilities of the account that is linked to the storage class. Let's give that account the rights to the pool and not the namespace to try to debug this (this is bad because it renders RADOS namespaces useless if customers have access to a Ceph account with capabilities to the pool, and not just their namespace)

Image
8. Restart the provisionner to relaunch the Omap generation
9. Now the logs are different, we managed to pass the operation where our capabilities weren't sufficient, but now it fails to list the RBD image, as if it was trying to search for it in the POOL and not in the NAMESPACE in that pool

Image

Actual results

OMap isn't generated for the PVC, this will probably render mirroring impossible.

Expected behavior

We can create OMaps for volumes inside a RADOS namespace (and hopefully with capabilities for that namespace only)

Additional context

Full GitOps setup on Talos + Rook

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Feb 12, 2025

This should have fixed in #5099. @iPraveenParihar can you please check if its the same and confirm?

@iPraveenParihar
Copy link
Contributor

Yes, you are right @Madhu-1. The reported issue is fixed by #5099 and back ported (#5100) to v3.13.
@SkalaNetworks, the fix will be available in v3.13.1.

@SkalaNetworks
Copy link
Author

Thanks @iPraveenParihar, do you have an ETA on the release? Is there a "latest" tag I can try to deploy the dev version and see if it fixes the problem? I still have an unknown concerning the ceph capabilities which block the creation of the OMap in the pool if you only allow the user to access its RADOS namespace and I'd like to verify that.

@iPraveenParihar
Copy link
Contributor

iPraveenParihar commented Feb 12, 2025

Thanks @iPraveenParihar, do you have an ETA on the release?

@Rakshith-R, do we know when is the v3.13.1 release?

Is there a "latest" tag I can try to deploy the dev version and see if it fixes the problem?

You can use the canary image tag.

I still have an unknown concerning the ceph capabilities which block the creation of the OMap in the pool if you only allow the user to access its RADOS namespace and I'd like to verify that.

profile rbd pool=rbd-nvme3x namespace=test-mirroring osd capability should just work fine.

@SkalaNetworks
Copy link
Author

Right, I tested the canary version (only on the csi-omap-generator container) and there's no errors visible anymore

Switching to debug mode (-v=5):

Without the "namespace=test-mirroring" capability restriction:

I0212 12:24:43.327132       1 omap.go:89] got omap values: (pool="rbd-nvme3x", namespace="test-mirroring", name="csi.volumes.default"): map[csi.volume.pvc-f195e544-10e4-4c36-99bb-c2ed334413e7:37836ab6-5a83-4ffe-986
c-01e26a478eb6]

With the restriction:

I0212 12:26:49.754382       1 omap.go:89] got omap values: (pool="rbd-nvme3x", namespace="test-mirroring", name="csi.volumes.default"): map[csi.volume.pvc-f195e544-10e4-4c36-99bb-c2ed334413e7:37836ab6-5a83-4ffe-986
c-01e26a478eb6]

I guess that's a win?

@iPraveenParihar
Copy link
Contributor

iPraveenParihar commented Feb 13, 2025

Yes, @SkalaNetworks that correct 👍 .

If there is nothing more on this issue, feel free to close the issue 😄 .

@SkalaNetworks
Copy link
Author

SkalaNetworks commented Feb 13, 2025

Thanks, I'll wait for the release of the 3.13.1 to deploy the fix properly!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants