Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to deploy Instrumentation resource when operator sub chart is disabled #1650

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

icicimov
Copy link

@icicimov icicimov commented Feb 8, 2025

Providing an option to deploy the Instrumentation resource when the internal operator sub chart is disabled, for example when managed by external opentelemetry-operator chart.

Motivation

We are using ArgoCD to deploy this chart to EKS and is not working. The operator Pod is stuck creating:

$ kc get pods -n otel
NAME                                                         READY   STATUS              RESTARTS   AGE
splunk-otel-collector-agent-6prc9                            1/1     Running             0          24h
splunk-otel-collector-agent-7mvkw                            1/1     Running             0          24h
splunk-otel-collector-agent-dnxlc                            1/1     Running             0          24h
splunk-otel-collector-k8s-cluster-receiver-b8844c489-b4rwl   1/1     Running             0          24h
splunk-otel-collector-operator-5b45ff497f-5lzqn              0/2     ContainerCreating   0          24h

because of missing splunk-otel-collector-operator-controller-manager-service-cert secret mount point:

Events:
  Type     Reason       Age                  From     Message
  ----     ------       ----                 ----     -------
  Warning  FailedMount  21s (x719 over 24h)  kubelet  MountVolume.SetUp failed for volume "cert" : secret "splunk-otel-collector-operator-controller-manager-service-cert" not found

which in turn is caused by the missing Certificate resource:

$ kc get cert -n otel
No resources found in otel namespace.

The reason for this are the Helm hooks that prevent the Certificate and Issuer from being deployed, see https://argo-cd.readthedocs.io/en/stable/user-guide/resource_hooks/ and https://argo-cd.readthedocs.io/en/stable/user-guide/helm/#helm-hooks to understand why and how the Helm hooks are being mapped into ArgoCD resource hooks.

In other words, the Helm hooks in the Chart are creating chicken and egg problem: the Issuer and the Certificate can not get deployed until all other resources are deployed and healthy which can not happen without them already being installed.

The certificateAnnotations and issuerAnnotations have hard coded Helm hooks in the Chart's values file:

operator:
  enabled: true
  crds:
    create: false
  admissionWebhooks:
    certManager:
      # Annotate the certificate and issuer to ensure they are created after the cert-manager CRDs have been installed.
      certificateAnnotations:
        "helm.sh/hook": post-install,post-upgrade
        "helm.sh/hook-weight": "1"
      issuerAnnotations:
        "helm.sh/hook": post-install,post-upgrade
        "helm.sh/hook-weight": "1"

that we can't override via our custom values file no matter what we do. Setting them as empty has no effect:

  admissionWebhooks:
    certManager:
      certificateAnnotations: {}
      issuerAnnotations: {}

the hooks still get rendered in the Certificate and Issuer manifests. And setting any other value in those objects only gets added to the hard coded ones instead of replacing the values. For example this settings:

  admissionWebhooks:
    certManager:
      certificateAnnotations:
        annotation1: foo
        annotation2: bar
      issuerAnnotations: {}

produces this Certificate manifest with the hard coded hooks still rendered:

---
# Source: splunk-otel-collector/charts/operator/templates/certmanager.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  annotations:
    annotation1: foo
    annotation2: bar
    helm.sh/hook: post-install,post-upgrade
    helm.sh/hook-weight: "1"
  labels:
    helm.sh/chart: operator-0.71.2
    app.kubernetes.io/name: operator
    app.kubernetes.io/version: "0.110.0"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/instance: splunk

    app.kubernetes.io/component: webhook
  name: splunk-operator-serving-cert
  namespace: otel
spec:
  dnsNames:
    - splunk-operator-webhook.otel.svc
    - splunk-operator-webhook.otel.svc.cluster.local
  issuerRef:
    kind: Issuer
    name: splunk-operator-selfsigned-issuer
  secretName: splunk-operator-controller-manager-service-cert
  subject:
    organizationalUnits:
      - splunk-operator

when running helm template --debug ... for the chart.

Workaround

To workaround this problem we deploy the operator and the otel-collector as two separate charts via dependencies:

dependencies:
  - name: splunk-otel-collector
    version: 0.118.0
    repository: https://signalfx.github.io/splunk-otel-collector-chart
  - name: opentelemetry-operator
    version: 0.79.0
    repository: https://open-telemetry.github.io/opentelemetry-helm-charts

and disable the operator sub chart in the otel-collector chart:

operator:
  enabled: false

This works well (because both problematic annotations are actually empty {} in the opentelemetry-operator chart) BUT because of that the Instrumentation resource does not get deployed because it depends on operator.enabled being true. Hence the PR.

Conclusion

Unless you have a suggestion on how to resolve the ArgoCD issues with the Helm hooks preventing resources from deploying I hope this change makes sense being simple as it is and non intrusive to the chart's overall operation.

Other info

There is a feature request open in the ArgoCD project argoproj/argo-cd#4331 providing for exactly the user case we have with this chart which is an option to tell ArgoCD to ignore some resources Helm hooks that may cause the issues we are seeing.

@icicimov icicimov requested review from a team as code owners February 8, 2025 10:46
Copy link
Contributor

github-actions bot commented Feb 8, 2025

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@icicimov
Copy link
Author

icicimov commented Feb 8, 2025

I have read the CLA Document and I hereby sign the CLA

srv-gh-o11y-gdi-cla added a commit to splunk/cla-agreement that referenced this pull request Feb 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant