Skip to content

Operator ignores OperatorConfiguration changes #1315

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
vladimirfx opened this issue Jan 15, 2021 · 5 comments
Open

Operator ignores OperatorConfiguration changes #1315

vladimirfx opened this issue Jan 15, 2021 · 5 comments
Labels

Comments

@vladimirfx
Copy link

  • Which image of the operator are you using? - registry.opensource.zalan.do/acid/postgres-operator:v1.6.0
  • Where do you run it - cloud or metal? - Bare Metal K8s (microk8s)
  • Are you running Postgres Operator in production? - yes
  • Type of issue? - Bug report

Operator successfully read own OperatorConfiguration resource on start and ignores subsequent updates to it. I can't find any params/env to configure this behavior in docs.

As workaround we are forced to delete operator pod.

@FxKu
Copy link
Member

FxKu commented Jan 18, 2021

This is the expected behavior. Changes are not dynamically loaded. Ideally you have a deployment pipeline with operator and configuration and run that whenever there's a change, so that the operator pod is replaced.

@FxKu FxKu added the wontfix label Jan 18, 2021
@vladimirfx
Copy link
Author

Thank you for response!
Actually we have GitOps pipeline witch is declarative and change to resource is enough to apply it. That works for 20+ operators/controllers but Postgres Operator.
Operator (more generally controller https://kubernetes.io/docs/concepts/architecture/controller/) by its defenition a sync state mechanism:

A controller tracks at least one Kubernetes resource type. These objects have a spec field that represents the desired state. The controller(s) for that resource are responsible for making the current state come closer to that desired state.

So if postgresql-operator is actually Controller for OperatorConfiguration resource it should sync controlled resources state. At least do that in configurable way. Or it can be implemented with (now deprecated) ConfigMap configuration mechanism.

Please consider implement control loop for operator controlled by env variable of operator pod (disable by default). I would provide a PR if desired.

@onelapahead
Copy link

This is the expected behavior. Changes are not dynamically loaded. Ideally you have a deployment pipeline with operator and configuration and run that whenever there's a change, so that the operator pod is replaced.

Within the Helm chart, https://github.com/zalando/postgres-operator/blob/master/charts/postgres-operator/templates/deployment.yaml#L23
appears to attempt to achieve that. But Helm does not order the resource updates to ensure this actually works: https://github.com/helm/helm/blob/main/pkg/releaseutil/kind_sorter.go#L28-L68. Custom resources themselves are updated last.

We've observed, the operator roll due to a combination of changing config and an operator image change, but when it starts up the OperatorConfiguration is still being updated (this is likely partially attributed to our chart being a larger bundle of charts for bootstrapping so lots of resources / custom resources). Because the OC is outdated, the operator does not see the config changes, and we have had very confusing, "non-deterministic" windows for when our Postgres clusters might upgrade (if the operator restarts due to a Node going down, whatever you're all the sudden updating the Postgres cluster). Paired with a serious bug in newer images like #2747, this has nearly been catastrophic for us and we were lucky to notice it all at once.

So I would assert:

  1. This design is very counterintuitive because the authors chose to use a CRD for configuration. If it had been a simple ConfigMap, no such ordering and dependency issues would exist, nor would it be counterintuitive to users that the operator did not dynamically load config changes. A CRD as @vladimirfx mentioned, implies a control loop.
  2. The Helm chart has a "bug" because it depends on timing rather than properly guaranteeing order (which can't be done w/o using hooks which would be a breaking chart change and aren't really intended for this use case).

I think a fix could be an env var to enable such new behavior like @vladimirfx suggested, but then rather than going to the trouble of a full control loop, the operator simply graceful exits and K8s restarts it via the regular container lifecycle. On startup, the operator will load the new configuration through its regular startup procedure and not have to account for dynamic config changes during in-flight reconciles at this time.

Would that be an acceptable solution while the maintainers consider how a control loop could/should be implemented ?

@vladimirfx
Copy link
Author

Unfortunately, I can't provide a PR even if the idea will be accepted. We migrated from Zalando mainly because of this issue (it can't be used in GitOps without ugly workarounds) and Spilo lock-in (it is a tough task to move from Spilo-based deployments).
Overall, Zalando is a robust and full-featured solution - thanks to its maintainers!

@onelapahead
Copy link

Happy to make the contribution, will likely pose the same question in the form of a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants