Skip to content
This repository was archived by the owner on May 6, 2020. It is now read-only.

Controller worker thread times out during config set #652

Open
kmala opened this issue Apr 21, 2016 · 7 comments
Open

Controller worker thread times out during config set #652

kmala opened this issue Apr 21, 2016 · 7 comments
Labels

Comments

@kmala
Copy link
Contributor

kmala commented Apr 21, 2016

when we try to do a config:set on an app with 30-40 pods the controller worker thread is timing out because the operation is taking more than 20min(default timeout of the worker thread) keeping the cluster in an unstable state with pods of both releases.

@bacongobbler
Copy link
Member

tagging at rc1 since this is something we need to fix before we cut a stable release.

@bacongobbler
Copy link
Member

ping @helgi; has this been fixed recently?

@helgi
Copy link
Contributor

helgi commented May 19, 2016

No, and it won't be done in RC. This isn't a kubernetes problem, it's more the fact we are trying to execute any operation within the timeout of the gunicorn server, which we can't extend too much without causing contention of resources on the RC=1 controller setup. We'd need to move to background jobs to fix this

@helgi helgi removed this from the v2.0-rc1 milestone May 19, 2016
@helgi helgi removed the k8s label May 19, 2016
@helgi
Copy link
Contributor

helgi commented May 19, 2016

Oh, what has helped is doing deploys in batches, and by default rolling as many pods as available nodes but that only mitigates the issue in some scenarios

@bacongobbler
Copy link
Member

bacongobbler commented May 19, 2016

@kmala are you still able to reproduce this in your environment with workflow-dev? If it's a matter of waiting for all the pods to come up then perhaps we should rethink this deployment strategy eventually, since the old way of just destroying everything and starting everything worked for 99% of our use cases, including this one. Graceful rolling deploys are nice but if it's causing us to be in an unstable state any time we're dealing with a larger number of jobs then perhaps we should go back to square one.

@mboersma
Copy link
Member

mboersma commented Sep 6, 2016

This situation should be improved by the batching operation of the current controller, although probably not fixed definitively.

@Cryptophobia
Copy link
Contributor

This issue was moved to teamhephy/controller#66

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants