Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a healthcheck script #174

Closed
wants to merge 1 commit into from
Closed

Conversation

tyx
Copy link

@tyx tyx commented Jun 30, 2017

To allow docker know if our container is really alive

To allow docker know if our container is really alive
@michaelklishin
Copy link
Collaborator

Note that this is a very basic check. rabbitmqctl node_health_check performs a more extensive (and opinionated) check. While the idea of what a "good enough" health check is varies from team to team and environment to environment, I'm leaning towards recommending that.

@tianon @yosifkit this sounds like a useful improvement, any reason why this PR hasn't received any feedback so far?

@yosifkit
Copy link
Member

I still feel the same as I did last year (linked below). I have not seen a need for general health checks in the official-images; either the container is up or it is not and anything more granular is up to the user to define for their environment.

See also docker-library/cassandra#76 (comment).

@ilude
Copy link

ilude commented Oct 18, 2017

@yosifkit docker swarm zero downtime deployments are going to require health checks to operate correctly. As it stands these official images are showing a status of running and not a status of healthy.
For docker service update to perform a rolling update it now needs to know that the new instance is up and "healthy" before it will drop the old instance and direct traffic to the new instance.

In addition I would like to address your each of your previous points:

  • Users "may" have their own idea of what healthy is, and docker provides a number of ways for them to override the stock healthcheck with their own.

  • all of the healthcheck I've seen including this one allows the deployer to specify a network interface other than localhost. Healthchecks are meant to check that the application is healthy, not that your network/firewall and any number of other things outside of the applications control are properly configured. "Some...start in a localhost only mode...", This image and the others that you apply this blanket reason to do not do this, so this is not a valid argument for not including a health check.

  • Changelogs exist for this very reason. And I hardly think that a process running every three seconds that checks to see that rabbitmq is accepting socket connections as expected will generate enough load to cause a system to be unable to keep up with its existing traffic. But if it does it can be disabled, and if a health check is the straw that brakes the camels back I think most folks would prefer to find out this way than when a DDOS that hits you a bit more often that one request every 3 seconds.

In closing please check out Kelsey Hightowers talk here: https://vimeo.com/173610242

@tianon
Copy link
Member

tianon commented Nov 1, 2018

See https://github.com/docker-library/faq#healthcheck for a slightly more expanded answer on our stance on HEALTHCHECK, especially for databases like RabbitMQ where a HEALTHCHECK is very likely to actively cause harm if implemented poorly.

@michaelklishin
Copy link
Collaborator

michaelklishin commented Jan 7, 2019

I hardly think that a process running every three seconds that checks to see that rabbitmq is
accepting socket connections as expected will generate enough load to cause a system to be unable > to keep up with its existing traffic

@ilude you would be surprised to learn how often we (RabbitMQ core team) see unreasonably aggressive monitoring to cause issues that are hard to predict and are counterproductive. High connection churn that environments are not configured to handle, opinionated health checks that query every single channel and queue in the system (potentially many thousands) every few seconds, health checks
that were only tested on localhost with a nearly blank database and assume a node always takes 5 seconds to start… I can continue. In the end such health checks are always revisited and most sensible operators arrive at the same conclusion: the nitty gritty aspects of distributed system monitoring are team- and system-specific.

Running this basic check or every 30 seconds is sufficient for most environments. rabbitmqctl node_health_check definitely should not be executed more frequently than every 30 seconds since it is more extensive than you think with a large number of connections and/or queues. RabbitMQ monitoring docs recommend that now.
I assume Docker Swarm has a way to make this configurable but defaults matter as too many users read documentation after they have their first incident in production (I've been watching this for years
in multiple data service communities).

Possibly the most popular Kubernetes deployment example for RabbitMQ runs liveness probes every 60 seconds and we haven't seen any false positives or other complaints ever since.

I agree that this image should include an example that demonstrates how to set up health checks and where to learn more (this is RabbitMQ's own Monitoring guide job to cover that well). It's important to not, however, build unreasonable monitoring policies that will produce false positives and waste resources into this image.

@michaelklishin
Copy link
Collaborator

@gerhard FYI.

@michaelklishin
Copy link
Collaborator

@tianon if you have made your mind on it I suggest closing this PR and perhaps revisiting the docs (to explain how to do health checks and where to learn more). Anything else would result in more and more iterations of the same discussion.

@gerhard
Copy link
Contributor

gerhard commented Jan 14, 2019

I am picking this one up. A HEALTHCHECK is valuable in the context of Docker Swarm, monitoring systems and anything that needs to determine whether a RabbitMQ node is healthy.

ping docker-library/docs#1395

@gerhard gerhard mentioned this pull request Jan 14, 2019
@michaelklishin
Copy link
Collaborator

FTR, I discussed #300 with @gerhard and we will be taking small steps for now. Once rabbitmq/rabbitmq-cli#292 provides a few more "tiered" health check commands and the docs are improved, Docker Swarm users can adopt more extensive/intrusive monitoring easily if they choose to. What the default should be for this image, we believe only time will tell. Those unknown unknowns, dammit.

@gerhard
Copy link
Contributor

gerhard commented Jan 15, 2019

Based on the comments in #300, I think this PR should be closed.

@michaelklishin
Copy link
Collaborator

Worth mentioning here: Kubernetes seems to be moving past the One True Health Check™ idea and towards a list of both generic and system-specific checks.

naphta added a commit to naphta/charts that referenced this pull request Jul 30, 2019
Updated to reflect general recommendations from the rabbitmq docker repository (and via docker-library/rabbitmq#174 (comment)), and to generally support non-alpine versions of the image as the ubuntu image does not include wget (or curl) whereas `rabbitmqctl status` is included in both images, additionally the check looks more into the service status rather than the management plugins status.

Signed-off-by: Jake Hill <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants