Add a healthcheck script #174

tyx · 2017-06-30T14:32:50Z

To allow docker know if our container is really alive

michaelklishin · 2017-09-19T18:24:54Z

Note that this is a very basic check. rabbitmqctl node_health_check performs a more extensive (and opinionated) check. While the idea of what a "good enough" health check is varies from team to team and environment to environment, I'm leaning towards recommending that.

@tianon @yosifkit this sounds like a useful improvement, any reason why this PR hasn't received any feedback so far?

yosifkit · 2017-09-19T22:04:00Z

I still feel the same as I did last year (linked below). I have not seen a need for general health checks in the official-images; either the container is up or it is not and anything more granular is up to the user to define for their environment.

See also docker-library/cassandra#76 (comment).

ilude · 2017-10-18T23:05:09Z

@yosifkit docker swarm zero downtime deployments are going to require health checks to operate correctly. As it stands these official images are showing a status of running and not a status of healthy.
For docker service update to perform a rolling update it now needs to know that the new instance is up and "healthy" before it will drop the old instance and direct traffic to the new instance.

In addition I would like to address your each of your previous points:

Users "may" have their own idea of what healthy is, and docker provides a number of ways for them to override the stock healthcheck with their own.
all of the healthcheck I've seen including this one allows the deployer to specify a network interface other than localhost. Healthchecks are meant to check that the application is healthy, not that your network/firewall and any number of other things outside of the applications control are properly configured. "Some...start in a localhost only mode...", This image and the others that you apply this blanket reason to do not do this, so this is not a valid argument for not including a health check.
Changelogs exist for this very reason. And I hardly think that a process running every three seconds that checks to see that rabbitmq is accepting socket connections as expected will generate enough load to cause a system to be unable to keep up with its existing traffic. But if it does it can be disabled, and if a health check is the straw that brakes the camels back I think most folks would prefer to find out this way than when a DDOS that hits you a bit more often that one request every 3 seconds.

In closing please check out Kelsey Hightowers talk here: https://vimeo.com/173610242

tianon · 2018-11-01T23:34:06Z

See https://github.com/docker-library/faq#healthcheck for a slightly more expanded answer on our stance on HEALTHCHECK, especially for databases like RabbitMQ where a HEALTHCHECK is very likely to actively cause harm if implemented poorly.

michaelklishin · 2019-01-07T16:55:39Z

I hardly think that a process running every three seconds that checks to see that rabbitmq is
accepting socket connections as expected will generate enough load to cause a system to be unable > to keep up with its existing traffic

@ilude you would be surprised to learn how often we (RabbitMQ core team) see unreasonably aggressive monitoring to cause issues that are hard to predict and are counterproductive. High connection churn that environments are not configured to handle, opinionated health checks that query every single channel and queue in the system (potentially many thousands) every few seconds, health checks
that were only tested on localhost with a nearly blank database and assume a node always takes 5 seconds to start… I can continue. In the end such health checks are always revisited and most sensible operators arrive at the same conclusion: the nitty gritty aspects of distributed system monitoring are team- and system-specific.

Running this basic check or every 30 seconds is sufficient for most environments. rabbitmqctl node_health_check definitely should not be executed more frequently than every 30 seconds since it is more extensive than you think with a large number of connections and/or queues. RabbitMQ monitoring docs recommend that now.
I assume Docker Swarm has a way to make this configurable but defaults matter as too many users read documentation after they have their first incident in production (I've been watching this for years
in multiple data service communities).

Possibly the most popular Kubernetes deployment example for RabbitMQ runs liveness probes every 60 seconds and we haven't seen any false positives or other complaints ever since.

I agree that this image should include an example that demonstrates how to set up health checks and where to learn more (this is RabbitMQ's own Monitoring guide job to cover that well). It's important to not, however, build unreasonable monitoring policies that will produce false positives and waste resources into this image.

michaelklishin · 2019-01-07T16:55:58Z

@gerhard FYI.

michaelklishin · 2019-01-07T16:57:57Z

@tianon if you have made your mind on it I suggest closing this PR and perhaps revisiting the docs (to explain how to do health checks and where to learn more). Anything else would result in more and more iterations of the same discussion.

gerhard · 2019-01-14T14:56:33Z

I am picking this one up. A HEALTHCHECK is valuable in the context of Docker Swarm, monitoring systems and anything that needs to determine whether a RabbitMQ node is healthy.

ping docker-library/docs#1395

michaelklishin · 2019-01-14T20:33:54Z

FTR, I discussed #300 with @gerhard and we will be taking small steps for now. Once rabbitmq/rabbitmq-cli#292 provides a few more "tiered" health check commands and the docs are improved, Docker Swarm users can adopt more extensive/intrusive monitoring easily if they choose to. What the default should be for this image, we believe only time will tell. Those unknown unknowns, dammit.

gerhard · 2019-01-15T09:34:03Z

Based on the comments in #300, I think this PR should be closed.

michaelklishin · 2019-02-05T15:58:40Z

Worth mentioning here: Kubernetes seems to be moving past the One True Health Check™ idea and towards a list of both generic and system-specific checks.

Updated to reflect general recommendations from the rabbitmq docker repository (and via docker-library/rabbitmq#174 (comment)), and to generally support non-alpine versions of the image as the ubuntu image does not include wget (or curl) whereas `rabbitmqctl status` is included in both images, additionally the check looks more into the service status rather than the management plugins status. Signed-off-by: Jake Hill <[email protected]>

🆕 Add a healthcheck script

53cc4c3

To allow docker know if our container is really alive

yosifkit mentioned this pull request Sep 20, 2017

Add a HEALTHCHECK stanza MariaDB/mariadb-docker#128

Closed

ilude mentioned this pull request Sep 25, 2017

add RABBITMQ_LOG_LEVELS as a valid env var in docker-entrypoint.sh #187

Closed

tianon mentioned this pull request Jan 7, 2019

Add a blurb explicitly pointing out why RabbitMQ doesn't come with a default HEALTHCHECK docker-library/docs#1395

Merged

gerhard mentioned this pull request Jan 14, 2019

Add healthcheck #300

Closed

tianon closed this in docker-library/docs#1395 Jan 16, 2019

michaelklishin mentioned this pull request Jan 18, 2019

More/progressive health check commands rabbitmq/rabbitmq-cli#292

Closed

12 tasks

naphta mentioned this pull request Jul 30, 2019

[stable/rabbitmq-ha] Update readiness/liveness probes to use rabbitmqctl status helm/charts#15967

Merged

3 tasks

dentarg mentioned this pull request Sep 24, 2019

Actions and services feedback (health check questions) actions/example-services#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a healthcheck script #174

Add a healthcheck script #174

tyx commented Jun 30, 2017

michaelklishin commented Sep 19, 2017

yosifkit commented Sep 19, 2017

ilude commented Oct 18, 2017

tianon commented Nov 1, 2018

michaelklishin commented Jan 7, 2019 •

edited

Loading

michaelklishin commented Jan 7, 2019

michaelklishin commented Jan 7, 2019

gerhard commented Jan 14, 2019

michaelklishin commented Jan 14, 2019

gerhard commented Jan 15, 2019

michaelklishin commented Feb 5, 2019

Add a healthcheck script #174

Add a healthcheck script #174

Conversation

tyx commented Jun 30, 2017

michaelklishin commented Sep 19, 2017

yosifkit commented Sep 19, 2017

ilude commented Oct 18, 2017

tianon commented Nov 1, 2018

michaelklishin commented Jan 7, 2019 • edited Loading

michaelklishin commented Jan 7, 2019

michaelklishin commented Jan 7, 2019

gerhard commented Jan 14, 2019

michaelklishin commented Jan 14, 2019

gerhard commented Jan 15, 2019

michaelklishin commented Feb 5, 2019

michaelklishin commented Jan 7, 2019 •

edited

Loading