Skip to content

Add healthcheck #300

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from
Closed

Add healthcheck #300

wants to merge 8 commits into from

Conversation

gerhard
Copy link
Contributor

@gerhard gerhard commented Jan 14, 2019

RabbitMQ healthchecks are hard, as captured by @michaelklishin in
#174 (comment)
and followed-up in rabbitmq/rabbitmq-cli#292

This is an attempt to simplify the inherent complexity by taking the
first step in what will likely become a new rabbitmqctl diagnostics
command. To follow-up on this idea, join rabbitmq/rabbitmq-cli#292.

Things that we might want to discuss:

  • is 30s a reasonable amount to wait before starting a RabbitMQ node for healthines?
  • should we account for nodes that can take a long time to boot?
  • should we retry longer before considering the node unhealthy?
  • is 3s a reasonable command timeout?
  • should we improve the active listeners check?

Superceeds #174
Related to docker-library/docs#1395

I've based this on PR #297 since the current debian image won't docker build due to openssl failures, as captured by the CI.

This is a first commit, the feature as a whole is WIP. The next steps
are captured in a TODO at the tail of the Dockerfile. Sharing this
early so that we can discuss the direction that this is going in.
Various decisions made in the Dockerfile have been captured inline in
comments, should make for a good PR discussion.

The primary goal is to upgrade the Erlang/OTP version to latest stable,
which is v21.2.2 at the time of this commit. RabbitMQ v3.7.x will stop
supporting Erlang v20.x in September 2019 (~8 months from now). RabbitMQ
v3.8.x will only support Erlang v21.x. RabbitMQ Erlang/OTP release
support policy was announced on the rabbitmq-users mailing list in
October 2018:
https://groups.google.com/forum/#!msg/rabbitmq-users/G4UJ9zbIYHs/tyt_kDoFBgAJ

The secondary goal is to only ship the required artefacts in the final
image. For example, all Erlang/OTP applications & features which are not
required by RabbitMQ are disabled. I suspect that the final Erlang/OTP
release can be shrunk further (it currently stands 130MB), but this is a
minor concern right now.

Related to the secondary goal, we enable certain features in Erlang/OTP
which are useful when debugging:
* extra microstate accounting is not known to negatively affect
  performance, this feature is exposed via the `rabbitmq-diagnostics
  runtime_thread_stats` command added in v3.7.10
* lock counting is only enabled if the Erlang VM is started in a
  specific mode, this feature doesn't impact the default beam.smp runtime

The final goal is to be explicit about the OpenSSL version that
Erlang/OTP uses. Using a shared OpenSSL might be convenient, but it
has the following drawbacks:

* depending on the base image for OpenSSL updates
* not knowing which OpenSSL version we compile against
* not knowing how OpenSSL was configured
* not being able to change OpenSSL configuration

Compiling OpenSSL adds an extra concern and definitely complicates this
Dockerfile, but one possible mitigation would be to automate version
bumps when a new OpenSSL version gets published. I am also expecting
that images will be automatically built & published from this
Dockerfile. Since OpenSSL is compiled with all defaults, I do not expect
things to stop working and become a maintenance overhead - we are not
using any advanced compilation flags.

I am including the full docker build log, I always find the information
captured in build logs to be helpful.

Resources which I found helpful while putting this Dockerfile together:

* https://github.com/rabbitmq/rabbitmq-server-boshrelease/blob/816bb377a59975c461e1af72367f187edc39ad3d/packages/erlang-21.1/packaging
* https://github.com/erlang/docker-erlang-otp/blob/e2e804aeeb6e6bc5fd49f66481be1dff829428f5/21/Dockerfile
* https://github.com/erlang/docker-erlang-example#2-build-stage-1-create-a-minimal-docker-image
* https://bugs.erlang.org/browse/ERL-823
* http://erlang.org/pipermail/erlang-questions/2019-January/097012.html
* https://github.com/lrascao/erlang-ec2-build
* https://github.com/kerl/kerl/blob/master/kerl
@tianon I've left a few questions for you in the Dockerfile as TODOs.

A few highlighlights:

* I've added capture the way I build this image locally in the build script
* ha.pool.sks-keyservers.net is not as stable as pgpkeys.eu,
  there are many unstable PGP keyservers https://sks-keyservers.net/status/
* GitHub SSL was failing in wget when grabbing gosu, curl is more reliable
* docker-entrypoint.sh fails if rabbitmq-plugins is not invoked with the -q flag,
  a fix since 3.7.10 rabbitmq/rabbitmq-server-boshrelease@2da9884#commitcomment-31470432
RabbitMQ healthchecks are hard, as captured by @michaelklishin in
#174 (comment)
and followed-up in rabbitmq/rabbitmq-cli#292

This is an attempt to simplify the inherent complexity by taking the
first step in what will likely become a new rabbitmqctl diagnostics
command. To follow-up on this idea, join rabbitmq/rabbitmq-cli#292.

Things that we might want to discuss:

* is 30s a reasonable amount to wait before starting a RabbitMQ node for healthines?
* should we account for nodes that can take a long time to boot?
* should we retry longer before considering the node unhealthy?
* is 3s a reasonable command timeout?
* should we improve the active listeners check?

Superceeds #174
Related to docker-library/docs#1395
From my perspective, they are both outside-of-this-container concerns
@gerhard
Copy link
Contributor Author

gerhard commented Jan 14, 2019

People which will be interested in this: @michaelklishin

People which might be interested in this: @lukebakken @acogoluegnes @MarcialRosales @mkuratczyk

@yosifkit
Copy link
Member

Since there isn't a "one-size-fits-all" check, I'd still rather stick to documenting possible options instead of forcing one on everybody.

See also https://github.com/docker-library/faq#healthcheck.

@gerhard
Copy link
Contributor Author

gerhard commented Jan 15, 2019

While some users might want a more comprehensive check, a coarse default which will be accurate in all except edge cases is better than no healthcheck.

Having read the Healthcheck FAQ, I understand the reasoning and will play within the constraints of the ecosystem.

@yosifkit is it worth contributing the rationale from this PR into https://github.com/docker-library/docs/tree/master/rabbitmq ?

@michaelklishin
Copy link
Collaborator

FWIW RabbitMQ CLI tools are being extended to make a number of health checks to be one liners or a combination of a few one liners. Then Docker and Kubernetes users would be able to pick the "stage" they want and easily use it as their healthcheck/liveness probe.

@michaelklishin
Copy link
Collaborator

All but one of the new health check commands (see rabbitmq-diagnostics --help) will be available as of RabbitMQ 3.7.11.

@gerhard gerhard deleted the healthcheck branch February 4, 2019 11:11
@michaelklishin
Copy link
Collaborator

Worth mentioning here: Kubernetes seems to be moving past the One True Health Check™ idea and towards a list of both generic and system-specific checks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants