Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1000 elcap nodes are shown as "lost connection" on rank 0 but compute node thinks it is still connected #6626

Open
garlick opened this issue Feb 11, 2025 · 1 comment

Comments

@garlick
Copy link
Member

garlick commented Feb 11, 2025

Problem: about 1000 nodes of el cap got into a state where the rank 0 broker thought they were disconnected, but from the point of view of the nodes, the rank 0 broker was just unresponsive.

Specifically, on rank 0

[root@elcap1:conf.d]# flux overlay status |grep elcap1124
├─ 820 elcap1124: lost lost connection

but on the node, everything seemed OK except flux commands that needed to contact rank 0 would hang.

The actual TCP connection to the node appeared to be in connected state. Here is elcap12119 (another node in that state):

tcp        0 138488 eelcap12119:46878       eelcap1:8050            ESTABLISHED 188356/broker

and the same connection on elcap1

tcp   246536      0 eelcap1:8050            eelcap12119:46878       ESTABLISHED 3180474/broker

Stopping flux on the compute node ran into the systemd timeout, but problems immediately went away upon restart.

@garlick
Copy link
Member Author

garlick commented Feb 12, 2025

Note that the "lost connection" status indicates that a send on rank 0 failed with EHOSTUNREACH.

Because we have ZMQ_ROUTER_MANDATORY set[1], if the destination UUID is unknown (presumably because it disconnected), a zmq_send() fails with EHOSTUNEACH. We catch that and set SUBTREE_STATUS_LOST with the error "lost connection".

[1] https://libzmq.readthedocs.io/en/latest/zmq_setsockopt.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant