You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Problem: about 1000 nodes of el cap got into a state where the rank 0 broker thought they were disconnected, but from the point of view of the nodes, the rank 0 broker was just unresponsive.
Specifically, on rank 0
[root@elcap1:conf.d]# flux overlay status |grep elcap1124
├─ 820 elcap1124: lost lost connection
but on the node, everything seemed OK except flux commands that needed to contact rank 0 would hang.
The actual TCP connection to the node appeared to be in connected state. Here is elcap12119 (another node in that state):
tcp 0 138488 eelcap12119:46878 eelcap1:8050 ESTABLISHED 188356/broker
and the same connection on elcap1
tcp 246536 0 eelcap1:8050 eelcap12119:46878 ESTABLISHED 3180474/broker
Stopping flux on the compute node ran into the systemd timeout, but problems immediately went away upon restart.
The text was updated successfully, but these errors were encountered:
Note that the "lost connection" status indicates that a send on rank 0 failed with EHOSTUNREACH.
Because we have ZMQ_ROUTER_MANDATORY set[1], if the destination UUID is unknown (presumably because it disconnected), a zmq_send() fails with EHOSTUNEACH. We catch that and set SUBTREE_STATUS_LOST with the error "lost connection".
Problem: about 1000 nodes of el cap got into a state where the rank 0 broker thought they were disconnected, but from the point of view of the nodes, the rank 0 broker was just unresponsive.
Specifically, on rank 0
but on the node, everything seemed OK except flux commands that needed to contact rank 0 would hang.
The actual TCP connection to the node appeared to be in connected state. Here is elcap12119 (another node in that state):
and the same connection on elcap1
Stopping flux on the compute node ran into the systemd timeout, but problems immediately went away upon restart.
The text was updated successfully, but these errors were encountered: