Describe the bug
My app has ros2 dds topics communicating locally (~90 topics @ max) on the same host (A) and communicating with a cloud container, the container has about 10 topics (continuous), host (B). The topics on host A originate from multiple dockers containers, all using the same dds xml (the default dds xml for zenoh referenced here https://github.com/eclipse-zenoh/zenoh-plugin-ros2dds?tab=readme-ov-file#usage)
While in a steady state zenoh-bridge-ros2dds sends messages cleanly between hosts A to B with no issue. However, during the execution of my app I transition to various states/modes where the volume of topics can fluctuate greatly or not at all. This is where the issue emerges.
About 90% of these transitions (either many topic add/remove or none) create a mode where, on host A, the bridge freezes. I define freeze by not exiting, showing no errors - it just hangs. I don't see any corresponding events in logs on host B - as if B expects A to be connected and healthy. Otherwise it performs nominally - the issue is just during these stage/mode transitions. When i kill a docker running many topics i see this error as well (aka not using the apps change stage/mode feature).
To recover i simply stop zenoh and restart it - the system recovers and picks up using the data from Host B with no issue.
I've tried to review rosout to see if there is a difference in messages sent vs received on A and B but found nothing promising in the few samples I collected. I hoped it would reveal a problematic message being sent by my app on A - but nothing.
I expect that between app stages/mode transitions or add/remove of N topics the bridge continues to send and receive data or that it provide errors when its not sending/receiving data.
This occurs with and with out using custom zenoh json5 configs.
in the attached debug logs I've removed the customer name and replaced it with 'my-company'
Host A logs:
debug_start-wait5s-freeze-wait5s.txt
debug_start-freeze-wait10s.txt
To reproduce
- Start zenoh bridge
- add/remove topics / kill docker with many topics / stage change (sorry this is ill defined)
- bridge freezes (no more updates from host B topics on A) with no crash or log output on host A
System info
-
Platform: linux/aarch64
Distributor ID: Ubuntu
Description: Ubuntu 22.04.4 LTS
Release: 22.04
Codename: jammy
-
CPU:
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Vendor ID: ARM
Model name: Cortex-A78AE
-
Zenoh:
zenoh-bridge-ros2dds v1.3.0 (also occurred on 1.21.1)
Describe the bug
My app has ros2 dds topics communicating locally (~90 topics @ max) on the same host (A) and communicating with a cloud container, the container has about 10 topics (continuous), host (B). The topics on host A originate from multiple dockers containers, all using the same dds xml (the default dds xml for zenoh referenced here https://github.com/eclipse-zenoh/zenoh-plugin-ros2dds?tab=readme-ov-file#usage)
While in a steady state zenoh-bridge-ros2dds sends messages cleanly between hosts A to B with no issue. However, during the execution of my app I transition to various states/modes where the volume of topics can fluctuate greatly or not at all. This is where the issue emerges.
About 90% of these transitions (either many topic add/remove or none) create a mode where, on host A, the bridge freezes. I define freeze by not exiting, showing no errors - it just hangs. I don't see any corresponding events in logs on host B - as if B expects A to be connected and healthy. Otherwise it performs nominally - the issue is just during these stage/mode transitions. When i kill a docker running many topics i see this error as well (aka not using the apps change stage/mode feature).
To recover i simply stop zenoh and restart it - the system recovers and picks up using the data from Host B with no issue.
I've tried to review rosout to see if there is a difference in messages sent vs received on A and B but found nothing promising in the few samples I collected. I hoped it would reveal a problematic message being sent by my app on A - but nothing.
I expect that between app stages/mode transitions or add/remove of N topics the bridge continues to send and receive data or that it provide errors when its not sending/receiving data.
This occurs with and with out using custom zenoh json5 configs.
in the attached debug logs I've removed the customer name and replaced it with 'my-company'
Host A logs:
debug_start-wait5s-freeze-wait5s.txt
debug_start-freeze-wait10s.txt
To reproduce
System info
Platform: linux/aarch64
Distributor ID: Ubuntu
Description: Ubuntu 22.04.4 LTS
Release: 22.04
Codename: jammy
CPU:
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Vendor ID: ARM
Model name: Cortex-A78AE
Zenoh:
zenoh-bridge-ros2dds v1.3.0 (also occurred on 1.21.1)