Skip to content

[Bug] Freezes and does not recover - no error #435

@stngrz

Description

@stngrz

Describe the bug

My app has ros2 dds topics communicating locally (~90 topics @ max) on the same host (A) and communicating with a cloud container, the container has about 10 topics (continuous), host (B). The topics on host A originate from multiple dockers containers, all using the same dds xml (the default dds xml for zenoh referenced here https://github.com/eclipse-zenoh/zenoh-plugin-ros2dds?tab=readme-ov-file#usage)

While in a steady state zenoh-bridge-ros2dds sends messages cleanly between hosts A to B with no issue. However, during the execution of my app I transition to various states/modes where the volume of topics can fluctuate greatly or not at all. This is where the issue emerges.

About 90% of these transitions (either many topic add/remove or none) create a mode where, on host A, the bridge freezes. I define freeze by not exiting, showing no errors - it just hangs. I don't see any corresponding events in logs on host B - as if B expects A to be connected and healthy. Otherwise it performs nominally - the issue is just during these stage/mode transitions. When i kill a docker running many topics i see this error as well (aka not using the apps change stage/mode feature).

To recover i simply stop zenoh and restart it - the system recovers and picks up using the data from Host B with no issue.

I've tried to review rosout to see if there is a difference in messages sent vs received on A and B but found nothing promising in the few samples I collected. I hoped it would reveal a problematic message being sent by my app on A - but nothing.

I expect that between app stages/mode transitions or add/remove of N topics the bridge continues to send and receive data or that it provide errors when its not sending/receiving data.

This occurs with and with out using custom zenoh json5 configs.

in the attached debug logs I've removed the customer name and replaced it with 'my-company'

Host A logs:
debug_start-wait5s-freeze-wait5s.txt
debug_start-freeze-wait10s.txt

To reproduce

  1. Start zenoh bridge
  2. add/remove topics / kill docker with many topics / stage change (sorry this is ill defined)
  3. bridge freezes (no more updates from host B topics on A) with no crash or log output on host A

System info

  • Platform: linux/aarch64
    Distributor ID: Ubuntu
    Description: Ubuntu 22.04.4 LTS
    Release: 22.04
    Codename: jammy

  • CPU:
    Architecture: aarch64
    CPU op-mode(s): 32-bit, 64-bit
    Byte Order: Little Endian
    CPU(s): 12
    On-line CPU(s) list: 0-11
    Vendor ID: ARM
    Model name: Cortex-A78AE

  • Zenoh:
    zenoh-bridge-ros2dds v1.3.0 (also occurred on 1.21.1)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions