Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qos :2:,10:,:,:,, not found in for topic type rcl_interfaces::srv::dds_::SetParametersAtomically_. Report this. #455

Open
Timple opened this issue Feb 7, 2025 · 10 comments
Labels
bug Something isn't working more-information-needed

Comments

@Timple
Copy link
Contributor

Timple commented Feb 7, 2025

I'm getting a lot of the following:

[implement_selector-73] [ERROR] [1738933120.536669563] [rmw_zenoh_cpp]: qos :2:,10:,:,:,, not found in for topic type rcl_interfaces::srv::dds_::SetParametersAtomically_. Report this.
[rosbag_cleanup-92] [ERROR] [1738933209.023131536] [rmw_zenoh_cpp]: qos 2:2:,50:,:,:,, not found in for topic type sensor_msgs::msg::dds_::PointCloud2_. Report this.

So I'm reporting it as requested.

It's a large number of nodes and topics which I am launching at once.
How to best debug this?

I'm running rmw_zenoh from source (latest jazzy branch).

@Yadunund
Copy link
Member

Yadunund commented Feb 7, 2025

@Timple this looks related to #408.

Apart from the errors printed, does your application run fine?

@Timple
Copy link
Contributor Author

Timple commented Feb 7, 2025

I had lots of similar issues as in #408 . That's why today I tried again after it was closed 🙂 .

Unfortunately the application also does not run fine.

@Yadunund
Copy link
Member

Yadunund commented Feb 7, 2025

That's a bummer.

Having a minimal reproducible example would greatly help the team debug the issue.
Eg. Does it happen with 100s of the same /talker node started in a launch file? This would be easier to investigate than a complex application. It seems related to CPU availability as well so knowing the hardware resource constraints is equally important.
Would you be able to provide these?

Unfortunately the application also does not run fine.

Are you able to elaborate on this further? Is the system caught up waiting for messages (with specific QoS settings), or services or actions? Any details would be helpful.

@Yadunund Yadunund added the bug Something isn't working label Feb 7, 2025
@Timple
Copy link
Contributor Author

Timple commented Feb 7, 2025

That would be for next week!

Since the word QoS is involved, we do use quite some transient local topics.

@Yadunund
Copy link
Member

Yadunund commented Feb 7, 2025

@Timple sounds good!

Since the word QoS is involved, we do use quite some transient local topics.

Interesting. @YuanYuYuan and @JEnoch are updating the way we handle transient_local topics in #368 so it would be great to rely on your minimal example to evaluate if things improve with that PR.

@Timple
Copy link
Contributor Author

Timple commented Feb 21, 2025

So circling back on this. I realise I've not done my homework thusfar to create a minimal example. Mostly because the example probably wouldn't be minimal 🙂 .

But just wanted to mention that todays jazzy branch ( ecf21d4 ) didn't resolve the issues mentioned.

@Yadunund
Copy link
Member

Thanks for following up! Looking forward to running the "minimal example" whenever its ready!

@Yadunund
Copy link
Member

@Timple in the meanwhile, i'm curious if splitting all your nodes/processes into two or three launch files and launching them one after the other (launch the second file after the first was fully brought up) alleviates the problem?

@Timple
Copy link
Contributor Author

Timple commented Mar 7, 2025

Well, today was the day I was going to test your suggestions. But I thought it would be best to try the latest branch first. As I've seen some relevant PR's passing by.

And good new, our software stack seems to be running with the latest changes 🥳 .
So somewhere in this range is the last bit that we needed: 0.2.2...801ff66

Thank your for all your efforts!

I have only verified this on our hardware-in-the-loop-simulator. I will close this issue once verified on the actual hardware!

@Timple
Copy link
Contributor Author

Timple commented Mar 7, 2025

Unfortunately take 2 on the HIL failed as well. One of the nodes had a crash:

what(): rcl_wait unexpectedly timed out
2025-03-07T08:07:25.668853000Z [INFO] [diagnostics_summarizer-29]: process started with pid [662]
2025-03-07T08:07:28.528681000Z [diagnostics_summarizer-29] �[2m2025-03-07T08:07:22.092433Z�[0m �[32m INFO�[0m ThreadId(05) �[2mzenoh::net::runtime::orchestrator�[0m�[2m:�[0m Zenoh can be reached at: tcp/[::1]:52549
2025-03-07T08:07:28.822902000Z [ERROR] [diagnostics_summarizer-29]: process has died [pid 662, exit code -6, cmd '/install/harvey/lib/harvey/diagnostics_summarizer --ros-args -r __node:=diagnostics_summarizer --params-file /tmp/launch_params_i1lgx4ix'].
2025-03-07T08:07:33.852035000Z [diagnostics_summarizer-29] [ERROR] [1741334848.118082488] [rclcpp]: caught std::exception exception in GraphListener thread: rcl_wait unexpectedly timed out
2025-03-07T08:07:33.882358000Z [diagnostics_summarizer-29] terminate called after throwing an instance of 'std::runtime_error'
2025-03-07T08:07:33.882532000Z [diagnostics_summarizer-29]   what():  rcl_wait unexpectedly timed out
2025-03-07T08:07:33.884842000Z [diagnostics_summarizer-29] Stack trace (most recent call last) in thread 1024:
2025-03-07T08:07:33.891791000Z [diagnostics_summarizer-29] #11   Object "", at 0xffffffffffffffff, in 
2025-03-07T08:07:33.891994000Z [diagnostics_summarizer-29] #10   Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc25b62bc3b, in 
2025-03-07T08:07:33.894063000Z [diagnostics_summarizer-29] #9    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc25b59eaa3, in 
2025-03-07T08:07:33.895319000Z [diagnostics_summarizer-29] #8    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.33", at 0x7fc25b830db3, in 
2025-03-07T08:07:33.922034000Z [diagnostics_summarizer-29] #7    Object "/opt/ros/jazzy/lib/librclcpp.so", at 0x7fc25bb96ab7, in 
2025-03-07T08:07:33.928509000Z [diagnostics_summarizer-29] #6    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.33", at 0x7fc25b7ff0c0, in std::rethrow_exception(std::__exception_ptr::exception_ptr)
2025-03-07T08:07:33.934079000Z [diagnostics_summarizer-29] #5    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.33", at 0x7fc25b7e9a54, in std::terminate()
2025-03-07T08:07:33.945943000Z [diagnostics_summarizer-29] #4    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.33", at 0x7fc25b7ff0d9, in 
2025-03-07T08:07:33.965063000Z [diagnostics_summarizer-29] #3    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.33", at 0x7fc25b7e9ff4, in 
2025-03-07T08:07:33.970482000Z [diagnostics_summarizer-29] #2    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc25b52a8fe, in abort
2025-03-07T08:07:33.986060000Z [diagnostics_summarizer-29] #1    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc25b54727d, in raise
2025-03-07T08:07:34.030215000Z [diagnostics_summarizer-29] #0    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc25b5a0b2c, in pthread_kill
2025-03-07T08:07:34.064043000Z [diagnostics_summarizer-29] Aborted (Signal sent by tkill() 662 0)

Would you like me to rephrase the title of this issue or to create a new ticket for this one? As the cause (high load / number of nodes) is likely the same but the error is clearly different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working more-information-needed
Projects
None yet
Development

No branches or pull requests

2 participants