bug: unsoundness of `rmw_shutdown` #170

YuanYuYuan · 2024-05-08T09:00:01Z

We notice that ApexAI performance test with rmw_zenoh keep failing with the error.

thread '<unnamed>' panicked at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/thread/local.rs:262:26:
cannot access a Thread Local Storage value during or after destruction: AccessError
stack backtrace:
   0:     0x7e09df3dfab6 - std::backtrace_rs::backtrace::libunwind::trace::h92ba5ac126ecc5ad
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/../../backtrace/src/backtrace/libunwind.rs:104:5
   1:     0x7e09df3dfab6 - std::backtrace_rs::backtrace::trace_unsynchronized::hfaa118b6e3a8f31a
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x7e09df3dfab6 - std::sys_common::backtrace::_print_fmt::h2075f185adf13bbd
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/sys_common/backtrace.rs:68:5
   3:     0x7e09df3dfab6 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h9c4bd387f9f3f544
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/sys_common/backtrace.rs:44:22
   4:     0x7e09df40ab20 - core::fmt::rt::Argument::fmt::h018ecebdf268b5bb
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/fmt/rt.rs:142:9
   5:     0x7e09df40ab20 - core::fmt::write::h938c332fdab924eb
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/fmt/mod.rs:1120:17
   6:     0x7e09df3dcc5f - std::io::Write::write_fmt::h4a694b02e44e6363
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/io/mod.rs:1846:15
   7:     0x7e09df3df894 - std::sys_common::backtrace::_print::h3595c0f4303f3ccd
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/sys_common/backtrace.rs:47:5
   8:     0x7e09df3df894 - std::sys_common::backtrace::print::ha888e6736b0bc71f
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/sys_common/backtrace.rs:34:9
   9:     0x7e09df3e1117 - std::panicking::default_hook::{{closure}}::he19a7f79f7beab5e
  10:     0x7e09df3e0e79 - std::panicking::default_hook::h67efe04e9a5d446e
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:292:9
  11:     0x7e09df3e15a8 - std::panicking::rust_panic_with_hook::h49021cdbc4b22349
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:781:13
  12:     0x7e09df3e1482 - std::panicking::begin_panic_handler::{{closure}}::hfbf601f3d8c62d13
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:659:13
  13:     0x7e09df3dffb6 - std::sys_common::backtrace::__rust_end_short_backtrace::h98dd020b6e913806
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/sys_common/backtrace.rs:171:18
  14:     0x7e09df3e11d4 - rust_begin_unwind
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:647:5
  15:     0x7e09de3978e5 - core::panicking::panic_fmt::h0d3f1893e38be419
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/panicking.rs:72:14
  16:     0x7e09de397e33 - core::result::unwrap_failed::h1356562ef8bc898c
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/result.rs:1649:5
  17:     0x7e09df3826d9 - core::result::Result<T,E>::expect::h6b8b29ca8643e857
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/result.rs:1030:23
  18:     0x7e09df343d8b - std::thread::local::LocalKey<T>::with::h1752a920e6b97eb1
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/thread/local.rs:262:9
  19:     0x7e09df364e6e - tokio::runtime::context::runtime_mt::current_enter_context::haf9beb3610bddfe5
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/context/runtime_mt.rs:5:5
  20:     0x7e09de6c3f24 - tokio::runtime::scheduler::multi_thread::worker::block_in_place::{{closure}}::h9c06599e5c088372
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/multi_thread/worker.rs:355:13
  21:     0x7e09de6c31aa - tokio::runtime::scheduler::multi_thread::worker::with_current::{{closure}}::hc2aa2ec0b046bb24
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/multi_thread/worker.rs:1223:14
  22:     0x7e09de786485 - tokio::runtime::context::with_scheduler::{{closure}}::h4358f2c1fed3380d
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/context.rs:183:33
  23:     0x7e09de7ada3f - core::result::Result<T,E>::unwrap_or_else::hf45d4956e923594f
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/result.rs:1426:23
  24:     0x7e09de7861af - tokio::runtime::context::with_scheduler::h60fbeb43725fdaf0
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/context.rs:182:9
  25:     0x7e09de6c2f9b - tokio::runtime::scheduler::multi_thread::worker::with_current::he3f8c8d1928a9152
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/multi_thread/worker.rs:1221:5
  26:     0x7e09de6c3541 - tokio::runtime::scheduler::multi_thread::worker::block_in_place::h812065477b42e82f
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/multi_thread/worker.rs:353:24
  27:     0x7e09de786afa - tokio::runtime::scheduler::block_in_place::block_in_place::he4fdf8f5613ebc2b
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/block_in_place.rs:20:5
  28:     0x7e09de5b9f8a - tokio::task::blocking::block_in_place::hd8175eae66af42f5
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/task/blocking.rs:78:9
  29:     0x7e09de7a6764 - zenoh_runtime::ZRuntime::block_in_place::hc7706bb648ae8ea0
                               at /home/zstw/Workspace/src/zenoh/commons/zenoh-runtime/src/lib.rs:130:9
  30:     0x7e09de83f367 - <zenoh_core::ResolveFuture<F,To> as zenoh_core::SyncResolve>::res_sync::hc9409ed43c1f44e0
                               at /home/zstw/Workspace/src/zenoh/commons/zenoh-core/src/lib.rs:166:9
  31:     0x7e09de45d0ea - zenoh_core::SyncResolve::res::hb45a81dafd5ac98f
                               at /home/zstw/Workspace/src/zenoh/commons/zenoh-core/src/lib.rs:57:9
  32:     0x7e09de46c518 - z_close
                               at /home/zstw/Workspace/benchmark_rmw_zenoh/build/zenoh_c_vendor/zenoh_c_vendor-prefix/src/zenoh_c_vendor/src/session.rs:144:11
  33:     0x7e09df9d2d5b - rmw_shutdown
  34:     0x7e09e15c2104 - rcl_shutdown
  35:     0x7e09e12fa41e - _ZN6rclcpp7Context8shutdownERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
  36:     0x7e09e12fac91 - _ZN6rclcpp7ContextD1Ev
  37:     0x7e09e12f3bea - <unknown>
  38:     0x7e09e0a45495 - __run_exit_handlers
                               at ./stdlib/exit.c:113:8
  39:     0x7e09e0a45610 - __GI_exit
                               at ./stdlib/exit.c:143:3
  40:     0x7e09e0a29d97 - __libc_start_call_main
                               at ./csu/../sysdeps/nptl/libc_start_call_main.h:74:3
  41:     0x7e09e0a29e40 - __libc_start_main_impl
                               at ./csu/../csu/libc-start.c:392:3
  42:     0x584c927db915 - _start
  43:                0x0 - <unknown>
thread '<unnamed>' panicked at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/thread/local.rs:262:26:
cannot access a Thread Local Storage value during or after destruction: AccessError
stack backtrace:
   0:     0x7e09df3dfab6 - std::backtrace_rs::backtrace::libunwind::trace::h92ba5ac126ecc5ad
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/../../backtrace/src/backtrace/libunwind.rs:104:5
   1:     0x7e09df3dfab6 - std::backtrace_rs::backtrace::trace_unsynchronized::hfaa118b6e3a8f31a
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x7e09df3dfab6 - std::sys_common::backtrace::_print_fmt::h2075f185adf13bbd
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/sys_common/backtrace.rs:68:5
   3:     0x7e09df3dfab6 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h9c4bd387f9f3f544
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/sys_common/backtrace.rs:44:22
   4:     0x7e09df40ab20 - core::fmt::rt::Argument::fmt::h018ecebdf268b5bb
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/fmt/rt.rs:142:9
   5:     0x7e09df40ab20 - core::fmt::write::h938c332fdab924eb
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/fmt/mod.rs:1120:17
   6:     0x7e09df3dcc5f - std::io::Write::write_fmt::h4a694b02e44e6363
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/io/mod.rs:1846:15
   7:     0x7e09df3df894 - std::sys_common::backtrace::_print::h3595c0f4303f3ccd
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/sys_common/backtrace.rs:47:5
   8:     0x7e09df3df894 - std::sys_common::backtrace::print::ha888e6736b0bc71f
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/sys_common/backtrace.rs:34:9
   9:     0x7e09df3e1117 - std::panicking::default_hook::{{closure}}::he19a7f79f7beab5e
  10:     0x7e09df3e0e79 - std::panicking::default_hook::h67efe04e9a5d446e
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:292:9
  11:     0x7e09df3e15a8 - std::panicking::rust_panic_with_hook::h49021cdbc4b22349
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:781:13
  12:     0x7e09df3e1482 - std::panicking::begin_panic_handler::{{closure}}::hfbf601f3d8c62d13
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:659:13
  13:     0x7e09df3dffb6 - std::sys_common::backtrace::__rust_end_short_backtrace::h98dd020b6e913806
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/sys_common/backtrace.rs:171:18
  14:     0x7e09df3e11d4 - rust_begin_unwind
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:647:5
  15:     0x7e09de3978e5 - core::panicking::panic_fmt::h0d3f1893e38be419
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/panicking.rs:72:14
  16:     0x7e09de397e33 - core::result::unwrap_failed::h1356562ef8bc898c
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/result.rs:1649:5
  17:     0x7e09df3826d9 - core::result::Result<T,E>::expect::h6b8b29ca8643e857
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/result.rs:1030:23
  18:     0x7e09df343d8b - std::thread::local::LocalKey<T>::with::h1752a920e6b97eb1
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/thread/local.rs:262:9
  19:     0x7e09df364e6e - tokio::runtime::context::runtime_mt::current_enter_context::haf9beb3610bddfe5
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/context/runtime_mt.rs:5:5
  20:     0x7e09de6c3f24 - tokio::runtime::scheduler::multi_thread::worker::block_in_place::{{closure}}::h9c06599e5c088372
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/multi_thread/worker.rs:355:13
  21:     0x7e09de6c31aa - tokio::runtime::scheduler::multi_thread::worker::with_current::{{closure}}::hc2aa2ec0b046bb24
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/multi_thread/worker.rs:1223:14
  22:     0x7e09de786485 - tokio::runtime::context::with_scheduler::{{closure}}::h4358f2c1fed3380d
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/context.rs:183:33
  23:     0x7e09de7ada3f - core::result::Result<T,E>::unwrap_or_else::hf45d4956e923594f
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/result.rs:1426:23
  24:     0x7e09de7861af - tokio::runtime::context::with_scheduler::h60fbeb43725fdaf0
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/context.rs:182:9
  25:     0x7e09de6c2f9b - tokio::runtime::scheduler::multi_thread::worker::with_current::he3f8c8d1928a9152
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/multi_thread/worker.rs:1221:5
  26:     0x7e09de6c3541 - tokio::runtime::scheduler::multi_thread::worker::block_in_place::h812065477b42e82f
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/multi_thread/worker.rs:353:24
  27:     0x7e09de786afa - tokio::runtime::scheduler::block_in_place::block_in_place::he4fdf8f5613ebc2b
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/block_in_place.rs:20:5
  28:     0x7e09de5b9f8a - tokio::task::blocking::block_in_place::hd8175eae66af42f5
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/task/blocking.rs:78:9
  29:     0x7e09de7a6764 - zenoh_runtime::ZRuntime::block_in_place::hc7706bb648ae8ea0
                               at /home/zstw/Workspace/src/zenoh/commons/zenoh-runtime/src/lib.rs:130:9
  30:     0x7e09de83f367 - <zenoh_core::ResolveFuture<F,To> as zenoh_core::SyncResolve>::res_sync::hc9409ed43c1f44e0
                               at /home/zstw/Workspace/src/zenoh/commons/zenoh-core/src/lib.rs:166:9
  31:     0x7e09de7e6e83 - <zenoh::session::Session as core::ops::drop::Drop>::drop::h19c93fde394c5c5e
                               at /home/zstw/Workspace/src/zenoh/zenoh/src/session.rs:2529:21
  32:     0x7e09de830457 - core::ptr::drop_in_place<zenoh::session::Session>::hdc7d9bce22baadd0
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/ptr/mod.rs:507:1
  33:     0x7e09de83900b - core::ptr::drop_in_place<zenoh::session::Session::close::{{closure}}>::h764a6fc98a47f868
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/ptr/mod.rs:525:28
  34:     0x7e09de82dcaf - core::ptr::drop_in_place<zenoh_runtime::ZRuntime::block_in_place<zenoh::session::Session::close::{{closure}},core::result::Result<(),alloc::boxed::Box<dyn core::error::Error+core::marker::Send+core::marker::Sync>>>::{{closure}}>::h229fbbc24e853948
                               at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/ptr/mod.rs:507:1
  35:     0x7e09de6c374d - tokio::runtime::scheduler::multi_thread::worker::block_in_place::h812065477b42e82f
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/multi_thread/worker.rs:442:1
  36:     0x7e09de786afa - tokio::runtime::scheduler::block_in_place::block_in_place::he4fdf8f5613ebc2b
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/block_in_place.rs:20:5
  37:     0x7e09de5b9f8a - tokio::task::blocking::block_in_place::hd8175eae66af42f5
                               at /home/zstw/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/task/blocking.rs:78:9
  38:     0x7e09de7a6764 - zenoh_runtime::ZRuntime::block_in_place::hc7706bb648ae8ea0
                               at /home/zstw/Workspace/src/zenoh/commons/zenoh-runtime/src/lib.rs:130:9
  39:     0x7e09de83f367 - <zenoh_core::ResolveFuture<F,To> as zenoh_core::SyncResolve>::res_sync::hc9409ed43c1f44e0
                               at /home/zstw/Workspace/src/zenoh/commons/zenoh-core/src/lib.rs:166:9
  40:     0x7e09de45d0ea - zenoh_core::SyncResolve::res::hb45a81dafd5ac98f
                               at /home/zstw/Workspace/src/zenoh/commons/zenoh-core/src/lib.rs:57:9
  41:     0x7e09de46c518 - z_close
                               at /home/zstw/Workspace/benchmark_rmw_zenoh/build/zenoh_c_vendor/zenoh_c_vendor-prefix/src/zenoh_c_vendor/src/session.rs:144:11
  42:     0x7e09df9d2d5b - rmw_shutdown
  43:     0x7e09e15c2104 - rcl_shutdown
  44:     0x7e09e12fa41e - _ZN6rclcpp7Context8shutdownERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
  45:     0x7e09e12fac91 - _ZN6rclcpp7ContextD1Ev
  46:     0x7e09e12f3bea - <unknown>
  47:     0x7e09e0a45495 - __run_exit_handlers
                               at ./stdlib/exit.c:113:8
  48:     0x7e09e0a45610 - __GI_exit
                               at ./stdlib/exit.c:143:3
  49:     0x7e09e0a29d97 - __libc_start_call_main
                               at ./csu/../sysdeps/nptl/libc_start_call_main.h:74:3
  50:     0x7e09e0a29e40 - __libc_start_main_impl
                               at ./csu/../csu/libc-start.c:392:3
  51:     0x584c927db915 - _start
  52:                0x0 - <unknown>
thread '<unnamed>' panicked at library/core/src/panicking.rs:163:5:
panic in a destructor during cleanup
thread caused non-unwinding panic. aborting.

This happens since we call z_close within __run_exit_handlers. We had concluded that termination of tokio runtime within the atexit handler is unsound
and leads to undefined behavior. See this and this. Basically, the termination during exit stage is beyond what tokio runtime covering. Rust is designed not to drop any static variable until the end of the program. Calling destruction in exit handlers is also error-prone even in C/C++.

On the other hand, rmw_cyclonedds seems not to use any explicit termination during the exit stage. https://github.com/ros2/rmw_cyclonedds/blob/c6dbe24b2f2be87cf8e4750d89501657ab83566f/rmw_cyclonedds_cpp/src/rmw_node.cpp#L1629. I've ran the same test with the following patch. The error disappears and the rmw_zenoh seems fine.

diff --git a/rmw_zenoh_cpp/src/rmw_init.cpp b/rmw_zenoh_cpp/src/rmw_init.cpp
index edd9545..90ef360 100644
--- a/rmw_zenoh_cpp/src/rmw_init.cpp
+++ b/rmw_zenoh_cpp/src/rmw_init.cpp
@@ -371,11 +371,12 @@ rmw_shutdown(rmw_context_t * context)
   if (context->impl->shm_manager.has_value()) {
     z_drop(z_move(context->impl->shm_manager.value()));
   }
-  // Close the zenoh session
-  if (z_close(z_move(context->impl->session)) < 0) {
-    RMW_SET_ERROR_MSG("Error while closing zenoh session");
-    return RMW_RET_ERROR;
-  }
+
+  // // Close the zenoh session
+  // if (z_close(z_move(context->impl->session)) < 0) {
+  //   RMW_SET_ERROR_MSG("Error while closing zenoh session");
+  //   return RMW_RET_ERROR;
+  // }

   context->impl->is_shutdown = true;

In conclusion, we have a few solutions to it,

Apply the patch above.
Using a "Guard" to terminate the zenoh sessions properly. An example looks like this.
Terminate the rmw_zenoh proactively instead of within the exit phase.

TBH, I'm not sure if 2. and 3. are feasible in ROS RMW. Any comment is welcome! 🙂

The text was updated successfully, but these errors were encountered:

Yadunund · 2024-05-08T15:42:02Z

@YuanYuYuan could you provide steps to reproduce the error?

With the patch above, are you suggesting that it is ok to not close the zenoh session? Wouldn't that lead to memory leaks?

@clalancette should we try moving this block back into rmw_context_fini?

clalancette · 2024-05-08T15:53:36Z

@clalancette should we try moving this block back into rmw_context_fini?

We need to take a close look before making changes here again. I've moved it twice already for 2 different bugs, so we need to go back, look at the contexts these are being called in (for both rclcpp and rclpy), and see what the previous bugs were.

YuanYuYuan · 2024-05-09T03:52:31Z

@YuanYuYuan could you provide steps to reproduce the error?

With the patch above, are you suggesting that it is ok to not close the zenoh session? Wouldn't that lead to memory leaks?

@clalancette should we try moving this block back into rmw_context_fini?

Please follow the scripts written by @evshary.

evshary · 2024-05-09T05:16:10Z

The issue disappeared after the patch. If you want to use the scripts, please note that you need to update the repository here. The issue occurs no matter running intra-process or inter-process test with rmw_zenoh.

takasehideki · 2024-09-17T05:34:57Z

Same here. I'm trying to operate ApexAI/performance_test, and encounted the same issue. But I'd like to report that applying the patch suggested in this post is working fine in my environment.

Question just in case: I cannot access evshary/benchmark_rmw_zenoh that was mentioned in this thread. I'm wondering the method (scripts?) of dealing with this issue is the same as suggested in this post?

YuanYuYuan · 2024-09-17T15:46:31Z

Same here. I'm trying to operate ApexAI/performance_test, and encounted the same issue. But I'd like to report that applying the patch suggested in this post is working fine in my environment.

Question just in case: I cannot access evshary/benchmark_rmw_zenoh that was mentioned in this thread. I'm wondering the method (scripts?) of dealing with this issue is the same as suggested in this post?

Hi @takasehideki! The script has been moved here. And the fix is still the same.

Yadunund · 2024-09-20T01:32:35Z

@YuanYuYuan i'm looking into this issue more closely now and could benefit from your insights.

Summarizing some basic information first:

With the current implementation in rolling branch, we call z_close() when rmw_shutdown() is called.
With ROS 2 nodes written with rclcpp, rmw_shutdown() is invoked within rcl_shutdown() and rcl_shutdown is called within Context::shutdown().
Context::shutdown() is called in three scenarios
1. Within deferred_signal_handler which is a function that executes upon receiving a SIGINT/SIGTERM.
2. Within rclcpp::shutdown() which the user should call before their program terminates.
3. Within ~Context() in case the user forgot to do 2).

Above you mentioned

This happens since we call z_close within __run_exit_handlers. We had concluded that termination of tokio runtime within the atexit handler is unsound
and leads to undefined behavior. See eclipse-zenoh/zenoh#973 and eclipse-zenoh/zenoh#1004 (comment). Basically, the termination during exit stage is beyond what tokio runtime covering. Rust is designed not to drop any static variable until the end of the program. Calling destruction in exit handlers is also error-prone even in C/C++.

This would imply that anytime I terminate a ROS executable (written with rclcpp), I should expect to see the above panic/backtrace since we will close the session as part of i that I described. But this does not seem to be the case. I can run ros2 run demo_nodes_cpp talker, hit CTRL+C and I don't see the panic. So is it really related to exit handlers?

Or could it be that we see this unsoundness when iii occurs? ie, the user does not call rclcpp::shutdown and the ~Context() destructor is the one closing the Zenoh session? This destructor will be called when the program terminates at which point the Thread Local Storage has been cleared?

I did a quick search for rclcpp::shutdown() in the Apex repo and couldn't find any references 🤔

YuanYuYuan · 2024-09-20T01:47:41Z

Hi @Yadunund, thanks for spending time on this issue. I think the panic depends on some race conditions I'm still investigating. Here is one that can steadily reproduce. eclipse-zenoh/zenoh-cpp#198.

takasehideki · 2024-09-20T08:23:33Z

Hi @YuanYuYuan

Same here. I'm trying to operate ApexAI/performance_test, and encounted the same issue. But I'd like to report that applying the patch suggested in this post is working fine in my environment.
Question just in case: I cannot access evshary/benchmark_rmw_zenoh that was mentioned in this thread. I'm wondering the method (scripts?) of dealing with this issue is the same as suggested in this post?

Hi @takasehideki! The script has been moved here. And the fix is still the same.

Thank you so much for pointing this out. But unfortunately I cannot still access this link (maybe due to the repository is set to private). But it is not a problem for me because I can now catch up that the fix is the same.

YuanYuYuan · 2024-09-20T16:40:27Z

Thank you so much for pointing this out. But unfortunately I cannot still access this link (maybe due to the repository is set to private). But it is not a problem for me because I can now catch up that the fix is the same.

My bad! I'm not even aware that the repo has been closed 😅. Yup, the fix should be the same.

Yadunund · 2024-09-20T19:32:30Z

Hi @Yadunund, thanks for spending time on this issue. I think the panic depends on some race conditions I'm still investigating. Here is one that can steadily reproduce. eclipse-zenoh/zenoh-cpp#198.

@YuanYuYuan I'm not sure if this is a race condition. I have a much simpler reproducible scenario that confirms my hypothesis above. See https://github.com/Yadunund/rmw_zenoh_shutdown_test.

Basically, if we let the rclcpp::Context's destructor to call Context::shutdown() (which in turn calls rmw_shutdown) when the program terminates, tokio will panic because the Thread local storage gets cleared before the destructor executes. Perhaps due to ordering of static initialization.

Therefore we need to ensure that the Context is shutdown before termination. The signal handler thread does this and so will explicitly calling rclcpp::shutdown() in the program.

This also explains why we see the panic all the time if we close the Zenoh session in rmw_context_fini instead of rmw_shutdown. The rmw_context_fini is only called when the custom deleter for the rcl_context executes which happens always be within ~Context() as seen here.

Therefore until the tokio behavior is fixed upstream, I propose the following workarounds

Add a "Known Issues" section to the README of rmw_zenoh_cpp informing users that they will need to explicitly call rclcpp::shutdown() in their programs.
Updating the Apex test and other tests in ros2 to ensure rclcpp::shutdown() is invoked. Some of the tests now implicitly rely on ~Context() to shutdown the context.

cc: @JEnoch @clalancette @ahcorde

Yadunund · 2024-09-20T22:55:25Z

With ros2/rclcpp#2633, the shutdown issue has disappeared for all tests in rclcpp. We should do the same for the Apex repo. @takasehideki would you be able to apply similar changes to the Apex repo and see if that resolves the issue there as well?

Yadunund mentioned this issue May 11, 2024

Service Call Hangup #143

Closed

sloretz assigned clalancette May 16, 2024

This was referenced May 26, 2024

Hang loading components #182

Closed

MoveIt + zenoh : Topic name /xxx not found in topic map. Report this #187

Closed

Yadunund added the bug Something isn't working label Jun 2, 2024

JEnoch mentioned this issue Jun 27, 2024

problems while using zenoh with simulation neobotix/rox#22

Closed

Yadunund mentioned this issue Jul 3, 2024

ctrl+c not killing ros2 cli tools when the rmw_zenohd router is not running #236

Open

clalancette mentioned this issue Aug 21, 2024

Items to get rmw_zenoh to Tier-1 status #265

Open

Yadunund mentioned this issue Sep 3, 2024

Sporadic failure on system_tests/test_communication #275

Closed

JEnoch mentioned this issue Sep 12, 2024

rclcpp test_qos_event__rmw_zenoh_cpp generating backtrace when test is finished #280

Closed

YuanYuYuan mentioned this issue Sep 18, 2024

Zenoh 1.0.0 Porting #276

Merged

Yadunund mentioned this issue Sep 20, 2024

Shutdown the context before context's destructor is invoked in tests ros2/rclcpp#2633

Merged

Yadunund mentioned this issue Sep 20, 2024

Document issue with shutdown #283

Merged

Yadunund closed this as completed Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: unsoundness of `rmw_shutdown` #170

bug: unsoundness of `rmw_shutdown` #170

YuanYuYuan commented May 8, 2024 •

edited

Loading

Yadunund commented May 8, 2024

clalancette commented May 8, 2024

YuanYuYuan commented May 9, 2024

evshary commented May 9, 2024

takasehideki commented Sep 17, 2024

YuanYuYuan commented Sep 17, 2024

Yadunund commented Sep 20, 2024

YuanYuYuan commented Sep 20, 2024

takasehideki commented Sep 20, 2024

YuanYuYuan commented Sep 20, 2024

Yadunund commented Sep 20, 2024

Yadunund commented Sep 20, 2024 •

edited

Loading

bug: unsoundness of rmw_shutdown #170

bug: unsoundness of rmw_shutdown #170

Comments

YuanYuYuan commented May 8, 2024 • edited Loading

Yadunund commented May 8, 2024

clalancette commented May 8, 2024

YuanYuYuan commented May 9, 2024

evshary commented May 9, 2024

takasehideki commented Sep 17, 2024

YuanYuYuan commented Sep 17, 2024

Yadunund commented Sep 20, 2024

YuanYuYuan commented Sep 20, 2024

takasehideki commented Sep 20, 2024

YuanYuYuan commented Sep 20, 2024

Yadunund commented Sep 20, 2024

Yadunund commented Sep 20, 2024 • edited Loading

bug: unsoundness of `rmw_shutdown` #170

bug: unsoundness of `rmw_shutdown` #170

YuanYuYuan commented May 8, 2024 •

edited

Loading

Yadunund commented Sep 20, 2024 •

edited

Loading