Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v5.0.5 sm btl hanging and producing heap-buffer-overflow #12816

Open
judicaelclair opened this issue Sep 18, 2024 · 0 comments
Open

v5.0.5 sm btl hanging and producing heap-buffer-overflow #12816

judicaelclair opened this issue Sep 18, 2024 · 0 comments

Comments

@judicaelclair
Copy link

judicaelclair commented Sep 18, 2024

OpenMPI v5.0.5 with sm btl is causing my workload to hang/crash.

  • All code (application & MPI) compiled with Clang 19.1.0.
  • Official tarballs used for building MPI library.
  • OS: Ubuntu 22.04.5 LTS.
  • Hardware: AMD Threadripper Pro 5995WX
  • MPI runs in funneled mode.
  • Parameters are being checked: --mca mpi_param_check=1.

At a high-level, my workload consists of an array of requests that are progressed via calls to MPI_Testsome. This array grows, shrinks, and shuffles over time. Requests don't necessarily belong to the same MPI_Comm. Requests are mainly related to p2p communication via MPI_Irecv, MPI_Recv_init, etc. Some requests are long-lived whereas others are short-lived.

  • If I use sm (via --mca btl self,sm), application starts hanging pretty quickly.
  • If I make most, if not all, sends synchronous (i.e. MPI_Issend), the application hangs immediately (i.e. synchronous makes things worse).
  • To get rid of most the hanging, I have to add an MPI_Testall before every call to MPI_Testsome, but the application still usually hangs after a while.

If I do the following, everything works perfectly fine regardless of if sending is synchronous (e.g. MPI_Issend), and whether or not MPI_Testall is injected before MPI_Testsome:

  • tcp is forcefully used instead of sm (via --mca btl self,tcp).
  • or, OpenMPI v4.1.6 is used.
  • or, MPICH v4.2.2 is used.

If I sanitise my application with ASan+UBSan (I did not sanitise OpenMPI itself as in the past it's given me lots of false positives), I get heap-buffer-overflow errors originating from OpenMPI's sm implementation details.

Example 1:

    #0 0x5de326394471 in memcpy sanitizer_common_interceptors_memintrinsics.inc:115:5
    #1 0x75daae589067 in sm_prepare_src btl_sm_module.c
    #2 0x75daaf20f6fa in mca_pml_ob1_send_request_schedule_once (/openmpi-5.0.5/lib/libmpi.so.40+0x20f6fa)
    #3 0x75daaf2083a7 in mca_pml_ob1_recv_frag_callback_ack (/openmpi-5.0.5/lib/libmpi.so.40+0x2083a7)
    #4 0x75daae58a231 in mca_btl_sm_component_progress btl_sm_component.c
    #5 0x75daae50fdec in opal_progress (/openmpi-5.0.5/lib/libopen-pal.so.80+0x21dec)
    #6 0x75daaf083a5d in ompi_request_default_test_all (/openmpi-5.0.5/lib/libmpi.so.40+0x83a5d)
    #7 0x75daaf0c5062 in MPI_Testall (/openmpi-5.0.5/lib/libmpi.so.40+0xc5062)

SUMMARY: AddressSanitizer: heap-buffer-overflow btl_sm_module.c in sm_prepare_src
Shadow bytes around the buggy address:
  0x5040000bb780: fa fa 00 00 00 00 00 fa fa fa 00 00 00 00 00 fa
  0x5040000bb800: fa fa 00 00 00 00 06 fa fa fa 00 00 00 00 00 fa
  0x5040000bb880: fa fa 00 00 00 00 00 fa fa fa 00 00 00 00 06 fa
  0x5040000bb900: fa fa fa fa fa fa fa fa fa fa 00 00 00 00 00 fa
  0x5040000bb980: fa fa 00 00 00 00 00 fa fa fa 00 00 00 00 00 00
=>0x5040000bba00:[fa]fa 00 00 00 00 03 fa fa fa 00 00 00 00 00 fa
  0x5040000bba80: fa fa 00 00 00 00 00 fa fa fa fd fd fd fd fd fa
  0x5040000bbb00: fa fa fd fd fd fd fd fd fa fa 00 00 00 00 00 fa
  0x5040000bbb80: fa fa 00 00 00 00 00 fa fa fa 00 00 00 00 00 fa
  0x5040000bbc00: fa fa fd fd fd fd fd fd fa fa fd fd fd fd fd fa
  0x5040000bbc80: fa fa fd fd fd fd fd fd fa fa 00 00 00 00 00 fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb

Example 2:

    #0 0x589dde322471 in memcpy sanitizer_common_interceptors_memintrinsics.inc:115:5
    #1 0x70e2eb5077e7 in mca_btl_sm_fbox_sendi btl_sm_sendi.c
    #2 0x70e2eb50742c in mca_btl_sm_sendi (/openmpi-5.0.5/lib/libopen-pal.so.80+0x9d42c)
    #3 0x70e2ec1fdebc in mca_pml_ob1_process_pending_packets (/openmpi-5.0.5/lib/libmpi.so.40+0x1fdebc)
    #4 0x70e2ec20939f in mca_pml_ob1_rget_completion pml_ob1_recvreq.c
    #5 0x70e2eb507ba5 in mca_btl_sm_get (/openmpi-5.0.5/lib/libopen-pal.so.80+0x9dba5)
    #6 0x70e2ec20a013 in mca_pml_ob1_recv_request_progress_rget (/openmpi-5.0.5/lib/libmpi.so.40+0x20a013)
    #7 0x70e2ec20c45e in mca_pml_ob1_recv_req_start (/openmpi-5.0.5/lib/libmpi.so.40+0x20c45e)
    #8 0x70e2ec200a87 in mca_pml_ob1_irecv (/openmpi-5.0.5/lib/libmpi.so.40+0x200a87)
    #9 0x70e2ec0b3a1b in PMPI_Irecv (/openmpi-5.0.5/lib/libmpi.so.40+0xb3a1b)

0x5290005092b8 is located 0 bytes after 16568-byte region [0x529000505200,0x5290005092b8)
allocated by thread T0 here:
    #0 0x589dde324160 in malloc /clang-19.1.0/compiler-rt/lib/asan/asan_malloc_linux.cpp:68:3
    #1 0x70e2eb48e5c9 in opal_free_list_grow_st (/openmpi-5.0.5/lib/libopen-pal.so.80+0x245c9)

SUMMARY: AddressSanitizer: heap-buffer-overflow btl_sm_sendi.c in mca_btl_sm_fbox_sendi
Shadow bytes around the buggy address:
  0x529000509000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x529000509080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x529000509100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x529000509180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x529000509200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x529000509280: 00 00 00 00 00 00 00[fa]fa fa fa fa fa fa fa fa
  0x529000509300: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x529000509380: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x529000509400: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x529000509480: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x529000509500: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant