You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running libfabric 1.22, I still hit the same issue (probably only now because we had a GCC update):
# gdb --args ./simple
GNU gdb (GDB; SUSE Linux Enterprise 15) 13.2
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://bugs.opensuse.org/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./simple...
(gdb) r
Starting program: /root/simple
Missing separate debuginfos, use: zypper install glibc-debuginfo-2.38-150600.14.14.2.x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[Detaching after fork from child process 3597]
[New Thread 0x7ffff79636c0 (LWP 3601)]
[New Thread 0x7ffff6f4c6c0 (LWP 3602)]
Thread 1 "simple" received signal SIGILL, Illegal instruction.
0x00007ffff59bdee3 in psm3_getenv_range () from /usr/lib64/libfabric.so.1
(gdb) bt
#0 0x00007ffff59bdee3 in psm3_getenv_range () from /usr/lib64/libfabric.so.1
#1 0x00007ffff59c0fad in psm3_getenv () from /usr/lib64/libfabric.so.1
#2 0x00007ffff59797c2 in psm3_getenv_bool () from /usr/lib64/libfabric.so.1
#3 0x00007ffff59518a1 in psmx3_param_get_bool () from /usr/lib64/libfabric.so.1
#4 0x00007ffff5952161 in fi_psm3_ini () from /usr/lib64/libfabric.so.1
#5 0x00007ffff5641765 in fi_ini () from /usr/lib64/libfabric.so.1
#6 0x00007ffff5641d1b in fi_getinfo () from /usr/lib64/libfabric.so.1
#7 0x00007ffff66eae27 in mca_btl_ofi_component_init (num_btl_modules=0x7fffffffd914, enable_progress_threads=<optimized out>, enable_mpi_threads=<optimized out>) at btl_ofi_component.c:357
#8 0x00007ffff7ac438f in mca_btl_base_select (enable_progress_threads=true, enable_mpi_threads=false) at base/btl_base_select.c:110
#9 0x00007ffff66f6512 in mca_bml_r2_component_init (priority=0x7fffffffd994, enable_progress_threads=<optimized out>, enable_mpi_threads=<optimized out>) at bml_r2_component.c:86
#10 0x00007ffff7f243f4 in mca_bml_base_init (enable_progress_threads=enable_progress_threads@entry=true, enable_mpi_threads=false) at base/bml_base_init.c:74
#11 0x00007ffff7f73976 in ompi_mpi_init (argc=<optimized out>, argv=<optimized out>, requested=0, provided=provided@entry=0x7fffffffdac4, reinit_ok=reinit_ok@entry=false) at runtime/ompi_mpi_init.c:613
#12 0x00007ffff7f0700e in PMPI_Init (argc=0x7fffffffdb1c, argv=0x7fffffffdb10) at pinit.c:67
#13 0x0000000000400909 in main ()
(gdb) frame 4
#4 0x00007ffff5952161 in fi_psm3_ini () from /usr/lib64/libfabric.so.1
(gdb) q
A debugging session is active.
Inferior 1 [process 3594] will be killed.
Quit anyway? (y or n) y
My best guess is that GCC added some unsupported vector instructions in some of the code called through PSMX3_INFO
I've tried a quick ad not too clean patch to check it out:
This only partially solved the issue. Because of all the attribute ((constructor)) used in prov/psm3, this test is not enough !
Program received signal SIGILL, Illegal instruction.
0x00007ffff7644341 in init_picos_per_cycle () from /usr/lib64/libfabric.so.1
Missing separate debuginfos, use: zypper install libefa1-debuginfo-54.0-1.1.x86_64 libibverbs1-debuginfo-54.0-1.1.x86_64 libnl3-200-debuginfo-3.11.0-1.1.x86_64 libnuma1-debuginfo-2.0.18.10.g6c14bd5-1.1.x86_64 libpsm2-2-debuginfo-12.0.1-2.2.x86_64 librdmacm1-debuginfo-54.0-1.1.x86_64 libucm0-debuginfo-1.17.0-3.1.x86_64 libucp0-debuginfo-1.17.0-3.1.x86_64 libuct0-debuginfo-1.17.0-3.1.x86_64 libuuid1-debuginfo-2.40.2-2.1.x86_64
(gdb) bt
#0 0x00007ffff7644341 in init_picos_per_cycle () from /usr/lib64/libfabric.so.1
#1 0x00007ffff7fc969e in call_init () from /lib64/ld-linux-x86-64.so.2
#2 0x00007ffff7fc979c in _dl_init () from /lib64/ld-linux-x86-64.so.2
#3 0x00007ffff7fc65fe in _dl_catch_exception () from /lib64/ld-linux-x86-64.so.2
#4 0x00007ffff7fd06ce in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#5 0x00007ffff7fc6571 in _dl_catch_exception () from /lib64/ld-linux-x86-64.so.2
#6 0x00007ffff7fd0b2c in _dl_open () from /lib64/ld-linux-x86-64.so.2
#7 0x00007ffff7c93a3c in dlopen_doit () from /lib64/libc.so.6
#8 0x00007ffff7fc6571 in _dl_catch_exception () from /lib64/ld-linux-x86-64.so.2
#9 0x00007ffff7fc66a3 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#10 0x00007ffff7c934e7 in _dlerror_run () from /lib64/libc.so.6
#11 0x00007ffff7c93b01 in dlopen@GLIBC_2.2.5 () from /lib64/libc.so.6
#12 0x00005555555551d5 in ?? ()
#13 0x00007ffff7c2a2ae in __libc_start_call_main () from /lib64/libc.so.6
#14 0x00007ffff7c2a379 in __libc_start_main_impl () from /lib64/libc.so.6
#15 0x00005555555550b5 in ?? ()
Dump of assembler code for function init_picos_per_cycle:
0x00007ffff7644340 <+0>: push %rbp
=> 0x00007ffff7644341 <+1>: vpxor %xmm0,%xmm0,%xmm0
Similar issue to #8933
Running libfabric 1.22, I still hit the same issue (probably only now because we had a GCC update):
My best guess is that GCC added some unsupported vector instructions in some of the code called through PSMX3_INFO
I've tried a quick ad not too clean patch to check it out:
And this works.
I haven't checked v2.0.0 but I'm guessing it is broken there as well
The text was updated successfully, but these errors were encountered: