GNU bug report logs -
#46229
rdma-core 33.x breaks InfiniBand support in Open MPI
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 46229 in the body.
You can then email your comments to 46229 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
code <at> greghogan.com, florent.pruvost <at> inria.fr, efraim <at> flashner.co.il, bug-guix <at> gnu.org
:
bug#46229
; Package
guix
.
(Mon, 01 Feb 2021 08:56:01 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Ludovic Courtès <ludovic.courtes <at> inria.fr>
:
New bug report received and forwarded. Copy sent to
code <at> greghogan.com, florent.pruvost <at> inria.fr, efraim <at> flashner.co.il, bug-guix <at> gnu.org
.
(Mon, 01 Feb 2021 08:56:01 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Hello,
We noticed that the recent rdma-core upgrade to 33.1¹ leads to segfaults
in InfiniBand related routines:
--8<---------------cut here---------------start------------->8---
$ guix time-machine --commit=23a5dcce1d893b8f5c5301ae3c1af863776ed3cf -- environment --pure --ad-hoc openmpi openssh intel-mpi-benchmarks --with-debug-info=rdma-core -- mpiexec -np 2 IMB-MPI1 PingPong
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 0 on node devel02 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
$ file core.20879
core.20879: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'IMB-MPI1 PingPong', real uid: 10218, effective uid: 10218, real gid: 11018, effective gid: 11018, execfn: '/gnu/store/ls8pkyi05iabk952x7gy545lc7zyr4cv-profile/bin/IMB-MPI1', platform: 'x86_64'
$ gdb /gnu/store/ls8pkyi05iabk952x7gy545lc7zyr4cv-profile/bin/IMB-MPI1 core.20879
(gdb) bt
#0 0x00007f93b2789e88 in ibv_cmd_create_cq ()
from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/libibverbs.so.1
#1 0x00007f93b28c57bb in hfi1_create_cq ()
from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/libibverbs/libhfi1verbs-rdmav33.so
#2 0x00007f93b2796331 in ibv_create_cq@@IBVERBS_1.1 ()
from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/libibverbs.so.1
#3 0x00007f93b27c0a55 in opal_common_verbs_qp_test ()
from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmca_common_verbs.so.40
#4 0x00007f93b27f4e83 in btl_openib_component_init ()
from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/openmpi/mca_btl_openib.so
#5 0x00007f93b4516aaf in mca_btl_base_select ()
from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libopen-pal.so.40
#6 0x00007f93b29552c2 in mca_bml_r2_component_init ()
from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/openmpi/mca_bml_r2.so
#7 0x00007f93b4b81b54 in mca_bml_base_init ()
from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmpi.so.40
#8 0x00007f93b4bc4ef8 in ompi_mpi_init ()
from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmpi.so.40
#9 0x00007f93b4b5ee55 in PMPI_Init_thread ()
from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmpi.so.40
#10 0x0000000000405b55 in main ()
--8<---------------cut here---------------end--------------->8---
Conversely, a pre-upgrade commit works fine:
--8<---------------cut here---------------start------------->8---
$ guix time-machine --commit=c2538db5617032788ac2f140496d00d8107579c8 -- environment --pure --ad-hoc openmpi openssh intel-mpi-benchmarks -- mpiexec -np 2 IMB-MPI1 PingPong
--8<---------------cut here---------------end--------------->8---
Does that ring a bell?
Thanks,
Ludo’.
¹ https://git.savannah.gnu.org/cgit/guix.git/commit/?id=c2739c0801ebc5461564e862ce8f08405e2782dc
Information forwarded
to
bug-guix <at> gnu.org
:
bug#46229
; Package
guix
.
(Mon, 01 Feb 2021 09:15:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 46229 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Mon, Feb 01, 2021 at 09:55:19AM +0100, Ludovic Courtès wrote:
> Hello,
>
> We noticed that the recent rdma-core upgrade to 33.1¹ leads to segfaults
> in InfiniBand related routines:
>
> --8<---------------cut here---------------start------------->8---
> $ guix time-machine --commit=23a5dcce1d893b8f5c5301ae3c1af863776ed3cf -- environment --pure --ad-hoc openmpi openssh intel-mpi-benchmarks --with-debug-info=rdma-core -- mpiexec -np 2 IMB-MPI1 PingPong
> --------------------------------------------------------------------------
> Primary job terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpiexec noticed that process rank 1 with PID 0 on node devel02 exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
> $ file core.20879
> core.20879: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'IMB-MPI1 PingPong', real uid: 10218, effective uid: 10218, real gid: 11018, effective gid: 11018, execfn: '/gnu/store/ls8pkyi05iabk952x7gy545lc7zyr4cv-profile/bin/IMB-MPI1', platform: 'x86_64'
> $ gdb /gnu/store/ls8pkyi05iabk952x7gy545lc7zyr4cv-profile/bin/IMB-MPI1 core.20879
> (gdb) bt
> #0 0x00007f93b2789e88 in ibv_cmd_create_cq ()
> from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/libibverbs.so.1
> #1 0x00007f93b28c57bb in hfi1_create_cq ()
> from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/libibverbs/libhfi1verbs-rdmav33.so
> #2 0x00007f93b2796331 in ibv_create_cq@@IBVERBS_1.1 ()
> from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/libibverbs.so.1
> #3 0x00007f93b27c0a55 in opal_common_verbs_qp_test ()
> from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmca_common_verbs.so.40
> #4 0x00007f93b27f4e83 in btl_openib_component_init ()
> from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/openmpi/mca_btl_openib.so
> #5 0x00007f93b4516aaf in mca_btl_base_select ()
> from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libopen-pal.so.40
> #6 0x00007f93b29552c2 in mca_bml_r2_component_init ()
> from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/openmpi/mca_bml_r2.so
> #7 0x00007f93b4b81b54 in mca_bml_base_init ()
> from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmpi.so.40
> #8 0x00007f93b4bc4ef8 in ompi_mpi_init ()
> from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmpi.so.40
> #9 0x00007f93b4b5ee55 in PMPI_Init_thread ()
> from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmpi.so.40
> #10 0x0000000000405b55 in main ()
> --8<---------------cut here---------------end--------------->8---
>
> Conversely, a pre-upgrade commit works fine:
>
> --8<---------------cut here---------------start------------->8---
> $ guix time-machine --commit=c2538db5617032788ac2f140496d00d8107579c8 -- environment --pure --ad-hoc openmpi openssh intel-mpi-benchmarks -- mpiexec -np 2 IMB-MPI1 PingPong
> --8<---------------cut here---------------end--------------->8---
>
> Does that ring a bell?
>
> Thanks,
> Ludo’.
>
> ¹ https://git.savannah.gnu.org/cgit/guix.git/commit/?id=c2739c0801ebc5461564e862ce8f08405e2782dc
>
I thought I built everything that depended on rdma-core, and
unfortunately I don't have a way to test it. As an actual user of the
package I trust you to revert the change if necessary.
I don't see anything on their mailing list pointing to this, or any
other bugs really.
http://vger.kernel.org/vger-lists.html#linux-rdma
--
Efraim Flashner <efraim <at> flashner.co.il> אפרים פלשנר
GPG key = A28B F40C 3E55 1372 662D 14F7 41AA E7DC CA3D 8351
Confidentiality cannot be guaranteed on emails sent or received unencrypted
[signature.asc (application/pgp-signature, inline)]
Information forwarded
to
bug-guix <at> gnu.org
:
bug#46229
; Package
guix
.
(Mon, 01 Feb 2021 10:14:01 GMT)
Full text and
rfc822 format available.
Message #11 received at 46229 <at> debbugs.gnu.org (full text, mbox):
Ludovic Courtès <ludovic.courtes <at> inria.fr> skribis:
> $ guix time-machine --commit=23a5dcce1d893b8f5c5301ae3c1af863776ed3cf -- environment --pure --ad-hoc openmpi openssh intel-mpi-benchmarks --with-debug-info=rdma-core -- mpiexec -np 2 IMB-MPI1 PingPong
A workaround is to ask Open MPI to ignore the Verbs library with:
mpiexec --mca btl ^openib …
Ludo’.
Information forwarded
to
bug-guix <at> gnu.org
:
bug#46229
; Package
guix
.
(Mon, 01 Feb 2021 11:11:01 GMT)
Full text and
rfc822 format available.
Message #14 received at 46229 <at> debbugs.gnu.org (full text, mbox):
Ludovic Courtès <ludovic.courtes <at> inria.fr> skribis:
> mpiexec noticed that process rank 1 with PID 0 on node devel02 exited on signal 11 (Segmentation fault).
Now with a nicer backtrace:
--8<---------------cut here---------------start------------->8---
(gdb) bt full
#0 attr_optional (attr=0x0) at include/infiniband/cmd_ioctl.h:239
No locals.
#1 ibv_icmd_create_cq (context=context <at> entry=0x1074890, cqe=cqe <at> entry=2, channel=channel <at> entry=0x0,
comp_vector=comp_vector <at> entry=0, flags=flags <at> entry=0, cq=cq <at> entry=0x1074c50, link=0x7ffe0a089690)
at /tmp/guix-build-rdma-core-33.A.drv-0/rdma-core-33.1/libibverbs/cmd_cq.c:63
cmdb = {{next = 0x7ffe0a089690, next_attr = 0x0, last_attr = 0x0, uhw_in_idx = 255 '\377',
uhw_out_idx = 255 '\377', uhw_in_headroom_dwords = 0 '\000', uhw_out_headroom_dwords = 0 '\000',
buffer_error = 0 '\000', fallback_require_ex = 0 '\000', fallback_ioctl_only = 0 '\000', hdr = {
length = 0, object_id = 0, method_id = 0, num_attrs = 0, reserved1 = 0, driver_id = 0, reserved2 = 0,
attrs = 0x7ffe0a0895f8}}}
priv = <optimized out>
handle = <optimized out>
async_fd_attr = <optimized out>
resp_cqe = <optimized out>
ret = 0
#2 0x00007f9ec83f2e4e in ibv_cmd_create_cq (context=context <at> entry=0x1074890, cqe=cqe <at> entry=2,
channel=channel <at> entry=0x0, comp_vector=comp_vector <at> entry=0, cq=cq <at> entry=0x1074c50, cmd=cmd <at> entry=0x0, cmd_size=0,
resp=0x7ffe0a089760, resp_size=16) at /tmp/guix-build-rdma-core-33.A.drv-0/rdma-core-33.1/libibverbs/cmd_cq.c:137
__cmdbtotal = 2
cmdb = {{next = 0x0, next_attr = 0x7ffe0a0896d8, last_attr = 0x7ffe0a0896e8, uhw_in_idx = 255 '\377',
uhw_out_idx = 0 '\000', uhw_in_headroom_dwords = 0 '\000', uhw_out_headroom_dwords = 2 '\002',
buffer_error = 0 '\000', fallback_require_ex = 0 '\000', fallback_ioctl_only = 0 '\000', hdr = {
length = 0, object_id = 3, method_id = 0, num_attrs = 0, reserved1 = 0, driver_id = 0, reserved2 = 0,
attrs = 0x7ffe0a0896c8}}, {next = 0x100081001, next_attr = 0x7ffe0a089768, last_attr = 0x6e0000005b,
uhw_in_idx = 124 '|', uhw_out_idx = 0 '\000', uhw_in_headroom_dwords = 0 '\000',
uhw_out_headroom_dwords = 0 '\000', buffer_error = 1 '\001', fallback_require_ex = 1 '\001',
fallback_ioctl_only = 1 '\001', hdr = {length = 0, object_id = 0, method_id = 0, num_attrs = 0,
reserved1 = 0, driver_id = 17, reserved2 = 15, attrs = 0x7ffe0a089700}}}
__cmdbdummy = <optimized out>
#3 0x00007f9ec85257bb in hfi1_create_cq (context=0x1074890, cqe=2, channel=0x0, comp_vector=0)
at /tmp/guix-build-rdma-core-33.A.drv-0/rdma-core-33.1/providers/hfi1verbs/verbs.c:184
cq = 0x1074c50
resp = {ibv_resp = {cq_handle = 0, cqe = 0, driver_data = 0x7ffe0a089768}, offset = 0}
ret = <optimized out>
size = <optimized out>
#4 0x00007f9ec83fde41 in __ibv_create_cq_1_1 (context=0x1074890, cqe=<optimized out>, cq_context=0x0, channel=0x0,
comp_vector=<optimized out>) at /tmp/guix-build-rdma-core-33.A.drv-0/rdma-core-33.1/libibverbs/verbs.c:509
cq = <optimized out>
#5 0x00007f9ec8426a55 in opal_common_verbs_qp_test ()
from /gnu/store/6gcssdkn8iaiagzdv0d9mi93gppc85r4-openmpi-4.0.5/lib/libmca_common_verbs.so.40
No symbol table info available.
#6 0x00007f9ec8454e83 in btl_openib_component_init ()
from /gnu/store/6gcssdkn8iaiagzdv0d9mi93gppc85r4-openmpi-4.0.5/lib/openmpi/mca_btl_openib.so
--8<---------------cut here---------------end--------------->8---
Version 29.2 is good and everything beyond that isn’t. This has to do
with those rdma-core changes:
--8<---------------cut here---------------start------------->8---
$ git log --oneline v26.4..v33.1 libibverbs/cmd_cq.c
317d8895 verbs: Enhance async FD usage
195c9191 verbs: Introduce verbs_cq for extended CQ operations
90a4d0cc verbs: Extend CQ KABI to get an async FD
--8<---------------cut here---------------end--------------->8---
(The first commit in the list above appeared in v30.)
I forgot to mention this happens with Omni-Path hardware:
--8<---------------cut here---------------start------------->8---
$ guix environment --ad-hoc rdma-core -- ibv_devinfo
hca_id: hfi1_0
transport: InfiniBand (0)
fw_ver: 1.27.0
node_guid: 0011:7509:0107:573e
sys_image_guid: 0011:7509:0107:573e
vendor_id: 0x1175
vendor_part_id: 9456
hw_ver: 0x11
board_id: Intel Omni-Path Host Fabric Interface Adapter 100 Series
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 4
port_lmc: 0x00
link_layer: InfiniBand
--8<---------------cut here---------------end--------------->8---
Ludo’.
Reply sent
to
Ludovic Courtès <ludo <at> gnu.org>
:
You have taken responsibility.
(Mon, 01 Feb 2021 13:06:01 GMT)
Full text and
rfc822 format available.
Notification sent
to
Ludovic Courtès <ludovic.courtes <at> inria.fr>
:
bug acknowledged by developer.
(Mon, 01 Feb 2021 13:06:01 GMT)
Full text and
rfc822 format available.
Message #19 received at 46229-done <at> debbugs.gnu.org (full text, mbox):
Good news! This is fixed by:
https://git.savannah.gnu.org/cgit/guix.git/commit/?id=37e997bc7867901dc5eaf9060358dfddacae8dd6
Ludo’.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Tue, 02 Mar 2021 12:24:07 GMT)
Full text and
rfc822 format available.
This bug report was last modified 3 years and 48 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.