GNU bug report logs - #46229
rdma-core 33.x breaks InfiniBand support in Open MPI

Previous Next

Package: guix;

Reported by: Ludovic Courtès <ludovic.courtes <at> inria.fr>

Date: Mon, 1 Feb 2021 08:56:01 UTC

Severity: normal

Done: Ludovic Courtès <ludo <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 46229 in the body.
You can then email your comments to 46229 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to code <at> greghogan.com, florent.pruvost <at> inria.fr, efraim <at> flashner.co.il, bug-guix <at> gnu.org:
bug#46229; Package guix. (Mon, 01 Feb 2021 08:56:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Ludovic Courtès <ludovic.courtes <at> inria.fr>:
New bug report received and forwarded. Copy sent to code <at> greghogan.com, florent.pruvost <at> inria.fr, efraim <at> flashner.co.il, bug-guix <at> gnu.org. (Mon, 01 Feb 2021 08:56:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludovic.courtes <at> inria.fr>
To: <bug-guix <at> gnu.org>
Subject: rdma-core 33.x breaks InfiniBand support in Open MPI
Date: Mon, 01 Feb 2021 09:55:19 +0100
Hello,

We noticed that the recent rdma-core upgrade to 33.1¹ leads to segfaults
in InfiniBand related routines:

--8<---------------cut here---------------start------------->8---
$ guix time-machine --commit=23a5dcce1d893b8f5c5301ae3c1af863776ed3cf --  environment --pure --ad-hoc openmpi openssh intel-mpi-benchmarks --with-debug-info=rdma-core -- mpiexec -np 2 IMB-MPI1 PingPong
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 0 on node devel02 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
$ file core.20879 
core.20879: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'IMB-MPI1 PingPong', real uid: 10218, effective uid: 10218, real gid: 11018, effective gid: 11018, execfn: '/gnu/store/ls8pkyi05iabk952x7gy545lc7zyr4cv-profile/bin/IMB-MPI1', platform: 'x86_64'
$ gdb /gnu/store/ls8pkyi05iabk952x7gy545lc7zyr4cv-profile/bin/IMB-MPI1 core.20879 
(gdb) bt
#0  0x00007f93b2789e88 in ibv_cmd_create_cq ()
   from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/libibverbs.so.1
#1  0x00007f93b28c57bb in hfi1_create_cq ()
   from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/libibverbs/libhfi1verbs-rdmav33.so
#2  0x00007f93b2796331 in ibv_create_cq@@IBVERBS_1.1 ()
   from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/libibverbs.so.1
#3  0x00007f93b27c0a55 in opal_common_verbs_qp_test ()
   from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmca_common_verbs.so.40
#4  0x00007f93b27f4e83 in btl_openib_component_init ()
   from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/openmpi/mca_btl_openib.so
#5  0x00007f93b4516aaf in mca_btl_base_select ()
   from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libopen-pal.so.40
#6  0x00007f93b29552c2 in mca_bml_r2_component_init ()
   from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/openmpi/mca_bml_r2.so
#7  0x00007f93b4b81b54 in mca_bml_base_init ()
   from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmpi.so.40
#8  0x00007f93b4bc4ef8 in ompi_mpi_init ()
   from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmpi.so.40
#9  0x00007f93b4b5ee55 in PMPI_Init_thread ()
   from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmpi.so.40
#10 0x0000000000405b55 in main ()
--8<---------------cut here---------------end--------------->8---

Conversely, a pre-upgrade commit works fine:

--8<---------------cut here---------------start------------->8---
$ guix time-machine --commit=c2538db5617032788ac2f140496d00d8107579c8 --  environment --pure --ad-hoc openmpi openssh intel-mpi-benchmarks -- mpiexec -np 2 IMB-MPI1 PingPong
--8<---------------cut here---------------end--------------->8---

Does that ring a bell?

Thanks,
Ludo’.

¹ https://git.savannah.gnu.org/cgit/guix.git/commit/?id=c2739c0801ebc5461564e862ce8f08405e2782dc




Information forwarded to bug-guix <at> gnu.org:
bug#46229; Package guix. (Mon, 01 Feb 2021 09:15:02 GMT) Full text and rfc822 format available.

Message #8 received at 46229 <at> debbugs.gnu.org (full text, mbox):

From: Efraim Flashner <efraim <at> flashner.co.il>
To: Ludovic Courtès <ludovic.courtes <at> inria.fr>
Cc: Florent Pruvost <florent.pruvost <at> inria.fr>, 46229 <at> debbugs.gnu.org,
 Greg Hogan <code <at> greghogan.com>
Subject: Re: bug#46229: rdma-core 33.x breaks InfiniBand support in Open MPI
Date: Mon, 1 Feb 2021 11:13:17 +0200
[Message part 1 (text/plain, inline)]
On Mon, Feb 01, 2021 at 09:55:19AM +0100, Ludovic Courtès wrote:
> Hello,
> 
> We noticed that the recent rdma-core upgrade to 33.1¹ leads to segfaults
> in InfiniBand related routines:
> 
> --8<---------------cut here---------------start------------->8---
> $ guix time-machine --commit=23a5dcce1d893b8f5c5301ae3c1af863776ed3cf --  environment --pure --ad-hoc openmpi openssh intel-mpi-benchmarks --with-debug-info=rdma-core -- mpiexec -np 2 IMB-MPI1 PingPong
> --------------------------------------------------------------------------
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpiexec noticed that process rank 1 with PID 0 on node devel02 exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
> $ file core.20879 
> core.20879: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'IMB-MPI1 PingPong', real uid: 10218, effective uid: 10218, real gid: 11018, effective gid: 11018, execfn: '/gnu/store/ls8pkyi05iabk952x7gy545lc7zyr4cv-profile/bin/IMB-MPI1', platform: 'x86_64'
> $ gdb /gnu/store/ls8pkyi05iabk952x7gy545lc7zyr4cv-profile/bin/IMB-MPI1 core.20879 
> (gdb) bt
> #0  0x00007f93b2789e88 in ibv_cmd_create_cq ()
>    from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/libibverbs.so.1
> #1  0x00007f93b28c57bb in hfi1_create_cq ()
>    from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/libibverbs/libhfi1verbs-rdmav33.so
> #2  0x00007f93b2796331 in ibv_create_cq@@IBVERBS_1.1 ()
>    from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/libibverbs.so.1
> #3  0x00007f93b27c0a55 in opal_common_verbs_qp_test ()
>    from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmca_common_verbs.so.40
> #4  0x00007f93b27f4e83 in btl_openib_component_init ()
>    from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/openmpi/mca_btl_openib.so
> #5  0x00007f93b4516aaf in mca_btl_base_select ()
>    from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libopen-pal.so.40
> #6  0x00007f93b29552c2 in mca_bml_r2_component_init ()
>    from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/openmpi/mca_bml_r2.so
> #7  0x00007f93b4b81b54 in mca_bml_base_init ()
>    from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmpi.so.40
> #8  0x00007f93b4bc4ef8 in ompi_mpi_init ()
>    from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmpi.so.40
> #9  0x00007f93b4b5ee55 in PMPI_Init_thread ()
>    from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmpi.so.40
> #10 0x0000000000405b55 in main ()
> --8<---------------cut here---------------end--------------->8---
> 
> Conversely, a pre-upgrade commit works fine:
> 
> --8<---------------cut here---------------start------------->8---
> $ guix time-machine --commit=c2538db5617032788ac2f140496d00d8107579c8 --  environment --pure --ad-hoc openmpi openssh intel-mpi-benchmarks -- mpiexec -np 2 IMB-MPI1 PingPong
> --8<---------------cut here---------------end--------------->8---
> 
> Does that ring a bell?
> 
> Thanks,
> Ludo’.
> 
> ¹ https://git.savannah.gnu.org/cgit/guix.git/commit/?id=c2739c0801ebc5461564e862ce8f08405e2782dc
> 

I thought I built everything that depended on rdma-core, and
unfortunately I don't have a way to test it. As an actual user of the
package I trust you to revert the change if necessary.

I don't see anything on their mailing list pointing to this, or any
other bugs really.
http://vger.kernel.org/vger-lists.html#linux-rdma

-- 
Efraim Flashner   <efraim <at> flashner.co.il>   אפרים פלשנר
GPG key = A28B F40C 3E55 1372 662D  14F7 41AA E7DC CA3D 8351
Confidentiality cannot be guaranteed on emails sent or received unencrypted
[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-guix <at> gnu.org:
bug#46229; Package guix. (Mon, 01 Feb 2021 10:14:01 GMT) Full text and rfc822 format available.

Message #11 received at 46229 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: 46229 <at> debbugs.gnu.org
Cc: Florent Pruvost <florent.pruvost <at> inria.fr>,
 Efraim Flashner <efraim <at> flashner.co.il>, Greg Hogan <code <at> greghogan.com>
Subject: Re: bug#46229: rdma-core 33.x breaks InfiniBand support in
 Open MPI
Date: Mon, 01 Feb 2021 11:13:07 +0100
Ludovic Courtès <ludovic.courtes <at> inria.fr> skribis:

> $ guix time-machine --commit=23a5dcce1d893b8f5c5301ae3c1af863776ed3cf --  environment --pure --ad-hoc openmpi openssh intel-mpi-benchmarks --with-debug-info=rdma-core -- mpiexec -np 2 IMB-MPI1 PingPong

A workaround is to ask Open MPI to ignore the Verbs library with:

  mpiexec --mca btl ^openib …

Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#46229; Package guix. (Mon, 01 Feb 2021 11:11:01 GMT) Full text and rfc822 format available.

Message #14 received at 46229 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: 46229 <at> debbugs.gnu.org
Cc: Florent Pruvost <florent.pruvost <at> inria.fr>,
 Efraim Flashner <efraim <at> flashner.co.il>, Greg Hogan <code <at> greghogan.com>
Subject: Re: bug#46229: rdma-core 33.x breaks InfiniBand support in
 Open MPI
Date: Mon, 01 Feb 2021 12:10:34 +0100
Ludovic Courtès <ludovic.courtes <at> inria.fr> skribis:

> mpiexec noticed that process rank 1 with PID 0 on node devel02 exited on signal 11 (Segmentation fault).

Now with a nicer backtrace:

--8<---------------cut here---------------start------------->8---
(gdb) bt full
#0  attr_optional (attr=0x0) at include/infiniband/cmd_ioctl.h:239
No locals.
#1  ibv_icmd_create_cq (context=context <at> entry=0x1074890, cqe=cqe <at> entry=2, channel=channel <at> entry=0x0, 
    comp_vector=comp_vector <at> entry=0, flags=flags <at> entry=0, cq=cq <at> entry=0x1074c50, link=0x7ffe0a089690)
    at /tmp/guix-build-rdma-core-33.A.drv-0/rdma-core-33.1/libibverbs/cmd_cq.c:63
        cmdb = {{next = 0x7ffe0a089690, next_attr = 0x0, last_attr = 0x0, uhw_in_idx = 255 '\377', 
            uhw_out_idx = 255 '\377', uhw_in_headroom_dwords = 0 '\000', uhw_out_headroom_dwords = 0 '\000', 
            buffer_error = 0 '\000', fallback_require_ex = 0 '\000', fallback_ioctl_only = 0 '\000', hdr = {
              length = 0, object_id = 0, method_id = 0, num_attrs = 0, reserved1 = 0, driver_id = 0, reserved2 = 0, 
              attrs = 0x7ffe0a0895f8}}}
        priv = <optimized out>
        handle = <optimized out>
        async_fd_attr = <optimized out>
        resp_cqe = <optimized out>
        ret = 0
#2  0x00007f9ec83f2e4e in ibv_cmd_create_cq (context=context <at> entry=0x1074890, cqe=cqe <at> entry=2, 
    channel=channel <at> entry=0x0, comp_vector=comp_vector <at> entry=0, cq=cq <at> entry=0x1074c50, cmd=cmd <at> entry=0x0, cmd_size=0, 
    resp=0x7ffe0a089760, resp_size=16) at /tmp/guix-build-rdma-core-33.A.drv-0/rdma-core-33.1/libibverbs/cmd_cq.c:137
        __cmdbtotal = 2
        cmdb = {{next = 0x0, next_attr = 0x7ffe0a0896d8, last_attr = 0x7ffe0a0896e8, uhw_in_idx = 255 '\377', 
            uhw_out_idx = 0 '\000', uhw_in_headroom_dwords = 0 '\000', uhw_out_headroom_dwords = 2 '\002', 
            buffer_error = 0 '\000', fallback_require_ex = 0 '\000', fallback_ioctl_only = 0 '\000', hdr = {
              length = 0, object_id = 3, method_id = 0, num_attrs = 0, reserved1 = 0, driver_id = 0, reserved2 = 0, 
              attrs = 0x7ffe0a0896c8}}, {next = 0x100081001, next_attr = 0x7ffe0a089768, last_attr = 0x6e0000005b, 
            uhw_in_idx = 124 '|', uhw_out_idx = 0 '\000', uhw_in_headroom_dwords = 0 '\000', 
            uhw_out_headroom_dwords = 0 '\000', buffer_error = 1 '\001', fallback_require_ex = 1 '\001', 
            fallback_ioctl_only = 1 '\001', hdr = {length = 0, object_id = 0, method_id = 0, num_attrs = 0, 
              reserved1 = 0, driver_id = 17, reserved2 = 15, attrs = 0x7ffe0a089700}}}
        __cmdbdummy = <optimized out>
#3  0x00007f9ec85257bb in hfi1_create_cq (context=0x1074890, cqe=2, channel=0x0, comp_vector=0)
    at /tmp/guix-build-rdma-core-33.A.drv-0/rdma-core-33.1/providers/hfi1verbs/verbs.c:184
        cq = 0x1074c50
        resp = {ibv_resp = {cq_handle = 0, cqe = 0, driver_data = 0x7ffe0a089768}, offset = 0}
        ret = <optimized out>
        size = <optimized out>
#4  0x00007f9ec83fde41 in __ibv_create_cq_1_1 (context=0x1074890, cqe=<optimized out>, cq_context=0x0, channel=0x0, 
    comp_vector=<optimized out>) at /tmp/guix-build-rdma-core-33.A.drv-0/rdma-core-33.1/libibverbs/verbs.c:509
        cq = <optimized out>
#5  0x00007f9ec8426a55 in opal_common_verbs_qp_test ()
   from /gnu/store/6gcssdkn8iaiagzdv0d9mi93gppc85r4-openmpi-4.0.5/lib/libmca_common_verbs.so.40
No symbol table info available.
#6  0x00007f9ec8454e83 in btl_openib_component_init ()
   from /gnu/store/6gcssdkn8iaiagzdv0d9mi93gppc85r4-openmpi-4.0.5/lib/openmpi/mca_btl_openib.so
--8<---------------cut here---------------end--------------->8---

Version 29.2 is good and everything beyond that isn’t.  This has to do
with those rdma-core changes:

--8<---------------cut here---------------start------------->8---
$ git log --oneline v26.4..v33.1 libibverbs/cmd_cq.c
317d8895 verbs: Enhance async FD usage
195c9191 verbs: Introduce verbs_cq for extended CQ operations
90a4d0cc verbs: Extend CQ KABI to get an async FD
--8<---------------cut here---------------end--------------->8---

(The first commit in the list above appeared in v30.)

I forgot to mention this happens with Omni-Path hardware:

--8<---------------cut here---------------start------------->8---
$ guix environment --ad-hoc rdma-core -- ibv_devinfo

hca_id: hfi1_0
        transport:                      InfiniBand (0)
        fw_ver:                         1.27.0
        node_guid:                      0011:7509:0107:573e
        sys_image_guid:                 0011:7509:0107:573e
        vendor_id:                      0x1175
        vendor_part_id:                 9456
        hw_ver:                         0x11
        board_id:                       Intel Omni-Path Host Fabric Interface Adapter 100 Series
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               4
                        port_lmc:               0x00
                        link_layer:             InfiniBand

--8<---------------cut here---------------end--------------->8---

Ludo’.




Reply sent to Ludovic Courtès <ludo <at> gnu.org>:
You have taken responsibility. (Mon, 01 Feb 2021 13:06:01 GMT) Full text and rfc822 format available.

Notification sent to Ludovic Courtès <ludovic.courtes <at> inria.fr>:
bug acknowledged by developer. (Mon, 01 Feb 2021 13:06:01 GMT) Full text and rfc822 format available.

Message #19 received at 46229-done <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: 46229-done <at> debbugs.gnu.org
Cc: Florent Pruvost <florent.pruvost <at> inria.fr>,
 Efraim Flashner <efraim <at> flashner.co.il>, Greg Hogan <code <at> greghogan.com>
Subject: Re: bug#46229: rdma-core 33.x breaks InfiniBand support in
 Open MPI
Date: Mon, 01 Feb 2021 14:05:25 +0100
Good news!  This is fixed by:

  https://git.savannah.gnu.org/cgit/guix.git/commit/?id=37e997bc7867901dc5eaf9060358dfddacae8dd6

Ludo’.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 02 Mar 2021 12:24:07 GMT) Full text and rfc822 format available.

This bug report was last modified 3 years and 48 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.