GNU bug report logs - #59493
cuirass-remote-worker crash

Previous Next

Package: guix;

Reported by: Ludovic Courtès <ludovic.courtes <at> inria.fr>

Date: Tue, 22 Nov 2022 22:15:02 UTC

Severity: normal

Done: Ludovic Courtès <ludo <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 59493 in the body.
You can then email your comments to 59493 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to othacehe <at> gnu.org, bug-guix <at> gnu.org:
bug#59493; Package guix. (Tue, 22 Nov 2022 22:15:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Ludovic Courtès <ludovic.courtes <at> inria.fr>:
New bug report received and forwarded. Copy sent to othacehe <at> gnu.org, bug-guix <at> gnu.org. (Tue, 22 Nov 2022 22:15:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludovic.courtes <at> inria.fr>
To: bug-guix <at> gnu.org
Subject: cuirass-remote-worker crash
Date: Tue, 22 Nov 2022 23:14:05 +0100
Hi,

In /var/log/cuirass-remote-worker.log on overdrive1.guix, I found this:

--8<---------------cut here---------------start------------->8---
2022-11-21 14:27:24 Backtrace:
2022-11-21 14:27:24 Backtrace:
2022-11-21 14:27:24 In ice-9/boot-9.scm:
2022-11-21 14:27:24 In ice-9/boot-9.scm:
2022-11-21 14:27:24   1752:10 10 (with-exception-handler _ _ #:unwind? _ # _)
2022-11-21 14:27:24 In unknown file:
2022-11-21 14:27:24            9 (apply-smob/0 #<thunk 3903a300>)
2022-11-21 14:27:24 In ice-9/boot-9.scm:
2022-11-21 14:27:24     724:2  8 (call-with-prompt _ _ #<procedure default-prompt-handle?>)
2022-11-21 14:27:24 In ice-9/eval.scm:
2022-11-21 14:27:24   1752:10 10 (with-exception-handler _ _ #:unwind? _ # _)
2022-11-21 14:27:24     619:8  7 (_ #(#(#<directory (guile-user) 3903dc80>)))
2022-11-21 14:27:24 In cuirass/ui.scm:
2022-11-21 14:27:24 In unknown file:
2022-11-21 14:27:24            9 (apply-smob/0 #<thunk 3903a300>)
2022-11-21 14:27:24    104:10  6 (run-cuirass-command _ . _)
2022-11-21 14:27:24 In ice-9/boot-9.scm:
2022-11-21 14:27:24 In ice-9/boot-9.scm:
2022-11-21 14:27:24     724:2  8 (call-with-prompt _ _ #<procedure default-prompt-handle?>)
2022-11-21 14:27:24   1752:10  5 (with-exception-handler _ _ #:unwind? _ # _)
2022-11-21 14:27:24 In ice-9/eval.scm:
2022-11-21 14:27:24 In cuirass/scripts/remote-worker.scm:
2022-11-21 14:27:24     619:8  7 (_ #(#(#<directory (guile-user) 3903dc80>)))
2022-11-21 14:27:24 In cuirass/ui.scm:
2022-11-21 14:27:24    104:10  6 (run-cuirass-command _ . _)
2022-11-21 14:27:24    435:12  4 (_)
2022-11-21 14:27:24 In srfi/srfi-1.scm:
2022-11-21 14:27:24 In ice-9/boot-9.scm:
2022-11-21 14:27:24   1752:10  5 (with-exception-handler _ _ #:unwind? _ # _)
2022-11-21 14:27:24     634:9  3 (for-each #<procedure 398a3510 at cuirass/scripts/remo?> ?)
2022-11-21 14:27:24 In cuirass/scripts/remote-worker.scm:
2022-11-21 14:27:24 In cuirass/scripts/remote-worker.scm:
2022-11-21 14:27:24    448:18  2 (_ _)
2022-11-21 14:27:24    435:12  4 (_)
2022-11-21 14:27:24 In srfi/srfi-1.scm:
2022-11-21 14:27:24     634:9  3 (for-each #<procedure 398a3510 at cuirass/scripts/remo?> ?)
2022-11-21 14:27:24    356:11  1 (start-worker _ _)
2022-11-21 14:27:24 In cuirass/scripts/remote-worker.scm:
2022-11-21 14:27:24 In ice-9/boot-9.scm:
2022-11-21 14:27:24    448:18  2 (_ _)
2022-11-21 14:27:24   1685:16  0 (raise-exception _ #:continuable? _)
2022-11-21 14:27:24
2022-11-21 14:27:24 ice-9/boot-9.scm:1685:16: In procedure raise-exception:
2022-11-21 14:27:24 Throw to key `match-error' with args `("match" "no matching pattern" (#vu8()))'.
2022-11-21 14:27:24    356:11  1 (start-worker _ _)
2022-11-21 14:27:24 In ice-9/boot-9.scm:
2022-11-21 14:27:24   1685:16  0 (raise-exception _ #:continuable? _)
2022-11-21 14:27:24
2022-11-21 14:27:24 ice-9/boot-9.scm:1685:16: In procedure raise-exception:
2022-11-21 14:27:24 Throw to key `match-error' with args `("match" "no matching pattern" (#vu8()))'.
--8<---------------cut here---------------end--------------->8---

(Stuttering is due to the unprotected use of ‘primitive-fork’: a
non-local exit in the child leads it to execute the same code as its
parent.  We should fix that, but should we really fork in the first
place?  :-))

This comes from here:

--8<---------------cut here---------------start------------->8---
  (define (read-server-info socket)
    (request-info socket)
    (match (zmq-get-msg-parts-bytevector socket '())   ;<-- here
      ((empty info)
       (match (zmq-read-message (bv->string info))
         (('server-info
           ('worker-address worker-address)
           ('log-port log-port)
           ('publish-port publish-port))
          (list worker-address log-port publish-port))))))
--8<---------------cut here---------------end--------------->8---

This is the version being used:

--8<---------------cut here---------------start------------->8---
ludo <at> overdrive1 ~$ cat /proc/24019/cmdline |xargs -0
/gnu/store/zpir9n73amaxrwz2k7x46l73v21vxk6s-guile-3.0.8/bin/guile --no-auto-compile -e main -s /gnu/store/rlqdzmfyamjpn6lz07yqk2hsabv3l7g5-cuirass-1.1.0-11.9f08035/bin/.cuirass-real remote-worker --workers=2 --server=10.0.0.1:5555 --systems=armhf-linux,aarch64-linux --publish-port=5558 --substitute-urls=http://10.0.0.1
ludo <at> overdrive1 ~$ guix system describe
Generation 36   Sep 27 2022 09:06:48    (current)
  file name: /var/guix/profiles/system-36-link
  canonical file name: /gnu/store/m04qw6f0lfd0wpn1skiys4b56wqfc3b8-system
  label: GNU with Linux-Libre 5.19.11
  bootloader: grub-efi
  root device: /dev/sda3
  kernel: /gnu/store/09r4wbbabskmbrnwmshpdk7vh6g87gam-linux-libre-5.19.11/Image
  channels:
    guix:
      repository URL: https://git.savannah.gnu.org/git/guix.git
      commit: f15a141cf35bd4188767f0e91c0654991d4c49e0
  configuration file: /gnu/store/myvzd1kpw2pfzfj3krl4lzpcbqsdn48x-configuration.scm
--8<---------------cut here---------------end--------------->8---

The sequence leading to this seems to be:

--8<---------------cut here---------------start------------->8---
22340 eventfd2(0, EFD_CLOEXEC <unfinished ...>
[…]
22340 <... eventfd2 resumed>)           = 15
[…]
22340 ppoll([{fd=15, events=POLLIN}], 1, NULL, NULL, 0 <unfinished ...>
[…]
22340 <... ppoll resumed>)              = 1 ([{fd=15, revents=POLLIN}])
22343 epoll_pwait(8,  <unfinished ...>
22340 read(15, "\1\0\0\0\0\0\0\0", 8)   = 8
22340 ppoll([{fd=15, events=POLLIN}], 1, {tv_sec=0, tv_nsec=0}, NULL, 0) = 0 (Timeout)
22340 write(2, "Backtrace:\n", 11)      = 11
--8<---------------cut here---------------end--------------->8---

Does that ring a bell?  Perhaps that was fixed in the meantime?

Right now it cannot be restarted: it always fails at start up with the
error above.  10.0.0.1 is reachable though so I’m not sure what’s up.

Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#59493; Package guix. (Wed, 23 Nov 2022 08:09:02 GMT) Full text and rfc822 format available.

Message #8 received at 59493 <at> debbugs.gnu.org (full text, mbox):

From: Mathieu Othacehe <othacehe <at> gnu.org>
To: Ludovic Courtès <ludovic.courtes <at> inria.fr>
Cc: 59493 <at> debbugs.gnu.org
Subject: Re: bug#59493: cuirass-remote-worker crash
Date: Wed, 23 Nov 2022 09:08:32 +0100
Hello Ludo,

Thanks for gathering those information.

> 2022-11-21 14:27:24   1685:16  0 (raise-exception _ #:continuable? _)
> 2022-11-21 14:27:24
> 2022-11-21 14:27:24 ice-9/boot-9.scm:1685:16: In procedure raise-exception:
> 2022-11-21 14:27:24 Throw to key `match-error' with args `("match" "no matching pattern" (#vu8()))'.

Yes this is because a new remote-server is running on Berlin and it
sends an empty sequence at every connection:
https://git.savannah.gnu.org/cgit/guix/guix-cuirass.git/commit/?id=fc1641381d2a8a0472a71ef5ad2b64361faaaab4

All remote-workers must update, and I have deployed Cuirass
1.1.0-13.1341725 on all hydra workers + guix9p.

I have been trying to deploy that to overdrive1 for two days but Berlin
offloads the builds to kreuzberg which has some issues because a lot of
builds are timeouting:

--8<---------------cut here---------------start------------->8---
\building of `/gnu/store/9jg75a8rvdz3qxcbbm95312rlc4hyi98-mrustc-0.10-2.597593a-checkout.drv' timed out after 3600 seconds of silence
build of /gnu/store/9jg75a8rvdz3qxcbbm95312rlc4hyi98-mrustc-0.10-2.597593a-checkout.drv failed
View build log at '/var/log/guix/drvs/9j/g75a8rvdz3qxcbbm95312rlc4hyi98-mrustc-0.10-2.597593a-checkout.drv.gz'.
cannot build derivation `/gnu/store/wavx7rl6h93fpmc46nggnhkyxm75lqa4-mrustc-0.10-2.597593a-checkout.drv': 1 dependencies couldn't be built
--8<---------------cut here---------------end--------------->8---

> (Stuttering is due to the unprotected use of ‘primitive-fork’: a
> non-local exit in the child leads it to execute the same code as its
> parent.  We should fix that, but should we really fork in the first
> place?  :-))

Right, this is problematic. I can't remember why I chose to fork.

In the meantime, this should be fixed by updating to 1.1.0-13.1341725 so
we can close this one I guess.

Mathieu




Information forwarded to bug-guix <at> gnu.org:
bug#59493; Package guix. (Wed, 23 Nov 2022 15:48:02 GMT) Full text and rfc822 format available.

Message #11 received at 59493 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Mathieu Othacehe <othacehe <at> gnu.org>
Cc: 59493 <at> debbugs.gnu.org
Subject: Re: bug#59493: cuirass-remote-worker crash
Date: Wed, 23 Nov 2022 16:47:32 +0100
Hi,

Mathieu Othacehe <othacehe <at> gnu.org> skribis:

>> 2022-11-21 14:27:24   1685:16  0 (raise-exception _ #:continuable? _)
>> 2022-11-21 14:27:24
>> 2022-11-21 14:27:24 ice-9/boot-9.scm:1685:16: In procedure raise-exception:
>> 2022-11-21 14:27:24 Throw to key `match-error' with args `("match" "no matching pattern" (#vu8()))'.
>
> Yes this is because a new remote-server is running on Berlin and it
> sends an empty sequence at every connection:
> https://git.savannah.gnu.org/cgit/guix/guix-cuirass.git/commit/?id=fc1641381d2a8a0472a71ef5ad2b64361faaaab4

Oh I see.  It would be nice to avoid non-backward-compatible changes in
the protocol so we can upgrade more smoothly.

> All remote-workers must update, and I have deployed Cuirass
> 1.1.0-13.1341725 on all hydra workers + guix9p.
>
> I have been trying to deploy that to overdrive1 for two days but Berlin
> offloads the builds to kreuzberg which has some issues because a lot of
> builds are timeouting:

Done now!

--8<---------------cut here---------------start------------->8---
ludo <at> overdrive1 ~$ guix system describe
Generation 37   Nov 23 2022 15:58:08    (current)
  file name: /var/guix/profiles/system-37-link
  canonical file name: /gnu/store/62dr875n7i30l375j87flbqfym78kddg-system
  label: GNU with Linux-Libre 6.0.9
  bootloader: grub-efi
  root device: /dev/sda3
  kernel: /gnu/store/p4impcxw8lba8600acrxs21lgzc06xzq-linux-libre-6.0.9/Image
  channels:
    guix:
      repository URL: https://git.savannah.gnu.org/git/guix.git
      commit: 78f03567f44f704dfbc03cb64368aa42a01e78ad
  configuration file: /gnu/store/myvzd1kpw2pfzfj3krl4lzpcbqsdn48x-configuration.scm
--8<---------------cut here---------------end--------------->8---

Running the Shepherd 0.9.3 and all, wonderful.

>> (Stuttering is due to the unprotected use of ‘primitive-fork’: a
>> non-local exit in the child leads it to execute the same code as its
>> parent.  We should fix that, but should we really fork in the first
>> place?  :-))

Fixed in Cuirass commit 9fb6f21d29c5398b35f4c1a77cf6c20f207c9ebb.

> Right, this is problematic. I can't remember why I chose to fork.

One concern is that, in the Avahi case, we create at least one thread
before forking, and as we know that doesn’t work (as in: it might work
sometimes).  ZMQ may also create threads behind our back.

The parent doesn’t call ‘waitpid’ on its children, which isn’t great.

To me, ideally this would be either multi-threaded or Fiberized.  The
latter would be more fruitful but what might be difficult is
guile-simple-zmq integration with Fibers (but maybe not: zmq_getsockopt
+ ZMQ_FD lets us get the file descriptor of a socket).

Something to consider…

Thanks,
Ludo’.




bug closed, send any further explanations to 59493 <at> debbugs.gnu.org and Ludovic Courtès <ludovic.courtes <at> inria.fr> Request was from Ludovic Courtès <ludo <at> gnu.org> to control <at> debbugs.gnu.org. (Wed, 23 Nov 2022 15:48:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-guix <at> gnu.org:
bug#59493; Package guix. (Wed, 23 Nov 2022 16:04:01 GMT) Full text and rfc822 format available.

Message #16 received at 59493-done <at> debbugs.gnu.org (full text, mbox):

From: Mathieu Othacehe <othacehe <at> gnu.org>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: 59493-done <at> debbugs.gnu.org
Subject: Re: bug#59493: cuirass-remote-worker crash
Date: Wed, 23 Nov 2022 17:03:32 +0100
Hey,

> Oh I see.  It would be nice to avoid non-backward-compatible changes in
> the protocol so we can upgrade more smoothly.

Right, sorry. We should introduce a protocol version to avoid that in
the future.

> Fixed in Cuirass commit 9fb6f21d29c5398b35f4c1a77cf6c20f207c9ebb.

Awesome, thanks :)

> To me, ideally this would be either multi-threaded or Fiberized.  The
> latter would be more fruitful but what might be difficult is
> guile-simple-zmq integration with Fibers (but maybe not: zmq_getsockopt
> + ZMQ_FD lets us get the file descriptor of a socket).

I would prefer the multi-threaded approach if possible. While the
concept of Fiber is nice it adds another layer of complexity and
instability to those programs which are already hard to debug.

Mathieu




Information forwarded to bug-guix <at> gnu.org:
bug#59493; Package guix. (Sat, 26 Nov 2022 15:05:01 GMT) Full text and rfc822 format available.

Message #19 received at 59493-done <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Mathieu Othacehe <othacehe <at> gnu.org>
Cc: 59493-done <at> debbugs.gnu.org
Subject: Re: bug#59493: cuirass-remote-worker crash
Date: Sat, 26 Nov 2022 16:04:20 +0100
Hi,

Mathieu Othacehe <othacehe <at> gnu.org> skribis:

>> To me, ideally this would be either multi-threaded or Fiberized.  The
>> latter would be more fruitful but what might be difficult is
>> guile-simple-zmq integration with Fibers (but maybe not: zmq_getsockopt
>> + ZMQ_FD lets us get the file descriptor of a socket).
>
> I would prefer the multi-threaded approach if possible. While the
> concept of Fiber is nice it adds another layer of complexity and
> instability to those programs which are already hard to debug.

I guess it’s not black and white.  Shared-state multithreading is an
endless source of bugs, regardless of the language being used;
message-passing (what Fibers is about) is more tractable.

Sure Fibers can have bugs of its own (I’m well aware of that :-)) but at
Fiber-using code can be simpler and less error-ridden than the
equivalent shared-state code.

Anyway, we’re not there yet.

Can you remember the rationale for forking in remote-worker.scm, or do
you think we might as well do it all in a single process?

Thanks,
Ludo’.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 25 Dec 2022 12:24:11 GMT) Full text and rfc822 format available.

This bug report was last modified 1 year and 116 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.