GNU bug report logs -
#59493
cuirass-remote-worker crash
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 59493 in the body.
You can then email your comments to 59493 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
othacehe <at> gnu.org, bug-guix <at> gnu.org
:
bug#59493
; Package
guix
.
(Tue, 22 Nov 2022 22:15:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Ludovic Courtès <ludovic.courtes <at> inria.fr>
:
New bug report received and forwarded. Copy sent to
othacehe <at> gnu.org, bug-guix <at> gnu.org
.
(Tue, 22 Nov 2022 22:15:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Hi,
In /var/log/cuirass-remote-worker.log on overdrive1.guix, I found this:
--8<---------------cut here---------------start------------->8---
2022-11-21 14:27:24 Backtrace:
2022-11-21 14:27:24 Backtrace:
2022-11-21 14:27:24 In ice-9/boot-9.scm:
2022-11-21 14:27:24 In ice-9/boot-9.scm:
2022-11-21 14:27:24 1752:10 10 (with-exception-handler _ _ #:unwind? _ # _)
2022-11-21 14:27:24 In unknown file:
2022-11-21 14:27:24 9 (apply-smob/0 #<thunk 3903a300>)
2022-11-21 14:27:24 In ice-9/boot-9.scm:
2022-11-21 14:27:24 724:2 8 (call-with-prompt _ _ #<procedure default-prompt-handle?>)
2022-11-21 14:27:24 In ice-9/eval.scm:
2022-11-21 14:27:24 1752:10 10 (with-exception-handler _ _ #:unwind? _ # _)
2022-11-21 14:27:24 619:8 7 (_ #(#(#<directory (guile-user) 3903dc80>)))
2022-11-21 14:27:24 In cuirass/ui.scm:
2022-11-21 14:27:24 In unknown file:
2022-11-21 14:27:24 9 (apply-smob/0 #<thunk 3903a300>)
2022-11-21 14:27:24 104:10 6 (run-cuirass-command _ . _)
2022-11-21 14:27:24 In ice-9/boot-9.scm:
2022-11-21 14:27:24 In ice-9/boot-9.scm:
2022-11-21 14:27:24 724:2 8 (call-with-prompt _ _ #<procedure default-prompt-handle?>)
2022-11-21 14:27:24 1752:10 5 (with-exception-handler _ _ #:unwind? _ # _)
2022-11-21 14:27:24 In ice-9/eval.scm:
2022-11-21 14:27:24 In cuirass/scripts/remote-worker.scm:
2022-11-21 14:27:24 619:8 7 (_ #(#(#<directory (guile-user) 3903dc80>)))
2022-11-21 14:27:24 In cuirass/ui.scm:
2022-11-21 14:27:24 104:10 6 (run-cuirass-command _ . _)
2022-11-21 14:27:24 435:12 4 (_)
2022-11-21 14:27:24 In srfi/srfi-1.scm:
2022-11-21 14:27:24 In ice-9/boot-9.scm:
2022-11-21 14:27:24 1752:10 5 (with-exception-handler _ _ #:unwind? _ # _)
2022-11-21 14:27:24 634:9 3 (for-each #<procedure 398a3510 at cuirass/scripts/remo?> ?)
2022-11-21 14:27:24 In cuirass/scripts/remote-worker.scm:
2022-11-21 14:27:24 In cuirass/scripts/remote-worker.scm:
2022-11-21 14:27:24 448:18 2 (_ _)
2022-11-21 14:27:24 435:12 4 (_)
2022-11-21 14:27:24 In srfi/srfi-1.scm:
2022-11-21 14:27:24 634:9 3 (for-each #<procedure 398a3510 at cuirass/scripts/remo?> ?)
2022-11-21 14:27:24 356:11 1 (start-worker _ _)
2022-11-21 14:27:24 In cuirass/scripts/remote-worker.scm:
2022-11-21 14:27:24 In ice-9/boot-9.scm:
2022-11-21 14:27:24 448:18 2 (_ _)
2022-11-21 14:27:24 1685:16 0 (raise-exception _ #:continuable? _)
2022-11-21 14:27:24
2022-11-21 14:27:24 ice-9/boot-9.scm:1685:16: In procedure raise-exception:
2022-11-21 14:27:24 Throw to key `match-error' with args `("match" "no matching pattern" (#vu8()))'.
2022-11-21 14:27:24 356:11 1 (start-worker _ _)
2022-11-21 14:27:24 In ice-9/boot-9.scm:
2022-11-21 14:27:24 1685:16 0 (raise-exception _ #:continuable? _)
2022-11-21 14:27:24
2022-11-21 14:27:24 ice-9/boot-9.scm:1685:16: In procedure raise-exception:
2022-11-21 14:27:24 Throw to key `match-error' with args `("match" "no matching pattern" (#vu8()))'.
--8<---------------cut here---------------end--------------->8---
(Stuttering is due to the unprotected use of ‘primitive-fork’: a
non-local exit in the child leads it to execute the same code as its
parent. We should fix that, but should we really fork in the first
place? :-))
This comes from here:
--8<---------------cut here---------------start------------->8---
(define (read-server-info socket)
(request-info socket)
(match (zmq-get-msg-parts-bytevector socket '()) ;<-- here
((empty info)
(match (zmq-read-message (bv->string info))
(('server-info
('worker-address worker-address)
('log-port log-port)
('publish-port publish-port))
(list worker-address log-port publish-port))))))
--8<---------------cut here---------------end--------------->8---
This is the version being used:
--8<---------------cut here---------------start------------->8---
ludo <at> overdrive1 ~$ cat /proc/24019/cmdline |xargs -0
/gnu/store/zpir9n73amaxrwz2k7x46l73v21vxk6s-guile-3.0.8/bin/guile --no-auto-compile -e main -s /gnu/store/rlqdzmfyamjpn6lz07yqk2hsabv3l7g5-cuirass-1.1.0-11.9f08035/bin/.cuirass-real remote-worker --workers=2 --server=10.0.0.1:5555 --systems=armhf-linux,aarch64-linux --publish-port=5558 --substitute-urls=http://10.0.0.1
ludo <at> overdrive1 ~$ guix system describe
Generation 36 Sep 27 2022 09:06:48 (current)
file name: /var/guix/profiles/system-36-link
canonical file name: /gnu/store/m04qw6f0lfd0wpn1skiys4b56wqfc3b8-system
label: GNU with Linux-Libre 5.19.11
bootloader: grub-efi
root device: /dev/sda3
kernel: /gnu/store/09r4wbbabskmbrnwmshpdk7vh6g87gam-linux-libre-5.19.11/Image
channels:
guix:
repository URL: https://git.savannah.gnu.org/git/guix.git
commit: f15a141cf35bd4188767f0e91c0654991d4c49e0
configuration file: /gnu/store/myvzd1kpw2pfzfj3krl4lzpcbqsdn48x-configuration.scm
--8<---------------cut here---------------end--------------->8---
The sequence leading to this seems to be:
--8<---------------cut here---------------start------------->8---
22340 eventfd2(0, EFD_CLOEXEC <unfinished ...>
[…]
22340 <... eventfd2 resumed>) = 15
[…]
22340 ppoll([{fd=15, events=POLLIN}], 1, NULL, NULL, 0 <unfinished ...>
[…]
22340 <... ppoll resumed>) = 1 ([{fd=15, revents=POLLIN}])
22343 epoll_pwait(8, <unfinished ...>
22340 read(15, "\1\0\0\0\0\0\0\0", 8) = 8
22340 ppoll([{fd=15, events=POLLIN}], 1, {tv_sec=0, tv_nsec=0}, NULL, 0) = 0 (Timeout)
22340 write(2, "Backtrace:\n", 11) = 11
--8<---------------cut here---------------end--------------->8---
Does that ring a bell? Perhaps that was fixed in the meantime?
Right now it cannot be restarted: it always fails at start up with the
error above. 10.0.0.1 is reachable though so I’m not sure what’s up.
Ludo’.
Information forwarded
to
bug-guix <at> gnu.org
:
bug#59493
; Package
guix
.
(Wed, 23 Nov 2022 08:09:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 59493 <at> debbugs.gnu.org (full text, mbox):
Hello Ludo,
Thanks for gathering those information.
> 2022-11-21 14:27:24 1685:16 0 (raise-exception _ #:continuable? _)
> 2022-11-21 14:27:24
> 2022-11-21 14:27:24 ice-9/boot-9.scm:1685:16: In procedure raise-exception:
> 2022-11-21 14:27:24 Throw to key `match-error' with args `("match" "no matching pattern" (#vu8()))'.
Yes this is because a new remote-server is running on Berlin and it
sends an empty sequence at every connection:
https://git.savannah.gnu.org/cgit/guix/guix-cuirass.git/commit/?id=fc1641381d2a8a0472a71ef5ad2b64361faaaab4
All remote-workers must update, and I have deployed Cuirass
1.1.0-13.1341725 on all hydra workers + guix9p.
I have been trying to deploy that to overdrive1 for two days but Berlin
offloads the builds to kreuzberg which has some issues because a lot of
builds are timeouting:
--8<---------------cut here---------------start------------->8---
\building of `/gnu/store/9jg75a8rvdz3qxcbbm95312rlc4hyi98-mrustc-0.10-2.597593a-checkout.drv' timed out after 3600 seconds of silence
build of /gnu/store/9jg75a8rvdz3qxcbbm95312rlc4hyi98-mrustc-0.10-2.597593a-checkout.drv failed
View build log at '/var/log/guix/drvs/9j/g75a8rvdz3qxcbbm95312rlc4hyi98-mrustc-0.10-2.597593a-checkout.drv.gz'.
cannot build derivation `/gnu/store/wavx7rl6h93fpmc46nggnhkyxm75lqa4-mrustc-0.10-2.597593a-checkout.drv': 1 dependencies couldn't be built
--8<---------------cut here---------------end--------------->8---
> (Stuttering is due to the unprotected use of ‘primitive-fork’: a
> non-local exit in the child leads it to execute the same code as its
> parent. We should fix that, but should we really fork in the first
> place? :-))
Right, this is problematic. I can't remember why I chose to fork.
In the meantime, this should be fixed by updating to 1.1.0-13.1341725 so
we can close this one I guess.
Mathieu
Information forwarded
to
bug-guix <at> gnu.org
:
bug#59493
; Package
guix
.
(Wed, 23 Nov 2022 15:48:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 59493 <at> debbugs.gnu.org (full text, mbox):
Hi,
Mathieu Othacehe <othacehe <at> gnu.org> skribis:
>> 2022-11-21 14:27:24 1685:16 0 (raise-exception _ #:continuable? _)
>> 2022-11-21 14:27:24
>> 2022-11-21 14:27:24 ice-9/boot-9.scm:1685:16: In procedure raise-exception:
>> 2022-11-21 14:27:24 Throw to key `match-error' with args `("match" "no matching pattern" (#vu8()))'.
>
> Yes this is because a new remote-server is running on Berlin and it
> sends an empty sequence at every connection:
> https://git.savannah.gnu.org/cgit/guix/guix-cuirass.git/commit/?id=fc1641381d2a8a0472a71ef5ad2b64361faaaab4
Oh I see. It would be nice to avoid non-backward-compatible changes in
the protocol so we can upgrade more smoothly.
> All remote-workers must update, and I have deployed Cuirass
> 1.1.0-13.1341725 on all hydra workers + guix9p.
>
> I have been trying to deploy that to overdrive1 for two days but Berlin
> offloads the builds to kreuzberg which has some issues because a lot of
> builds are timeouting:
Done now!
--8<---------------cut here---------------start------------->8---
ludo <at> overdrive1 ~$ guix system describe
Generation 37 Nov 23 2022 15:58:08 (current)
file name: /var/guix/profiles/system-37-link
canonical file name: /gnu/store/62dr875n7i30l375j87flbqfym78kddg-system
label: GNU with Linux-Libre 6.0.9
bootloader: grub-efi
root device: /dev/sda3
kernel: /gnu/store/p4impcxw8lba8600acrxs21lgzc06xzq-linux-libre-6.0.9/Image
channels:
guix:
repository URL: https://git.savannah.gnu.org/git/guix.git
commit: 78f03567f44f704dfbc03cb64368aa42a01e78ad
configuration file: /gnu/store/myvzd1kpw2pfzfj3krl4lzpcbqsdn48x-configuration.scm
--8<---------------cut here---------------end--------------->8---
Running the Shepherd 0.9.3 and all, wonderful.
>> (Stuttering is due to the unprotected use of ‘primitive-fork’: a
>> non-local exit in the child leads it to execute the same code as its
>> parent. We should fix that, but should we really fork in the first
>> place? :-))
Fixed in Cuirass commit 9fb6f21d29c5398b35f4c1a77cf6c20f207c9ebb.
> Right, this is problematic. I can't remember why I chose to fork.
One concern is that, in the Avahi case, we create at least one thread
before forking, and as we know that doesn’t work (as in: it might work
sometimes). ZMQ may also create threads behind our back.
The parent doesn’t call ‘waitpid’ on its children, which isn’t great.
To me, ideally this would be either multi-threaded or Fiberized. The
latter would be more fruitful but what might be difficult is
guile-simple-zmq integration with Fibers (but maybe not: zmq_getsockopt
+ ZMQ_FD lets us get the file descriptor of a socket).
Something to consider…
Thanks,
Ludo’.
bug closed, send any further explanations to
59493 <at> debbugs.gnu.org and Ludovic Courtès <ludovic.courtes <at> inria.fr>
Request was from
Ludovic Courtès <ludo <at> gnu.org>
to
control <at> debbugs.gnu.org
.
(Wed, 23 Nov 2022 15:48:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-guix <at> gnu.org
:
bug#59493
; Package
guix
.
(Wed, 23 Nov 2022 16:04:01 GMT)
Full text and
rfc822 format available.
Message #16 received at 59493-done <at> debbugs.gnu.org (full text, mbox):
Hey,
> Oh I see. It would be nice to avoid non-backward-compatible changes in
> the protocol so we can upgrade more smoothly.
Right, sorry. We should introduce a protocol version to avoid that in
the future.
> Fixed in Cuirass commit 9fb6f21d29c5398b35f4c1a77cf6c20f207c9ebb.
Awesome, thanks :)
> To me, ideally this would be either multi-threaded or Fiberized. The
> latter would be more fruitful but what might be difficult is
> guile-simple-zmq integration with Fibers (but maybe not: zmq_getsockopt
> + ZMQ_FD lets us get the file descriptor of a socket).
I would prefer the multi-threaded approach if possible. While the
concept of Fiber is nice it adds another layer of complexity and
instability to those programs which are already hard to debug.
Mathieu
Information forwarded
to
bug-guix <at> gnu.org
:
bug#59493
; Package
guix
.
(Sat, 26 Nov 2022 15:05:01 GMT)
Full text and
rfc822 format available.
Message #19 received at 59493-done <at> debbugs.gnu.org (full text, mbox):
Hi,
Mathieu Othacehe <othacehe <at> gnu.org> skribis:
>> To me, ideally this would be either multi-threaded or Fiberized. The
>> latter would be more fruitful but what might be difficult is
>> guile-simple-zmq integration with Fibers (but maybe not: zmq_getsockopt
>> + ZMQ_FD lets us get the file descriptor of a socket).
>
> I would prefer the multi-threaded approach if possible. While the
> concept of Fiber is nice it adds another layer of complexity and
> instability to those programs which are already hard to debug.
I guess it’s not black and white. Shared-state multithreading is an
endless source of bugs, regardless of the language being used;
message-passing (what Fibers is about) is more tractable.
Sure Fibers can have bugs of its own (I’m well aware of that :-)) but at
Fiber-using code can be simpler and less error-ridden than the
equivalent shared-state code.
Anyway, we’re not there yet.
Can you remember the rationale for forking in remote-worker.scm, or do
you think we might as well do it all in a single process?
Thanks,
Ludo’.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Sun, 25 Dec 2022 12:24:11 GMT)
Full text and
rfc822 format available.
This bug report was last modified 1 year and 116 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.