GNU bug report logs -
#53463
ci.guix.gnu.org not building the 'guix' job
Previous Next
Reported by: Leo Famulari <leo <at> famulari.name>
Date: Sun, 23 Jan 2022 00:57:01 UTC
Severity: important
Done: Mathieu Othacehe <othacehe <at> gnu.org>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 53463 in the body.
You can then email your comments to 53463 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-guix <at> gnu.org
:
bug#53463
; Package
guix
.
(Sun, 23 Jan 2022 00:57:01 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Leo Famulari <leo <at> famulari.name>
:
New bug report received and forwarded. Copy sent to
bug-guix <at> gnu.org
.
(Sun, 23 Jan 2022 00:57:01 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
As far as I can tell, ci.guix.gnu.org has stopped building the 'guix'
job since a couple days ago:
https://ci.guix.gnu.org/jobset/guix
Information forwarded
to
bug-guix <at> gnu.org
:
bug#53463
; Package
guix
.
(Sun, 23 Jan 2022 23:01:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 53463 <at> debbugs.gnu.org (full text, mbox):
Also, the 'master' job hasn't been run in ~2 days:
https://ci.guix.gnu.org/jobset/master
I think the build farm is waiting to finish collecting garbage.
Information forwarded
to
bug-guix <at> gnu.org
:
bug#53463
; Package
guix
.
(Thu, 27 Jan 2022 22:14:01 GMT)
Full text and
rfc822 format available.
Message #11 received at 53463 <at> debbugs.gnu.org (full text, mbox):
On Sun, Jan 23, 2022 at 06:00:40PM -0500, Leo Famulari wrote:
> Also, the 'master' job hasn't been run in ~2 days:
>
> https://ci.guix.gnu.org/jobset/master
>
> I think the build farm is waiting to finish collecting garbage.
Unfortunately, the 'master' jobset is broken again, and the 'guix'
jobset is still broken.
Added indication that bug 53463 blocks53214
Request was from
Leo Famulari <leo <at> famulari.name>
to
control <at> debbugs.gnu.org
.
(Sat, 29 Jan 2022 21:13:02 GMT)
Full text and
rfc822 format available.
Severity set to 'important' from 'normal'
Request was from
Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
to
control <at> debbugs.gnu.org
.
(Tue, 01 Feb 2022 15:19:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-guix <at> gnu.org
:
bug#53463
; Package
guix
.
(Wed, 02 Feb 2022 18:42:01 GMT)
Full text and
rfc822 format available.
Message #18 received at 53463 <at> debbugs.gnu.org (full text, mbox):
Hello,
The issue here seems to be that the evaluations of the 'guix' jobset are
never finishing, even when the GC is not running.
I tried to strace one of the stuck evaluation process, it returns
repeatedly:
--8<---------------cut here---------------start------------->8---
[pid 36294] read(227, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 65536) = 96
[pid 36294] write(239, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 96) = 96
[pid 36294] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
[pid 36294] pselect6(240, [40 227 239], [], [], NULL, NULL) = 1 (in [227])
[pid 36294] read(227, "gmlo\0\0\0\0G\0\0\0\0\0\0\0process 40190 acquired build slot '/var/guix/offload/localhost:2224/0'\n\0", 65536) = 88
[pid 36294] write(239, "gmlo\0\0\0\0G\0\0\0\0\0\0\0process 40190 acquired build slot '/var/guix/offload/localhost:2224/0'\n\0", 88) = 88
[pid 36294] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
[pid 36294] pselect6(240, [40 227 239], [], [], NULL, NULL) = 1 (in [227])
[pid 36294] read(227, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 65536) = 96
[pid 36294] write(239, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 96) = 96
[pid 36294] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
[pid 36294] pselect6(240, [40 227 239], [], [], NULL, NULL) = 1 (in [227])
[pid 36294] read(227, "gmlo\0\0\0\0G\0\0\0\0\0\0\0process 40190 acquired build slot '/var/guix/offload/localhost:2224/0'\n\0", 65536) = 88
[pid 36294] write(239, "gmlo\0\0\0\0G\0\0\0\0\0\0\0process 40190 acquired build slot '/var/guix/offload/localhost:2224/0'\n\0", 88) = 88
[pid 36294] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
[pid 36294] pselect6(240, [40 227 239], [], [], NULL, NULL) = 1 (in [227])
[pid 36294] read(227, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 65536) = 96
[pid 36294] write(239, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 96) = 96
--8<---------------cut here---------------end--------------->8---
To be continued,
Thanks,
Mathieu
Information forwarded
to
bug-guix <at> gnu.org
:
bug#53463
; Package
guix
.
(Fri, 04 Feb 2022 08:59:02 GMT)
Full text and
rfc822 format available.
Message #21 received at 53463 <at> debbugs.gnu.org (full text, mbox):
Hello!
Mathieu Othacehe <othacehe <at> gnu.org> skribis:
> I tried to strace one of the stuck evaluation process, it returns
> repeatedly:
>
> [pid 36294] read(227, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 65536) = 96
> [pid 36294] write(239, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 96) = 96
> [pid 36294] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
> [pid 36294] pselect6(240, [40 227 239], [], [], NULL, NULL) = 1 (in [227])
> [pid 36294] read(227, "gmlo\0\0\0\0G\0\0\0\0\0\0\0process 40190 acquired build slot '/var/guix/offload/localhost:2224/0'\n\0", 65536) = 88
> [pid 36294] write(239, "gmlo\0\0\0\0G\0\0\0\0\0\0\0process 40190 acquired build slot '/var/guix/offload/localhost:2224/0'\n\0", 88) = 88
> [pid 36294] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
> [pid 36294] pselect6(240, [40 227 239], [], [], NULL, NULL) = 1 (in [227])
> [pid 36294] read(227, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 65536) = 96
> [pid 36294] write(239, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 96) = 96
> [pid 36294] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
> [pid 36294] pselect6(240, [40 227 239], [], [], NULL, NULL) = 1 (in [227])
> [pid 36294] read(227, "gmlo\0\0\0\0G\0\0\0\0\0\0\0process 40190 acquired build slot '/var/guix/offload/localhost:2224/0'\n\0", 65536) = 88
> [pid 36294] write(239, "gmlo\0\0\0\0G\0\0\0\0\0\0\0process 40190 acquired build slot '/var/guix/offload/localhost:2224/0'\n\0", 88) = 88
> [pid 36294] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
> [pid 36294] pselect6(240, [40 227 239], [], [], NULL, NULL) = 1 (in [227])
> [pid 36294] read(227, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 65536) = 96
> [pid 36294] write(239, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 96) = 96
Oh! That indicates that it’s failing to offload to one of the
‘localhost’ build machines specified in /etc/guix/machines.scm.
Normally there’s an SSH tunnel set up for those, but I guess it broke.
Perhaps we can update /etc/guix/machines.scm to refer to armhf-linux
machines by their WireGuard IP?
Thanks,
Ludo’.
Information forwarded
to
bug-guix <at> gnu.org
:
bug#53463
; Package
guix
.
(Fri, 04 Feb 2022 09:55:02 GMT)
Full text and
rfc822 format available.
Message #24 received at 53463 <at> debbugs.gnu.org (full text, mbox):
Hey,
> Oh! That indicates that it’s failing to offload to one of the
> ‘localhost’ build machines specified in /etc/guix/machines.scm.
> Normally there’s an SSH tunnel set up for those, but I guess it broke.
>
> Perhaps we can update /etc/guix/machines.scm to refer to armhf-linux
> machines by their WireGuard IP?
Seems like the right thing to do. This bit is also an unstaged change in
the berlin maintenance repository, we should commit it. Tobias, could
you have a look :) ?
--8<---------------cut here---------------start------------->8---
+(define powerpc64le
+ (list
+ ;; A VM donated/hosted by OSUOSL & administered by nckx.
+ ;; XXX: SSH tunnel via overdrive1:
+ ;; ssh -L 2224:p9.tobias.gr:22 hydra <at> 10.0.0.3
+ #;(build-machine
+ ;;(name "p9.tobias.gr")
+ (name "localhost")
+ (port 2224)
+ (user "hydra")
+ (systems '("powerpc64le-linux"))
+ (host-key "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIJEbRxJ6WqnNLYEMNDUKFcdMtyZ9V/6oEfBFSHY8xE6A nckx"))))
--8<---------------cut here---------------end--------------->8---
I also found that other machines were unreachable and commented them:
--8<---------------cut here---------------start------------->8---
;; CPU: 16 ARM Cortex-A72 cores
;; RAM: 32 GB
- (list (build-machine
+ (list #;(build-machine
;;kreuzberg
(name "10.0.0.9")
(user "hydra")
@@ -243,13 +256,13 @@
;; BeagleBoard X15 kindly hosted by Simon Josefsson.
;; CPU: Cortex A15 (2 cores)
;; RAM: 2 GB
- (build-machine
+ #;(build-machine
(name "10.0.0.5") ;guix-x15
(user "hydra")
(systems '("armhf-linux"))
(host-key "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIOfXjwCAFWeGiUoOVXEgtIeXxbtymjOTg7ph1ObMAcJ0 root <at> beaglebone"))
- (build-machine
+ #;(build-machine
(name "10.0.0.6") ;guix-x15b
(user "hydra")
(systems '("armhf-linux"))
--8<---------------cut here---------------end--------------->8---
Nevertheless we are hitting an offload issue here, maybe an occurrence
of #24496. The offload mechanism should timeout when a machine is
unreachable instead of retrying over and over, causing all evaluation
processes to hang.
Thanks,
Mathieu
Information forwarded
to
bug-guix <at> gnu.org
:
bug#53463
; Package
guix
.
(Tue, 08 Feb 2022 10:23:01 GMT)
Full text and
rfc822 format available.
Message #27 received at 53463 <at> debbugs.gnu.org (full text, mbox):
Hi,
Mathieu Othacehe <othacehe <at> gnu.org> skribis:
>> Oh! That indicates that it’s failing to offload to one of the
>> ‘localhost’ build machines specified in /etc/guix/machines.scm.
>> Normally there’s an SSH tunnel set up for those, but I guess it broke.
>>
>> Perhaps we can update /etc/guix/machines.scm to refer to armhf-linux
>> machines by their WireGuard IP?
>
> Seems like the right thing to do. This bit is also an unstaged change in
> the berlin maintenance repository, we should commit it. Tobias, could
> you have a look :) ?
>
> +(define powerpc64le
> + (list
> + ;; A VM donated/hosted by OSUOSL & administered by nckx.
> + ;; XXX: SSH tunnel via overdrive1:
> + ;; ssh -L 2224:p9.tobias.gr:22 hydra <at> 10.0.0.3
> + #;(build-machine
> + ;;(name "p9.tobias.gr")
> + (name "localhost")
> + (port 2224)
> + (user "hydra")
> + (systems '("powerpc64le-linux"))
> + (host-key "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIJEbRxJ6WqnNLYEMNDUKFcdMtyZ9V/6oEfBFSHY8xE6A nckx"))))
IIRC this machine is now running WireGuard, Tobias? If so, could you
change this to refer to its WireGuard IP and commit it?
> I also found that other machines were unreachable and commented them:
>
> ;; CPU: 16 ARM Cortex-A72 cores
> ;; RAM: 32 GB
> - (list (build-machine
> + (list #;(build-machine
> ;;kreuzberg
> (name "10.0.0.9")
> (user "hydra")
Ricardo, could you check what’s wrong with kreuzberg?
> @@ -243,13 +256,13 @@
> ;; BeagleBoard X15 kindly hosted by Simon Josefsson.
> ;; CPU: Cortex A15 (2 cores)
> ;; RAM: 2 GB
> - (build-machine
> + #;(build-machine
> (name "10.0.0.5") ;guix-x15
> (user "hydra")
> (systems '("armhf-linux"))
> (host-key "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIOfXjwCAFWeGiUoOVXEgtIeXxbtymjOTg7ph1ObMAcJ0 root <at> beaglebone"))
>
> - (build-machine
> + #;(build-machine
> (name "10.0.0.6") ;guix-x15b
> (user "hydra")
> (systems '("armhf-linux"))
Oops.
Note that it’s not necessary to comment them all out. As long as at
least one machine is available for a given system type, we’re fine:
‘guix offload’ will pick it up.
> Nevertheless we are hitting an offload issue here, maybe an occurrence
> of #24496. The offload mechanism should timeout when a machine is
> unreachable instead of retrying over and over, causing all evaluation
> processes to hang.
Yes, though the problem here is that some architectures were left with
zero machines IIRC, so it would have failed one way or another.
Thanks!
Ludo’.
Information forwarded
to
bug-guix <at> gnu.org
:
bug#53463
; Package
guix
.
(Tue, 08 Feb 2022 12:56:01 GMT)
Full text and
rfc822 format available.
Message #30 received at 53463 <at> debbugs.gnu.org (full text, mbox):
Ludovic Courtès <ludo <at> gnu.org> writes:
> Hi,
>
> Mathieu Othacehe <othacehe <at> gnu.org> skribis:
>
>>> Oh! That indicates that it’s failing to offload to one of the
>>> ‘localhost’ build machines specified in /etc/guix/machines.scm.
>>> Normally there’s an SSH tunnel set up for those, but I guess it broke.
>>>
>>> Perhaps we can update /etc/guix/machines.scm to refer to armhf-linux
>>> machines by their WireGuard IP?
>>
>> Seems like the right thing to do. This bit is also an unstaged change in
>> the berlin maintenance repository, we should commit it. Tobias, could
>> you have a look :) ?
>>
>> +(define powerpc64le
>> + (list
>> + ;; A VM donated/hosted by OSUOSL & administered by nckx.
>> + ;; XXX: SSH tunnel via overdrive1:
>> + ;; ssh -L 2224:p9.tobias.gr:22 hydra <at> 10.0.0.3
>> + #;(build-machine
>> + ;;(name "p9.tobias.gr")
>> + (name "localhost")
>> + (port 2224)
>> + (user "hydra")
>> + (systems '("powerpc64le-linux"))
>> + (host-key "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIJEbRxJ6WqnNLYEMNDUKFcdMtyZ9V/6oEfBFSHY8xE6A nckx"))))
>
> IIRC this machine is now running WireGuard, Tobias? If so, could you
> change this to refer to its WireGuard IP and commit it?
>
>> I also found that other machines were unreachable and commented them:
>>
>> ;; CPU: 16 ARM Cortex-A72 cores
>> ;; RAM: 32 GB
>> - (list (build-machine
>> + (list #;(build-machine
>> ;;kreuzberg
>> (name "10.0.0.9")
>> (user "hydra")
>
> Ricardo, could you check what’s wrong with kreuzberg?
Oh, the usual…
--8<---------------cut here---------------start------------->8---
root <at> kreuzberg ~# guix shell wireguard-tools -- wg
interface: wg0
public key: f9WGJTXp8bozJb0KxePjkOclF5pJUy1AomHWJHy80y4=
private key: (hidden)
listening port: 51820
peer: wOIfhHqQ+JQmskRS2qSvNRgZGh33UxFDi8uuSXOltF0=
endpoint: 141.80.181.40:51820
allowed ips: 10.0.0.1/32
latest handshake: 2 days, 2 hours, 11 minutes, 13 seconds ago
transfer: 292.79 MiB received, 6.05 GiB sent
--8<---------------cut here---------------end--------------->8---
Whenever the build farm is awfully quiet (e.g. because of GC) the
wireguard connection times out. I usually restart the
cuirass-remote-worker and everything’s fine again.
Today I got some additional SD cards for these machines, so I’m going to
reconfigure them (locally, because of the “guix deploy” bug) and then
move them to the data centre. Once reconfigured they will keep the
wireguard connection alive all by themselves, so no manual intervention
is necessary.
I didn’t reconfigure them locally because I hoped we would be able to
make time for the “guix deploy” bug, but things turned out differently.
--
Ricardo
Information forwarded
to
bug-guix <at> gnu.org
:
bug#53463
; Package
guix
.
(Mon, 21 Mar 2022 08:39:01 GMT)
Full text and
rfc822 format available.
Message #33 received at 53463 <at> debbugs.gnu.org (full text, mbox):
Hi there!
Looks like this bug is solved: the ‘guix’ jobset is getting built.
However, evaluations are marked as “failed”, even though their build log
shows they succeeded, and if you click on one of them, you see that all
its builds are there:
https://ci.guix.gnu.org/eval/168652
https://ci.guix.gnu.org/eval/168652/log/raw
https://ci.guix.gnu.org/jobset/guix?border-high=169749
Any idea what could be wrong?
Thanks,
Ludo’.
Information forwarded
to
bug-guix <at> gnu.org
:
bug#53463
; Package
guix
.
(Mon, 21 Mar 2022 08:56:01 GMT)
Full text and
rfc822 format available.
Message #36 received at 53463 <at> debbugs.gnu.org (full text, mbox):
Hey Ludo,
> However, evaluations are marked as “failed”, even though their build log
> shows they succeeded, and if you click on one of them, you see that all
> its builds are there:
>
> https://ci.guix.gnu.org/eval/168652
> https://ci.guix.gnu.org/eval/168652/log/raw
> https://ci.guix.gnu.org/jobset/guix?border-high=169749
This started at the time we enabled the armhf architecture, so I guess
it is marked as failed because the guix specification could not be
evaluated for this architecture.
Thanks,
Mathieu
Reply sent
to
Mathieu Othacehe <othacehe <at> gnu.org>
:
You have taken responsibility.
(Tue, 16 Aug 2022 07:58:01 GMT)
Full text and
rfc822 format available.
Notification sent
to
Leo Famulari <leo <at> famulari.name>
:
bug acknowledged by developer.
(Tue, 16 Aug 2022 07:58:02 GMT)
Full text and
rfc822 format available.
Message #41 received at 53463-done <at> debbugs.gnu.org (full text, mbox):
Hello,
> https://ci.guix.gnu.org/jobset/guix
It is now fixed for the following architectures: x86_64-linux,
i686-linux and aarch64-linux. I'll try to repair it for
powerpc64le-linux soon.
We can close this one I guess.
Thanks,
Mathieu
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Tue, 13 Sep 2022 11:24:07 GMT)
Full text and
rfc822 format available.
This bug report was last modified 1 year and 197 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.