Package: guix;
Reported by: Mark H Weaver <mhw <at> netris.org>
Date: Sun, 7 Apr 2019 16:44:02 UTC
Severity: normal
Merged with 34157
Done: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 35181 in the body.
You can then email your comments to 35181 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
View this report as an mbox folder, status mbox, maintainer mbox
bug-guix <at> gnu.org
:bug#35181
; Package guix
.
(Sun, 07 Apr 2019 16:44:02 GMT) Full text and rfc822 format available.Mark H Weaver <mhw <at> netris.org>
:bug-guix <at> gnu.org
.
(Sun, 07 Apr 2019 16:44:02 GMT) Full text and rfc822 format available.Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
From: Mark H Weaver <mhw <at> netris.org> To: bug-guix <at> gnu.org Subject: Hydra offloads often get stuck while exporting build requisites Date: Sun, 07 Apr 2019 12:41:38 -0400
It has become extremely frequent for builds offloaded by hydra.gnu.org to its x86 build slave hydra.gnunet.org to get stuck indefinitely while exporting prerequisites for the build to the build slave. As I write this, both of hydra.gnunet.org's build slots (one for x86_64-linux, and one for i686-linux) are stuck in this way. Here are the stuck builds: https://hydra.gnu.org/build/3432052 https://hydra.gnu.org/build/3432472 and here are the tails of their nix build logs: --8<---------------cut here---------------start------------->8--- performing build 3432052 these derivations will be built: /gnu/store/k27i3lkb38gr3mw0mridymhik3qsg6w7-font-fira-sans-4.202.drv process 14769 acquired build slot '/var/guix/offload/hydra.gnunet.org/1' load on machine 'hydra.gnunet.org' is 1.54 (normalized: 0.385) sending 1 store item to 'hydra.gnunet.org'... exporting path `/gnu/store/gzd2cisahj50nff16p8ji813p683p5r4-font-fira-sans-4.202-checkout' --8<---------------cut here---------------end--------------->8--- --8<---------------cut here---------------start------------->8--- performing build 3432472 these derivations will be built: /gnu/store/5ivay4l7bn0sqsi7k53j4qv3kndrby17-font-google-material-design-icons-3.0.1.drv process 8985 acquired build slot '/var/guix/offload/hydra.gnunet.org/0' load on machine 'hydra.gnunet.org' is 1.98 (normalized: 0.99) sending 1 store item to 'hydra.gnunet.org'... exporting path `/gnu/store/kaj5xgnz04l4mzgj05sc5v6j7cvpbrrd-font-google-material-design-icons-3.0.1-checkout' --8<---------------cut here---------------end--------------->8--- The first of these builds has been stuck for about 10.25 hours, and the second has been stuck for 11 hours. In the recent past, I've discovered them stuck for over 24 hours. Mark
bug-guix <at> gnu.org
:bug#35181
; Package guix
.
(Sun, 07 Apr 2019 16:48:02 GMT) Full text and rfc822 format available.Message #8 received at 35181 <at> debbugs.gnu.org (full text, mbox):
From: Mark H Weaver <mhw <at> netris.org> To: 35181 <at> debbugs.gnu.org Cc: Ludovic Courtès <ludo <at> gnu.org> Subject: Re: Hydra offloads often get stuck while exporting build requisites Date: Sun, 07 Apr 2019 12:45:57 -0400
I wrote earlier: > It has become extremely frequent for builds offloaded by hydra.gnu.org > to its x86 build slave hydra.gnunet.org to get stuck indefinitely while > exporting prerequisites for the build to the build slave. > > As I write this, both of hydra.gnunet.org's build slots (one for > x86_64-linux, and one for i686-linux) are stuck in this way. Here are > the stuck builds: > > https://hydra.gnu.org/build/3432052 > https://hydra.gnu.org/build/3432472 I'll leave these builds in their stuck state for at least the next 10 hours or so, in case someone wants to investigate. Mark
bug-guix <at> gnu.org
:bug#35181
; Package guix
.
(Sun, 07 Apr 2019 17:32:02 GMT) Full text and rfc822 format available.Message #11 received at 35181 <at> debbugs.gnu.org (full text, mbox):
From: Efraim Flashner <efraim <at> flashner.co.il> To: Mark H Weaver <mhw <at> netris.org> Cc: 35181 <at> debbugs.gnu.org Subject: Re: bug#35181: Hydra offloads often get stuck while exporting build requisites Date: Sun, 7 Apr 2019 20:31:05 +0300
[Message part 1 (text/plain, inline)]
On Sun, Apr 07, 2019 at 12:45:57PM -0400, Mark H Weaver wrote: > I wrote earlier: > > > It has become extremely frequent for builds offloaded by hydra.gnu.org > > to its x86 build slave hydra.gnunet.org to get stuck indefinitely while > > exporting prerequisites for the build to the build slave. > > > > As I write this, both of hydra.gnunet.org's build slots (one for > > x86_64-linux, and one for i686-linux) are stuck in this way. Here are > > the stuck builds: > > > > https://hydra.gnu.org/build/3432052 > > https://hydra.gnu.org/build/3432472 > > I'll leave these builds in their stuck state for at least the next 10 > hours or so, in case someone wants to investigate. > For these two specifically it's possible that they are just really really big. -- Efraim Flashner <efraim <at> flashner.co.il> אפרים פלשנר GPG key = A28B F40C 3E55 1372 662D 14F7 41AA E7DC CA3D 8351 Confidentiality cannot be guaranteed on emails sent or received unencrypted
[signature.asc (application/pgp-signature, inline)]
bug-guix <at> gnu.org
:bug#35181
; Package guix
.
(Mon, 08 Apr 2019 06:31:02 GMT) Full text and rfc822 format available.Message #14 received at 35181 <at> debbugs.gnu.org (full text, mbox):
From: Mark H Weaver <mhw <at> netris.org> To: Efraim Flashner <efraim <at> flashner.co.il> Cc: Ludovic Courtès <ludo <at> gnu.org>, 35181 <at> debbugs.gnu.org Subject: Re: bug#35181: Hydra offloads often get stuck while exporting build requisites Date: Mon, 08 Apr 2019 02:28:41 -0400
Hi Efraim, Efraim Flashner <efraim <at> flashner.co.il> writes: > For these two specifically it's possible that they are just really > really big. The source checkout currently being transferred for build 3432472 (/gnu/store/…-font-google-material-design-icons-3.0.1-checkout) is 176 megabytes uncompressed, as measured by "du -s --si", which is not precisely same as NAR size, but hopefully close enough for a rough estimate. As I write this, build 3432472 been stuck here for 24 hours 15 minutes. Even if the average transfer rate were 4 kilobytes per second, it should have been done in half that time. I should also note that this is the *10th* attempt for build 3432472. I didn't realize until now, but I've manually aborted this same build 9 times before now, before filing this bug report. On its fourth attempt, it ran for just over *48* hours before I aborted it. Here's a full list of how long each of the build attempts have run: Nr Duration Machine Status -------------------------------------------------- 1 20h 23m 27s Aborted (log, raw, tail) 2 5m 28s Aborted (log, raw, tail) 3 1d 9h 54m 42s Aborted (log, raw, tail) 4 2d 0h 1m 17s Aborted (log, raw, tail) 5 8h 12m 4s Aborted (log, raw, tail) 6 1d 21h 37m 24s Aborted (log, raw, tail) 7 8h 26m 1s Aborted (log, raw, tail) 8 11h 21m 6s Aborted (log, raw, tail) 9 4h 38m 10s Aborted (log, raw, tail) 10 1d 0h 22m 19s Building (log, raw, tail) -------------------------------------------------- Source: <https://hydra.gnu.org/build/3432472#tabs-buildsteps> The other build (3432052) has a very similar story. Its source checkout is slightly larger at 287 megabytes, and it's currently on its 8th attempt, which has been running for 23 hours 27 minutes as I write this. <https://hydra.gnu.org/build/3432052#tabs-buildsteps> Nr Duration Machine Status -------------------------------------------------- 1 1d 4h 32m 24s Aborted (log, raw, tail) 2 2d 0h 1m 17s Aborted (log, raw, tail) 3 8h 12m 5s Aborted (log, raw, tail) 4 1d 10h 11m 48s Aborted (log, raw, tail) 5 8h 26m 11s Aborted (log, raw, tail) 6 10h 49m 44s Aborted (log, raw, tail) 7 4h 38m 20s Aborted (log, raw, tail) 8 22h 43m 9s Building (log, raw, tail) -------------------------------------------------- So far, these two builds alone have consumed a total of 15 days of Hydra's build slot time. One of the builds is on x86_64-linux and the other on i686-linux. As I recall, a similar thing happened with the mozjs-60 builds, although I didn't look closely. Mark
bug-guix <at> gnu.org
:bug#35181
; Package guix
.
(Mon, 08 Apr 2019 07:16:01 GMT) Full text and rfc822 format available.Message #17 received at 35181 <at> debbugs.gnu.org (full text, mbox):
From: Mark H Weaver <mhw <at> netris.org> To: Efraim Flashner <efraim <at> flashner.co.il> Cc: Ludovic Courtès <ludo <at> gnu.org>, 35181 <at> debbugs.gnu.org Subject: Re: bug#35181: Hydra offloads often get stuck while exporting build requisites Date: Mon, 08 Apr 2019 03:13:39 -0400
The same jobs that are consistently getting stuck offloading to hydra.gnunet.org (Hydra's only functional x86 build slave at present) built successfully on armhf, with build times of 1-2 hours. With only one x86 build slave and all armhf build slaves on the same network and running the same ancient version of guix (circa 0.12.0), there are several possibilities: (1) The problem might be specific to hydra.gnunet.org or its network, or (2) it might depend on the version of guix running on the build slave, or (3) it might depend on the architecture of the build slave, or (4) something else? Also, these same jobs built without incident back in early December. Mark
bug-guix <at> gnu.org
:bug#35181
; Package guix
.
(Mon, 08 Apr 2019 08:20:01 GMT) Full text and rfc822 format available.Message #20 received at 35181 <at> debbugs.gnu.org (full text, mbox):
From: Ludovic Courtès <ludo <at> gnu.org> To: Mark H Weaver <mhw <at> netris.org> Cc: 35181 <at> debbugs.gnu.org, Efraim Flashner <efraim <at> flashner.co.il> Subject: Re: bug#35181: Hydra offloads often get stuck while exporting build requisites Date: Mon, 08 Apr 2019 10:19:18 +0200
Hi Mark, Mark H Weaver <mhw <at> netris.org> skribis: > The source checkout currently being transferred for build 3432472 > (/gnu/store/…-font-google-material-design-icons-3.0.1-checkout) is 176 > megabytes uncompressed, as measured by "du -s --si", which is not > precisely same as NAR size, but hopefully close enough for a rough > estimate. As I write this, build 3432472 been stuck here for 24 hours > 15 minutes. Even if the average transfer rate were 4 kilobytes per > second, it should have been done in half that time. This is weird, could it be that data transfers get stuck somehow? Did you try to check the status of the ‘nix-store’ and ‘guix offload’ processes on the head node? I just checked and apparently this package builds fine on berlin. Thanks for investigating, Ludo’.
bug-guix <at> gnu.org
:bug#35181
; Package guix
.
(Mon, 08 Apr 2019 19:43:01 GMT) Full text and rfc822 format available.Message #23 received at 35181 <at> debbugs.gnu.org (full text, mbox):
From: Mark H Weaver <mhw <at> netris.org> To: Ludovic Courtès <ludo <at> gnu.org> Cc: 35181 <at> debbugs.gnu.org Subject: Re: bug#35181: Hydra offloads often get stuck while exporting build requisites Date: Mon, 08 Apr 2019 15:40:51 -0400
Hi Ludovic, Ludovic Courtès <ludo <at> gnu.org> writes: > Mark H Weaver <mhw <at> netris.org> skribis: > >> The source checkout currently being transferred for build 3432472 >> (/gnu/store/…-font-google-material-design-icons-3.0.1-checkout) is 176 >> megabytes uncompressed, as measured by "du -s --si", which is not >> precisely same as NAR size, but hopefully close enough for a rough >> estimate. As I write this, build 3432472 been stuck here for 24 hours >> 15 minutes. Even if the average transfer rate were 4 kilobytes per >> second, it should have been done in half that time. > > This is weird, could it be that data transfers get stuck somehow? As far as I can tell, that's what seems to happen. > Did you try to check the status of the ‘nix-store’ and ‘guix offload’ > processes on the head node? Here are the corresponding 'guix offload' processes: --8<---------------cut here---------------start------------->8--- hydra <at> 20121227-hydra:~$ ps auxwwf | head -1; ps auxwwf | egrep -B1 'off()load' USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 8984 0.0 0.0 30784 2248 ? Ss Apr07 0:00 | \_ /root/.guix-profile/bin/guix-daemon 8983 --max-jobs=1 --build-users-group=guixbuild --disable-log-compression --gc-keep-outputs --gc-keep-derivations --no-substitutes --cache-failures root 8985 0.0 0.2 145532 13976 ? SLsl Apr07 0:10 | | \_ /gnu/store/yihvhxv3xyyvl1m2cy1lnf1lyi9h76fk-guile-2.2.2/bin/guile --no-auto-compile /gnu/store/fkkjhida23k612naa9d4q6avqj5v3b28-guix-0.13.0-8.357ab93/bin/.guix-real offload x86_64-linux 3600 1 72000 -- root 14768 0.0 0.0 30752 2356 ? Ss Apr07 0:00 | \_ /root/.guix-profile/bin/guix-daemon 14767 --max-jobs=1 --build-users-group=guixbuild --disable-log-compression --gc-keep-outputs --gc-keep-derivations --no-substitutes --cache-failures root 14769 0.0 0.2 145668 10912 ? SLsl Apr07 0:16 | | \_ /gnu/store/yihvhxv3xyyvl1m2cy1lnf1lyi9h76fk-guile-2.2.2/bin/guile --no-auto-compile /gnu/store/fkkjhida23k612naa9d4q6avqj5v3b28-guix-0.13.0-8.357ab93/bin/.guix-real offload x86_64-linux 3600 1 72000 --8<---------------cut here---------------end--------------->8--- I tried attaching to both of these offload processes with 'strace', and waited for several minutes for any system call activity. Both of them are stuck sleeping within a system call, although I don't yet know which system call: --8<---------------cut here---------------start------------->8--- root <at> 20121227-hydra:~# strace -p 8985 Process 8985 attached - interrupt to quit restart_syscall(<... resuming interrupted call ...>^C <unfinished ...> Process 8985 detached root <at> 20121227-hydra:~# strace -p 14769 Process 14769 attached - interrupt to quit restart_syscall(<... resuming interrupted call ...>^C <unfinished ...> Process 14769 detached --8<---------------cut here---------------end--------------->8--- Here are the 'nix-store' processes: --8<---------------cut here---------------start------------->8--- hydra <at> 20121227-hydra:~$ ps auxwwf | head -1; ps auxwwf | egrep -A1 'hydra-()build' USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND hydra 8980 0.0 0.9 187332 46656 pts/5 S Apr07 0:01 | | \_ /usr/local/bin/perl -w /usr/local/bin/hydra-build 3432472 hydra 8983 0.0 0.0 34228 464 pts/5 S Apr07 0:00 | | | \_ nix-store --realise /gnu/store/5ivay4l7bn0sqsi7k53j4qv3kndrby17-font-google-material-design-icons-3.0.1.drv --timeout 72000 --max-silent-time 3600 --option build-max-log-size 67108864 --keep-going --fallback --no-build-output --log-type flat --print-build-trace --add-root /nix/var/nix/gcroots/per-user/hydra/hydra-roots/y61f3cdhx31msdhkdw0kfs5pb75ycgfq-font-google-material-design-icons-3.0.1 hydra 14764 0.0 0.9 187336 46576 pts/5 S Apr07 0:01 | | \_ /usr/local/bin/perl -w /usr/local/bin/hydra-build 3432052 hydra 14767 0.0 0.0 34228 352 pts/5 S Apr07 0:00 | | \_ nix-store --realise /gnu/store/k27i3lkb38gr3mw0mridymhik3qsg6w7-font-fira-sans-4.202.drv --timeout 72000 --max-silent-time 3600 --option build-max-log-size 67108864 --keep-going --fallback --no-build-output --log-type flat --print-build-trace --add-root /nix/var/nix/gcroots/per-user/hydra/hydra-roots/28ncfjmplcwyzas2p3d4cy5xlzacjcnj-font-fira-sans-4.202 --8<---------------cut here---------------end--------------->8--- The 'nix-store' processes seem to be stuck sleeping in 'read', if I'm interpreting the 'strace' output correctly: --8<---------------cut here---------------start------------->8--- root <at> 20121227-hydra:~# strace -p 8983 Process 8983 attached - interrupt to quit read(3, ^C <unfinished ...> Process 8983 detached root <at> 20121227-hydra:~# strace -p 14767 Process 14767 attached - interrupt to quit read(3, ^C <unfinished ...> Process 14767 detached --8<---------------cut here---------------end--------------->8--- "netstat --inet --program" shows that the SSH connections are still open: --8<---------------cut here---------------start------------->8--- root <at> 20121227-hydra:~# netstat --inet --program | grep 'hydra\.net\.in\.tum\.' tcp 0 0 20121227-hydra.gn:53216 hydra.net.in.tum.de:ssh ESTABLISHED 14769/guile tcp 0 0 20121227-hydra.gn:52434 hydra.net.in.tum.de:ssh ESTABLISHED 8985/guile tcp 0 0 20121227-hydra.gnu.:www hydra.net.in.tum.:52104 TIME_WAIT - tcp 0 0 20121227-hydra.gnu.:www hydra.net.in.tum.:52103 TIME_WAIT - --8<---------------cut here---------------end--------------->8--- > I just checked and apparently this package builds fine on berlin. Also, these same jobs (e.g. same versions) have been built successfully on hydra.gnunet.org for years without difficulty. In the case of 'font-fira-sans-4.202.x86_64-linux', it has only ever been built on hydra.gnunet.org, with the last successful build on 28 September 2018. The packages haven't been updated in years, and so typically are only rebuilt during 'core-updates' cycles. They only started aborting in late March, when some other rarely-update package apparently changed to force a rebuild. However, the similar 'mozjs-60' failures happened earlier. FYI, here's the history of build attempts on Hydra: --8<---------------cut here---------------start------------->8--- hydra=> select case when s.machine~'^(hydra|guix)\.' then s.machine else substring(s.machine from '^[^.]*') end as machine, jobset, s.build, s.stepnr as step, case when s.busy=1 then 'busy' when s.status=0 then NULL when s.status=1 then 'fail' when s.status=4 then 'abort' when s.status=7 then 'timeout' when s.status=8 then 'cfail' else '?' end as stat, regexp_replace(substr(s.drvpath,1+strpos(s.drvpath,'-')),'\.drv$','') as what, date_trunc('day', to_timestamp(s.stoptime)) as finished from builds b, buildsteps s where b.id=s.build and b.job='font-fira-sans-4.202.x86_64-linux' order by s.stoptime; machine | jobset | build | step | stat | what | finished ------------------+--------------+---------+------+-------+----------------------+------------------------ hydra.gnunet.org | master | 2362639 | 1 | | font-fira-sans-4.202 | 2017-11-29 00:00:00+00 hydra.gnunet.org | core-updates | 2407845 | 1 | | font-fira-sans-4.202 | 2018-01-02 00:00:00+00 hydra.gnunet.org | core-updates | 2674686 | 1 | | font-fira-sans-4.202 | 2018-05-19 00:00:00+00 hydra.gnunet.org | core-updates | 3075254 | 1 | | font-fira-sans-4.202 | 2018-09-28 00:00:00+00 | master | 3432052 | 1 | abort | font-fira-sans-4.202 | 2019-03-31 00:00:00+00 | master | 3432052 | 2 | abort | font-fira-sans-4.202 | 2019-04-02 00:00:00+00 | master | 3432052 | 3 | abort | font-fira-sans-4.202 | 2019-04-03 00:00:00+00 | master | 3432052 | 4 | abort | font-fira-sans-4.202 | 2019-04-05 00:00:00+00 | master | 3432052 | 5 | abort | font-fira-sans-4.202 | 2019-04-05 00:00:00+00 | master | 3432052 | 6 | abort | font-fira-sans-4.202 | 2019-04-06 00:00:00+00 | master | 3432052 | 7 | abort | font-fira-sans-4.202 | 2019-04-06 00:00:00+00 | master | 3432052 | 8 | busy | font-fira-sans-4.202 | (12 rows) hydra=> select case when s.machine~'^(hydra|guix)\.' then s.machine else substring(s.machine from '^[^.]*') end as machine, jobset, s.build, s.stepnr as step, case when s.busy=1 then 'busy' when s.status=0 then NULL when s.status=1 then 'fail' when s.status=4 then 'abort' when s.status=7 then 'timeout' when s.status=8 then 'cfail' else '?' end as stat, regexp_replace(substr(s.drvpath,1+strpos(s.drvpath,'-')),'\.drv$','') as what, date_trunc('day', to_timestamp(s.stoptime)) as finished from builds b, buildsteps s where b.id=s.build and b.job='font-google-material-design-icons-3.0.1.i686-linux' order by s.stoptime; machine | jobset | build | step | stat | what | finished ------------------+--------------+---------+------+-------+------------------------------------------------+------------------------ chapters | master | 1834047 | 1 | | font-google-material-design-icons-3.0.1 | 2017-02-13 00:00:00+00 hydra.gnunet.org | core-updates | 1889434 | 1 | | font-google-material-design-icons-3.0.1 | 2017-03-12 00:00:00+00 | master | 2030520 | 2 | cfail | glibc-intermediate-2.25 | 2017-04-30 00:00:00+00 | master | 2030520 | 1 | cfail | glibc-intermediate-2.25 | 2017-04-30 00:00:00+00 | master | 2030520 | 3 | cfail | glibc-intermediate-2.25 | 2017-04-30 00:00:00+00 guix.sjd.se | master | 2035120 | 1 | | font-google-material-design-icons-3.0.1 | 2017-05-04 00:00:00+00 guix.sjd.se | core-updates | 2111787 | 1 | | font-google-material-design-icons-3.0.1 | 2017-06-25 00:00:00+00 hydra.gnunet.org | master | 2128849 | 1 | | font-google-material-design-icons-3.0.1 | 2017-06-26 00:00:00+00 guix.sjd.se | core-updates | 2175161 | 1 | | font-google-material-design-icons-3.0.1 | 2017-07-20 00:00:00+00 | master | 2334641 | 1 | | font-google-material-design-icons-3.0.1.tar.gz | 2017-10-23 00:00:00+00 hydra.gnunet.org | master | 2334641 | 2 | | font-google-material-design-icons-3.0.1 | 2017-10-23 00:00:00+00 | core-updates | 2406391 | 1 | | module-import | 2018-01-02 00:00:00+00 | core-updates | 2406391 | 2 | | module-import-compiled | 2018-01-02 00:00:00+00 guix.sjd.se | core-updates | 2406391 | 3 | | font-google-material-design-icons-3.0.1 | 2018-01-02 00:00:00+00 guix.sjd.se | core-updates | 2667328 | 1 | | font-google-material-design-icons-3.0.1 | 2018-05-18 00:00:00+00 guix.sjd.se | core-updates | 3073906 | 1 | | font-google-material-design-icons-3.0.1 | 2018-09-25 00:00:00+00 | master | 3432472 | 1 | abort | font-google-material-design-icons-3.0.1 | 2019-03-29 00:00:00+00 | master | 3432472 | 2 | abort | font-google-material-design-icons-3.0.1 | 2019-03-30 00:00:00+00 | master | 3432472 | 3 | abort | font-google-material-design-icons-3.0.1 | 2019-03-31 00:00:00+00 | master | 3432472 | 4 | abort | font-google-material-design-icons-3.0.1 | 2019-04-02 00:00:00+00 | master | 3432472 | 5 | abort | font-google-material-design-icons-3.0.1 | 2019-04-03 00:00:00+00 | master | 3432472 | 6 | abort | font-google-material-design-icons-3.0.1 | 2019-04-05 00:00:00+00 | master | 3432472 | 7 | abort | font-google-material-design-icons-3.0.1 | 2019-04-05 00:00:00+00 | master | 3432472 | 8 | abort | font-google-material-design-icons-3.0.1 | 2019-04-06 00:00:00+00 | master | 3432472 | 9 | abort | font-google-material-design-icons-3.0.1 | 2019-04-06 00:00:00+00 | master | 3432472 | 10 | busy | font-google-material-design-icons-3.0.1 | (26 rows) --8<---------------cut here---------------end--------------->8--- It seems that Hydra fails to record the machine name in build steps that are aborted, but I know that all of the 'aborts' above are on hydra.gnunet.org, because that's currently the only x86 build slave on Hydra. I could easily believe that this problem is specific to hydra.gnunet.org, but even if that's the case, it would be good if offloading would reliably time out before days have passed. Ideally, the timeout would be a "max-silent-time" kind of timeout, so that we don't impose an arbitrary limitation on total transfer time as long as progress is being made, and so that the timeout can be relatively short. However, even a "total-transfer-time" kind of timeout would be welcome at this point, to stop the profuse bleeding of Hydra's limited x86 build capacity. What do you think? Mark
bug-guix <at> gnu.org
:bug#35181
; Package guix
.
(Tue, 09 Apr 2019 01:08:01 GMT) Full text and rfc822 format available.Message #26 received at 35181 <at> debbugs.gnu.org (full text, mbox):
From: Mark H Weaver <mhw <at> netris.org> To: Ludovic Courtès <ludo <at> gnu.org> Cc: 35181 <at> debbugs.gnu.org Subject: Re: bug#35181: Hydra offloads often get stuck while exporting build requisites Date: Mon, 08 Apr 2019 21:06:04 -0400
merge 35181 34157 thanks I looked more closely at the 'mozjs-60' failures, and I'm convinced that it's an instance of the same problem that's currently affecting these large font builds. Mozjs-60 was pushed to the master branch on 2019-01-18. It has _never_ successfully built on x86_64 or i686, although all builds were successful on armhf. See below for the complete list of build attempts of mozjs-60 on Hydra. Also of note: So far, all known instances of this problem have occurred while transferring a large directory, as opposed to a tarball. We have several packages with source tarballs _much_ larger than these problematic source checkouts, and which are updated more much frequently, and yet I've *never* seen an instance of this problem while exporting a plain file to a build slave. For example, the upstream IceCat and Firefox ESR tarballs are ~270 megabytes compressed, whereas 'font-google-material-design-icons-3.0.1' source is only ~176 megabytes _uncompressed_. I have no explanation for why the superficial form of the store item should matter here, but maybe it's a clue. I know that plain non-executable files in the store are handled somewhat differently in the Nix model than directories or executable files, the latter associated with the word "recursive", and requiring an additional layer of encoding for purposes of serialization, but I'm not sufficiently familiar with the details or relevant code. Ludovic, can you think of a reason why the file/directory distinction could be relevant to this issue? Finally: the problem seems to have been introduced into Hydra sometime between September 2018 and January 2019. September 2018 is when the last successful build of the problematic font packages was performed, and January 2019 is the first known instance of the problem. I do not currently know of any relevant data points in that time range. The last 'core-updates' merge into 'master' was on December 3rd. Mark PS: Here's the complete history of 'mozjs-60' build attempts on Hydra: First are the 'armhf' attempts, followed by i686, and x86_64. Note that the two armhf aborts happened after only 2 seconds, and surely had a different cause than this issue. --8<---------------cut here---------------start------------->8--- hydra=> select case when s.machine~'^(hydra|guix)\.' then s.machine else substring(s.machine from '^[^.]*') end as machine, jobset, s.build, s.stepnr as step, case when s.busy=1 then 'busy' when s.status=0 then NULL when s.status=1 then 'fail' when s.status=4 then 'abort' when s.status=7 then 'timeout' when s.status=8 then 'cfail' else '?' end as stat, regexp_replace(substr(s.drvpath,1+strpos(s.drvpath,'-')),'\.drv$','') as what, date_trunc('second', to_timestamp(s.stoptime)) as finished, date_trunc('second', to_timestamp(s.stoptime) - to_timestamp(s.starttime)) as duration from builds b, buildsteps s where b.id=s.build and b.job='mozjs-60.2.3-2.armhf-linux' order by s.stoptime; machine | jobset | build | step | stat | what | finished | duration --------------+--------+---------+------+-------+-------------------------+------------------------+---------- hydra-slave2 | master | 3342804 | 1 | | mozjs-60.2.3-2-checkout | 2019-01-19 12:58:52+00 | 00:23:55 hydra-slave2 | master | 3342804 | 2 | | mozjs-60.2.3-2 | 2019-01-19 15:49:37+00 | 02:50:42 | master | 3367975 | 1 | abort | mozjs-60.2.3-2 | 2019-02-13 00:03:58+00 | 00:00:02 | master | 3367975 | 2 | abort | mozjs-60.2.3-2 | 2019-02-15 15:35:45+00 | 00:00:02 hydra-slave3 | master | 3367975 | 3 | | mozjs-60.2.3-2 | 2019-02-18 16:38:08+00 | 02:46:42 (5 rows) hydra=> select case when s.machine~'^(hydra|guix)\.' then s.machine else substring(s.machine from '^[^.]*') end as machine, jobset, s.build, s.stepnr as step, case when s.busy=1 then 'busy' when s.status=0 then NULL when s.status=1 then 'fail' when s.status=4 then 'abort' when s.status=7 then 'timeout' when s.status=8 then 'cfail' else '?' end as stat, regexp_replace(substr(s.drvpath,1+strpos(s.drvpath,'-')),'\.drv$','') as what, date_trunc('second', to_timestamp(s.stoptime)) as finished, date_trunc('second', to_timestamp(s.stoptime) - to_timestamp(s.starttime)) as duration from builds b, buildsteps s where b.id=s.build and b.job='mozjs-60.2.3-2.i686-linux' order by s.stoptime; machine | jobset | build | step | stat | what | finished | duration ---------+--------+---------+------+-------+----------------+------------------------+----------------- | master | 3343511 | 1 | abort | mozjs-60.2.3-2 | 2019-01-20 20:05:16+00 | 12:11:12 | master | 3343511 | 2 | abort | mozjs-60.2.3-2 | 2019-01-23 01:52:01+00 | 2 days 05:42:55 | master | 3360985 | 1 | abort | mozjs-60.2.3-2 | 2019-02-15 19:59:42+00 | 09:31:25 | master | 3360985 | 2 | abort | mozjs-60.2.3-2 | 2019-02-16 17:37:06+00 | 05:57:15 | master | 3360985 | 3 | abort | mozjs-60.2.3-2 | 2019-02-17 17:39:49+00 | 16:06:14 | master | 3360985 | 4 | abort | mozjs-60.2.3-2 | 2019-03-03 21:50:48+00 | 00:02:19 (6 rows) hydra=> select case when s.machine~'^(hydra|guix)\.' then s.machine else substring(s.machine from '^[^.]*') end as machine, jobset, s.build, s.stepnr as step, case when s.busy=1 then 'busy' when s.status=0 then NULL when s.status=1 then 'fail' when s.status=4 then 'abort' when s.status=7 then 'timeout' when s.status=8 then 'cfail' else '?' end as stat, regexp_replace(substr(s.drvpath,1+strpos(s.drvpath,'-')),'\.drv$','') as what, date_trunc('second', to_timestamp(s.stoptime)) as finished, date_trunc('second', to_timestamp(s.stoptime) - to_timestamp(s.starttime)) as duration from builds b, buildsteps s where b.id=s.build and b.job='mozjs-60.2.3-2.x86_64-linux' order by s.stoptime; machine | jobset | build | step | stat | what | finished | duration ---------+--------+---------+------+-------+----------------+------------------------+----------------- | master | 3342528 | 1 | abort | mozjs-60.2.3-2 | 2019-01-20 20:04:50+00 | 22:25:28 | master | 3342528 | 2 | abort | mozjs-60.2.3-2 | 2019-01-23 01:51:48+00 | 2 days 05:19:35 | master | 3366691 | 1 | abort | mozjs-60.2.3-2 | 2019-02-17 17:39:59+00 | 09:21:24 (3 rows) --8<---------------cut here---------------end--------------->8---
Mark H Weaver <mhw <at> netris.org>
to control <at> debbugs.gnu.org
.
(Tue, 09 Apr 2019 01:08:02 GMT) Full text and rfc822 format available.bug-guix <at> gnu.org
:bug#35181
; Package guix
.
(Tue, 09 Apr 2019 10:55:02 GMT) Full text and rfc822 format available.Message #31 received at 35181 <at> debbugs.gnu.org (full text, mbox):
From: Ludovic Courtès <ludo <at> gnu.org> To: Mark H Weaver <mhw <at> netris.org> Cc: 35181 <at> debbugs.gnu.org Subject: Re: bug#35181: Hydra offloads often get stuck while exporting build requisites Date: Tue, 09 Apr 2019 12:54:20 +0200
Hi Mark, Mark H Weaver <mhw <at> netris.org> skribis: > Ludovic Courtès <ludo <at> gnu.org> writes: > >> Mark H Weaver <mhw <at> netris.org> skribis: >> >>> The source checkout currently being transferred for build 3432472 >>> (/gnu/store/…-font-google-material-design-icons-3.0.1-checkout) is 176 >>> megabytes uncompressed, as measured by "du -s --si", which is not >>> precisely same as NAR size, but hopefully close enough for a rough >>> estimate. As I write this, build 3432472 been stuck here for 24 hours >>> 15 minutes. Even if the average transfer rate were 4 kilobytes per >>> second, it should have been done in half that time. >> >> This is weird, could it be that data transfers get stuck somehow? > > As far as I can tell, that's what seems to happen. > >> Did you try to check the status of the ‘nix-store’ and ‘guix offload’ >> processes on the head node? > > Here are the corresponding 'guix offload' processes: > > hydra <at> 20121227-hydra:~$ ps auxwwf | head -1; ps auxwwf | egrep -B1 'off()load' [...] > root 14769 0.0 0.2 145668 10912 ? SLsl Apr07 0:16 | | \_ /gnu/store/yihvhxv3xyyvl1m2cy1lnf1lyi9h76fk-guile-2.2.2/bin/guile --no-auto-compile /gnu/store/fkkjhida23k612naa9d4q6avqj5v3b28-guix-0.13.0-8.357ab93/bin/.guix-real offload x86_64-linux 3600 1 72000 The problem is that this is an ancient Guix. In the meantime, offloading has seen relevant changes, in particular things like commit ed7b44370f71126087eb953f36aad8dc4c44109f which address stability issues with Guile-SSH (ssh dist node) that was previously used. I think we should upgrade Guix on hydra.gnu.org otherwise we’re likely to end up chasing old bugs. > The 'nix-store' processes seem to be stuck sleeping in 'read', if I'm > interpreting the 'strace' output correctly: > > root <at> 20121227-hydra:~# strace -p 8983 > Process 8983 attached - interrupt to quit > read(3, ^C <unfinished ...> > Process 8983 detached > root <at> 20121227-hydra:~# strace -p 14767 > Process 14767 attached - interrupt to quit > read(3, ^C <unfinished ...> > Process 14767 detached > > > "netstat --inet --program" shows that the SSH connections are still > open: > > root <at> 20121227-hydra:~# netstat --inet --program | grep 'hydra\.net\.in\.tum\.' > tcp 0 0 20121227-hydra.gn:53216 hydra.net.in.tum.de:ssh ESTABLISHED 14769/guile > tcp 0 0 20121227-hydra.gn:52434 hydra.net.in.tum.de:ssh ESTABLISHED 8985/guile > tcp 0 0 20121227-hydra.gnu.:www hydra.net.in.tum.:52104 TIME_WAIT - > tcp 0 0 20121227-hydra.gnu.:www hydra.net.in.tum.:52103 TIME_WAIT - This could be the kind of issue that we had with (ssh dist node). It’s hard to tell. > I could easily believe that this problem is specific to > hydra.gnunet.org, but even if that's the case, it would be good if > offloading would reliably time out before days have passed. That’s the case with commit a708de151c255712071e42e5c8284756b51768cd, but again, the Guix installation on hydra may predate that. :-/ Thanks, Ludo’.
bug-guix <at> gnu.org
:bug#35181
; Package guix
.
(Tue, 09 Apr 2019 10:57:01 GMT) Full text and rfc822 format available.Message #34 received at 35181 <at> debbugs.gnu.org (full text, mbox):
From: Ludovic Courtès <ludo <at> gnu.org> To: Mark H Weaver <mhw <at> netris.org> Cc: 35181 <at> debbugs.gnu.org Subject: Re: bug#35181: Hydra offloads often get stuck while exporting build requisites Date: Tue, 09 Apr 2019 12:56:20 +0200
Mark H Weaver <mhw <at> netris.org> skribis: > Also of note: So far, all known instances of this problem have occurred > while transferring a large directory, as opposed to a tarball. > > We have several packages with source tarballs _much_ larger than these > problematic source checkouts, and which are updated more much > frequently, and yet I've *never* seen an instance of this problem while > exporting a plain file to a build slave. For example, the upstream > IceCat and Firefox ESR tarballs are ~270 megabytes compressed, whereas > 'font-google-material-design-icons-3.0.1' source is only ~176 megabytes > _uncompressed_. > > I have no explanation for why the superficial form of the store item > should matter here, but maybe it's a clue. I know that plain > non-executable files in the store are handled somewhat differently in > the Nix model than directories or executable files, the latter > associated with the word "recursive", and requiring an additional layer > of encoding for purposes of serialization, but I'm not sufficiently > familiar with the details or relevant code. > > Ludovic, can you think of a reason why the file/directory distinction > could be relevant to this issue? No, I can’t see why it could make a difference. Ludo’.
bug-guix <at> gnu.org
:bug#35181
; Package guix
.
(Tue, 09 Apr 2019 18:12:01 GMT) Full text and rfc822 format available.Message #37 received at 35181 <at> debbugs.gnu.org (full text, mbox):
From: Mark H Weaver <mhw <at> netris.org> To: Ludovic Courtès <ludo <at> gnu.org> Cc: 35181 <at> debbugs.gnu.org Subject: Re: bug#35181: Hydra offloads often get stuck while exporting build requisites Date: Tue, 09 Apr 2019 14:09:41 -0400
Hi Ludovic, Ludovic Courtès <ludo <at> gnu.org> writes: > The problem is that this is an ancient Guix. In the meantime, > offloading has seen relevant changes, in particular things like commit > ed7b44370f71126087eb953f36aad8dc4c44109f which address stability issues > with Guile-SSH (ssh dist node) that was previously used. > > I think we should upgrade Guix on hydra.gnu.org otherwise we’re likely > to end up chasing old bugs. Sure, that makes sense. I also noticed the old Guix after writing my last messages, so yesterday I tried updating Hydra's Guix to 0.16.0-11, which at the time was the latest version built by Hydra. After updating, I quit and relaunched 'guix-daemon', as well as 'guix publish', hydra-queue-runner, and hydra-evaluator. With the new version of Guix, *all* offloads started failing in a strange way: it got stuck in a loop, printing endlessly repeated messages like this: process N acquired build slot '/var/guix/offload/hydra.gnunet.org/0' process N acquired build slot '/var/guix/offload/hydra.gnunet.org/0' process N acquired build slot '/var/guix/offload/hydra.gnunet.org/1' process N acquired build slot '/var/guix/offload/hydra.gnunet.org/2' process N acquired build slot '/var/guix/offload/hydra.gnunet.org/0' This is from memory because after killing the queue-runner and cancelling the 'mozjs-60' jobs (which I had intended to start building as a test), the nix output above is no longer visible on those pages, and I'm not sure offhand were to look for it. Anyway, in every offloaded build, it printed a line like the above every few seconds, with the build slot number at the end varying. I don't remember if the process number varied. This reminds that I also ran into difficulties updating 'guix' on the armhf build slaves, which are also currently stuck on an even more ancient version of Guix (circa 0.12.0). On both Hydra and its armhf build slaves, Guix is installed on top of a Debian derivative, and both 'guix' and 'guix-daemon' are launched from an environment without any Guix environment variable settings. This apparently works in ancient versions of Guix, but not recent ones. So, could the problem simply be that the 'guix' wrapper is not installing enough environment variable settings for offloading to work? Mark
Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
:Mark H Weaver <mhw <at> netris.org>
:Message #42 received at 35181-done <at> debbugs.gnu.org (full text, mbox):
From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com> To: Mark H Weaver <mhw <at> netris.org> Cc: Ludovic Courtès <ludo <at> gnu.org>, 35181-done <at> debbugs.gnu.org Subject: Re: bug#35181: Hydra offloads often get stuck while exporting build requisites Date: Fri, 14 Apr 2023 09:15:27 -0400
Heya, Mark H Weaver <mhw <at> netris.org> writes: > merge 35181 34157 > thanks I'm closing this old forgotten issue since we are no longer using hydra. Let's focus on our guix deploy/offload and cuirass-related problems :-). -- Thanks, Maxim
Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
:Mark H Weaver <mhw <at> netris.org>
:Debbugs Internal Request <help-debbugs <at> gnu.org>
to internal_control <at> debbugs.gnu.org
.
(Sat, 13 May 2023 11:24:08 GMT) Full text and rfc822 format available.
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.