GNU bug report logs - #57827
Shepherd 0.9.2 possible regressions

Previous Next

Package: guix;

Reported by: Mathieu Othacehe <othacehe <at> gnu.org>

Date: Thu, 15 Sep 2022 13:06:02 UTC

Severity: important

Done: Ludovic Courtès <ludo <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 57827 in the body.
You can then email your comments to 57827 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-guix <at> gnu.org:
bug#57827; Package guix. (Thu, 15 Sep 2022 13:06:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Mathieu Othacehe <othacehe <at> gnu.org>:
New bug report received and forwarded. Copy sent to bug-guix <at> gnu.org. (Thu, 15 Sep 2022 13:06:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Mathieu Othacehe <othacehe <at> gnu.org>
To: bug-guix <at> gnu.org
Subject: Shepherd 0.9.2 possible regressions
Date: Thu, 15 Sep 2022 15:05:27 +0200
Hello,

Since Shepherd 0.9.2 the following tests are failing:

* cgit: https://ci.guix.gnu.org/build/1427375/details
* gitile https://ci.guix.gnu.org/build/1427377/details

It seems that an unexpected #<eof> object is received on the marionette
socket.

* gui-uefi-installed-os https://ci.guix.gnu.org/build/1431041/details
* gui-installed-os https://ci.guix.gnu.org/build/1431027/details
* gui-installed-os-encrypted https://ci.guix.gnu.org/build/1431040/details
* gui-installed-desktop-os-encrypted https://ci.guix.gnu.org/build/1431044/details

It seems that the Shepherd cannot be restarted in the install-system
procedure.

Thanks,

Mathieu




Severity set to 'important' from 'normal' Request was from Mathieu Othacehe <mathieu <at> meije.mail-host-address-is-not-set> to control <at> debbugs.gnu.org. (Thu, 15 Sep 2022 14:38:02 GMT) Full text and rfc822 format available.

Added indication that bug 57827 blocks53214 Request was from Mathieu Othacehe <mathieu <at> meije.mail-host-address-is-not-set> to control <at> debbugs.gnu.org. (Thu, 15 Sep 2022 14:38:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-guix <at> gnu.org:
bug#57827; Package guix. (Fri, 16 Sep 2022 07:42:01 GMT) Full text and rfc822 format available.

Message #12 received at 57827 <at> debbugs.gnu.org (full text, mbox):

From: Christopher Baines <mail <at> cbaines.net>
To: Mathieu Othacehe <othacehe <at> gnu.org>
Cc: 57827 <at> debbugs.gnu.org
Subject: Re: bug#57827: Shepherd 0.9.2 possible regressions
Date: Fri, 16 Sep 2022 09:35:58 +0200
[Message part 1 (text/plain, inline)]
Mathieu Othacehe <othacehe <at> gnu.org> writes:

> Since Shepherd 0.9.2 the following tests are failing:
>
> * cgit: https://ci.guix.gnu.org/build/1427375/details
> * gitile https://ci.guix.gnu.org/build/1427377/details
>
> It seems that an unexpected #<eof> object is received on the marionette
> socket.

I had a look at the cgit system test, and it seems like this #<eof> is
coming from the NGinx pid file being empty.

Since empty files is a possibility with wait-for-file, I've sent a patch
to [1] which prevents the eof issue, plus another change to make it
easier to debug.

1: https://issues.guix.gnu.org/57850

With that change, both of the above tests seem to pass for me.

This could be related to the Shepherd upgrade, but only indirectly, as I
think the failure at least for cgit was also timing dependent.

Chris
[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-guix <at> gnu.org:
bug#57827; Package guix. (Fri, 16 Sep 2022 08:37:02 GMT) Full text and rfc822 format available.

Message #15 received at 57827 <at> debbugs.gnu.org (full text, mbox):

From: Mathieu Othacehe <othacehe <at> gnu.org>
To: Christopher Baines <mail <at> cbaines.net>
Cc: 57827 <at> debbugs.gnu.org
Subject: Re: bug#57827: Shepherd 0.9.2 possible regressions
Date: Fri, 16 Sep 2022 10:35:45 +0200
Hey Chris,

> Since empty files is a possibility with wait-for-file, I've sent a patch
> to [1] which prevents the eof issue, plus another change to make it
> easier to debug.

Nice! I wonder if we should maybe block in wait-for-file until the file
has some content. In the cgit/gitile tests, we will keep going while the
pid file is empty so the nginx server is not maybe completely up yet.

> This could be related to the Shepherd upgrade, but only indirectly, as I
> think the failure at least for cgit was also timing dependent.

Yes looks like so.

Thanks,

Mathieu




Information forwarded to bug-guix <at> gnu.org:
bug#57827; Package guix. (Mon, 19 Sep 2022 06:42:02 GMT) Full text and rfc822 format available.

Message #18 received at 57827 <at> debbugs.gnu.org (full text, mbox):

From: Mathieu Othacehe <othacehe <at> gnu.org>
To: 57827 <at> debbugs.gnu.org
Subject: Re: bug#57827: Shepherd 0.9.2 possible regressions
Date: Mon, 19 Sep 2022 08:41:05 +0200
[Message part 1 (text/plain, inline)]
Hey,

> * cgit: https://ci.guix.gnu.org/build/1427375/details
> * gitile https://ci.guix.gnu.org/build/1427377/details

OK so those two are fixed by Chris patch.

> * gui-uefi-installed-os https://ci.guix.gnu.org/build/1431041/details
> * gui-installed-os https://ci.guix.gnu.org/build/1431027/details
> * gui-installed-os-encrypted https://ci.guix.gnu.org/build/1431040/details
> * gui-installed-desktop-os-encrypted https://ci.guix.gnu.org/build/1431044/details

Regarding those four, I was able to reproduce the issue this way:

--8<---------------cut here---------------start------------->8---
$ guix repl
(stop-service 'guix-daemon)
(start-service 'guix-daemon (list (number->string (getpid))))
--8<---------------cut here---------------end--------------->8---

The latter command hangs and Shepherd becomes unresponsive.  I collected
an (attached) strace dump of Shepherd showing that there is no response
on the socket when the service is started.

Note that, this works:

--8<---------------cut here---------------start------------->8---
$ guix repl
(stop-service 'guix-daemon)
(start-service 'guix-daemon) 
--8<---------------cut here---------------end--------------->8---

So the problem could be caused by the "container-excursion*" in the
"fork+exec-command/container" procedure.

Thanks,

Mathieu
[out (application/octet-stream, attachment)]

Reply sent to Ludovic Courtès <ludo <at> gnu.org>:
You have taken responsibility. (Tue, 20 Sep 2022 17:32:02 GMT) Full text and rfc822 format available.

Notification sent to Mathieu Othacehe <othacehe <at> gnu.org>:
bug acknowledged by developer. (Tue, 20 Sep 2022 17:32:02 GMT) Full text and rfc822 format available.

Message #23 received at 57827-done <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Mathieu Othacehe <othacehe <at> gnu.org>
Cc: 57827-done <at> debbugs.gnu.org
Subject: Re: bug#57827: Shepherd 0.9.2 possible regressions
Date: Tue, 20 Sep 2022 19:30:53 +0200
Hi,

Mathieu Othacehe <othacehe <at> gnu.org> skribis:

> Regarding those four, I was able to reproduce the issue this way:
>
> $ guix repl
> (stop-service 'guix-daemon)
> (start-service 'guix-daemon (list (number->string (getpid))))

Or from the shell:

  herd stop guix-daemon
  herd start guix-daemon $$

I was able to reproduce it using a bare-bones.tmpl VM.

> The latter command hangs and Shepherd becomes unresponsive.  I collected
> an (attached) strace dump of Shepherd showing that there is no response
> on the socket when the service is started.
>
> Note that, this works:
>
> $ guix repl
> (stop-service 'guix-daemon)
> (start-service 'guix-daemon) 
>
> So the problem could be caused by the "container-excursion*" in the
> "fork+exec-command/container" procedure.

PID 1 gets stuck on read(16, …) forever, after reading the string “2866”
(a PID):

--8<---------------cut here---------------start------------->8---
[pid  2865] clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLDstrace: Process 2866 attached
, child_tidptr=0x7fccfbe00a10) = 2866
[pid  2866] set_robust_list(0x7fccfbe00a20, 24) = 0
[pid  2866] close(3)                    = 0
[pid  2865] write(39, "2866", 4 <unfinished ...>
[pid  2866] close(4 <unfinished ...>
[pid  2865] <... write resumed>)        = 4
[pid  2866] <... close resumed>)        = 0
[pid  2866] pipe2( <unfinished ...>
[pid  2865] close(39 <unfinished ...>
[pid  2866] <... pipe2 resumed>[3, 4], O_CLOEXEC) = 0
[pid  2865] <... close resumed>)        = 0
[pid  2865] exit_group(0)               = ?
[pid  2866] rt_sigaction(SIGCHLD, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fccfc304d80}, {sa_handler=0x7fccfc427d50, sa_mask=[], sa_flags=SA_RESTORER|SA_NOCLDSTOP, sa_restorer=0x7fccfc304d80}, 8) = 0
[pid  2866] rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fccfc304d80}, {sa_handler=0x7fccfc427d50, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fccfc304d80}, 8) = 0
[pid  2866] rt_sigaction(SIGHUP, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fccfc304d80}, {sa_handler=0x7fccfc427d50, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fccfc304d80}, 8) = 0
[pid  2866] rt_sigaction(SIGTERM, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fccfc304d80}, {sa_handler=0x7fccfc427d50, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fccfc304d80}, 8) = 0
[pid  2866] rt_sigprocmask(SIG_UNBLOCK, [HUP INT TERM CHLD], [HUP INT TERM CHLD], 8) = 0
[pid  2866] mkdir("/var", 0777)         = -1 EEXIST (File exists)
[pid  2866] mkdir("/var/run", 0777)     = -1 EEXIST (File exists)
[pid  2865] +++ exited with 0 +++
[pid     1] <... wait4 resumed>[{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 2865
[pid     1] close(39)                   = 0
[pid  2866] setsid( <unfinished ...>
[pid     1] read(16,  <unfinished ...>
[pid  2866] <... setsid resumed>)       = 2866
[pid     1] <... read resumed>"2866", 4096) = 4
[pid  2866] chdir("/")                  = 0
[pid     1] read(16,  <unfinished ...>
[pid  2866] prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=4*1024}) = 0
[pid  2866] close(0)                    = 0
[pid  2866] openat(AT_FDCWD, "/dev/null", O_RDONLY) = 0
[pid  2866] dup2(0, 0)                  = 0
[pid  2866] close(1)                    = 0
[pid  2866] close(2)                    = 0
[pid  2866] openat(AT_FDCWD, "/var/log/guix-daemon.log", O_WRONLY|O_CREAT|O_APPEND, 0640) = 1
[pid  2866] dup2(1, 1)                  = 1
[pid  2866] dup2(1, 2)                  = 2
[pid  2866] execve("/gnu/store/bxnkqnpbf4q4z6245b61wgpm8gkr9nj1-guix-1.3.0-29.9e46320/bin/guix-daemon", ["/gnu/store/bxnkqnpbf4q4z6245b61w"..., "--build-users-group", "guixbuild", "--max-silent-time", "0", "--timeout", "0", "--log-compression", "gzip", "--discover=yes", "--substitute-urls", "https://substitutes.nonguix.org "...], 0x7fccf71fa480 /* 3 vars */) = 0
--8<---------------cut here---------------end--------------->8---

This happens because the other end of the file descriptor happens to be
inherited by 2866, which will never close it because it just execs
guix-daemon.

This is fixed by 6abdcef4a68e98f538ab69fde096adc5f5ca4ff4; the log
contains extra details.

Thanks!

Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#57827; Package guix. (Sat, 24 Sep 2022 12:05:01 GMT) Full text and rfc822 format available.

Message #26 received at 57827-done <at> debbugs.gnu.org (full text, mbox):

From: Mathieu Othacehe <othacehe <at> gnu.org>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: 57827-done <at> debbugs.gnu.org
Subject: Re: bug#57827: Shepherd 0.9.2 possible regressions
Date: Sat, 24 Sep 2022 14:04:08 +0200
[Message part 1 (text/plain, inline)]
Hey,

> This is fixed by 6abdcef4a68e98f538ab69fde096adc5f5ca4ff4; the log
> contains extra details.

Thanks for fixing it! Turns out we still have an issue on all four
installer tests.

The error messages look like:

--8<---------------cut here---------------start------------->8---
Sep 22 09:49:19 localhost installer[252]: running command ("guix" "system" "init" "--fallback" "--no-grafts" "--no-substitutes" "/mnt/etc/config.scm" "/mnt") 

Sep 22 09:49:21 localhost installer[252]: guix system: error: read error while loading '/mnt/etc/config.scm': /mnt/etc/config.scm:63:51: unexpected ")" 
--8<---------------cut here---------------end--------------->8---

Having a look at the configuration file (attached), it seems that the
problem comes from the edit-configuration-file function. As the
rewritten configuration is shorter than the initial because the comments
are stripped by the pretty-print function, we have some leftovers from
the initial config.

I do not really understand why because call-with-output-file is supposed
to use the O_TRUNC flag and remove the existing content of the
configuration file. Plus, calling edit-configuration-file from a guix
repl does the right thing.

Anyways, I pushed fe4663ae2476cb527d4f1f49ff8fa077d43f7251 which fixes
the issue by removing the file before rewriting it.

Mathieu
[configuration_before.scm (application/octet-stream, attachment)]
[configuration_after.scm (application/octet-stream, attachment)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 23 Oct 2022 11:24:06 GMT) Full text and rfc822 format available.

This bug report was last modified 1 year and 186 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.