GNU bug report logs - #52779
tests/no-home test failure in Shepherd

Previous Next

Package: guix;

Reported by: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>

Date: Fri, 24 Dec 2021 14:53:01 UTC

Severity: normal

Done: Ludovic Courtès <ludo <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 52779 in the body.
You can then email your comments to 52779 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-guix <at> gnu.org:
bug#52779; Package guix. (Fri, 24 Dec 2021 14:53:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Maxim Cournoyer <maxim.cournoyer <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-guix <at> gnu.org. (Fri, 24 Dec 2021 14:53:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: bug-guix <bug-guix <at> gnu.org>
Subject: tests/no-home test failure in Shepherd
Date: Fri, 24 Dec 2021 09:52:07 -0500
Hello Guix,

I encountered the following (nondeterministic) test failure:

--8<---------------cut here---------------start------------->8---
# TOTAL: 13
# PASS:  12
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0

.. contents:: :depth: 2

FAIL: tests/no-home
===================

+ shepherd --version
shepherd (GNU Shepherd) 0.8.1
Copyright (C) 2020 the Shepherd authors
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
+ HOME=/nowhere
+ export HOME
+ test -w /nowhere
+ socket=t-socket-1651
+ pid=t-pid-1651
+ herd='herd -s t-socket-1651'
+ trap 'rm -f t-socket-1651;
      test -f t-pid-1651 && kill `cat t-pid-1651` || true; rm -f t-pid-1651' EXIT
+ shepherd_pid=1896
+ test -f t-pid-1651
+ shepherd -I -s t-socket-1651 -c /dev/null -l /dev/null --pid=t-pid-1651
+ kill -0 1896
+ sleep 0.3
Service root has been started.
+ test -f t-pid-1651
++ cat t-pid-1651
+ kill -0 1896
+ herd -s t-socket-1651 status root
Started:
 + root
+ herd -s t-socket-1651 stop root
++ cat t-pid-1651
+ kill 1896
+ exit 1
+ rm -f t-socket-1651
+ test -f t-pid-1651
++ cat t-pid-1651
+ kill 1896
+ rm -f t-pid-1651
FAIL tests/no-home.sh (exit status: 1)


error: in phase 'check': uncaught exception:
%exception #<&invoke-error program: "make" arguments: ("check" "-j" "24" "GUILE_AUTO_COMPILE=0") exit-status: 2 term-signal: #f stop-signal: #f> 
phase `check' failed after 15.3 seconds
command "make" "check" "-j" "24" "GUILE_AUTO_COMPILE=0" failed with status 2
--8<---------------cut here---------------end--------------->8---

Perhaps due to high load, as I can't seem to reproduce when building
Shepherd alone.

Thanks,

Maxim




Information forwarded to bug-guix <at> gnu.org:
bug#52779; Package guix. (Mon, 03 Jan 2022 21:10:01 GMT) Full text and rfc822 format available.

Message #8 received at 52779 <at> debbugs.gnu.org (full text, mbox):

From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: 52779 <at> debbugs.gnu.org
Subject: Re: bug#52779: tests/no-home test failure in Shepherd
Date: Mon, 03 Jan 2022 16:09:30 -0500
Hi,

Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:

> Hello Guix,
>
> I encountered the following (nondeterministic) test failure:
>
> # TOTAL: 13
> # PASS:  12
> # SKIP:  0
> # XFAIL: 0
> # FAIL:  1
> # XPASS: 0
> # ERROR: 0
>
> .. contents:: :depth: 2
>
> FAIL: tests/no-home
> ===================
>
> + shepherd --version
> shepherd (GNU Shepherd) 0.8.1
> Copyright (C) 2020 the Shepherd authors
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.
> + HOME=/nowhere
> + export HOME
> + test -w /nowhere
> + socket=t-socket-1651
> + pid=t-pid-1651
> + herd='herd -s t-socket-1651'
> + trap 'rm -f t-socket-1651;
>       test -f t-pid-1651 && kill `cat t-pid-1651` || true; rm -f t-pid-1651' EXIT
> + shepherd_pid=1896
> + test -f t-pid-1651
> + shepherd -I -s t-socket-1651 -c /dev/null -l /dev/null --pid=t-pid-1651
> + kill -0 1896
> + sleep 0.3
> Service root has been started.
> + test -f t-pid-1651
> ++ cat t-pid-1651
> + kill -0 1896
> + herd -s t-socket-1651 status root
> Started:
>  + root
> + herd -s t-socket-1651 stop root
> ++ cat t-pid-1651
> + kill 1896
> + exit 1
> + rm -f t-socket-1651
> + test -f t-pid-1651
> ++ cat t-pid-1651
> + kill 1896
> + rm -f t-pid-1651
> FAIL tests/no-home.sh (exit status: 1)
>
>
> error: in phase 'check': uncaught exception:
> %exception #<&invoke-error program: "make" arguments: ("check" "-j" "24" "GUILE_AUTO_COMPILE=0") exit-status: 2 term-signal: #f stop-signal: #f> 
> phase `check' failed after 15.3 seconds
> command "make" "check" "-j" "24" "GUILE_AUTO_COMPILE=0" failed with status 2
>
> Perhaps due to high load, as I can't seem to reproduce when building
> Shepherd alone.

I encountered the problem once more; the errors are slightly different
so I'm pasting the failed test output here:

--8<---------------cut here---------------end--------------->8---
FAIL: tests/no-home
===================

+ shepherd --version
shepherd (GNU Shepherd) 0.8.1
Copyright (C) 2020 the Shepherd authors
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
+ HOME=/nowhere
+ export HOME
+ test -w /nowhere
+ socket=t-socket-1606
+ pid=t-pid-1606
+ herd='herd -s t-socket-1606'
+ trap 'rm -f t-socket-1606;
      test -f t-pid-1606 && kill `cat t-pid-1606` || true; rm -f t-pid-1606' EXIT
+ shepherd_pid=1868
+ test -f t-pid-1606
+ shepherd -I -s t-socket-1606 -c /dev/null -l /dev/null --pid=t-pid-1606
+ kill -0 1868
+ sleep 0.3
Service root has been started.
+ test -f t-pid-1606
++ cat t-pid-1606
+ kill -0 1868
+ herd -s t-socket-1606 status root
Started:
 + root
+ herd -s t-socket-1606 stop root
++ cat t-pid-1606
+ kill 1868
+ exit 1
+ rm -f t-socket-1606
+ test -f t-pid-1606
++ cat t-pid-1606
+ kill 1868
./tests/no-home.sh: line 2: kill: (1868) - No such process
+ true
+ rm -f t-pid-1606
FAIL tests/no-home.sh (exit status: 1)


error: in phase 'check': uncaught exception:
%exception #<&invoke-error program: "make" arguments: ("check" "-j" "24" "GUILE_AUTO_COMPILE=0") exit-status: 2 term-signal: #f stop-signal: #f> 
phase `check' failed after 15.6 seconds
command "make" "check" "-j" "24" "GUILE_AUTO_COMPILE=0" failed with status 2
--8<---------------cut here---------------end--------------->8---

Thanks,

Maxim




Information forwarded to bug-guix <at> gnu.org:
bug#52779; Package guix. (Mon, 10 Jan 2022 18:18:02 GMT) Full text and rfc822 format available.

Message #11 received at 52779 <at> debbugs.gnu.org (full text, mbox):

From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: 52779 <at> debbugs.gnu.org
Subject: Re: bug#52779: tests/no-home test failure in Shepherd
Date: Mon, 10 Jan 2022 13:17:52 -0500
Hi,

Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:

> Hi,
>
> Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:
>
>> Hello Guix,
>>
>> I encountered the following (nondeterministic) test failure:
>>
>> # TOTAL: 13
>> # PASS:  12
>> # SKIP:  0
>> # XFAIL: 0
>> # FAIL:  1
>> # XPASS: 0
>> # ERROR: 0
>>
>> .. contents:: :depth: 2
>>
>> FAIL: tests/no-home
>> ===================

[...]

I tried to reproduce this is a controlled setting, using the 'stress'
command to effect a high load on the system, but in vain.

I could only trigger the problem on a highly parallel CPU (24 logical
cores) busy building 10s of Guix packages at the same time.

Thanks,

Maxim




Information forwarded to bug-guix <at> gnu.org:
bug#52779; Package guix. (Thu, 13 Jan 2022 15:38:02 GMT) Full text and rfc822 format available.

Message #14 received at 52779 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Cc: 52779 <at> debbugs.gnu.org
Subject: Re: bug#52779: tests/no-home test failure in Shepherd
Date: Thu, 13 Jan 2022 16:37:45 +0100
[Message part 1 (text/plain, inline)]
Hello,

Maxim Cournoyer <maxim.cournoyer <at> gmail.com> skribis:

> + herd -s t-socket-1651 status root
> Started:
>  + root
> + herd -s t-socket-1651 stop root
> ++ cat t-pid-1651
> + kill 1896
> + exit 1
> + rm -f t-socket-1651
> + test -f t-pid-1651
> ++ cat t-pid-1651
> + kill 1896
> + rm -f t-pid-1651
> FAIL tests/no-home.sh (exit status: 1)

What happens here is that the shepherd process is still alive after
‘herd stop root’ has completed, contrary to what’s expected:

--8<---------------cut here---------------start------------->8---
$herd stop root

if kill `cat "$pid"`
then
    exit 1
fi
--8<---------------cut here---------------end--------------->8---

The expectation is that shepherd has terminated by the time ‘herd stop
root’ exits; I wonder if that’s bogus.

‘herd stop root’ terminates when shepherd has closed its connection,
which normally happens when shepherd exits:

--8<---------------cut here---------------start------------->8---
28003 read(15, "(shepherd-command (version 0) (action stop) (service root) (arguments ()) (directory \"/data/src/shepherd\"))", 1024) = 107
28003 brk(0x1030000)                    = 0x1030000
28003 mmap(NULL, 262144, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0072be8000
28003 brk(0x100f000)                    = 0x100f000
28003 getcwd("/data/src/shepherd", 100) = 19
28003 chdir("/data/src/shepherd")       = 0
28003 newfstatat(AT_FDCWD, "/etc/localtime", {st_mode=S_IFREG|0444, st_size=2962, ...}, 0) = 0
28003 write(7, "2022-01-13 16:21:16 Exiting shepherd...\n", 40) = 40
28003 chdir("/data/src/shepherd")       = 0
28003 getuid()                          = 1000
28003 close(13)                         = 0
28003 unlink("test")                    = 0
28003 exit_group(0)                     = ?
28006 <... futex resumed>)              = ?
28008 <... read resumed> <unfinished ...>) = ?
28005 <... futex resumed>)              = ?
28004 <... futex resumed>)              = ?
28008 +++ exited with 0 +++
28006 +++ exited with 0 +++
28005 +++ exited with 0 +++
28004 +++ exited with 0 +++
28003 +++ exited with 0 +++
--8<---------------cut here---------------end--------------->8---

Maybe there’s a chance that the shell hasn’t processed the shepherd’s
SIGCHLD when it evaluates the “if kill `cat "$pid"`” condition; in that
case, the shepherd process still exists as a zombie.

A more robust approach might be to use the shell’s builtin ‘wait’,
because then I suppose the shell will be forced to process pending
SIGCHLDs:

[Message part 2 (text/x-patch, inline)]
diff --git a/tests/no-home.sh b/tests/no-home.sh
index 85b6116..5a8c278 100644
--- a/tests/no-home.sh
+++ b/tests/no-home.sh
@@ -1,5 +1,5 @@
 # GNU Shepherd --- Make sure shepherd doesn't fail when $HOME is not writable.
-# Copyright © 2014, 2016 Ludovic Courtès <ludo <at> gnu.org>
+# Copyright © 2014, 2016, 2022 Ludovic Courtès <ludo <at> gnu.org>
 #
 # This file is part of the GNU Shepherd.
 #
@@ -46,7 +46,4 @@ kill -0 `cat "$pid"`
 $herd status root
 $herd stop root
 
-if kill `cat "$pid"`
-then
-    exit 1
-fi
+wait `cat "$pid"`
[Message part 3 (text/plain, inline)]
I can’t get it to fail while waiting for a few minutes of:

  while make check TESTS=tests/no-home.sh ; do : ; done

… but I cannot get the original one to fail either.

Does it work for you?

Ludo’.

Information forwarded to bug-guix <at> gnu.org:
bug#52779; Package guix. (Thu, 13 Jan 2022 17:30:02 GMT) Full text and rfc822 format available.

Message #17 received at 52779 <at> debbugs.gnu.org (full text, mbox):

From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: 52779 <at> debbugs.gnu.org
Subject: Re: bug#52779: tests/no-home test failure in Shepherd
Date: Thu, 13 Jan 2022 12:29:14 -0500
Hi Ludo!

Ludovic Courtès <ludo <at> gnu.org> writes:

> Hello,
>
> Maxim Cournoyer <maxim.cournoyer <at> gmail.com> skribis:
>
>> + herd -s t-socket-1651 status root
>> Started:
>>  + root
>> + herd -s t-socket-1651 stop root
>> ++ cat t-pid-1651
>> + kill 1896
>> + exit 1
>> + rm -f t-socket-1651
>> + test -f t-pid-1651
>> ++ cat t-pid-1651
>> + kill 1896
>> + rm -f t-pid-1651
>> FAIL tests/no-home.sh (exit status: 1)
>
> What happens here is that the shepherd process is still alive after
> ‘herd stop root’ has completed, contrary to what’s expected:
>
> $herd stop root
>
> if kill `cat "$pid"`
> then
>     exit 1
> fi

Yes!

[...]

> Maybe there’s a chance that the shell hasn’t processed the shepherd’s
> SIGCHLD when it evaluates the “if kill `cat "$pid"`” condition; in that
> case, the shepherd process still exists as a zombie.
>
> A more robust approach might be to use the shell’s builtin ‘wait’,
> because then I suppose the shell will be forced to process pending
> SIGCHLDs:
>
> diff --git a/tests/no-home.sh b/tests/no-home.sh
> index 85b6116..5a8c278 100644
> --- a/tests/no-home.sh
> +++ b/tests/no-home.sh
> @@ -1,5 +1,5 @@
>  # GNU Shepherd --- Make sure shepherd doesn't fail when $HOME is not writable.
> -# Copyright © 2014, 2016 Ludovic Courtès <ludo <at> gnu.org>
> +# Copyright © 2014, 2016, 2022 Ludovic Courtès <ludo <at> gnu.org>
>  #
>  # This file is part of the GNU Shepherd.
>  #
> @@ -46,7 +46,4 @@ kill -0 `cat "$pid"`
>  $herd status root
>  $herd stop root
>  
> -if kill `cat "$pid"`
> -then
> -    exit 1
> -fi
> +wait `cat "$pid"`

As I wrote, I was also unable to reproduce this (but when I had a high
load of packages to build at the same time, I could get it to happen a
couple times upon retrying).  Your analysis (and the narrow window which
would allow for a failure) makes sense to me, along with the proposed
fix.

I think you should commit it and tentatively mark this bug as fixed :-).

Thank you for looking into it!

Maxim




Reply sent to Ludovic Courtès <ludo <at> gnu.org>:
You have taken responsibility. (Mon, 17 Jan 2022 13:45:01 GMT) Full text and rfc822 format available.

Notification sent to Maxim Cournoyer <maxim.cournoyer <at> gmail.com>:
bug acknowledged by developer. (Mon, 17 Jan 2022 13:45:01 GMT) Full text and rfc822 format available.

Message #22 received at 52779-done <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Cc: 52779-done <at> debbugs.gnu.org
Subject: Re: bug#52779: tests/no-home test failure in Shepherd
Date: Mon, 17 Jan 2022 14:43:55 +0100
Maxim Cournoyer <maxim.cournoyer <at> gmail.com> skribis:

> I think you should commit it and tentatively mark this bug as fixed :-).

Alright, pushed as Shepherd commit
e025f83c2072e4a6533a5747682810a885e20b2e!

Thanks,
Ludo’.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 15 Feb 2022 12:24:12 GMT) Full text and rfc822 format available.

This bug report was last modified 2 years and 42 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.