GNU bug report logs - #40981
shepherd 0.8.0 race condition can lead to stopping itself

Previous Next

Package: guix;

Reported by: Mathieu Othacehe <m.othacehe <at> gmail.com>

Date: Thu, 30 Apr 2020 11:52:02 UTC

Severity: important

Merged with 41429

Done: Mathieu Othacehe <mathieu <at> meru.i-did-not-set--mail-host-address--so-tickle-me>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 40981 in the body.
You can then email your comments to 40981 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-guix <at> gnu.org:
bug#40981; Package guix. (Thu, 30 Apr 2020 11:52:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Mathieu Othacehe <m.othacehe <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-guix <at> gnu.org. (Thu, 30 Apr 2020 11:52:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Mathieu Othacehe <m.othacehe <at> gmail.com>
To: bug-guix <at> gnu.org
Subject: Graphical installer tests sometimes hang.
Date: Thu, 30 Apr 2020 13:51:49 +0200
Hello,

Graphical installer tests sometimes hang, before starting marionette
tests. See for instance:
https://ci.guix.gnu.org/log/d31s9sycixhvfak5lpzdg0mzvz5aa9av-gui-installed-os-encrypted.

Restarting tests just after a hang (on the same installed image),
sometimes work. I removed the "quiet" kernel argument to see what's
going on.

--8<---------------cut here---------------start------------->8---
[    0.862608] Freeing unused kernel image memory: 1964K
[    0.863177] Run /init as init process
GC Warning: pthread_getattr_np or pthread_attr_getstack failed for main thread
GC Warning: Couldn't read /proc/stat
Welcome, this is GNU's early boot Guile.
Use '--repl' for an initrd REPL.

loading kernel modules...
[    0.915640] usbcore: registered new interface driver usb-storage
[    0.917800] usbcore: registered new interface driver uas
[    0.919569] hidraw: raw HID events driver (C) Jiri Kosina
[    0.920519] usbcore: registered new interface driver usbhid
[    0.921177] usbhid: USB HID core driver
[    0.933506] isci: Intel(R) C600 SAS Controller Driver - version 1.2.0
[    0.951303] PCI Interrupt Link [LNKD] enabled at IRQ 11
[    0.970144] PCI Interrupt Link [LNKA] enabled at IRQ 10
[    0.974033] virtio_blk virtio1: [vda] 4505600 512-byte logical blocks (2.31 GB/2.15 GiB)
[    0.976186]  vda: vda1 vda2

;; hanging here.
--8<---------------cut here---------------end--------------->8---

It seems that the boot freezes, soon after the initrd is started, and
before loading the boot script.

Mathieu




Information forwarded to bug-guix <at> gnu.org:
bug#40981; Package guix. (Mon, 04 May 2020 11:44:01 GMT) Full text and rfc822 format available.

Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Mathieu Othacehe <m.othacehe <at> gmail.com>
To: bug-guix <at> gnu.org
Subject: Re: Graphical installer tests sometimes hang.
Date: Mon, 04 May 2020 13:42:59 +0200
[Message part 1 (text/plain, inline)]
Hello,

> It seems that the boot freezes, soon after the initrd is started, and
> before loading the boot script.

I made some progress on this one. Turns out, the system was started and
the tests are in progress when it hangs.

The problem occurs during the reset of the HTTP proxy ("guix-daemon
set-http-proxy action, clear" test). It seems that clearing the proxy
kills Shepherd.

Here's a screenshot of the console when it happens. It can also be
reproduced by spawning a VM and running this script:

--8<---------------cut here---------------start------------->8---
(use-modules (gnu services herd))
(with-shepherd-action 'guix-daemon
                                  ('set-http-proxy "http://localhost:8118")
                                  result
                                result)
(with-shepherd-action 'guix-daemon
                                  ('set-http-proxy)
                                  result
                                result)
--8<---------------cut here---------------end--------------->8---

I'll keep looking!

Thanks,

Mathieu
[crash.ppm (image/x-portable-pixmap, attachment)]

Information forwarded to bug-guix <at> gnu.org:
bug#40981; Package guix. (Mon, 04 May 2020 12:51:01 GMT) Full text and rfc822 format available.

Message #11 received at 40981 <at> debbugs.gnu.org (full text, mbox):

From: Mathieu Othacehe <m.othacehe <at> gmail.com>
To: 40981 <at> debbugs.gnu.org
Subject: Re: Graphical installer tests sometimes hang.
Date: Mon, 04 May 2020 14:50:43 +0200
> I'll keep looking!

Ok, getting closer. Here's a suspect part of Shepherd strace log:

--8<---------------cut here---------------start------------->8---
[pid     1] stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=2298, ...}) = 0
[pid     1] write(9, "shepherd[1]: changing HTTP/HTTPS"..., 86) = 86
[pid     1] getpgid(194)                = 194
[pid     1] kill(-194, SIGTERM)         = 0
--8<---------------cut here---------------end--------------->8---

I think the problem is introduced by commit
1e7a91d21f1cc5d02697680e19e3878ff8565710 in Shepherd.

"(getpgid <guix-daemon-pid>") returns 0, and calling "(kill 0 SIGTERM)"
kills all processes.

Now, I really don't get how guix-daemon pgid could be zero. Man page of
setpgid(2) says:

--8<---------------cut here---------------start------------->8---
   A child created via fork(2) inherits its parent's process group ID.  The PGID  is
   preserved across an execve(2).
--8<---------------cut here---------------end--------------->8---

WDYT?

Thanks,

Mathieu




Information forwarded to bug-guix <at> gnu.org:
bug#40981; Package guix. (Tue, 05 May 2020 10:01:02 GMT) Full text and rfc822 format available.

Message #14 received at 40981 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Mathieu Othacehe <m.othacehe <at> gmail.com>
Cc: 40981 <at> debbugs.gnu.org
Subject: Re: bug#40981: Graphical installer tests sometimes hang.
Date: Tue, 05 May 2020 12:00:25 +0200
Hi!

Mathieu Othacehe <m.othacehe <at> gmail.com> skribis:

>> I'll keep looking!
>
> Ok, getting closer. Here's a suspect part of Shepherd strace log:
>
> [pid     1] stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=2298, ...}) = 0
> [pid     1] write(9, "shepherd[1]: changing HTTP/HTTPS"..., 86) = 86
> [pid     1] getpgid(194)                = 194
> [pid     1] kill(-194, SIGTERM)         = 0
>
>
> I think the problem is introduced by commit
> 1e7a91d21f1cc5d02697680e19e3878ff8565710 in Shepherd.

OK, but the trace above is “as expected”, isn’t it?

> "(getpgid <guix-daemon-pid>") returns 0, and calling "(kill 0 SIGTERM)"
> kills all processes.

What made you think of this scenario?

I don’t think getpgid(2) can return 0.  Or am I missing something?
Since guix-dameon doesn’t actually daemonize, getpgid(pid) = pid.

Running this (in a VM) works fine:

  while herd set-http-proxy guix-daemon foo ; do : ; done

Thanks for debugging!

Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#40981; Package guix. (Tue, 05 May 2020 10:23:01 GMT) Full text and rfc822 format available.

Message #17 received at 40981 <at> debbugs.gnu.org (full text, mbox):

From: Mathieu Othacehe <m.othacehe <at> gmail.com>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: 40981 <at> debbugs.gnu.org
Subject: Re: bug#40981: Graphical installer tests sometimes hang.
Date: Tue, 05 May 2020 12:22:23 +0200
[Message part 1 (text/plain, inline)]
Hey!

> OK, but the trace above is “as expected”, isn’t it?

Oh sorry, I pasted the wrong part! Here's the part where getpgid returns
zero and the full log in attachment:

--8<---------------cut here---------------start------------->8---
[pid     1] stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=2298, ...}) = 0
[pid     1] write(9, "shepherd[1]: clearing HTTP/HTTPS"..., 59) = 59
[pid     1] getpgid(266)                = 0
--8<---------------cut here---------------end--------------->8---

Thanks,

Mathieu
[strace.log (application/octet-stream, attachment)]

Information forwarded to bug-guix <at> gnu.org:
bug#40981; Package guix. (Wed, 06 May 2020 10:03:02 GMT) Full text and rfc822 format available.

Message #20 received at 40981 <at> debbugs.gnu.org (full text, mbox):

From: Mathieu Othacehe <m.othacehe <at> gmail.com>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: 40981 <at> debbugs.gnu.org
Subject: Re: bug#40981: Graphical installer tests sometimes hang.
Date: Wed, 06 May 2020 12:02:47 +0200
[Message part 1 (text/plain, inline)]
Hello,

I found what happened here. Turns out, when a process is forked and
before "setsid" or "setpgid" is called, it shares the parent PGID. In
that case the parent is Shepherd, with the PGID 0.

When doing the following actions:

* stop guix-daemon
* start guix-daemon

* stop guix-daemon
* start guix-daemon

If the second stop occurs after "fork" has been done, but before
"setsid", then "(getpgid)" returns 0. The naive patch attached could fix
the situation.

WDYT?

Mathieu
[0001-service-Fix-make-kill-destructor-when-PGID-is-zero.patch (text/x-diff, inline)]
From 0e4167251a56d6baa4f51fe72250a6e3bffae8c3 Mon Sep 17 00:00:00 2001
From: Mathieu Othacehe <m.othacehe <at> gmail.com>
Date: Wed, 6 May 2020 11:48:26 +0200
Subject: [PATCH] service: Fix 'make-kill-destructor' when PGID is zero.

When a process is forked, and before its GID is changed in "exec-command",
it will share the parent GID, which is 0 for Shepherd. In that case, use
the PID instead of the PGID.

* modules/shepherd/service.scm (make-kill-destructor): Handle the case when
PGID is zero, between the process fork and exec.
---
 modules/shepherd/service.scm | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index 8604d2f..258992c 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -5,6 +5,7 @@
 ;; Copyright (C) 2016 Alex Kost <alezost <at> gmail.com>
 ;; Copyright (C) 2018 Carlo Zancanaro <carlo <at> zancanaro.id.au>
 ;; Copyright (C) 2019 Ricardo Wurmus <rekado <at> elephly.net>
+;; Copyright (C) 2020 Mathieu Othacehe <m.othacehe <at> gmail.com>
 ;;
 ;; This file is part of the GNU Shepherd.
 ;;
@@ -957,11 +958,15 @@ start."
   "Return a procedure that sends SIGNAL to the process group of the PID given
 as argument, where SIGNAL defaults to `SIGTERM'."
   (lambda (pid . args)
-    ;; Kill the whole process group PID belongs to.  Don't assume that PID
-    ;; is a process group ID: that's not the case when using #:pid-file,
-    ;; where the process group ID is the PID of the process that
-    ;; "daemonized".
-    (kill (- (getpgid pid)) signal)
+    ;; Kill the whole process group PID belongs to.  Don't assume that PID is
+    ;; a process group ID: that's not the case when using #:pid-file, where
+    ;; the process group ID is the PID of the process that "daemonized".  If
+    ;; this procedure is called, between the process fork and exec, the PGID
+    ;; will still be zero (the Shepherd PGID). In that case, use the PID.
+    (let ((pgid (getpgid pid)))
+      (if pgid
+          (kill (- pgid) signal)
+          (kill pid signal)))
     #f))
 
 ;; Produce a constructor that executes a command.
-- 
2.26.0


Severity set to 'important' from 'normal' Request was from Ludovic Courtès <ludo <at> gnu.org> to control <at> debbugs.gnu.org. (Thu, 07 May 2020 10:53:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-guix <at> gnu.org:
bug#40981; Package guix. (Thu, 07 May 2020 16:50:02 GMT) Full text and rfc822 format available.

Message #25 received at 40981 <at> debbugs.gnu.org (full text, mbox):

From: Mathieu Othacehe <m.othacehe <at> gmail.com>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: 40981 <at> debbugs.gnu.org
Subject: Re: bug#40981: Graphical installer tests sometimes hang.
Date: Thu, 07 May 2020 18:48:54 +0200
[Message part 1 (text/plain, inline)]
Hello,

The previous patch was not working. The reason is that, when a process
is forked and before execv is called, it shares the parent signal
handler.

So when sending a SIGTERM to the child process, we are stopping
root-service, if the signal is received before the child has forked.

The work-around here is to save the installed SIGTERM handler and reset
it. Then, after forking, the parent can restore the SIGTERM handler. The
child will use the default SIGTERM handler that terminates the process.

WDYT?

Thanks,

Mathieu
[0001-service-Fix-make-kill-destructor-when-PGID-is-zero.patch (text/x-diff, inline)]
From aa6f67068f1fdd622673ec0223f05fd8f8a96baa Mon Sep 17 00:00:00 2001
From: Mathieu Othacehe <m.othacehe <at> gmail.com>
Date: Thu, 7 May 2020 18:39:41 +0200
Subject: [PATCH] service: Fix 'make-kill-destructor' when PGID is zero.

When a process is forked, and before its GID is changed in "exec-command",
it will share the parent GID, which is 0 for Shepherd. In that case, use
the PID instead of the PGID.

Also make sure to reset the SIGTERM handler before forking a process. Failing
to do so, will result in stopping Shepherd if a SIGTERM is received between
fork and execv calls. Restore the SIGTERM handler once the process has been
forked.

* modules/shepherd/service.scm (fork+exec-command): Save the current SIGTERM
handler and reset it before forking. Then, restore it on the parent after
forking.
(make-kill-destructor): Handle the case when PGID is zero, between the process
fork and exec.
---
 modules/shepherd/service.scm | 36 +++++++++++++++++++++++++++++-------
 tests/forking-service.sh     | 18 ++++++++++++++++++
 2 files changed, 47 insertions(+), 7 deletions(-)

diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index 8604d2f..c68d7e0 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -5,6 +5,7 @@
 ;; Copyright (C) 2016 Alex Kost <alezost <at> gmail.com>
 ;; Copyright (C) 2018 Carlo Zancanaro <carlo <at> zancanaro.id.au>
 ;; Copyright (C) 2019 Ricardo Wurmus <rekado <at> elephly.net>
+;; Copyright (C) 2020 Mathieu Othacehe <m.othacehe <at> gmail.com>
 ;;
 ;; This file is part of the GNU Shepherd.
 ;;
@@ -884,7 +885,18 @@ its PID."
   (unless %sigchld-handler-installed?
     (sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP)
     (set! %sigchld-handler-installed? #t))
-  (let ((pid (primitive-fork)))
+
+  ;; When forking a process, the signal handlers are inherited, until it
+  ;; forks. If SIGTERM is received by the forked process, before it calls
+  ;; execv, the installed SIGTERM handler, stopping Shepherd will be called.
+  ;; To avoid this, save the SIGTERM handler, disable it, and restore it once,
+  ;; the process has been forked. This way, the forked process will use the
+  ;; default SIGTERM handler stopping the process.
+  (let ((term-handler (match (sigaction SIGTERM)
+                        ((proc . _)
+                         proc)))
+        (pid (and (sigaction SIGTERM SIG_DFL)
+                  (primitive-fork))))
     (if (zero? pid)
         (exec-command command
                       #:user user
@@ -893,7 +905,10 @@ its PID."
                       #:directory directory
                       #:file-creation-mask file-creation-mask
                       #:environment-variables environment-variables)
-        pid)))
+        (begin
+          ;; Restore the initial SIGTERM handler.
+          (sigaction SIGTERM term-handler)
+          pid))))
 
 (define* (make-forkexec-constructor command
                                     #:key
@@ -957,11 +972,18 @@ start."
   "Return a procedure that sends SIGNAL to the process group of the PID given
 as argument, where SIGNAL defaults to `SIGTERM'."
   (lambda (pid . args)
-    ;; Kill the whole process group PID belongs to.  Don't assume that PID
-    ;; is a process group ID: that's not the case when using #:pid-file,
-    ;; where the process group ID is the PID of the process that
-    ;; "daemonized".
-    (kill (- (getpgid pid)) signal)
+    ;; Kill the whole process group PID belongs to.  Don't assume that PID is
+    ;; a process group ID: that's not the case when using #:pid-file, where
+    ;; the process group ID is the PID of the process that "daemonized".  If
+    ;; this procedure is called, between the process fork and exec, the PGID
+    ;; will still be zero (the Shepherd PGID). In that case, use the PID.
+    (let ((current-pgid (getpgid 0))
+          (pgid (getpgid pid)))
+      (if (eq? pgid current-pgid)
+          (begin
+            (kill pid signal))
+          (begin
+            (kill (- pgid) signal))))
     #f))
 
 ;; Produce a constructor that executes a command.
diff --git a/tests/forking-service.sh b/tests/forking-service.sh
index 7126c3d..47b09a2 100644
--- a/tests/forking-service.sh
+++ b/tests/forking-service.sh
@@ -71,6 +71,17 @@ cat > "$conf"<<EOF
                                       #:pid-file "$PWD/$service2_pid")
    #:stop  (make-kill-destructor)
    #:respawn? #t))
+
+(define %command3
+  '("$SHELL" "-c" "sleep 600"))
+
+(register-services
+ (make <service>
+   ;; A service that forks into a different process.
+   #:provides '(test3)
+   #:start (make-forkexec-constructor %command3)
+   #:stop  (make-kill-destructor)
+   #:respawn? #t))
 EOF
 cat $conf
 
@@ -113,3 +124,10 @@ sleep 1;
 $herd status test2 | grep started
 test "`cat $PWD/$service2_started`" = "started
 started"
+
+# Try to trigger eventual race conditions, when killing a process between fork
+# and execv calls.
+for i in {1..50}
+do
+    $herd restart test3
+done
-- 
2.26.0


Information forwarded to bug-guix <at> gnu.org:
bug#40981; Package guix. (Sun, 10 May 2020 10:33:01 GMT) Full text and rfc822 format available.

Message #28 received at 40981 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Mathieu Othacehe <m.othacehe <at> gmail.com>
Cc: 40981 <at> debbugs.gnu.org
Subject: Re: bug#40981: Graphical installer tests sometimes hang.
Date: Sun, 10 May 2020 12:32:00 +0200
Hi!

Mathieu Othacehe <m.othacehe <at> gmail.com> skribis:

> The previous patch was not working. The reason is that, when a process
> is forked and before execv is called, it shares the parent signal
> handler.
>
> So when sending a SIGTERM to the child process, we are stopping
> root-service, if the signal is received before the child has forked.

Woow, good catch!

> The work-around here is to save the installed SIGTERM handler and reset
> it. Then, after forking, the parent can restore the SIGTERM handler. The
> child will use the default SIGTERM handler that terminates the process.

OK, makes sense.  (Another occasion to see how something like
‘posix_spawn’ would be more appropriate than fork + exec…)

> From aa6f67068f1fdd622673ec0223f05fd8f8a96baa Mon Sep 17 00:00:00 2001
> From: Mathieu Othacehe <m.othacehe <at> gmail.com>
> Date: Thu, 7 May 2020 18:39:41 +0200
> Subject: [PATCH] service: Fix 'make-kill-destructor' when PGID is zero.
>
> When a process is forked, and before its GID is changed in "exec-command",
> it will share the parent GID, which is 0 for Shepherd. In that case, use
> the PID instead of the PGID.
>
> Also make sure to reset the SIGTERM handler before forking a process. Failing
> to do so, will result in stopping Shepherd if a SIGTERM is received between
> fork and execv calls. Restore the SIGTERM handler once the process has been
> forked.
>
> * modules/shepherd/service.scm (fork+exec-command): Save the current SIGTERM
> handler and reset it before forking. Then, restore it on the parent after
> forking.
> (make-kill-destructor): Handle the case when PGID is zero, between the process
> fork and exec.

[...]

> +    ;; Kill the whole process group PID belongs to.  Don't assume that PID is
> +    ;; a process group ID: that's not the case when using #:pid-file, where
> +    ;; the process group ID is the PID of the process that "daemonized".  If
> +    ;; this procedure is called, between the process fork and exec, the PGID
> +    ;; will still be zero (the Shepherd PGID). In that case, use the PID.
> +    (let ((current-pgid (getpgid 0))
> +          (pgid (getpgid pid)))
> +      (if (eq? pgid current-pgid)
> +          (begin
> +            (kill pid signal))
> +          (begin
> +            (kill (- pgid) signal))))

Shouldn’t it be:

  (let ((pgid (getpgid pid)))
    (if (= (getpgid 0) pgid)
        (kill pid signal)  ;don't kill ourself
        (kill (-p pgid) signal)))

?

Note: Use the most specific comparison procedure, ‘=’ in this case,
because we know we’re dealing with numbers (it enables proper type
checking, better compiler optimizations, etc.).  More importantly, ‘eq?’
performs pointer comparison, so it shouldn’t be used with numbers (in
practice it works with “fixnums” but not with bignums).

> +# Try to trigger eventual race conditions, when killing a process between fork
> +# and execv calls.
> +for i in {1..50}
> +do
> +    $herd restart test3
> +done

Would it reproduce the problem well enough?

Use `seq 1 50` to avoid relying on a Bash-specific construct (which I
think it is, right?).

Could you send an updated patch?

Thanks for the bug hunting and for the patch!

Ludo’.




Changed bug title to 'shepherd 0.8.0 race condition can lead to stopping itself' from 'Graphical installer tests sometimes hang.' Request was from Ludovic Courtès <ludo <at> gnu.org> to control <at> debbugs.gnu.org. (Sun, 10 May 2020 10:37:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-guix <at> gnu.org:
bug#40981; Package guix. (Sun, 10 May 2020 11:20:02 GMT) Full text and rfc822 format available.

Message #33 received at 40981 <at> debbugs.gnu.org (full text, mbox):

From: Mathieu Othacehe <othacehe <at> gnu.org>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: 40981 <at> debbugs.gnu.org
Subject: Re: bug#40981: Graphical installer tests sometimes hang.
Date: Sun, 10 May 2020 13:19:18 +0200
[Message part 1 (text/plain, inline)]
Hey Ludo,

> Woow, good catch!

Thanks :)

>> The work-around here is to save the installed SIGTERM handler and reset
>> it. Then, after forking, the parent can restore the SIGTERM handler. The
>> child will use the default SIGTERM handler that terminates the process.
>
> OK, makes sense.  (Another occasion to see how something like
> ‘posix_spawn’ would be more appropriate than fork + exec…)

Didn't know about that function, but it seems way easier to manipulate
and less error prone indeed!

> Shouldn’t it be:
>
>   (let ((pgid (getpgid pid)))
>     (if (= (getpgid 0) pgid)
>         (kill pid signal)  ;don't kill ourself
>         (kill (-p pgid) signal)))

Yes, I changed it.

> Note: Use the most specific comparison procedure, ‘=’ in this case,
> because we know we’re dealing with numbers (it enables proper type
> checking, better compiler optimizations, etc.).  More importantly, ‘eq?’
> performs pointer comparison, so it shouldn’t be used with numbers (in
> practice it works with “fixnums” but not with bignums).

Ok, noted!

>> +# Try to trigger eventual race conditions, when killing a process between fork
>> +# and execv calls.
>> +for i in {1..50}
>> +do
>> +    $herd restart test3
>> +done
>
> Would it reproduce the problem well enough?

On a slow machine one time out of two, and on a faster machine,
never. The 'reproducer' I used, was to add a 'sleep' between fork and
exec, it works way better!

Tell me if you think its better to drop it.

> Could you send an updated patch?

Here it is!

> Thanks for the bug hunting and for the patch!

Thanks for the fast review :)

Mathieu
[0001-service-Fix-make-kill-destructor-when-PGID-is-zero.patch (text/x-diff, inline)]
From 79d3603bf15b8f815136178be8c8a236734a7596 Mon Sep 17 00:00:00 2001
From: Mathieu Othacehe <m.othacehe <at> gmail.com>
Date: Thu, 7 May 2020 18:39:41 +0200
Subject: [PATCH] service: Fix 'make-kill-destructor' when PGID is zero.

When a process is forked, and before its GID is changed in "exec-command",
it will share the parent GID, which is 0 for Shepherd. In that case, use
the PID instead of the PGID.

Also make sure to reset the SIGTERM handler before forking a process. Failing
to do so, will result in stopping Shepherd if a SIGTERM is received between
fork and execv calls. Restore the SIGTERM handler once the process has been
forked.

* modules/shepherd/service.scm (fork+exec-command): Save the current SIGTERM
handler and reset it before forking. Then, restore it on the parent after
forking.
(make-kill-destructor): Handle the case when PGID is zero, between the process
fork and exec.
---
 modules/shepherd/service.scm | 33 ++++++++++++++++++++++++++-------
 tests/forking-service.sh     | 18 ++++++++++++++++++
 2 files changed, 44 insertions(+), 7 deletions(-)

diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index 8604d2f..45fcf32 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -5,6 +5,7 @@
 ;; Copyright (C) 2016 Alex Kost <alezost <at> gmail.com>
 ;; Copyright (C) 2018 Carlo Zancanaro <carlo <at> zancanaro.id.au>
 ;; Copyright (C) 2019 Ricardo Wurmus <rekado <at> elephly.net>
+;; Copyright (C) 2020 Mathieu Othacehe <m.othacehe <at> gmail.com>
 ;;
 ;; This file is part of the GNU Shepherd.
 ;;
@@ -884,7 +885,18 @@ its PID."
   (unless %sigchld-handler-installed?
     (sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP)
     (set! %sigchld-handler-installed? #t))
-  (let ((pid (primitive-fork)))
+
+  ;; When forking a process, the signal handlers are inherited, until it
+  ;; forks. If SIGTERM is received by the forked process, before it calls
+  ;; execv, the installed SIGTERM handler, stopping Shepherd will be called.
+  ;; To avoid this, save the SIGTERM handler, disable it, and restore it once,
+  ;; the process has been forked. This way, the forked process will use the
+  ;; default SIGTERM handler stopping the process.
+  (let ((term-handler (match (sigaction SIGTERM)
+                        ((proc . _)
+                         proc)))
+        (pid (and (sigaction SIGTERM SIG_DFL)
+                  (primitive-fork))))
     (if (zero? pid)
         (exec-command command
                       #:user user
@@ -893,7 +905,10 @@ its PID."
                       #:directory directory
                       #:file-creation-mask file-creation-mask
                       #:environment-variables environment-variables)
-        pid)))
+        (begin
+          ;; Restore the initial SIGTERM handler.
+          (sigaction SIGTERM term-handler)
+          pid))))
 
 (define* (make-forkexec-constructor command
                                     #:key
@@ -957,11 +972,15 @@ start."
   "Return a procedure that sends SIGNAL to the process group of the PID given
 as argument, where SIGNAL defaults to `SIGTERM'."
   (lambda (pid . args)
-    ;; Kill the whole process group PID belongs to.  Don't assume that PID
-    ;; is a process group ID: that's not the case when using #:pid-file,
-    ;; where the process group ID is the PID of the process that
-    ;; "daemonized".
-    (kill (- (getpgid pid)) signal)
+    ;; Kill the whole process group PID belongs to.  Don't assume that PID is
+    ;; a process group ID: that's not the case when using #:pid-file, where
+    ;; the process group ID is the PID of the process that "daemonized".  If
+    ;; this procedure is called, between the process fork and exec, the PGID
+    ;; will still be zero (the Shepherd PGID). In that case, use the PID.
+    (let ((pgid (getpgid pid)))
+      (if (= (getpgid 0) pgid)
+          (kill pid signal) ;don't kill ourself
+          (kill (- pgid) signal)))
     #f))
 
 ;; Produce a constructor that executes a command.
diff --git a/tests/forking-service.sh b/tests/forking-service.sh
index 7126c3d..bd9aac9 100644
--- a/tests/forking-service.sh
+++ b/tests/forking-service.sh
@@ -71,6 +71,17 @@ cat > "$conf"<<EOF
                                       #:pid-file "$PWD/$service2_pid")
    #:stop  (make-kill-destructor)
    #:respawn? #t))
+
+(define %command3
+  '("$SHELL" "-c" "sleep 600"))
+
+(register-services
+ (make <service>
+   ;; A service that forks into a different process.
+   #:provides '(test3)
+   #:start (make-forkexec-constructor %command3)
+   #:stop  (make-kill-destructor)
+   #:respawn? #t))
 EOF
 cat $conf
 
@@ -113,3 +124,10 @@ sleep 1;
 $herd status test2 | grep started
 test "`cat $PWD/$service2_started`" = "started
 started"
+
+# Try to trigger eventual race conditions, when killing a process between fork
+# and execv calls.
+for i in `seq 1 50`
+do
+    $herd restart test3
+done
-- 
2.26.2


Reply sent to Ludovic Courtès <ludo <at> gnu.org>:
You have taken responsibility. (Mon, 11 May 2020 21:10:02 GMT) Full text and rfc822 format available.

Notification sent to Mathieu Othacehe <m.othacehe <at> gmail.com>:
bug acknowledged by developer. (Mon, 11 May 2020 21:10:02 GMT) Full text and rfc822 format available.

Message #38 received at 40981-done <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Mathieu Othacehe <othacehe <at> gnu.org>
Cc: 40981-done <at> debbugs.gnu.org
Subject: Re: bug#40981: Graphical installer tests sometimes hang.
Date: Mon, 11 May 2020 23:09:05 +0200
Hello,

Mathieu Othacehe <othacehe <at> gnu.org> skribis:

>>> The work-around here is to save the installed SIGTERM handler and reset
>>> it. Then, after forking, the parent can restore the SIGTERM handler. The
>>> child will use the default SIGTERM handler that terminates the process.
>>
>> OK, makes sense.  (Another occasion to see how something like
>> ‘posix_spawn’ would be more appropriate than fork + exec…)
>
> Didn't know about that function, but it seems way easier to manipulate
> and less error prone indeed!

Make sure to read “A fork() in the Road” on that topic:

  https://lwn.net/Articles/785430/

>>> +# Try to trigger eventual race conditions, when killing a process between fork
>>> +# and execv calls.
>>> +for i in {1..50}
>>> +do
>>> +    $herd restart test3
>>> +done
>>
>> Would it reproduce the problem well enough?
>
> On a slow machine one time out of two, and on a faster machine,
> never. The 'reproducer' I used, was to add a 'sleep' between fork and
> exec, it works way better!
>
> Tell me if you think its better to drop it.

It’s better than nothing, let’s keep it.

>>From 79d3603bf15b8f815136178be8c8a236734a7596 Mon Sep 17 00:00:00 2001
> From: Mathieu Othacehe <m.othacehe <at> gmail.com>
> Date: Thu, 7 May 2020 18:39:41 +0200
> Subject: [PATCH] service: Fix 'make-kill-destructor' when PGID is zero.
>
> When a process is forked, and before its GID is changed in "exec-command",
> it will share the parent GID, which is 0 for Shepherd. In that case, use
> the PID instead of the PGID.
>
> Also make sure to reset the SIGTERM handler before forking a process. Failing
> to do so, will result in stopping Shepherd if a SIGTERM is received between
> fork and execv calls. Restore the SIGTERM handler once the process has been
> forked.
>
> * modules/shepherd/service.scm (fork+exec-command): Save the current SIGTERM
> handler and reset it before forking. Then, restore it on the parent after
> forking.
> (make-kill-destructor): Handle the case when PGID is zero, between the process
> fork and exec.

I added a “Fixes” line and pushed it.

Thanks a lot!

I can roll a 0.8.1 release soonish (I’d like to add signalfd support
while at it, we’ll see.)

Ludo’.




Did not alter fixed versions and reopened. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 22 May 2020 20:16:01 GMT) Full text and rfc822 format available.

Merged 40981 41429. Request was from Ludovic Courtès <ludo <at> gnu.org> to control <at> debbugs.gnu.org. (Fri, 22 May 2020 20:16:02 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 40981 <at> debbugs.gnu.org and Mathieu Othacehe <m.othacehe <at> gmail.com> Request was from Mathieu Othacehe <mathieu <at> meru.i-did-not-set--mail-host-address--so-tickle-me> to control <at> debbugs.gnu.org. (Sat, 20 Jun 2020 10:07:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sat, 18 Jul 2020 11:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 3 years and 254 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.