GNU bug report logs - #56674
[Shepherd] Use of ‘waitpid’, ‘system*’, etc. in service code can cause deadlocks

Previous Next

Package: guix;

Reported by: Ludovic Courtès <ludo <at> gnu.org>

Date: Wed, 20 Jul 2022 21:40:01 UTC

Severity: important

Merged with 58926

Done: Ludovic Courtès <ludo <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 56674 in the body.
You can then email your comments to 56674 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-guix <at> gnu.org:
bug#56674; Package guix. (Wed, 20 Jul 2022 21:40:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Ludovic Courtès <ludo <at> gnu.org>:
New bug report received and forwarded. Copy sent to bug-guix <at> gnu.org. (Wed, 20 Jul 2022 21:40:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: bug-guix <at> gnu.org
Subject: [Shepherd] Use of ‘waitpid’, ‘system*’, etc. in service code can cause deadlocks
Date: Wed, 20 Jul 2022 23:39:08 +0200
Hi!

We’ve just had a bad experience with the nginx service on berlin, where
‘herd restart nginx’ would cause shepherd to get stuck forever in
‘waitpid’ on the process that was supposed to start nginx.

The details are unclear, but one thing is clear is that using ‘waitpid’
(either directly or indirectly with ‘system*’, which is what
‘nginx-service-type’ does) is not great:

  1. In the best case, shepherd (as of 0.9.1) is stuck while ‘system*’
     is in ‘waitpid’ waiting for child process completion (“stuck” as
     in: doesn’t do anything, not even answering ‘herd’ requests or
     inetd connections.)

  2. I don’t think that can happen with ‘system*’ (because it’s in C),
     but generally speaking, there’s a possibility that shepherd’s event
     loop will handle child process termination before some other
     user-made ‘waitpid’ call does.

Anyway, that’s a bad situation.

So I can think of several ways to address it:

  1. Change the nginx service ‘stop’ method to just
     (make-kill-destructor), which should work just as well as invoking
     “nginx -s stop”.

  2. Have Shepherd provide a replacement for ‘system*’.

Thoughts?

Ludo’.




Severity set to 'important' from 'normal' Request was from Ludovic Courtès <ludo <at> gnu.org> to control <at> debbugs.gnu.org. (Wed, 20 Jul 2022 21:44:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-guix <at> gnu.org:
bug#56674; Package guix. (Wed, 20 Jul 2022 23:49:02 GMT) Full text and rfc822 format available.

Message #10 received at 56674 <at> debbugs.gnu.org (full text, mbox):

From: Maxime Devos <maximedevos <at> telenet.be>
To: Ludovic Courtès <ludo <at> gnu.org>, 56674 <at> debbugs.gnu.org
Subject: Re: bug#56674: [Shepherd] Use of ‘waitpid’, ‘system*’, etc. in service code can cause deadlocks
Date: Thu, 21 Jul 2022 01:48:02 +0200
[Message part 1 (text/plain, inline)]
On 20-07-2022 23:39, Ludovic Courtès wrote:
> Hi!
>
> We’ve just had a bad experience with the nginx service on berlin, where
> ‘herd restart nginx’ would cause shepherd to get stuck forever in
> ‘waitpid’ on the process that was supposed to start nginx.
>
> The details are unclear, but one thing is clear is that using ‘waitpid’
> (either directly or indirectly with ‘system*’, which is what
> ‘nginx-service-type’ does) is not great:
>
>    1. In the best case, shepherd (as of 0.9.1) is stuck while ‘system*’
>       is in ‘waitpid’ waiting for child process completion (“stuck” as
>       in: doesn’t do anything, not even answering ‘herd’ requests or
>       inetd connections.)
>
>    2. I don’t think that can happen with ‘system*’ (because it’s in C),
>       but generally speaking, there’s a possibility that shepherd’s event
>       loop will handle child process termination before some other
>       user-made ‘waitpid’ call does.
>
> Anyway, that’s a bad situation.
>
> So I can think of several ways to address it:
>
>    1. Change the nginx service ‘stop’ method to just
>       (make-kill-destructor), which should work just as well as invoking
>       “nginx -s stop”.
>
>    2. Have Shepherd provide a replacement for ‘system*’.
Why Shepherd and not guile fibers? Is this a Shepherd-specific problem?
>
> Thoughts?

3. Make waitpid (or a variant that does what we need) interact well with 
guile-fibers, like how 'accept' is doesn't inhibit switching to another 
fiber. There some Linux API with signal handlers or pid fds or such that 
might be useful here, though I don't recall the name. Presumably 
something similar can be done for the Hurd, though some C glue may be 
needed to access the right Hurd APIs if the signal handler API isn't 
portable.

Alternatively:

4. Do the waitpid in a separate thread (needs work-around for the 
multi-threaded fork problem, probably C things? Or modifying Guile and 
maybe glibc to avoid async-unsafe things or make more things async-safe 
or whatever the appropriate ...-safe is here.)

If not a Guile Fibers interaction problem, then the asynchronous signal 
handler API might still be useful.

Greetings,
Maxime

[OpenPGP_0x49E3EE22191725EE.asc (application/pgp-keys, attachment)]
[OpenPGP_signature (application/pgp-signature, attachment)]

Information forwarded to bug-guix <at> gnu.org:
bug#56674; Package guix. (Thu, 21 Jul 2022 15:40:02 GMT) Full text and rfc822 format available.

Message #13 received at 56674 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Maxime Devos <maximedevos <at> telenet.be>
Cc: 56674 <at> debbugs.gnu.org
Subject: Re: bug#56674: [Shepherd] Use of ‘waitpid’, ‘system*’, etc. in
 service code can cause deadlocks
Date: Thu, 21 Jul 2022 17:39:39 +0200
Maxime Devos <maximedevos <at> telenet.be> skribis:

> Why Shepherd and not guile fibers? Is this a Shepherd-specific problem?

Blocking calls are a problem for Fibers in general, and ‘waitpid’ is no
exception.

The problem here is Shepherd-specific in the sense that we’re more
likely to use ‘system*’ and ‘waitpid’ in this context.  It’s also
Shepherd-specific because shepherd already runs an event loop that
tracks signal FDs and will thus “see” SIGCHLD events.

> 3. Make waitpid (or a variant that does what we need) interact well
> with guile-fibers, like how 'accept' is doesn't inhibit switching to
> another fiber. There some Linux API with signal handlers or pid fds or
> such that might be useful here, though I don't recall the
> name. Presumably something similar can be done for the Hurd, though
> some C glue may be needed to access the right Hurd APIs if the signal
> handler API isn't portable.

Yes, that’s roughly what I had in mind when I mentioned providing a
replacement for ‘system*’ (but you’re right, it’s a replacement for
‘waitpid’ at its core).

> Alternatively:
>
> 4. Do the waitpid in a separate thread (needs work-around for the
> multi-threaded fork problem, probably C things? Or modifying Guile and
> maybe glibc to avoid async-unsafe things or make more things
> async-safe or whatever the appropriate ...-safe is here.)

For shepherd, multithreading is not an option due to the semantics of
fork in the presence of threads.

Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#56674; Package guix. (Sat, 13 Aug 2022 15:01:01 GMT) Full text and rfc822 format available.

Message #16 received at 56674 <at> debbugs.gnu.org (full text, mbox):

From: Maxime Devos <maximedevos <at> telenet.be>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: 56674 <at> debbugs.gnu.org
Subject: Re: bug#56674: [Shepherd] Use of ‘waitpid’, ‘system*’, etc. in service code can cause deadlocks
Date: Sat, 13 Aug 2022 16:59:55 +0200
[Message part 1 (text/plain, inline)]
On 21-07-2022 17:39, Ludovic Courtès wrote:
>> Alternatively:
>>
>> 4. Do the waitpid in a separate thread (needs work-around for the
>> multi-threaded fork problem, probably C things? Or modifying Guile and
>> maybe glibc to avoid async-unsafe things or make more things
>> async-safe or whatever the appropriate ...-safe is here.)
> For shepherd, multithreading is not an option due to the semantics of
> fork in the presence of threads.

From what I've read, multi-threaded fork is safe as long as you do an 
exec 'immediately' afterwards, without doing things like taking locks or 
allocating memory with malloc in-between the fork and exec. I don't 
think it's possible to do that in Guile code, but that's what the C 
things are for.

Greetings,
Maxime.
[Message part 2 (text/html, inline)]
[OpenPGP_0x49E3EE22191725EE.asc (application/pgp-keys, attachment)]
[OpenPGP_signature (application/pgp-signature, attachment)]

Merged 56674 58926. Request was from Mathieu Othacehe <mathieu <at> meije.mail-host-address-is-not-set> to control <at> debbugs.gnu.org. (Sat, 12 Nov 2022 08:37:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-guix <at> gnu.org:
bug#56674; Package guix. (Sun, 13 Nov 2022 23:17:01 GMT) Full text and rfc822 format available.

Message #21 received at 56674 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: 56674 <at> debbugs.gnu.org
Subject: Re: bug#56674: [Shepherd] Use of ‘waitpid’, ‘system*’, etc. in
 service code can cause deadlocks
Date: Mon, 14 Nov 2022 00:16:38 +0100
[Message part 1 (text/plain, inline)]
Hi,

Ludovic Courtès <ludo <at> gnu.org> skribis:

>   1. In the best case, shepherd (as of 0.9.1) is stuck while ‘system*’
>      is in ‘waitpid’ waiting for child process completion (“stuck” as
>      in: doesn’t do anything, not even answering ‘herd’ requests or
>      inetd connections.)
>
>   2. I don’t think that can happen with ‘system*’ (because it’s in C),
>      but generally speaking, there’s a possibility that shepherd’s event
>      loop will handle child process termination before some other
>      user-made ‘waitpid’ call does.
>
> Anyway, that’s a bad situation.
>
> So I can think of several ways to address it:
>
>   1. Change the nginx service ‘stop’ method to just
>      (make-kill-destructor), which should work just as well as invoking
>      “nginx -s stop”.
>
>   2. Have Shepherd provide a replacement for ‘system*’.

These fresh Shepherd commits install a non-blocking ‘system*’ replacement:

  975b0aa service: Provide a non-blocking replacement of 'system*'.
  039c7a8 service: Spawn a fiber responsible for process monitoring.

We’ll have to do more testing and probably go for a 0.9.3 release soon.

Protip: you can test the latest shepherd with:

--8<---------------cut here---------------start------------->8---
(operating-system
  ;; …
  (essential-services
   (modify-services (operating-system-default-essential-services
                     this-operating-system)
     (shepherd-root-service-type
      config =>
      (shepherd-configuration
       (shepherd (package
                   (inherit shepherd-0.9)
                   (version "0.9.3pre")
                   (source (git-checkout
                            (url "https://git.savannah.gnu.org/git/shepherd.git")))
                   (native-inputs
                    (modify-inputs (package-native-inputs shepherd-0.9)
                      (append autoconf automake help2man texinfo gnu-gettext))))))))))
--8<---------------cut here---------------end--------------->8---

Full example attached.

Ludo’.

[bare-bones.tmpl (text/plain, inline)]
;; This is an operating system configuration template
;; for a "bare bones" setup, with no X11 display server.

(use-modules (gnu) (guix) (guix git))
(use-service-modules networking ssh web vpn shepherd)
(use-package-modules linux screen ssh
                     admin autotools gettext man texinfo)

(operating-system
  (host-name "komputilo")
  (timezone "Europe/Berlin")
  (locale "en_US.utf8")

  ;; Boot in "legacy" BIOS mode, assuming /dev/sdX is the
  ;; target hard disk, and "my-root" is the label of the target
  ;; root file system.
  (bootloader (bootloader-configuration
               (bootloader grub-bootloader)
               (targets '("/dev/sdX"))))
  ;; It's fitting to support the equally bare bones ‘-nographic’
  ;; QEMU option, which also nicely sidesteps forcing QWERTY.
  (kernel-arguments (list "console=ttyS0,115200"))
  (file-systems (cons (file-system
                        (device (file-system-label "my-root"))
                        (mount-point "/")
                        (type "ext4"))
                      %base-file-systems))

  ;; This is where user accounts are specified.  The "root"
  ;; account is implicit, and is initially created with the
  ;; empty password.
  (users (cons (user-account
                (name "alice")
                (comment "Bob's sister")
                (group "users")

                ;; Adding the account to the "wheel" group
                ;; makes it a sudoer.  Adding it to "audio"
                ;; and "video" allows the user to play sound
                ;; and access the webcam.
                (supplementary-groups '("wheel"
                                        "audio" "video")))
               %base-user-accounts))

  ;; Globally-installed packages.
  (packages (append (list screen strace) %base-packages))

  (essential-services
   (modify-services (operating-system-default-essential-services
                     this-operating-system)
     (shepherd-root-service-type
      config =>
      (shepherd-configuration
       (shepherd (package
                   (inherit shepherd-0.9)
                   (version "0.9.3pre")
                   (source (git-checkout
                            (url "https://git.savannah.gnu.org/git/shepherd.git")))
                   (native-inputs
                    (modify-inputs (package-native-inputs shepherd-0.9)
                      (append autoconf automake help2man texinfo gnu-gettext)))))))))

  ;; Add services to the baseline: a DHCP client and
  ;; an SSH server.
  (services (append (list (service dhcp-client-service-type)
                          (service nginx-service-type
                                   (nginx-configuration
                                    (server-blocks
                                     (list (nginx-server-configuration
                                            (listen '("80"))
                                            (server-name '("www.example.org"))
                                            (root "/srv/whatever"))))))
                          (service wireguard-service-type
                                   (wireguard-configuration
                                    (addresses (list "10.0.0.2/24"))
                                    (dns '("10.0.0.50")))) ;does not exit


                          (service openssh-service-type
                                   (openssh-configuration
                                    (openssh openssh-sans-x)
                                    (port-number 2222))))
                    %base-services)))

Information forwarded to bug-guix <at> gnu.org:
bug#56674; Package guix. (Mon, 14 Nov 2022 16:33:02 GMT) Full text and rfc822 format available.

Message #24 received at 56674 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: 56674 <at> debbugs.gnu.org
Cc: Mathieu Othacehe <othacehe <at> gnu.org>, 58926 <at> debbugs.gnu.org
Subject: Re: bug#58926: Shepherd becomes unresponsive after an interrupt
Date: Mon, 14 Nov 2022 17:32:35 +0100
Hello!

Ludovic Courtès <ludo <at> gnu.org> skribis:

> These fresh Shepherd commits install a non-blocking ‘system*’ replacement:
>
>   975b0aa service: Provide a non-blocking replacement of 'system*'.
>   039c7a8 service: Spawn a fiber responsible for process monitoring.
>
> We’ll have to do more testing and probably go for a 0.9.3 release soon.

Shepherd commit ada88074f0ab7551fd0f3dce8bf06de971382e79 passes my
tests.  It definitely solves the wireguard example and similar things
(uses of ‘system*’ in service constructors/destructors); I can’t tell
for sure about nginx because I haven’t been able to reproduce it in a
VM.  I’m interested in ways to reproduce it.

It does look like we could go with 0.9.3 real soon now.

Ludo’.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 15 Dec 2022 12:24:07 GMT) Full text and rfc822 format available.

This bug report was last modified 1 year and 132 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.