GNU bug report logs - #57922
Shepherd doesn't seem to correctly handle waitpid itself

Package: guix;

Reported by: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>

Date: Mon, 19 Sep 2022 04:30:03 UTC

Severity: normal

Done: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 57922 in the body.
You can then email your comments to 57922 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-guix <at> gnu.org:
bug#57922; Package guix. (Mon, 19 Sep 2022 04:30:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Maxim Cournoyer <maxim.cournoyer <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-guix <at> gnu.org. (Mon, 19 Sep 2022 04:30:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: bug-guix <bug-guix <at> gnu.org>
Subject: Shepherd doesn't seem to correctly handle waitpid itself
Date: Mon, 19 Sep 2022 00:29:44 -0400

Hi,

I've tried to determine why a workaround in the jami-service-type is
required in the 'stop' slot to avoid failures in 'herd restart jami',
and haven't quite found the culprit, but it appears to me that:

1. waipid is only called in one place in Shepherd, which is in the
handle-SIGCHLD procedure in (shepherd service), which does not
specifically wait for an exact PID but rather does:

(waitpid* WAIT_ANY WNOHANG), which is waitpid with some special handling
in the case a system-error exception is thrown with an ECHILD or EINTR
error number.

This doesn't strike me as a strong guarantee that waitpid occurs when
stop is called, because:

1. It requires to be installed in the signal handlers for each
processes, with something like:

--8<---------------cut here---------------start------------->8---
  (unless %sigchld-handler-installed?
    (sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP)
    (set! %sigchld-handler-installed? #t))
--8<---------------cut here---------------end--------------->8---

Done for fork+exec-command and make-inetd-forkexec-constructor, but not
for make-forkexec-constructor/container, AFAICT;

2. it has the WNOHANG flag, which means the stop simply does a kill the
the signal handling weakly (because of WNOHANG) waits on it, which means
the start may begin before the process was actually completely
terminated.

Here's a small reproducer to apply on our code base:

--8<---------------cut here---------------start------------->8---
modified   gnu/services/telephony.scm
@@ -685,13 +685,7 @@ (define (archive-name->username archive)
 
                     ;; Finally, return the PID of the daemon process.
                     daemon-pid))
-               (stop
-                #~(lambda (pid . args)
-                    (kill pid SIGKILL)
-                    ;; Wait for the process to exit; this prevents overlapping
-                    ;; processes when issuing 'herd restart'.
-                    (waitpid pid)
-                    #f))))))))
+               (stop #~(make-kill-destructor))))))))
 
 (define jami-service-type
   (service-type
--8<---------------cut here---------------end--------------->8---

Then run 'make check-system TESTS=jami-provisioning' to see new
failures, or if you want to investigate manually the system:

--8<---------------cut here---------------start------------->8---
$ ./pre-inst-env guix system vm --no-grafts --no-offload --no-graphic \
   -e '(@@ (gnu tests telephony) %jami-os-provisioning)'

$ /gnu/store/rxi7c14hga62qslb0sr6nac9qnkxr0nn-run-vm.sh -m 1G -smp 4 \
  -nic user,model=virtio-net-pci,hostfwd=tcp::10022-:22

# Connect to the QEMU VM:
$ ssh root <at> localhost -p10022

root <at> jami ~# herd restart jami
Service jami has been stopped.
herd: exception caught while executing 'start' on service 'jami':
dbus "method failed with error" "org.freedesktop.DBus.Error.NoReply" ("Message recipient disconnected from message bus without replying")
root <at> jami ~# herd status jami
Status of jami:
  It is stopped.
  It is enabled.
  Provides (jami).
  Requires (jami-dbus-session).
  Conflicts with ().
  Will be respawned.
root <at> jami ~# pgrep jami
--8<---------------cut here---------------end--------------->8---

Thanks,

Maxim

Information forwarded to bug-guix <at> gnu.org:
bug#57922; Package guix. (Tue, 20 Sep 2022 07:33:02 GMT) Full text and rfc822 format available.

Message #8 received at 57922 <at> debbugs.gnu.org (full text, mbox):

From: Josselin Poiret <dev <at> jpoiret.xyz>
To: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>, 57922 <at> debbugs.gnu.org
Subject: Re: bug#57922: Shepherd doesn't seem to correctly handle waitpid
 itself
Date: Tue, 20 Sep 2022 09:31:57 +0200

Hi Maxim,

Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:

> Hi,
>
> I've tried to determine why a workaround in the jami-service-type is
> required in the 'stop' slot to avoid failures in 'herd restart jami',
> and haven't quite found the culprit, but it appears to me that:
>
> 1. waipid is only called in one place in Shepherd, which is in the
> handle-SIGCHLD procedure in (shepherd service), which does not
> specifically wait for an exact PID but rather does:
>
> (waitpid* WAIT_ANY WNOHANG), which is waitpid with some special handling
> in the case a system-error exception is thrown with an ECHILD or EINTR
> error number.
>
> This doesn't strike me as a strong guarantee that waitpid occurs when
> stop is called, because:
>
> 1. It requires to be installed in the signal handlers for each
> processes, with something like:
>
> --8<---------------cut here---------------start------------->8---
>   (unless %sigchld-handler-installed?
>     (sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP)
>     (set! %sigchld-handler-installed? #t))
> --8<---------------cut here---------------end--------------->8---
>
> Done for fork+exec-command and make-inetd-forkexec-constructor, but not
> for make-forkexec-constructor/container, AFAICT;

The signal handler is only installed once in PID 1 (in fact, you haven't
forked yet here), since it's the one that receives the SIGCHLD.

What I don't understand that well is that this signal handler could be
installed only once when shepherd starts, right?  That way, it wouldn't
need to depend on specific start actions being chosen.

> 2. it has the WNOHANG flag, which means the stop simply does a kill the
> the signal handling weakly (because of WNOHANG) waits on it, which means
> the start may begin before the process was actually completely
> terminated.
>
> Here's a small reproducer to apply on our code base:
>
> --8<---------------cut here---------------start------------->8---
> modified   gnu/services/telephony.scm
> @@ -685,13 +685,7 @@ (define (archive-name->username archive)
>  
>                      ;; Finally, return the PID of the daemon process.
>                      daemon-pid))
> -               (stop
> -                #~(lambda (pid . args)
> -                    (kill pid SIGKILL)
> -                    ;; Wait for the process to exit; this prevents overlapping
> -                    ;; processes when issuing 'herd restart'.
> -                    (waitpid pid)
> -                    #f))))))))
> +               (stop #~(make-kill-destructor))))))))
>  
>  (define jami-service-type
>    (service-type
> --8<---------------cut here---------------end--------------->8---

The real problem here is not really the WNOHANG flag (you could remove
that and still get issues) but rather that the waitpid is run inside a
signal handler, which in Guile means that it's run through asyncs.  You
have no guarantees wrt. when asyncs run, so they could run after or in
the middle of the next action.  I also think make-kill-destructor should
waitpid the processes it's killing, as you're implying, and leave the
signal handler only for unexpected service crashes.

Best,
-- 
Josselin Poiret

Information forwarded to bug-guix <at> gnu.org:
bug#57922; Package guix. (Fri, 23 Sep 2022 06:34:02 GMT) Full text and rfc822 format available.

Message #11 received at 57922 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Josselin Poiret <dev <at> jpoiret.xyz>
Cc: 57922 <at> debbugs.gnu.org, Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Subject: Re: bug#57922: Shepherd doesn't seem to correctly handle waitpid
 itself
Date: Fri, 23 Sep 2022 08:33:28 +0200

Hi,

Josselin Poiret <dev <at> jpoiret.xyz> skribis:

> Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:

[...]

>> 1. It requires to be installed in the signal handlers for each
>> processes, with something like:
>>
>> --8<---------------cut here---------------start------------->8---
>>   (unless %sigchld-handler-installed?
>>     (sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP)
>>     (set! %sigchld-handler-installed? #t))
>> --8<---------------cut here---------------end--------------->8---
>>
>> Done for fork+exec-command and make-inetd-forkexec-constructor, but not
>> for make-forkexec-constructor/container, AFAICT;
>
> The signal handler is only installed once in PID 1 (in fact, you haven't
> forked yet here), since it's the one that receives the SIGCHLD.

Right.

> What I don't understand that well is that this signal handler could be
> installed only once when shepherd starts, right?  That way, it wouldn't
> need to depend on specific start actions being chosen.

The SIGCHLD handler is installed lazily since
f776de04e6702e18d95152072e78c43441d3ccc3.  The rationale was discussed
here:

  https://issues.guix.gnu.org/27553

That said, on GNU/Linux, SIGCHLD is actually blocked and instead we rely
on signalfd(2).  It’s from the main even loop in shepherd.scm that the
signal handler is called.

>> Here's a small reproducer to apply on our code base:
>>
>> --8<---------------cut here---------------start------------->8---
>> modified   gnu/services/telephony.scm
>> @@ -685,13 +685,7 @@ (define (archive-name->username archive)
>>  
>>                      ;; Finally, return the PID of the daemon process.
>>                      daemon-pid))
>> -               (stop
>> -                #~(lambda (pid . args)
>> -                    (kill pid SIGKILL)
>> -                    ;; Wait for the process to exit; this prevents overlapping
>> -                    ;; processes when issuing 'herd restart'.
>> -                    (waitpid pid)
>> -                    #f))))))))
>> +               (stop #~(make-kill-destructor))))))))

I think the main difference between these two is that the first one uses
SIGKILL while the second one uses SIGTERM.

You could try #~(make-kill-destructor SIGKILL) to get the same effect.

(Another difference is that ‘make-kill-destructor’ kills the process
group, not just the process itself.)

Anyway, the key point is that shepherd takes care of calling ‘waitpid’
for its child processes (services).  If you call it yourself as in the
snippet above, you’re racing with shepherd; in the case above it
probably doesn’t make any difference though because it will consider
that the service is stopped in any case.

HTH!

Ludo’.

Reply sent to Maxim Cournoyer <maxim.cournoyer <at> gmail.com>:
You have taken responsibility. (Fri, 23 Sep 2022 17:50:02 GMT) Full text and rfc822 format available.

Notification sent to Maxim Cournoyer <maxim.cournoyer <at> gmail.com>:
bug acknowledged by developer. (Fri, 23 Sep 2022 17:50:02 GMT) Full text and rfc822 format available.

Message #16 received at 57922-done <at> debbugs.gnu.org (full text, mbox):

From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: Josselin Poiret <dev <at> jpoiret.xyz>, 57922-done <at> debbugs.gnu.org
Subject: Re: bug#57922: Shepherd doesn't seem to correctly handle waitpid
 itself
Date: Fri, 23 Sep 2022 13:49:26 -0400

tags 57922 +notabug
thanks

Hi Ludo!

Ludovic Courtès <ludo <at> gnu.org> writes:

[...]

>> What I don't understand that well is that this signal handler could be
>> installed only once when shepherd starts, right?  That way, it wouldn't
>> need to depend on specific start actions being chosen.
>
> The SIGCHLD handler is installed lazily since
> f776de04e6702e18d95152072e78c43441d3ccc3.  The rationale was discussed
> here:
>
>   https://issues.guix.gnu.org/27553
>
> That said, on GNU/Linux, SIGCHLD is actually blocked and instead we rely
> on signalfd(2).  It’s from the main even loop in shepherd.scm that the
> signal handler is called.

I had missed that, thanks for explaining.

>>> Here's a small reproducer to apply on our code base:
>>>
>>> --8<---------------cut here---------------start------------->8---
>>> modified   gnu/services/telephony.scm
>>> @@ -685,13 +685,7 @@ (define (archive-name->username archive)
>>>
>>>                      ;; Finally, return the PID of the daemon process.
>>>                      daemon-pid))
>>> -               (stop
>>> -                #~(lambda (pid . args)
>>> -                    (kill pid SIGKILL)
>>> -                    ;; Wait for the process to exit; this prevents overlapping
>>> -                    ;; processes when issuing 'herd restart'.
>>> -                    (waitpid pid)
>>> -                    #f))))))))
>>> +               (stop #~(make-kill-destructor))))))))
>
> I think the main difference between these two is that the first one uses
> SIGKILL while the second one uses SIGTERM.
>
> You could try #~(make-kill-destructor SIGKILL) to get the same effect.

You are right, the important difference was SIGTERM vs SIGKILL.  I
thought I had tried that.  The problem only shows itself in the
'jami-provisioning' system test, not the 'jami' one.

Marking this one as notabug and closing.

Thanks again!

Maxim

Information forwarded to bug-guix <at> gnu.org:
bug#57922; Package guix. (Sat, 24 Sep 2022 03:34:02 GMT) Full text and rfc822 format available.

Message #19 received at 57922 <at> debbugs.gnu.org (full text, mbox):

From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: Josselin Poiret <dev <at> jpoiret.xyz>, 57922 <at> debbugs.gnu.org
Subject: Re: bug#57922: Shepherd doesn't seem to correctly handle waitpid
 itself
Date: Fri, 23 Sep 2022 23:32:52 -0400

reopen 57922
tags 57922 -notabug
thanks

Hi again,

[...]

>>> Here's a small reproducer to apply on our code base:
>>>
>>> --8<---------------cut here---------------start------------->8---
>>> modified   gnu/services/telephony.scm
>>> @@ -685,13 +685,7 @@ (define (archive-name->username archive)
>>>
>>>                      ;; Finally, return the PID of the daemon process.
>>>                      daemon-pid))
>>> -               (stop
>>> -                #~(lambda (pid . args)
>>> -                    (kill pid SIGKILL)
>>> -                    ;; Wait for the process to exit; this prevents overlapping
>>> -                    ;; processes when issuing 'herd restart'.
>>> -                    (waitpid pid)
>>> -                    #f))))))))
>>> +               (stop #~(make-kill-destructor))))))))
>
> I think the main difference between these two is that the first one uses
> SIGKILL while the second one uses SIGTERM.
>
> You could try #~(make-kill-destructor SIGKILL) to get the same effect.

> You are right, the important difference was SIGTERM vs SIGKILL.  I
> thought I had tried that.  The problem only shows itself in the
> 'jami-provisioning' system test, not the 'jami' one.

> Marking this one as notabug and closing.

I think I spoke too soon.  SIGKILL does fix the problem when *not* using
waitpid explicitly, but when using waitpid explicitly, SIGTERM can be
used just fine.  In other words, this works:

--8<---------------cut here---------------start------------->8---
@@ -687,7 +687,7 @@ (define (archive-name->username archive)
                     daemon-pid))
                (stop
                 #~(lambda (pid . args)
-                    (kill pid SIGKILL)
+                    (kill pid SIGTERM)
                     ;; Wait for the process to exit; this prevents overlapping
                     ;; processes when issuing 'herd restart'.
                     (waitpid pid)
--8<---------------cut here---------------end--------------->8---

but this doesn't:

--8<---------------cut here---------------start------------->8---
@@ -685,13 +685,7 @@ (define (archive-name->username archive)
 
                     ;; Finally, return the PID of the daemon process.
                     daemon-pid))
-               (stop
-                #~(lambda (pid . args)
-                    (kill pid SIGKILL)
-                    ;; Wait for the process to exit; this prevents overlapping
-                    ;; processes when issuing 'herd restart'.
-                    (waitpid pid)
-                    #f))))))))
+               (stop #~(make-kill-destructor))))))))
 
 (define jami-service-type
--8<---------------cut here---------------end--------------->8---

when exercised with 'make check-system TESTS=jami-provisioning':

--8<---------------cut here---------------start------------->8---
This is the GNU system.  Welcome.
jami login: Jami Daemon 13.4.0, by Savoir-faire Linux 2004-2019
https://jami.net/
[Video support enabled]
[Plugins support enabled]

23:29:05.375         os_core_unix.c !pjlib 2.12.1 for POSIX initialized
shepherd: Service jami has been stopped.
Caught signal Terminated, terminating...

Some deprecated features have been used.  Set the environment
variable GUILE_WARN_DEPRECATED to "detailed" and rerun the
program to get more information.  Set it to "no" to suppress
this message.
Jami Daemon 13.4.0, by Savoir-faire Linux 2004-2019
https://jami.net/
[Video support enabled]
[Plugins support enabled]

One does not simply initialize the client: Another daemon is detected
/gnu/store/2vcv1fyqfyym2zcyf5bvbj1pcgbcc515-shepherd-marionette.scm:1:1718: ERROR:
  1. &action-exception-error:
      service: jami
      action: start
      key: misc-error
      args: (#f "~A ~S ~S ~S" (dbus "method failed with error" "org.freedesktop.DBus.Error.NoReply" ("Message recipient disconnected from message bus without replying")) #f)
--8<---------------cut here---------------end--------------->8---
      
or manually through the test VM:

--8<---------------cut here---------------start------------->8---
$(./pre-inst-env guix system vm --no-graphic --no-grafts --no-offload \
  -e '(@@ (gnu tests telephony) %jami-os-provisioning)')  \
  -m 1G -smp $(nproc) "-nic" user,model=virtio-net-pci,hostfwd=tcp::10022-:22
--8<---------------cut here---------------end--------------->8---

This leads me to believe that Shepherd does not block until the process
is actually dead to mark the process as stopped (it just waitpid on the
group pid with WNOHANG), which means it won't block if the child process
hasn't exited yet, if I'm correct.

When we are in the stop slot, we know for sure that the process should
terminate completely, hence it'd make sense to call 'waitpid' *without*
WNOHANG there, to avoid 'herd restart' from starting the service while
its stopped process is not done terminating.

jamid can take quite some time to terminate cleanly because of the
networking threads in the opendht library that needs to be finalized,
which is probably the reason this problem can be observed here.

Thoughts?

Maxim

Information forwarded to bug-guix <at> gnu.org:
bug#57922; Package guix. (Sat, 24 Sep 2022 08:10:01 GMT) Full text and rfc822 format available.

Message #22 received at 57922 <at> debbugs.gnu.org (full text, mbox):

From: Josselin Poiret <dev <at> jpoiret.xyz>
To: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>, Ludovic Courtès <ludo <at> gnu.org>
Cc: 57922 <at> debbugs.gnu.org
Subject: Re: bug#57922: Shepherd doesn't seem to correctly handle waitpid
 itself
Date: Sat, 24 Sep 2022 10:09:00 +0200

Hi everyone,

Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:

> This leads me to believe that Shepherd does not block until the process
> is actually dead to mark the process as stopped (it just waitpid on the
> group pid with WNOHANG), which means it won't block if the child process
> hasn't exited yet, if I'm correct.
>
> When we are in the stop slot, we know for sure that the process should
> terminate completely, hence it'd make sense to call 'waitpid' *without*
> WNOHANG there, to avoid 'herd restart' from starting the service while
> its stopped process is not done terminating.
>
> jamid can take quite some time to terminate cleanly because of the
> networking threads in the opendht library that needs to be finalized,
> which is probably the reason this problem can be observed here.
>
> Thoughts?

I agree with you, make-kill-destructor should waitpid the processes it's
killing.  There shouldn't be any issues waitpid'ing before the
shepherd's signal handler, since stop actions are run with asyncs
disabled.  The signal handler will run once but won't get anything
because all the processes were already waitpid'd and it uses WNOHANG.

Best,
-- 
Josselin Poiret

Information forwarded to bug-guix <at> gnu.org:
bug#57922; Package guix. (Sat, 24 Sep 2022 16:31:02 GMT) Full text and rfc822 format available.

Message #25 received at 57922 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Josselin Poiret <dev <at> jpoiret.xyz>
Cc: 57922 <at> debbugs.gnu.org, Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Subject: Re: bug#57922: Shepherd doesn't seem to correctly handle waitpid
 itself
Date: Sat, 24 Sep 2022 18:30:07 +0200

Hi,

Josselin Poiret <dev <at> jpoiret.xyz> skribis:

> Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:
>
>> This leads me to believe that Shepherd does not block until the process
>> is actually dead to mark the process as stopped (it just waitpid on the
>> group pid with WNOHANG), which means it won't block if the child process
>> hasn't exited yet, if I'm correct.

Correct: the service is marked as stopped as soon as ‘stop’ returns.

>> When we are in the stop slot, we know for sure that the process should
>> terminate completely, hence it'd make sense to call 'waitpid' *without*
>> WNOHANG there, to avoid 'herd restart' from starting the service while
>> its stopped process is not done terminating.
>>
>> jamid can take quite some time to terminate cleanly because of the
>> networking threads in the opendht library that needs to be finalized,
>> which is probably the reason this problem can be observed here.
>>
>> Thoughts?
>
> I agree with you, make-kill-destructor should waitpid the processes it's
> killing.  There shouldn't be any issues waitpid'ing before the
> shepherd's signal handler, since stop actions are run with asyncs
> disabled.  The signal handler will run once but won't get anything
> because all the processes were already waitpid'd and it uses WNOHANG.

I think we need an extra “stopping” state for services.  In general,
we’ll want to send SIGTERM, wait for some grace period or dead process
notification, then send SIGKILL, and finally change state to “stopped”.

This is not possible in 0.9 but is something I’d like to have in 0.10¹.

Ludo’.

¹ https://lists.gnu.org/archive/html/guix-devel/2022-06/msg00350.html

Information forwarded to bug-guix <at> gnu.org:
bug#57922; Package guix. (Mon, 26 Sep 2022 00:13:02 GMT) Full text and rfc822 format available.

Message #28 received at 57922 <at> debbugs.gnu.org (full text, mbox):

From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: Josselin Poiret <dev <at> jpoiret.xyz>, 57922 <at> debbugs.gnu.org
Subject: Re: bug#57922: Shepherd doesn't seem to correctly handle waitpid
 itself
Date: Sun, 25 Sep 2022 20:12:09 -0400

Hi,

Ludovic Courtès <ludo <at> gnu.org> writes:

> Hi,
>
> Josselin Poiret <dev <at> jpoiret.xyz> skribis:
>
>> Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:
>>
>>> This leads me to believe that Shepherd does not block until the process
>>> is actually dead to mark the process as stopped (it just waitpid on the
>>> group pid with WNOHANG), which means it won't block if the child process
>>> hasn't exited yet, if I'm correct.
>
> Correct: the service is marked as stopped as soon as ‘stop’ returns.
>
>>> When we are in the stop slot, we know for sure that the process should
>>> terminate completely, hence it'd make sense to call 'waitpid' *without*
>>> WNOHANG there, to avoid 'herd restart' from starting the service while
>>> its stopped process is not done terminating.
>>>
>>> jamid can take quite some time to terminate cleanly because of the
>>> networking threads in the opendht library that needs to be finalized,
>>> which is probably the reason this problem can be observed here.
>>>
>>> Thoughts?
>>
>> I agree with you, make-kill-destructor should waitpid the processes it's
>> killing.  There shouldn't be any issues waitpid'ing before the
>> shepherd's signal handler, since stop actions are run with asyncs
>> disabled.  The signal handler will run once but won't get anything
>> because all the processes were already waitpid'd and it uses WNOHANG.
>
> I think we need an extra “stopping” state for services.  In general,
> we’ll want to send SIGTERM, wait for some grace period or dead process
> notification, then send SIGKILL, and finally change state to “stopped”.
>
> This is not possible in 0.9 but is something I’d like to have in 0.10¹.

This sounds good.  Let's keep this ticket open until this goodness
lands, as a reminder.

Thank you!

Maxim

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 24 Oct 2022 11:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 2 years and 268 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #57922 Shepherd doesn't seem to correctly handle waitpid itself

GNU bug report logs - #57922
Shepherd doesn't seem to correctly handle waitpid itself