GNU bug report logs - #63736
[Shepherd] Loss of SIGCHLD notifications

Package: guix;

Reported by: Ludovic Courtès <ludo <at> gnu.org>

Date: Fri, 26 May 2023 12:15:02 UTC

Severity: important

Done: Ludovic Courtès <ludo <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 63736 in the body.
You can then email your comments to 63736 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-guix <at> gnu.org:
bug#63736; Package guix. (Fri, 26 May 2023 12:15:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Ludovic Courtès <ludo <at> gnu.org>:
New bug report received and forwarded. Copy sent to bug-guix <at> gnu.org. (Fri, 26 May 2023 12:15:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: bug-guix <at> gnu.org
Subject: [Shepherd] Loss of SIGCHLD notifications
Date: Fri, 26 May 2023 14:14:09 +0200

I experienced, with the Shepherd 0.10.0, a situation where ‘herd
restart’ would get stuck after the end-of-grace-period expiration
(restarting nscd in this case):

--8<---------------cut here---------------start------------->8---
May 26 08:44:33 localhost shepherd[1]: [NetworkManager] Status of nscd: 
May 26 08:44:33 localhost shepherd[1]: [NetworkManager]   It is running since 08:43:57 (36 seconds ago). 
May 26 08:44:33 localhost shepherd[1]: [NetworkManager]   Running value is 23753. 
May 26 08:44:33 localhost shepherd[1]: [NetworkManager]   It is enabled. 
May 26 08:44:33 localhost shepherd[1]: [NetworkManager]   Provides (nscd). 
May 26 08:44:33 localhost shepherd[1]: [NetworkManager]   Requires (user-processes syslogd). 
May 26 08:44:33 localhost shepherd[1]: [NetworkManager]   Will be respawned. 
May 26 08:44:33 localhost shepherd[1]: Stopping service nscd... 
[...]
May 26 08:44:38 localhost shepherd[1]: Grace period of 5 seconds is over; sending -23753 SIGKILL. 
--8<---------------cut here---------------end--------------->8---

The ‘herd restart’ process is indeed waiting, and the nscd process is
still around as zombie, meaning that shepherd didn’t call waitpid(2):

--8<---------------cut here---------------start------------->8---
$ sudo herd status nscd
Status of nscd:
  It is being stopped.
  It is enabled.
  Provides (nscd).
  Requires (user-processes syslogd).
  Will be respawned.
$ ps aux|grep nscd
root     23753  0.0  0.0      0     0 ?        Zs   08:43   0:00 [nscd] <defunct>
root     23870  0.0  0.1  49968 18088 ?        Sl   08:44   0:00 /gnu/store/4gvgcfdiz67wv04ihqfa8pqwzsb0qpv5-guile-3.0.9/bin/guile --no-auto-compile /run/current-system/profile/bin/herd restart nscd
ludo     25280 33.3  0.0   6112  2304 pts/0    S+   09:32   0:00 grep --color=auto nscd
--8<---------------cut here---------------end--------------->8---

At that point, if I pick another service process and kill it, shepherd
again fails to react to SIGCHLD (or doesn’t receive it):

--8<---------------cut here---------------start------------->8---
$ sudo herd status ntpd
Status of ntpd:
  It is running since 08:44:31 AM (52 minutes ago).
  Running value is 23814.
  It is enabled.
  Provides (ntpd).
  Requires (user-processes networking).
  Will be respawned.
$ sudo kill 23814
$ sudo herd status ntpd
Status of ntpd:
  It is running since 08:44:31 AM (52 minutes ago).
  Running value is 23814.
  It is enabled.
  Provides (ntpd).
  Requires (user-processes networking).
  Will be respawned.
ludo <at> ribbon ~/src/guix$ ps 23814
  PID TTY      STAT   TIME COMMAND
23814 ?        Zs     0:00 [ntpd] <defunct>
--8<---------------cut here---------------end--------------->8---

If I repeat the same experiment while stracing shepherd, here’s what I
see:

--8<---------------cut here---------------start------------->8---
$ sudo herd status guix-publish
Status of guix-publish:
  It is running since 08:44:32 AM (55 minutes ago).
  Running value is 23822.
  It is enabled.
  Provides (guix-publish).
  Requires (user-processes guix-daemon avahi-daemon).
  Will be respawned.
$ sudo kill 23822
$ ps 23822
  PID TTY      STAT   TIME COMMAND
23822 ?        Zs     0:02 [guix publish] <defunct>
--8<---------------cut here---------------end--------------->8---

… and the strace:

--8<---------------cut here---------------start------------->8---
1     rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0
1     epoll_wait(13, [{events=EPOLLIN, data={u32=14, u64=14}}], 8, -1) = 1
1     accept4(14, {sa_family=AF_UNIX}, [112 => 2], SOCK_CLOEXEC|SOCK_NONBLOCK) = 25
1     accept4(14, 0x7ffe6f8f0be0, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
1     epoll_ctl(13, EPOLL_CTL_MOD, 14, {events=EPOLLIN|EPOLLRDHUP|EPOLLONESHOT, data={u32=14, u64=14}}) = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     read(25, "(shepherd-command (version 0) (action status) (service guix-publish) (arguments ()) (directory \"/home/ludo/src/guix\"))", 1024) = 118
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     write(25, "(reply (version 0) (result ((service (version 0) (provides (guix-publish)) (requires (user-processes guix-daemon avahi-daemon)) (respawn? #t) (docstring \"[No documentation.]\") (enabled? #t) (running 23822) (conflicts ()) (last-respawns ()) (status-changes ((running . 1685083472) (starting . 1685083472) (stopped . 1685083469) (stopping . 1685083469) (running . 1684924319) (starting . 1684924319) (stopped . 1684924319) (stopping . 1684924319) (running . 1684853459) (starting . 1684853459) (stopped"..., 686) = 686
1     close(25)                         = 0
1     rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0
1     epoll_wait(13, [{events=EPOLLHUP, data={u32=30, u64=30}}], 8, -1) = 1
1     read(30, "", 4096)                = 0
1     close(30)                         = 0
1     close(36)                         = 0
1     rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0
1     epoll_wait(13, 
--8<---------------cut here---------------end--------------->8---

The signal FD still exists but we didn’t see any activity on it:

--8<---------------cut here---------------start------------->8---
$ sudo ls -lrt /proc/1/fd |grep signalfd
lrwx------ 1 root root 64 May 26 09:41 10 -> anon_inode:[signalfd]
--8<---------------cut here---------------end--------------->8---

Here’s the signal handling status (notice two signals queued):

--8<---------------cut here---------------start------------->8---
$ sudo cat /proc/1/status|grep -i Sig
SigQ:   2/63467
SigPnd: 0000000000000000
SigBlk: 0000000000014003
SigIgn: 0000000000001000
SigCgt: 0000000120814423
--8<---------------cut here---------------end--------------->8---

With:

  (define (signal-mask->list mask)
    (let loop ((sig 32) (lst '()))
      (if (zero? sig)
          lst
          (loop (- sig 1)
                (if (bit-set? (- sig 1) mask)
                    (cons sig lst)
                    lst)))))

… we see that the block mask is equal to ‘%precious-signals’ (good):

--8<---------------cut here---------------start------------->8---
scheme@(shepherd system)> (signal-mask->list #x0000000000014003)
$33 = (1 2 15 17)
--8<---------------cut here---------------end--------------->8---

… while the ignored mask lists SIGPIPE:

--8<---------------cut here---------------start------------->8---
scheme@(shepherd system)> (signal-mask->list #x0000000000001000)
$32 = (13)
--8<---------------cut here---------------end--------------->8---

However, the process also has 3 GC marker threads and 1 finalization
thread:

--8<---------------cut here---------------start------------->8---
(gdb) info threads
  Id   Target Id                                     Frame 
* 1    Thread 0x7fdbd530f380 (LWP 1) "shepherd"      0x00007fdbd5415626 in epoll_wait (epfd=13, 
    events=0x7fdbd203a1a0, maxevents=8, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  2    Thread 0x7fdbd4cd9640 (LWP 118) "GC-marker-0" __futex_abstimed_wait_common64 (private=0, 
    cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7fdbd5857be8 <mark_cv+40>)
    at futex-internal.c:57
  3    Thread 0x7fdbd44d8640 (LWP 119) "GC-marker-1" __futex_abstimed_wait_common64 (private=0, 
    cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7fdbd5857be8 <mark_cv+40>)
    at futex-internal.c:57
  4    Thread 0x7fdbd3cd7640 (LWP 120) "GC-marker-2" __futex_abstimed_wait_common64 (private=0, 
    cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7fdbd5857be8 <mark_cv+40>)
    at futex-internal.c:57
  5    Thread 0x7fdbd3389640 (LWP 123) "shepherd"    __GI___libc_read (nbytes=1, 
    buf=0x7fdbd3388620, fd=6) at ../sysdeps/unix/sysv/linux/read.c:26
--8<---------------cut here---------------end--------------->8---

These threads seem to be blocking every signal, which is good:

--8<---------------cut here---------------start------------->8---
$ sudo cat /proc/118/status |grep ^Sig
SigQ:   2/63467
SigPnd: 0000000000000000
SigBlk: fffffffe5f7bfeff
SigIgn: 0000000000001000
SigCgt: 0000000120814423
$ sudo cat /proc/119/status |grep ^Sig
SigQ:   2/63467
SigPnd: 0000000000000000
SigBlk: fffffffe5f7bfeff
SigIgn: 0000000000001000
SigCgt: 0000000120814423
$ sudo cat /proc/120/status |grep ^Sig
SigQ:   2/63467
SigPnd: 0000000000000000
SigBlk: fffffffe5f7bfeff
SigIgn: 0000000000001000
SigCgt: 0000000120814423
$ sudo cat /proc/123/status |grep ^Sig
SigQ:   2/63467
SigPnd: 0000000000000000
SigBlk: fffffffe5ffbfeff
SigIgn: 0000000000001000
SigCgt: 0000000120814423
--8<---------------cut here---------------end--------------->8---

Then all of a sudden, as I’m conducting this experiment, I see:

--8<---------------cut here---------------start------------->8---
May 26 09:36:37 localhost ntpd[23814]: ntpd exiting on signal 15 (Terminated)
[…]
May 26 10:52:10 localhost shepherd[1]: Respawning nscd. 
May 26 10:52:10 localhost shepherd[1]: Service guix-publish (PID 23822) terminated with signal 15. 
May 26 10:52:10 localhost shepherd[1]: Respawning ntpd. 
May 26 10:52:10 localhost shepherd[1]: Respawning guix-publish. 
May 26 10:52:10 localhost shepherd[1]: Starting service nscd... 
May 26 10:52:10 localhost shepherd[1]: Starting service ntpd... 
May 26 10:52:10 localhost shepherd[1]: Service ntpd has been started. 
May 26 10:52:10 localhost shepherd[1]: Service ntpd started. 
May 26 10:52:10 localhost shepherd[1]: Service ntpd running with value 29148. 
--8<---------------cut here---------------end--------------->8---

10:52 is about the time I attached gdb to shepherd…

Interestingly, we get a “terminated with signal 15” message from
shepherd for guix-publish, but not for ntpd.

Long story short: there seems to be a problem with signal delivery.
Most likely, the initial grace period expiration above when stopping
nscd is a symptom of shepherd no longer receiving/processing SIGCHLD
rather than the cause.

To be continued…

Ludo’.

Information forwarded to bug-guix <at> gnu.org:
bug#63736; Package guix. (Sat, 27 May 2023 17:02:02 GMT) Full text and rfc822 format available.

Message #8 received at 63736 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: 63736 <at> debbugs.gnu.org
Subject: Re: bug#63736: [Shepherd] Loss of SIGCHLD notifications
Date: Sat, 27 May 2023 19:00:55 +0200

Ludovic Courtès <ludo <at> gnu.org> skribis:

> Long story short: there seems to be a problem with signal delivery.
> Most likely, the initial grace period expiration above when stopping
> nscd is a symptom of shepherd no longer receiving/processing SIGCHLD
> rather than the cause.

Another possibility is lockup: one of the relevant fibers is either gone
or stuck in ‘put-message’ or ‘get-message’.

I did two things:

  b9a37f3 shepherd: Make signal handling fiber an essential task.
  8ae2780 service: Do not attempt to restart transient services.

Commit 8ae2780 fixes a bug whereby ‘herd restart’ could end up
attempting to restart a transient service, which would lock up the
calling fiber because the service’s controlling fiber would first
receive the 'terminate message, so it would return and nobody would be
reading further messages send on its channel.

Commit b9a37f3 will allows us to ensure that the signal-handling fiber
never exits (and we’ll get a trace in the log if it tries to).

Ludo’.

Severity set to 'important' from 'normal' Request was from Ludovic Courtès <ludo <at> gnu.org> to control <at> debbugs.gnu.org. (Sun, 28 May 2023 09:26:02 GMT) Full text and rfc822 format available.

Reply sent to Ludovic Courtès <ludo <at> gnu.org>:
You have taken responsibility. (Thu, 23 Nov 2023 20:48:02 GMT) Full text and rfc822 format available.

Notification sent to Ludovic Courtès <ludo <at> gnu.org>:
bug acknowledged by developer. (Thu, 23 Nov 2023 20:48:02 GMT) Full text and rfc822 format available.

Message #15 received at 63736-done <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: 63736-done <at> debbugs.gnu.org
Subject: Re: bug#63736: [Shepherd] Loss of SIGCHLD notifications
Date: Thu, 23 Nov 2023 21:46:49 +0100

Ludovic Courtès <ludo <at> gnu.org> skribis:

> Another possibility is lockup: one of the relevant fibers is either gone
> or stuck in ‘put-message’ or ‘get-message’.
>
> I did two things:
>
>   b9a37f3 shepherd: Make signal handling fiber an essential task.
>   8ae2780 service: Do not attempt to restart transient services.
>
> Commit 8ae2780 fixes a bug whereby ‘herd restart’ could end up
> attempting to restart a transient service, which would lock up the
> calling fiber because the service’s controlling fiber would first
> receive the 'terminate message, so it would return and nobody would be
> reading further messages send on its channel.
>
> Commit b9a37f3 will allows us to ensure that the signal-handling fiber
> never exits (and we’ll get a trace in the log if it tries to).

Apparently these commits, which made it in 0.10.1 months ago, fixed this
particular bug.  Closing!

Ludo’.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 22 Dec 2023 12:24:08 GMT) Full text and rfc822 format available.

This bug report was last modified 1 year and 201 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #63736 [Shepherd] Loss of SIGCHLD notifications

GNU bug report logs - #63736
[Shepherd] Loss of SIGCHLD notifications