Package: guix;
Reported by: Ludovic Courtès <ludo <at> gnu.org>
Date: Fri, 26 May 2023 12:15:02 UTC
Severity: important
Done: Ludovic Courtès <ludo <at> gnu.org>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 63736 in the body.
You can then email your comments to 63736 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
View this report as an mbox folder, status mbox, maintainer mbox
bug-guix <at> gnu.org
:bug#63736
; Package guix
.
(Fri, 26 May 2023 12:15:02 GMT) Full text and rfc822 format available.Ludovic Courtès <ludo <at> gnu.org>
:bug-guix <at> gnu.org
.
(Fri, 26 May 2023 12:15:02 GMT) Full text and rfc822 format available.Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
From: Ludovic Courtès <ludo <at> gnu.org> To: bug-guix <at> gnu.org Subject: [Shepherd] Loss of SIGCHLD notifications Date: Fri, 26 May 2023 14:14:09 +0200
I experienced, with the Shepherd 0.10.0, a situation where ‘herd restart’ would get stuck after the end-of-grace-period expiration (restarting nscd in this case): --8<---------------cut here---------------start------------->8--- May 26 08:44:33 localhost shepherd[1]: [NetworkManager] Status of nscd: May 26 08:44:33 localhost shepherd[1]: [NetworkManager] It is running since 08:43:57 (36 seconds ago). May 26 08:44:33 localhost shepherd[1]: [NetworkManager] Running value is 23753. May 26 08:44:33 localhost shepherd[1]: [NetworkManager] It is enabled. May 26 08:44:33 localhost shepherd[1]: [NetworkManager] Provides (nscd). May 26 08:44:33 localhost shepherd[1]: [NetworkManager] Requires (user-processes syslogd). May 26 08:44:33 localhost shepherd[1]: [NetworkManager] Will be respawned. May 26 08:44:33 localhost shepherd[1]: Stopping service nscd... [...] May 26 08:44:38 localhost shepherd[1]: Grace period of 5 seconds is over; sending -23753 SIGKILL. --8<---------------cut here---------------end--------------->8--- The ‘herd restart’ process is indeed waiting, and the nscd process is still around as zombie, meaning that shepherd didn’t call waitpid(2): --8<---------------cut here---------------start------------->8--- $ sudo herd status nscd Status of nscd: It is being stopped. It is enabled. Provides (nscd). Requires (user-processes syslogd). Will be respawned. $ ps aux|grep nscd root 23753 0.0 0.0 0 0 ? Zs 08:43 0:00 [nscd] <defunct> root 23870 0.0 0.1 49968 18088 ? Sl 08:44 0:00 /gnu/store/4gvgcfdiz67wv04ihqfa8pqwzsb0qpv5-guile-3.0.9/bin/guile --no-auto-compile /run/current-system/profile/bin/herd restart nscd ludo 25280 33.3 0.0 6112 2304 pts/0 S+ 09:32 0:00 grep --color=auto nscd --8<---------------cut here---------------end--------------->8--- At that point, if I pick another service process and kill it, shepherd again fails to react to SIGCHLD (or doesn’t receive it): --8<---------------cut here---------------start------------->8--- $ sudo herd status ntpd Status of ntpd: It is running since 08:44:31 AM (52 minutes ago). Running value is 23814. It is enabled. Provides (ntpd). Requires (user-processes networking). Will be respawned. $ sudo kill 23814 $ sudo herd status ntpd Status of ntpd: It is running since 08:44:31 AM (52 minutes ago). Running value is 23814. It is enabled. Provides (ntpd). Requires (user-processes networking). Will be respawned. ludo <at> ribbon ~/src/guix$ ps 23814 PID TTY STAT TIME COMMAND 23814 ? Zs 0:00 [ntpd] <defunct> --8<---------------cut here---------------end--------------->8--- If I repeat the same experiment while stracing shepherd, here’s what I see: --8<---------------cut here---------------start------------->8--- $ sudo herd status guix-publish Status of guix-publish: It is running since 08:44:32 AM (55 minutes ago). Running value is 23822. It is enabled. Provides (guix-publish). Requires (user-processes guix-daemon avahi-daemon). Will be respawned. $ sudo kill 23822 $ ps 23822 PID TTY STAT TIME COMMAND 23822 ? Zs 0:02 [guix publish] <defunct> --8<---------------cut here---------------end--------------->8--- … and the strace: --8<---------------cut here---------------start------------->8--- 1 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0 1 epoll_wait(13, [{events=EPOLLIN, data={u32=14, u64=14}}], 8, -1) = 1 1 accept4(14, {sa_family=AF_UNIX}, [112 => 2], SOCK_CLOEXEC|SOCK_NONBLOCK) = 25 1 accept4(14, 0x7ffe6f8f0be0, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable) 1 epoll_ctl(13, EPOLL_CTL_MOD, 14, {events=EPOLLIN|EPOLLRDHUP|EPOLLONESHOT, data={u32=14, u64=14}}) = 0 1 epoll_wait(13, [], 8, 0) = 0 1 read(25, "(shepherd-command (version 0) (action status) (service guix-publish) (arguments ()) (directory \"/home/ludo/src/guix\"))", 1024) = 118 1 epoll_wait(13, [], 8, 0) = 0 1 epoll_wait(13, [], 8, 0) = 0 1 epoll_wait(13, [], 8, 0) = 0 1 epoll_wait(13, [], 8, 0) = 0 1 epoll_wait(13, [], 8, 0) = 0 1 epoll_wait(13, [], 8, 0) = 0 1 epoll_wait(13, [], 8, 0) = 0 1 epoll_wait(13, [], 8, 0) = 0 1 epoll_wait(13, [], 8, 0) = 0 1 epoll_wait(13, [], 8, 0) = 0 1 epoll_wait(13, [], 8, 0) = 0 1 epoll_wait(13, [], 8, 0) = 0 1 epoll_wait(13, [], 8, 0) = 0 1 epoll_wait(13, [], 8, 0) = 0 1 epoll_wait(13, [], 8, 0) = 0 1 epoll_wait(13, [], 8, 0) = 0 1 write(25, "(reply (version 0) (result ((service (version 0) (provides (guix-publish)) (requires (user-processes guix-daemon avahi-daemon)) (respawn? #t) (docstring \"[No documentation.]\") (enabled? #t) (running 23822) (conflicts ()) (last-respawns ()) (status-changes ((running . 1685083472) (starting . 1685083472) (stopped . 1685083469) (stopping . 1685083469) (running . 1684924319) (starting . 1684924319) (stopped . 1684924319) (stopping . 1684924319) (running . 1684853459) (starting . 1684853459) (stopped"..., 686) = 686 1 close(25) = 0 1 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0 1 epoll_wait(13, [{events=EPOLLHUP, data={u32=30, u64=30}}], 8, -1) = 1 1 read(30, "", 4096) = 0 1 close(30) = 0 1 close(36) = 0 1 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0 1 epoll_wait(13, --8<---------------cut here---------------end--------------->8--- The signal FD still exists but we didn’t see any activity on it: --8<---------------cut here---------------start------------->8--- $ sudo ls -lrt /proc/1/fd |grep signalfd lrwx------ 1 root root 64 May 26 09:41 10 -> anon_inode:[signalfd] --8<---------------cut here---------------end--------------->8--- Here’s the signal handling status (notice two signals queued): --8<---------------cut here---------------start------------->8--- $ sudo cat /proc/1/status|grep -i Sig SigQ: 2/63467 SigPnd: 0000000000000000 SigBlk: 0000000000014003 SigIgn: 0000000000001000 SigCgt: 0000000120814423 --8<---------------cut here---------------end--------------->8--- With: (define (signal-mask->list mask) (let loop ((sig 32) (lst '())) (if (zero? sig) lst (loop (- sig 1) (if (bit-set? (- sig 1) mask) (cons sig lst) lst))))) … we see that the block mask is equal to ‘%precious-signals’ (good): --8<---------------cut here---------------start------------->8--- scheme@(shepherd system)> (signal-mask->list #x0000000000014003) $33 = (1 2 15 17) --8<---------------cut here---------------end--------------->8--- … while the ignored mask lists SIGPIPE: --8<---------------cut here---------------start------------->8--- scheme@(shepherd system)> (signal-mask->list #x0000000000001000) $32 = (13) --8<---------------cut here---------------end--------------->8--- However, the process also has 3 GC marker threads and 1 finalization thread: --8<---------------cut here---------------start------------->8--- (gdb) info threads Id Target Id Frame * 1 Thread 0x7fdbd530f380 (LWP 1) "shepherd" 0x00007fdbd5415626 in epoll_wait (epfd=13, events=0x7fdbd203a1a0, maxevents=8, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 2 Thread 0x7fdbd4cd9640 (LWP 118) "GC-marker-0" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7fdbd5857be8 <mark_cv+40>) at futex-internal.c:57 3 Thread 0x7fdbd44d8640 (LWP 119) "GC-marker-1" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7fdbd5857be8 <mark_cv+40>) at futex-internal.c:57 4 Thread 0x7fdbd3cd7640 (LWP 120) "GC-marker-2" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7fdbd5857be8 <mark_cv+40>) at futex-internal.c:57 5 Thread 0x7fdbd3389640 (LWP 123) "shepherd" __GI___libc_read (nbytes=1, buf=0x7fdbd3388620, fd=6) at ../sysdeps/unix/sysv/linux/read.c:26 --8<---------------cut here---------------end--------------->8--- These threads seem to be blocking every signal, which is good: --8<---------------cut here---------------start------------->8--- $ sudo cat /proc/118/status |grep ^Sig SigQ: 2/63467 SigPnd: 0000000000000000 SigBlk: fffffffe5f7bfeff SigIgn: 0000000000001000 SigCgt: 0000000120814423 $ sudo cat /proc/119/status |grep ^Sig SigQ: 2/63467 SigPnd: 0000000000000000 SigBlk: fffffffe5f7bfeff SigIgn: 0000000000001000 SigCgt: 0000000120814423 $ sudo cat /proc/120/status |grep ^Sig SigQ: 2/63467 SigPnd: 0000000000000000 SigBlk: fffffffe5f7bfeff SigIgn: 0000000000001000 SigCgt: 0000000120814423 $ sudo cat /proc/123/status |grep ^Sig SigQ: 2/63467 SigPnd: 0000000000000000 SigBlk: fffffffe5ffbfeff SigIgn: 0000000000001000 SigCgt: 0000000120814423 --8<---------------cut here---------------end--------------->8--- Then all of a sudden, as I’m conducting this experiment, I see: --8<---------------cut here---------------start------------->8--- May 26 09:36:37 localhost ntpd[23814]: ntpd exiting on signal 15 (Terminated) […] May 26 10:52:10 localhost shepherd[1]: Respawning nscd. May 26 10:52:10 localhost shepherd[1]: Service guix-publish (PID 23822) terminated with signal 15. May 26 10:52:10 localhost shepherd[1]: Respawning ntpd. May 26 10:52:10 localhost shepherd[1]: Respawning guix-publish. May 26 10:52:10 localhost shepherd[1]: Starting service nscd... May 26 10:52:10 localhost shepherd[1]: Starting service ntpd... May 26 10:52:10 localhost shepherd[1]: Service ntpd has been started. May 26 10:52:10 localhost shepherd[1]: Service ntpd started. May 26 10:52:10 localhost shepherd[1]: Service ntpd running with value 29148. --8<---------------cut here---------------end--------------->8--- 10:52 is about the time I attached gdb to shepherd… Interestingly, we get a “terminated with signal 15” message from shepherd for guix-publish, but not for ntpd. Long story short: there seems to be a problem with signal delivery. Most likely, the initial grace period expiration above when stopping nscd is a symptom of shepherd no longer receiving/processing SIGCHLD rather than the cause. To be continued… Ludo’.
bug-guix <at> gnu.org
:bug#63736
; Package guix
.
(Sat, 27 May 2023 17:02:02 GMT) Full text and rfc822 format available.Message #8 received at 63736 <at> debbugs.gnu.org (full text, mbox):
From: Ludovic Courtès <ludo <at> gnu.org> To: 63736 <at> debbugs.gnu.org Subject: Re: bug#63736: [Shepherd] Loss of SIGCHLD notifications Date: Sat, 27 May 2023 19:00:55 +0200
Ludovic Courtès <ludo <at> gnu.org> skribis: > Long story short: there seems to be a problem with signal delivery. > Most likely, the initial grace period expiration above when stopping > nscd is a symptom of shepherd no longer receiving/processing SIGCHLD > rather than the cause. Another possibility is lockup: one of the relevant fibers is either gone or stuck in ‘put-message’ or ‘get-message’. I did two things: b9a37f3 shepherd: Make signal handling fiber an essential task. 8ae2780 service: Do not attempt to restart transient services. Commit 8ae2780 fixes a bug whereby ‘herd restart’ could end up attempting to restart a transient service, which would lock up the calling fiber because the service’s controlling fiber would first receive the 'terminate message, so it would return and nobody would be reading further messages send on its channel. Commit b9a37f3 will allows us to ensure that the signal-handling fiber never exits (and we’ll get a trace in the log if it tries to). Ludo’.
Ludovic Courtès <ludo <at> gnu.org>
to control <at> debbugs.gnu.org
.
(Sun, 28 May 2023 09:26:02 GMT) Full text and rfc822 format available.Ludovic Courtès <ludo <at> gnu.org>
:Ludovic Courtès <ludo <at> gnu.org>
:Message #15 received at 63736-done <at> debbugs.gnu.org (full text, mbox):
From: Ludovic Courtès <ludo <at> gnu.org> To: 63736-done <at> debbugs.gnu.org Subject: Re: bug#63736: [Shepherd] Loss of SIGCHLD notifications Date: Thu, 23 Nov 2023 21:46:49 +0100
Ludovic Courtès <ludo <at> gnu.org> skribis: > Another possibility is lockup: one of the relevant fibers is either gone > or stuck in ‘put-message’ or ‘get-message’. > > I did two things: > > b9a37f3 shepherd: Make signal handling fiber an essential task. > 8ae2780 service: Do not attempt to restart transient services. > > Commit 8ae2780 fixes a bug whereby ‘herd restart’ could end up > attempting to restart a transient service, which would lock up the > calling fiber because the service’s controlling fiber would first > receive the 'terminate message, so it would return and nobody would be > reading further messages send on its channel. > > Commit b9a37f3 will allows us to ensure that the signal-handling fiber > never exits (and we’ll get a trace in the log if it tries to). Apparently these commits, which made it in 0.10.1 months ago, fixed this particular bug. Closing! Ludo’.
Debbugs Internal Request <help-debbugs <at> gnu.org>
to internal_control <at> debbugs.gnu.org
.
(Fri, 22 Dec 2023 12:24:08 GMT) Full text and rfc822 format available.
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.