GNU bug report logs - #76790
[Shepherd] Handling process termination before service is running

Previous Next

Package: guix;

Reported by: Ludovic Courtès <ludo <at> gnu.org>

Date: Thu, 6 Mar 2025 20:47:02 UTC

Severity: normal

Done: Ludovic Courtès <ludo <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 76790 in the body.
You can then email your comments to 76790 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-guix <at> gnu.org:
bug#76790; Package guix. (Thu, 06 Mar 2025 20:47:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Ludovic Courtès <ludo <at> gnu.org>:
New bug report received and forwarded. Copy sent to bug-guix <at> gnu.org. (Thu, 06 Mar 2025 20:47:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: bug-guix <at> gnu.org
Subject: [Shepherd] Handling process termination before service is running
Date: Thu, 06 Mar 2025 21:45:56 +0100
[Message part 1 (text/plain, inline)]
While on a quest for flaky tests in the Shepherd, I found a genuine bug
that would manifest with this ‘tests/basic.sh’ failure:

--8<---------------cut here---------------start------------->8---
+ herd -s t-socket-21679 status test-run-from-nonexistent-directory
+ sleep 0.5
+ herd -s t-socket-21679 status test-run-from-nonexistent-directory
+ grep 'exited with code 127'
+ sleep 0.5
+ herd -s t-socket-21679 status test-run-from-nonexistent-directory
+ grep 'exited with code 127'
[…]
2025-03-06 14:06:36 Service test-run-from-nonexistent-directory started.
2025-03-06 14:06:36 Failed to run "/gnu/store/3bg5qfsmjw6p7bh1xadarbaq246zis0d-coreutils-9.1/bin/pwd": In procedure chdir: No such file or directory
2025-03-06 14:06:36 Service test-run-from-nonexistent-directory running with value #<<process> id: 22431 command: ("/gnu/store/3bg5qfsmjw6p7bh1xadarbaq246zis0d-coreutils-9.1/bin/pwd")>.
2025-03-06 14:06:36 Service test-run-from-nonexistent-directory has been started.
2025-03-06 14:06:36 Service test-run-from-nonexistent-directory has been disabled.
2025-03-06 14:11:51 Stopping service root...
--8<---------------cut here---------------end--------------->8---

What happens is that the service is not marked as “exited with code
127”; instead, it is marked as having exited with code 0:

--8<---------------cut here---------------start------------->8---
● Status of test-run-from-nonexistent-directory:
  It is stopped since 14:06:36 (37 seconds ago).
  Process exited successfully.
  It is disabled.
  Provides: test-run-from-nonexistent-directory
  Will not be respawned.
--8<---------------cut here---------------end--------------->8---

This is due to a race condition: the process terminates before its
service goes from ‘starting’ to ‘running’.

By the time the service controller calls ‘monitor-service-process’, the
process has already terminated, so the process monitor replies 0 to the
'await request because that process no longer exists.

Attached is a test that reproduces the problem.

Ludo’.

[terminate-before-running.sh (text/plain, attachment)]

Information forwarded to bug-guix <at> gnu.org:
bug#76790; Package guix. (Sun, 09 Mar 2025 16:16:01 GMT) Full text and rfc822 format available.

Message #8 received at 76790 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: 76790 <at> debbugs.gnu.org
Subject: Re: bug#76790: [Shepherd] Handling process termination before
 service is running
Date: Sun, 09 Mar 2025 17:15:05 +0100
[Message part 1 (text/plain, inline)]
Ludovic Courtès <ludo <at> gnu.org> skribis:

> This is due to a race condition: the process terminates before its
> service goes from ‘starting’ to ‘running’.
>
> By the time the service controller calls ‘monitor-service-process’, the
> process has already terminated, so the process monitor replies 0 to the
> 'await request because that process no longer exists.

Fixed in 88cc5a43bf04c13b00b15a8d93cb635d9b64713c by introducing an
atomic fork + monitor primitive.

For the record, I also considered a hacky but less intrusive solution:
adding support for “zombies” in the process monitor, whereby it would
record the status of terminated processes and give that (instead of 0)
when somebody awaits them (patch attached).

It’s fragile though because unlike real zombies, we cannot guarantee
that the PID won’t be reused in the meantime; it’s just unlikely.  (PID
FDs would help, but that’s not portable so best if we can avoid it.)

Ludo’.

[Message part 2 (text/x-patch, inline)]
diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index 221998b..fec773c 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -2345,25 +2345,35 @@ may never terminate, even after sending it SIGKILL---e.g., kthreadd on Linux."
           (const #f)))
       (const #f)))
 
+(define %max-zombie-processes
+  ;; Number of processes recorded as "zombies" by the process
+  ;; monitor--processes that had no waiters when they terminated.
+  20)
+
 (define (process-monitor channel)
   "Run a process monitor that handles requests received over @var{channel}."
-  (let loop ((waiters vlist-null))
+  (let loop ((waiters vlist-null)
+             (zombies (ring-buffer %max-zombie-processes)))
     (match (get-message channel)
       (('handle-process-termination pid status)
-       ;; Notify any waiters.
-       (vhash-foldv* (lambda (waiter _)
-                       (put-message waiter status)
-                       #t)
-                     #t pid waiters)
-
-       ;; XXX: The call below is linear in the size of WAITERS, but WAITERS is
-       ;; usually empty or small.
-       (loop (vhash-fold (lambda (key value result)
-                           (if (= key pid)
-                               result
-                               (vhash-consv key value result)))
-                         vlist-null
-                         waiters)))
+       ;; Notify any waiters.  When there are none, record PID and STATUS in
+       ;; the ZOMBIES buffer in case somebody asks for it later--
+       (let ((zombie? (vhash-foldv* (lambda (waiter _)
+                                      (put-message waiter status)
+                                      #f)
+                                    #t pid waiters)))
+
+         ;; XXX: The call below is linear in the size of WAITERS, but WAITERS
+         ;; is usually empty or small.
+         (loop (vhash-fold (lambda (key value result)
+                             (if (= key pid)
+                                 result
+                                 (vhash-consv key value result)))
+                           vlist-null
+                           waiters)
+               (if zombie?
+                   (ring-buffer-insert (cons pid status) zombies)
+                   zombies))))
 
       (('spawn arguments service reply)
        ;; Spawn the command as specified by ARGUMENTS; send the spawn result
@@ -2378,22 +2388,31 @@ may never terminate, even after sending it SIGKILL---e.g., kthreadd on Linux."
          (put-message reply result)
          (match result
            (('exception . _)
-            (loop waiters))
+            (loop waiters zombies))
            (('success (pid))
-            (loop (vhash-consv pid reply waiters))))))
+            (loop (vhash-consv pid reply waiters) zombies)))))
 
       (('await pid reply)
        ;; Await the termination of PID and send its status on REPLY.
        (if (and (catch-system-error (kill pid 0))
                 (not (pseudo-process? pid)))
-           (loop (vhash-consv pid reply waiters))
-           (begin                             ;PID is gone or a pseudo-process
-             ;; This might be a race condition (the caller thinks PID is up
-             ;; when it's already dead) so log it.
-             (local-output (l10n "Awaiting PID ~a, which is already gone.")
-                           pid)
-             (put-message reply 0)
-             (loop waiters)))))))
+           (loop (vhash-consv pid reply waiters) zombies)
+           ;; PID is gone or a pseudo-process; in the former case, check if we
+           ;; have info about it in ZOMBIES (this is linear in the length of
+           ;; ZOMBIES, which is small.)
+           (let* ((zombies* (ring-buffer->list zombies))
+                  (status (assoc-ref zombies* pid)))
+             (unless status
+               ;; This might be a race condition (the caller thinks PID is up
+               ;; when it's already dead) so log it.
+               (local-output (l10n "Awaiting PID ~a, which is already gone.")
+                             pid))
+             (put-message reply (or status 0))
+             (loop waiters
+                   (if status
+                       (list->ring-buffer (alist-delete pid zombies*)
+                                          %max-zombie-processes)
+                       zombies))))))))
 
 (define spawn-process-monitor
   (essential-task-launcher 'process-monitor process-monitor))

bug closed, send any further explanations to 76790 <at> debbugs.gnu.org and Ludovic Courtès <ludo <at> gnu.org> Request was from Ludovic Courtès <ludo <at> gnu.org> to control <at> debbugs.gnu.org. (Sun, 09 Mar 2025 16:16:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 07 Apr 2025 11:24:06 GMT) Full text and rfc822 format available.

This bug report was last modified 31 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.