GNU bug report logs - #57922
Shepherd doesn't seem to correctly handle waitpid itself

Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.

Package: guix; Reported by: Maxim Cournoyer <maxim.cournoyer@HIDDEN>; Done: Maxim Cournoyer <maxim.cournoyer@HIDDEN>; Maintainer for guix is bug-guix@HIDDEN.

Message received at 57922 <at> debbugs.gnu.org:


Received: (at 57922) by debbugs.gnu.org; 24 Sep 2022 16:30:20 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sat Sep 24 12:30:20 2022
Received: from localhost ([127.0.0.1]:45112 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1oc82m-0004ZQ-Cm
	for submit <at> debbugs.gnu.org; Sat, 24 Sep 2022 12:30:20 -0400
Received: from eggs.gnu.org ([209.51.188.92]:51750)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <ludo@HIDDEN>) id 1oc82l-0004EQ-1p
 for 57922 <at> debbugs.gnu.org; Sat, 24 Sep 2022 12:30:19 -0400
Received: from fencepost.gnu.org ([2001:470:142:3::e]:42760)
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <ludo@HIDDEN>)
 id 1oc82f-0002vl-O1; Sat, 24 Sep 2022 12:30:13 -0400
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org;
 s=fencepost-gnu-org; h=MIME-Version:In-Reply-To:Date:References:Subject:To:
 From; bh=got81TcrqCy98JpJ+p0OHfPAiY4xP2grH0geQAy4Xz4=; b=PQ5IgUkyodJVm2GBRwTn
 RXQTMHT/vhZp9nr8zTEb7F8PAhJTJVPGJ4xmeQeYmu7rEjJf4Cn/Gm8Ax3hi2fonL81rYRyA7QJsZ
 BbrwAlonOetQvGFn6v9TKxfEj4mQol4b7CRgcCREA1BARdyUXZoV7At7ZMrjYaC2Jaw1Nw8RR79t2
 hCuuOOQkzH9CMJ553rWZhoON/n2tOfzPwqDAfmfzYobV8k38t1Mqo86jJ7W8lm5Oflr9qGb3tmnJe
 SIIRJFtZ9Am/Wn0l7VJ/JQRRf7LHyGR6j58Dg+y/7ju9b0lKPO000wz+bVX4eZ6BGKDFXMctrEFMH
 A5pTvKvjxxlvQQ==;
Received: from 91-160-117-201.subs.proxad.net ([91.160.117.201]:63645
 helo=ribbon)
 by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <ludo@HIDDEN>)
 id 1oc82c-0001gd-2G; Sat, 24 Sep 2022 12:30:13 -0400
From: =?utf-8?Q?Ludovic_Court=C3=A8s?= <ludo@HIDDEN>
To: Josselin Poiret <dev@HIDDEN>
Subject: Re: bug#57922: Shepherd doesn't seem to correctly handle waitpid
 itself
References: <874jx4q953.fsf@HIDDEN> <87o7va33iq.fsf@HIDDEN>
 <87bkr6fvlz.fsf@HIDDEN> <878rm98n17.fsf@HIDDEN>
 <87sfkh8a8z.fsf@HIDDEN>
X-URL: http://www.fdn.fr/~lcourtes/
X-Revolutionary-Date: Tridi 3 =?utf-8?Q?Vend=C3=A9miaire?= an 231 de la
 =?utf-8?Q?R=C3=A9volution=2C?= jour de
 la =?utf-8?Q?Ch=C3=A2taigne?=
X-PGP-Key-ID: 0x090B11993D9AEBB5
X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc
X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4  0CFB 090B 1199 3D9A EBB5
X-OS: x86_64-pc-linux-gnu
Date: Sat, 24 Sep 2022 18:30:07 +0200
In-Reply-To: <87sfkh8a8z.fsf@HIDDEN> (Josselin Poiret's message of "Sat, 
 24 Sep 2022 10:09:00 +0200")
Message-ID: <87zgeo68hc.fsf@HIDDEN>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: -0.3 (/)
X-Debbugs-Envelope-To: 57922
Cc: 57922 <at> debbugs.gnu.org, Maxim Cournoyer <maxim.cournoyer@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -1.3 (-)

Hi,

Josselin Poiret <dev@HIDDEN> skribis:

> Maxim Cournoyer <maxim.cournoyer@HIDDEN> writes:
>
>> This leads me to believe that Shepherd does not block until the process
>> is actually dead to mark the process as stopped (it just waitpid on the
>> group pid with WNOHANG), which means it won't block if the child process
>> hasn't exited yet, if I'm correct.

Correct: the service is marked as stopped as soon as =E2=80=98stop=E2=80=99=
 returns.

>> When we are in the stop slot, we know for sure that the process should
>> terminate completely, hence it'd make sense to call 'waitpid' *without*
>> WNOHANG there, to avoid 'herd restart' from starting the service while
>> its stopped process is not done terminating.
>>
>> jamid can take quite some time to terminate cleanly because of the
>> networking threads in the opendht library that needs to be finalized,
>> which is probably the reason this problem can be observed here.
>>
>> Thoughts?
>
> I agree with you, make-kill-destructor should waitpid the processes it's
> killing.  There shouldn't be any issues waitpid'ing before the
> shepherd's signal handler, since stop actions are run with asyncs
> disabled.  The signal handler will run once but won't get anything
> because all the processes were already waitpid'd and it uses WNOHANG.

I think we need an extra =E2=80=9Cstopping=E2=80=9D state for services.  In=
 general,
we=E2=80=99ll want to send SIGTERM, wait for some grace period or dead proc=
ess
notification, then send SIGKILL, and finally change state to =E2=80=9Cstopp=
ed=E2=80=9D.

This is not possible in 0.9 but is something I=E2=80=99d like to have in 0.=
10=C2=B9.

Ludo=E2=80=99.

=C2=B9 https://lists.gnu.org/archive/html/guix-devel/2022-06/msg00350.html




Information forwarded to bug-guix@HIDDEN:
bug#57922; Package guix. Full text available.

Message received at 57922 <at> debbugs.gnu.org:


Received: (at 57922) by debbugs.gnu.org; 24 Sep 2022 08:09:07 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sat Sep 24 04:09:07 2022
Received: from localhost ([127.0.0.1]:42243 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1oc0Di-0007xM-Op
	for submit <at> debbugs.gnu.org; Sat, 24 Sep 2022 04:09:06 -0400
Received: from jpoiret.xyz ([206.189.101.64]:40644)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <dev@HIDDEN>) id 1oc0Dg-0007xD-Ul
 for 57922 <at> debbugs.gnu.org; Sat, 24 Sep 2022 04:09:05 -0400
Received: from authenticated-user (jpoiret.xyz [206.189.101.64])
 by jpoiret.xyz (Postfix) with ESMTPA id 30AA1185310;
 Sat, 24 Sep 2022 08:09:01 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=jpoiret.xyz; s=dkim;
 t=1664006941;
 h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
 in-reply-to:in-reply-to:references:references;
 bh=talIB2iV9WeuRSS20bV8CkxBo8H3SOAfTcO7cdcc5qU=;
 b=SHlRxd6OVp7KWNNqtGToTvlCYm2Y1KWNqL0XKaOc0h26AFW3EGEvX+ygG55h4pdioAq4SX
 Hrnykmu+v+D3y6mqzfWU4OKzeG63yp10F9DacxSeN7Ja1AoRSCzaRcgjhpGji3OK5gzplL
 0rnoVZpOUWHTsWKHdUfvGSswrFC5JdAjnBMFAF0S/6UBuxOD5sszEwK4+/T3jbbwaEtoP1
 b01M3n36ze+pUyOk+gcuRcSeARs0kLGJEsfqKyehwUM6EZ8w+rCINUQzk5rgNE29aLXSaA
 /xpla6e5U4HZvBW5V/Cci87384kEey1TKhsjjjOytjW0dMI43tOADbA75GL01A==
From: Josselin Poiret <dev@HIDDEN>
To: Maxim Cournoyer <maxim.cournoyer@HIDDEN>, Ludovic =?utf-8?Q?Court?=
 =?utf-8?Q?=C3=A8s?= <ludo@HIDDEN>
Subject: Re: bug#57922: Shepherd doesn't seem to correctly handle waitpid
 itself
In-Reply-To: <878rm98n17.fsf@HIDDEN>
References: <874jx4q953.fsf@HIDDEN> <87o7va33iq.fsf@HIDDEN>
 <87bkr6fvlz.fsf@HIDDEN> <878rm98n17.fsf@HIDDEN>
Date: Sat, 24 Sep 2022 10:09:00 +0200
Message-ID: <87sfkh8a8z.fsf@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain
Authentication-Results: jpoiret.xyz;
 auth=pass smtp.auth=jpoiret@HIDDEN smtp.mailfrom=dev@HIDDEN
X-Spamd-Bar: /
X-Spam-Score: 2.0 (++)
X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org",
 has NOT identified this incoming email as spam.  The original
 message has been attached to this so you can view it or label
 similar future email.  If you have any questions, see
 the administrator of that system for details.
 Content preview:  Hi everyone,
 Maxim Cournoyer <maxim.cournoyer@HIDDEN> writes:
 > This leads me to believe that Shepherd does not block until the process
 > is actually dead to mark the process as stopped (it just waitpid on the
 > group pid with WNOHANG), which means it won't bloc [...] 
 Content analysis details:   (2.0 points, 10.0 required)
 pts rule name              description
 ---- ---------------------- --------------------------------------------------
 -0.0 SPF_PASS               SPF: sender matches SPF record
 2.0 PDS_OTHER_BAD_TLD      Untrustworthy TLDs
 [URI: jpoiret.xyz (xyz)]
 -0.0 SPF_HELO_PASS          SPF: HELO matches SPF record
 0.0 FROM_SUSPICIOUS_NTLD   From abused NTLD
X-Debbugs-Envelope-To: 57922
Cc: 57922 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: 2.0 (++)
X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org",
 has NOT identified this incoming email as spam.  The original
 message has been attached to this so you can view it or label
 similar future email.  If you have any questions, see
 the administrator of that system for details.
 
 Content preview:  Hi everyone, Maxim Cournoyer <maxim.cournoyer@HIDDEN> writes:
    > This leads me to believe that Shepherd does not block until the process
    > is actually dead to mark the process as stopped (it just waitpid on the
    > group pid with WNOHANG), which means it won't bloc [...] 
 
 Content analysis details:   (2.0 points, 10.0 required)
 
  pts rule name              description
 ---- ---------------------- --------------------------------------------------
 -0.0 SPF_PASS               SPF: sender matches SPF record
  2.0 PDS_OTHER_BAD_TLD      Untrustworthy TLDs
                             [URI: jpoiret.xyz (xyz)]
 -0.0 SPF_HELO_PASS          SPF: HELO matches SPF record
  1.0 BULK_RE_SUSP_NTLD      Precedence bulk and RE: from a suspicious TLD
  0.0 FROM_SUSPICIOUS_NTLD   From abused NTLD
 -1.0 MAILING_LIST_MULTI     Multiple indicators imply a widely-seen list
                             manager

Hi everyone,

Maxim Cournoyer <maxim.cournoyer@HIDDEN> writes:

> This leads me to believe that Shepherd does not block until the process
> is actually dead to mark the process as stopped (it just waitpid on the
> group pid with WNOHANG), which means it won't block if the child process
> hasn't exited yet, if I'm correct.
>
> When we are in the stop slot, we know for sure that the process should
> terminate completely, hence it'd make sense to call 'waitpid' *without*
> WNOHANG there, to avoid 'herd restart' from starting the service while
> its stopped process is not done terminating.
>
> jamid can take quite some time to terminate cleanly because of the
> networking threads in the opendht library that needs to be finalized,
> which is probably the reason this problem can be observed here.
>
> Thoughts?

I agree with you, make-kill-destructor should waitpid the processes it's
killing.  There shouldn't be any issues waitpid'ing before the
shepherd's signal handler, since stop actions are run with asyncs
disabled.  The signal handler will run once but won't get anything
because all the processes were already waitpid'd and it uses WNOHANG.

Best,
-- 
Josselin Poiret




Information forwarded to bug-guix@HIDDEN:
bug#57922; Package guix. Full text available.

Message received at 57922 <at> debbugs.gnu.org:


Received: (at 57922) by debbugs.gnu.org; 24 Sep 2022 03:33:02 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Sep 23 23:33:02 2022
Received: from localhost ([127.0.0.1]:42030 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1obvuX-0000Rb-MF
	for submit <at> debbugs.gnu.org; Fri, 23 Sep 2022 23:33:02 -0400
Received: from mail-qk1-f170.google.com ([209.85.222.170]:41618)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <maxim.cournoyer@HIDDEN>) id 1obvuV-0000RK-Qt
 for 57922 <at> debbugs.gnu.org; Fri, 23 Sep 2022 23:33:00 -0400
Received: by mail-qk1-f170.google.com with SMTP id k12so1236856qkj.8
 for <57922 <at> debbugs.gnu.org>; Fri, 23 Sep 2022 20:32:59 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112;
 h=mime-version:user-agent:message-id:in-reply-to:date:references
 :subject:cc:to:from:from:to:cc:subject:date;
 bh=LjNAuB0RhYiOD+fxOVpRlEmQOnkKmFfbEzpFDVb+MWI=;
 b=GvkKm+pT500R4BrK5/xw/zjXmkByhwGpWGcG6zUy0wQEJUnc7DCApw7BSpO1OFwIyR
 Eu9PA2zSwkgxdnlQ72qpqlkXu6RHAQV96X0/Ddq5m4ZJBwuqO/Df+W7cFN/YRIRM/IKJ
 42pHHQayR4swFZox+WVg2+YiUQwYuUPRR3pTEfFRQ5NziV7ZxyKBQGN8VEpnNpWGtDxN
 1FRxRxLAr6Kd4YOrJ2xevUjEdiO6plXCEDad77Uqhi1SOzrTAt+ZM2AYxG2ebntSxQxu
 0xswavQS7VmmKaiEJQXEBlXQsO+ypDB+OYKdAJ8Xfcr7vTOMQ6uiV7jbZhXEF8nkWaUy
 VFdA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=mime-version:user-agent:message-id:in-reply-to:date:references
 :subject:cc:to:from:x-gm-message-state:from:to:cc:subject:date;
 bh=LjNAuB0RhYiOD+fxOVpRlEmQOnkKmFfbEzpFDVb+MWI=;
 b=faFHxYX95Q7y6pa3/6KSOUs/ZoqGvDzkh97nHoEqpEYJZm+jQlEqZ8thaq5unlw5kS
 y01yzDzIR/olShUpXbyp7f0WLdJv+gX3FyguC1iqFFft4jEW0+r1p25/UUiT5WYcWRu9
 qMnPrDUOLLXhaV6oRlNbsfZT8GfxkF7BewVoWrdjOJpU7MWMvTDIHB45fIbyVGc9UEaJ
 CZKHaqMVbWtKTyH5HotGf8wPxjt9fzVr8l4AxdvaUmUf8Fo5jtIkwQOkjoEzmUdy0Kka
 wnLD/uPYkghRpVDNVqSFC96ZscK7JUbKTj6AYdJXbpo4+JdWFWJOEIFvFAXQfCb8duiR
 aOEw==
X-Gm-Message-State: ACrzQf0v3bLWB0U66gUfSN5Hhtaac9qaaX9n4Z4KMr+xe1omDXvj+J7i
 hTMzksSHcTebemI4yx5YTVPOdJZD46o=
X-Google-Smtp-Source: AMsMyM5Ym3Sy+McaVWduALVsD+jrBYZPrSMhMehb8pztlJQ2C1744S3H92eMHUv/zuzBoacqSJNQsg==
X-Received: by 2002:a05:620a:40c1:b0:6ce:a11a:7279 with SMTP id
 g1-20020a05620a40c100b006cea11a7279mr7888764qko.703.1663990373980; 
 Fri, 23 Sep 2022 20:32:53 -0700 (PDT)
Received: from hurd (dsl-10-130-64.b2b2c.ca. [72.10.130.64])
 by smtp.gmail.com with ESMTPSA id
 v11-20020a05622a014b00b0035cf0f50d7csm7483131qtw.52.2022.09.23.20.32.52
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Fri, 23 Sep 2022 20:32:53 -0700 (PDT)
From: Maxim Cournoyer <maxim.cournoyer@HIDDEN>
To: Ludovic =?utf-8?Q?Court=C3=A8s?= <ludo@HIDDEN>
Subject: Re: bug#57922: Shepherd doesn't seem to correctly handle waitpid
 itself
References: <874jx4q953.fsf@HIDDEN> <87o7va33iq.fsf@HIDDEN>
 <87bkr6fvlz.fsf@HIDDEN>
Date: Fri, 23 Sep 2022 23:32:52 -0400
In-Reply-To: <87bkr6fvlz.fsf@HIDDEN> ("Ludovic =?utf-8?Q?Court=C3=A8s=22'?=
 =?utf-8?Q?s?= message of "Fri, 23 Sep 2022 08:33:28 +0200")
Message-ID: <878rm98n17.fsf@HIDDEN>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: 57922
Cc: Josselin Poiret <dev@HIDDEN>, 57922 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -1.0 (-)

reopen 57922
tags 57922 -notabug
thanks

Hi again,

[...]

>>> Here's a small reproducer to apply on our code base:
>>>
>>> --8<---------------cut here---------------start------------->8---
>>> modified   gnu/services/telephony.scm
>>> @@ -685,13 +685,7 @@ (define (archive-name->username archive)
>>>
>>>                      ;; Finally, return the PID of the daemon process.
>>>                      daemon-pid))
>>> -               (stop
>>> -                #~(lambda (pid . args)
>>> -                    (kill pid SIGKILL)
>>> -                    ;; Wait for the process to exit; this prevents overlapping
>>> -                    ;; processes when issuing 'herd restart'.
>>> -                    (waitpid pid)
>>> -                    #f))))))))
>>> +               (stop #~(make-kill-destructor))))))))
>
> I think the main difference between these two is that the first one uses
> SIGKILL while the second one uses SIGTERM.
>
> You could try #~(make-kill-destructor SIGKILL) to get the same effect.

> You are right, the important difference was SIGTERM vs SIGKILL.  I
> thought I had tried that.  The problem only shows itself in the
> 'jami-provisioning' system test, not the 'jami' one.

> Marking this one as notabug and closing.

I think I spoke too soon.  SIGKILL does fix the problem when *not* using
waitpid explicitly, but when using waitpid explicitly, SIGTERM can be
used just fine.  In other words, this works:

--8<---------------cut here---------------start------------->8---
@@ -687,7 +687,7 @@ (define (archive-name->username archive)
                     daemon-pid))
                (stop
                 #~(lambda (pid . args)
-                    (kill pid SIGKILL)
+                    (kill pid SIGTERM)
                     ;; Wait for the process to exit; this prevents overlapping
                     ;; processes when issuing 'herd restart'.
                     (waitpid pid)
--8<---------------cut here---------------end--------------->8---

but this doesn't:

--8<---------------cut here---------------start------------->8---
@@ -685,13 +685,7 @@ (define (archive-name->username archive)
 
                     ;; Finally, return the PID of the daemon process.
                     daemon-pid))
-               (stop
-                #~(lambda (pid . args)
-                    (kill pid SIGKILL)
-                    ;; Wait for the process to exit; this prevents overlapping
-                    ;; processes when issuing 'herd restart'.
-                    (waitpid pid)
-                    #f))))))))
+               (stop #~(make-kill-destructor))))))))
 
 (define jami-service-type
--8<---------------cut here---------------end--------------->8---

when exercised with 'make check-system TESTS=jami-provisioning':

--8<---------------cut here---------------start------------->8---
This is the GNU system.  Welcome.
jami login: Jami Daemon 13.4.0, by Savoir-faire Linux 2004-2019
https://jami.net/
[Video support enabled]
[Plugins support enabled]

23:29:05.375         os_core_unix.c !pjlib 2.12.1 for POSIX initialized
shepherd: Service jami has been stopped.
Caught signal Terminated, terminating...

Some deprecated features have been used.  Set the environment
variable GUILE_WARN_DEPRECATED to "detailed" and rerun the
program to get more information.  Set it to "no" to suppress
this message.
Jami Daemon 13.4.0, by Savoir-faire Linux 2004-2019
https://jami.net/
[Video support enabled]
[Plugins support enabled]

One does not simply initialize the client: Another daemon is detected
/gnu/store/2vcv1fyqfyym2zcyf5bvbj1pcgbcc515-shepherd-marionette.scm:1:1718: ERROR:
  1. &action-exception-error:
      service: jami
      action: start
      key: misc-error
      args: (#f "~A ~S ~S ~S" (dbus "method failed with error" "org.freedesktop.DBus.Error.NoReply" ("Message recipient disconnected from message bus without replying")) #f)
--8<---------------cut here---------------end--------------->8---
      
or manually through the test VM:

--8<---------------cut here---------------start------------->8---
$(./pre-inst-env guix system vm --no-graphic --no-grafts --no-offload \
  -e '(@@ (gnu tests telephony) %jami-os-provisioning)')  \
  -m 1G -smp $(nproc) "-nic" user,model=virtio-net-pci,hostfwd=tcp::10022-:22
--8<---------------cut here---------------end--------------->8---

This leads me to believe that Shepherd does not block until the process
is actually dead to mark the process as stopped (it just waitpid on the
group pid with WNOHANG), which means it won't block if the child process
hasn't exited yet, if I'm correct.

When we are in the stop slot, we know for sure that the process should
terminate completely, hence it'd make sense to call 'waitpid' *without*
WNOHANG there, to avoid 'herd restart' from starting the service while
its stopped process is not done terminating.

jamid can take quite some time to terminate cleanly because of the
networking threads in the opendht library that needs to be finalized,
which is probably the reason this problem can be observed here.

Thoughts?

Maxim




Information forwarded to bug-guix@HIDDEN:
bug#57922; Package guix. Full text available.

Message received at 57922-done <at> debbugs.gnu.org:


Received: (at 57922-done) by debbugs.gnu.org; 23 Sep 2022 17:49:35 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Sep 23 13:49:35 2022
Received: from localhost ([127.0.0.1]:41691 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1obmnu-00079e-Ti
	for submit <at> debbugs.gnu.org; Fri, 23 Sep 2022 13:49:35 -0400
Received: from mail-qt1-f181.google.com ([209.85.160.181]:38459)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <maxim.cournoyer@HIDDEN>) id 1obmnt-00079R-Fg
 for 57922-done <at> debbugs.gnu.org; Fri, 23 Sep 2022 13:49:33 -0400
Received: by mail-qt1-f181.google.com with SMTP id y2so523527qtv.5
 for <57922-done <at> debbugs.gnu.org>; Fri, 23 Sep 2022 10:49:33 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112;
 h=content-transfer-encoding:mime-version:user-agent:message-id
 :in-reply-to:date:references:subject:cc:to:from:from:to:cc:subject
 :date; bh=m9QUAKp6OUKoJCi+0GmI0iq8xaNi5TvAMWD9W/uKWMU=;
 b=C7K9VNRyShds9BkCbrvaGSRFknp6pOcyrbq775oiZ3cg0Vghf3xs8ZOztwzWdMSAj8
 IGlamFmW+UdnE7UAwiT7tA+Ye1PKX7J5BDjWPZFHPJijnFIu/Tw5PPqc9mTo+mlFQF4U
 A0XM3t4eKQKEGvHe+vmLHGWh8dQSVwR7+MbmAZVAxfbT9K5GuNETZaZD3Tq2VDuCdt9D
 X8zjAsYioomdzpEOjO019+mdL/06n77uYXFO1hMOqEvrvaF83Rx2uhvhQt8w8Ev1hODP
 Gn+tAS8uBjynKwIt8MDPDa/FNaGm8J46jlzKuGkLnJaiAwbz+ddmf7kBQD3bbZvjQoTm
 JZ6w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=content-transfer-encoding:mime-version:user-agent:message-id
 :in-reply-to:date:references:subject:cc:to:from:x-gm-message-state
 :from:to:cc:subject:date;
 bh=m9QUAKp6OUKoJCi+0GmI0iq8xaNi5TvAMWD9W/uKWMU=;
 b=5cD+QeihetTvW3yleoe+X8QBz9WuI09TtOc2fVpipGDiqJekkyTzddMIacWg1X9Uee
 dq9iToFcyrPdLHVHd+9Je+mMD9lY+zMSlPXGB5bLs5d0woWG2BppQuMgjIPanNxQ9wy6
 EesIQglQDcyTPaduikWCk3NRqC126hfHsg+u1ivJqIEWygqQAAzdr/7dEyhy3dX59i8a
 vF8ZHXoIcRbu0eyRhf0jvLHNijLWM/uEeR4BVc+/GQpSWla9nj0OGEuNYiK/CXid7w90
 cOkPu62TF1KFYtuA2Oh/DLp1SkskEGCl8sNq8iGH5ShShI1AZPDAm0SgS/NrtO+Eg1qE
 kvxw==
X-Gm-Message-State: ACrzQf0U+r/deUvF9hRRV6zlYb0pmB7bRwc07hKTU3WjBC8ncH/EbSUA
 sy63puctda0vANynup2kzXw9CqvIhWs=
X-Google-Smtp-Source: AMsMyM5nm59h++wEY6AwUnYF+zmZIp+2KH5+xKJKDvyidaROFmLodBQK5LQtMElWcZmE2FxQQRlZHg==
X-Received: by 2002:ac8:5847:0:b0:35d:18b8:aa0f with SMTP id
 h7-20020ac85847000000b0035d18b8aa0fmr6784383qth.591.1663955367654; 
 Fri, 23 Sep 2022 10:49:27 -0700 (PDT)
Received: from hurd (dsl-10-130-64.b2b2c.ca. [72.10.130.64])
 by smtp.gmail.com with ESMTPSA id
 br30-20020a05620a461e00b006ceafb1aa92sm6562072qkb.96.2022.09.23.10.49.26
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Fri, 23 Sep 2022 10:49:27 -0700 (PDT)
From: Maxim Cournoyer <maxim.cournoyer@HIDDEN>
To: Ludovic =?utf-8?Q?Court=C3=A8s?= <ludo@HIDDEN>
Subject: Re: bug#57922: Shepherd doesn't seem to correctly handle waitpid
 itself
References: <874jx4q953.fsf@HIDDEN> <87o7va33iq.fsf@HIDDEN>
 <87bkr6fvlz.fsf@HIDDEN>
Date: Fri, 23 Sep 2022 13:49:26 -0400
In-Reply-To: <87bkr6fvlz.fsf@HIDDEN> ("Ludovic =?utf-8?Q?Court=C3=A8s=22'?=
 =?utf-8?Q?s?= message of "Fri, 23 Sep 2022 08:33:28 +0200")
Message-ID: <875yhe9e1l.fsf@HIDDEN>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: 57922-done
Cc: Josselin Poiret <dev@HIDDEN>, 57922-done <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -1.0 (-)

tags 57922 +notabug
thanks

Hi Ludo!

Ludovic Court=C3=A8s <ludo@HIDDEN> writes:

[...]

>> What I don't understand that well is that this signal handler could be
>> installed only once when shepherd starts, right?  That way, it wouldn't
>> need to depend on specific start actions being chosen.
>
> The SIGCHLD handler is installed lazily since
> f776de04e6702e18d95152072e78c43441d3ccc3.  The rationale was discussed
> here:
>
>   https://issues.guix.gnu.org/27553
>
> That said, on GNU/Linux, SIGCHLD is actually blocked and instead we rely
> on signalfd(2).  It=E2=80=99s from the main even loop in shepherd.scm tha=
t the
> signal handler is called.

I had missed that, thanks for explaining.

>>> Here's a small reproducer to apply on our code base:
>>>
>>> --8<---------------cut here---------------start------------->8---
>>> modified   gnu/services/telephony.scm
>>> @@ -685,13 +685,7 @@ (define (archive-name->username archive)
>>>
>>>                      ;; Finally, return the PID of the daemon process.
>>>                      daemon-pid))
>>> -               (stop
>>> -                #~(lambda (pid . args)
>>> -                    (kill pid SIGKILL)
>>> -                    ;; Wait for the process to exit; this prevents ove=
rlapping
>>> -                    ;; processes when issuing 'herd restart'.
>>> -                    (waitpid pid)
>>> -                    #f))))))))
>>> +               (stop #~(make-kill-destructor))))))))
>
> I think the main difference between these two is that the first one uses
> SIGKILL while the second one uses SIGTERM.
>
> You could try #~(make-kill-destructor SIGKILL) to get the same effect.

You are right, the important difference was SIGTERM vs SIGKILL.  I
thought I had tried that.  The problem only shows itself in the
'jami-provisioning' system test, not the 'jami' one.

Marking this one as notabug and closing.

Thanks again!

Maxim




Notification sent to Maxim Cournoyer <maxim.cournoyer@HIDDEN>:
bug acknowledged by developer. Full text available.
Reply sent to Maxim Cournoyer <maxim.cournoyer@HIDDEN>:
You have taken responsibility. Full text available.

Message received at 57922 <at> debbugs.gnu.org:


Received: (at 57922) by debbugs.gnu.org; 23 Sep 2022 06:33:44 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Sep 23 02:33:43 2022
Received: from localhost ([127.0.0.1]:39187 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1obcFr-0001NS-Hz
	for submit <at> debbugs.gnu.org; Fri, 23 Sep 2022 02:33:43 -0400
Received: from eggs.gnu.org ([209.51.188.92]:42014)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <ludo@HIDDEN>) id 1obcFp-0001NF-9Z
 for 57922 <at> debbugs.gnu.org; Fri, 23 Sep 2022 02:33:41 -0400
Received: from fencepost.gnu.org ([2001:470:142:3::e]:49122)
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <ludo@HIDDEN>)
 id 1obcFj-0002KU-NS; Fri, 23 Sep 2022 02:33:35 -0400
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org;
 s=fencepost-gnu-org; h=MIME-Version:In-Reply-To:Date:References:Subject:To:
 From; bh=ROm0IsvWG+YODmu//zCK0zwGmAHR1MstxI/DI4pj51E=; b=W9dQV0u2JqEppHyYhfuD
 g43U1V3u0zGgftVitbix/vU/4YG4WHxy9FVBGYf1rbMjukKV6YexG90J309Ncar5vNsVKdFwK/+CU
 GfaI/uSfsmPLwtViMQyjhWSWDkVHsHFlfflnLRc6vn8ltawqcucvQjsUhGCcqxSonSpvamhsBS5qu
 ZWH5HzA2R6opiBPtvJbVBNMVWmp/W59ovX8ZRncjvLqWs8Gja2eJS6zj5/lYppltbrf4pbyjLkomC
 BC4oc34jTV1pcbDcZuBoTABP0zLAkKG1d3krPIlWrcjomfLyjCykPHRhL9qDhKs3ZaJrp4JnGVAT5
 qAXSIScGpoOMiA==;
Received: from [89.207.171.75] (port=39712 helo=ribbon)
 by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <ludo@HIDDEN>)
 id 1obcFf-00009Z-Kr; Fri, 23 Sep 2022 02:33:35 -0400
From: =?utf-8?Q?Ludovic_Court=C3=A8s?= <ludo@HIDDEN>
To: Josselin Poiret <dev@HIDDEN>
Subject: Re: bug#57922: Shepherd doesn't seem to correctly handle waitpid
 itself
References: <874jx4q953.fsf@HIDDEN> <87o7va33iq.fsf@HIDDEN>
Date: Fri, 23 Sep 2022 08:33:28 +0200
In-Reply-To: <87o7va33iq.fsf@HIDDEN> (Josselin Poiret's message of "Tue, 
 20 Sep 2022 09:31:57 +0200")
Message-ID: <87bkr6fvlz.fsf@HIDDEN>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: 3.3 (+++)
X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org",
 has NOT identified this incoming email as spam.  The original
 message has been attached to this so you can view it or label
 similar future email.  If you have any questions, see
 the administrator of that system for details.
 Content preview:  Hi,
 Josselin Poiret <dev@HIDDEN> skribis: > Maxim Cournoyer
 <maxim.cournoyer@HIDDEN> writes: 
 Content analysis details:   (3.3 points, 10.0 required)
 pts rule name              description
 ---- ---------------------- --------------------------------------------------
 -0.0 SPF_PASS               SPF: sender matches SPF record
 3.6 RCVD_IN_SBL_CSS        RBL: Received via a relay in Spamhaus SBL-CSS
 [89.207.171.75 listed in zen.spamhaus.org]
 2.0 PDS_OTHER_BAD_TLD      Untrustworthy TLDs
 [URI: jpoiret.xyz (xyz)]
 -0.0 SPF_HELO_PASS          SPF: HELO matches SPF record
 -2.3 RCVD_IN_DNSWL_MED      RBL: Sender listed at https://www.dnswl.org/,
 medium trust [209.51.188.92 listed in list.dnswl.org]
X-Debbugs-Envelope-To: 57922
Cc: 57922 <at> debbugs.gnu.org, Maxim Cournoyer <maxim.cournoyer@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: 2.3 (++)
X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org",
 has NOT identified this incoming email as spam.  The original
 message has been attached to this so you can view it or label
 similar future email.  If you have any questions, see
 the administrator of that system for details.
 
 Content preview:  Hi, Josselin Poiret <dev@HIDDEN> skribis: > Maxim Cournoyer
    <maxim.cournoyer@HIDDEN> writes: 
 
 Content analysis details:   (2.3 points, 10.0 required)
 
  pts rule name              description
 ---- ---------------------- --------------------------------------------------
 -2.3 RCVD_IN_DNSWL_MED      RBL: Sender listed at https://www.dnswl.org/,
                             medium trust
                             [209.51.188.92 listed in list.dnswl.org]
  3.6 RCVD_IN_SBL_CSS        RBL: Received via a relay in Spamhaus SBL-CSS
                             [89.207.171.75 listed in zen.spamhaus.org]
 -0.0 SPF_PASS               SPF: sender matches SPF record
  2.0 PDS_OTHER_BAD_TLD      Untrustworthy TLDs
                             [URI: jpoiret.xyz (xyz)]
 -0.0 SPF_HELO_PASS          SPF: HELO matches SPF record
 -1.0 MAILING_LIST_MULTI     Multiple indicators imply a widely-seen list
                             manager

Hi,

Josselin Poiret <dev@HIDDEN> skribis:

> Maxim Cournoyer <maxim.cournoyer@HIDDEN> writes:

[...]

>> 1. It requires to be installed in the signal handlers for each
>> processes, with something like:
>>
>> --8<---------------cut here---------------start------------->8---
>>   (unless %sigchld-handler-installed?
>>     (sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP)
>>     (set! %sigchld-handler-installed? #t))
>> --8<---------------cut here---------------end--------------->8---
>>
>> Done for fork+exec-command and make-inetd-forkexec-constructor, but not
>> for make-forkexec-constructor/container, AFAICT;
>
> The signal handler is only installed once in PID 1 (in fact, you haven't
> forked yet here), since it's the one that receives the SIGCHLD.

Right.

> What I don't understand that well is that this signal handler could be
> installed only once when shepherd starts, right?  That way, it wouldn't
> need to depend on specific start actions being chosen.

The SIGCHLD handler is installed lazily since
f776de04e6702e18d95152072e78c43441d3ccc3.  The rationale was discussed
here:

  https://issues.guix.gnu.org/27553

That said, on GNU/Linux, SIGCHLD is actually blocked and instead we rely
on signalfd(2).  It=E2=80=99s from the main even loop in shepherd.scm that =
the
signal handler is called.

>> Here's a small reproducer to apply on our code base:
>>
>> --8<---------------cut here---------------start------------->8---
>> modified   gnu/services/telephony.scm
>> @@ -685,13 +685,7 @@ (define (archive-name->username archive)
>>=20=20
>>                      ;; Finally, return the PID of the daemon process.
>>                      daemon-pid))
>> -               (stop
>> -                #~(lambda (pid . args)
>> -                    (kill pid SIGKILL)
>> -                    ;; Wait for the process to exit; this prevents over=
lapping
>> -                    ;; processes when issuing 'herd restart'.
>> -                    (waitpid pid)
>> -                    #f))))))))
>> +               (stop #~(make-kill-destructor))))))))

I think the main difference between these two is that the first one uses
SIGKILL while the second one uses SIGTERM.

You could try #~(make-kill-destructor SIGKILL) to get the same effect.

(Another difference is that =E2=80=98make-kill-destructor=E2=80=99 kills th=
e process
group, not just the process itself.)

Anyway, the key point is that shepherd takes care of calling =E2=80=98waitp=
id=E2=80=99
for its child processes (services).  If you call it yourself as in the
snippet above, you=E2=80=99re racing with shepherd; in the case above it
probably doesn=E2=80=99t make any difference though because it will consider
that the service is stopped in any case.

HTH!

Ludo=E2=80=99.




Information forwarded to bug-guix@HIDDEN:
bug#57922; Package guix. Full text available.

Message received at 57922 <at> debbugs.gnu.org:


Received: (at 57922) by debbugs.gnu.org; 20 Sep 2022 07:32:02 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Tue Sep 20 03:32:02 2022
Received: from localhost ([127.0.0.1]:55967 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1oaXje-0000EY-0y
	for submit <at> debbugs.gnu.org; Tue, 20 Sep 2022 03:32:02 -0400
Received: from jpoiret.xyz ([206.189.101.64]:54904)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <dev@HIDDEN>) id 1oaXjb-0000EL-Mb
 for 57922 <at> debbugs.gnu.org; Tue, 20 Sep 2022 03:32:00 -0400
Received: from authenticated-user (jpoiret.xyz [206.189.101.64])
 by jpoiret.xyz (Postfix) with ESMTPA id 225D6184F2B;
 Tue, 20 Sep 2022 07:31:58 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=jpoiret.xyz; s=dkim;
 t=1663659118;
 h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
 to:to:cc:mime-version:mime-version:content-type:content-type:
 in-reply-to:in-reply-to:references:references;
 bh=mT1YjZpcqepo2dF2FR0yNfjgn5RDzMuu0yfYCsoJb48=;
 b=aRMkyh37qQdSRt8ZUj5JuMGFuaF/+XFfUW98xBKHCcmFasFbwTJxrQaBS9KF5Yver7X4um
 9WSK+DxWgXTrKrCqGlUaXaNswGgv+DFKNgRkRHdYQHwXjIgnmLdg/bEFzx09yQzRL6wwMM
 sb1kYwvNPNMFn3gM7J/3qx+eFoGuqYo8etgzWSJEMUmzkrfBAZOTH7OtSQyPJhJ06d4Wdx
 QLBNiIoUCaRl4+9XX1MdMTJSCyY5bK6NXlqg3skXMfOXRK153KrMlmIkm7GMWsHPdGfP3H
 XMuAEpo6+h0gjw9yeQ74espP1QvzwLBkVVItioX4uD/649IRFF7wQej7CYKxeg==
From: Josselin Poiret <dev@HIDDEN>
To: Maxim Cournoyer <maxim.cournoyer@HIDDEN>, 57922 <at> debbugs.gnu.org
Subject: Re: bug#57922: Shepherd doesn't seem to correctly handle waitpid
 itself
In-Reply-To: <874jx4q953.fsf@HIDDEN>
References: <874jx4q953.fsf@HIDDEN>
Date: Tue, 20 Sep 2022 09:31:57 +0200
Message-ID: <87o7va33iq.fsf@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain
Authentication-Results: jpoiret.xyz;
 auth=pass smtp.auth=jpoiret@HIDDEN smtp.mailfrom=dev@HIDDEN
X-Spamd-Bar: /
X-Spam-Score: 2.0 (++)
X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org",
 has NOT identified this incoming email as spam.  The original
 message has been attached to this so you can view it or label
 similar future email.  If you have any questions, see
 the administrator of that system for details.
 Content preview:  Hi Maxim, Maxim Cournoyer <maxim.cournoyer@HIDDEN> writes:
 > Hi, > > I've tried to determine why a workaround in the jami-service-type
 is > required in the 'stop' slot to avoid failures in 'herd restart jami',
 > and haven't quite found the culprit, but it app [...] 
 Content analysis details:   (2.0 points, 10.0 required)
 pts rule name              description
 ---- ---------------------- --------------------------------------------------
 -0.0 SPF_PASS               SPF: sender matches SPF record
 2.0 PDS_OTHER_BAD_TLD      Untrustworthy TLDs
 [URI: jpoiret.xyz (xyz)]
 -0.0 SPF_HELO_PASS          SPF: HELO matches SPF record
 0.0 FROM_SUSPICIOUS_NTLD   From abused NTLD
X-Debbugs-Envelope-To: 57922
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: 2.0 (++)
X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org",
 has NOT identified this incoming email as spam.  The original
 message has been attached to this so you can view it or label
 similar future email.  If you have any questions, see
 the administrator of that system for details.
 
 Content preview:  Hi Maxim, Maxim Cournoyer <maxim.cournoyer@HIDDEN> writes:
    > Hi, > > I've tried to determine why a workaround in the jami-service-type
    is > required in the 'stop' slot to avoid failures in 'herd restart jami',
    > and haven't quite found the culprit, but it app [...] 
 
 Content analysis details:   (2.0 points, 10.0 required)
 
  pts rule name              description
 ---- ---------------------- --------------------------------------------------
 -0.0 SPF_PASS               SPF: sender matches SPF record
  2.0 PDS_OTHER_BAD_TLD      Untrustworthy TLDs
                             [URI: jpoiret.xyz (xyz)]
 -0.0 SPF_HELO_PASS          SPF: HELO matches SPF record
  1.0 BULK_RE_SUSP_NTLD      Precedence bulk and RE: from a suspicious TLD
  0.0 FROM_SUSPICIOUS_NTLD   From abused NTLD
 -1.0 MAILING_LIST_MULTI     Multiple indicators imply a widely-seen list
                             manager

Hi Maxim,

Maxim Cournoyer <maxim.cournoyer@HIDDEN> writes:

> Hi,
>
> I've tried to determine why a workaround in the jami-service-type is
> required in the 'stop' slot to avoid failures in 'herd restart jami',
> and haven't quite found the culprit, but it appears to me that:
>
> 1. waipid is only called in one place in Shepherd, which is in the
> handle-SIGCHLD procedure in (shepherd service), which does not
> specifically wait for an exact PID but rather does:
>
> (waitpid* WAIT_ANY WNOHANG), which is waitpid with some special handling
> in the case a system-error exception is thrown with an ECHILD or EINTR
> error number.
>
> This doesn't strike me as a strong guarantee that waitpid occurs when
> stop is called, because:
>
> 1. It requires to be installed in the signal handlers for each
> processes, with something like:
>
> --8<---------------cut here---------------start------------->8---
>   (unless %sigchld-handler-installed?
>     (sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP)
>     (set! %sigchld-handler-installed? #t))
> --8<---------------cut here---------------end--------------->8---
>
> Done for fork+exec-command and make-inetd-forkexec-constructor, but not
> for make-forkexec-constructor/container, AFAICT;

The signal handler is only installed once in PID 1 (in fact, you haven't
forked yet here), since it's the one that receives the SIGCHLD.

What I don't understand that well is that this signal handler could be
installed only once when shepherd starts, right?  That way, it wouldn't
need to depend on specific start actions being chosen.

> 2. it has the WNOHANG flag, which means the stop simply does a kill the
> the signal handling weakly (because of WNOHANG) waits on it, which means
> the start may begin before the process was actually completely
> terminated.
>
> Here's a small reproducer to apply on our code base:
>
> --8<---------------cut here---------------start------------->8---
> modified   gnu/services/telephony.scm
> @@ -685,13 +685,7 @@ (define (archive-name->username archive)
>  
>                      ;; Finally, return the PID of the daemon process.
>                      daemon-pid))
> -               (stop
> -                #~(lambda (pid . args)
> -                    (kill pid SIGKILL)
> -                    ;; Wait for the process to exit; this prevents overlapping
> -                    ;; processes when issuing 'herd restart'.
> -                    (waitpid pid)
> -                    #f))))))))
> +               (stop #~(make-kill-destructor))))))))
>  
>  (define jami-service-type
>    (service-type
> --8<---------------cut here---------------end--------------->8---

The real problem here is not really the WNOHANG flag (you could remove
that and still get issues) but rather that the waitpid is run inside a
signal handler, which in Guile means that it's run through asyncs.  You
have no guarantees wrt. when asyncs run, so they could run after or in
the middle of the next action.  I also think make-kill-destructor should
waitpid the processes it's killing, as you're implying, and leave the
signal handler only for unexpected service crashes.

Best,
-- 
Josselin Poiret




Information forwarded to bug-guix@HIDDEN:
bug#57922; Package guix. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 19 Sep 2022 04:29:52 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Mon Sep 19 00:29:52 2022
Received: from localhost ([127.0.0.1]:51779 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1oa8Po-00083p-4j
	for submit <at> debbugs.gnu.org; Mon, 19 Sep 2022 00:29:52 -0400
Received: from lists.gnu.org ([209.51.188.17]:45858)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <maxim.cournoyer@HIDDEN>) id 1oa8Pl-00083h-FB
 for submit <at> debbugs.gnu.org; Mon, 19 Sep 2022 00:29:50 -0400
Received: from eggs.gnu.org ([2001:470:142:3::10]:45478)
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <maxim.cournoyer@HIDDEN>)
 id 1oa8Pl-0006b5-9x
 for bug-guix@HIDDEN; Mon, 19 Sep 2022 00:29:49 -0400
Received: from mail-qv1-xf30.google.com ([2607:f8b0:4864:20::f30]:41614)
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128)
 (Exim 4.90_1) (envelope-from <maxim.cournoyer@HIDDEN>)
 id 1oa8Pj-0006u9-PM
 for bug-guix@HIDDEN; Mon, 19 Sep 2022 00:29:49 -0400
Received: by mail-qv1-xf30.google.com with SMTP id l14so12634845qvq.8
 for <bug-guix@HIDDEN>; Sun, 18 Sep 2022 21:29:47 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112;
 h=mime-version:message-id:date:subject:to:from:from:to:cc:subject
 :date; bh=A6w1ujOl8PyJXJquYATugAQsz2k2jgChb30YmyLuPco=;
 b=lBRIrgfzrC9RSjYFs8o9OsLRrn9m+h3tMa0pUtNNP42bopBCXFfZJeUcNAk0e2i/JT
 MaXk5Yn3/JfWFXiziKEhHbDedtikdOWfxapQeXPsYFSUQrPh22S3GGwY4TrDTvbN+Q92
 QPc6rTkaVQg+2wjzkYpIXom87rAUwQVyd04mAjdCNLa7ectiAY7UW45lqIFxBIsYFs+g
 fQHHEPgREUHrWktO9jR2hoyX5ok4kZHXYLztpZv7RrGt4BdytF8uxr76T1QUGAavSb3c
 Qiqu6Gxjm7gdyrgq8ccpDaB/oEwWZgwUeky/Q3J9MFoFb4ivHM3DvaXV3cQLYaA4eI43
 5TUg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=mime-version:message-id:date:subject:to:from:x-gm-message-state
 :from:to:cc:subject:date;
 bh=A6w1ujOl8PyJXJquYATugAQsz2k2jgChb30YmyLuPco=;
 b=5nTHqU9Q8fnU4UXvqb4OtgR/VdY8TUhTamLEXwKpo2X0WjJok/feZOZY+/d9tcEUPr
 6TQiwL8/EVVYJNd0UteuIzolK+Oj45oEt+FVvolxMLUfkSDi0b4ogHZKlpz3Zpz+vvdh
 fPd7K5HCb726ELBzHFmMWBUnQVEBk/TUhSFVeSms3exCFPIhPtZit5wDFjvsV4GDbmpu
 iRpHpP+wTDn3iRUrtdEYUBJlCpWcYPgtcU/W7yTU4/8Tg+37qd7w4fdFEq5vysUbuIzf
 3npWe5yjN+H8MWACjOtkBYPc+2TmB9WeqFDHxojY3pUcOVlRehWAP13nY9RcsJcuZhmJ
 Uq9g==
X-Gm-Message-State: ACrzQf3fkTGA6ZVNC5hmAaK6Iyfz+xlzfwCvGjlUrQY5m8AIErbTFzsf
 taVpwnmKaciZ3GZtvwTWWmSs8uypIGo=
X-Google-Smtp-Source: AMsMyM5m0SNaiKUywp0fj/sxPxq2OCYh+S4b8Cc8tCfPIKwJID+teukhNVLCm/BHYVqLMquWlmWO6g==
X-Received: by 2002:a05:6214:2aa4:b0:4ac:8848:b251 with SMTP id
 js4-20020a0562142aa400b004ac8848b251mr12932469qvb.55.1663561785980; 
 Sun, 18 Sep 2022 21:29:45 -0700 (PDT)
Received: from hurd (dsl-148-8.b2b2c.ca. [66.158.148.8])
 by smtp.gmail.com with ESMTPSA id
 m5-20020a05620a24c500b006bb366779a4sm12943248qkn.6.2022.09.18.21.29.45
 for <bug-guix@HIDDEN>
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Sun, 18 Sep 2022 21:29:45 -0700 (PDT)
From: Maxim Cournoyer <maxim.cournoyer@HIDDEN>
To: bug-guix <bug-guix@HIDDEN>
Subject: Shepherd doesn't seem to correctly handle waitpid itself
Date: Mon, 19 Sep 2022 00:29:44 -0400
Message-ID: <874jx4q953.fsf@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain
Received-SPF: pass client-ip=2607:f8b0:4864:20::f30;
 envelope-from=maxim.cournoyer@HIDDEN; helo=mail-qv1-xf30.google.com
X-Spam_score_int: -20
X-Spam_score: -2.1
X-Spam_bar: --
X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1,
 DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001,
 RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001,
 SPF_PASS=-0.001 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-Spam-Score: -1.3 (-)
X-Debbugs-Envelope-To: submit
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -2.3 (--)

Hi,

I've tried to determine why a workaround in the jami-service-type is
required in the 'stop' slot to avoid failures in 'herd restart jami',
and haven't quite found the culprit, but it appears to me that:

1. waipid is only called in one place in Shepherd, which is in the
handle-SIGCHLD procedure in (shepherd service), which does not
specifically wait for an exact PID but rather does:

(waitpid* WAIT_ANY WNOHANG), which is waitpid with some special handling
in the case a system-error exception is thrown with an ECHILD or EINTR
error number.

This doesn't strike me as a strong guarantee that waitpid occurs when
stop is called, because:

1. It requires to be installed in the signal handlers for each
processes, with something like:

--8<---------------cut here---------------start------------->8---
  (unless %sigchld-handler-installed?
    (sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP)
    (set! %sigchld-handler-installed? #t))
--8<---------------cut here---------------end--------------->8---

Done for fork+exec-command and make-inetd-forkexec-constructor, but not
for make-forkexec-constructor/container, AFAICT;

2. it has the WNOHANG flag, which means the stop simply does a kill the
the signal handling weakly (because of WNOHANG) waits on it, which means
the start may begin before the process was actually completely
terminated.

Here's a small reproducer to apply on our code base:

--8<---------------cut here---------------start------------->8---
modified   gnu/services/telephony.scm
@@ -685,13 +685,7 @@ (define (archive-name->username archive)
 
                     ;; Finally, return the PID of the daemon process.
                     daemon-pid))
-               (stop
-                #~(lambda (pid . args)
-                    (kill pid SIGKILL)
-                    ;; Wait for the process to exit; this prevents overlapping
-                    ;; processes when issuing 'herd restart'.
-                    (waitpid pid)
-                    #f))))))))
+               (stop #~(make-kill-destructor))))))))
 
 (define jami-service-type
   (service-type
--8<---------------cut here---------------end--------------->8---

Then run 'make check-system TESTS=jami-provisioning' to see new
failures, or if you want to investigate manually the system:

--8<---------------cut here---------------start------------->8---
$ ./pre-inst-env guix system vm --no-grafts --no-offload --no-graphic \
   -e '(@@ (gnu tests telephony) %jami-os-provisioning)'

$ /gnu/store/rxi7c14hga62qslb0sr6nac9qnkxr0nn-run-vm.sh -m 1G -smp 4 \
  -nic user,model=virtio-net-pci,hostfwd=tcp::10022-:22

# Connect to the QEMU VM:
$ ssh root@localhost -p10022

root@jami ~# herd restart jami
Service jami has been stopped.
herd: exception caught while executing 'start' on service 'jami':
dbus "method failed with error" "org.freedesktop.DBus.Error.NoReply" ("Message recipient disconnected from message bus without replying")
root@jami ~# herd status jami
Status of jami:
  It is stopped.
  It is enabled.
  Provides (jami).
  Requires (jami-dbus-session).
  Conflicts with ().
  Will be respawned.
root@jami ~# pgrep jami
--8<---------------cut here---------------end--------------->8---

Thanks,

Maxim




Acknowledgement sent to Maxim Cournoyer <maxim.cournoyer@HIDDEN>:
New bug report received and forwarded. Copy sent to bug-guix@HIDDEN. Full text available.
Report forwarded to bug-guix@HIDDEN:
bug#57922; Package guix. Full text available.
Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.
Last modified: Sat, 24 Sep 2022 16:45:01 UTC

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997 nCipher Corporation Ltd, 1994-97 Ian Jackson.