Received: (at 33410) by debbugs.gnu.org; 17 Nov 2018 14:21:41 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Sat Nov 17 09:21:41 2018 Received: from localhost ([127.0.0.1]:57164 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1gO1Tg-0002MX-UD for submit <at> debbugs.gnu.org; Sat, 17 Nov 2018 09:21:41 -0500 Received: from eggs.gnu.org ([208.118.235.92]:45774) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <ludo@HIDDEN>) id 1gO1Tf-0002ML-L3 for 33410 <at> debbugs.gnu.org; Sat, 17 Nov 2018 09:21:39 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from <ludo@HIDDEN>) id 1gO1TZ-00034q-MT for 33410 <at> debbugs.gnu.org; Sat, 17 Nov 2018 09:21:34 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=disabled version=3.3.2 Received: from fencepost.gnu.org ([2001:4830:134:3::e]:39103) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <ludo@HIDDEN>) id 1gO1TZ-00034m-Jb; Sat, 17 Nov 2018 09:21:33 -0500 Received: from [2a01:e0a:1d:7270:af76:b9b:ca24:c465] (port=49876 helo=ribbon) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from <ludo@HIDDEN>) id 1gO1TZ-0004Tq-Bv; Sat, 17 Nov 2018 09:21:33 -0500 From: ludo@HIDDEN (Ludovic =?utf-8?Q?Court=C3=A8s?=) To: Mark H Weaver <mhw@HIDDEN> Subject: Re: bug#33410: Offloaded builds can get stuck indefinitely due to network issues References: <87a7m8xs42.fsf@HIDDEN> X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: 27 Brumaire an 227 de la =?utf-8?Q?R=C3=A9volution?= X-PGP-Key-ID: 0x090B11993D9AEBB5 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4 0CFB 090B 1199 3D9A EBB5 X-OS: x86_64-pc-linux-gnu Date: Sat, 17 Nov 2018 15:21:32 +0100 In-Reply-To: <87a7m8xs42.fsf@HIDDEN> (Mark H. Weaver's message of "Fri, 16 Nov 2018 23:08:50 -0500") Message-ID: <87efbjokcz.fsf@HIDDEN> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2001:4830:134:3::e X-Spam-Score: -5.0 (-----) X-Debbugs-Envelope-To: 33410 Cc: 33410 <at> debbugs.gnu.org X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -6.0 (------) Hello, Mark H Weaver <mhw@HIDDEN> skribis: > I just discovered that 4 out of 5 armhf build slots on Hydra have been > stuck for 24 hours, apparently after the network connections to the > build slaves were lost, possibly due to a temporary network outage. > > I've seen this kind of thing happen periodically since we switched to > using guile-ssh for offloaded builds. Which guix-daemon version is hydra running? Commit a708de151c255712071e42e5c8284756b51768cd adds a safeguard to make sure timeouts are honored, though there might be some cases where it doesn=E2=80=99t quite work as expected (I suspect libssh handles EINTR internally by looping, in which case our signal handling async doesn=E2=80= =99t get a chance to run.) > On Hydra I can monitor the builds and investigate when a given build > seems to be taking far too long, and I can kill those jobs to free up > the build slots. There's no way to kill the builds from Hydra's web > interface, but I can kill them manually by logging into Hydra. > > This might become a more serious problem on Berlin, as we add ARM build > slaves that are not on the same local network as Berlin itself, until > the web interface allows for this kind of monitoring and intervention. The current situation on berlin is suboptimal: I run =E2=80=98guix processe= s=E2=80=99 when I suspect something is wrong, and that=E2=80=99s how I found about <https://issues.guix.info/issue/33239>. Thanks, Ludo=E2=80=99.
bug-guix@HIDDEN
:bug#33410
; Package guix
.
Full text available.Received: (at submit) by debbugs.gnu.org; 17 Nov 2018 04:09:44 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Fri Nov 16 23:09:44 2018 Received: from localhost ([127.0.0.1]:56982 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1gNrvU-0000WY-2G for submit <at> debbugs.gnu.org; Fri, 16 Nov 2018 23:09:44 -0500 Received: from eggs.gnu.org ([208.118.235.92]:33886) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <mhw@HIDDEN>) id 1gNrvS-0000WL-8Q for submit <at> debbugs.gnu.org; Fri, 16 Nov 2018 23:09:42 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from <mhw@HIDDEN>) id 1gNrvM-0000sp-DK for submit <at> debbugs.gnu.org; Fri, 16 Nov 2018 23:09:37 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:41751) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from <mhw@HIDDEN>) id 1gNrvM-0000sj-At for submit <at> debbugs.gnu.org; Fri, 16 Nov 2018 23:09:36 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:43494) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from <mhw@HIDDEN>) id 1gNrvL-00032C-J7 for bug-guix@HIDDEN; Fri, 16 Nov 2018 23:09:36 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from <mhw@HIDDEN>) id 1gNrvI-0000rI-Em for bug-guix@HIDDEN; Fri, 16 Nov 2018 23:09:35 -0500 Received: from world.peace.net ([64.112.178.59]:50540) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from <mhw@HIDDEN>) id 1gNrvI-0000r9-C4 for bug-guix@HIDDEN; Fri, 16 Nov 2018 23:09:32 -0500 Received: from mhw by world.peace.net with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from <mhw@HIDDEN>) id 1gNrvH-0002B7-QZ; Fri, 16 Nov 2018 23:09:31 -0500 From: Mark H Weaver <mhw@HIDDEN> To: bug-guix@HIDDEN Subject: Offloaded builds can get stuck indefinitely due to network issues Date: Fri, 16 Nov 2018 23:08:50 -0500 Message-ID: <87a7m8xs42.fsf@HIDDEN> MIME-Version: 1.0 Content-Type: text/plain X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -5.0 (-----) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -6.0 (------) I just discovered that 4 out of 5 armhf build slots on Hydra have been stuck for 24 hours, apparently after the network connections to the build slaves were lost, possibly due to a temporary network outage. I've seen this kind of thing happen periodically since we switched to using guile-ssh for offloaded builds. On Hydra I can monitor the builds and investigate when a given build seems to be taking far too long, and I can kill those jobs to free up the build slots. There's no way to kill the builds from Hydra's web interface, but I can kill them manually by logging into Hydra. This might become a more serious problem on Berlin, as we add ARM build slaves that are not on the same local network as Berlin itself, until the web interface allows for this kind of monitoring and intervention. Mark
Mark H Weaver <mhw@HIDDEN>
:bug-guix@HIDDEN
.
Full text available.bug-guix@HIDDEN
:bug#33410
; Package guix
.
Full text available.
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997 nCipher Corporation Ltd,
1994-97 Ian Jackson.