GNU bug report logs - #33410
Offloaded builds can get stuck indefinitely due to network issues

Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.

Package: guix; Reported by: Mark H Weaver <mhw@HIDDEN>; dated Sat, 17 Nov 2018 04:10:01 UTC; Maintainer for guix is bug-guix@HIDDEN.

Message received at 33410 <at> debbugs.gnu.org:


Received: (at 33410) by debbugs.gnu.org; 17 Nov 2018 14:21:41 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sat Nov 17 09:21:41 2018
Received: from localhost ([127.0.0.1]:57164 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1gO1Tg-0002MX-UD
	for submit <at> debbugs.gnu.org; Sat, 17 Nov 2018 09:21:41 -0500
Received: from eggs.gnu.org ([208.118.235.92]:45774)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <ludo@HIDDEN>) id 1gO1Tf-0002ML-L3
 for 33410 <at> debbugs.gnu.org; Sat, 17 Nov 2018 09:21:39 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <ludo@HIDDEN>) id 1gO1TZ-00034q-MT
 for 33410 <at> debbugs.gnu.org; Sat, 17 Nov 2018 09:21:34 -0500
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=disabled
 version=3.3.2
Received: from fencepost.gnu.org ([2001:4830:134:3::e]:39103)
 by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <ludo@HIDDEN>)
 id 1gO1TZ-00034m-Jb; Sat, 17 Nov 2018 09:21:33 -0500
Received: from [2a01:e0a:1d:7270:af76:b9b:ca24:c465] (port=49876 helo=ribbon)
 by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256)
 (Exim 4.82) (envelope-from <ludo@HIDDEN>)
 id 1gO1TZ-0004Tq-Bv; Sat, 17 Nov 2018 09:21:33 -0500
From: ludo@HIDDEN (Ludovic =?utf-8?Q?Court=C3=A8s?=)
To: Mark H Weaver <mhw@HIDDEN>
Subject: Re: bug#33410: Offloaded builds can get stuck indefinitely due to
 network issues
References: <87a7m8xs42.fsf@HIDDEN>
X-URL: http://www.fdn.fr/~lcourtes/
X-Revolutionary-Date: 27 Brumaire an 227 de la =?utf-8?Q?R=C3=A9volution?=
X-PGP-Key-ID: 0x090B11993D9AEBB5
X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc
X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4  0CFB 090B 1199 3D9A EBB5
X-OS: x86_64-pc-linux-gnu
Date: Sat, 17 Nov 2018 15:21:32 +0100
In-Reply-To: <87a7m8xs42.fsf@HIDDEN> (Mark H. Weaver's message of "Fri, 16
 Nov 2018 23:08:50 -0500")
Message-ID: <87efbjokcz.fsf@HIDDEN>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-Received-From: 2001:4830:134:3::e
X-Spam-Score: -5.0 (-----)
X-Debbugs-Envelope-To: 33410
Cc: 33410 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -6.0 (------)

Hello,

Mark H Weaver <mhw@HIDDEN> skribis:

> I just discovered that 4 out of 5 armhf build slots on Hydra have been
> stuck for 24 hours, apparently after the network connections to the
> build slaves were lost, possibly due to a temporary network outage.
>
> I've seen this kind of thing happen periodically since we switched to
> using guile-ssh for offloaded builds.

Which guix-daemon version is hydra running?

Commit a708de151c255712071e42e5c8284756b51768cd adds a safeguard to make
sure timeouts are honored, though there might be some cases where it
doesn=E2=80=99t quite work as expected (I suspect libssh handles EINTR
internally by looping, in which case our signal handling async doesn=E2=80=
=99t
get a chance to run.)

> On Hydra I can monitor the builds and investigate when a given build
> seems to be taking far too long, and I can kill those jobs to free up
> the build slots.  There's no way to kill the builds from Hydra's web
> interface, but I can kill them manually by logging into Hydra.
>
> This might become a more serious problem on Berlin, as we add ARM build
> slaves that are not on the same local network as Berlin itself, until
> the web interface allows for this kind of monitoring and intervention.

The current situation on berlin is suboptimal: I run =E2=80=98guix processe=
s=E2=80=99
when I suspect something is wrong, and that=E2=80=99s how I found about
<https://issues.guix.info/issue/33239>.

Thanks,
Ludo=E2=80=99.




Information forwarded to bug-guix@HIDDEN:
bug#33410; Package guix. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 17 Nov 2018 04:09:44 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Nov 16 23:09:44 2018
Received: from localhost ([127.0.0.1]:56982 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1gNrvU-0000WY-2G
	for submit <at> debbugs.gnu.org; Fri, 16 Nov 2018 23:09:44 -0500
Received: from eggs.gnu.org ([208.118.235.92]:33886)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <mhw@HIDDEN>) id 1gNrvS-0000WL-8Q
 for submit <at> debbugs.gnu.org; Fri, 16 Nov 2018 23:09:42 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <mhw@HIDDEN>) id 1gNrvM-0000sp-DK
 for submit <at> debbugs.gnu.org; Fri, 16 Nov 2018 23:09:37 -0500
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=disabled
 version=3.3.2
Received: from lists.gnu.org ([2001:4830:134:3::11]:41751)
 by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32)
 (Exim 4.71) (envelope-from <mhw@HIDDEN>) id 1gNrvM-0000sj-At
 for submit <at> debbugs.gnu.org; Fri, 16 Nov 2018 23:09:36 -0500
Received: from eggs.gnu.org ([2001:4830:134:3::10]:43494)
 by lists.gnu.org with esmtp (Exim 4.71)
 (envelope-from <mhw@HIDDEN>) id 1gNrvL-00032C-J7
 for bug-guix@HIDDEN; Fri, 16 Nov 2018 23:09:36 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <mhw@HIDDEN>) id 1gNrvI-0000rI-Em
 for bug-guix@HIDDEN; Fri, 16 Nov 2018 23:09:35 -0500
Received: from world.peace.net ([64.112.178.59]:50540)
 by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32)
 (Exim 4.71) (envelope-from <mhw@HIDDEN>) id 1gNrvI-0000r9-C4
 for bug-guix@HIDDEN; Fri, 16 Nov 2018 23:09:32 -0500
Received: from mhw by world.peace.net with esmtpsa
 (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89)
 (envelope-from <mhw@HIDDEN>)
 id 1gNrvH-0002B7-QZ; Fri, 16 Nov 2018 23:09:31 -0500
From: Mark H Weaver <mhw@HIDDEN>
To: bug-guix@HIDDEN
Subject: Offloaded builds can get stuck indefinitely due to network issues
Date: Fri, 16 Nov 2018 23:08:50 -0500
Message-ID: <87a7m8xs42.fsf@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x
X-Received-From: 2001:4830:134:3::11
X-Spam-Score: -5.0 (-----)
X-Debbugs-Envelope-To: submit
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -6.0 (------)

I just discovered that 4 out of 5 armhf build slots on Hydra have been
stuck for 24 hours, apparently after the network connections to the
build slaves were lost, possibly due to a temporary network outage.

I've seen this kind of thing happen periodically since we switched to
using guile-ssh for offloaded builds.

On Hydra I can monitor the builds and investigate when a given build
seems to be taking far too long, and I can kill those jobs to free up
the build slots.  There's no way to kill the builds from Hydra's web
interface, but I can kill them manually by logging into Hydra.

This might become a more serious problem on Berlin, as we add ARM build
slaves that are not on the same local network as Berlin itself, until
the web interface allows for this kind of monitoring and intervention.

      Mark




Acknowledgement sent to Mark H Weaver <mhw@HIDDEN>:
New bug report received and forwarded. Copy sent to bug-guix@HIDDEN. Full text available.
Report forwarded to bug-guix@HIDDEN:
bug#33410; Package guix. Full text available.
Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.
Last modified: Mon, 25 Nov 2019 12:00:02 UTC

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997 nCipher Corporation Ltd, 1994-97 Ian Jackson.