GNU bug report logs - #63412
Topological sorting in cuirass

Previous Next

Package: guix;

Reported by: Andreas Enge <andreas <at> enge.fr>

Date: Wed, 10 May 2023 10:07:02 UTC

Severity: normal

To reply to this bug, email your comments to 63412 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-guix <at> gnu.org:
bug#63412; Package guix. (Wed, 10 May 2023 10:07:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Andreas Enge <andreas <at> enge.fr>:
New bug report received and forwarded. Copy sent to bug-guix <at> gnu.org. (Wed, 10 May 2023 10:07:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Andreas Enge <andreas <at> enge.fr>
To: bug-guix <at> gnu.org
Subject: Topological sorting in cuirass
Date: Wed, 10 May 2023 12:06:28 +0200
This is a wishlist bug, but it is important for architectures where we
are currently short on build power, and where this issue can stall builds
and waste an arbitrary amount of build power.

Cuirass should sort builds and only offload derivations for which all
inputs are available.

In my current understanding, cuirass offloads arbitrary derivations, and
the machine to which they are offloaded then starts building recursively
all inputs. If this is true, then it is possible that at some point in time,
all build slots are taken by the same package built as many times as there
are machines; I have seen something like this when working on core-updates,
where several machines were building the main gcc compiler at the same time.
At worse, if cuirass asks every machine to build a leaf package, this may
result in a simultaneous full bootstrap on all of them.

The situation becomes worse when the package in question fails. Then as
I understand it, each machine may receive a request to build something
depending on the failing package and try the failing build and thus waste
build power that will not be available to build other packages successfully.

Solving this problem may also make reports of build failures more accurate
and legible. For instance, doxygen currently fails to build on aarch64:
   https://ci.guix.gnu.org/build/969427/details
and is reported as "Failed", and not as "Failed (dependency)".
However, looking at the build log
   https://ci.guix.gnu.org/build/969427/log/raw
shows this:
...
building path(s) `/gnu/store/p5vqrwywz053r1vkiyw54dp9gj7vw9xd-ninja-1.11.1'
...
builder for `/gnu/store/0zf7fqndzf2k595r4s6wblmpccdwr3nx-ninja-1.11.1.drv' failed with exit code 1
@ build-failed /gnu/store/0zf7fqndzf2k595r4s6wblmpccdwr3nx-ninja-1.11.1.drv - 1 builder for `/gnu/store/0zf7fqndzf2k595r4s6wblmpccdwr3nx-ninja-1.11.1.drv' failed with exit code 1
cannot build derivation `/gnu/store/hlscqram59id51hxg0fj15041v52h1kw-meson-1.1.0.drv': 1 dependencies couldn't be built
cannot build derivation `/gnu/store/w8qxkrwpffd9qs5w1jggy1yi27ycm0xr-jsoncpp-1.9.5.drv': 2 dependencies couldn't be built
cannot build derivation `/gnu/store/mss4yv015cil1vnjnglq506m83b7n3dy-cmake-bootstrap-3.24.2.drv': 1 dependencies couldn't be built
cannot build derivation `/gnu/store/w0irp6xn30nlmpizhcbjnvhqmsba41jn-cmake-minimal-3.24.2.drv': 2 dependencies couldn't be built
cannot build derivation `/gnu/store/rqk2rbnpjpcnqswz8hqari1rnw6r8v1m-doxygen-1.9.5.drv': 1 dependencies couldn't be built

So it is indeed a different package that fails (and the last few lines
give a list of dependencies between ninja and doxygen, each of which may
or may not fail once ninja is fixed).

Notice that this could be solved without a topological sorting of the
dependency graph: It would be enough to keep an array deriv in which
deriv[i] contains a list of derivations requiring i more inputs to be built,
together with the list of inputs; elements in deriv[0] are ready to be sent
to a build machine, and upon completion of a build, all derivations
depending on it should be moved from deriv[i] to deriv[i-1] if the input
has been built successfully, or marked as "Failed (dependency)" if the
input has failed. (But this could be expensive, and may require appropriate
data structures.)

Alternatively, build jobs could be sorted topologically and then be kept
in a list; then before sending out a job, all its inputs have been tried
to be built; the job should then be sent if all inputs are available, or
be marked as "Failed (dependency)" if any of them has failed.

Andreas





Information forwarded to bug-guix <at> gnu.org:
bug#63412; Package guix. (Wed, 10 May 2023 14:03:01 GMT) Full text and rfc822 format available.

Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):

From: "Dr. Arne Babenhauserheide" <arne_bab <at> web.de>
To: Andreas Enge <andreas <at> enge.fr>
Cc: 63412 <at> debbugs.gnu.org, bug-guix <at> gnu.org
Subject: Re: bug#63412: Topological sorting in cuirass
Date: Wed, 10 May 2023 15:59:45 +0200
[Message part 1 (text/plain, inline)]
Andreas Enge <andreas <at> enge.fr> writes:

> Cuirass should sort builds and only offload derivations for which all
> inputs are available.
...
> Alternatively, build jobs could be sorted topologically and then be kept
> in a list; then before sending out a job, all its inputs have been tried
> to be built; the job should then be sent if all inputs are available, or
> be marked as "Failed (dependency)" if any of them has failed.

If you want to try this out quickly, you could use code from the
topological sorting SRFI I’m slowly finalizing:

https://srfi.schemers.org/srfi-234/srfi-234.html

For usage see: https://github.com/scheme-requests-for-implementation/srfi-234/blob/main/srfi-234-test.scm

Code: https://github.com/scheme-requests-for-implementation/srfi-234/blob/main/srfi/234-impl.scm

edgelist->graph should do the conversion from inputs per package to the
input format.

Best wishes,
Arne
-- 
Unpolitisch sein
heißt politisch sein,
ohne es zu merken.
draketo.de
[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-guix <at> gnu.org:
bug#63412; Package guix. (Wed, 10 May 2023 14:03:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-guix <at> gnu.org:
bug#63412; Package guix. (Sun, 29 Oct 2023 17:03:02 GMT) Full text and rfc822 format available.

Message #14 received at 63412 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Andreas Enge <andreas <at> enge.fr>
Cc: 63412 <at> debbugs.gnu.org
Subject: Re: bug#63412: Topological sorting in cuirass
Date: Sun, 29 Oct 2023 18:01:20 +0100
Hi,

Andreas Enge <andreas <at> enge.fr> skribis:

> This is a wishlist bug, but it is important for architectures where we
> are currently short on build power, and where this issue can stall builds
> and waste an arbitrary amount of build power.
>
> Cuirass should sort builds and only offload derivations for which all
> inputs are available.

Cuirass has two build backends: talking to the build daemon, and using
the ZeroMQ-based “remote worker” protocol.  In practice we use the
latter, which fixes scalability issues with the former.

The worker protocol implements work stealing: workers periodically send
messages to “remote server” asking for work.  Said server replies with a
derivation for one of the systems the worker supports; that derivation
is chosen among the “builds” whose dependencies have all been
successfully built.

And here’s the trick: the server is doing the right thing, but it has a
partial view.  Namely, the server sees “builds” rather than
“derivations”.  “Builds” are the things explicitly declared in the
jobset.  If you have a declared build for GCC, but no build for MPC and
MPFR, then the server will consider that GCC has zero non-built
dependencies, even though MPFR and MPC may still need to be built.

Having ‘remote-worker’ operate on derivations rather than builds would
address this impedance mismatch, though there are complications (there
are bits of the database schema that amalgamate build/derivation).

Food for thought!

Ludo’.




This bug report was last modified 1 year and 29 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.