GNU bug report logs - #30644
Cuirass runs out of build users

Previous Next

Package: guix;

Reported by: Andreas Enge <andreas <at> enge.fr>

Date: Wed, 28 Feb 2018 09:08:01 UTC

Severity: normal

Done: Mathieu Othacehe <othacehe <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 30644 in the body.
You can then email your comments to 30644 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-guix <at> gnu.org:
bug#30644; Package guix. (Wed, 28 Feb 2018 09:08:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Andreas Enge <andreas <at> enge.fr>:
New bug report received and forwarded. Copy sent to bug-guix <at> gnu.org. (Wed, 28 Feb 2018 09:08:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Andreas Enge <andreas <at> enge.fr>
To: bug-guix <at> gnu.org
Subject: Cuirass runs out of build users
Date: Wed, 28 Feb 2018 10:07:14 +0100
Hello,

in trying out the new bayfront machine, I raised the number of concurrent
build jobs to 10; the number of build accounts is programmatically set to
twice this number.

Looking at the build directories in /tmp, 9 build jobs were started on
February 26 at around 21:50, a 10th one at 22:24.

Later, I have been getting messages such as these:
2018-02-27T18:31:13 building 14491 derivations in batches of 200
2018-02-27T18:31:13 building batch of 200 jobs (0/14491)
2018-02-27T18:31:13 build failed: '/gnu/store/3qy3s1dy3nnb108ifz9yiscs687g0xkh-gfortran-5.5.0.drv'
2018-02-27T18:31:13 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
2018-02-27T18:31:14 build failed: '/gnu/store/w56wpx4jar88hfabipvf6gd6hz81p6wk-python2-gevent-1.2.2.drv'
2018-02-27T18:31:14 build failed: '/gnu/store/i2zmgbsi599pxskqh15s38fpms359jp7-gcc-cross-sans-libc-arm-linux-gnueabihf-5.5.0.drv'
2018-02-27T18:31:14 build failed: '/gnu/store/qd59gqza3ly4rk2xjf40q2539rjgzh5r-gcc-cross-sans-libc-powerpc-linux-gnu-5.5.0.drv'
2018-02-27T18:31:14 build failed: '/gnu/store/3qy3s1dy3nnb108ifz9yiscs687g0xkh-gfortran-5.5.0.drv'
2018-02-27T18:31:14 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
2018-02-27T18:31:15 build started: '/gnu/store/08pjxws1pd04syqf28d3hhsfrb946bpc-dzen-0.9.5-1.488ab66.drv'
2018-02-27T18:31:15 build failed: '/gnu/store/3qy3s1dy3nnb108ifz9yiscs687g0xkh-gfortran-5.5.0.drv'
2018-02-27T18:31:15 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
2018-02-27T18:31:15 build started: '/gnu/store/01r4xm1nqixcxjbbn6xakpg3m5wkr12v-po4a-0.47.drv'
2018-02-27T18:31:15 build failed: '/gnu/store/d545d4jzw0mfa61d9ml3vygchr93xp51-python-gevent-1.2.2.drv'
2018-02-27T18:31:15 build started: '/gnu/store/02chd3k5qc8p8fc46j4w955cfld77cw1-spinsim-0.75-1.66915a7ad.drv'
2018-02-27T18:31:15 build started: '/gnu/store/81b2340psschvvk7ybhnz6nx9jkpdlwa-Net-SMTP-SSL-1.04.tar.gz.drv'
2018-02-27T18:31:15 build started: '/gnu/store/rkkacywdj7yya3wf51i141x5khzld4zl-clx-0.0.0-1.tar.xz.drv'
2018-02-27T18:31:16 build started: '/gnu/store/286gkv0z6g9fb4frcx6yp2fg9v6iwk3b-ecl-alexandria-0.0.0-1.926a066.drv'
2018-02-27T18:31:16 build started: '/gnu/store/hmy3ds71gbv1mspc9p5yvvhz5pibs8wl-elixir-1.5.3.tar.xz.drv'
2018-02-27T18:31:16 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
2018-02-27T18:31:16 build started: '/gnu/store/y0p2myl0al8x3rlbxphgyjk83zl3kmb5-idris-mode-0.9.19.tar.drv'
2018-02-27T18:31:16 batch of builds (partially) failed:all build users are currently in use; consider creating additional users and adding them to the `guixbuild' group (status: 1)

So it looks as if build jobs get stalled, but somehow do not free their
build users.

Hm, the directories in /tmp are created by guixbuilder01 to guixbuilder10.
10 is the default value, but it should have been overwritten.

And in /etc/passwd, I see 20 users. But apparently, they are not used!

Andreas





Information forwarded to bug-guix <at> gnu.org:
bug#30644; Package guix. (Thu, 01 Mar 2018 14:02:01 GMT) Full text and rfc822 format available.

Message #8 received at 30644 <at> debbugs.gnu.org (full text, mbox):

From: ludo <at> gnu.org (Ludovic Courtès)
To: Andreas Enge <andreas <at> enge.fr>
Cc: 30644 <at> debbugs.gnu.org
Subject: Re: bug#30644: Cuirass runs out of build users
Date: Thu, 01 Mar 2018 15:01:30 +0100
Hello,

Andreas Enge <andreas <at> enge.fr> skribis:

> in trying out the new bayfront machine, I raised the number of concurrent
> build jobs to 10; the number of build accounts is programmatically set to
> twice this number.

Indeed, (sysadmin services) in guix-maintenance.git does that.

> 2018-02-27T18:31:16 batch of builds (partially) failed:all build users are currently in use; consider creating additional users and adding them to the `guixbuild' group (status: 1)

[...]

> Hm, the directories in /tmp are created by guixbuilder01 to guixbuilder10.
> 10 is the default value, but it should have been overwritten.
>
> And in /etc/passwd, I see 20 users. But apparently, they are not used!

There’s something fishy here.  What does “getent group guixbuild”
return?

Thanks,
Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#30644; Package guix. (Thu, 01 Mar 2018 20:36:02 GMT) Full text and rfc822 format available.

Message #11 received at 30644 <at> debbugs.gnu.org (full text, mbox):

From: Andreas Enge <andreas <at> enge.fr>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: 30644 <at> debbugs.gnu.org
Subject: Re: bug#30644: Cuirass runs out of build users
Date: Thu, 1 Mar 2018 21:35:43 +0100
On Thu, Mar 01, 2018 at 03:01:30PM +0100, Ludovic Courtès wrote:
> > Hm, the directories in /tmp are created by guixbuilder01 to guixbuilder10.
> > 10 is the default value, but it should have been overwritten.
> >
> > And in /etc/passwd, I see 20 users. But apparently, they are not used!
> 
> There’s something fishy here.  What does “getent group guixbuild”
> return?

All 20 build users:
# getent group guixbuild
guixbuild:x:30000:guixbuilder01,guixbuilder02,guixbuilder03,guixbuilder04,guixbuilder05,guixbuilder06,guixbuilder07,guixbuilder08,guixbuilder09,guixbuilder10,guixbuilder11,guixbuilder13,guixbuilder14,guixbuilder15,guixbuilder16,guixbuilder17,guixbuilder18,guixbuilder20

I will deploy the new cuirass with the other bug fixed, and report back if
the problem described here is still present.

Andreas





Information forwarded to bug-guix <at> gnu.org:
bug#30644; Package guix. (Fri, 02 Mar 2018 14:09:02 GMT) Full text and rfc822 format available.

Message #14 received at 30644 <at> debbugs.gnu.org (full text, mbox):

From: Efraim Flashner <efraim <at> flashner.co.il>
To: Andreas Enge <andreas <at> enge.fr>
Cc: 30644 <at> debbugs.gnu.org
Subject: Re: bug#30644: Cuirass runs out of build users
Date: Fri, 2 Mar 2018 16:08:33 +0200
[Message part 1 (text/plain, inline)]
On Wed, Feb 28, 2018 at 10:07:14AM +0100, Andreas Enge wrote:
> Hello,
> 
> in trying out the new bayfront machine, I raised the number of concurrent
> build jobs to 10; the number of build accounts is programmatically set to
> twice this number.
> 
> Looking at the build directories in /tmp, 9 build jobs were started on
> February 26 at around 21:50, a 10th one at 22:24.
> 
> Later, I have been getting messages such as these:
> 2018-02-27T18:31:13 building 14491 derivations in batches of 200
> 2018-02-27T18:31:13 building batch of 200 jobs (0/14491)
> 2018-02-27T18:31:13 build failed: '/gnu/store/3qy3s1dy3nnb108ifz9yiscs687g0xkh-gfortran-5.5.0.drv'
> 2018-02-27T18:31:13 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-02-27T18:31:14 build failed: '/gnu/store/w56wpx4jar88hfabipvf6gd6hz81p6wk-python2-gevent-1.2.2.drv'
> 2018-02-27T18:31:14 build failed: '/gnu/store/i2zmgbsi599pxskqh15s38fpms359jp7-gcc-cross-sans-libc-arm-linux-gnueabihf-5.5.0.drv'
> 2018-02-27T18:31:14 build failed: '/gnu/store/qd59gqza3ly4rk2xjf40q2539rjgzh5r-gcc-cross-sans-libc-powerpc-linux-gnu-5.5.0.drv'
> 2018-02-27T18:31:14 build failed: '/gnu/store/3qy3s1dy3nnb108ifz9yiscs687g0xkh-gfortran-5.5.0.drv'
> 2018-02-27T18:31:14 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-02-27T18:31:15 build started: '/gnu/store/08pjxws1pd04syqf28d3hhsfrb946bpc-dzen-0.9.5-1.488ab66.drv'
> 2018-02-27T18:31:15 build failed: '/gnu/store/3qy3s1dy3nnb108ifz9yiscs687g0xkh-gfortran-5.5.0.drv'
> 2018-02-27T18:31:15 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-02-27T18:31:15 build started: '/gnu/store/01r4xm1nqixcxjbbn6xakpg3m5wkr12v-po4a-0.47.drv'
> 2018-02-27T18:31:15 build failed: '/gnu/store/d545d4jzw0mfa61d9ml3vygchr93xp51-python-gevent-1.2.2.drv'
> 2018-02-27T18:31:15 build started: '/gnu/store/02chd3k5qc8p8fc46j4w955cfld77cw1-spinsim-0.75-1.66915a7ad.drv'
> 2018-02-27T18:31:15 build started: '/gnu/store/81b2340psschvvk7ybhnz6nx9jkpdlwa-Net-SMTP-SSL-1.04.tar.gz.drv'
> 2018-02-27T18:31:15 build started: '/gnu/store/rkkacywdj7yya3wf51i141x5khzld4zl-clx-0.0.0-1.tar.xz.drv'
> 2018-02-27T18:31:16 build started: '/gnu/store/286gkv0z6g9fb4frcx6yp2fg9v6iwk3b-ecl-alexandria-0.0.0-1.926a066.drv'
> 2018-02-27T18:31:16 build started: '/gnu/store/hmy3ds71gbv1mspc9p5yvvhz5pibs8wl-elixir-1.5.3.tar.xz.drv'
> 2018-02-27T18:31:16 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-02-27T18:31:16 build started: '/gnu/store/y0p2myl0al8x3rlbxphgyjk83zl3kmb5-idris-mode-0.9.19.tar.drv'
> 2018-02-27T18:31:16 batch of builds (partially) failed:all build users are currently in use; consider creating additional users and adding them to the `guixbuild' group (status: 1)
> 
> So it looks as if build jobs get stalled, but somehow do not free their
> build users.
> 
> Hm, the directories in /tmp are created by guixbuilder01 to guixbuilder10.
> 10 is the default value, but it should have been overwritten.
> 
> And in /etc/passwd, I see 20 users. But apparently, they are not used!
> 
> Andreas
> 

When I use a shell script to build all the packages, if a package hangs
and gets eaten by the timout-limit then it holds the build user and
space in /tmp until all 10 builders are taken and building stops or all
the builds finish and then it releases them.

-- 
Efraim Flashner   <efraim <at> flashner.co.il>   אפרים פלשנר
GPG key = A28B F40C 3E55 1372 662D  14F7 41AA E7DC CA3D 8351
Confidentiality cannot be guaranteed on emails sent or received unencrypted
[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-guix <at> gnu.org:
bug#30644; Package guix. (Mon, 05 Mar 2018 09:04:02 GMT) Full text and rfc822 format available.

Message #17 received at 30644 <at> debbugs.gnu.org (full text, mbox):

From: Andreas Enge <andreas <at> enge.fr>
To: Efraim Flashner <efraim <at> flashner.co.il>
Cc: 30644 <at> debbugs.gnu.org
Subject: Re: bug#30644: Cuirass runs out of build users
Date: Mon, 5 Mar 2018 10:02:56 +0100
On Fri, Mar 02, 2018 at 04:08:33PM +0200, Efraim Flashner wrote:
> When I use a shell script to build all the packages, if a package hangs
> and gets eaten by the timout-limit then it holds the build user and
> space in /tmp until all 10 builders are taken and building stops or all
> the builds finish and then it releases them.

Right now, I observe something else: Only one package is built at a time,
taking space in /tmp. But it is not the same build user; I saw one package
with guixbuilder06, and now it is building one package with guixbuilder01.

Is there a way to check which builders are considered taken? A lock file
somewhere?

Andreas





Information forwarded to bug-guix <at> gnu.org:
bug#30644; Package guix. (Mon, 05 Mar 2018 15:55:02 GMT) Full text and rfc822 format available.

Message #20 received at 30644 <at> debbugs.gnu.org (full text, mbox):

From: Andreas Enge <andreas <at> enge.fr>
To: Efraim Flashner <efraim <at> flashner.co.il>
Cc: 30644 <at> debbugs.gnu.org
Subject: Re: bug#30644: Cuirass runs out of build users
Date: Mon, 5 Mar 2018 16:53:51 +0100
[Message part 1 (text/plain, inline)]
Right now, I do not seem to be running out of build users, but I think
that cuirass nevertheless behaves strangely; so I am asking here instead
of opening a new bug (maybe there is no bug, actually).

I reconfigured bayfront this morning with the latest git checkout.
Right now, there are only two packages built in /tmp, if at all:
$ ll /tmp/
total 8
drwx------ 4 guixbuilder02 guixbuild 4096 Mar  5 16:06 guix-build-guile-emacs-20150512.41120e0.drv-0/
drwx------ 5 guixbuilder01 guixbuild 4096 Mar  5 15:05 guix-build-r-mgcv-1.8-23.drv-0/
(It is now 16:54, but "top" shows emacs and r activity, so my guess
is these packages are being built.)
But in any case, we are far from the 10 jobs I configured.

I am attaching the result of
$ grep -v evaluate cuirass.log|grep -v ^\`/gnu/store|grep -v spec|grep -v "next evaluation"|grep -v heap > ~/purged.log
which contains the non-trivial lines of /var/log/cuirass.log.

There is still one occurrence of this:
In ice-9/boot-9.scm:
    829:9  7 (catch _ _ #<procedure 7f49ffb6c9e0 at cuirass/base.sc?> ?)
In cuirass/base.scm:
   616:26  6 (_)
    548:2  5 (build-packages _ #<<sqlite-db> pointer: #<pointer 0x4?> ?)
   426:10  4 (spawn-builds #<build-daemon 256.97 42d2320> #<<sqlite?> ?)
In srfi/srfi-1.scm:
    640:9  3 (for-each #<procedure update! (drv)> _)
In cuirass/database.scm:
    69:18  2 (%sqlite-exec _ _ . _)
In ice-9/eval.scm:
   293:34  1 (_ #(#(#<directory (sqlite3) 1b91820> #<<sqlite-stm?> ?)))
    619:8  0 (_ #(#(#(#(#<directory (sqlite3) 1b91820> #<?>) #) 5) 5))
ice-9/eval.scm:619:8: Throw to key `sqlite-error' with args `(#f 5 "database is locked")'.

There is this, which looks like a problem, but not a fatal one:
;;; Failed to autoload make-session in (gnutls):
;;; missing interface for module (gnutls)
;;; Failed to autoload connection-end/client in (gnutls):
;;; missing interface for module (gnutls)

And what strikes me most is that between two lines of the kind
   building batch of 200 jobs
there are not 200 lines of successful or failed packages.

And my impression is that too many packages fail. What are these
"status 100" and "status 101"?

Do things look like this on berlin? (Of course, one difference is that
right now the machine builds everything itself without offloading.)

Andreas

[purged.log (text/plain, attachment)]

Information forwarded to bug-guix <at> gnu.org:
bug#30644; Package guix. (Mon, 05 Mar 2018 19:56:01 GMT) Full text and rfc822 format available.

Message #23 received at 30644 <at> debbugs.gnu.org (full text, mbox):

From: Andreas Enge <andreas <at> enge.fr>
To: Efraim Flashner <efraim <at> flashner.co.il>
Cc: 30644 <at> debbugs.gnu.org
Subject: Re: bug#30644: Cuirass runs out of build users
Date: Mon, 5 Mar 2018 20:55:20 +0100
Okay, this evening, I see 10 packages being built and lots of messages
"build started"/"build succeeded"; so maybe there is no bug at work here.

Andreas





Information forwarded to bug-guix <at> gnu.org:
bug#30644; Package guix. (Wed, 07 Mar 2018 22:36:02 GMT) Full text and rfc822 format available.

Message #26 received at 30644 <at> debbugs.gnu.org (full text, mbox):

From: ludo <at> gnu.org (Ludovic Courtès)
To: Andreas Enge <andreas <at> enge.fr>
Cc: 30644 <at> debbugs.gnu.org, Efraim Flashner <efraim <at> flashner.co.il>
Subject: Re: bug#30644: Cuirass runs out of build users
Date: Wed, 07 Mar 2018 23:35:44 +0100
Andreas Enge <andreas <at> enge.fr> skribis:

> Right now, I do not seem to be running out of build users, but I think
> that cuirass nevertheless behaves strangely; so I am asking here instead
> of opening a new bug (maybe there is no bug, actually).

Good.  :-)

> There is still one occurrence of this:
> In ice-9/boot-9.scm:
>     829:9  7 (catch _ _ #<procedure 7f49ffb6c9e0 at cuirass/base.sc?> ?)
> In cuirass/base.scm:
>    616:26  6 (_)
>     548:2  5 (build-packages _ #<<sqlite-db> pointer: #<pointer 0x4?> ?)
>    426:10  4 (spawn-builds #<build-daemon 256.97 42d2320> #<<sqlite?> ?)
> In srfi/srfi-1.scm:
>     640:9  3 (for-each #<procedure update! (drv)> _)
> In cuirass/database.scm:
>     69:18  2 (%sqlite-exec _ _ . _)
> In ice-9/eval.scm:
>    293:34  1 (_ #(#(#<directory (sqlite3) 1b91820> #<<sqlite-stm?> ?)))
>     619:8  0 (_ #(#(#(#(#<directory (sqlite3) 1b91820> #<?>) #) 5) 5))
> ice-9/eval.scm:619:8: Throw to key `sqlite-error' with args `(#f 5 "database is locked")'.

Bah.  :-/

> There is this, which looks like a problem, but not a fatal one:
> ;;; Failed to autoload make-session in (gnutls):
> ;;; missing interface for module (gnutls)
> ;;; Failed to autoload connection-end/client in (gnutls):
> ;;; missing interface for module (gnutls)

That happens during the evaluation, probably when compiling (guix build
download).  Nothing to worry about.

> 2018-03-05T12:41:34 building batch of 200 jobs (0/14675)
> 2018-03-05T12:41:34 build failed: '/gnu/store/0cfs95nasmfgs5nhylm1fcrfwpwh81pc-gcompris-qt-0.70.drv'
> 2018-03-05T12:41:34 build failed: '/gnu/store/065zrkz0zmfwiwajrb9nanzwvdhmyrrd-python2-stem-1.6.0.drv'
> 2018-03-05T12:41:35 build failed: '/gnu/store/3qy3s1dy3nnb108ifz9yiscs687g0xkh-gfortran-5.5.0.drv'
> 2018-03-05T12:41:35 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-03-05T12:41:36 build failed: '/gnu/store/b7jfj61snk90pkg7dlhyd7jjqbpi2rni-htslib-1.6.drv'
> 2018-03-05T12:41:37 build failed: '/gnu/store/9jjifxnxbv7z54qlcik3j8gcdj6ndd3k-python2-astroid-1.5.3.drv'
> 2018-03-05T12:41:38 build failed: '/gnu/store/i2zmgbsi599pxskqh15s38fpms359jp7-gcc-cross-sans-libc-arm-linux-gnueabihf-5.5.0.drv'
> 2018-03-05T12:41:38 build failed: '/gnu/store/qd59gqza3ly4rk2xjf40q2539rjgzh5r-gcc-cross-sans-libc-powerpc-linux-gnu-5.5.0.drv'
> 2018-03-05T12:41:38 build failed: '/gnu/store/3qy3s1dy3nnb108ifz9yiscs687g0xkh-gfortran-5.5.0.drv'
> 2018-03-05T12:41:38 build failed: '/gnu/store/z7f4imd95xw5w86pv59lnqgq2apgg0rc-ghc-hslua-0.9.5.drv'
> 2018-03-05T12:41:39 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-03-05T12:41:39 build failed: '/gnu/store/b7jfj61snk90pkg7dlhyd7jjqbpi2rni-htslib-1.6.drv'
> 2018-03-05T12:41:40 build failed: '/gnu/store/d545d4jzw0mfa61d9ml3vygchr93xp51-python-gevent-1.2.2.drv'
> 2018-03-05T12:41:40 build failed: '/gnu/store/agrdbmvswi81q54riyin4355zcpy9yrr-python2-reno-2.7.0.drv'
> 2018-03-05T12:41:41 build failed: '/gnu/store/9l7f6zfbm3ckziy9wqddwy0jl9r9xj1q-gcc-7.3.0.drv'
> 2018-03-05T12:41:42 build failed: '/gnu/store/3qy3s1dy3nnb108ifz9yiscs687g0xkh-gfortran-5.5.0.drv'
> 2018-03-05T12:41:42 build failed: '/gnu/store/z7f4imd95xw5w86pv59lnqgq2apgg0rc-ghc-hslua-0.9.5.drv'
> 2018-03-05T12:41:42 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-03-05T12:41:42 build failed: '/gnu/store/b7jfj61snk90pkg7dlhyd7jjqbpi2rni-htslib-1.6.drv'
> 2018-03-05T12:41:43 build failed: '/gnu/store/d545d4jzw0mfa61d9ml3vygchr93xp51-python-gevent-1.2.2.drv'
> 2018-03-05T12:41:43 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-03-05T12:41:43 build started: '/gnu/store/y0p2myl0al8x3rlbxphgyjk83zl3kmb5-idris-mode-0.9.19.tar.drv'
> 2018-03-05T12:41:44 build failed: '/gnu/store/014v4q1xykd2gl04ildgg3nf30dp51pw-gcc-5.5.0.drv'
> 2018-03-05T12:41:44 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-03-05T12:41:45 build failed: '/gnu/store/014v4q1xykd2gl04ildgg3nf30dp51pw-gcc-5.5.0.drv'
> 2018-03-05T12:41:45 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-03-05T12:41:45 build failed: '/gnu/store/014v4q1xykd2gl04ildgg3nf30dp51pw-gcc-5.5.0.drv'
> 2018-03-05T12:41:45 build failed: '/gnu/store/9l7f6zfbm3ckziy9wqddwy0jl9r9xj1q-gcc-7.3.0.drv'
> 2018-03-05T12:41:45 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-03-05T12:41:45 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-03-05T12:41:45 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-03-05T12:41:45 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-03-05T12:41:45 build failed: '/gnu/store/y0p2myl0al8x3rlbxphgyjk83zl3kmb5-idris-mode-0.9.19.tar.drv'

The fact that all of these failed more or less instantaneously probably
means that these were cached failures.  See ‘guix gc --list-failures’.

Why they failed in the first place, I don’t know.  Are these x86_64/i686
builds?  (You can see that in the .drv file.)  If not, no wonder that
they fail.

Thanks,
Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#30644; Package guix. (Wed, 07 Mar 2018 23:02:02 GMT) Full text and rfc822 format available.

Message #29 received at 30644 <at> debbugs.gnu.org (full text, mbox):

From: Danny Milosavljevic <dannym <at> scratchpost.org>
To: ludo <at> gnu.org (Ludovic Courtès)
Cc: Andreas Enge <andreas <at> enge.fr>, 30644 <at> debbugs.gnu.org
Subject: Re: bug#30644: Cuirass runs out of build users
Date: Thu, 8 Mar 2018 00:01:11 +0100
[Message part 1 (text/plain, inline)]
Hi Ludo,

> >     69:18  2 (%sqlite-exec _ _ . _)
> > In ice-9/eval.scm:
> >    293:34  1 (_ #(#(#<directory (sqlite3) 1b91820> #<<sqlite-stm?> ?)))
> >     619:8  0 (_ #(#(#(#(#<directory (sqlite3) 1b91820> #<?>) #) 5) 5))
> > ice-9/eval.scm:619:8: Throw to key `sqlite-error' with args `(#f 5 "database is locked")'.  
> 
> Bah.  :-/

sqlite3.h contains quite nice documentation about this case:

/*
** CAPI3REF: Unlock Notification
** METHOD: sqlite3
**
** ^When running in shared-cache mode, a database operation may fail with
** an [SQLITE_LOCKED] error if the required locks on the shared-cache or
** individual tables within the shared-cache cannot be obtained. See
** [SQLite Shared-Cache Mode] for a description of shared-cache locking. 
** ^This API may be used to register a callback that SQLite will invoke 
** when the connection currently holding the required lock relinquishes it.
** ^This API is only available if the library was compiled with the
** [SQLITE_ENABLE_UNLOCK_NOTIFY] C-preprocessor symbol defined.
**
** See Also: [Using the SQLite Unlock Notification Feature].
**
** ^Shared-cache locks are released when a database connection concludes
** its current transaction, either by committing it or rolling it back. 
**
** ^When a connection (known as the blocked connection) fails to obtain a
** shared-cache lock and SQLITE_LOCKED is returned to the caller, the
** identity of the database connection (the blocking connection) that
** has locked the required resource is stored internally. ^After an 
** application receives an SQLITE_LOCKED error, it may call the
** sqlite3_unlock_notify() method with the blocked connection handle as 
** the first argument to register for a callback that will be invoked
** when the blocking connections current transaction is concluded. ^The
** callback is invoked from within the [sqlite3_step] or [sqlite3_close]
** call that concludes the blocking connections transaction.
**
** ^(If sqlite3_unlock_notify() is called in a multi-threaded application,
** there is a chance that the blocking connection will have already
** concluded its transaction by the time sqlite3_unlock_notify() is invoked.
** If this happens, then the specified callback is invoked immediately,
** from within the call to sqlite3_unlock_notify().)^
**
** ^If the blocked connection is attempting to obtain a write-lock on a
** shared-cache table, and more than one other connection currently holds
** a read-lock on the same table, then SQLite arbitrarily selects one of 
** the other connections to use as the blocking connection.
**
** ^(There may be at most one unlock-notify callback registered by a 
** blocked connection. If sqlite3_unlock_notify() is called when the
** blocked connection already has a registered unlock-notify callback,
** then the new callback replaces the old.)^ ^If sqlite3_unlock_notify() is
** called with a NULL pointer as its second argument, then any existing
** unlock-notify callback is canceled. ^The blocked connections 
** unlock-notify callback may also be canceled by closing the blocked
** connection using [sqlite3_close()].
**
** The unlock-notify callback is not reentrant. If an application invokes
** any sqlite3_xxx API functions from within an unlock-notify callback, a
** crash or deadlock may be the result.
**
** ^Unless deadlock is detected (see below), sqlite3_unlock_notify() always
** returns SQLITE_OK.
**
...
*/

...
** When a call to [sqlite3_step()] returns SQLITE_LOCKED, it is almost 
** always appropriate to call sqlite3_unlock_notify(). There is however,
** one exception. When executing a "DROP TABLE" or "DROP INDEX" statement,
** SQLite checks if there are any currently executing SELECT statements
** that belong to the same connection. If there are, SQLITE_LOCKED is
** returned. In this case there is no "blocking connection", so invoking
** sqlite3_unlock_notify() results in the unlock-notify callback being
** invoked immediately. If the application then re-attempts the "DROP TABLE"
** or "DROP INDEX" query, an infinite loop might be the result.
**
** One way around this problem is to check the extended error code returned
** by an sqlite3_step() call. ^(If there is a blocking connection, then the
** extended error code is set to SQLITE_LOCKED_SHAREDCACHE. Otherwise, in
** the special "DROP TABLE/INDEX" case, the extended error code is just 
** SQLITE_LOCKED.)^
...
[Message part 2 (application/pgp-signature, inline)]

Information forwarded to bug-guix <at> gnu.org:
bug#30644; Package guix. (Thu, 08 Mar 2018 00:26:01 GMT) Full text and rfc822 format available.

Message #32 received at 30644 <at> debbugs.gnu.org (full text, mbox):

From: Danny Milosavljevic <dannym <at> scratchpost.org>
To: ludo <at> gnu.org (Ludovic Courtès)
Cc: Andreas Enge <andreas <at> enge.fr>, 30644 <at> debbugs.gnu.org
Subject: Re: bug#30644: Cuirass runs out of build users
Date: Thu, 8 Mar 2018 01:25:41 +0100
[Message part 1 (text/plain, inline)]
As for what's causing this, cuirass's bin/evaluate seems to connect to the sqlite database and then write stuff into it.

If a query is being done by the http server at the same time, you are going to be really careful with the sqlite open mode.

There would be ways to still be able to do this, but it's some seriously unnecessary locking complexity for something small like this.

Is that really necessary?  Can't the invoker of bin/evaluate write the result?

It even seems that src/cuirass/base.scm already does the db-add-derivation.

Why is bin/evaluate doing it again?

Same for db-add-evaluation...
[Message part 2 (application/pgp-signature, inline)]

Information forwarded to bug-guix <at> gnu.org:
bug#30644; Package guix. (Thu, 08 Mar 2018 00:30:02 GMT) Full text and rfc822 format available.

Message #35 received at 30644 <at> debbugs.gnu.org (full text, mbox):

From: Danny Milosavljevic <dannym <at> scratchpost.org>
To: ludo <at> gnu.org (Ludovic Courtès)
Cc: Andreas Enge <andreas <at> enge.fr>, 30644 <at> debbugs.gnu.org
Subject: Re: bug#30644: Cuirass runs out of build users
Date: Thu, 8 Mar 2018 01:29:23 +0100
[Message part 1 (text/plain, inline)]
Oops, I had an old bin/evaluate . bin/evaluate.in doesn't contain it anymore.  My bad...

So please disregard.

Anyway, sqlite contains some cool stuff to find out who's locking it.

But as things are it seems that sqlite access is now done only by one thread.  How can it then be locked?  Weird...
[Message part 2 (application/pgp-signature, inline)]

Information forwarded to bug-guix <at> gnu.org:
bug#30644; Package guix. (Thu, 08 Mar 2018 12:02:02 GMT) Full text and rfc822 format available.

Message #38 received at 30644 <at> debbugs.gnu.org (full text, mbox):

From: ludo <at> gnu.org (Ludovic Courtès)
To: Danny Milosavljevic <dannym <at> scratchpost.org>
Cc: Andreas Enge <andreas <at> enge.fr>, 30644 <at> debbugs.gnu.org
Subject: Re: bug#30644: Cuirass runs out of build users
Date: Thu, 08 Mar 2018 13:01:01 +0100
Danny Milosavljevic <dannym <at> scratchpost.org> skribis:

> Oops, I had an old bin/evaluate . bin/evaluate.in doesn't contain it anymore.  My bad...

Yes, see <https://bugs.gnu.org/30618>.

> Anyway, sqlite contains some cool stuff to find out who's locking it.
>
> But as things are it seems that sqlite access is now done only by one thread.  How can it then be locked?  Weird...

It’s accessed possibly from several threads because Fibers can be
thought of as an N×M threading library.

So what’s your take on this?

Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#30644; Package guix. (Sun, 11 Mar 2018 06:07:01 GMT) Full text and rfc822 format available.

Message #41 received at 30644 <at> debbugs.gnu.org (full text, mbox):

From: Danny Milosavljevic <dannym <at> scratchpost.org>
To: ludo <at> gnu.org (Ludovic Courtès)
Cc: Andreas Enge <andreas <at> enge.fr>, 30644 <at> debbugs.gnu.org
Subject: Re: bug#30644: Cuirass runs out of build users
Date: Sun, 11 Mar 2018 07:06:10 +0100
[Message part 1 (text/plain, inline)]
Hi Ludo,

On Thu, 08 Mar 2018 13:01:01 +0100
ludo <at> gnu.org (Ludovic Courtès) wrote:

> So what’s your take on this?

sqlite has a sqlite-unlock-notify.  We should wrap it and use it when we get the
"locked" error code.

We should also enable extended error codes https://sqlite.org/c3ref/extended_result_codes.html .

See also https://sqlite.org/unlock_notify.html for a blocking version of _step (to the right).

Do we want cuirass to block until the lock is available?
[Message part 2 (application/pgp-signature, inline)]

Information forwarded to bug-guix <at> gnu.org:
bug#30644; Package guix. (Sun, 11 Mar 2018 21:44:02 GMT) Full text and rfc822 format available.

Message #44 received at 30644 <at> debbugs.gnu.org (full text, mbox):

From: ludo <at> gnu.org (Ludovic Courtès)
To: Danny Milosavljevic <dannym <at> scratchpost.org>
Cc: Andreas Enge <andreas <at> enge.fr>, 30644 <at> debbugs.gnu.org
Subject: Re: bug#30644: Cuirass runs out of build users
Date: Sun, 11 Mar 2018 22:43:48 +0100
Hi,

Danny Milosavljevic <dannym <at> scratchpost.org> skribis:

> On Thu, 08 Mar 2018 13:01:01 +0100
> ludo <at> gnu.org (Ludovic Courtès) wrote:
>
>> So what’s your take on this?
>
> sqlite has a sqlite-unlock-notify.  We should wrap it and use it when we get the
> "locked" error code.
>
> We should also enable extended error codes https://sqlite.org/c3ref/extended_result_codes.html .
>
> See also https://sqlite.org/unlock_notify.html for a blocking version of _step (to the right).

Interesting.

> Do we want cuirass to block until the lock is available?

The POSIX thread itself should not block.  Rather the fiber should yield
to other fibers.

So we could do the blocking-step within ‘non-blocking’ block as defined
in (cuirass utils), though ‘non-blocking’ as currently implemented is a
bit expensive.

Or we could simply poll: catch those exceptions and try again later.  If
this is rare enough, as seems to be the case, that might be an
acceptable approach.

Thoughts?

Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#30644; Package guix. (Sat, 24 Mar 2018 13:16:01 GMT) Full text and rfc822 format available.

Message #47 received at 30644 <at> debbugs.gnu.org (full text, mbox):

From: Danny Milosavljevic <dannym <at> scratchpost.org>
To: ludo <at> gnu.org (Ludovic Courtès)
Cc: Andreas Enge <andreas <at> enge.fr>, 30644 <at> debbugs.gnu.org
Subject: Re: bug#30644: Cuirass runs out of build users
Date: Sat, 24 Mar 2018 14:15:47 +0100
[Message part 1 (text/plain, inline)]
Hi Ludo,

I've checked it some more.  The shared cache uses unlock_notify, and the actual
database uses sqlite3_busy_handler (sqlite3_busy_timeout) when BUSY is returned.

The 5 in

ice-9/eval.scm:619:8: Throw to key `sqlite-error' with args `(#f 5 "database is locked")'.  
                                                                 ^^

means SQLITE_BUSY, so that's the actual database.

So the easiest way to work around it would be to call

  sqlite3_busy_timeout(db, 10); // ms

which would automatically repeatedly retry on busy until 10 ms have accumulated,
and only then give up.

There's also sqlite3_busy_handler which can be used to install a busy handler,
and of course in the future we could also handle SQLITE_BUSY ourselves and yield.

See also https://notabug.org/civodul/guile-sqlite3/pulls/4 for the former.
[Message part 2 (application/pgp-signature, inline)]

Information forwarded to bug-guix <at> gnu.org:
bug#30644; Package guix. (Sun, 25 Mar 2018 12:41:02 GMT) Full text and rfc822 format available.

Message #50 received at 30644 <at> debbugs.gnu.org (full text, mbox):

From: ludo <at> gnu.org (Ludovic Courtès)
To: Danny Milosavljevic <dannym <at> scratchpost.org>
Cc: Andreas Enge <andreas <at> enge.fr>, 30644 <at> debbugs.gnu.org
Subject: Re: bug#30644: Cuirass runs out of build users
Date: Sun, 25 Mar 2018 14:40:20 +0200
Hi Danny,

Danny Milosavljevic <dannym <at> scratchpost.org> skribis:

> I've checked it some more.  The shared cache uses unlock_notify, and the actual
> database uses sqlite3_busy_handler (sqlite3_busy_timeout) when BUSY is returned.
>
> The 5 in
>
> ice-9/eval.scm:619:8: Throw to key `sqlite-error' with args `(#f 5 "database is locked")'.  
>                                                                  ^^
>
> means SQLITE_BUSY, so that's the actual database.
>
> So the easiest way to work around it would be to call
>
>   sqlite3_busy_timeout(db, 10); // ms
>
> which would automatically repeatedly retry on busy until 10 ms have accumulated,
> and only then give up.

Indeed, that’s actually what nix/libstore does.

> There's also sqlite3_busy_handler which can be used to install a busy handler,
> and of course in the future we could also handle SQLITE_BUSY ourselves and yield.
>
> See also https://notabug.org/civodul/guile-sqlite3/pulls/4 for the former.

Awesome.

I’ve merged these guile-sqlite3 bits and added a “PRAGMA busy_timeout”
in Cuirass (the pragma allows us to set the timeout even when using an
older guile-sqlite3.)  I’ve updated the ‘cuirass’ package and will test
soonish on berlin.

Thank you!

Ludo’.




Reply sent to Mathieu Othacehe <othacehe <at> gnu.org>:
You have taken responsibility. (Thu, 25 Mar 2021 13:18:01 GMT) Full text and rfc822 format available.

Notification sent to Andreas Enge <andreas <at> enge.fr>:
bug acknowledged by developer. (Thu, 25 Mar 2021 13:18:01 GMT) Full text and rfc822 format available.

Message #55 received at 30644-done <at> debbugs.gnu.org (full text, mbox):

From: Mathieu Othacehe <othacehe <at> gnu.org>
To: 30644-done <at> debbugs.gnu.org
Subject: Re: bug#30644: Cuirass runs out of build users
Date: Thu, 25 Mar 2021 14:17:32 +0100
Hello,

This should no longer be an issue with the new remote building mechanism
in Cuirass.

Thanks,

Mathieu




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 23 Apr 2021 11:24:09 GMT) Full text and rfc822 format available.

This bug report was last modified 2 years and 362 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.