GNU bug report logs - #51787
GC takes more than 9 hours on berlin

Package: guix;

Reported by: Mathieu Othacehe <othacehe <at> gnu.org>

Date: Fri, 12 Nov 2021 11:50:02 UTC

Severity: important

Done: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 51787 in the body.
You can then email your comments to 51787 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-guix <at> gnu.org:
bug#51787; Package guix. (Fri, 12 Nov 2021 11:50:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Mathieu Othacehe <othacehe <at> gnu.org>:
New bug report received and forwarded. Copy sent to bug-guix <at> gnu.org. (Fri, 12 Nov 2021 11:50:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Mathieu Othacehe <othacehe <at> gnu.org>
To: bug-guix <at> gnu.org
Subject: GC takes more than 9 hours on berlin
Date: Fri, 12 Nov 2021 11:49:04 +0000

Hello,

On berlin, the daily GC command is still running whereas it was started
9 hours ago.

--8<---------------cut here---------------start------------->8---
guix processes
[...]
SessionPID: 37231
ClientPID: 37195
ClientCommand: /gnu/store/49vfv8l1j96bbd73ssbqanpnjz83arss-guix-1.3.0-11.014f1b6/libexec/guix/guile \ /gnu/store/49vfv8l1j96bbd73ssbqanpnjz83arss-guix-1.3.0-11.014f1b6/bin/guix gc -F10995116277760
--8<---------------cut here---------------end--------------->8---

and

--8<---------------cut here---------------start------------->8---
mathieu <at> berlin ~$ ps auxww|grep 37195
root      37195  0.0  0.0 183260 33440 ?        Sl   03:59   0:00 /gnu/store/49vfv8l1j96bbd73ssbqanpnjz83arss-guix-1.3.0-11.014f1b6/libexec/guix/guile \ /gnu/store/49vfv8l1j96bbd73ssbqanpnjz83arss-guix-1.3.0-11.014f1b6/bin/guix gc -F10995116277760
--8<---------------cut here---------------end--------------->8---

That's really problematic as it is blocking some other berlin services
such as Cuirass, which has 4564 packages in its fetch queue:

--8<---------------cut here---------------start------------->8---
mathieu <at> berlin ~$ less /var/log/cuirass-remote-server.log
[...]
2021-11-12T12:47:01 period update: 0 resumable, 0 failed builds.
2021-11-12T12:47:01 period update: 4564 items in the fetch queue.
--8<---------------cut here---------------end--------------->8---

Thanks,

Mathieu

Information forwarded to bug-guix <at> gnu.org:
bug#51787; Package guix. (Fri, 12 Nov 2021 19:18:01 GMT) Full text and rfc822 format available.

Message #8 received at 51787 <at> debbugs.gnu.org (full text, mbox):

From: Mathieu Othacehe <othacehe <at> gnu.org>
To: 51787 <at> debbugs.gnu.org
Subject: Re: bug#51787: GC takes more than 9 hours on berlin
Date: Fri, 12 Nov 2021 19:17:24 +0000

> mathieu <at> berlin ~$ ps auxww|grep 37195
> root      37195  0.0  0.0 183260 33440 ?        Sl   03:59   0:00 /gnu/store/49vfv8l1j96bbd73ssbqanpnjz83arss-guix-1.3.0-11.014f1b6/libexec/guix/guile \ /gnu/store/49vfv8l1j96bbd73ssbqanpnjz83arss-guix-1.3.0-11.014f1b6/bin/guix gc -F10995116277760

I just killed this process, it was running for more than 16 hours.

Mathieu

Information forwarded to bug-guix <at> gnu.org:
bug#51787; Package guix. (Mon, 22 Nov 2021 09:18:01 GMT) Full text and rfc822 format available.

Message #11 received at 51787 <at> debbugs.gnu.org (full text, mbox):

From: zimoun <zimon.toutoune <at> gmail.com>
To: Mathieu Othacehe <othacehe <at> gnu.org>
Cc: 51787 <at> debbugs.gnu.org
Subject: Re: bug#51787: GC takes more than 9 hours on berlin
Date: Mon, 22 Nov 2021 10:16:36 +0100

Hi,

On Fri, 12 Nov 2021 at 12:51, Mathieu Othacehe <othacehe <at> gnu.org> wrote:

> On berlin, the daily GC command is still running whereas it was started
> 9 hours ago.
>
> --8<---------------cut here---------------start------------->8---
> guix processes
> [...]
> SessionPID: 37231
> ClientPID: 37195
> ClientCommand: /gnu/store/49vfv8l1j96bbd73ssbqanpnjz83arss-guix-1.3.0-11.014f1b6/libexec/guix/guile \ /gnu/store/49vfv8l1j96bbd73ssbqanpnjz83arss-guix-1.3.0-11.014f1b6/bin/guix gc -F10995116277760
> --8<---------------cut here---------------end--------------->8---
>
> and
>
> --8<---------------cut here---------------start------------->8---
> mathieu <at> berlin ~$ ps auxww|grep 37195
> root      37195  0.0  0.0 183260 33440 ?        Sl   03:59   0:00 /gnu/store/49vfv8l1j96bbd73ssbqanpnjz83arss-guix-1.3.0-11.014f1b6/libexec/guix/guile \ /gnu/store/49vfv8l1j96bbd73ssbqanpnjz83arss-guix-1.3.0-11.014f1b6/bin/guix gc -F10995116277760
> --8<---------------cut here---------------end--------------->8---
>
> That's really problematic as it is blocking some other berlin services
> such as Cuirass, which has 4564 packages in its fetch queue:
>
> --8<---------------cut here---------------start------------->8---
> mathieu <at> berlin ~$ less /var/log/cuirass-remote-server.log
> [...]
> 2021-11-12T12:47:01 period update: 0 resumable, 0 failed builds.
> 2021-11-12T12:47:01 period update: 4564 items in the fetch queue.
> --8<---------------cut here---------------end--------------->8---

How is it possible to investigate?

Cheers,
simon

Severity set to 'important' from 'normal' Request was from Ludovic Courtès <ludo <at> gnu.org> to control <at> debbugs.gnu.org. (Tue, 23 Nov 2021 17:35:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-guix <at> gnu.org:
bug#51787; Package guix. (Tue, 23 Nov 2021 17:49:02 GMT) Full text and rfc822 format available.

Message #16 received at 51787 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Mathieu Othacehe <othacehe <at> gnu.org>
Cc: Christopher Baines <mail <at> cbaines.net>, 51787 <at> debbugs.gnu.org
Subject: Re: bug#51787: GC takes more than 9 hours on berlin
Date: Tue, 23 Nov 2021 18:48:08 +0100

Hello!

Mathieu Othacehe <othacehe <at> gnu.org> skribis:

> On berlin, the daily GC command is still running whereas it was started
> 9 hours ago.

Some data points:

  • I deployed on berlin the new daemon featuring the faster “deleting
    unused links” phase from <https://issues.guix.gnu.org/24937> on
    Nov. 20.

    However, that part runs after the GC lock has been released, so it’s
    not really relevant (but it is relevant to I/O load and GC
    efficiency.)

  • When discussing together with Chris Baines yesterday on IRC, we
    found that the sqlite WAL file was 8 GiB.  I later ran:

      PRAGMA wal_checkpoint(TRUNCATE);

    which emptied it immediately.  However, GC time wasn’t particularly
    different today.

  • ‘db.sqlite’ weighs in at 19 GiB (!) so perhaps there’s something to
    do, like the “VACUUM” thing maybe.  Chris?

  • Stracing the session’s guix-daemon process during GC suggests that
    most of the time goes into I/O from ‘db.sqlite’.  It’s not
    surprising because that GC phase is basically about browsing the
    database, but it does seem to take a little too long for each store
    item.

  • I haven’t checked recently but I recall that ‘guix gc --list-roots’
    (or its C++ counterpart, ‘findRoots’) would take ages on berlin
    because of all the GC roots Cuirass registers.  It may be that an
    hour or so goes into enumerating GC roots.

Collecting garbage,
Ludo’.

Information forwarded to bug-guix <at> gnu.org:
bug#51787; Package guix. (Fri, 26 Nov 2021 12:11:01 GMT) Full text and rfc822 format available.

Message #19 received at 51787 <at> debbugs.gnu.org (full text, mbox):

From: Christopher Baines <mail <at> cbaines.net>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: Mathieu Othacehe <othacehe <at> gnu.org>, 51787 <at> debbugs.gnu.org
Subject: Re: bug#51787: GC takes more than 9 hours on berlin
Date: Thu, 25 Nov 2021 13:24:17 +0000

[Message part 1 (text/plain, inline)]

Ludovic Courtès <ludo <at> gnu.org> writes:

>   • When discussing together with Chris Baines yesterday on IRC, we
>     found that the sqlite WAL file was 8 GiB.  I later ran:
>
>       PRAGMA wal_checkpoint(TRUNCATE);
>
>     which emptied it immediately.  However, GC time wasn’t particularly
>     different today.

So, as I understand it, the WAL is made up of pages, and checking for
this db, I think they're the current default size of 4096 bytes.

sqlite> PRAGMA page_size;
4096

From looking at the code, the wal_autocheckpoint value is set to 40000:

    /* Increase the auto-checkpoint interval to 40000 pages.  This
       seems enough to ensure that instantiating the NixOS system
       derivation is done in a single fsync(). */
    if (mode == "wal" && sqlite3_exec(db, "pragma wal_autocheckpoint = 40000;", 0, 0, 0) != SQLITE_OK)
        throwSQLiteError(db, "setting autocheckpoint interval");

https://git.savannah.gnu.org/cgit/guix.git/tree/nix/libstore/local-store.cc#n253

This means you'd expect the WAL to be in the region of 40000*4096 bytes,
or ~160MB. Assuming the autocheckpointing is keeping up... it doesn't
look to be, since the file is now much larger than this.

As described here [1], the automatic checkpoints are PASSIVE ones, which
has the advantage of not interrupting any readers or writers, but can
also do nothing if it's being blocked by readers or writers.

1: https://www.sqlite.org/wal.html#application_initiated_checkpoints

What I've found while writing the Guix Build Coordinator, is that when
the service is busy (usually new builds being submitted, plus lots of
builds happening), the PASSIVE checkpoints aren't sufficient. To
supplement them, there's a regular check that looks at the size of the
WAL file, and runs a TRUNCATE checkpoint, which is a FULL checkpoint
(which blocks new writers), plus truncating the WAL file once it's
finished checkpointing the entire WAL. The truncating is mostly so that
it's easier to monitor the size of the WAL, by checking the size of the
file.

I feel like I need to defend SQLite at this point. Tuning the
configuration of a database to get acceptable performance is the norm, I
had to tune the PostgreSQL configuration for data.guix.gnu.org to
improve the performance. It's easier to get in to trouble with SQLite
because it's a lower level too, and requires you to actually initiate
things like checkpoints or periodic optimisation if you want them to
happen.

Unfortunately, I don't know enough about the internals of the daemon to
say anything specific though.

>   • ‘db.sqlite’ weighs in at 19 GiB (!) so perhaps there’s something to
>     do, like the “VACUUM” thing maybe.  Chris?

Doing a VACCUM might address some fragmentation and improve performance,
it's probably worth trying.

>   • Stracing the session’s guix-daemon process during GC suggests that
>     most of the time goes into I/O from ‘db.sqlite’.  It’s not
>     surprising because that GC phase is basically about browsing the
>     database, but it does seem to take a little too long for each store
>     item.

At least the way I've approached finding and fixing the poor performance
issues in the Guix Build Coordinator is through adding instrumentation,
so just recording the time that calling particular procedures takes, and
then logging if it's longer than some threshold.

Since this issue is about Cuirass, there's also the possibility of
avoiding the problems of a large store, by avoiding having a large
store. That's what bordeaux.guix.gnu.org does, and I thought it was part
of the plan for ci.guix.gnu.org (at least around a year ago)?

[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-guix <at> gnu.org:
bug#51787; Package guix. (Sat, 27 Nov 2021 11:12:01 GMT) Full text and rfc822 format available.

Message #22 received at 51787 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Mathieu Othacehe <othacehe <at> gnu.org>
Cc: Christopher Baines <mail <at> cbaines.net>, 51787 <at> debbugs.gnu.org
Subject: Re: bug#51787: GC takes more than 9 hours on berlin
Date: Sat, 27 Nov 2021 12:11:04 +0100

Ludovic Courtès <ludo <at> gnu.org> skribis:

>   • Stracing the session’s guix-daemon process during GC suggests that
>     most of the time goes into I/O from ‘db.sqlite’.  It’s not
>     surprising because that GC phase is basically about browsing the
>     database, but it does seem to take a little too long for each store
>     item.

Stracing the client shows that the daemon spends several seconds on a
single store item occasionally:

--8<---------------cut here---------------start------------->8---
read(27, "gmlo\0\0\0\0", 8)             = 8 <0.363064>
read(27, "c\0\0\0\0\0\0\0", 8)          = 8 <0.000013>
read(27, "[95%] deleting '/gnu/store/p6r2jjy6frp682z3x94nvnmdh71p1p58-ecl-quicksearch-0.01"..., 104) = 104 <0.000010>
write(2, "[95%] deleting '/gnu/store/p6r2jjy6frp682z3x94nvnmdh71p1p58-ecl-quicksearch-0.01"..., 99) = 99 <0.000019>
read(27, "gmlo\0\0\0\0", 8)             = 8 <0.017863>
read(27, "^\0\0\0\0\0\0\0", 8)          = 8 <0.000019>
read(27, "[95%] deleting '/gnu/store/v6zd510kfmqd8j4w7q3zy9bid1fj96dk-shepherd-guix-daemon"..., 96) = 96 <0.000007>
write(2, "[95%] deleting '/gnu/store/v6zd510kfmqd8j4w7q3zy9bid1fj96dk-shepherd-guix-daemon"..., 94) = 94 <0.000012>
read(27, "gmlo\0\0\0\0", 8)             = 8 <5.861071>
read(27, "T\0\0\0\0\0\0\0", 8)          = 8 <0.000061>
read(27, "[95%] deleting '/gnu/store/0hpwig8cwdnzygjjzs9zjbxicvhif2vv-rust-bitvec-0.19.4.d"..., 88) = 88 <0.000087>
write(2, "[95%] deleting '/gnu/store/0hpwig8cwdnzygjjzs9zjbxicvhif2vv-rust-bitvec-0.19.4.d"..., 84) = 84 <0.000033>
--8<---------------cut here---------------end--------------->8---

(Notice ‘read’ taking 5.9s above.)

Ludo’.

Information forwarded to bug-guix <at> gnu.org:
bug#51787; Package guix. (Sat, 27 Nov 2021 11:24:02 GMT) Full text and rfc822 format available.

Message #25 received at 51787 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Christopher Baines <mail <at> cbaines.net>
Cc: Mathieu Othacehe <othacehe <at> gnu.org>, 51787 <at> debbugs.gnu.org
Subject: Re: bug#51787: GC takes more than 9 hours on berlin
Date: Sat, 27 Nov 2021 12:23:00 +0100

Hi Chris,

Christopher Baines <mail <at> cbaines.net> skribis:

> So, as I understand it, the WAL is made up of pages, and checking for
> this db, I think they're the current default size of 4096 bytes.
>
> sqlite> PRAGMA page_size;
> 4096
>
> From looking at the code, the wal_autocheckpoint value is set to 40000:
>
>     /* Increase the auto-checkpoint interval to 40000 pages.  This
>        seems enough to ensure that instantiating the NixOS system
>        derivation is done in a single fsync(). */
>     if (mode == "wal" && sqlite3_exec(db, "pragma wal_autocheckpoint = 40000;", 0, 0, 0) != SQLITE_OK)
>         throwSQLiteError(db, "setting autocheckpoint interval");
>
> https://git.savannah.gnu.org/cgit/guix.git/tree/nix/libstore/local-store.cc#n253
>
> This means you'd expect the WAL to be in the region of 40000*4096 bytes,
> or ~160MB. Assuming the autocheckpointing is keeping up... it doesn't
> look to be, since the file is now much larger than this.
>
> As described here [1], the automatic checkpoints are PASSIVE ones, which
> has the advantage of not interrupting any readers or writers, but can
> also do nothing if it's being blocked by readers or writers.
>
> 1: https://www.sqlite.org/wal.html#application_initiated_checkpoints
>
> What I've found while writing the Guix Build Coordinator, is that when
> the service is busy (usually new builds being submitted, plus lots of
> builds happening), the PASSIVE checkpoints aren't sufficient. To
> supplement them, there's a regular check that looks at the size of the
> WAL file, and runs a TRUNCATE checkpoint, which is a FULL checkpoint
> (which blocks new writers), plus truncating the WAL file once it's
> finished checkpointing the entire WAL. The truncating is mostly so that
> it's easier to monitor the size of the WAL, by checking the size of the
> file.

OK.  That may well be what happens on berlin these days: the database is
kept busy all day long, so presumably checkpoints don’t happen and the
WAL file grows.

> I feel like I need to defend SQLite at this point. Tuning the
> configuration of a database to get acceptable performance is the norm, I
> had to tune the PostgreSQL configuration for data.guix.gnu.org to
> improve the performance. It's easier to get in to trouble with SQLite
> because it's a lower level too, and requires you to actually initiate
> things like checkpoints or periodic optimisation if you want them to
> happen.

Understood.  It’s really not about defending software X against Y, but
rather about finding ways to address the issues we experience.

>>   • ‘db.sqlite’ weighs in at 19 GiB (!) so perhaps there’s something to
>>     do, like the “VACUUM” thing maybe.  Chris?
>
> Doing a VACCUM might address some fragmentation and improve performance,
> it's probably worth trying.

Alright, let’s give it a try.

>>   • Stracing the session’s guix-daemon process during GC suggests that
>>     most of the time goes into I/O from ‘db.sqlite’.  It’s not
>>     surprising because that GC phase is basically about browsing the
>>     database, but it does seem to take a little too long for each store
>>     item.
>
> At least the way I've approached finding and fixing the poor performance
> issues in the Guix Build Coordinator is through adding instrumentation,
> so just recording the time that calling particular procedures takes, and
> then logging if it's longer than some threshold.

Yeah, that makes sense.  I think we always took these bits of the daemon
for granted because they’d been used on large stores even before Guix
existed.

> Since this issue is about Cuirass, there's also the possibility of
> avoiding the problems of a large store, by avoiding having a large
> store. That's what bordeaux.guix.gnu.org does, and I thought it was part
> of the plan for ci.guix.gnu.org (at least around a year ago)?

That’s indeed the case: the store is smaller than it used to be (but
still 27 TiB), it’s GC’d more aggressively than before, and instead we
rely on /var/cache/guix/publish for long-term storage.

Perhaps we should go further and keep the store smaller though.

Thanks for your feedback!

Ludo’.

Information forwarded to bug-guix <at> gnu.org:
bug#51787; Package guix. (Fri, 03 Dec 2021 09:47:02 GMT) Full text and rfc822 format available.

Message #28 received at 51787 <at> debbugs.gnu.org (full text, mbox):

From: Mathieu Othacehe <othacehe <at> gnu.org>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: Christopher Baines <mail <at> cbaines.net>, 51787 <at> debbugs.gnu.org
Subject: Re: bug#51787: GC takes more than 9 hours on berlin
Date: Fri, 03 Dec 2021 10:45:54 +0100

Hello,

> That’s indeed the case: the store is smaller than it used to be (but
> still 27 TiB), it’s GC’d more aggressively than before, and instead we
> rely on /var/cache/guix/publish for long-term storage.
>
> Perhaps we should go further and keep the store smaller though.

That's what I did with 93adf7aaa693d234ee13240e9f4ff22a2dfef599 on
maintenance. It increases the GC threshold to 15TiB. Let's see if it
brings some improvements.

Thanks,

Mathieu

Information forwarded to bug-guix <at> gnu.org:
bug#51787; Package guix. (Fri, 10 Dec 2021 06:25:02 GMT) Full text and rfc822 format available.

Message #31 received at 51787 <at> debbugs.gnu.org (full text, mbox):

From: Mathieu Othacehe <othacehe <at> gnu.org>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: 51787 <at> debbugs.gnu.org
Subject: Re: bug#51787: GC takes more than 9 hours on berlin
Date: Fri, 10 Dec 2021 07:24:15 +0100

Hey,

New GC recap. The process that has been started yesterday at 04:00 is
still running. I killed the GC that was started today at 04:00 to keep
things clear.

From yesterday 11:00 when I started monitoring it to today when I'm
writing this email, 20 hours have elapsed and the GC is still in the
same phase: removing recursively the /gnu/store/trash directory content.

It corresponds to the following snippet for those of you who would like
to have a look to the corresponding code:

--8<---------------cut here---------------start------------->8---
if (state.shouldDelete) {
    if (pathExists(state.trashDir)) deleteGarbage(state, state.trashDir); // > 20 hours
    try {
        createDirs(state.trashDir);
    } catch (SysError & e) {
        if (e.errNo == ENOSPC) {
            printMsg(lvlInfo, format("note: can't create trash directory: %1%") % e.msg());
            state.moveToTrash = false;
        }
    }
}--8<---------------cut here---------------end--------------->8---

This is an early phase of the garbage collecting, where store items that
were moved to the trash directory by previous GC runs are effectively
removed.

Stracing the guix-daemon process associated with the GC process clearly
shows what's going on:

--8<---------------cut here---------------start------------->8---
chmod("/gnu/store/trash/272ibwb38i0kcbcl3n9v0ka1rsmd1104-guix-web-site/de/packages/rust-syntex-0.58.1", 040755) = 0 <0.000012>
openat(AT_FDCWD, "/gnu/store/trash/272ibwb38i0kcbcl3n9v0ka1rsmd1104-guix-web-site/de/packages/rust-syntex-0.58.1", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 13 <0.000011>
fstat(13, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 <0.000007>
getdents64(13, 0x397a510 /* 3 entries */, 32768) = 80 <0.005059>
getdents64(13, 0x397a510 /* 0 entries */, 32768) = 0 <0.000007>
close(13)                               = 0 <0.000008>
statx(AT_FDCWD, "/gnu/store/trash/272ibwb38i0kcbcl3n9v0ka1rsmd1104-guix-web-site/de/packages/rust-syntex-0.58.1/index.html", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_MODE|STATX_NLINK|STATX_SIZE, {stx_mask=STATX_BASIC_STATS|0x1000, stx_attributes=0, stx_mode=S_IFREG|0444, stx_size=10265, ...}) = 0 <0.000023>
unlink("/gnu/store/trash/272ibwb38i0kcbcl3n9v0ka1rsmd1104-guix-web-site/de/packages/rust-syntex-0.58.1/index.html") = 0 <0.000013>
rmdir("/gnu/store/trash/272ibwb38i0kcbcl3n9v0ka1rsmd1104-guix-web-site/de/packages/rust-syntex-0.58.1") = 0 <0.000028>
statx(AT_FDCWD, "/gnu/store/trash/272ibwb38i0kcbcl3n9v0ka1rsmd1104-guix-web-site/de/packages/lofreq-2.1.5", AT_STATX_
--8<---------------cut here---------------end--------------->8---

Several syscalls are involved to clean the trash directory: chmod,
openat, statx, unlink and rmdir. This does not seem particularly wrong.

What is problematic though is that in 20 hours, the free space has
bumped from 9.6T to 9.7T in the store partition. As the GC lock is
preventing most of Berlin services from running, almost all the machine
IO is dedicated to removing this directory, as shown by iotop.

I'm not sure to understand why this removing process is so long, but if
someone has an idea, I'm all ears. In the meantime, I plan to let the GC
run and keep monitoring it.

Thanks,

Mathieu

Information forwarded to bug-guix <at> gnu.org:
bug#51787; Package guix. (Fri, 10 Dec 2021 10:23:01 GMT) Full text and rfc822 format available.

Message #34 received at 51787 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Mathieu Othacehe <othacehe <at> gnu.org>
Cc: 51787 <at> debbugs.gnu.org
Subject: Re: bug#51787: GC takes more than 9 hours on berlin
Date: Fri, 10 Dec 2021 11:21:58 +0100

Hi,

Mathieu Othacehe <othacehe <at> gnu.org> skribis:

> What is problematic though is that in 20 hours, the free space has
> bumped from 9.6T to 9.7T in the store partition. As the GC lock is
> preventing most of Berlin services from running, almost all the machine
> IO is dedicated to removing this directory, as shown by iotop.
>
> I'm not sure to understand why this removing process is so long, but if
> someone has an idea, I'm all ears. In the meantime, I plan to let the GC
> run and keep monitoring it.

This is the first time GC runs since we’ve increased the threshold from
10 TiB to 15 TiB free.  There are at least 5 more TiBs to delete than
usual, so it’s expected to take more time.

Still, I’m surprised a mere ‘rm -rf’ can take this long.  ‘strace -T’ on
the child guix-daemon process doesn’t reveal anything obviously wrong,
pause times or similar.

Ludo’.

Information forwarded to bug-guix <at> gnu.org:
bug#51787; Package guix. (Fri, 10 Dec 2021 17:12:02 GMT) Full text and rfc822 format available.

Message #37 received at 51787 <at> debbugs.gnu.org (full text, mbox):

From: Mathieu Othacehe <othacehe <at> gnu.org>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: 51787 <at> debbugs.gnu.org
Subject: Re: bug#51787: GC takes more than 9 hours on berlin
Date: Fri, 10 Dec 2021 18:11:29 +0100

Hey,

> Still, I’m surprised a mere ‘rm -rf’ can take this long.  ‘strace -T’ on
> the child guix-daemon process doesn’t reveal anything obviously wrong,
> pause times or similar.

I noticed that the process isn't around anymore. Did anyone killed it,
or maybe it just crashed? Also noticed some trash removing commands were
running. Regardless of the root cause of the problem, getting rid of the
trash directory before the next evaluation seems like a good idea.

Thanks,

Mathieu

Information forwarded to bug-guix <at> gnu.org:
bug#51787; Package guix. (Fri, 10 Dec 2021 17:12:02 GMT) Full text and rfc822 format available.

Message #40 received at 51787 <at> debbugs.gnu.org (full text, mbox):

From: Mathieu Othacehe <othacehe <at> gnu.org>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: 51787 <at> debbugs.gnu.org
Subject: Re: bug#51787: GC takes more than 9 hours on berlin
Date: Fri, 10 Dec 2021 18:11:32 +0100

Hey,

> Still, I’m surprised a mere ‘rm -rf’ can take this long.  ‘strace -T’ on
> the child guix-daemon process doesn’t reveal anything obviously wrong,
> pause times or similar.

I noticed that the process isn't around anymore. Did anyone killed it,
or maybe it just crashed? Also noticed some trash removing commands were
running. Regardless of the root cause of the problem, getting rid of the
trash directory before the next evaluation seems like a good idea.

Thanks,

Mathieu

Information forwarded to bug-guix <at> gnu.org:
bug#51787; Package guix. (Fri, 10 Dec 2021 19:39:01 GMT) Full text and rfc822 format available.

Message #43 received at 51787 <at> debbugs.gnu.org (full text, mbox):

From: Ricardo Wurmus <rekado <at> elephly.net>
To: 51787 <at> debbugs.gnu.org
Subject: GC takes more than 9 hours on berlin
Date: Fri, 10 Dec 2021 19:38:11 +0000

> I noticed that the process isn't around anymore. Did anyone killed
> it,or maybe it just crashed?

I have not touched it.

-- 
Ricardo

Information forwarded to bug-guix <at> gnu.org:
bug#51787; Package guix. (Fri, 10 Dec 2021 21:56:01 GMT) Full text and rfc822 format available.

Message #46 received at 51787 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Mathieu Othacehe <othacehe <at> gnu.org>
Cc: 51787 <at> debbugs.gnu.org
Subject: Re: bug#51787: GC takes more than 9 hours on berlin
Date: Fri, 10 Dec 2021 22:54:51 +0100

Mathieu Othacehe <othacehe <at> gnu.org> skribis:

>> Still, I’m surprised a mere ‘rm -rf’ can take this long.  ‘strace -T’ on
>> the child guix-daemon process doesn’t reveal anything obviously wrong,
>> pause times or similar.
>
> I noticed that the process isn't around anymore. Did anyone killed it,
> or maybe it just crashed? Also noticed some trash removing commands were
> running. Regardless of the root cause of the problem, getting rid of the
> trash directory before the next evaluation seems like a good idea.

Yeah I don’t know what happened to the GC process but it disappeared
earlier today.  Then I started things like this:

  unshare -m sh -c 'mount --bind -o remount,rw /gnu/store; rm -rf --one-file-system /gnu/store/trash/[abc]*'

which is equally slow (perhaps slightly slower: it seems to do more
syscalls per file to remove than the guix-daemon code).

Anyway, I’ll let it run hoping it’ll be done by the next GC.

Ludo’.

Information forwarded to bug-guix <at> gnu.org:
bug#51787; Package guix. (Sat, 11 Dec 2021 09:43:02 GMT) Full text and rfc822 format available.

Message #49 received at 51787 <at> debbugs.gnu.org (full text, mbox):

From: Mathieu Othacehe <othacehe <at> gnu.org>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: 51787 <at> debbugs.gnu.org
Subject: Re: bug#51787: GC takes more than 9 hours on berlin
Date: Sat, 11 Dec 2021 10:42:30 +0100

Hey,

> Yeah I don’t know what happened to the GC process but it disappeared
> earlier today.  Then I started things like this:
>
>   unshare -m sh -c 'mount --bind -o remount,rw /gnu/store; rm -rf --one-file-system /gnu/store/trash/[abc]*'
>
> which is equally slow (perhaps slightly slower: it seems to do more
> syscalls per file to remove than the guix-daemon code).
>
> Anyway, I’ll let it run hoping it’ll be done by the next GC.

OK, thanks. Looks it just finished removing the trash directory
content. I started a GC process from my session to monitor it closely.

Mathieu

Information forwarded to bug-guix <at> gnu.org:
bug#51787; Package guix. (Sun, 12 Dec 2021 17:10:01 GMT) Full text and rfc822 format available.

Message #52 received at 51787 <at> debbugs.gnu.org (full text, mbox):

From: Mathieu Othacehe <othacehe <at> gnu.org>
To: 51787 <at> debbugs.gnu.org
Subject: Re: bug#51787: GC takes more than 9 hours on berlin
Date: Sun, 12 Dec 2021 18:09:42 +0100

Hey,

> OK, thanks. Looks it just finished removing the trash directory
> content. I started a GC process from my session to monitor it closely.

Daily GC recap:

* The GC process I started yesterday, did collect 5.5TiB in
  approximately 24 hours, that are now in the /gnu/store/trash
  directory.
  
* The /gnu/store/trash directory contains 288910 entries. If those items
  are removed at the same rate than on the previous days, it will take
  days/months to delete them all.

* I noticed that the upstream Nix GC process can now operate without
  locking. I think it shouldn't be too hard to port it to our fork or
  maybe rewrite the process in Guile while we are at it.

  That will not fix the slow hard-drives issues though.

Thanks,

Mathieu

Information forwarded to bug-guix <at> gnu.org:
bug#51787; Package guix. (Mon, 13 Dec 2021 16:16:02 GMT) Full text and rfc822 format available.

Message #55 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Christian Thäter <ct.guix <at> pipapo.org>
To: bug-guix <at> gnu.org
Subject: Re: bug#51787: GC takes more than 9 hours on berlin
Date: Mon, 13 Dec 2021 17:13:33 +0100

On 2021-12-12 18:09, Mathieu Othacehe wrote:

>Hey,
>
>> OK, thanks. Looks it just finished removing the trash directory
>> content. I started a GC process from my session to monitor it
>> closely.  
>
>Daily GC recap:
>
>* The GC process I started yesterday, did collect 5.5TiB in
>  approximately 24 hours, that are now in the /gnu/store/trash
>  directory.
>  
>* The /gnu/store/trash directory contains 288910 entries. If those
>items
>  are removed at the same rate than on the previous days, it will take
>  days/months to delete them all.
>
>* I noticed that the upstream Nix GC process can now operate without
>  locking. I think it shouldn't be too hard to port it to our fork or
>  maybe rewrite the process in Guile while we are at it.
>
>  That will not fix the slow hard-drives issues though.

While discussing this issue on IRC I came up with some idea:

'rmrfd' a system daemon that deletes huge trees in the background where
'-rf' stands for --really --fast :)

Actually this is an use case that happens for on my backup system too.
With that idea I just started coding and ran some experiments. For me
this looks quite feasible now and I will continue next days on this
small project. Any feedback or help would be welcomed!

The initial ideas and experiments are at https://github.com/cehteh/rmrfd

Note that the important part is that it will put some efforts into
freeing as much space as possible at begin of the freeing process,
Unlike just 'rm -rf' where space may only freed really late when the
last link count of the data goes to zero.

	Cheers
		Christian

Information forwarded to bug-guix <at> gnu.org:
bug#51787; Package guix. (Tue, 14 Dec 2021 03:32:02 GMT) Full text and rfc822 format available.

Message #58 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Christian Thäter <ct <at> pipapo.org>
To: bug-guix <at> gnu.org
Subject: Re: bug#51787: GC takes more than 9 hours on berlin
Date: Tue, 14 Dec 2021 04:31:23 +0100

On 2021-12-12 18:09, Mathieu Othacehe wrote:

>Hey,
>
>> OK, thanks. Looks it just finished removing the trash directory
>> content. I started a GC process from my session to monitor it
>> closely.  
>
>Daily GC recap:
>
>* The GC process I started yesterday, did collect 5.5TiB in
>  approximately 24 hours, that are now in the /gnu/store/trash
>  directory.
>  
>* The /gnu/store/trash directory contains 288910 entries. If those
>items
>  are removed at the same rate than on the previous days, it will take
>  days/months to delete them all.

On another note: when 'guix gc' determines that objects are dead, are
they really dead then or can it be that they are 'locally' dead but in
case the store serves as substitutes/offload server some external
clients may still request dead objects?

In the either case would make sense to run the GC more frequently, but
for the later case a --min-age option to preserve objects that just
died recently could be helping. Further it may consider the atime of
objects for removal.

And finally while I had this Idea: You mount the
guix store with 'relatime' or 'nodiratme', if not that could explain
the slow deletion process as well!

	Christian

>
>* I noticed that the upstream Nix GC process can now operate without
>  locking. I think it shouldn't be too hard to port it to our fork or
>  maybe rewrite the process in Guile while we are at it.
>
>  That will not fix the slow hard-drives issues though.
>
>Thanks,
>
>Mathieu
>
>
>

Information forwarded to bug-guix <at> gnu.org:
bug#51787; Package guix. (Fri, 17 Dec 2021 10:56:01 GMT) Full text and rfc822 format available.

Message #61 received at 51787 <at> debbugs.gnu.org (full text, mbox):

From: Mathieu Othacehe <othacehe <at> gnu.org>
To: 51787 <at> debbugs.gnu.org
Cc: rekado <at> elephly.net
Subject: Re: bug#51787: GC takes more than 9 hours on berlin
Date: Fri, 17 Dec 2021 11:55:26 +0100

[Message part 1 (text/plain, inline)]

Hey,

New day, new benchmark. Berlin has two hard drives, which are roughly
used this way:

/dev/sda -> / (916G)
/dev/sdb -> /gnu (37T)

I ran the fio benchmark tool on both of them. See the reports
attached, and the following summary:

--8<---------------cut here---------------start------------->8---
|       | sda       | sdb       |
|-------+-----------+-----------|
| read  | 1565KiB/s | 9695KiB/s |
| write | 523KiB/s  | 3240KiB/s |
--8<---------------cut here---------------end--------------->8---

I'm not sure how slow those figures are relatively to the hard drives
technologies. Ricardo, any idea about that?

My own NVME hard drive has 294MiB/s read and 98.1MiB/s write with the
same test in comparison.

I also tried to benchmark file removal this way, but this is not really
conclusive:

--8<---------------cut here---------------start------------->8---
# 3.8s on sdb
time for i in {0..100000}; do echo 'test' > "fs_test/test${i}.txt"; done

# 2.7s on sdb
time rf -rf fs_test
--8<---------------cut here---------------end--------------->8---

Thanks,

Mathieu

[fio_sda (application/octet-stream, attachment)]

[fio_sdb (application/octet-stream, attachment)]

Information forwarded to bug-guix <at> gnu.org:
bug#51787; Package guix. (Fri, 17 Dec 2021 13:31:01 GMT) Full text and rfc822 format available.

Message #64 received at 51787 <at> debbugs.gnu.org (full text, mbox):

From: Ricardo Wurmus <rekado <at> elephly.net>
To: Mathieu Othacehe <othacehe <at> gnu.org>
Cc: 51787 <at> debbugs.gnu.org
Subject: Re: bug#51787: GC takes more than 9 hours on berlin
Date: Fri, 17 Dec 2021 14:06:51 +0100

Hi Mathieu,

> New day, new benchmark. Berlin has two hard drives, which are roughly
> used this way:
>
> /dev/sda -> / (916G)
> /dev/sdb -> /gnu (37T)

sda consists of two local hard disks that are combined to a RAID.
Here are the disk details:

--8<---------------cut here---------------start------------->8---
Disk.Bay.0:Enclosure.Internal.0-1:RAID.Slot.3-1
   Status                           = Ok                                       
   DeviceDescription                = Disk 0 in Backplane 1 of RAID Controller in Slot 3
   RollupStatus                     = Ok                                       
   Name                             = Physical Disk 0:1:0                      
   State                            = Online                                   
   OperationState                   = Not Applicable                           
   PowerStatus                      = Spun-Up                                  
   Size                             = 931.000 GB                               
   FailurePredicted                 = NO                                       
   RemainingRatedWriteEndurance     = Not Applicable                           
   SecurityStatus                   = Not Capable                              
   BusProtocol                      = SATA                                     
   MediaType                        = HDD                                      
   UsedRaidDiskSpace                = 931.000 GB                               
   AvailableRaidDiskSpace           = 0.001 GB                                 
   Hotspare                         = NO                                       
   Manufacturer                     = SEAGATE                                  
   ProductId                        = ST1000NX0443                             
   Revision                         = NB33                                     
   SerialNumber                     = W470QK7K                                 
   PartNumber                       = CN08DN1YSGW0076S00L8A00                  
   NegotiatedSpeed                  = 6.0 Gb/s                                 
   ManufacturedDay                  = 0                                        
   ManufacturedWeek                 = 0                                        
   ManufacturedYear                 = 0                                        
   ForeignKeyIdentifier             = null                                     
   SasAddress                       = 0x4433221106000000                       
   FormFactor                       = 2.5 Inch                                 
   RaidNominalMediumRotationRate    = 7200                                     
   T10PICapability                  = Not Capable                              
   BlockSizeInBytes                 = 512                                      
   MaxCapableSpeed                  = 6 Gb/s                                   
   RaidType                         = None                                     
   SystemEraseCapability            = CryptographicErasePD                     
   SelfEncryptingDriveCapability    = Not Capable                              
   EncryptionCapability             = Not Capable                              
   CryptographicEraseCapability     = Capable                                  
Disk.Bay.1:Enclosure.Internal.0-1:RAID.Slot.3-1
   Status                           = Ok                                       
   DeviceDescription                = Disk 1 in Backplane 1 of RAID Controller in Slot 3
   RollupStatus                     = Ok                                       
   Name                             = Physical Disk 0:1:1                      
   State                            = Online                                   
   OperationState                   = Not Applicable                           
   PowerStatus                      = Spun-Up                                  
   Size                             = 931.000 GB                               
   FailurePredicted                 = NO                                       
   RemainingRatedWriteEndurance     = Not Applicable                           
   SecurityStatus                   = Not Capable                              
   BusProtocol                      = SATA                                     
   MediaType                        = HDD                                      
   UsedRaidDiskSpace                = 931.000 GB                               
   AvailableRaidDiskSpace           = 0.001 GB                                 
   Hotspare                         = NO                                       
   Manufacturer                     = SEAGATE                                  
   ProductId                        = ST1000NX0443                             
   Revision                         = NB33                                     
   SerialNumber                     = W470SYTP                                 
   PartNumber                       = CN08DN1YSGW0077F00FQA00                  
   NegotiatedSpeed                  = 6.0 Gb/s                                 
   ManufacturedDay                  = 0                                        
   ManufacturedWeek                 = 0                                        
   ManufacturedYear                 = 0                                        
   ForeignKeyIdentifier             = null                                     
   SasAddress                       = 0x4433221107000000                       
   FormFactor                       = 2.5 Inch                                 
   RaidNominalMediumRotationRate    = 7200                                     
   T10PICapability                  = Not Capable                              
   BlockSizeInBytes                 = 512                                      
   MaxCapableSpeed                  = 6 Gb/s                                   
   RaidType                         = None                                     
   SystemEraseCapability            = CryptographicErasePD                     
   SelfEncryptingDriveCapability    = Not Capable                              
   EncryptionCapability             = Not Capable                              
   CryptographicEraseCapability     = Capable                
--8<---------------cut here---------------end--------------->8---

sdb is an external storage array (Dell MD3400) filled with 10 hard disks
(SAS) in a RAID 10 configuration (36.36 TB effective capacity).  There
are two hot spares that are currently unassigned.  They are used
automatically when the RAID is degraded.  The two RAID controllers have
read and write caches enabled.  The enclosure has two redundant host
interfaces.

Berlin has two host based adapter cards of which *one* is connected to
the array.  Why only one?  Because we don’t have multipathd configured
so that the system could *boot* off the external array with multipath.
Without multipath the storage would appear as one disk device per card,
but it would not be safe to mount them both at the same time.

If we wanted to make use of the redundant connection here: figure out
how to add multipathd to the initrd and set up multipath *before*
handing off control to Linux.  This would effectively double our
bandwidth to the storage.

My guess is that we’re not even close to saturating the available
bandwidth.

> I ran the fio benchmark tool on both of them. See the reports
> attached, and the following summary:
>
> |       | sda       | sdb       |
> |-------+-----------+-----------|
> | read  | 1565KiB/s | 9695KiB/s |
> | write | 523KiB/s  | 3240KiB/s |
>
>
> I'm not sure how slow those figures are relatively to the hard drives
> technologies. Ricardo, any idea about that?

It seems awfully slow.  Especially performance of sda is abysmal: this
is a local disk.  sdb is the fancy external disk array that’s hooked up
to two HBA cards.  It should not perform *better* than sda.

I’ll run this on a few of the build nodes to get some more comparisons.

-- 
Ricardo

Information forwarded to bug-guix <at> gnu.org:
bug#51787; Package guix. (Fri, 17 Dec 2021 14:15:01 GMT) Full text and rfc822 format available.

Message #67 received at 51787 <at> debbugs.gnu.org (full text, mbox):

From: Ricardo Wurmus <rekado <at> elephly.net>
To: Mathieu Othacehe <othacehe <at> gnu.org>
Cc: 51787 <at> debbugs.gnu.org
Subject: Re: bug#51787: GC takes more than 9 hours on berlin
Date: Fri, 17 Dec 2021 15:08:23 +0100

Ricardo Wurmus <rekado <at> elephly.net> writes:

> I’ll run this on a few of the build nodes to get some more comparisons.

I ran “guix deploy” for hydra-guix-107, a node that was bought at the
same time as the one that is now the head node of ci.guix.gnu.org.  I
copied over a current Guix and installed “fio”
(/gnu/store/qs9cyy5s95n2fbjmxs48iccqvsvj6wxr-fio-3.28/bin/fio) there.

Here’s the output:

--8<---------------cut here---------------start------------->8---
root <at> hydra-guix-107 ~# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.28
Starting 1 process
test: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [m(1)][100.0%][r=172MiB/s,w=56.9MiB/s][r=44.2k,w=14.6k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=45365: Fri Dec 17 14:50:41 2021
  read: IOPS=42.9k, BW=167MiB/s (176MB/s)(3070MiB/18331msec)
   bw (  KiB/s): min=21208, max=199928, per=100.00%, avg=171621.56, stdev=46624.05, samples=36
   iops        : min= 5302, max=49982, avg=42905.39, stdev=11656.01, samples=36
  write: IOPS=14.3k, BW=56.0MiB/s (58.7MB/s)(1026MiB/18331msec); 0 zone resets
   bw (  KiB/s): min= 7424, max=66720, per=100.00%, avg=57364.22, stdev=15612.42, samples=36
   iops        : min= 1856, max=16680, avg=14341.06, stdev=3903.10, samples=36
  cpu          : usr=6.27%, sys=24.78%, ctx=121626, majf=0, minf=11
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=167MiB/s (176MB/s), 167MiB/s-167MiB/s (176MB/s-176MB/s), io=3070MiB (3219MB), run=18331-18331msec
  WRITE: bw=56.0MiB/s (58.7MB/s), 56.0MiB/s-56.0MiB/s (58.7MB/s-58.7MB/s), io=1026MiB (1076MB), run=18331-18331msec

Disk stats (read/write):
  sda: ios=779469/260494, merge=0/6, ticks=932265/195191, in_queue=1127456, util=99.50%
--8<---------------cut here---------------end--------------->8---

Here sda is RAID of two SSDs:

--8<---------------cut here---------------start------------->8---
racadm>>storage get pdisks -o
Disk.Bay.0:Enclosure.Internal.0-1:RAID.Integrated.1-1
   Status                           = Ok                                       
   DeviceDescription                = Disk 0 in Backplane 1 of Integrated RAID Controller 1
   RollupStatus                     = Ok                                       
   Name                             = Solid State Disk 0:1:0                   
   State                            = Online                                   
   OperationState                   = Not Applicable                           
   PowerStatus                      = On                                       
   Size                             = 223.001 GB                               
   FailurePredicted                 = NO                                       
   RemainingRatedWriteEndurance     = 99 %                                     
   SecurityStatus                   = Not Capable                              
   BusProtocol                      = SATA                                     
   MediaType                        = SSD                                      
   UsedRaidDiskSpace                = 223.001 GB                               
   AvailableRaidDiskSpace           = 0.001 GB                                 
   Hotspare                         = NO                                       
   Manufacturer                     = INTEL                                    
   ProductId                        = SSDSC2KG240G8R                           
   Revision                         = XCV1DL67                                 
   SerialNumber                     = BTYG91520CHD240AGN                       
   PartNumber                       = CN0T1WH8PESIT95302LTA01                  
   NegotiatedSpeed                  = 6.0 Gb/s                                 
   ManufacturedDay                  = 0                                        
   ManufacturedWeek                 = 0                                        
   ManufacturedYear                 = 0                                        
   ForeignKeyIdentifier             = null                                     
   SasAddress                       = 0x4433221104000000                       
   WWN                              = 0x4433221104000000                       
   FormFactor                       = 2.5 Inch                                 
   RaidNominalMediumRotationRate    = 1                                        
   T10PICapability                  = Not Capable                              
   BlockSizeInBytes                 = 512                                      
   MaxCapableSpeed                  = 6 Gb/s                                   
   RaidType                         = Unknown                                  
   SystemEraseCapability            = CryptographicErasePD                     
   SelfEncryptingDriveCapability    = Not Capable                              
   EncryptionCapability             = Not Capable                              
   CryptographicEraseCapability     = Capable                                  
   Certified                        = Yes                                      
   NonRAIDDiskCachePolicy           = Not Applicable                           
   EncryptionProtocol               = None                                     
Disk.Bay.1:Enclosure.Internal.0-1:RAID.Integrated.1-1
   Status                           = Ok                                       
   DeviceDescription                = Disk 1 in Backplane 1 of Integrated RAID Controller 1
   RollupStatus                     = Ok                                       
   Name                             = Solid State Disk 0:1:1                   
   State                            = Online                                   
   OperationState                   = Not Applicable                           
   PowerStatus                      = On                                       
   Size                             = 223.001 GB                               
   FailurePredicted                 = NO                                       
   RemainingRatedWriteEndurance     = 99 %                                     
   SecurityStatus                   = Not Capable                              
   BusProtocol                      = SATA                                     
   MediaType                        = SSD                                      
   UsedRaidDiskSpace                = 223.001 GB                               
   AvailableRaidDiskSpace           = 0.001 GB                                 
   Hotspare                         = NO                                       
   Manufacturer                     = INTEL                                    
   ProductId                        = SSDSC2KG240G8R                           
   Revision                         = XCV1DL67                                 
   SerialNumber                     = BTYG915502D9240AGN                       
   PartNumber                       = CN0T1WH8PESIT95303BSA01                  
   NegotiatedSpeed                  = 6.0 Gb/s                                 
   ManufacturedDay                  = 0                                        
   ManufacturedWeek                 = 0                                        
   ManufacturedYear                 = 0                                        
   ForeignKeyIdentifier             = null                                     
   SasAddress                       = 0x4433221100000000                       
   WWN                              = 0x4433221100000000                       
   FormFactor                       = 2.5 Inch                                 
   RaidNominalMediumRotationRate    = 1                                        
   T10PICapability                  = Not Capable                              
   BlockSizeInBytes                 = 512                                      
   MaxCapableSpeed                  = 6 Gb/s                                   
   RaidType                         = Unknown                                  
   SystemEraseCapability            = CryptographicErasePD                     
   SelfEncryptingDriveCapability    = Not Capable                              
   EncryptionCapability             = Not Capable                              
   CryptographicEraseCapability     = Capable                                  
   Certified                        = Yes                                      
   NonRAIDDiskCachePolicy           = Not Applicable                           
   EncryptionProtocol               = None                                     
--8<---------------cut here---------------end--------------->8---

-- 
Ricardo

Information forwarded to bug-guix <at> gnu.org:
bug#51787; Package guix. (Mon, 20 Dec 2021 17:00:02 GMT) Full text and rfc822 format available.

Message #70 received at 51787 <at> debbugs.gnu.org (full text, mbox):

From: Mathieu Othacehe <othacehe <at> gnu.org>
To: Ricardo Wurmus <rekado <at> elephly.net>
Cc: 51787 <at> debbugs.gnu.org
Subject: Re: Disk performance on ci.guix.gnu.org
Date: Mon, 20 Dec 2021 17:59:33 +0100

Hey,

> This is still pretty bad, but better than the <1M performance suggested
> by previous runs.

Mmh interesting, I also have a x10 speed up on sdb by increasing the
block size from 4k to 512k. I'm not sure what conclusion should we draw
from this observation.

In particular for our most urging matter, /gnu/store/trash
removal. Moving to a faster hard drive would definitely help here, but I
still don't understand if that disk performance regression comes from
Linux, the file-system fragmentation, or the disk itself.

>    READ: bw=1547MiB/s (1622MB/s), 1547MiB/s-1547MiB/s (1622MB/s-1622MB/s), io=3055MiB (3203MB), run=1975-1975msec
>   WRITE: bw=527MiB/s (553MB/s), 527MiB/s-527MiB/s (553MB/s-553MB/s), io=1042MiB (1092MB), run=1975-1975msec

Wooh that's fast! On test could be to copy the /gnu/store/trash content
to the SAN an observe how long that it takes for this operating to
complete.

Thanks for your support on that complex topic :)

Mathieu

Information forwarded to bug-guix <at> gnu.org:
bug#51787; Package guix. (Mon, 20 Dec 2021 17:10:02 GMT) Full text and rfc822 format available.

Message #73 received at 51787 <at> debbugs.gnu.org (full text, mbox):

From: Ricardo Wurmus <rekado <at> elephly.net>
To: Mathieu Othacehe <othacehe <at> gnu.org>
Cc: 51787 <at> debbugs.gnu.org
Subject: Re: Disk performance on ci.guix.gnu.org
Date: Mon, 20 Dec 2021 18:05:08 +0100

Mathieu Othacehe <othacehe <at> gnu.org> writes:

>> This is still pretty bad, but better than the <1M performance suggested
>> by previous runs.
>
> Mmh interesting, I also have a x10 speed up on sdb by increasing the
> block size from 4k to 512k. I'm not sure what conclusion should we draw
> from this observation.

As a general rule, we want the block size to match that of the
configured disk layout — if we care about getting the best numbers in
our benchmarks.  With real workloads things are always going to be
slower anyway.

> In particular for our most urging matter, /gnu/store/trash
> removal. Moving to a faster hard drive would definitely help here, but I
> still don't understand if that disk performance regression comes from
> Linux, the file-system fragmentation, or the disk itself.
>
>>    READ: bw=1547MiB/s (1622MB/s), 1547MiB/s-1547MiB/s (1622MB/s-1622MB/s), io=3055MiB (3203MB), run=1975-1975msec
>>   WRITE: bw=527MiB/s (553MB/s), 527MiB/s-527MiB/s (553MB/s-553MB/s), io=1042MiB (1092MB), run=1975-1975msec
>
> Wooh that's fast! On test could be to copy the /gnu/store/trash content
> to the SAN an observe how long that it takes for this operating to
> complete.

Do you mean time the copy or time the removal from that storage?  You
know what, I’ll time both.  I’ll need to get more space first.  I think
the trash directory is larger than the 500G that I got for testing the
SAN.

> Thanks for your support on that complex topic :)

Hey, I’m just happy neither of us has to do this alone.  Thank you!

-- 
Ricardo

Information forwarded to bug-guix <at> gnu.org:
bug#51787; Package guix. (Mon, 20 Dec 2021 18:38:01 GMT) Full text and rfc822 format available.

Message #76 received at 51787 <at> debbugs.gnu.org (full text, mbox):

From: Bengt Richter <bokr <at> bokr.com>
To: Mathieu Othacehe <othacehe <at> gnu.org>
Cc: Ricardo Wurmus <rekado <at> elephly.net>, 51787 <at> debbugs.gnu.org
Subject: Re: bug#51787: Disk performance on ci.guix.gnu.org
Date: Mon, 20 Dec 2021 19:36:51 +0100

On +2021-12-20 17:59:33 +0100, Mathieu Othacehe wrote:
> 
> Hey,
> 
> > This is still pretty bad, but better than the <1M performance suggested
> > by previous runs.
> 
> Mmh interesting, I also have a x10 speed up on sdb by increasing the
> block size from 4k to 512k. I'm not sure what conclusion should we draw
> from this observation.
> 
> In particular for our most urging matter, /gnu/store/trash
> removal. Moving to a faster hard drive would definitely help here, but I
> still don't understand if that disk performance regression comes from
> Linux, the file-system fragmentation, or the disk itself.
> 
> >    READ: bw=1547MiB/s (1622MB/s), 1547MiB/s-1547MiB/s (1622MB/s-1622MB/s), io=3055MiB (3203MB), run=1975-1975msec
> >   WRITE: bw=527MiB/s (553MB/s), 527MiB/s-527MiB/s (553MB/s-553MB/s), io=1042MiB (1092MB), run=1975-1975msec
> 
> Wooh that's fast! On test could be to copy the /gnu/store/trash content
> to the SAN an observe how long that it takes for this operating to
> complete.

also might be interesting to copy to /dev/null
to see read rate alone on /gnu/store?

> 
> Thanks for your support on that complex topic :)
> 
> Mathieu
> 
> 
> 

-- 
Regards,
Bengt Richter

Information forwarded to bug-guix <at> gnu.org:
bug#51787; Package guix. (Mon, 20 Dec 2021 21:26:01 GMT) Full text and rfc822 format available.

Message #79 received at 51787 <at> debbugs.gnu.org (full text, mbox):

From: Ricardo Wurmus <rekado <at> elephly.net>
To: 51787 <at> debbugs.gnu.org
Subject: GC takes more than 9 hours on berlin
Date: Mon, 20 Dec 2021 22:12:46 +0100

My colleague extended the SAN slice to 5TB for more realistic testing.
I formatted the disk with btrfs, and mounted it like this:

    mount /dev/sdd /mnt_test/

Then I ran the test with block size 512k:

--8<---------------cut here---------------start------------->8---
root <at> berlin ~# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=/mnt_test/test --bs=512k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
test: (g=0): rw=randrw, bs=(R) 512KiB-512KiB, (W) 512KiB-512KiB, (T) 512KiB-512KiB, ioengine=libaio, iodepth=64
fio-3.6
Starting 1 process
test: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [m(1)][100.0%][r=802MiB/s,w=274MiB/s][r=1603,w=547 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=16949: Mon Dec 20 22:18:28 2021
   read: IOPS=1590, BW=795MiB/s (834MB/s)(3055MiB/3842msec)
   bw (  KiB/s): min=747520, max=857088, per=99.83%, avg=812763.43, stdev=44213.07, samples=7
   iops        : min= 1460, max= 1674, avg=1587.43, stdev=86.35, samples=7
  write: IOPS=542, BW=271MiB/s (284MB/s)(1042MiB/3842msec)
   bw (  KiB/s): min=262144, max=297984, per=100.00%, avg=278820.57, stdev=15115.88, samples=7
   iops        : min=  512, max=  582, avg=544.57, stdev=29.52, samples=7
  cpu          : usr=1.98%, sys=96.28%, ctx=1096, majf=0, minf=6
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.4%, >=64=99.2%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=6109,2083,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=795MiB/s (834MB/s), 795MiB/s-795MiB/s (834MB/s-834MB/s), io=3055MiB (3203MB), run=3842-3842msec
  WRITE: bw=271MiB/s (284MB/s), 271MiB/s-271MiB/s (284MB/s-284MB/s), io=1042MiB (1092MB), run=3842-3842msec
--8<---------------cut here---------------end--------------->8---

Because this is fun I reran it with the same arguments:

--8<---------------cut here---------------start------------->8---
root <at> berlin ~# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=/mnt_test/test --bs=512k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
test: (g=0): rw=randrw, bs=(R) 512KiB-512KiB, (W) 512KiB-512KiB, (T) 512KiB-512KiB, ioengine=libaio, iodepth=64
fio-3.6
Starting 1 process
Jobs: 1 (f=0): [f(1)][-.-%][r=756MiB/s,w=260MiB/s][r=1511,w=519 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=17488: Mon Dec 20 22:18:56 2021
   read: IOPS=1647, BW=824MiB/s (864MB/s)(3055MiB/3708msec)
   bw (  KiB/s): min=738304, max=929792, per=99.28%, avg=837485.71, stdev=73710.05, samples=7
   iops        : min= 1442, max= 1816, avg=1635.71, stdev=143.96, samples=7
  write: IOPS=561, BW=281MiB/s (295MB/s)(1042MiB/3708msec)
   bw (  KiB/s): min=234496, max=320512, per=99.79%, avg=287012.57, stdev=29009.60, samples=7
   iops        : min=  458, max=  626, avg=560.57, stdev=56.66, samples=7
  cpu          : usr=1.38%, sys=96.47%, ctx=1394, majf=0, minf=16420
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.4%, >=64=99.2%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=6109,2083,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=824MiB/s (864MB/s), 824MiB/s-824MiB/s (864MB/s-864MB/s), io=3055MiB (3203MB), run=3708-3708msec
  WRITE: bw=281MiB/s (295MB/s), 281MiB/s-281MiB/s (295MB/s-295MB/s), io=1042MiB (1092MB), run=3708-3708msec
--8<---------------cut here---------------end--------------->8---

Then I mounted with compression and space cache:

    mount /dev/sdd -o compress-force=zstd,space_cache=v2 /mnt_test/

The numbers don’t differ much at all.

--8<---------------cut here---------------start------------->8---
Run status group 0 (all jobs):
   READ: bw=882MiB/s (925MB/s), 882MiB/s-882MiB/s (925MB/s-925MB/s), io=3055MiB (3203MB), run=3464-3464msec
  WRITE: bw=301MiB/s (315MB/s), 301MiB/s-301MiB/s (315MB/s-315MB/s), io=1042MiB (1092MB), run=3464-3464msec
--8<---------------cut here---------------end--------------->8---


I then erased the file system and again put on a big ext4:

--8<---------------cut here---------------start------------->8---
root <at> berlin ~# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=/mnt_test/test --bs=512k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
test: (g=0): rw=randrw, bs=(R) 512KiB-512KiB, (W) 512KiB-512KiB, (T) 512KiB-512KiB, ioengine=libaio, iodepth=64
fio-3.6
Starting 1 process
test: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [m(1)][-.-%][r=1539MiB/s,w=526MiB/s][r=3078,w=1052 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=20672: Mon Dec 20 22:23:29 2021
   read: IOPS=3077, BW=1539MiB/s (1614MB/s)(3055MiB/1985msec)
   bw (  MiB/s): min= 1530, max= 1548, per=100.00%, avg=1539.33, stdev= 9.02, samples=3
   iops        : min= 3060, max= 3096, avg=3078.67, stdev=18.04, samples=3
  write: IOPS=1049, BW=525MiB/s (550MB/s)(1042MiB/1985msec)
   bw (  KiB/s): min=533504, max=557056, per=100.00%, avg=546133.33, stdev=11868.39, samples=3
   iops        : min= 1042, max= 1088, avg=1066.67, stdev=23.18, samples=3
  cpu          : usr=2.17%, sys=11.24%, ctx=4787, majf=0, minf=8
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.4%, >=64=99.2%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=6109,2083,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=1539MiB/s (1614MB/s), 1539MiB/s-1539MiB/s (1614MB/s-1614MB/s), io=3055MiB (3203MB), run=1985-1985msec
  WRITE: bw=525MiB/s (550MB/s), 525MiB/s-525MiB/s (550MB/s-550MB/s), io=1042MiB (1092MB), run=1985-1985msec

Disk stats (read/write):
  sdd: ios=5926/2087, merge=1/0, ticks=119183/3276, in_queue=122460, util=94.87%
--8<---------------cut here---------------end--------------->8---

No idea why btrfs performs so much worse in comparison.

I’ll copy over /gnu/store/trash next.

-- 
Ricardo

Information forwarded to bug-guix <at> gnu.org:
bug#51787; Package guix. (Mon, 20 Dec 2021 21:54:01 GMT) Full text and rfc822 format available.

Message #82 received at 51787 <at> debbugs.gnu.org (full text, mbox):

From: Mathieu Othacehe <othacehe <at> gnu.org>
To: Ricardo Wurmus <rekado <at> elephly.net>
Cc: 51787 <at> debbugs.gnu.org
Subject: Re: Disk performance on ci.guix.gnu.org
Date: Mon, 20 Dec 2021 22:53:45 +0100

Hey,

> Do you mean time the copy or time the removal from that storage?  You
> know what, I’ll time both.  I’ll need to get more space first.  I think
> the trash directory is larger than the 500G that I got for testing the
> SAN.

Yeah I meant removal time :) I found this article[1] that suggests that
over time the ext4 fragmentation can cause a performance drop that is
very noticeable on hard drives.

I'm trying to determine how fragmented is the sdb1 file-system, by
running e4defrag and e2freefrag[2], but I'm not sure if they will
complete soon.

Copying and removing /gnu/store/trash on the SAN will be interesting but
the ultimate test would be to be able to re-create the ext4 file-system
directly on Berlin's sdb drive to evaluate the fragmentation role in
this funny business.

Thanks,

Mathieu

[1]: https://www.usenix.org/system/files/hotstorage19-paper-conway.pdf
[2]: https://cromwell-intl.com/open-source/performance-tuning/file-systems.html

Information forwarded to bug-guix <at> gnu.org:
bug#51787; Package guix. (Tue, 21 Dec 2021 17:37:02 GMT) Full text and rfc822 format available.

Message #85 received at 51787 <at> debbugs.gnu.org (full text, mbox):

From: Ricardo Wurmus <rekado <at> elephly.net>
To: Mathieu Othacehe <othacehe <at> gnu.org>
Cc: 51787 <at> debbugs.gnu.org
Subject: Re: Disk performance on ci.guix.gnu.org
Date: Tue, 21 Dec 2021 18:26:03 +0100

Today we discovered a few more things and discussed them on IRC.  Here’s
a summary.

/var/cache sits on the same storage as /gnu.  We mounted the 5TB ext4
file system that’s hosted by the SAN at /mnt_test and started copying
over /var/cache to /mnt_test/var/cache.  Transfer speed was considerably
faster (not *great*, but reasonably fast) than the copy of
/gnu/store/trash to the same target.

This confirmed our suspicions that the problem is not with the storage
array but due to the fact that /gnu/store/trash (and also /gnu/store)
is an extremely large, flat directory.  /var/cache is not.

Here’s what we do now: continue copying /var/cache to the SAN, then
remount to serve substitutes from there.  This removes some pressure
from the file system as it will only be used for /gnu.  We’re
considering to dump the file system completely (i.e. reinstall the
server), thereby emptying /gnu, but leaving the stash of built
substitutes in /var/cache (hosted from the faster SAN).

We could take this opportunity to reformat /gnu with btrfs, which
performs quite a bit more poorly than ext4 but would be immune to
defragmentation.  It’s not clear that defragmentation matters here.  It
could just be that the problem is exclusively caused by having these
incredibly large, flat /gnu/store, /gnu/store/.links, and
/gnu/store/trash directories.

A possible alternative for this file system might also be XFS, which
performs well when presented with unreasonably large directories.

It may be a good idea to come up with realistic test scenarios that we
could test with each of these three file systems at scale.

Any ideas?

-- 
Ricardo

Information forwarded to bug-guix <at> gnu.org:
bug#51787; Package guix. (Tue, 21 Dec 2021 17:52:01 GMT) Full text and rfc822 format available.

Message #88 received at 51787 <at> debbugs.gnu.org (full text, mbox):

From: Leo Famulari <leo <at> famulari.name>
To: Ricardo Wurmus <rekado <at> elephly.net>
Cc: Mathieu Othacehe <othacehe <at> gnu.org>, 51787 <at> debbugs.gnu.org
Subject: Re: bug#51787: Disk performance on ci.guix.gnu.org
Date: Tue, 21 Dec 2021 12:51:35 -0500

On Tue, Dec 21, 2021 at 06:26:03PM +0100, Ricardo Wurmus wrote:
> We could take this opportunity to reformat /gnu with btrfs, which
> performs quite a bit more poorly than ext4 but would be immune to
> defragmentation.  It’s not clear that defragmentation matters here.  It
> could just be that the problem is exclusively caused by having these
> incredibly large, flat /gnu/store, /gnu/store/.links, and
> /gnu/store/trash directories.

My impression was that btrfs could also become fragmented. At least,
btrfs-progrs includes a command for defragmenting. Or do I
misunderstand?

Information forwarded to bug-guix <at> gnu.org:
bug#51787; Package guix. (Tue, 21 Dec 2021 18:24:01 GMT) Full text and rfc822 format available.

Message #91 received at 51787 <at> debbugs.gnu.org (full text, mbox):

From: Mathieu Othacehe <othacehe <at> gnu.org>
To: Ricardo Wurmus <rekado <at> elephly.net>
Cc: 51787 <at> debbugs.gnu.org
Subject: Re: Disk performance on ci.guix.gnu.org
Date: Tue, 21 Dec 2021 19:23:28 +0100

Hey,

> Today we discovered a few more things and discussed them on IRC.  Here’s
> a summary.

Nice summary :)

> We could take this opportunity to reformat /gnu with btrfs, which
> performs quite a bit more poorly than ext4 but would be immune to
> defragmentation.  It’s not clear that defragmentation matters here.  It
> could just be that the problem is exclusively caused by having these
> incredibly large, flat /gnu/store, /gnu/store/.links, and
> /gnu/store/trash directories.
>
> A possible alternative for this file system might also be XFS, which
> performs well when presented with unreasonably large directories.
>
> It may be a good idea to come up with realistic test scenarios that we
> could test with each of these three file systems at scale.

We could compare xfs, btrfs and ext4 performances on a store subset,
1TiB for instance that we would create on the SAN. Realistic test
scenario could be:

- Time the copy of new items to the test store.
- Time the removal of randomly picked items from the test store.
- Time the creation of nar archives from the test store.

That will allow us to choose the file-system that has the best
performances for our use-case, regardless of fragmentation.

Now fragmentation may or may not be a problem as you mentioned. What we
could do is repeat the same tests but on a test store that is created
and removed N times, to simulate file-system aging.

This is more or less what is done in this article[1] by "git pulling" N
times a repository and testing read performances. For them btrfs > xfs >
ext4 in term of performances, but we might draw different conclusions
for our specific use case.

Do you think it is realistic? If so, we can start working on some test
scripts.

Thanks,

Mathieu

[1]: https://www.usenix.org/system/files/hotstorage19-paper-conway.pdf

Information forwarded to bug-guix <at> gnu.org:
bug#51787; Package guix. (Tue, 21 Dec 2021 23:21:01 GMT) Full text and rfc822 format available.

Message #94 received at 51787 <at> debbugs.gnu.org (full text, mbox):

From: Bengt Richter <bokr <at> bokr.com>
To: Ricardo Wurmus <rekado <at> elephly.net>
Cc: Mathieu Othacehe <othacehe <at> gnu.org>, 51787 <at> debbugs.gnu.org
Subject: Re: bug#51787: Disk performance on ci.guix.gnu.org
Date: Wed, 22 Dec 2021 00:20:24 +0100

Hi Ricardo,

TL;DR: re: "Any ideas?" :)

Read this [0], and consider how file systems may be
interacting with with SSD wear-leveling algorithms.

Are some file systems dependent on successful speculative
transaction continuations, while others might slow down
waiting for signs that an SSD controller has committed one
of ITS transactions, e.g. in special cases where the user or
kernel file system wants to be sure metadata is
written/journaled for fs structural integrity, but maybe
cares less about data?

I guess this difference might show up in copying a large
file over-writing the same target file (slower) vs copying
to a series of new files (faster).

What happens if you use a contiguous file as swap space?

Or, if you use anonymous files as user data space buffers,
passing them to wayland as file handles, per its protocol,
can you do better than ignoring SSD controllers and/or
storage hardware altogether?

Reference [0] is from 2013, so probably much has happened
since then, and the paper mentions (which has probably not
gotten better), the following, referring to trade secrets
giving one manufacturer ability to produce longer-lasting
SSDs cheaper and better than others ...

--8<---------------cut here---------------start------------->8---
    This means that the SSD controller is dedicated to a
    single brand of NAND, and it means that the SSD maker
    can’t shop around among NAND suppliers for the best price.
    Furthermore, the NAND supplier won’t share this
    information unless it believes that there is some compelling
    reason to work the SSD manufacturer. Since there are
    hundreds of SSD makers it’s really difficult to get these
    companies to pay attention to you! The SSD manufacturers
    that have this kind of relationship with their flash
    suppliers are very rare and very special.
--8<---------------cut here---------------end--------------->8---

Well, maybe you will have to parameterize your file system
tuning with manufacturer ID and SSD controller firmware
version ;/

Mvh, Bengt

[0] https://www.snia.org/sites/default/files/SSSITECHNOTES_HowControllersMaximizeSSDLife.pdf

On +2021-12-21 18:26:03 +0100, Ricardo Wurmus wrote:
> Today we discovered a few more things and discussed them on IRC.  Here’s
> a summary.
> 
> /var/cache sits on the same storage as /gnu.  We mounted the 5TB ext4
> file system that’s hosted by the SAN at /mnt_test and started copying
> over /var/cache to /mnt_test/var/cache.  Transfer speed was considerably
> faster (not *great*, but reasonably fast) than the copy of
> /gnu/store/trash to the same target.
> 
> This confirmed our suspicions that the problem is not with the storage
> array but due to the fact that /gnu/store/trash (and also /gnu/store)
> is an extremely large, flat directory.  /var/cache is not.
> 
> Here’s what we do now: continue copying /var/cache to the SAN, then
> remount to serve substitutes from there.  This removes some pressure
> from the file system as it will only be used for /gnu.  We’re
> considering to dump the file system completely (i.e. reinstall the
> server), thereby emptying /gnu, but leaving the stash of built
> substitutes in /var/cache (hosted from the faster SAN).
> 
> We could take this opportunity to reformat /gnu with btrfs, which
> performs quite a bit more poorly than ext4 but would be immune to
> defragmentation.  It’s not clear that defragmentation matters here.  It
> could just be that the problem is exclusively caused by having these
> incredibly large, flat /gnu/store, /gnu/store/.links, and
> /gnu/store/trash directories.
> 
> A possible alternative for this file system might also be XFS, which
> performs well when presented with unreasonably large directories.
> 
> It may be a good idea to come up with realistic test scenarios that we
> could test with each of these three file systems at scale.
> 
> Any ideas?
> 
> -- 
> Ricardo
> 
> 
> 
(sorry, the top-post grew)
-- 
Regards,
Bengt Richter

Information forwarded to bug-guix <at> gnu.org:
bug#51787; Package guix. (Wed, 22 Dec 2021 01:35:02 GMT) Full text and rfc822 format available.

Message #97 received at 51787 <at> debbugs.gnu.org (full text, mbox):

From: Thiago Jung Bauermann <bauermann <at> kolabnow.com>
To: 51787 <at> debbugs.gnu.org
Cc: Ricardo Wurmus <rekado <at> elephly.net>, Mathieu Othacehe <othacehe <at> gnu.org>,
 Bengt Richter <bokr <at> bokr.com>, Leo Famulari <leo <at> famulari.name>
Subject: Re: bug#51787: Disk performance on ci.guix.gnu.org
Date: Tue, 21 Dec 2021 21:27:46 -0300

Hello,

Ricardo Wurmus <rekado <at> elephly.net> writes:

> Today we discovered a few more things and discussed them on IRC.  Here’s
> a summary.
>
> /var/cache sits on the same storage as /gnu.  We mounted the 5TB ext4
> file system that’s hosted by the SAN at /mnt_test and started copying
> over /var/cache to /mnt_test/var/cache.  Transfer speed was considerably
> faster (not *great*, but reasonably fast) than the copy of
> /gnu/store/trash to the same target.
>
> This confirmed our suspicions that the problem is not with the storage
> array but due to the fact that /gnu/store/trash (and also /gnu/store)
> is an extremely large, flat directory.  /var/cache is not.

There was an interesting thread in the Linux kernel mailing lists about this
very issue earlier this year:

https://lore.kernel.org/linux-fsdevel/206078.1621264018 <at> warthog.procyon.org.uk/

I’m not sure I completely understood all of the concerns discussed there, but
my understanding of it is that for workloads which don’t concurrently modify
the huge directory, it’s size isn’t a problem for btrfs and XFS and in fact
it’s even more efficient to have one big directory rather than
subdirectories¹. It’s should also be well handled even by ext4, IIUC².

The problem for all filesystems is concurrently modifying the directory
(e.g., adding or removing files), because the kernel serializes directory
operations at the VFS layer.

Also in that case XFS can also have allocation issues when adding new files
if one isn’t careful.³

-- 
Thanks
Thiago

¹ https://lore.kernel.org/linux-fsdevel/20210517232237.GE2893 <at> dread.disaster.area/
² https://lore.kernel.org/linux-fsdevel/6E4DE257-4220-4B5B-B3D0-B67C7BC69BB5 <at> dilger.ca/
³ https://lore.kernel.org/linux-fsdevel/20210519125743.GP2893 <at> dread.disaster.area/

Information forwarded to bug-guix <at> gnu.org:
bug#51787; Package guix. (Sat, 25 Dec 2021 22:24:01 GMT) Full text and rfc822 format available.

Message #100 received at 51787 <at> debbugs.gnu.org (full text, mbox):

From: Ricardo Wurmus <rekado <at> elephly.net>
To: Mathieu Othacehe <othacehe <at> gnu.org>
Cc: 51787 <at> debbugs.gnu.org
Subject: Re: Disk performance on ci.guix.gnu.org
Date: Sat, 25 Dec 2021 23:19:23 +0100

Ricardo Wurmus <rekado <at> elephly.net> writes:

> Today we discovered a few more things and discussed them on IRC.  Here’s
> a summary.
>
> /var/cache sits on the same storage as /gnu.  We mounted the 5TB ext4
> file system that’s hosted by the SAN at /mnt_test and started copying
> over /var/cache to /mnt_test/var/cache.  Transfer speed was considerably
> faster (not *great*, but reasonably fast) than the copy of
> /gnu/store/trash to the same target.

Turns out that space on the SAN is insufficient for a full copy of
/var/cache.  We’ve hit ENOSPC after 4.2TB.  The SAN enforces some
headroom to remain free, so it denies us full access to the 5TB slice.
Bummer.

I guess we’ll have to wait for the SAN extension some time early 2022
before we can relocate the substitutes cache.

Should we attempt to overwrite /gnu/store and rely exclusively on
substitutes from the cache?

No matter how we look at it, the huge store is a performance problem for
us.  Today I had to kill ’guix gc’ after the GC lock had been held for
about 24 hours.  We will keep having this problem.

-- 
Ricardo

Information forwarded to bug-guix <at> gnu.org:
bug#51787; Package guix. (Sun, 26 Dec 2021 08:54:02 GMT) Full text and rfc822 format available.

Message #103 received at 51787 <at> debbugs.gnu.org (full text, mbox):

From: Mathieu Othacehe <othacehe <at> gnu.org>
To: Ricardo Wurmus <rekado <at> elephly.net>
Cc: 51787 <at> debbugs.gnu.org
Subject: Re: Disk performance on ci.guix.gnu.org
Date: Sun, 26 Dec 2021 09:53:09 +0100

Hello Ricardo,

> Should we attempt to overwrite /gnu/store and rely exclusively on
> substitutes from the cache?

Yes, I don't see any other options. Before that, what might be nice
could be:

1. Ensure that all Berlin /var/cache/guix/publish directory is
synchronized on Bordeaux. We are now at 117G out of X. We could then
start a publish server on Bordeaux. As Bordeaux is already part of the
default substitute servers list, the transition could be smooth I guess.

2. Determine what file-system out of ext4, btrfs and xfs could be the
most suitable for Berlin's /gnu/store. I'm running some tests on an old
HDD to try to determine the fragmentation impact on those
file-systems. We can of course choose to be conservative and go for ext4
that did the job until now.

Regarding the /gnu/store re-creation, I wonder how can we do it without
reinstalling completely Berlin. Maybe we could save the system store
closure somewhere and restore it on the shining new file-system?

Thanks,

Mathieu

Information forwarded to bug-guix <at> gnu.org:
bug#51787; Package guix. (Thu, 30 Dec 2021 10:52:01 GMT) Full text and rfc822 format available.

Message #106 received at 51787 <at> debbugs.gnu.org (full text, mbox):

From: Ricardo Wurmus <rekado <at> elephly.net>
To: Mathieu Othacehe <othacehe <at> gnu.org>
Cc: 51787 <at> debbugs.gnu.org
Subject: Re: Disk performance on ci.guix.gnu.org
Date: Thu, 30 Dec 2021 11:44:12 +0100

Mathieu Othacehe <othacehe <at> gnu.org> writes:

> Hello Ricardo,
>
>> Should we attempt to overwrite /gnu/store and rely exclusively on
>> substitutes from the cache?
>
> Yes, I don't see any other options. Before that, what might be nice
> could be:
>
> 1. Ensure that all Berlin /var/cache/guix/publish directory is
> synchronized on Bordeaux. We are now at 117G out of X. We could then
> start a publish server on Bordeaux. As Bordeaux is already part of the
> default substitute servers list, the transition could be smooth I guess.

I had the SAN slice extended from 5TB to 10TB.  This is now also full
(at 9.2TB due to SAN configuration).  I suggest doing the rsync to
Bordeaux from /mnt_test/var/cache/guix/publish instead of the much
slower /var/cache/guix/publish.  It doesn’t hold *all* files, but 9+TB
should be enough to fuel the transfer to Bordeaux for a while.

> Regarding the /gnu/store re-creation, I wonder how can we do it without
> reinstalling completely Berlin. Maybe we could save the system store
> closure somewhere and restore it on the shining new file-system?

I don’t know.  I would want to take a copy of the root file system as a
backup of state (like the Lets Encrypt certs), and copy the closure of
the current operating system configuration somewhere.  We could copy it
to a dedicated build node (after stopping the GC cron job) and set it up
as an internal substitute server.  Then “guix system init” while
fetching the substitutes from that server.

But I guess we’d have to boot the installer image anyway so that we can
safely erase /gnu/store, or else we’d erase files that are currently in
use.

-- 
Ricardo

Reply sent to Maxim Cournoyer <maxim.cournoyer <at> gmail.com>:
You have taken responsibility. (Wed, 16 Aug 2023 10:54:02 GMT) Full text and rfc822 format available.

Notification sent to Mathieu Othacehe <othacehe <at> gnu.org>:
bug acknowledged by developer. (Wed, 16 Aug 2023 10:54:02 GMT) Full text and rfc822 format available.

Message #111 received at 51787-done <at> debbugs.gnu.org (full text, mbox):

From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: Mathieu Othacehe <othacehe <at> gnu.org>
Cc: 51787-done <at> debbugs.gnu.org
Subject: Re: bug#51787: GC takes more than 9 hours on berlin
Date: Wed, 16 Aug 2023 06:53:05 -0400

Hi,

Mathieu Othacehe <othacehe <at> gnu.org> writes:

> Hello,
>
> On berlin, the daily GC command is still running whereas it was started
> 9 hours ago.
>
> guix processes
> [...]
> SessionPID: 37231
> ClientPID: 37195
> ClientCommand: /gnu/store/49vfv8l1j96bbd73ssbqanpnjz83arss-guix-1.3.0-11.014f1b6/libexec/guix/guile \ /gnu/store/49vfv8l1j96bbd73ssbqanpnjz83arss-guix-1.3.0-11.014f1b6/bin/guix gc -F10995116277760
>
>
> and
>
> mathieu <at> berlin ~$ ps auxww|grep 37195
> root      37195  0.0  0.0 183260 33440 ?        Sl   03:59   0:00 /gnu/store/49vfv8l1j96bbd73ssbqanpnjz83arss-guix-1.3.0-11.014f1b6/libexec/guix/guile \ /gnu/store/49vfv8l1j96bbd73ssbqanpnjz83arss-guix-1.3.0-11.014f1b6/bin/guix gc -F10995116277760
>
>
> That's really problematic as it is blocking some other berlin services
> such as Cuirass, which has 4564 packages in its fetch queue:
>
> mathieu <at> berlin ~$ less /var/log/cuirass-remote-server.log
> [...]
> 2021-11-12T12:47:01 period update: 0 resumable, 0 failed builds.
> 2021-11-12T12:47:01 period update: 4564 items in the fetch queue.

I'm more than happy to close this old, no longer relevant bug :-).

Since it was reported, the storage was migrated to a much faster array,
and is using the Btrfs file system with zstd compression, which avoids
the inodes exhaustion issue we've had with ext4 in the past.

A full garbage collection (GC) is run daily to keep the store growth in
check and for the GC to always take a similar amount of time (less than
20 minutes) every day.

-- 
Thanks,
Maxim

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Wed, 13 Sep 2023 11:24:14 GMT) Full text and rfc822 format available.

This bug report was last modified 2 years and 92 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #51787 GC takes more than 9 hours on berlin

GNU bug report logs - #51787
GC takes more than 9 hours on berlin