GNU bug report logs - #59784
[version 1.4.0rc1] Retrying a failed install fails

Previous Next

Package: guix;

Reported by: "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de>

Date: Fri, 2 Dec 2022 17:53:02 UTC

Severity: normal

Done: "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 59784 in the body.
You can then email your comments to 59784 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-guix <at> gnu.org:
bug#59784; Package guix. (Fri, 02 Dec 2022 17:53:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de>:
New bug report received and forwarded. Copy sent to bug-guix <at> gnu.org. (Fri, 02 Dec 2022 17:53:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de>
To: bug-guix <at> gnu.org
Subject: [version 1.4.0rc1] Retrying a failed install fails
Date: Fri, 02 Dec 2022 18:52:39 +0100
I aborted graphical system installation (Ctrl-C), retried the
installation and got this:

shepherd: Service guix-daemon has been stopped.
shepherd: Service guix-daemon has been started.
guix system: Fehler: opening file `/gnu/store/4z81a7njyvnwa4kn46ad6vhvi0lcnrhh-shadow-4.9.drv': No such file or directory
Befehl ("guix" "system" "init" "--fallback" "/mnt/etc/config.scm" "/mnt") hat mit Exit-Code 1 geendet
Drücken Sie die Eingabetaste, um fortzufahren.

(It told me to press Enter to continue.)  I did so; retried; but again
it did not really retry the installation, I always get this same error
message.

Sorry in case this is a duplicate bug.

Regards,
Florian




Information forwarded to bug-guix <at> gnu.org:
bug#59784; Package guix. (Fri, 09 Dec 2022 09:43:01 GMT) Full text and rfc822 format available.

Message #8 received at 59784 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de>
Cc: 59784 <at> debbugs.gnu.org
Subject: Re: bug#59784: [version 1.4.0rc1] Retrying a failed install fails
Date: Fri, 09 Dec 2022 10:42:16 +0100
Hi,

"pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de> skribis:

> I aborted graphical system installation (Ctrl-C), retried the
> installation and got this:
>
> shepherd: Service guix-daemon has been stopped.
> shepherd: Service guix-daemon has been started.
> guix system: Fehler: opening file `/gnu/store/4z81a7njyvnwa4kn46ad6vhvi0lcnrhh-shadow-4.9.drv': No such file or directory
> Befehl ("guix" "system" "init" "--fallback" "/mnt/etc/config.scm" "/mnt") hat mit Exit-Code 1 geendet
> Drücken Sie die Eingabetaste, um fortzufahren.
>
> (It told me to press Enter to continue.)  I did so; retried; but again
> it did not really retry the installation, I always get this same error
> message.

Related to that, I found this old bug:

  https://issues.guix.gnu.org/35543

I tried to reproduce it:

  0. I chose a basic installation to a fully-encrypted disk with a
     single partition.

  1. I hit Ctrl-C while ‘guix system init’ was downloading substitutes.

  2. That led me to a confusing error screen says “Command cryptsetup
     failed” with Ignore/Abort/Retry buttons.  This should have been
     “Command guix system init” failed no?

  3. I resumed starting with the “Configuration File” step, and there
     ‘guix system init’ ran to completion just fine.

Maybe the difference is that you hit Ctrl-C when ‘guix system init’ had
already started copying stuff to /mnt?

Thanks,
Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#59784; Package guix. (Fri, 09 Dec 2022 11:12:02 GMT) Full text and rfc822 format available.

Message #11 received at 59784 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de>
Cc: 59784 <at> debbugs.gnu.org
Subject: Re: bug#59784: [version 1.4.0rc1] Retrying a failed install fails
Date: Fri, 09 Dec 2022 12:11:43 +0100
Ludovic Courtès <ludo <at> gnu.org> skribis:

>   2. That led me to a confusing error screen says “Command cryptsetup
>      failed” with Ignore/Abort/Retry buttons.

Actually it’s “External command ("cryptsetup" "close" "cryptroot")
exited with code 5” and “cryptroot device is busy”.

Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#59784; Package guix. (Sat, 10 Dec 2022 08:41:02 GMT) Full text and rfc822 format available.

Message #14 received at 59784 <at> debbugs.gnu.org (full text, mbox):

From: "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: 59784 <at> debbugs.gnu.org
Subject: Re: bug#59784: [version 1.4.0rc1] Retrying a failed install fails
Date: Sat, 10 Dec 2022 09:39:54 +0100
Ludovic Courtès <ludo <at> gnu.org> writes:
> I tried to reproduce it:
>
>   0. I chose a basic installation to a fully-encrypted disk with a
>      single partition.
>
>   1. I hit Ctrl-C while ‘guix system init’ was downloading substitutes.
>
>   2. That led me to a confusing error screen says “Command cryptsetup
>      failed” with Ignore/Abort/Retry buttons.  This should have been
>      “Command guix system init” failed no?
>
>   3. I resumed starting with the “Configuration File” step, and there
>      ‘guix system init’ ran to completion just fine.

Yes, these were the steps, except I did not do encryption.  But I had
not told the whole story …  Sorry!

So what was missing is that the reason I pressed Ctrl-C was a rare
dropout by my Ethernet controller.  Because it is so rare and has not
happened anymore since, as a substitute, for reproducing, I did as
follows:

 0. Use Ethernet for the installation.

 1. During substitute downloading, pull the Ethernet plug.

 2. Get lucky so the installation will crash with an error and not just
    pause.  Otherwise, if no crash, repeat.

 3. Press Ctrl-C.

 4. Resume the installation from the last step.

 5. It will fail now.

I now uploaded an installer-dump-bade9971 of me reproducing the issue.

> Maybe the difference is that you hit Ctrl-C when ‘guix system init’ had
> already started copying stuff to /mnt?

No, like you, I was in the substitute downloading step.

This issue is much rarer than I thought.

Thank you for investigating.

Regards,
Florian




Information forwarded to bug-guix <at> gnu.org:
bug#59784; Package guix. (Mon, 12 Dec 2022 12:08:02 GMT) Full text and rfc822 format available.

Message #17 received at 59784 <at> debbugs.gnu.org (full text, mbox):

From: "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de>
To: 59784 <at> debbugs.gnu.org
Subject: Re: bug#59784: [version 1.4.0rc1] Retrying a failed install fails
Date: Mon, 12 Dec 2022 13:07:45 +0100
"pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de> writes:
> shepherd: Service guix-daemon has been stopped.
> shepherd: Service guix-daemon has been started.
> guix system: Fehler: opening file
> `/gnu/store/4z81a7njyvnwa4kn46ad6vhvi0lcnrhh-shadow-4.9.drv': No such
> file or directory
> Befehl ("guix" "system" "init" "--fallback" "/mnt/etc/config.scm" "/mnt") hat mit Exit-Code 1 geendet

Still happens with 1.4.0rc2.  I guess install-system in
gnu/installer/final.scm does not sync the disk on failure?

Regards,
Florian




Information forwarded to bug-guix <at> gnu.org:
bug#59784; Package guix. (Tue, 13 Dec 2022 09:41:02 GMT) Full text and rfc822 format available.

Message #20 received at 59784 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de>
Cc: Mathieu Othacehe <othacehe <at> gnu.org>, 59784 <at> debbugs.gnu.org
Subject: Re: bug#59784: [version 1.4.0rc1] Retrying a failed install fails
Date: Tue, 13 Dec 2022 10:40:16 +0100
Hi,

"pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de> skribis:

> I now uploaded an installer-dump-bade9971 of me reproducing the issue.

Here’s the relevant syslog excerpt (this was with 1.4.0rc1) where we can
see the point where you unplugged the Ethernet connection:

--8<---------------cut here---------------start------------->8---
Dec 10 09:07:29 localhost installer[399]: running command ("guix" "system" "init" "--fallback" "/mnt/etc/config.scm" "/mnt") 
Dec 10 09:07:48 localhost installer[399]: ^[[1m10.3 MB will be downloaded^M 
Dec 10 09:07:49 localhost installer[399]: ^[[0m^M^[[K^M^[[K utf8proc-2.5.0  52KiB                716KiB/s 00:00 [##################] 100.0%^M^[[K utf8proc-2.5.0  52KiB                594KiB/s 00:00 [##################] 100.0%^M 

[...]

Dec 10 09:08:48 localhost installer[399]: ^[[0m^M^[[Kretrying download of '/gnu/store/8zigz7afvz2rjrvrh7zq1d389qbl2684-dbus-1.12.20' with other substitute URLs...^M 
Dec 10 09:08:48 localhost installer[399]: guix substitute: warning: bordeaux.guix.gnu.org: host not found: Name or service not known^M 
Dec 10 09:08:48 localhost installer[399]: guix substitute: error: failed to find alternative substitute for '/gnu/store/8zigz7afvz2rjrvrh7zq1d389qbl2684-dbus-1.12.20'^M 
Dec 10 09:08:48 localhost installer[399]: ^[[31;1msubstitution of /gnu/store/8zigz7afvz2rjrvrh7zq1d389qbl2684-dbus-1.12.20 failed^[[0m^M 
Dec 10 09:08:49 localhost installer[399]: ^M^[[K^M^[[Kretrying download of '/gnu/store/mzfkrxd4w8vqrmyrx169wj8wyw7r8i37-bash' with other substitute URLs...^M 
Dec 10 09:08:49 localhost installer[399]: guix substitute: warning: bordeaux.guix.gnu.org: host not found: Name or service not known^M 
Dec 10 09:08:49 localhost installer[399]: guix substitute: error: failed to find alternative substitute for '/gnu/store/mzfkrxd4w8vqrmyrx169wj8wyw7r8i37-bash'^M 
Dec 10 09:08:49 localhost installer[399]: ^[[31;1msubstitution of /gnu/store/mzfkrxd4w8vqrmyrx169wj8wyw7r8i37-bash failed^[[0m^M 
Dec 10 09:08:49 localhost installer[399]: guix system: ^[[1;31merror: ^[[0mcorrupt input while restoring archive from #<closed: file 7fa02f84d4d0>^M 
Dec 10 09:08:49 localhost installer[399]: command ("guix" "system" "init" "--fallback" "/mnt/etc/config.scm" "/mnt") exited with value 1 
Dec 10 09:08:58 localhost vmunix: [ 1220.571986] r8169 0000:02:00.0 enp2s0: Link is Up - 1Gbps/Full - flow control off

[...]

Dec 10 09:09:12 localhost shepherd[1]: Service guix-daemon has been stopped. 
Dec 10 09:09:12 localhost shepherd[1]: Service guix-daemon has been started. 
Dec 10 09:09:17 localhost installer[274]: unmounting "/mnt/" 
Dec 10 09:09:17 localhost vmunix: [ 1239.111442] EXT4-fs (sda3): unmounting filesystem.
Dec 10 09:09:19 localhost installer[274]: running form #<newt-form 2499c90> ("Installation menu") with 0 clients 
Dec 10 09:09:22 localhost installer[274]: running step 'final' 
Dec 10 09:09:22 localhost installer[274]: proceeding with final step 
Dec 10 09:09:23 localhost installer[274]: mounting "/dev/sda3" on "/mnt/" 
Dec 10 09:09:23 localhost vmunix: [ 1245.890840] EXT4-fs (sda3): mounted filesystem with ordered data mode. Quota mode: none.
Dec 10 09:09:23 localhost vmunix: [ 1245.893304] Adding 3905532k swap on /dev/sda2.  Priority:-2 extents:1 across:3905532k SSFS
Dec 10 09:09:23 localhost installer[274]: running form #<newt-form 248c440> ("Configuration file") with 0 clients 
Dec 10 09:09:29 localhost installer[437]: install supported locale en_US.utf8. 
Dec 10 09:09:29 localhost shepherd[1]: Service guix-daemon has been stopped. 
Dec 10 09:09:29 localhost shepherd[1]: Service guix-daemon has been started. 
Dec 10 09:09:29 localhost installer[437]: running command ("guix" "system" "init" "--fallback" "/mnt/etc/config.scm" "/mnt") 
Dec 10 09:09:54 localhost installer[437]: ^[[1m60.8 MB will be downloaded^M 
Dec 10 09:09:54 localhost installer[437]: ^[[0mguix system: ^[[1;31merror: ^[[0mopening file `/gnu/store/igxf1b1l2b19h7mx2s6r117270dbi6iq-guix-1.4.0rc1.drv': No such file or directory^M 
Dec 10 09:09:54 localhost installer[437]: command ("guix" "system" "init" "--fallback" "/mnt/etc/config.scm" "/mnt") exited with value 1 
Dec 10 09:10:21 localhost shepherd[1]: Service guix-daemon has been stopped. 
Dec 10 09:10:21 localhost shepherd[1]: Service guix-daemon has been started. 
Dec 10 09:10:21 localhost installer[274]: unmounting "/mnt/" 
Dec 10 09:10:21 localhost vmunix: [ 1303.398583] EXT4-fs (sda3): unmounting filesystem.
Dec 10 09:10:28 localhost installer[274]: crashing due to uncaught exception: %exception (#<&user-abort-error>) 
--8<---------------cut here---------------end--------------->8---

It looks like the store is in a broken state, with its database not
matching its actual contents.  The ‘install-system’ procedure is
supposed to protect against that by making a backup of the database
before starting the installation and restoring it afterwards.  (It
apparently worked for me when I interrupted ‘guix system init’ by
hitting C-c.)

I wonder how that failed here.  Mathieu, ideas?

Thanks,
Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#59784; Package guix. (Tue, 13 Dec 2022 09:49:01 GMT) Full text and rfc822 format available.

Message #23 received at 59784 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de>
Cc: Mathieu Othacehe <othacehe <at> gnu.org>, 59784 <at> debbugs.gnu.org
Subject: Re: bug#59784: [version 1.4.0rc1] Retrying a failed install fails
Date: Tue, 13 Dec 2022 10:48:43 +0100
Hi again,

Ludovic Courtès <ludo <at> gnu.org> skribis:

> It looks like the store is in a broken state, with its database not
> matching its actual contents.  The ‘install-system’ procedure is
> supposed to protect against that by making a backup of the database
> before starting the installation and restoring it afterwards.  (It
> apparently worked for me when I interrupted ‘guix system init’ by
> hitting C-c.)

Actually, look at the excerpt from final.scm:

         ;; Restart guix-daemon so that it does no keep the MNT namespace
         ;; alive.
         (restart-service 'guix-daemon)
         (copy-file saved-database database-file)

We’re restarting the daemon *before* we have restored the database,
which is wrong: depending on how lucky you are, guix-daemon might load
the old database (all this depends on what exactly happens when sqlite
opens the database, but I think there’s a possibility that it will load
or cache a few things and thus fail to see the changes ‘copy-file’
introduces.)

So my guess is that things will be much better if we swap these two
lines.

Florian, it would be great if you could try that and run a new image
generated version ‘version-1.4.0’ with these two lines changed.  To
produce the image, run:

  ./pre-inst-env guix system image -t iso9660 --label=Guix \
    gnu/system/install.scm

Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#59784; Package guix. (Tue, 13 Dec 2022 22:23:01 GMT) Full text and rfc822 format available.

Message #26 received at 59784 <at> debbugs.gnu.org (full text, mbox):

From: "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: Mathieu Othacehe <othacehe <at> gnu.org>, 59784 <at> debbugs.gnu.org
Subject: Re: bug#59784: [version 1.4.0rc1] Retrying a failed install fails
Date: Tue, 13 Dec 2022 23:22:22 +0100
Hi again.

Ludovic Courtès <ludo <at> gnu.org> writes:
> So my guess is that things will be much better if we swap these two
> lines.

This was helpful, but not enough.

Swapping them may have improved the likelihood of being able to retry,
but the issue is still there.  I uploaded as installer-dump-5f9f8dbe,
but it is pretty much the same as the previous dump.

Tomorrow, I will try to add an fsync call in between the two lines.

>   ./pre-inst-env guix system image -t iso9660 --label=Guix \
>     gnu/system/install.scm

Additionally, I had to do “GUIX_ALLOW_ME_TO_USE_PRIVATE_COMMIT=y
make update-guix-package”.  Or else the installer was using a Guix that
did not have the lines swapped.

Also before I did the GPG authorization dance (my x86 machine isn’t
worth getting my actual commiter GPG keys, so I make sure its dummy GPG
key is in the keyring branch, .guix-authorizations file, that
guix/channels.scm’s default guix channel points to the url
/home/florian/src/guix and to the commit with the new authorization).
Then I guix pulled.  So that building the installer succeeds.  I did
*not* use ./pre-inst-env.

Regards,
Florian




Information forwarded to bug-guix <at> gnu.org:
bug#59784; Package guix. (Tue, 13 Dec 2022 23:17:02 GMT) Full text and rfc822 format available.

Message #29 received at 59784 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de>
Cc: Mathieu Othacehe <othacehe <at> gnu.org>, 59784 <at> debbugs.gnu.org
Subject: Re: bug#59784: [version 1.4.0rc1] Retrying a failed install fails
Date: Wed, 14 Dec 2022 00:16:29 +0100
[Message part 1 (text/plain, inline)]
"pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de> skribis:

> Ludovic Courtès <ludo <at> gnu.org> writes:
>> So my guess is that things will be much better if we swap these two
>> lines.
>
> This was helpful, but not enough.

Sorry, I think I wasn’t thinking at full speed.  There needs to be zero
daemons running while we copy the database.  So the real fix is more
like this:

[Message part 2 (text/x-patch, inline)]
diff --git a/gnu/installer/final.scm b/gnu/installer/final.scm
index 044f79372b..9a6bbad122 100644
--- a/gnu/installer/final.scm
+++ b/gnu/installer/final.scm
@@ -213,10 +213,13 @@ (define (assert-exit x)
 
              (set! ret (run-command install-command #:tty? #t)))
            (lambda ()
-             ;; Restart guix-daemon so that it does no keep the MNT namespace
+             ;; Stop guix-daemon so that it does no keep the MNT namespace
              ;; alive.
-             (restart-service 'guix-daemon)
+             (stop-service 'guix-daemon)
+
+             ;; Restore the database and restart it.
              (copy-file saved-database database-file)
+             (start-service 'guix-daemon)
 
              ;; Finally umount the cow-store and exit the container.
              (unmount-cow-store (%installer-target-dir) backing-directory)
[Message part 3 (text/plain, inline)]
>>   ./pre-inst-env guix system image -t iso9660 --label=Guix \
>>     gnu/system/install.scm
>
> Additionally, I had to do “GUIX_ALLOW_ME_TO_USE_PRIVATE_COMMIT=y
> make update-guix-package”.  Or else the installer was using a Guix that
> did not have the lines swapped.

Hmm this is surprising because we’re already using (current-guix) in
(gnu installer).

> Also before I did the GPG authorization dance (my x86 machine isn’t
> worth getting my actual commiter GPG keys, so I make sure its dummy GPG
> key is in the keyring branch, .guix-authorizations file, that
> guix/channels.scm’s default guix channel points to the url
> /home/florian/src/guix and to the commit with the new authorization).
> Then I guix pulled.  So that building the installer succeeds.  I did
> *not* use ./pre-inst-env.

Ah yes, apologies.  You should be able to disable authentication with
this:

[Message part 4 (text/x-patch, inline)]
diff --git a/gnu/packages/package-management.scm b/gnu/packages/package-management.scm
index 5a09b1fcf8..374b187d8c 100644
--- a/gnu/packages/package-management.scm
+++ b/gnu/packages/package-management.scm
@@ -625,6 +625,7 @@ (define-public current-guix-package
                (inherit guix)
                (source source)
                (build-system channel-build-system)
+               (arguments '(#:authenticate? #f))
                (inputs '())
                (native-inputs '())
                (propagated-inputs '())))
[Message part 5 (text/plain, inline)]
Thanks a lot for patiently testing, this is very helpful!

Ludo’.

Information forwarded to bug-guix <at> gnu.org:
bug#59784; Package guix. (Wed, 14 Dec 2022 13:37:02 GMT) Full text and rfc822 format available.

Message #32 received at 59784 <at> debbugs.gnu.org (full text, mbox):

From: "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: Mathieu Othacehe <othacehe <at> gnu.org>, 59784 <at> debbugs.gnu.org
Subject: Re: bug#59784: [version 1.4.0rc1] Retrying a failed install fails
Date: Wed, 14 Dec 2022 14:36:16 +0100
Eventual success, partially.

First of all:

Ludovic Courtès <ludo <at> gnu.org> writes:
> "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de> skribis:
>> Additionally, I had to do “GUIX_ALLOW_ME_TO_USE_PRIVATE_COMMIT=y
>> make update-guix-package”.  Or else the installer was using a Guix that
>> did not have the lines swapped.
> Hmm this is surprising because we’re already using (current-guix) in
> (gnu installer).

Apparently no.  If I commit only those two diffs from your mail, with
`./pre-inst-env guix system image -t iso9660 --label=Guix
gnu/system/install.scm`, then

guix gc --references /gnu/store/*-installer-real

prints a Guix package that does not contain any of the changes to
gnu/installer/final.scm.

Nonetheless I used it and ran the installer with surprising failures
that make me doubt either the health of my USB drive: `guix system
init --fallback` did not download substitutes but said ACL seems to be
uninitialized and fell back to downloading/building the tar.xz
sources.  I pulled the Ethernet plug, resumed the installer to run
`guix system init` again, but this now complains that nss-certs is an
unknown package.  Sending a dump crashed the installer.  On TTY3, `ls
/tmp` tells me '-bash: ls: command not found'.

Another USB drive, another try, the installer again says there's no
ACL and downloads tar.xz, but otherwise behaves as rc2 and sometimes
bugs out when pulling Ethernet; final.scm does not contain the patch.

Is that second diff of yours perhaps really about ACLs?

I do the authorization dance, commit the diff about 'stop-service' and
the update-guix-package, then pull --branch=version-1.4.0.  I can now
resume happily, when pulling the Ethernet and even when pressing
Ctrl-C just for fun.

Except it is necessary to resume twice.  The first resume always fails
and the second resume resumes.  Does it confuse the two databases?

Except after a large number of resumes, not even the second resume
resumes anymore.  I sent a installer-dump-c82c7abf.

I shall try with fsync now.

Regards,
Florian




Information forwarded to bug-guix <at> gnu.org:
bug#59784; Package guix. (Wed, 14 Dec 2022 21:48:01 GMT) Full text and rfc822 format available.

Message #35 received at 59784 <at> debbugs.gnu.org (full text, mbox):

From: "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: Mathieu Othacehe <othacehe <at> gnu.org>, 59784 <at> debbugs.gnu.org
Subject: Re: bug#59784: [version 1.4.0rc1] Retrying a failed install fails
Date: Wed, 14 Dec 2022 22:47:14 +0100
[Message part 1 (text/plain, inline)]
"pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de> writes:
> I shall try with fsync now.

fsyncing the database had no effect.  (In addition to Ludo’s
'stop-service', I had done

[fsync.patch (text/x-patch, inline)]
diff --git a/gnu/installer/final.scm b/gnu/installer/final.scm
index ef487805f0..13deffef85 100644
--- a/gnu/installer/final.scm
+++ b/gnu/installer/final.scm
@@ -217,8 +217,16 @@ (define (assert-exit x)
              ;; alive.
              (stop-service 'guix-daemon)
 
-             ;; Restore the database and restart it.
+             ;; Restore the database.
              (copy-file saved-database database-file)
+
+             ;; Sync it to the filesystem.
+             (let* ((flags O_RDONLY)
+                    (fd (open database-file flags)))
+               (fsync fd)
+               (close fd))
+
+             ;; And restart guix-daemon.
              (start-service 'guix-daemon)
 
              ;; Finally umount the cow-store and exit the container.

[Message part 3 (text/plain, inline)]
The same two problems:

* If I resume a crashed installer, I need to resume twice because the
  first resume always fails immediately.

* With bad luck, it permanently fails, even a second, third, fourth,
  fifth time fail.

This is the same as without the fsync.  Fsync had no effect.  Still I
uploaded installer-dump-194618fa.

Regards,
Florian

Information forwarded to bug-guix <at> gnu.org:
bug#59784; Package guix. (Wed, 14 Dec 2022 23:51:01 GMT) Full text and rfc822 format available.

Message #38 received at 59784 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de>
Cc: Mathieu Othacehe <othacehe <at> gnu.org>, 59784 <at> debbugs.gnu.org
Subject: Re: bug#59784: [version 1.4.0rc1] Retrying a failed install fails
Date: Thu, 15 Dec 2022 00:50:21 +0100
[Message part 1 (text/plain, inline)]
Grrr, I’m really silly: we have the same problem (copying the database
before the daemon has been stopped) just a few lines above.

How about this:

[Message part 2 (text/x-patch, inline)]
diff --git a/gnu/installer/final.scm b/gnu/installer/final.scm
index 044f79372b..360b34d8cb 100644
--- a/gnu/installer/final.scm
+++ b/gnu/installer/final.scm
@@ -1,6 +1,6 @@
 ;;; GNU Guix --- Functional package management for GNU
 ;;; Copyright © 2018, 2020 Mathieu Othacehe <m.othacehe <at> gmail.com>
-;;; Copyright © 2019, 2020 Ludovic Courtès <ludo <at> gnu.org>
+;;; Copyright © 2019, 2020, 2022 Ludovic Courtès <ludo <at> gnu.org>
 ;;;
 ;;; This file is part of GNU Guix.
 ;;;
@@ -196,14 +196,15 @@ (define (assert-exit x)
              ;; the loaded cow-store locale files will prevent umounting.
              (install-locale locale)
 
-             ;; Save the database, so that it can be restored once the
-             ;; cow-store is umounted.
+             ;; Stop the daemon and save the database, so that it can be
+             ;; restored once the cow-store is umounted.
+             (stop-service 'guix-daemon)
              (copy-file database-file saved-database)
+
              (mount-cow-store (%installer-target-dir) backing-directory))
            (lambda ()
              ;; We need to drag the guix-daemon to the container MNT
              ;; namespace, so that it can operate on the cow-store.
-             (stop-service 'guix-daemon)
              (start-service 'guix-daemon (list (number->string (getpid))))
 
              (setvbuf (current-output-port) 'none)
@@ -213,10 +214,13 @@ (define (assert-exit x)
 
              (set! ret (run-command install-command #:tty? #t)))
            (lambda ()
-             ;; Restart guix-daemon so that it does no keep the MNT namespace
+             ;; Stop guix-daemon so that it does no keep the MNT namespace
              ;; alive.
-             (restart-service 'guix-daemon)
+             (stop-service 'guix-daemon)
+
+             ;; Restore the database and restart it.
              (copy-file saved-database database-file)
+             (start-service 'guix-daemon)
 
              ;; Finally umount the cow-store and exit the container.
              (unmount-cow-store (%installer-target-dir) backing-directory)
[Message part 3 (text/plain, inline)]
?

This time, I believe we only ever copy the database when we’re sure no
guix-daemon process is accessing it.

Ludo’.

Information forwarded to bug-guix <at> gnu.org:
bug#59784; Package guix. (Thu, 15 Dec 2022 17:47:01 GMT) Full text and rfc822 format available.

Message #41 received at 59784 <at> debbugs.gnu.org (full text, mbox):

From: "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: Mathieu Othacehe <othacehe <at> gnu.org>, 59784 <at> debbugs.gnu.org
Subject: Re: bug#59784: [version 1.4.0rc1] Retrying a failed install fails
Date: Thu, 15 Dec 2022 18:46:16 +0100
Hi Ludo…

Ludovic Courtès <ludo <at> gnu.org> writes:
> This time, I believe we only ever copy the database when we’re sure no
> guix-daemon process is accessing it.

Failure.  In addition to your partially helpful patch from before
(with which a second resume now works most of the time), I now tried
further the new change:

diff --git a/gnu/installer/final.scm b/gnu/installer/final.scm
index 044f79372b..360b34d8cb 100644
--- a/gnu/installer/final.scm
+++ b/gnu/installer/final.scm
@@ -196,14 +196,15 @@ (define (assert-exit x)
              ;; the loaded cow-store locale files will prevent umounting.
              (install-locale locale)

-             ;; Save the database, so that it can be restored once the
-             ;; cow-store is umounted.
+             ;; Stop the daemon and save the database, so that it can be
+             ;; restored once the cow-store is umounted.
+             (stop-service 'guix-daemon)
              (copy-file database-file saved-database)
+
              (mount-cow-store (%installer-target-dir) backing-directory))
            (lambda ()
              ;; We need to drag the guix-daemon to the container MNT
              ;; namespace, so that it can operate on the cow-store.
-             (stop-service 'guix-daemon)
              (start-service 'guix-daemon (list (number->string (getpid))))

              (setvbuf (current-output-port) 'none)


No additional effect. :(  Perhaps at that time, the guix-daemon isnt
doing anything anyway (though the addition makes sense in general and
may help some users).  There are the same two problems, needing to
resume twice each time and eventually not being able to resume at all
(perhaps some multi-core issue?).  I sent installer-dump-89be04d5.

I tried interrupting the Ethernet on the same machine but with an
installed 1.4.0rc2 Guix System during `guix system reconfigure`.
This has no issues…  There must be corruption in the installer.

Regards,
Florian




Information forwarded to bug-guix <at> gnu.org:
bug#59784; Package guix. (Thu, 15 Dec 2022 20:45:02 GMT) Full text and rfc822 format available.

Message #44 received at 59784 <at> debbugs.gnu.org (full text, mbox):

From: "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: Mathieu Othacehe <othacehe <at> gnu.org>, 59784 <at> debbugs.gnu.org
Subject: Re: bug#59784: [version 1.4.0rc1] Retrying a failed install fails
Date: Thu, 15 Dec 2022 21:44:37 +0100
[Message part 1 (text/plain, inline)]
Desperately I tried also adding fsync, to no avail, both issues remain.
Non-working patch attached.

Maybe dynamic-wind is an inappropriate pattern here?

If I interrupt installation using Ctrl-C (which I normally don’t,
instead I unplug Ethernet), then I have to press Ctrl-C twice.  Maybe
that could be related to why I need to resume twice?

I’m in the dark.

Regards,
Florian

[fsync-to-no-avail.patch (text/x-patch, attachment)]

Information forwarded to bug-guix <at> gnu.org:
bug#59784; Package guix. (Fri, 16 Dec 2022 13:56:02 GMT) Full text and rfc822 format available.

Message #47 received at 59784 <at> debbugs.gnu.org (full text, mbox):

From: Maxime Devos <maximedevos <at> telenet.be>
To: "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de>,
 Ludovic Courtès <ludo <at> gnu.org>
Cc: Mathieu Othacehe <othacehe <at> gnu.org>, 59784 <at> debbugs.gnu.org
Subject: Re: bug#59784: [version 1.4.0rc1] Retrying a failed install fails
Date: Fri, 16 Dec 2022 14:55:30 +0100
[Message part 1 (text/plain, inline)]

On 14-12-2022 22:47, pelzflorian (Florian Pelz) wrote:
> fsyncing the database had no effect.  (In addition to Ludo’s
> 'stop-service', I had done
> 
> 
> fsync.patch
> 
> diff --git a/gnu/installer/final.scm b/gnu/installer/final.scm
> index ef487805f0..13deffef85 100644
> --- a/gnu/installer/final.scm
> +++ b/gnu/installer/final.scm
> @@ -217,8 +217,16 @@ (define (assert-exit x)
>                ;; alive.
>                (stop-service 'guix-daemon)
>   
> -             ;; Restore the database and restart it.
> +             ;; Restore the database.
>                (copy-file saved-database database-file)
> +
> +             ;; Sync it to the filesystem.
> +             (let* ((flags O_RDONLY)
> +                    (fd (open database-file flags)))
> +               (fsync fd)
> +               (close fd))
> +

So, I'm nominally 'on hiatus', but I noticed this mail, and noticed you 
copied a file (and fsync'ed it), but forgot to fsync the directory it 
was copied to -- from what I've read (but I don't recall the source), 
fsyncing the contents of the file isn't enough, you also need to fsync 
the directory such that the new file entry is in the directory after 
crashing.

Greetings,
Maxime.
[OpenPGP_0x49E3EE22191725EE.asc (application/pgp-keys, attachment)]
[OpenPGP_signature (application/pgp-signature, attachment)]

Information forwarded to bug-guix <at> gnu.org:
bug#59784; Package guix. (Fri, 16 Dec 2022 16:58:02 GMT) Full text and rfc822 format available.

Message #50 received at 59784 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de>
Cc: Mathieu Othacehe <othacehe <at> gnu.org>, 59784 <at> debbugs.gnu.org
Subject: Re: bug#59784: [version 1.4.0rc1] Retrying a failed install fails
Date: Fri, 16 Dec 2022 17:57:12 +0100
Hi,

"pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de> skribis:

> Desperately I tried also adding fsync, to no avail, both issues remain.
> Non-working patch attached.
>
> Maybe dynamic-wind is an inappropriate pattern here?
>
> If I interrupt installation using Ctrl-C (which I normally don’t,
> instead I unplug Ethernet), then I have to press Ctrl-C twice.  Maybe
> that could be related to why I need to resume twice?

One finding: when hitting C-c, the dynamic-wind exit handler (the one
that restores the database and umounts the cow store) is *not* executed.

This is because ‘call-with-mnt-container’ sets a SIGINT handler that
terminates that process with SIGKILL (I’m not entirely sure of the
rationale, but said process cannot handle signals in Scheme while it’s
in ‘waitpid’, called from ‘run-command’).

I did reproduce the issue in a VM by running “ifconfig ens3 down” in a
tty, or by killing the ‘guix substitute’ process, to cause failure of
‘guix system init’.  In that case the database is indeed restored, but I
occasionally get errors like “/gnu/store/….drv: No such file or
directory”.

Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#59784; Package guix. (Fri, 16 Dec 2022 20:18:02 GMT) Full text and rfc822 format available.

Message #53 received at 59784 <at> debbugs.gnu.org (full text, mbox):

From: "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de>
To: Maxime Devos <maximedevos <at> telenet.be>
Cc: Mathieu Othacehe <othacehe <at> gnu.org>,
 Ludovic Courtès <ludo <at> gnu.org>, 59784 <at> debbugs.gnu.org
Subject: Re: bug#59784: [version 1.4.0rc1] Retrying a failed install fails
Date: Fri, 16 Dec 2022 21:17:42 +0100
Maxime Devos <maximedevos <at> telenet.be> writes:
> So, I'm nominally 'on hiatus', but I noticed this mail, and noticed
> you copied a file (and fsync'ed it), but forgot to fsync the directory
> it was copied to -- from what I've read (but I don't recall the
> source), fsyncing the contents of the file isn't enough, you also need
> to fsync the directory such that the new file entry is in the
> directory after crashing.

Ohh indeed!  The Linux manpage on fsync confirms it.  That invalidates
my fsync testing.  Which was on a codepath that, as Ludo found out, did
not even run.  But I will remember to fsync the directory in the future.

Thank you very much Maxime!

Regards,
Florian




Information forwarded to bug-guix <at> gnu.org:
bug#59784; Package guix. (Fri, 16 Dec 2022 20:29:02 GMT) Full text and rfc822 format available.

Message #56 received at 59784 <at> debbugs.gnu.org (full text, mbox):

From: "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: Mathieu Othacehe <othacehe <at> gnu.org>, 59784 <at> debbugs.gnu.org
Subject: Re: bug#59784: [version 1.4.0rc1] Retrying a failed install fails
Date: Fri, 16 Dec 2022 21:28:29 +0100
Ludovic Courtès <ludo <at> gnu.org> writes:
> One finding: when hitting C-c, the dynamic-wind exit handler (the one
> that restores the database and umounts the cow store) is *not* executed.

Impressive findings.

Now that you found the dynamic-wind’s out-guard does not even run: Uhh I
had misdiagnosed when I thought your 'stop-service' patch had made a
difference and caused a second resume to work.  Second resume was
already possible on rc2.  Except eventually resume stops working and on
some install attempts with rc2, resume stops working right away.

After seeing that you opened a bug#60116 on setsid(), I tested removing
the setsid call and it had no effect, but if the dynamic-wind’s
out-guard does not even run, that is to be expected.


> I did reproduce the issue in a VM by running “ifconfig ens3 down” in a
> tty, or by killing the ‘guix substitute’ process, to cause failure of
> ‘guix system init’.  In that case the database is indeed restored, but I
> occasionally get errors like “/gnu/store/….drv: No such file or
> directory”.

Yes, this is the error message that I get on failing resumes.

Regards,
Florian




Information forwarded to bug-guix <at> gnu.org:
bug#59784; Package guix. (Sat, 17 Dec 2022 11:02:01 GMT) Full text and rfc822 format available.

Message #59 received at 59784 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de>
Cc: Mathieu Othacehe <othacehe <at> gnu.org>, 59784 <at> debbugs.gnu.org
Subject: Re: bug#59784: [version 1.4.0rc1] Retrying a failed install fails
Date: Sat, 17 Dec 2022 12:01:36 +0100
Moin!

"pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de> skribis:

> Ludovic Courtès <ludo <at> gnu.org> writes:
>> One finding: when hitting C-c, the dynamic-wind exit handler (the one
>> that restores the database and umounts the cow store) is *not* executed.
>
> Impressive findings.
>
> Now that you found the dynamic-wind’s out-guard does not even run:

It does not run on C-c, but it does run in other cases, typically if you
just press Enter after reading the message that says “command failed,
press Enter”.

I don’t see how to address the C-c issue so we’ll have to live with it.

Longer-term we may have to find a different strategy than the
‘call-with-mnt-container’ trick, but that’s difficult.

> After seeing that you opened a bug#60116 on setsid(), I tested removing
> the setsid call and it had no effect, but if the dynamic-wind’s
> out-guard does not even run, that is to be expected.

Right; #60116 is related, and it’s not great but it’s not critical.

Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#59784; Package guix. (Sat, 17 Dec 2022 16:16:02 GMT) Full text and rfc822 format available.

Message #62 received at 59784 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de>
Cc: Mathieu Othacehe <othacehe <at> gnu.org>, 59784 <at> debbugs.gnu.org
Subject: Re: bug#59784: [version 1.4.0rc1] Retrying a failed install fails
Date: Sat, 17 Dec 2022 17:15:21 +0100
Ludovic Courtès <ludo <at> gnu.org> skribis:

> I did reproduce the issue in a VM by running “ifconfig ens3 down” in a
> tty, or by killing the ‘guix substitute’ process, to cause failure of
> ‘guix system init’.  In that case the database is indeed restored, but I
> occasionally get errors like “/gnu/store/….drv: No such file or
> directory”.

The error message that’s haunting us:

  opening file `/gnu/store/….drv': No such file or directory

comes from guix-daemon.  It happens while the client is doing an
‘add-text-to-store’ RPC to add that .drv to the store.
‘LocalStore::addTextToStore’ supposedly creates the .drv file in
/gnu/store and then reads it back (‘registerValidPath’ -> ‘addValidPath’
-> ‘readDerivation’ -> ‘readFile’): this is where it gets ENOENT.

It would suggest that the database is consistent, but that somehow
writes don’t go through the overlay FS.

More investigation is needed, but we may have to live with this bug in
1.4.0.

Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#59784; Package guix. (Sat, 17 Dec 2022 19:28:01 GMT) Full text and rfc822 format available.

Message #65 received at 59784 <at> debbugs.gnu.org (full text, mbox):

From: "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: Mathieu Othacehe <othacehe <at> gnu.org>, 59784 <at> debbugs.gnu.org
Subject: Re: bug#59784: [version 1.4.0rc1] Retrying a failed install fails
Date: Sat, 17 Dec 2022 20:27:43 +0100
[Message part 1 (text/plain, inline)]
Ludovic Courtès <ludo <at> gnu.org> writes:
> The error message that’s haunting us:
>
>   opening file `/gnu/store/….drv': No such file or directory
>
> comes from guix-daemon.  It happens while the client is doing an
> ‘add-text-to-store’ RPC to add that .drv to the store.
> ‘LocalStore::addTextToStore’ supposedly creates the .drv file in
> /gnu/store and then reads it back (‘registerValidPath’ -> ‘addValidPath’
> -> ‘readDerivation’ -> ‘readFile’): this is where it gets ENOENT.
>
> It would suggest that the database is consistent, but that somehow
> writes don’t go through the overlay FS.

Most interesting.

I saw a comment
> void LocalStore::registerValidPaths(const ValidPathInfos & infos)
> {
>     /* SQLite will fsync by default, but the new valid paths may not be fsync-ed.
>      * So some may want to fsync them before registering the validity, at the
>      * expense of some speed of the path registering operation. */
>     if (settings.syncBeforeRegistering) sync();

In vain, I therefore tried

[sync-before-registering.patch (text/x-patch, inline)]
diff --git a/nix/libstore/globals.cc b/nix/libstore/globals.cc
index d4f9a46a74..5f8a3a3031 100644
--- a/nix/libstore/globals.cc
+++ b/nix/libstore/globals.cc
@@ -40,7 +40,7 @@ Settings::Settings()
     reservedSize = 8 * 1024 * 1024;
     fsyncMetadata = true;
     useSQLiteWAL = true;
-    syncBeforeRegistering = false;
+    syncBeforeRegistering = true;
     useSubstitutes = true;
     useChroot = false;
     impersonateLinux26 = false;
[Message part 3 (text/plain, inline)]
But it changes nothing.

Regards,
Florian

Information forwarded to bug-guix <at> gnu.org:
bug#59784; Package guix. (Sat, 17 Dec 2022 19:37:02 GMT) Full text and rfc822 format available.

Message #68 received at 59784 <at> debbugs.gnu.org (full text, mbox):

From: "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: Mathieu Othacehe <othacehe <at> gnu.org>, 59784 <at> debbugs.gnu.org
Subject: Re: bug#59784: [version 1.4.0rc1] Retrying a failed install fails
Date: Sat, 17 Dec 2022 20:36:11 +0100
Ahoi. :)

Ludovic Courtès <ludo <at> gnu.org> writes:
>> Now that you found the dynamic-wind’s out-guard does not even run:
> It does not run on C-c, but it does run in other cases, typically if you
> just press Enter after reading the message that says “command failed,
> press Enter”.

Ahh.  Then would it be good if you at least pushed the partial fix about
replacing 'restart' with 'stop-service'?  I’m unsure now if it has an
effect on the likelihood that a second resume works again.  But maybe it
does.  And is closer to correct.


> I don’t see how to address the C-c issue so we’ll have to live with it.

Yes.  Thank you for all investigations!

Regards,
Florian




Information forwarded to bug-guix <at> gnu.org:
bug#59784; Package guix. (Sat, 17 Dec 2022 21:31:01 GMT) Full text and rfc822 format available.

Message #71 received at 59784 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de>
Cc: Mathieu Othacehe <othacehe <at> gnu.org>, 59784 <at> debbugs.gnu.org
Subject: Re: bug#59784: [version 1.4.0rc1] Retrying a failed install fails
Date: Sat, 17 Dec 2022 22:30:34 +0100
"pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de> skribis:

> I saw a comment
>> void LocalStore::registerValidPaths(const ValidPathInfos & infos)
>> {
>>     /* SQLite will fsync by default, but the new valid paths may not be fsync-ed.
>>      * So some may want to fsync them before registering the validity, at the
>>      * expense of some speed of the path registering operation. */
>>     if (settings.syncBeforeRegistering) sync();
>
> In vain, I therefore tried

Yeah, I don’t think this has much to do with syncing data on disk.  It’s
an inconsistency between the store database and the actual store.

Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#59784; Package guix. (Sun, 18 Dec 2022 00:24:01 GMT) Full text and rfc822 format available.

Message #74 received at 59784 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de>
Cc: Mathieu Othacehe <othacehe <at> gnu.org>, 59784 <at> debbugs.gnu.org
Subject: Re: bug#59784: [version 1.4.0rc1] Retrying a failed install fails
Date: Sun, 18 Dec 2022 01:23:18 +0100
After spending a few more hours on this, I got convinced that upon
restarting guix-daemon, even though we had restored
/var/guix/db/db.sqlite, the presence of stale db.sqlite-{wal,shm} files
could lead sqlite to do as if transactions in the WAL file had been
committed.

Commit 495c50008be91429ebea3805e161a1e385a2a572 deletes these two
files, and it appears to solve the problem for me.

I also pushed the patch previously shared in this thread, to make sure
db.sqlite is only copied when guix-daemon is stopped.

So we have this:

  495c50008b installer: final: Delete SQLite WAL and shm files upon completion.
  9b6703eabe installer: final: Stop guix-daemon before accessing store database.

I’ll go ahead and prepare for the release as planned, to be published on Monday.

Ludo’.




Reply sent to "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de>:
You have taken responsibility. (Sun, 18 Dec 2022 16:42:02 GMT) Full text and rfc822 format available.

Notification sent to "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de>:
bug acknowledged by developer. (Sun, 18 Dec 2022 16:42:02 GMT) Full text and rfc822 format available.

Message #79 received at 59784-done <at> debbugs.gnu.org (full text, mbox):

From: "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: Mathieu Othacehe <othacehe <at> gnu.org>, 59784-done <at> debbugs.gnu.org
Subject: Re: bug#59784: [version 1.4.0rc1] Retrying a failed install fails
Date: Sun, 18 Dec 2022 17:41:33 +0100
"pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de> writes:
> * If I resume a crashed installer, I need to resume twice because the
>   first resume always fails immediately.

Hooray, you fixed it.  Ludo, your debugging speed is miraculous.  I did
not know SQLite uses multiple files per database.


> * With bad luck, it permanently fails, even a second, third, fourth,
>   fifth time fail.

It can still permanently fail to resume, e.g. sometimes when doing
Ctrl-c during download of a substitue, it will continue to say nss-certs
is an unknown package, but that may be too rare to happen by chance and
is not what this bug was about.

Closing!

Regards,
Florian




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 16 Jan 2023 12:24:07 GMT) Full text and rfc822 format available.

This bug report was last modified 1 year and 101 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.