Package: guix;
Reported by: Markus Nilsson <markusnilsson890 <at> gmail.com>
Date: Tue, 12 Jul 2022 15:28:01 UTC
Severity: normal
To reply to this bug, email your comments to 56519 AT debbugs.gnu.org.
Toggle the display of automated, internal messages from the tracker.
View this report as an mbox folder, status mbox, maintainer mbox
bug-guix <at> gnu.org
:bug#56519
; Package guix
.
(Tue, 12 Jul 2022 15:28:02 GMT) Full text and rfc822 format available.Markus Nilsson <markusnilsson890 <at> gmail.com>
:bug-guix <at> gnu.org
.
(Tue, 12 Jul 2022 15:28:02 GMT) Full text and rfc822 format available.Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
From: Markus Nilsson <markusnilsson890 <at> gmail.com> To: bug-guix <at> gnu.org Subject: Shepherd non-deterministically fails to load the "guix-daemon" service after the "user-processes" service has been started Date: Tue, 12 Jul 2022 13:07:00 +0000
[Message part 1 (text/plain, inline)]
TITLE Shepherd non-deterministically fails to load the "guix-daemon" service after the "user-processes" service has been started EXPECTED BEHAVIOUR Shepherd will always start the "guix-daemon" service after the "user-processes" service has been started. OBSERVED BEHAVIOUR Shepherd will sometimes start the "guix-daemon" service after the "user-processes" service has been started and sometimes it won't. HARDWARE ENVIRONMENT Motherboard: Asus M5A88-M EVO CPU: AMD Athlon II X4 640 Processor SAS card: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 02) Hard drive: 9x assorted spinning rust sata hard drives SOFTWARE ENVIRONMENT See details below STEPS TO REPRODUCE Create the software environment detailed below on an equivalent hardware environment to the one listed above DETAILS I have a media server running NFS to host videos to computers on my home network. The media server has 9 hard drives in it. 5 hard drives are connected to the motherboard's sata ports and another four are connected to a HBA SAS controller card in IT mode sitting in the motherboard's pci-e slot. The hard-drives are in an LVM configuration like below: mark <at> streetkid ~$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS sda 8:0 0 465.8G 0 disk ├─sda1 8:1 0 2M 0 part ├─sda2 8:2 0 100G 0 part /gnu/store │ / └─sda3 8:3 0 365.8G 0 part └─streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home sdb 8:16 0 931.5G 0 disk └─sdb1 8:17 0 931.5G 0 part └─streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home sdc 8:32 0 149G 0 disk └─sdc1 8:33 0 149G 0 part └─streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home sdd 8:48 0 465.8G 0 disk └─sdd1 8:49 0 465.8G 0 part └─streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home sde 8:64 0 465.8G 0 disk └─sde1 8:65 0 465.8G 0 part └─streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home sdf 8:80 0 298.1G 0 disk └─sdf1 8:81 0 298.1G 0 part └─streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home sdg 8:96 0 931.5G 0 disk └─sdg1 8:97 0 931.5G 0 part └─streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home sdh 8:112 0 465.8G 0 disk └─sdh1 8:113 0 465.8G 0 part └─streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home sdi 8:128 0 298.1G 0 disk └─sdi1 8:129 0 298.1G 0 part └─streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home Recently I noticed that running guix commands would sometimes fail with the following error: mark <at> streetkid ~$ guix package -m ~/.config/guix/manifest.scm guix package: error: failed to connect to `/var/guix/daemon-socket/socket': Connection refused The NFS service would be working fine but guix would refuse to work. Not knowing what to do to fix guix I tried rolling back to previous system generations using the GRUB menu on bootup. Sometimes this would result in guix working but then on the next bootup - using the same system generation - guix would fail again. I found it was easy to work around the issue by manually getting shepherd to start the guix-daemon: mark <at> streetkid ~$ sudo herd start guix-daemon Service guix-daemon has been started. I looked through the kernel log and found that the harddrives connected to the SAS controller were taking a long while to be brought up which was blocking services that depend on the "device-mapping" service from being started. The following snippet from /var/log/messages.1.gz illustrates this: Jul 1 21:36:58 localhost vmunix: [ 18.538151] shepherd[1]: Service device-mapping-streetkid_VG_storage-streetkid_LV_home could not be started. Jul 1 21:36:58 localhost vmunix: [ 18.540000] shepherd[1]: Service file-system-/home depends on device-mapping-streetkid_VG_storage-streetkid_LV_home. Jul 1 21:36:58 localhost vmunix: [ 18.541809] shepherd[1]: Service file-system-/home could not be started. Jul 1 21:36:58 localhost vmunix: [ 18.543412] shepherd[1]: Service file-systems depends on file-system-/home. Jul 1 21:36:58 localhost vmunix: [ 18.545276] shepherd[1]: Service file-systems could not be started. Jul 1 21:36:58 localhost vmunix: [ 18.546921] shepherd[1]: Service user-processes depends on file-systems. Jul 1 21:36:58 localhost vmunix: [ 18.548774] shepherd[1]: Service user-processes could not be started. Jul 1 21:36:58 localhost vmunix: [ 18.550432] shepherd[1]: Service guix-daemon depends on user-processes. Jul 1 21:36:58 localhost vmunix: [ 18.552355] shepherd[1]: Service guix-daemon could not be started. Looking at the logs shows that during bootup the Shepherd will make a few attempts at starting services while waiting for the hard drives connected to the SAS card to become available. The "device-mapping" service needs the SAS card hard drives to be up and will fail when they are not. When they are not up the services that depend on the "device-mapping" service will fail. After a while the hard drives come online and then the "device-mapping" service starts, which then causes the services that depend on it to start up as well. I then noticed something very strange. On some boot ups the Shepherd would try and fail a few times to start services that depend on "device-mapping" and then when the SAS hard drives were up it would successfully load these dependent services including the "guix-daemon" service. On other boot ups it would try and fail a few times to start services that depend on "device-mapping" and then when the SAS hard drives were up it would successfully load these dependent services BUT WOULD NOT LOAD the "guix-daemon" service. To see this non-deterministic bug in the logs open the attached file "messages.1". Go to line 21664 which is the start of the boot process starting Jul 1 09:02:35. Search for the next string "Service user-processes" and you can see this service failing to start on line 22558 and again on line 22581. The "user-processes" service then succeeds on line 22657 and on the next line (22658) you can see that the "guix-daemon" starts. In summary: 21664 - System starts booting 22558 - "user-processes" service fails to start 22581 - "user-processes" service fails to start again 22657 - "user-processes" service starts successfully 22658 - "guix-daemon" service starts successfully 22837 - System shuts down On the next boot "guix-daemon" fails to start after "user-processes" has started: 22838 - System starts booting 23728 - "user-processes" service fails to start 23750 - "user-processes" service fails to start again 23762 - "user-processes" service fails to start a third time 23764 - "guix-daemon" service fails to start 23840 - "user-processes" service starts successfully 23938 - System shuts down On the next boot "guix-daemon" again fails to start after "user-processes" has started: 23939 - System starts booting 24826 - "user-processes" service fails to start 24853 - "user-processes" service fails to start again 24865 - "user-processes" service fails to start a third time 24867 - "guix-daemon" service fails to start 24941 - "user-processes" service starts successfully 25053 - System shuts down On the next boot "guix-daemon" successfully starts after "user-processes" has started: 25054 - System starts booting 25952 - "user-processes" service fails to start 25975 - "user-processes" service fails to start again 26064 - "user-processes" service starts successfully 26066 - "guix-daemon" service starts successfully 26169 - System shuts down Note that these four boots are using the same system generation so the software is the same on all boots. The generation info is shown below: Generation 27 Jun 29 2022 22:50:13 file name: /var/guix/profiles/system-27-link canonical file name: /gnu/store/xrn0lz22gnhhq9ywk30q2nmim3fgx5sd-system label: GNU with Linux 5.17.15 bootloader: grub root device: UUID: f75c515d-0ef7-4af1-950e-5486408d2451 kernel: /gnu/store/xbcwcv2a024n0vmncvhgllmz52d4rprj-linux-5.17.15/bzImage channels: nonguix: repository URL: https://gitlab.com/nonguix/nonguix branch: master commit: 1a122e06fe046caebf39395edc797515861acd3b guix: repository URL: https://git.savannah.gnu.org/git/guix.git branch: master commit: 0fe0c739f53bbb635eb5c4a3b172f4be45293c82 configuration file: /gnu/store/sniryjir3bhlswyarmdmrwdw8zqh8zgp-configuration.scm The config file for this generation is in the attachment "system-config.scm". I could find two clues which might be relevant: 1) It seems that on the occasions that the "guix-daemon" successfully starts after "user-processes" has started the Shepherd hasn't tried and then failed to start it earlier. Conversely, on the occasions when the "guix-daemon" service fails to start after "user-processes" has started the shepherd will have tried and failed to start the "guix-daemon" service. Does the "guix-daemon" service definition state that it should not attempt to start if it has failed to start before? 2) The earlier boots in the kernel log seem to not have the problem of the SAS hard drives not coming up in time to cause services that depend on them from failing to start. This might mean that the SAS card is starting to fail. On the other hand once the hard drives are up the system is stable and the NFS service works fine. A screenshot of the lvm error messages from a recent boot up showing the hard drives failing to be detected is shown in the "lvm-error-messages.jpg" attachment.
[Message part 2 (text/html, inline)]
[messages.1 (application/octet-stream, attachment)]
[system-config.scm (text/x-scheme, attachment)]
[lvm_error_messages.jpg (image/jpeg, attachment)]
bug-guix <at> gnu.org
:bug#56519
; Package guix
.
(Mon, 18 Jul 2022 12:55:02 GMT) Full text and rfc822 format available.Message #8 received at 56519 <at> debbugs.gnu.org (full text, mbox):
From: Ludovic Courtès <ludo <at> gnu.org> To: Markus Nilsson <markusnilsson890 <at> gmail.com> Cc: 56519 <at> debbugs.gnu.org Subject: Re: bug#56519: Shepherd non-deterministically fails to load the "guix-daemon" service after the "user-processes" service has been started Date: Mon, 18 Jul 2022 14:54:01 +0200
Hi Markus, From the /var/log/messages excerpt you sent, it would seem that the root cause is failing to device mapping for the “streetkid” storage: Markus Nilsson <markusnilsson890 <at> gmail.com> skribis: > Jul 1 21:39:35 localhost vmunix: [ 16.150396] sd 6:0:1:0: Attached scsi generic sg6 type 0 > Jul 1 21:39:35 localhost vmunix: [ 16.157673] sd 6:0:0:0: [sdf] 625142448 512-byte logical blocks: (320 GB/298 GiB) > Jul 1 21:39:35 localhost vmunix: [ 16.157689] sd 6:0:0:0: [sdf] 4096-byte physical blocks > Jul 1 21:39:35 localhost vmunix: [ 16.195016] shepherd[1]: Service device-mapping-streetkid_VG_storage-streetkid_LV_home could not be started. > Jul 1 21:39:35 localhost vmunix: [ 16.197048] shepherd[1]: Service file-system-/home depends on device-mapping-streetkid_VG_storage-streetkid_LV_home. > Jul 1 21:39:35 localhost vmunix: [ 16.199307] shepherd[1]: Service file-system-/home could not be started. > Jul 1 21:39:35 localhost vmunix: [ 16.203577] shepherd[1]: Service file-system-/dev/pts has been started. > Jul 1 21:39:35 localhost vmunix: [ 16.206789] shepherd[1]: Service file-system-/sys/kernel/debug has been started. > Jul 1 21:39:35 localhost vmunix: [ 16.209950] shepherd[1]: Service file-system-/dev/shm has been started. > Jul 1 21:39:35 localhost vmunix: [ 16.212595] shepherd[1]: Service file-system-/sys/firmware/efi/efivars has been started. > Jul 1 21:39:35 localhost vmunix: [ 16.215719] shepherd[1]: Service file-system-/gnu/store has been started. [...] > Jul 1 21:39:35 localhost vmunix: [ 17.481611] shepherd[1]: Service nscd could not be started. > Jul 1 21:39:35 localhost vmunix: [ 17.499327] ppdev: user-space parallel port driver > Jul 1 21:39:35 localhost vmunix: [ 17.691044] sdi: sdi1 > Jul 1 21:39:35 localhost vmunix: [ 17.949918] sd 6:0:3:0: [sdi] Attached SCSI disk > Jul 1 21:39:35 localhost vmunix: [ 17.956933] shepherd[1]: Service device-mapping-streetkid_VG_storage-streetkid_LV_home could not be started. > Jul 1 21:39:35 localhost vmunix: [ 17.958868] shepherd[1]: Service file-system-/home depends on device-mapping-streetkid_VG_storage-streetkid_LV_home. > Jul 1 21:39:35 localhost vmunix: [ 17.960699] shepherd[1]: Service file-system-/home could not be started. > Jul 1 21:39:35 localhost vmunix: [ 17.962778] shepherd[1]: Service file-systems depends on file-system-/home. > Jul 1 21:39:35 localhost vmunix: [ 17.964868] shepherd[1]: Service file-systems could not be started. > Jul 1 21:39:35 localhost vmunix: [ 17.966751] shepherd[1]: Service user-processes depends on file-systems. > Jul 1 21:39:35 localhost vmunix: [ 17.968643] shepherd[1]: Service user-processes could not be started. > Jul 1 21:39:35 localhost vmunix: [ 17.970712] shepherd[1]: Service guix-daemon depends on user-processes. > Jul 1 21:39:35 localhost vmunix: [ 17.971359] shepherd[1]: Service guix-daemon could not be started. That is: ‘device-mapping-streetkid_VG_storage-streetkid_LV_home’ fails to start; shepherd tries to start other services, but they depend on it so every time it fails with “Service X depends on device-mapping-streetkid_VG_storage-streetkid_LV_home” followed by “Service X could not be started”. Eventually, that propagates down to ‘user-processes’ and ‘guix-daemon’, both of which can’t be started. You can see the dependency graph of Shepherd services with: guix system shepherd-graph …/config.scm | xdot - So it might not be a Shepherd or Guix issue per se, but rather something related to that “streetkid” device. I couldn’t find hints in the log though. Ideas? Thanks, Ludo’.
bug-guix <at> gnu.org
:bug#56519
; Package guix
.
(Tue, 26 Jul 2022 04:52:01 GMT) Full text and rfc822 format available.Message #11 received at 56519 <at> debbugs.gnu.org (full text, mbox):
From: Markus Nilsson <markusnilsson890 <at> gmail.com> To: 56519 <at> debbugs.gnu.org Subject: Re: bug#56519: Shepherd non-deterministically fails to load the "guix-daemon" service after the "user-processes" service has been started Date: Tue, 26 Jul 2022 02:48:54 +0000
[Message part 1 (text/plain, inline)]
Hi Ludo, Thankyou for taking a look at this issue! I think I didn't make it clear in the original bug report what the actual problem was. I understand that in the beginning of the boot process the "device-mapping-streetkid_VG_storage-streetkid_LV_home" shepherd service is failing and this is causing shepherd services that depend on it to fail to start as a consequence. Later in the boot process the "device-mapping-streetkid_VG_storage-streetkid_LV_home" does successfully start though. What happens is that after the "device-mapping-streetkid_VG_storage-streetkid_LV_home" does start (after a couple of tries), other shepherd services will then start including the "user-processes" shepherd service. The problem is that the "guix-daemon" shepherd service doesn't always start after that. Sometimes it will, but sometimes it won't (see my first message for examples). Shouldn't the "guix-daemon" shepherd service ALWAYS start after the "user-processes" shepherd service starts? I also found that the "nscd" shepherd service is failing to start even though the "user-processes" shepherd service has started. The following terminal output was taken after my streetkid server had finished booting up. This should make things clearer: 1 mark <at> streetkid ~$ sudo herd status 2 Started: 3 + /proc/fs/nfsd 4 + console-font-tty1 5 + console-font-tty2 6 + console-font-tty3 7 + console-font-tty4 8 + console-font-tty5 9 + console-font-tty6 10 + device-mapping-streetkid_VG_storage-streetkid_LV_home 11 + file-system-/dev/pts 12 + file-system-/dev/shm 13 + file-system-/gnu/store 14 + file-system-/home 15 + file-system-/sys/firmware/efi/efivars 16 + file-system-/sys/kernel/debug 17 + file-systems 18 + idmap-daemon 19 + loopback 20 + mcron 21 + networking 22 + nfs 23 + ntpd 24 + root 25 + root-file-system 26 + rpc-pipefs 27 + rpc.mountd 28 + rpc.nfsd 29 + rpc.statd 30 + rpcbind-daemon 31 + ssh-daemon 32 + sshd-1 33 + syslogd 34 + term-tty1 35 + term-tty2 36 + term-tty3 37 + term-tty4 38 + term-tty5 39 + term-tty6 40 + udev 41 + urandom-seed 42 + user-file-systems 43 + user-processes 44 + virtual-terminal 45 Stopped: 46 - guix-daemon 47 - nscd 48 - term-console 49 One-shot: 50 * host-name 51 * sysctl 52 * user-homes As you can see on line 10 "device-mapping-streetkid_VG_storage-streetkid_LV_home" has been started and on line 43 "user-processes" has been started. The problem is that "guix-daemon" hasn't been started (line 46) and "nscd" also hasn't been started (line 47). Shouldn't Shepherd start these two services once "user-processes" has been started? I checked other services that depend on "user-processes" (see the attached shepherd dependency graph for my system). The "mcron" service and "ntpd" successfully start (lines 20 and 23). This still leaves the mystery of why "guix-daemon" and "nscd" won't start even though "user-processes" HAS been started. I hope this makes things clearer. Cheers Markus
[Message part 2 (text/html, inline)]
[streetkid-system-config-shepherd-graph.gv (text/vnd.graphviz, attachment)]
bug-guix <at> gnu.org
:bug#56519
; Package guix
.
(Mon, 01 Aug 2022 13:57:01 GMT) Full text and rfc822 format available.Message #14 received at 56519 <at> debbugs.gnu.org (full text, mbox):
From: Ludovic Courtès <ludo <at> gnu.org> To: Markus Nilsson <markusnilsson890 <at> gmail.com> Cc: 56519 <at> debbugs.gnu.org Subject: Re: bug#56519: Shepherd non-deterministically fails to load the "guix-daemon" service after the "user-processes" service has been started Date: Mon, 01 Aug 2022 15:56:00 +0200
Hi Markus, Markus Nilsson <markusnilsson890 <at> gmail.com> skribis: > I checked other services that depend on "user-processes" (see the attached > shepherd dependency graph for my system). The "mcron" service and "ntpd" > successfully start (lines 20 and 23). This still leaves the mystery of why > "guix-daemon" and "nscd" won't start even though "user-processes" HAS been > started. Could you check what /var/log/messages says for this specific occurrence? It will shed some light as to what services were started, in what order, how they failed, etc. Thanks, Ludo’.
bug-guix <at> gnu.org
:bug#56519
; Package guix
.
(Fri, 05 Aug 2022 06:24:02 GMT) Full text and rfc822 format available.Message #17 received at 56519 <at> debbugs.gnu.org (full text, mbox):
From: Markus Nilsson <markusnilsson890 <at> gmail.com> To: 56519 <at> debbugs.gnu.org Subject: Re: bug#56519: Shepherd non-deterministically fails to load the "guix-daemon" service after the "user-processes" service has been started Date: Fri, 5 Aug 2022 06:23:41 +0000
[Message part 1 (text/plain, inline)]
Hi Ludo, Ludovic Courtès <ludo <at> gnu.org> writes: >Could you check what /var/log/messages says for this specific occurrence? >It will shed some light as to what services were started, in what order, how they failed, etc. I have attached the kernel log information for today (August 5 2022) as well as the output of "herd status". You can see that on line 1020 "user-processes" successfully starts, but "guix-daemon" or "nscd" doesn't attempt to start afterwards for some unknown reason. Below is a summary of the relevant events from the log: 1 - System starts booting 873 - "user-processes" service fails to start 908 - "user-processes" service fails to start a 2nd time 910 - "nscd" service fails to start 920 - "user-processes" service fails to start a 3rd time 922 - "guix-daemon" service fails to start 943 - "user-processes" service fails to start a 4th time 1020 - "user-processes" service starts successfully 1071 - "ntpd" service starts successfully 1075 - "mcron" service starts successfully This seems to indicate that the Shepherd will attempt to start "user-processes" multiple times until it succeeds. But Shepherd won't try to start "guix-daemon" or "nscd" if it has failed once. Cheers Markus
[Message part 2 (text/html, inline)]
[Aug_5_kernel_log (application/octet-stream, attachment)]
[Aug_5_herd_status (application/octet-stream, attachment)]
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.