GNU bug report logs - #52051
[core-updates-frozen] cannot login ('org.freedesktop.login1' service times out)

Previous Next

Package: guix;

Reported by: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>

Date: Tue, 23 Nov 2021 06:48:02 UTC

Severity: important

Done: Leo Famulari <leo <at> famulari.name>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 52051 in the body.
You can then email your comments to 52051 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Tue, 23 Nov 2021 06:48:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Maxim Cournoyer <maxim.cournoyer <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-guix <at> gnu.org. (Tue, 23 Nov 2021 06:48:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: bug-guix <bug-guix <at> gnu.org>
Subject: [core-updates-frozen] cannot login ('org.freedesktop.login1'
 service times out)
Date: Tue, 23 Nov 2021 01:47:07 -0500
Hello,

I'm not 100% sure this is the cause, but these are the last messages I
have before I rebooted:

--8<---------------cut here---------------start------------->8---
Nov 23 01:09:14 localhost dbus-daemon[383]: [system] Activating service name='org.freedesktop.login1' requested by ':1.17' (uid=0 pid=370
    comm="/gnu/store/ximad0zvg12r4x0x80mvym8hzg0n33jl-shadow") (using servicehelper)
Nov 23 01:09:14 localhost elogind[1189]: elogind is already running as PID 390
Nov 23 01:09:20 localhost shepherd[1]: Respawning term-tty1. 
Nov 23 01:09:20 localhost shepherd[1]: Service host-name has been started. 
Nov 23 01:09:20 localhost shepherd[1]: Service term-tty1 has been started.
Nov 23 01:09:39 localhost dbus-daemon[383]: [system] Failed to activate service 'org.freedesktop.login1': timed out (service_start_timeout=25000ms)
--8<---------------cut here---------------end--------------->8---

I don't remember if I saw the slim login screen; but in any case I
couldn't successfully login even via a ptty.

It may have to do with polkit.

To be investigated.

This happened on a system *not* using gdm (it uses slim) and with
ratpoison as the WM, on commit f42bc604547d9ee8e35fcd66d5db7786954cfac3
of the core-updates-frozen branch.

To be investigated.

Thanks,

Maxim




Information forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Wed, 24 Nov 2021 09:03:01 GMT) Full text and rfc822 format available.

Message #8 received at 52051 <at> debbugs.gnu.org (full text, mbox):

From: Josselin Poiret <dev <at> jpoiret.xyz>
To: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>, 52051 <at> debbugs.gnu.org
Cc: dev <at> jpoiret.xyz
Subject: Re: bug#52051: [core-updates-frozen] cannot login
 ('org.freedesktop.login1' service times out)
Date: Wed, 24 Nov 2021 09:02:16 +0000
Hello Maxim,

Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:

> --8<---------------cut here---------------start------------->8---
> Nov 23 01:09:14 localhost dbus-daemon[383]: [system] Activating service name='org.freedesktop.login1' requested by ':1.17' (uid=0 pid=370
>     comm="/gnu/store/ximad0zvg12r4x0x80mvym8hzg0n33jl-shadow") (using servicehelper)
> Nov 23 01:09:14 localhost elogind[1189]: elogind is already running as PID 390
> Nov 23 01:09:20 localhost shepherd[1]: Respawning term-tty1. 
> Nov 23 01:09:20 localhost shepherd[1]: Service host-name has been started. 
> Nov 23 01:09:20 localhost shepherd[1]: Service term-tty1 has been started.
> Nov 23 01:09:39 localhost dbus-daemon[383]: [system] Failed to activate service 'org.freedesktop.login1': timed out (service_start_timeout=25000ms)
> --8<---------------cut here---------------end--------------->8---
>
> I don't remember if I saw the slim login screen; but in any case I
> couldn't successfully login even via a ptty.
>
> It may have to do with polkit.
>
> To be investigated.
>
> This happened on a system *not* using gdm (it uses slim) and with
> ratpoison as the WM, on commit f42bc604547d9ee8e35fcd66d5db7786954cfac3
> of the core-updates-frozen branch.
>
> To be investigated.

I cannot reproduce in a fresh VM on commit
d5de4e163ccef80f78bc5fe330f568d8fe3a23ab, and can login just fine, with

  (services (cons* (service slim-service-type (slim-configuration))
                   (modify-services %desktop-services
                     (delete gdm-service-type))))

Is this still affecting you?

Best,
Josselin Poiret




Information forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Thu, 25 Nov 2021 03:04:02 GMT) Full text and rfc822 format available.

Message #11 received at 52051 <at> debbugs.gnu.org (full text, mbox):

From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: Josselin Poiret <dev <at> jpoiret.xyz>
Cc: 52051 <at> debbugs.gnu.org
Subject: Re: bug#52051: [core-updates-frozen] cannot login
 ('org.freedesktop.login1' service times out)
Date: Wed, 24 Nov 2021 22:03:13 -0500
Hello Josselin,

Josselin Poiret <dev <at> jpoiret.xyz> writes:

> Hello Maxim,
>
> Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:
>
>> --8<---------------cut here---------------start------------->8---
>> Nov 23 01:09:14 localhost dbus-daemon[383]: [system] Activating
>> service name='org.freedesktop.login1' requested by ':1.17' (uid=0
>> pid=370
>>     comm="/gnu/store/ximad0zvg12r4x0x80mvym8hzg0n33jl-shadow") (using servicehelper)
>> Nov 23 01:09:14 localhost elogind[1189]: elogind is already running as PID 390
>> Nov 23 01:09:20 localhost shepherd[1]: Respawning term-tty1. 
>> Nov 23 01:09:20 localhost shepherd[1]: Service host-name has been started. 
>> Nov 23 01:09:20 localhost shepherd[1]: Service term-tty1 has been started.
>> Nov 23 01:09:39 localhost dbus-daemon[383]: [system] Failed to
>> activate service 'org.freedesktop.login1': timed out
>> (service_start_timeout=25000ms)
>> --8<---------------cut here---------------end--------------->8---
>>
>> I don't remember if I saw the slim login screen; but in any case I
>> couldn't successfully login even via a ptty.
>>
>> It may have to do with polkit.
>>
>> To be investigated.
>>
>> This happened on a system *not* using gdm (it uses slim) and with
>> ratpoison as the WM, on commit f42bc604547d9ee8e35fcd66d5db7786954cfac3
>> of the core-updates-frozen branch.
>>
>> To be investigated.
>
> I cannot reproduce in a fresh VM on commit
> d5de4e163ccef80f78bc5fe330f568d8fe3a23ab, and can login just fine, with
>
>   (services (cons* (service slim-service-type (slim-configuration))
>                    (modify-services %desktop-services
>                      (delete gdm-service-type))))
>
> Is this still affecting you?

Yes!  It didn't occur in a 'guix system vm my-config.scm', but the exact
same config deployed on my machine fails at login.

Some symptoms:

1. Slim login screen comes up, but after entering credentials Xorg
resets (back to login screen)

2. going to a TTY and attempting to login there, it'd fail with a "Login
failed after 60 s timeout" or similar error.

3. I can login via SSH (thanks goodness!)

4. There are no errors (EE) in /var/log/Xorg.0.log

5. here's the tail of my /var/log/messages:

--8<---------------cut here---------------start------------->8---
Nov 24 21:23:54 localhost ntpd[346]: Soliciting pool server 216.197.156.83
Nov 24 21:23:55 localhost ntpd[346]: Soliciting pool server 206.108.0.133
Nov 24 21:23:56 localhost wpa_supplicant[343]: wlp4s0: CTRL-EVENT-BEACON-LOSS 
Nov 24 21:23:56 localhost ntpd[346]: Soliciting pool server 98.143.85.249
Nov 24 21:23:57 localhost ntpd[346]: Soliciting pool server 192.95.27.155
Nov 24 21:23:58 localhost dbus-daemon[341]: [system] Activating service name='org.freedesktop.login1' requested by ':1.16' (uid=0 pid=324 comm="/gnu/store/ximad0zvg12r4x0x80mvym8hzg0n33jl-shadow") (using servicehelper)
Nov 24 21:23:58 localhost elogind[1114]: elogind is already running as PID 355
Nov 24 21:24:11 localhost wpa_supplicant[343]: wlp4s0: CTRL-EVENT-BEACON-LOSS 
Nov 24 21:24:21 localhost last message repeated 5 times
Nov 24 21:24:23 localhost dbus-daemon[341]: [system] Failed to activate service 'org.freedesktop.login1': timed out (service_start_timeout=25000ms)
Nov 24 21:24:23 localhost shepherd[1]: Respawning term-tty2. 
Nov 24 21:24:23 localhost shepherd[1]: Service host-name has been started. 
Nov 24 21:24:23 localhost shepherd[1]: Service term-tty2 has been started. 
Nov 24 21:24:23 localhost wpa_supplicant[343]: wlp4s0: CTRL-EVENT-BEACON-LOSS 
Nov 24 21:24:27 localhost last message repeated 3 times
Nov 24 21:26:04 localhost dbus-daemon[341]: [system] Activating service name='org.freedesktop.login1' requested by ':1.17' (uid=0 pid=429 comm="/gnu/store/nvvmksc9pvahqmypaz3h8mqya82vnga8-slim-1") (using servicehelper)
Nov 24 21:26:04 localhost elogind[1127]: elogind is already running as PID 355
Nov 24 21:26:29 localhost dbus-daemon[341]: [system] Failed to activate service 'org.freedesktop.login1': timed out (service_start_timeout=25000ms)
Nov 24 21:26:29 localhost shepherd[1]: Respawning xorg-server. 
Nov 24 21:26:29 localhost shepherd[1]: Service host-name has been started. 
Nov 24 21:26:29 localhost shepherd[1]: Service xorg-server has been started. 
Nov 24 21:27:23 localhost ntpd[346]: Soliciting pool server 209.115.181.108
Nov 24 21:27:24 localhost ntpd[346]: Soliciting pool server 138.197.153.200
Nov 24 21:27:25 localhost ntpd[346]: Soliciting pool server 162.159.200.123
Nov 24 21:27:26 localhost ntpd[346]: Soliciting pool server 162.159.200.1
Nov 24 21:29:09 localhost ntpd[346]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Nov 24 21:29:42 localhost ntpd[346]: Soliciting pool server 199.182.221.110
Nov 24 21:35:23 localhost wpa_supplicant[343]: wlp4s0: CTRL-EVENT-SIGNAL-CHANGE above=0 signal=-87 noise=-95 txrate=270000
Nov 24 21:35:32 localhost wpa_supplicant[343]: wlp4s0: CTRL-EVENT-SIGNAL-CHANGE above=1 signal=-67 noise=-95 txrate=270000
Nov 24 21:42:50 localhost dbus-daemon[341]: [system] Activating service name='org.freedesktop.login1' requested by ':1.19' (uid=0 pid=1269 comm="sudo herd status dbus ") (using servicehelper)
Nov 24 21:42:50 localhost elogind[1273]: elogind is already running as PID 355
Nov 24 21:43:15 localhost dbus-daemon[341]: [system] Failed to activate service 'org.freedesktop.login1': timed out (service_start_timeout=25000ms)
Nov 24 21:47:25 localhost last message repeated 3 times
--8<---------------cut here---------------end--------------->8---

So it seems the issue is with some D-Bus service
(org.freedesktop.login1) failing to be started.  I'm not sure how it can
be explained that it works in a VM but not on the real machine...

Also, the dbus-system process (pid 341) as the following environment
variable (singular) defined:

--8<---------------cut here---------------start------------->8---
# cat /proc/341/environ | xargs -0 -n1
PATH=/run/current-system/profile/bin
--8<---------------cut here---------------end--------------->8---

which seems like not much.  Perhaps it needs some XDG_DATA_DIRS to find
that D-Bus service file?  Where would it come from?

To be investigated...

Thanks,

Maxim




Information forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Thu, 25 Nov 2021 04:08:02 GMT) Full text and rfc822 format available.

Message #14 received at 52051 <at> debbugs.gnu.org (full text, mbox):

From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: Josselin Poiret <dev <at> jpoiret.xyz>
Cc: 52051 <at> debbugs.gnu.org
Subject: Re: bug#52051: [core-updates-frozen] cannot login
 ('org.freedesktop.login1' service times out)
Date: Wed, 24 Nov 2021 23:07:13 -0500
Hello,

I've found a workaround: restarting elogind via SSH resolved the issue.

I guess itt may be a race between elogind and dbus-system (elogind gets
started before dbus-system is fully up, and the communication with the
session bus is somehow crippled from there?).

I'll experiment with the elogind service a bit, adding a dumb sleep to
its start slot to see if it prevents the race.

Thank you,

Maxim




Information forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Thu, 25 Nov 2021 22:08:02 GMT) Full text and rfc822 format available.

Message #17 received at 52051 <at> debbugs.gnu.org (full text, mbox):

From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: Josselin Poiret <dev <at> jpoiret.xyz>
Cc: 52051 <at> debbugs.gnu.org
Subject: Re: bug#52051: [core-updates-frozen] cannot login
 ('org.freedesktop.login1' service times out)
Date: Thu, 25 Nov 2021 17:07:30 -0500
[Message part 1 (text/plain, inline)]
Hello,

In case it is useful, here's what the shepherd-graph of my system looks
like:

[hurd.svg (application/octet-stream, attachment)]
[Message part 3 (text/plain, inline)]
My system configuration looks something like this (some parts redacted).

--8<---------------cut here---------------start------------->8---
(use-modules (gnu)
             (guix modules)
             (srfi srfi-1)
             ;; for mcron experiment
             (gnu packages autotools)
             (gnu packages guile-xyz)
             (gnu packages man)
             (gnu packages texinfo)
             (guix git)
             (guix packages))

(use-service-modules admin audio desktop docker linux mail mcron networking nfs
                     nix spice ssh sysctl telephony virtualization vpn xorg)
(use-package-modules android bootloaders certs gnome linux
                     nfs ratpoison rsync tls)

(define updatedb-job
  ;; Run 'updatedb' at 3AM every day.
  #~(job '(next-hour '(3))
         (lambda ()
           (execl #$(file-append findutils "/bin/updatedb")
                  "updatedb"
                  "--prunepaths=/gnu/store /tmp /var/tmp /var/lib"))
         "updatedb"))

(define btrfs-balance-job
  ;; Re-allocate chunks which are using less than 5% of their chunk
  ;; space, to regain Btrfs 'unallocated' space.  The usage is kept
  ;; low (5%) to minimize wear on the SSD.  Runs at 5 AM every 3 days.
  #~(job '(next-hour-from (next-day (range 1 31 3)) '(5))
         (lambda ()
           (execl #$(file-append btrfs-progs "/bin/btrfs")
                  "balance" "start" "-dusage=5" "/"))
         "btrfs-balance"))

(define i2c-udev-rule
  ;; Sets i2c group to i2c devices, so that we can access it as a
  ;; simple user, e.g. using ddcutil.
  (udev-rule
   "90-i2c.rules"
   (string-append "KERNEL==\"i2c-[0-9]*\", "
		  "GROUP=\"i2c\"")))

(define c920-webcam-udev-rule
  (udev-rule "90-c920-webcam.rules"
             (string-append "SUBSYSTEM==\"usb\", "
                            "ATTRS{idVendor}==\"046d\", "
                            "ATTRS{idProduct}==\"082d\", "
                            "GROUP=\"video\"")))

(define %my-desktop-services
  (remove (lambda (service)
            (eq? (service-kind service) gdm-service-type))
          %desktop-services))

;; TODO: Honor rootflags in the initrd,
;; then remove degraded.
(define %common-btrfs-options '(("compress-force" . "zstd")
                                ("space_cache" . "v2")
                                "degraded"))

(define %system
  (operating-system
    (host-name "hurd")
    (timezone "America/Montreal")
    (keyboard-layout (keyboard-layout "dvorak"))
    (bootloader (bootloader-configuration
                 (bootloader grub-bootloader)
                 (targets (list "/dev/sda" "/dev/sdb" "/dev/sdc"))
                 (terminal-outputs '(console))
		 (keyboard-layout keyboard-layout)))
    (kernel-arguments '("quiet" "snd_hda_intel.dmic_detect=0"
                        "modprobe.blacklist=rtl8187"))
    (mapped-devices
     (list (mapped-device
            (source "/dev/sda2")
            (target "cryptroot")
            (type luks-device-mapping))
           (mapped-device
            (source "/dev/sdb2")
            (target "cryptroot-mirror")
            (type luks-device-mapping))
           (mapped-device
            (source "/dev/sdc2")
            (target "cryptroot-mirror2")
            (type luks-device-mapping))))

    ;; Note: Using any of the LUKS encrypted drives exposed under
    ;; /dev/mapper is enough to reference the Btrfs RAID-1 array,
    ;; since the 'btrfs device scan' command is executed in the init
    ;; RAM disk and takes care of assembling the array.
    (file-systems (cons* (file-system
                           (mount-point "/")
                           (device "/dev/mapper/cryptroot")
                           (type "btrfs")
			   (options (alist->file-system-options
                                     (cons '("subvol" . "@root")
                                           %common-btrfs-options)))
			   (dependencies mapped-devices))
                         (file-system
                           (device "/dev/mapper/cryptroot")
                           (mount-point "/home")
                           (type "btrfs")
                           (options (alist->file-system-options
                                     (cons '("subvol" . "@home")
                                           %common-btrfs-options)))
                           (dependencies mapped-devices))
                         (file-system
                           (device "/dev/mapper/cryptroot")
                           (mount-point "/data")
                           (type "btrfs")
                           (options (alist->file-system-options
                                     (cons '("subvol" . "@data")
                                           %common-btrfs-options)))
                           (dependencies mapped-devices))
                         %base-file-systems))
    (users (cons (user-account
                  (name "me")
                  (group "users")
                  (supplementary-groups
		   '("lp"               ;for bluetooth
		     "dialout"		;for serial port
		     "adbusers"         ;for adb
		     "i2c"              ;for ddcutil
		     "realtime"         ;for jackd
                     "kvm" "libvirt" "docker"
		     "wheel" "netdev" "audio" "video"))
                  (home-directory "/home/maxim"))
                 %base-user-accounts))
    (groups
     (cons* (user-group (system? #t) (name "adbusers"))
	    (user-group (system? #t) (name "i2c"))
            (user-group (system? #t) (name "realtime"))
	    %base-groups))
    (services
     (cons*
      (service docker-service-type (docker-configuration
                                    (debug? #t)))
      (service qemu-binfmt-service-type
               (qemu-binfmt-configuration
                (platforms (lookup-qemu-platforms "arm" "aarch64" "ppc64le"))))
      (service earlyoom-service-type)
      (service zram-device-service-type (zram-device-configuration
                                         (size "6G")
                                         (compression-algorithm 'ZSTD)))
      (bluetooth-service #:auto-enable? #t)
      (pam-limits-service
       (list
        ;; Required in order for jackd to function properly.
        (pam-limits-entry "@realtime" 'both 'rtprio 99)
        (pam-limits-entry "@realtime" 'both 'memlock 'unlimited)))
      (service mcron-service-type
               (mcron-configuration
                (mcron (package/inherit mcron
                         (source (git-checkout
                                  (url "file:///home/maxim/src/mcron")))
                         (native-inputs (append
                                            `(("autoconf" ,autoconf)
                                              ("automake" ,automake)
                                              ("help2man" ,help2man)
                                              ("texinfo" ,texinfo))
                                            (package-native-inputs mcron)))))
                (jobs (list updatedb-job
                            btrfs-balance-job))))
      (service mpd-service-type
               (mpd-configuration
                (user "me")))
      ;; Virtualization services to use with GNOME Boxes.
      (service libvirt-service-type
               (libvirt-configuration
                (unix-sock-group "libvirt")))
      (service virtlog-service-type)
      (service opendht-service-type (opendht-configuration
                                     (enable-logging? #t)
                                     (peer-discovery? #t)
                                     (proxy-server-port 8282)))
      (service openssh-service-type
	       (openssh-configuration
	        (port-number 22)
	        (permit-root-login #f)
	        (password-authentication? #f)
	        (use-pam? #f))))
      (service slim-service-type
	       (slim-configuration
	        (auto-login? #f)
	        (default-user "me")
	        (xorg-configuration
	         (xorg-configuration
	      	  (keyboard-layout keyboard-layout)))))
      (service nix-service-type)
      (service nfs-service-type
               (nfs-configuration
                (nfs-versions '("4.2" "4.1" "4.0"
                                "3" "2")) ; for U-Boot
                (exports
                 '(("/some/path/to/rootfs"
                    "*(rw,no_root_squash,no_subtree_check)")))
                (nfsd-udp? #t)))        ;for U-Boot
      (service guix-publish-service-type
               (guix-publish-configuration
                (advertise? #t)
                (host "0.0.0.0")))      ;listen on all interfaces
      (modify-services %my-desktop-services
        ;; Ignore the silly HHKB2 power off key shortcut.
        (elogind-service-type config =>
                              (elogind-configuration
                               (inherit config)
			       (handle-power-key 'ignore)))
        ;; Enable using adb as a simple user with a multitude of devices.
        (udev-service-type config =>
			   (udev-configuration
			    (inherit config)
			    (rules (cons* android-udev-rules
					  i2c-udev-rule
                                          c920-webcam-udev-rule
					  (udev-configuration-rules config)))))
        (network-manager-service-type config =>
				      (network-manager-configuration
				       (inherit config)
				       (vpn-plugins
				        (list network-manager-openvpn
                                              network-manager-openconnect))))
	(guix-service-type config =>
			   (guix-configuration
			    (inherit config)
                            (substitute-urls '("http://127.0.0.1:8181"
					       "https://ci.guix.gnu.org"))
                            (discover? #t)
                            (authorized-keys
                             (cons*
                                keys...
			      %default-authorized-guix-keys))
                            (log-compression 'none)
			    (extra-options '("--max-jobs=4")))))))

    ;; Packages installed by default.
    (packages (append (map specification->package
                           '("ratpoison"
                             "linux-pam" ;for the pam_limits manpage
                             "nss-certs" ;for HTTPS access
		             "rsync"     ;for the backup script
                             "btrfs-progs"
                             "nix"
                             "nfs-utils"
                             "ntfs-3g")) ;for mount.ntfs
                  %base-packages))

    ;; Allow resolution of '.local' host names with mDNS.
    (name-service-switch %mdns-host-lookup-nss)))

%system
--8<---------------cut here---------------end--------------->8---

But virtualizing it with `guix system vm` doesn't trigger the bug, I'm
guessing because the NFS export doesn't work (it fails quickly rather
than take time like it would on an actual boot).

Thanks,

Maxim

Information forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Thu, 25 Nov 2021 22:14:02 GMT) Full text and rfc822 format available.

Message #20 received at 52051 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Cc: Josselin Poiret <dev <at> jpoiret.xyz>, 52051 <at> debbugs.gnu.org
Subject: Re: bug#52051: [core-updates-frozen] cannot login
 ('org.freedesktop.login1' service times out)
Date: Thu, 25 Nov 2021 23:13:42 +0100
Hi,

Maxim Cournoyer <maxim.cournoyer <at> gmail.com> skribis:

> Nov 24 21:23:58 localhost dbus-daemon[341]: [system] Activating service name='org.freedesktop.login1' requested by ':1.16' (uid=0 pid=324 comm="/gnu/store/ximad0zvg12r4x0x80mvym8hzg0n33jl-shadow") (using servicehelper)
> Nov 24 21:23:58 localhost elogind[1114]: elogind is already running as PID 355
> Nov 24 21:24:11 localhost wpa_supplicant[343]: wlp4s0: CTRL-EVENT-BEACON-LOSS 
> Nov 24 21:24:21 localhost last message repeated 5 times
> Nov 24 21:24:23 localhost dbus-daemon[341]: [system] Failed to activate service 'org.freedesktop.login1': timed out (service_start_timeout=25000ms)
> Nov 24 21:24:23 localhost shepherd[1]: Respawning term-tty2. 
> Nov 24 21:24:23 localhost shepherd[1]: Service host-name has been started. 
> Nov 24 21:24:23 localhost shepherd[1]: Service term-tty2 has been started. 
> Nov 24 21:24:23 localhost wpa_supplicant[343]: wlp4s0: CTRL-EVENT-BEACON-LOSS 
> Nov 24 21:24:27 localhost last message repeated 3 times
> Nov 24 21:26:04 localhost dbus-daemon[341]: [system] Activating service name='org.freedesktop.login1' requested by ':1.17' (uid=0 pid=429 comm="/gnu/store/nvvmksc9pvahqmypaz3h8mqya82vnga8-slim-1") (using servicehelper)
> Nov 24 21:26:04 localhost elogind[1127]: elogind is already running as PID 355
> Nov 24 21:26:29 localhost dbus-daemon[341]: [system] Failed to activate service 'org.freedesktop.login1': timed out (service_start_timeout=25000ms)

It looks like elogind things it’s already running as PID 355, where the
PID comes from its PID file; there’s indeed a process with this PID, but
is it really elogind?  Or could it be that there’s a stale PID file?

Does /var/log/messages shows the very first time elogind is started at
boot time?  With which PID?

We need to find out why dbus-daemon thinks it needs to restart elogind.

Ludo’.




Severity set to 'important' from 'normal' Request was from Maxim Cournoyer <maxim.cournoyer <at> gmail.com> to control <at> debbugs.gnu.org. (Fri, 26 Nov 2021 04:01:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Fri, 26 Nov 2021 09:36:02 GMT) Full text and rfc822 format available.

Message #25 received at 52051 <at> debbugs.gnu.org (full text, mbox):

From: Josselin Poiret <dev <at> jpoiret.xyz>
To: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Cc: dev <at> jpoiret.xyz, 52051 <at> debbugs.gnu.org
Subject: Re: bug#52051: [core-updates-frozen] cannot login
 ('org.freedesktop.login1' service times out)
Date: Fri, 26 Nov 2021 09:35:42 +0000
Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:

> Hello,
>
> I've found a workaround: restarting elogind via SSH resolved the issue.
>
> I guess itt may be a race between elogind and dbus-system (elogind gets
> started before dbus-system is fully up, and the communication with the
> session bus is somehow crippled from there?).

Does this happen with other login managers on your system, like LightDM?
The thing is that elogind depends on dbus-system, so I'm not sure there
should be a race there.

I noticed though that none of our desktop managers's shepherd services
require elogind, and in the case of SLiM not even dbus-system.  Maybe we
should add it there, since we want Shepherd to handle launching elogind,
and avoid dbus launching one by itself if the login1 service is used or
even by the PAM elogind module.  Can you try adding that to
slim-shepherd-service?

Also, maybe patching the dbus service to Exec=false instead would be a
good safeguard against dbus launching elogind itself.

Tangentially: SLiM is very old and unmaintained, so it may one day
simply break with newer versions of systemd/elogind.

Best,
Josselin Poiret




Information forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Tue, 07 Dec 2021 21:24:01 GMT) Full text and rfc822 format available.

Message #28 received at 52051 <at> debbugs.gnu.org (full text, mbox):

From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: Josselin Poiret <dev <at> jpoiret.xyz>
Cc: 52051 <at> debbugs.gnu.org
Subject: Re: bug#52051: [core-updates-frozen] cannot login
 ('org.freedesktop.login1' service times out)
Date: Tue, 07 Dec 2021 16:23:38 -0500
[Message part 1 (text/plain, inline)]
Hello!

Josselin Poiret <dev <at> jpoiret.xyz> writes:

> Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:
>
>> Hello,
>>
>> I've found a workaround: restarting elogind via SSH resolved the issue.
>>
>> I guess itt may be a race between elogind and dbus-system (elogind gets
>> started before dbus-system is fully up, and the communication with the
>> session bus is somehow crippled from there?).
>
> Does this happen with other login managers on your system, like LightDM?
> The thing is that elogind depends on dbus-system, so I'm not sure there
> should be a race there.

I've yet to try this, but it *doesn't* happen on another machine using
an almost-identical configuration (slim + ratpoison).

> I noticed though that none of our desktop managers's shepherd services
> require elogind, and in the case of SLiM not even dbus-system.  Maybe we
> should add it there, since we want Shepherd to handle launching elogind,
> and avoid dbus launching one by itself if the login1 service is used or
> even by the PAM elogind module.  Can you try adding that to
> slim-shepherd-service?

I've tried adding elogind in the requirements of the slim service type
as well as the upower-daemon service type, but it didn't help.

Based on my observations (/var/log/messages), it seems that the dbus
session bus is not yet ready when elogin starts, and thus the various
dbus services are not yet published/registered when other dependents
need them.

I noticed that dbus system session times out on every boot on that
machine.  I also notice that when the NTPD daemon starts, it rewinds
time by about 25 s on every boot

--8<---------------cut here---------------start------------->8---
[...]
Dec  7 15:55:02 localhost shepherd[1]: Service dbus-system has been started.
[...]
Dec  7 15:55:10 localhost shepherd[1]: Service ntpd has been started.
[...]
Dec  7 15:54:46 localhost ntpd[341]: ntpd 4.2.8p15 <at> 1.3728-o Thu Jan  1 12:00:01 AM UTC 1970 (1): Starting
[...]
Dec  7 15:55:14 localhost shepherd[1]: Service elogind has been started.
[...]
Dec  7 15:55:21 localhost shepherd[1]: Service upower-daemon has been started.
[...]
Dec  7 15:54:51 localhost dbus-daemon[335]: [system] Activating service name='org.freedesktop.login1' requested by ':1.1' (uid=0 pid=345 comm="/gnu/store/g1qlpzcfnk2r6186al2hfqjmq9yl7qkk-upower") (using servicehelper)
[...]
Dec  7 15:55:02 localhost elogind[352]: elogind is already running as PID 343
[...]
Dec  7 15:55:16 localhost dbus-daemon[335]: [system] Failed to activate service 'org.freedesktop.login1': timed out (service_start_timeout=25000ms)
[...]
Dec  7 15:55:24 localhost dbus-daemon[335]: [system] Connection has not authenticated soon enough, closing it (auth_timeout=30000ms, elapsed: 36684ms)
--8<---------------cut here---------------end--------------->8---


> Also, maybe patching the dbus service to Exec=false instead would be a
> good safeguard against dbus launching elogind itself.

Perhaps a good idea to try also!

Attached is the /var/log/messages of the problematic boot, nearly
unedited.

[messages.boot (text/plain, attachment)]
[Message part 3 (text/plain, inline)]
Thanks,

Maxim

Information forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Tue, 07 Dec 2021 22:17:01 GMT) Full text and rfc822 format available.

Message #31 received at 52051 <at> debbugs.gnu.org (full text, mbox):

From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: Josselin Poiret <dev <at> jpoiret.xyz>
Cc: 52051 <at> debbugs.gnu.org
Subject: Re: bug#52051: [core-updates-frozen] cannot login
 ('org.freedesktop.login1' service times out)
Date: Tue, 07 Dec 2021 17:15:45 -0500
Hi again,

Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:

[...]

> I noticed that dbus system session times out on every boot on that
> machine.  I also notice that when the NTPD daemon starts, it rewinds
> time by about 25 s on every boot

Not NTP related; the same occurs when removing the ntp-service-type from
%desktop-services.

Maxim




Information forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Wed, 08 Dec 2021 02:14:02 GMT) Full text and rfc822 format available.

Message #34 received at 52051 <at> debbugs.gnu.org (full text, mbox):

From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: Josselin Poiret <dev <at> jpoiret.xyz>
Cc: 52051 <at> debbugs.gnu.org
Subject: Re: bug#52051: [core-updates-frozen] cannot login
 ('org.freedesktop.login1' service times out)
Date: Tue, 07 Dec 2021 21:12:59 -0500
Hello,

Josselin Poiret <dev <at> jpoiret.xyz> writes:

> Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:
>
>> Hello,
>>
>> I've found a workaround: restarting elogind via SSH resolved the issue.
>>
>> I guess itt may be a race between elogind and dbus-system (elogind gets
>> started before dbus-system is fully up, and the communication with the
>> session bus is somehow crippled from there?).
>
> Does this happen with other login managers on your system, like LightDM?
> The thing is that elogind depends on dbus-system, so I'm not sure there
> should be a race there.

I have the same problem using GDM on that machine :-(.

I'll try poking the dbus session manually using dbus-send, or the Scheme
API for it under (guix build jami-service).

Thanks!

Maxim




Information forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Wed, 08 Dec 2021 22:19:01 GMT) Full text and rfc822 format available.

Message #37 received at 52051 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Cc: Josselin Poiret <dev <at> jpoiret.xyz>, 52051 <at> debbugs.gnu.org
Subject: Re: bug#52051: [core-updates-frozen] cannot login
 ('org.freedesktop.login1' service times out)
Date: Wed, 08 Dec 2021 23:18:47 +0100
Hi,

Maxim Cournoyer <maxim.cournoyer <at> gmail.com> skribis:

> Dec  7 15:55:02 localhost shepherd[1]: Service dbus-system has been started.
> [...]
> Dec  7 15:55:10 localhost shepherd[1]: Service ntpd has been started.
> [...]
> Dec  7 15:54:46 localhost ntpd[341]: ntpd 4.2.8p15 <at> 1.3728-o Thu Jan  1 12:00:01 AM UTC 1970 (1): Starting
> [...]
> Dec  7 15:55:14 localhost shepherd[1]: Service elogind has been started.
> [...]
> Dec  7 15:55:21 localhost shepherd[1]: Service upower-daemon has been started.
> [...]
> Dec  7 15:54:51 localhost dbus-daemon[335]: [system] Activating service name='org.freedesktop.login1' requested by ':1.1' (uid=0 pid=345 comm="/gnu/store/g1qlpzcfnk2r6186al2hfqjmq9yl7qkk-upower") (using servicehelper)

The key thing here is that dbus-daemon things elogind is not running and
thus considers it has to start it, which is bound to fail.

You can display the list of services known to the DBus system bus with:

  dbus-send --system --print-reply --dest=org.freedesktop.DBus \
    /org/freedesktop/DBus org.freedesktop.DBus.ListNames

I tried in a VM that had booted fine and it shows
“org.freedesktop.login1” as expected.  Could you check what that gives
for you when the bug above shows up?

I wonder if it’s possible for there to be a race, like elogind says
“hi!” to dbus-daemon but dbus-daemon is still sleepy and doesn’t notice.
Seems hard to believe though.

If it’s reproducible, could you instrument ‘elogind-service-type’ such
that elogind is started with “strace -f -o /elogind.log -s 700”?

HTH,
Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Thu, 09 Dec 2021 14:19:01 GMT) Full text and rfc822 format available.

Message #40 received at 52051 <at> debbugs.gnu.org (full text, mbox):

From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: Josselin Poiret <dev <at> jpoiret.xyz>, 52051 <at> debbugs.gnu.org
Subject: Re: bug#52051: [core-updates-frozen] cannot login
 ('org.freedesktop.login1' service times out)
Date: Thu, 09 Dec 2021 09:18:13 -0500
[Message part 1 (text/plain, inline)]
Hi Ludovic!

Ludovic Courtès <ludo <at> gnu.org> writes:

> Hi,
>
> Maxim Cournoyer <maxim.cournoyer <at> gmail.com> skribis:
>
>> Dec  7 15:55:02 localhost shepherd[1]: Service dbus-system has been started.
>> [...]
>> Dec  7 15:55:10 localhost shepherd[1]: Service ntpd has been started.
>> [...]
>> Dec  7 15:54:46 localhost ntpd[341]: ntpd 4.2.8p15 <at> 1.3728-o Thu Jan  1 12:00:01 AM UTC 1970 (1): Starting
>> [...]
>> Dec  7 15:55:14 localhost shepherd[1]: Service elogind has been started.
>> [...]
>> Dec  7 15:55:21 localhost shepherd[1]: Service upower-daemon has been started.
>> [...]
>> Dec 7 15:54:51 localhost dbus-daemon[335]: [system] Activating
>> service name='org.freedesktop.login1' requested by ':1.1' (uid=0
>> pid=345 comm="/gnu/store/g1qlpzcfnk2r6186al2hfqjmq9yl7qkk-upower")
>> (using servicehelper)
>
> The key thing here is that dbus-daemon things elogind is not running and
> thus considers it has to start it, which is bound to fail.
>
> You can display the list of services known to the DBus system bus with:
>
>   dbus-send --system --print-reply --dest=org.freedesktop.DBus \
>     /org/freedesktop/DBus org.freedesktop.DBus.ListNames

After launching elogind (added a wait for it in its start slot), the
first time it prints just:

Dec  9 09:08:24 localhost shepherd[1]: ;;; (available-dbus-services
("org.freedesktop.DBus" "fi.w1.wpa_supplicant1"))

And on the last try (25 seconds of trying) it prints:
Dec  9 09:09:27 localhost shepherd[1]: ;;; (available-dbus-services ("org.freedesktop.DBus" "org.freedesktop.PolicyKit1" "org.freedesktop.ModemManager1" "fi.w1.wpa_supplicant1" "org.freedesktop.NetworkManager"))

login1 never shows up.

So elogind is essentially hung up when it is started that early during
boot.

Attached is the requested strace output attached to elogind.

I've tried adding a synchronization loop that tests for dbus-daemon
socket readiness (able to make a connection) to its shepherd service but
it didn't help.

If elogind is restarted post boot login1 shows up and all is happy.

[elogind.log (text/plain, attachment)]
[Message part 3 (text/plain, inline)]
Thanks!

Maxim

Information forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Thu, 09 Dec 2021 22:16:02 GMT) Full text and rfc822 format available.

Message #43 received at 52051 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Cc: Josselin Poiret <dev <at> jpoiret.xyz>, 52051 <at> debbugs.gnu.org
Subject: Re: bug#52051: [core-updates-frozen] cannot login
 ('org.freedesktop.login1' service times out)
Date: Thu, 09 Dec 2021 23:15:10 +0100
Hello!

Maxim Cournoyer <maxim.cournoyer <at> gmail.com> skribis:

> 374   connect(11, {sa_family=AF_UNIX, sun_path="/var/run/dbus/system_bus_socket"}, 34) = 0

[...]

> 374   epoll_wait(5, [{events=EPOLLIN|EPOLLOUT|EPOLLHUP, data={u32=24802800, u64=24802800}}], 20, -1) = 1
> 374   sendmsg(11, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\1\0\1\0\0\0\0\1\0\0\0m\0\0\0\1\1o\0\25\0\0\0/org/freedesktop/DBus\0\0\0\3\1s\0\5\0\0\0Hello\0\0\0\2\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\6\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0", iov_len=128}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = -1 EPIPE (Broken pipe)
> 374   gettid()                          = 374
> 374   epoll_ctl(5, EPOLL_CTL_MOD, 11, {events=0, data={u32=24802800, u64=24802800}}) = 0
> 374   timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0
> 374   epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2
> 374   read(12, "\1\0\0\0\0\0\0\0", 8)   = 8
> 374   gettid()                          = 374
> 374   timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0
> 374   epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2
> 374   read(12, "\1\0\0\0\0\0\0\0", 8)   = 8
> 374   gettid()                          = 374
> 374   timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0
> 374   epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2
> 374   read(12, "\1\0\0\0\0\0\0\0", 8)   = 8
> 374   epoll_ctl(5, EPOLL_CTL_DEL, 11, NULL) = 0
> 374   close(11)                         = 0
> 374   gettid()                          = 374
> 374   epoll_wait(5,  <unfinished ...>
> 391   <... close resumed>)              = 0
> 391   madvise(0x7fd6c83dc000, 8368128, MADV_DONTNEED) = 0
> 391   exit(0)                           = ?
> 391   +++ exited with 0 +++
> 374   <... epoll_wait resumed>[{events=EPOLLERR, data={u32=24768000, u64=24768000}}], 17, -1) = 1
> 374   lseek(7, 0, SEEK_SET)             = 0
> 374   read(7, "tty7\n", 63)             = 5

As you pointed out on IRC, the initially ‘Hello’ method call above leads
to EPIPE, and we can see that elogind eventually closes its socket to
dbus-daemon *but* keeps doing its thing.

Some interesting things to note…

First, to my surprise, elogind does not use the client library of the
‘dbus’ package:

--8<---------------cut here---------------start------------->8---
$ guix gc --references $(./pre-inst-env guix build elogind)|grep dbus
$ echo $?
1
--8<---------------cut here---------------end--------------->8---

(This is already the case in ‘master’ with v243.7.)  Instead, it has its
own implementation of the DBus protocol, in C, from systemd—we can’t
have enough sources of bugs and vulnerabilities.

Anyway, the “Hello” message is sent to the system bus asynchronously in
‘sd-bus.c’:

--8<---------------cut here---------------start------------->8---
static int bus_send_hello(sd_bus *bus) {
        _cleanup_(sd_bus_message_unrefp) sd_bus_message *m = NULL;
        int r;

        assert(bus);

        if (!bus->bus_client)
                return 0;

        r = sd_bus_message_new_method_call(
                        bus,
                        &m,
                        "org.freedesktop.DBus",
                        "/org/freedesktop/DBus",
                        "org.freedesktop.DBus",
                        "Hello");
        if (r < 0)
                return r;

        return sd_bus_call_async(bus, NULL, m, hello_callback, NULL, 0);
}
--8<---------------cut here---------------end--------------->8---

A callback is called when a reply is received or an error arises:

--8<---------------cut here---------------start------------->8---
static int hello_callback(sd_bus_message *reply, void *userdata, sd_bus_error *error) {

[...]

fail:
        /* When Hello() failed, let's propagate this in two ways: first we return the error immediately here,
         * which is the propagated up towards the event loop. Let's also invalidate the connection, so that
         * if the user then calls back into us again we won't wait any longer. */

        bus_set_state(bus, BUS_CLOSING);
        return r;
}
--8<---------------cut here---------------end--------------->8---

It’s not clear from that whether the authors intended for the thing to
keep going in case of failure.  In our case it’s not helpful.

But why does dbus-daemon drop the connection in the first place?

To know that, we could change ‘dbus-root-service-type’ to run
dbus-daemon from a ‘--enable-verbose-mode’ build, and with the
‘DBUS_VERBOSE’ environment set to 1.

Looking at ‘dbus-server-socket.c’ it would seem that the only sensible
way this can happen is if dbus-daemon doesn’t yet have a
‘new_connection_function’ set at the time it accepts the incoming
connection:

--8<---------------cut here---------------start------------->8---
static dbus_bool_t
handle_new_client_fd_and_unlock (DBusServer *server,
                                 DBusSocket  client_fd)
{

[...]

  /* See if someone wants to handle this new connection, self-referencing
   * for paranoia.
   */
  new_connection_function = server->new_connection_function;
  new_connection_data = server->new_connection_data;

  _dbus_server_ref_unlocked (server);
  SERVER_UNLOCK (server);

  if (new_connection_function)
    {
      (* new_connection_function) (server, connection,
                                   new_connection_data);
    }
  dbus_server_unref (server);

  /* If no one grabbed a reference, the connection will die. */
  _dbus_connection_close_if_only_one_ref (connection);
  dbus_connection_unref (connection);
--8<---------------cut here---------------end--------------->8---

We know that elogind is started after dbus-daemon has written its PID
file (there’s a Shepherd service dependency).  Is there a possibility
that dbus-daemon writes its PID file before it has set
‘new_connection_function’?

It would seem that ‘bus_context_new’ writes the PID file after it has
called ‘dbus_server_set_new_connection_function’ via ‘setup_server’ via
‘process_config_first_time_only’.

So not sure what happens.

That’s it for today.  Thoughts?

Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Fri, 10 Dec 2021 05:10:02 GMT) Full text and rfc822 format available.

Message #46 received at 52051 <at> debbugs.gnu.org (full text, mbox):

From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: Josselin Poiret <dev <at> jpoiret.xyz>, 52051 <at> debbugs.gnu.org
Subject: Re: bug#52051: [core-updates-frozen] cannot login
 ('org.freedesktop.login1' service times out)
Date: Fri, 10 Dec 2021 00:09:28 -0500
[Message part 1 (text/plain, inline)]
Hello Ludovic!

Ludovic Courtès <ludo <at> gnu.org> writes:

> Hello!
>
> Maxim Cournoyer <maxim.cournoyer <at> gmail.com> skribis:
>
>> 374   connect(11, {sa_family=AF_UNIX, sun_path="/var/run/dbus/system_bus_socket"}, 34) = 0
>
> [...]
>
>> 374   epoll_wait(5, [{events=EPOLLIN|EPOLLOUT|EPOLLHUP, data={u32=24802800, u64=24802800}}], 20, -1) = 1
>> 374 sendmsg(11, {msg_name=NULL, msg_namelen=0,
>> msg_iov=[{iov_base="l\1\0\1\0\0\0\0\1\0\0\0m\0\0\0\1\1o\0\25\0\0\0/org/freedesktop/DBus\0\0\0\3\1s\0\5\0\0\0Hello\0\0\0\2\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\6\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0",
>> iov_len=128}], msg_iovlen=1, msg_controllen=0, msg_flags=0},
>> MSG_DONTWAIT|MSG_NOSIGNAL) = -1 EPIPE (Broken pipe)
>> 374   gettid()                          = 374
>> 374   epoll_ctl(5, EPOLL_CTL_MOD, 11, {events=0, data={u32=24802800, u64=24802800}}) = 0
>> 374   timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0
>> 374   epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2
>> 374   read(12, "\1\0\0\0\0\0\0\0", 8)   = 8
>> 374   gettid()                          = 374
>> 374   timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0
>> 374   epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2
>> 374   read(12, "\1\0\0\0\0\0\0\0", 8)   = 8
>> 374   gettid()                          = 374
>> 374   timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0
>> 374   epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2
>> 374   read(12, "\1\0\0\0\0\0\0\0", 8)   = 8
>> 374   epoll_ctl(5, EPOLL_CTL_DEL, 11, NULL) = 0
>> 374   close(11)                         = 0
>> 374   gettid()                          = 374
>> 374   epoll_wait(5,  <unfinished ...>
>> 391   <... close resumed>)              = 0
>> 391   madvise(0x7fd6c83dc000, 8368128, MADV_DONTNEED) = 0
>> 391   exit(0)                           = ?
>> 391   +++ exited with 0 +++
>> 374   <... epoll_wait resumed>[{events=EPOLLERR, data={u32=24768000, u64=24768000}}], 17, -1) = 1
>> 374   lseek(7, 0, SEEK_SET)             = 0
>> 374   read(7, "tty7\n", 63)             = 5
>
> As you pointed out on IRC, the initially ‘Hello’ method call above leads
> to EPIPE, and we can see that elogind eventually closes its socket to
> dbus-daemon *but* keeps doing its thing.
>
> Some interesting things to note…

Indeed, very interesting :-).  Thanks for diving in!

> First, to my surprise, elogind does not use the client library of the
> ‘dbus’ package:
>
> $ guix gc --references $(./pre-inst-env guix build elogind)|grep dbus
> $ echo $?
> 1
>
>
> (This is already the case in ‘master’ with v243.7.)  Instead, it has its
> own implementation of the DBus protocol, in C, from systemd—we can’t
> have enough sources of bugs and vulnerabilities.

Oh!  Marius found this interesting issue [0] that they shared on IRC
today; I wonder if it may be related.  sd-bus apparently pipeline things
aggressively, which is not always handled well by other D-bus
participants.

[0]  https://github.com/systemd/systemd/issues/16610

> Anyway, the “Hello” message is sent to the system bus asynchronously in
> ‘sd-bus.c’:
>
> static int bus_send_hello(sd_bus *bus) {
>         _cleanup_(sd_bus_message_unrefp) sd_bus_message *m = NULL;
>         int r;
>
>         assert(bus);
>
>         if (!bus->bus_client)
>                 return 0;
>
>         r = sd_bus_message_new_method_call(
>                         bus,
>                         &m,
>                         "org.freedesktop.DBus",
>                         "/org/freedesktop/DBus",
>                         "org.freedesktop.DBus",
>                         "Hello");
>         if (r < 0)
>                 return r;
>
>         return sd_bus_call_async(bus, NULL, m, hello_callback, NULL, 0);
> }
>
>
> A callback is called when a reply is received or an error arises:
>
> static int hello_callback(sd_bus_message *reply, void *userdata, sd_bus_error *error) {
>
> [...]
>
> fail:
>         /* When Hello() failed, let's propagate this in two ways: first we return the error immediately here,
>          * which is the propagated up towards the event loop. Let's also invalidate the connection, so that
>          * if the user then calls back into us again we won't wait any longer. */
>
>         bus_set_state(bus, BUS_CLOSING);
>         return r;
> }
>
>
> It’s not clear from that whether the authors intended for the thing to
> keep going in case of failure.  In our case it’s not helpful.

If we picture this in the systemd use case, I believe sd-bus must be
used as *the* d-bus daemon, perhaps?  So it should never die, and expect
users to call back into it to retry things?  In our case, it acts as a
D-Bus client, not a server (IIUC), so perhaps its behavior is not tuned
right for our use case.

> But why does dbus-daemon drop the connection in the first place?
>
> To know that, we could change ‘dbus-root-service-type’ to run
> dbus-daemon from a ‘--enable-verbose-mode’ build, and with the
> ‘DBUS_VERBOSE’ environment set to 1.
>
> Looking at ‘dbus-server-socket.c’ it would seem that the only sensible
> way this can happen is if dbus-daemon doesn’t yet have a
> ‘new_connection_function’ set at the time it accepts the incoming
> connection:

Interesting that you mention this; that's what I worked on yesterday
(see attached patches).  I managed to get elogind panicking the kernel
during a guix system reconfigure, which corrupted GRUB, so had to chroot
and reconfigure from there [1].  Not sure what happened, but it seems
that killing and restarting elogind is susceptible to cause hard locks.

[1]  https://lists.gnu.org/archive/html/guix-devel/2018-06/msg00202.html

[0001-gnu-Add-dbus-verbose.patch (text/x-patch, attachment)]
[0002-services-dbus-Add-a-VERBOSE-configuration-option.patch (text/x-patch, attachment)]
[0003-build-Move-D-Bus-parser-to-its-own-build-module.patch (text/x-patch, attachment)]
[0004-services-telephony-Factorize-bus-polling-logic.patch (text/x-patch, attachment)]
[0005-services-dbus-Add-synchronization-primitives-to-dbus.patch (text/x-patch, attachment)]
[Message part 7 (text/plain, inline)]

> static dbus_bool_t
> handle_new_client_fd_and_unlock (DBusServer *server,
>                                  DBusSocket  client_fd)
> {
>
> [...]
>
>   /* See if someone wants to handle this new connection, self-referencing
>    * for paranoia.
>    */
>   new_connection_function = server->new_connection_function;
>   new_connection_data = server->new_connection_data;
>
>   _dbus_server_ref_unlocked (server);
>   SERVER_UNLOCK (server);
>
>   if (new_connection_function)
>     {
>       (* new_connection_function) (server, connection,
>                                    new_connection_data);
>     }
>   dbus_server_unref (server);
>
>   /* If no one grabbed a reference, the connection will die. */
>   _dbus_connection_close_if_only_one_ref (connection);
>   dbus_connection_unref (connection);
>
> We know that elogind is started after dbus-daemon has written its PID
> file (there’s a Shepherd service dependency).  Is there a possibility
> that dbus-daemon writes its PID file before it has set
> ‘new_connection_function’?

Are PID files conventionally agreed to be synchronization primitives?
In the case of systemd, I feel this may be more of an afterthought,
given the system is designed to rely on socket-based activation.  Still
just guessing though, and your analysis above seems to suggest
otherwise.

> It would seem that ‘bus_context_new’ writes the PID file after it has
> called ‘dbus_server_set_new_connection_function’ via ‘setup_server’ via
> ‘process_config_first_time_only’.
>
> So not sure what happens.
>
> That’s it for today.  Thoughts?

Thank you!

I'm sorry for my lack of deep analysis here, I've spent most of my
evening attempting to fix my system just to boot ^^'.  I've at least
managed to collect the following verbose D-Bus log (54 MiB uncompressed)
which hopefully can shed some light onto how this failure came to be.

[dbus-daemon.log.xz (application/octet-stream, attachment)]
[Message part 9 (text/plain, inline)]
Happy Friday!

Maxim

Information forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Fri, 10 Dec 2021 10:18:01 GMT) Full text and rfc822 format available.

Message #49 received at 52051 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Cc: Josselin Poiret <dev <at> jpoiret.xyz>, 52051 <at> debbugs.gnu.org
Subject: Re: bug#52051: [core-updates-frozen] cannot login
 ('org.freedesktop.login1' service times out)
Date: Fri, 10 Dec 2021 11:17:30 +0100
Hi Maxim!

Maxim Cournoyer <maxim.cournoyer <at> gmail.com> skribis:

[...]

> Oh!  Marius found this interesting issue [0] that they shared on IRC
> today; I wonder if it may be related.  sd-bus apparently pipeline things
> aggressively, which is not always handled well by other D-bus
> participants.
>
> [0]  https://github.com/systemd/systemd/issues/16610

Interesting.

>> Anyway, the “Hello” message is sent to the system bus asynchronously in
>> ‘sd-bus.c’:

[...]

>> static int hello_callback(sd_bus_message *reply, void *userdata, sd_bus_error *error) {
>>
>> [...]
>>
>> fail:
>>         /* When Hello() failed, let's propagate this in two ways: first we return the error immediately here,
>>          * which is the propagated up towards the event loop. Let's also invalidate the connection, so that
>>          * if the user then calls back into us again we won't wait any longer. */
>>
>>         bus_set_state(bus, BUS_CLOSING);
>>         return r;
>> }
>>
>>
>> It’s not clear from that whether the authors intended for the thing to
>> keep going in case of failure.  In our case it’s not helpful.
>
> If we picture this in the systemd use case, I believe sd-bus must be
> used as *the* d-bus daemon, perhaps?  So it should never die, and expect
> users to call back into it to retry things?  In our case, it acts as a
> D-Bus client, not a server (IIUC), so perhaps its behavior is not tuned
> right for our use case.

I think systemd-logind is a separate process, just like elogind, that
talks over D-Bus; I don’t think there’s any difference here.  It just
seems that this corner case, where we don’t get a reply for ‘Hello’, is
not correctly handled.

> Interesting that you mention this; that's what I worked on yesterday
> (see attached patches).  I managed to get elogind panicking the kernel
> during a guix system reconfigure, which corrupted GRUB, so had to chroot
> and reconfigure from there [1].  Not sure what happened, but it seems
> that killing and restarting elogind is susceptible to cause hard locks.

Ouch!  This is weird.  Note that “everything” depends on elogind, as can
be seen with ‘guix system shepherd-graph’.  So “herd stop elogind” is
not something you should try at home.

Incidentally, can you reproduce the bug in a VM?  That would be much
nicer.

>> We know that elogind is started after dbus-daemon has written its PID
>> file (there’s a Shepherd service dependency).  Is there a possibility
>> that dbus-daemon writes its PID file before it has set
>> ‘new_connection_function’?
>
> Are PID files conventionally agreed to be synchronization primitives?

Yes.  Daemons are supposed to write their PID file once they’re ready to
do their job.

> I'm sorry for my lack of deep analysis here, I've spent most of my
> evening attempting to fix my system just to boot ^^'.  I've at least
> managed to collect the following verbose D-Bus log (54 MiB uncompressed)
> which hopefully can shed some light onto how this failure came to be.

Excerpt:

--8<---------------cut here---------------start------------->8---
366: 0x7f28f396e740: 1639108961.938358 [dbus-sysdeps-util-unix.c(237):_dbus_write_pid_to_file_and_pipe] writing pid file /var/run/dbus/pid
366: 0x7f28f396e740: 1639108961.938438 [dbus-sysdeps-util-unix.c(291):_dbus_write_pid_to_file_and_pipe] No pid pipe to write to
366: 0x7f28f396e740: 1639108961.938474 [dbus-userdb.c(177):_dbus_user_database_lookup] Using cache for UID 986 information
366: 0x7f28f396e740: 1639108961.938665 [dbus-sysdeps-unix.c(3514):_dbus_socketpair] full-duplex pipe 6 <-> 7
366: 0x7f28f396e740: 1639108961.938700 [main.c(719):main] We are on D-Bus...
366: 0x7f28f396e740: 1639108961.938725 [dbus-mainloop.c(884):_dbus_loop_run] Running main loop, depth 0 -> 1
366: 0x7f28f396e740: 1639108962.566557 [dbus-server-socket.c(182):socket_handle_watch] Handling client connection, flags 0x1
366: 0x7f28f396e740: 1639108962.566623 [dbus-sysdeps-unix.c(2344):_dbus_accept] client fd 8 accepted
366: 0x7f28f396e740: 1639108962.566665 [dbus-server-socket.c(94):handle_new_client_fd_and_unlock] Creating new client connection with fd 8
366: 0x7f28f396e740: 1639108962.566779 [dbus-connection.c(1360):_dbus_connection_new_for_transport] LOCK
366: 0x7f28f396e740: 1639108962.566824 [dbus-transport-socket.c(180):check_read_watch] fd = 8
366: 0x7f28f396e740: 1639108962.566869 [dbus-transport-socket.c(227):check_read_watch]   setting read watch enabled = 1

[...]

366: 0x7f28f396e740: 1639108962.568765 [dbus-transport-socket.c(974):socket_handle_watch] handling read watch 0x23c2b00 flags = 9
366: 0x7f28f396e740: 1639108962.568807 [dbus-transport-socket.c(348):exchange_credentials] exchange_credentials: do_reading = 1, do_writing = 0
366: 0x7f28f396e740: 1639108962.568862 [dbus-transport-socket.c(383):exchange_credentials] Failed to read credentials Failed to read credentials byte (zero-length read)
366: 0x7f28f396e740: 1639108962.568904 [dbus-transport.c(510):_dbus_transport_disconnect] start
366: 0x7f28f396e740: 1639108962.568945 [dbus-transport-socket.c(1041):socket_disconnect] 
366: 0x7f28f396e740: 1639108962.568986 [dbus-transport-socket.c(76):free_watches] start
366: 0x7f28f396e740: 1639108962.569029 [dbus-watch.c(423):_dbus_watch_list_remove_watch] Removing watch on fd 8
--8<---------------cut here---------------end--------------->8---

We see this sequence:

  1. Writes PID file.

  2. Accepts incoming connection.

  3. “Failed to read credentials”.

  4. Close connection.

So we need to understand #3.  Perhaps going back to strace logs could
help this time…

Thanks,
Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Wed, 15 Dec 2021 19:29:01 GMT) Full text and rfc822 format available.

Message #52 received at 52051 <at> debbugs.gnu.org (full text, mbox):

From: Michael Rohleder <mike <at> rohleder.de>
To: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Cc: 52051 <at> debbugs.gnu.org
Subject: Re: bug#52051: [core-updates-frozen] cannot login
 ('org.freedesktop.login1' service times out)
Date: Wed, 15 Dec 2021 20:27:33 +0100
[Message part 1 (text/plain, inline)]
Hi Maxim,

I currently have this (very annoying) issue on _one_ of my machines (two
others work with nearly the same config).
I can't login at all not via console nor ssh or sddm.
I spend some time to reproduce it in a vm, but no success so far.

These are the relevant messages from syslog:
Dec 15 18:15:52 micha dbus-daemon[470]: [system] Failed to activate service 'org.freedesktop.login1': timed out (service_start_timeout=25000ms)
Dec 15 18:16:47 micha dbus-daemon[470]: [system] Activating service name='org.freedesktop.login1' requested by ':1.8' (uid=0 pid=899 comm="/gnu/store/ximad0zvg12r4x0x80mvym8hzg0n33jl-shadow") (using servicehelper)
Dec 15 18:16:47 micha elogind[935]: elogind is already running as PID 558
Dec 15 18:17:12 micha dbus-daemon[470]: [system] Failed to activate service 'org.freedesktop.login1': timed out (service_start_timeout=25000ms)

-- 
How much does it cost to entice a dope-smoking UNIX system guru to Dayton?
		-- Brian Boyle, UNIX/WORLD's First Annual Salary Survey
[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Thu, 16 Dec 2021 04:45:01 GMT) Full text and rfc822 format available.

Message #55 received at 52051 <at> debbugs.gnu.org (full text, mbox):

From: Abhiseck Paira <abhiseckpaira <at> disroot.org>
To: 52051 <at> debbugs.gnu.org
Subject: [core-updates-frozen] cannot login ('org.freedesktop.login1'
 service times out)
Date: Thu, 16 Dec 2021 09:06:21 +0530
[Message part 1 (text/plain, inline)]
Hello,

I am getting the same errors others have mentioned. I can't login via
tty and GDM wouldn't launch either. This happened after I pulled this
commit 73eb941b8b3793aec6110a4ae28bdbfc3ab4f6c5 with guix pull and ran
system reconfigure.

-- 
Abhiseck Paira
E34E 825B 979E EB9F 8505  F80E E93D 353B 7740 0709
https://abhiseck.neocities.org
[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Thu, 16 Dec 2021 13:18:02 GMT) Full text and rfc822 format available.

Message #58 received at submit <at> debbugs.gnu.org (full text, mbox):

From: ison <ison <at> airmail.cc>
To: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Cc: bug-guix <at> gnu.org, 52051 <at> debbugs.gnu.org
Subject: Re: bug#52051: [core-updates-frozen] cannot login
 ('org.freedesktop.login1' service times out)
Date: Thu, 16 Dec 2021 06:09:01 -0700
I may have found a clue about what's happening.

After updating I ran into this same issue (I login from TTY), and I see
the same "elogind is already running" error.
However, if I boot into an older generation (before the
core-updates-frozen merge), and then again reboot back into the newest
generation (after core-updates-frozen merge) then I am able to login.

I was able to reproduce those exact steps too. Reboot to new generation,
can't login again. Reboot to old generation, then reboot back to new
generation and I can log in.

Note that even when I'm able to login on the new generation I still see
the "elogind is already running" error printed, but only before the
login prompt shows up, not after.

Also I'm not sure if this matters, but another difference I notice
during the attempts when I can successfully login on the new generation
is that I see Herd restart the `nscd` service.
Perhaps someone else experiencing this issue should try to restart nscd
(via SSH) and see if that works for you.

Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:

> Hello,
>
> I'm not 100% sure this is the cause, but these are the last messages I
> have before I rebooted:
>
> Nov 23 01:09:14 localhost dbus-daemon[383]: [system] Activating service name='org.freedesktop.login1' requested by ':1.17' (uid=0 pid=370
>     comm="/gnu/store/ximad0zvg12r4x0x80mvym8hzg0n33jl-shadow") (using servicehelper)
> Nov 23 01:09:14 localhost elogind[1189]: elogind is already running as PID 390
> Nov 23 01:09:20 localhost shepherd[1]: Respawning term-tty1. 
> Nov 23 01:09:20 localhost shepherd[1]: Service host-name has been started. 
> Nov 23 01:09:20 localhost shepherd[1]: Service term-tty1 has been started.
> Nov 23 01:09:39 localhost dbus-daemon[383]: [system] Failed to activate service 'org.freedesktop.login1': timed out (service_start_timeout=25000ms)
>
> I don't remember if I saw the slim login screen; but in any case I
> couldn't successfully login even via a ptty.
>
> It may have to do with polkit.
>
> To be investigated.
>
> This happened on a system *not* using gdm (it uses slim) and with
> ratpoison as the WM, on commit f42bc604547d9ee8e35fcd66d5db7786954cfac3
> of the core-updates-frozen branch.
>
> To be investigated.
>
> Thanks,
>
> Maxim





Information forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Thu, 16 Dec 2021 13:18:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Thu, 16 Dec 2021 18:12:02 GMT) Full text and rfc822 format available.

Message #64 received at 52051 <at> debbugs.gnu.org (full text, mbox):

From: Guillaume Le Vaillant <glv <at> posteo.net>
To: 52051 <at> debbugs.gnu.org
Subject: Re: bug#52051: [core-updates-frozen] cannot login
Date: Thu, 16 Dec 2021 17:59:33 +0000
[Message part 1 (text/plain, inline)]
Hi,

I just got this login error when updating an old machine with a HDD as
storage. On some other faster machines using SSD or NVME storage this
issue never happened, so I thought the error might be triggered by
slow IO.

Do some of you also see the issue on fast machines/storage?
[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Thu, 16 Dec 2021 18:25:01 GMT) Full text and rfc822 format available.

Message #67 received at 52051 <at> debbugs.gnu.org (full text, mbox):

From: Josselin Poiret <dev <at> jpoiret.xyz>
To: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>, Ludovic Courtès <ludo <at> gnu.org>
Cc: dev <at> jpoiret.xyz, 52051 <at> debbugs.gnu.org
Subject: Re: bug#52051: [core-updates-frozen] cannot login
 ('org.freedesktop.login1' service times out)
Date: Thu, 16 Dec 2021 19:24:11 +0100
Hello Maxim and Ludovic,

Seeing all the new activity on this bug, I decided to take a closer
look.

It doesn't seem to me that the credential byte read is the problem, as
can be seen by the elogind strace: it sends

--8<---------------cut here---------------start------------->8---
374   sendmsg(11, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0AUTH EXTERNAL\r\nDATA\r\n", iov_len=22}, {iov_base="NEGOTIATE_UNIX_FD\r\n", iov_len=19}, {iov_base="BEGIN\r\n", iov_len=7}], msg_iovlen=3, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 48
--8<---------------cut here---------------end--------------->8---

to which dbus answers with

--8<---------------cut here---------------start------------->8---
374   recvmsg(11, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="DATA\r\nOK ea94265b9365f7158f1d81ee61b20dab\r\nAGREE_UNIX_FD\r\n", iov_len=256}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 58
--8<---------------cut here---------------end--------------->8---

so dbus doesn't seem to choke on the initial authentication (where the
credentials read happens) and it's clear that the byte 0 is properly
sent at the beginning of the handshake procedure.

Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:

> I'm sorry for my lack of deep analysis here, I've spent most of my
> evening attempting to fix my system just to boot ^^'.  I've at least
> managed to collect the following verbose D-Bus log (54 MiB uncompressed)
> which hopefully can shed some light onto how this failure came to be.

Would it be possible to get both the elogind strace and the dbus log for
the same session?  We could then simply `grep` the authentication cookie
sent back by dbus to elogind to track the corresponding fd in the dbus
log.

Best,
Josselin Poiret




Information forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Thu, 16 Dec 2021 18:54:01 GMT) Full text and rfc822 format available.

Message #70 received at 52051 <at> debbugs.gnu.org (full text, mbox):

From: Michael Rohleder <mike <at> rohleder.de>
To: 52051 <at> debbugs.gnu.org
Subject: Re: bug#52051: [core-updates-frozen] cannot login
 ('org.freedesktop.login1' service times out)
Date: Thu, 16 Dec 2021 19:53:30 +0100
[Message part 1 (text/plain, inline)]
Hi Guillaume,

Guillaume Le Vaillant <glv <at> posteo.net> writes:
> I just got this login error when updating an old machine with a HDD as
> storage. On some other faster machines using SSD or NVME storage this
> issue never happened, so I thought the error might be triggered by
> slow IO.
>
> Do some of you also see the issue on fast machines/storage?

I can confirm this. For me, this happens on a machine with a HDD, but
not on another with a SSD nor my laptop with NVME.

(It also "feels" like, it's about timing or so)

-- 
Q: How does a Unix guru have sex?
A: unzip;strip;touch;finger;mount;fsck;more;yes;umount;sleep

[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Mon, 20 Dec 2021 00:41:01 GMT) Full text and rfc822 format available.

Message #73 received at 52051 <at> debbugs.gnu.org (full text, mbox):

From: Caleb Herbert <csh <at> bluehome.net>
To: 52051 <at> debbugs.gnu.org
Subject: [core-updates-frozen] cannot login ('org.freedesktop.login1' service
 times out)
Date: Sun, 19 Dec 2021 18:40:27 -0600
[Message part 1 (text/plain, inline)]
I am using an X200 Tablet with an HDD.  This happened on my system.  I 
am now on Fedora with Guix until this issue is fixed.
[OpenPGP_0x42E4FDF80F03CA7C.asc (application/pgp-keys, attachment)]
[OpenPGP_signature (application/pgp-signature, attachment)]

Information forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Tue, 21 Dec 2021 09:32:01 GMT) Full text and rfc822 format available.

Message #76 received at 52051 <at> debbugs.gnu.org (full text, mbox):

From: Timothy Sample <samplet <at> ngyro.com>
To: Josselin Poiret <dev <at> jpoiret.xyz>
Cc: Ludovic Courtès <ludo <at> gnu.org>,
 52051 <at> debbugs.gnu.org, Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Subject: Re: bug#52051: [core-updates-frozen] cannot login
 ('org.freedesktop.login1' service times out)
Date: Tue, 21 Dec 2021 04:31:27 -0500
Hello,

I dug out an old HDD, put it in my X200, and was able to reproduce this.
Eventually I was able to log in to the thing by fiddling with the
“auth_timeout” parameter in the D-Bus config.

Josselin Poiret <dev <at> jpoiret.xyz> writes:

> [...]
>
> Would it be possible to get both the elogind strace and the dbus log for
> the same session?  We could then simply `grep` the authentication cookie
> sent back by dbus to elogind to track the corresponding fd in the dbus
> log.

This is exactly what I did (thanks for posting the patches, Maxim).
While looking around in the D-Bus log, I noticed it complaining about
authentication expiry around where it mentions the PID of elogind.  This
is in Maxim’s log, too.  It counts down an authentication expiry with
messages like “Connection 0x23b9f50 authentication expires in ...”.
That brought me to the D-Bus NEWS file, which mentions adjusting
“auth_timeout” to fix a boot regression for users with older hardware
[1].  The NEWS file mentions a bug report [2] that discusses how this
might be related to hard disk speed.

[1] https://gitlab.freedesktop.org/dbus/dbus/-/blob/master/NEWS#L2487
[2] https://bugs.freedesktop.org/show_bug.cgi?id=86431#c3

After reading that I tried with the timeout bumped up to a minute, and
the X200 booted into GDM just fine, twice in row, and then failed again
when I removed the change.  (I should add that it still printed the
“elogind is already running” messages, but it worked anyway.)

What’s weird is that this bug is very old (2014), and the default
timeout was increased at the time from 5s to 30s to solve the bug.  So
it’s part of the story, but not the whole story.


-- Tim




Information forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Tue, 21 Dec 2021 16:14:01 GMT) Full text and rfc822 format available.

Message #79 received at 52051 <at> debbugs.gnu.org (full text, mbox):

From: Leo Famulari <leo <at> famulari.name>
To: Timothy Sample <samplet <at> ngyro.com>
Cc: Josselin Poiret <dev <at> jpoiret.xyz>, 52051 <at> debbugs.gnu.org,
 Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Subject: Re: bug#52051: [core-updates-frozen] cannot login
 ('org.freedesktop.login1' service times out)
Date: Tue, 21 Dec 2021 11:13:35 -0500
On Tue, Dec 21, 2021 at 04:31:27AM -0500, Timothy Sample wrote:
> [1] https://gitlab.freedesktop.org/dbus/dbus/-/blob/master/NEWS#L2487
> [2] https://bugs.freedesktop.org/show_bug.cgi?id=86431#c3
> 
> After reading that I tried with the timeout bumped up to a minute, and
> the X200 booted into GDM just fine, twice in row, and then failed again
> when I removed the change.  (I should add that it still printed the
> “elogind is already running” messages, but it worked anyway.)

Okay, great. I don't "auth_timeout" in the Guix source tree. Can you say
where you adjusted it?




Information forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Tue, 21 Dec 2021 16:37:02 GMT) Full text and rfc822 format available.

Message #82 received at 52051 <at> debbugs.gnu.org (full text, mbox):

From: Timothy Sample <samplet <at> ngyro.com>
To: Leo Famulari <leo <at> famulari.name>
Cc: Josselin Poiret <dev <at> jpoiret.xyz>, 52051 <at> debbugs.gnu.org,
 Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Subject: Re: bug#52051: [core-updates-frozen] cannot login
 ('org.freedesktop.login1' service times out)
Date: Tue, 21 Dec 2021 11:36:22 -0500
[Message part 1 (text/plain, inline)]
Hi Leo,

Leo Famulari <leo <at> famulari.name> writes:

> On Tue, Dec 21, 2021 at 04:31:27AM -0500, Timothy Sample wrote:
>> [1] https://gitlab.freedesktop.org/dbus/dbus/-/blob/master/NEWS#L2487
>> [2] https://bugs.freedesktop.org/show_bug.cgi?id=86431#c3
>> 
>> After reading that I tried with the timeout bumped up to a minute, and
>> the X200 booted into GDM just fine, twice in row, and then failed again
>> when I removed the change.  (I should add that it still printed the
>> “elogind is already running” messages, but it worked anyway.)
>
> Okay, great. I don't "auth_timeout" in the Guix source tree. Can you say
> where you adjusted it?

Good question!  :)  Here’s the patch:

[auth-timeout.patch (text/x-patch, inline)]
diff --git a/gnu/services/dbus.scm b/gnu/services/dbus.scm
index 85a4c3e..a680ed7 100644
--- a/gnu/services/dbus.scm
+++ b/gnu/services/dbus.scm
@@ -106,6 +106,7 @@ (define-syntax directives
         (define (services->sxml services)
           ;; Return the SXML 'includedir' clauses for DIRS.
           `(busconfig
+            (limit (@ (name "auth_timeout")) "60000")
             (servicehelper "/run/setuid-programs/dbus-daemon-launch-helper")
 
             ;; First, the '.service' files of services subject to activation.

Information forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Tue, 21 Dec 2021 16:53:02 GMT) Full text and rfc822 format available.

Message #85 received at 52051 <at> debbugs.gnu.org (full text, mbox):

From: Maxime Devos <maximedevos <at> telenet.be>
To: Timothy Sample <samplet <at> ngyro.com>, Leo Famulari <leo <at> famulari.name>
Cc: 52051 <at> debbugs.gnu.org, Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Subject: Re: bug#52051: [core-updates-frozen] cannot login
 ('org.freedesktop.login1' service times out)
Date: Tue, 21 Dec 2021 16:52:19 +0000
Timothy Sample schreef op di 21-12-2021 om 11:36 [-0500]:
> +            (limit (@ (name "auth_timeout")) "60000")


Are there any good reasons for having a timeout at all?
(Except for the local-user denial of service, but local users can do
"guix build -f something-that-allocates-almost-all-memory-and-melts-
the-cpu.scm" anyway ...)

If not, can the timeout be disabled/set to infinity?

A comment like 

> ;; Set timeout to a huge number (16.6 hours), because
> ;; upstream often sets timeouts low for spinning disks,
> ;; slow CPUs, etc.
> (limit [...] "60000")

could be useful (I'm assuming the timeout is in seconds here).

Greetings,
Maxime.





Information forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Tue, 21 Dec 2021 17:33:01 GMT) Full text and rfc822 format available.

Message #88 received at 52051 <at> debbugs.gnu.org (full text, mbox):

From: Leo Famulari <leo <at> famulari.name>
To: Maxime Devos <maximedevos <at> telenet.be>
Cc: Timothy Sample <samplet <at> ngyro.com>, 52051 <at> debbugs.gnu.org,
 Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Subject: Re: bug#52051: [core-updates-frozen] cannot login
 ('org.freedesktop.login1' service times out)
Date: Tue, 21 Dec 2021 12:32:21 -0500
On Tue, Dec 21, 2021 at 04:52:19PM +0000, Maxime Devos wrote:
> Are there any good reasons for having a timeout at all?
> (Except for the local-user denial of service, but local users can do
> "guix build -f something-that-allocates-almost-all-memory-and-melts-
> the-cpu.scm" anyway ...)
> 
> If not, can the timeout be disabled/set to infinity?

Could be. But I think that, based on a several years of reports, the
X200 with an HDD is the slowest machine used with Guix System. On my
X200 with HDD, I have personally experienced similar race-y bugs that
seem to crop up after major upgrades --- I assume that it's a case of
bad luck, where important programs for booting move to distant parts of
the disk and seeking is too slow.

> A comment like 
> 
> > ;; Set timeout to a huge number (16.6 hours), because
> > ;; upstream often sets timeouts low for spinning disks,
> > ;; slow CPUs, etc.
> > (limit [...] "60000")
> 
> could be useful (I'm assuming the timeout is in seconds here).

I suggest we wait until such drastic action is necessary. Otherwise we
might be banging our heads against the wall in a few years, trying to
debug something. Let's not rush to extremes :)




Information forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Tue, 21 Dec 2021 17:38:01 GMT) Full text and rfc822 format available.

Message #91 received at 52051 <at> debbugs.gnu.org (full text, mbox):

From: Leo Famulari <leo <at> famulari.name>
To: Timothy Sample <samplet <at> ngyro.com>
Cc: Josselin Poiret <dev <at> jpoiret.xyz>, 52051 <at> debbugs.gnu.org,
 Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Subject: Re: bug#52051: [core-updates-frozen] cannot login
 ('org.freedesktop.login1' service times out)
Date: Tue, 21 Dec 2021 12:37:30 -0500
On Tue, Dec 21, 2021 at 11:36:22AM -0500, Timothy Sample wrote:
> diff --git a/gnu/services/dbus.scm b/gnu/services/dbus.scm
> index 85a4c3e..a680ed7 100644
> --- a/gnu/services/dbus.scm
> +++ b/gnu/services/dbus.scm
> @@ -106,6 +106,7 @@ (define-syntax directives
>          (define (services->sxml services)
>            ;; Return the SXML 'includedir' clauses for DIRS.
>            `(busconfig
> +            (limit (@ (name "auth_timeout")) "60000")
>              (servicehelper "/run/setuid-programs/dbus-daemon-launch-helper")
>  
>              ;; First, the '.service' files of services subject to activation.


With this patch, I can boot and login to X on my X200s with HDD.
Obviously this failure and subsequent fix are not deterministic, but
it's a good sign! Let's get some more testing! And comrades, let's test
the xorg-server update while doing it:

https://issues.guix.gnu.org/issue/52562




Information forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Tue, 21 Dec 2021 17:45:02 GMT) Full text and rfc822 format available.

Message #94 received at 52051 <at> debbugs.gnu.org (full text, mbox):

From: Leo Famulari <leo <at> famulari.name>
To: Timothy Sample <samplet <at> ngyro.com>
Cc: Josselin Poiret <dev <at> jpoiret.xyz>, 52051 <at> debbugs.gnu.org,
 Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Subject: Re: bug#52051: [core-updates-frozen] cannot login
 ('org.freedesktop.login1' service times out)
Date: Tue, 21 Dec 2021 12:44:47 -0500
[Message part 1 (text/plain, inline)]
On Tue, Dec 21, 2021 at 12:37:30PM -0500, Leo Famulari wrote:
> With this patch, I can boot and login to X on my X200s with HDD.
> Obviously this failure and subsequent fix are not deterministic, but
> it's a good sign! Let's get some more testing!

I pushed the patch to a Savannah branch to ease testing:

https://git.savannah.gnu.org/cgit/guix.git/log/?h=wip-fix-52051

So, you can do:

`guix pull --branch=wip-fix-52051 && guix system reconfigure [...]`

Or similar with time-machine
[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Tue, 21 Dec 2021 17:52:02 GMT) Full text and rfc822 format available.

Message #97 received at 52051 <at> debbugs.gnu.org (full text, mbox):

From: Maxime Devos <maximedevos <at> telenet.be>
To: Leo Famulari <leo <at> famulari.name>
Cc: Timothy Sample <samplet <at> ngyro.com>, 52051 <at> debbugs.gnu.org,
 Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Subject: Re: bug#52051: [core-updates-frozen] cannot login
 ('org.freedesktop.login1' service times out)
Date: Tue, 21 Dec 2021 17:51:28 +0000
Leo Famulari schreef op di 21-12-2021 om 12:32 [-0500]:
> > A comment like 
> > 
> > > ;; Set timeout to a huge number (16.6 hours), because
> > > ;; upstream often sets timeouts low for spinning disks,
> > > ;; slow CPUs, etc.
> > > (limit [...] "60000")
> > 
> > could be useful (I'm assuming the timeout is in seconds here).
> 
> I suggest we wait until such drastic action is necessary. Otherwise
> we
> might be banging our heads against the wall in a few years, trying to
> debug something. Let's not rush to extremes :)

I don't think commenting new code is a drastic and extreme action.
I assume you were referring you were referring to the ‘16.6 hours’?
If so, just replace ‘16.6 hours’ by ‘one minute’, because it turns
out that 'auth_timeout' is in milliseconds and not seconds,
see <https://dbus.freedesktop.org/doc/dbus-daemon.1.html>.

Greetings,
Maxime.






Information forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Tue, 21 Dec 2021 18:12:02 GMT) Full text and rfc822 format available.

Message #100 received at 52051 <at> debbugs.gnu.org (full text, mbox):

From: Leo Famulari <leo <at> famulari.name>
To: Maxime Devos <maximedevos <at> telenet.be>
Cc: Timothy Sample <samplet <at> ngyro.com>, 52051 <at> debbugs.gnu.org,
 Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Subject: Re: bug#52051: [core-updates-frozen] cannot login
 ('org.freedesktop.login1' service times out)
Date: Tue, 21 Dec 2021 13:11:08 -0500
On Tue, Dec 21, 2021 at 05:51:28PM +0000, Maxime Devos wrote:
> I don't think commenting new code is a drastic and extreme action.
> I assume you were referring you were referring to the ‘16.6 hours’?

Right. We don't want to use a timeout of 16.6 hours. From the
perspective of a user, that's not meaningfully longer than, say, 30
minutes. And really, if the system can't bring up the login interface
for a laptop in 5 minutes, it's totally broken.

> If so, just replace ‘16.6 hours’ by ‘one minute’, because it turns
> out that 'auth_timeout' is in milliseconds and not seconds,
> see <https://dbus.freedesktop.org/doc/dbus-daemon.1.html>.

I think I was confused about the units here.




Information forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Wed, 22 Dec 2021 01:47:01 GMT) Full text and rfc822 format available.

Message #103 received at 52051 <at> debbugs.gnu.org (full text, mbox):

From: Luis Felipe <luis.felipe.la <at> protonmail.com>
To: "52051 <at> debbugs.gnu.org" <52051 <at> debbugs.gnu.org>
Subject: [core-updates-frozen] cannot login ('org.freedesktop.login1' service
 times out)
Date: Wed, 22 Dec 2021 01:46:41 +0000
[Message part 1 (text/plain, inline)]
Hello,

Just wanted to inform that I ran

`guix pull --branch=wip-fix-52051 && guix system reconfigure [...]`

and I could log in without problems now. Thanks.
[publickey - luis.felipe.la@protonmail.com - 0x12DE1598.asc (application/pgp-keys, attachment)]
[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Wed, 22 Dec 2021 02:20:02 GMT) Full text and rfc822 format available.

Message #106 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Leo Famulari <leo <at> famulari.name>
To: Luis Felipe via Bug reports for GNU Guix <bug-guix <at> gnu.org>
Cc: "52051 <at> debbugs.gnu.org" <52051 <at> debbugs.gnu.org>
Subject: Re: bug#52051: [core-updates-frozen] cannot login
 ('org.freedesktop.login1' service times out)
Date: Tue, 21 Dec 2021 21:19:20 -0500
[Message part 1 (text/plain, inline)]
On Wed, Dec 22, 2021 at 01:46:41AM +0000, Luis Felipe via Bug reports for GNU Guix wrote:
> Hello,
> 
> Just wanted to inform that I ran
> 
> `guix pull --branch=wip-fix-52051 && guix system reconfigure [...]`
> 
> and I could log in without problems now. Thanks.

Awesome!

Would someone like to write a proper commit message and add a code
comment?
[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Wed, 22 Dec 2021 02:20:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-guix <at> gnu.org:
bug#52051; Package guix. (Wed, 22 Dec 2021 21:17:01 GMT) Full text and rfc822 format available.

Message #112 received at 52051 <at> debbugs.gnu.org (full text, mbox):

From: Leo Famulari <leo <at> famulari.name>
To: Timothy Sample <samplet <at> ngyro.com>
Cc: "52051 <at> debbugs.gnu.org" <52051 <at> debbugs.gnu.org>
Subject: Re: bug#52051: [core-updates-frozen] cannot login
 ('org.freedesktop.login1' service times out)
Date: Wed, 22 Dec 2021 16:16:30 -0500
[Message part 1 (text/plain, inline)]
On Tue, Dec 21, 2021 at 09:19:20PM -0500, Leo Famulari wrote:
> Would someone like to write a proper commit message and add a code
> comment?

How about the attached patch? I'd like to push this soon, because it's a
severe problem for some users.
[0001-services-dbus-Wait-1-minute-for-elogind-to-get-ready.patch (text/plain, attachment)]
[signature.asc (application/pgp-signature, inline)]

Reply sent to Leo Famulari <leo <at> famulari.name>:
You have taken responsibility. (Wed, 22 Dec 2021 23:39:01 GMT) Full text and rfc822 format available.

Notification sent to Maxim Cournoyer <maxim.cournoyer <at> gmail.com>:
bug acknowledged by developer. (Wed, 22 Dec 2021 23:39:02 GMT) Full text and rfc822 format available.

Message #117 received at 52051-done <at> debbugs.gnu.org (full text, mbox):

From: Leo Famulari <leo <at> famulari.name>
To: Timothy Sample <samplet <at> ngyro.com>
Cc: "52051 <at> debbugs.gnu.org" <52051-done <at> debbugs.gnu.org>
Subject: Re: bug#52051: [core-updates-frozen] cannot login
 ('org.freedesktop.login1' service times out)
Date: Wed, 22 Dec 2021 18:38:26 -0500
[Message part 1 (text/plain, inline)]
On Wed, Dec 22, 2021 at 04:16:30PM -0500, Leo Famulari wrote:
> On Tue, Dec 21, 2021 at 09:19:20PM -0500, Leo Famulari wrote:
> > Would someone like to write a proper commit message and add a code
> > comment?
> 
> How about the attached patch? I'd like to push this soon, because it's a
> severe problem for some users.

> From ab9c4de76dea889e5d05bcf7fa173868357d5f44 Mon Sep 17 00:00:00 2001
> From: Timothy Sample <samplet <at> ngyro.com>
> Date: Tue, 21 Dec 2021 11:52:34 -0500
> Subject: [PATCH] services: dbus: Wait 1 minute for elogind to get ready.
> 
> Fixes <https://issues.guix.gnu.org/issue/52051>.
> 
> * gnu/services/dbus.scm (dbus-configuration-directory): Set a 60 second timeout
> in the dbus config.

Pushed as 488f1c589df00e802163af534294d93372e5c025
[signature.asc (application/pgp-signature, inline)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 20 Jan 2022 12:24:11 GMT) Full text and rfc822 format available.

This bug report was last modified 2 years and 68 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.