GNU bug report logs - #76516
[shepherd] Timer not executed

Package: guix;

Date: Sun, 23 Feb 2025 22:06:02 UTC

Severity: normal

Done: Ludovic Courtès <ludo <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 76516 in the body.
You can then email your comments to 76516 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-guix <at> gnu.org:
bug#76516; Package guix. (Sun, 23 Feb 2025 22:06:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Tomas Volf <~@wolfsden.cz>:
New bug report received and forwarded. Copy sent to bug-guix <at> gnu.org. (Sun, 23 Feb 2025 22:06:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Tomas Volf <~@wolfsden.cz>
To: bug-guix <at> gnu.org
Subject: [shepherd] Timer not executed
Date: Sun, 23 Feb 2025 23:05:08 +0100

[Message part 1 (text/plain, inline)]

Hi,

I might have find a bug in shepherd timers.  I have a timer scheduled to
run every 24 hours, the definition is as follow:

--8<---------------cut here---------------start------------->8---
(define %kerberos-log-in-refresh-service
  (let ((name 'kerberos-log-in-refresh))
    (simple-service
     name
     home-shepherd-service-type
     (list (shepherd-service
            (documentation "Refresh the kerberos ticket.")
            (provision (list name))
            (requirement '(kerberos-reachable?))
            (start #~(make-timer-constructor
                      (calendar-event #:hours '(12) #:minutes '(0))
                      (command (list #$%kerberos-log-in))))
            (stop #~(make-timer-destructor))
            (modules (cons '(shepherd service timer)
                           %default-modules))
            (actions (list (shepherd-action
                            (name 'trigger)
                            (documentation "Immediately refresh the ticket.")
                            (procedure #~trigger-timer)))))))))
--8<---------------cut here---------------end--------------->8---

This should run every 24 hours (at noon) and execute the
%kerberos-log-in script (simple guile program that authenticates against
kerberos).

However that did not happen.  Here are the logs:

--8<---------------cut here---------------start------------->8---
2025-02-22 19:17:00 Service kerberos-log-in running with value #<<process> id: 730 command: ("/gnu/store/8m21cnqnllk6g1kcgyj91i5h05s7c0c4-krb-log-in")>.
2025-02-22 19:17:00 [8m21cnqnllk6g1kcgyj91i5h05s7c0c4-krb-log-in] <redacted>
2025-02-22 19:17:00 [8m21cnqnllk6g1kcgyj91i5h05s7c0c4-krb-log-in] <redacted>
2025-02-22 19:17:00 [8m21cnqnllk6g1kcgyj91i5h05s7c0c4-krb-log-in] <redacted>
2025-02-22 19:17:00 [8m21cnqnllk6g1kcgyj91i5h05s7c0c4-krb-log-in] <redacted>
2025-02-23 12:00:02 Waiting anew for timer 'kerberos-log-in-refresh' (resuming from sleep state?).
2025-02-23 22:00:01 Not rotating '/home/<redacted>/.local/state/shepherd/dbus.log', which is below the 8192 B threshold.
--8<---------------cut here---------------end--------------->8---

The ones from 19:17:00 are from 'kerberos-log-in service, which is
one-shot executed upon login.  That went fine.

However the 'kerberos-log-in-refresh is only at 12:00:02, and only as
"Waiting anew ...".  The message indicates that the computer might be
resuming from sleep, however that was not the case here.  It is a
desktop machine, and it was left running over night.

Here is herd status:

--8<---------------cut here---------------start------------->8---
$ herd status kerberos-log-in-refresh
● Status of kerberos-log-in-refresh:
  It is running since Sat 22 Feb 2025 07:17:00 PM CET (28 hours ago).
  Timed service.
  Periodically running: /gnu/store/8m21cnqnllk6g1kcgyj91i5h05s7c0c4-krb-log-in
  It is enabled.
  Provides: kerberos-log-in-refresh
  Requires: kerberos-reachable?
  Custom action: trigger
  Will be respawned.

Upcoming timer alarms:
  Mon 24 Feb 2025 12:00:00 PM CET (in 13 hours)
  Tue 25 Feb 2025 12:00:00 PM CET (in 37 hours)
  Wed 26 Feb 2025 12:00:00 PM CET (in 3 days)
  Thu 27 Feb 2025 12:00:00 PM CET (in 4 days)
  Fri 28 Feb 2025 12:00:00 PM CET (in 5 days)
--8<---------------cut here---------------end--------------->8---



Have a nice day,
Tomas

-- 
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.

[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-guix <at> gnu.org:
bug#76516; Package guix. (Mon, 24 Feb 2025 16:23:01 GMT) Full text and rfc822 format available.

Message #8 received at 76516 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Tomas Volf <~@wolfsden.cz>
Cc: 76516 <at> debbugs.gnu.org
Subject: Re: bug#76516: [shepherd] Timer not executed
Date: Mon, 24 Feb 2025 17:22:19 +0100

Hi Tomas,

Tomas Volf <~@wolfsden.cz> skribis:

> However that did not happen.  Here are the logs:
>
> 2025-02-22 19:17:00 Service kerberos-log-in running with value #<<process> id: 730 command: ("/gnu/store/8m21cnqnllk6g1kcgyj91i5h05s7c0c4-krb-log-in")>.
> 2025-02-22 19:17:00 [8m21cnqnllk6g1kcgyj91i5h05s7c0c4-krb-log-in] <redacted>
> 2025-02-22 19:17:00 [8m21cnqnllk6g1kcgyj91i5h05s7c0c4-krb-log-in] <redacted>
> 2025-02-22 19:17:00 [8m21cnqnllk6g1kcgyj91i5h05s7c0c4-krb-log-in] <redacted>
> 2025-02-22 19:17:00 [8m21cnqnllk6g1kcgyj91i5h05s7c0c4-krb-log-in] <redacted>
> 2025-02-23 12:00:02 Waiting anew for timer 'kerberos-log-in-refresh' (resuming from sleep state?).
> 2025-02-23 22:00:01 Not rotating '/home/<redacted>/.local/state/shepherd/dbus.log', which is below the 8192 B threshold.
>
>
> The ones from 19:17:00 are from 'kerberos-log-in service, which is
> one-shot executed upon login.  That went fine.
>
> However the 'kerberos-log-in-refresh is only at 12:00:02, and only as
> "Waiting anew ...".  The message indicates that the computer might be
> resuming from sleep, however that was not the case here.  It is a
> desktop machine, and it was left running over night.

What architecture is this on?

From the excerpt above, the ‘log-rotation’ timer did fire as expected.
Did it also have “Waiting anew” messages?

Ludo’.

Information forwarded to bug-guix <at> gnu.org:
bug#76516; Package guix. (Mon, 24 Feb 2025 16:28:02 GMT) Full text and rfc822 format available.

Message #11 received at 76516 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Tomas Volf <~@wolfsden.cz>
Cc: 76516 <at> debbugs.gnu.org
Subject: Re: bug#76516: [shepherd] Timer not executed
Date: Mon, 24 Feb 2025 17:27:03 +0100

Ludovic Courtès <ludo <at> gnu.org> skribis:

>> 2025-02-23 12:00:02 Waiting anew for timer 'kerberos-log-in-refresh' (resuming from sleep state?).

The “Waiting anew” message happens when the timer fires 2 seconds or
more later than expected (see ‘sleep-operation/check’), which is indeed
the case here.

It’s not supposed to happen normally.  Before we bump that to 10
seconds, say, it would be good to understand why the timer got late
here.

Are there services that could block shepherd somehow, for instance by
calling ‘waitpid’, or running computations at 12:00pm?

Ludo’.

Information forwarded to bug-guix <at> gnu.org:
bug#76516; Package guix. (Mon, 24 Feb 2025 19:07:01 GMT) Full text and rfc822 format available.

Message #14 received at 76516 <at> debbugs.gnu.org (full text, mbox):

From: Tomas Volf <~@wolfsden.cz>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: 76516 <at> debbugs.gnu.org
Subject: Re: bug#76516: [shepherd] Timer not executed
Date: Mon, 24 Feb 2025 20:06:33 +0100

[Message part 1 (text/plain, inline)]

Ludovic Courtès <ludo <at> gnu.org> writes:

> Hi Tomas,
>
> Tomas Volf <~@wolfsden.cz> skribis:
>
>> However that did not happen.  Here are the logs:
>>
>> 2025-02-22 19:17:00 Service kerberos-log-in running with value #<<process> id: 730 command: ("/gnu/store/8m21cnqnllk6g1kcgyj91i5h05s7c0c4-krb-log-in")>.
>> 2025-02-22 19:17:00 [8m21cnqnllk6g1kcgyj91i5h05s7c0c4-krb-log-in] <redacted>
>> 2025-02-22 19:17:00 [8m21cnqnllk6g1kcgyj91i5h05s7c0c4-krb-log-in] <redacted>
>> 2025-02-22 19:17:00 [8m21cnqnllk6g1kcgyj91i5h05s7c0c4-krb-log-in] <redacted>
>> 2025-02-22 19:17:00 [8m21cnqnllk6g1kcgyj91i5h05s7c0c4-krb-log-in] <redacted>
>> 2025-02-23 12:00:02 Waiting anew for timer 'kerberos-log-in-refresh' (resuming from sleep state?).
>> 2025-02-23 22:00:01 Not rotating '/home/<redacted>/.local/state/shepherd/dbus.log', which is below the 8192 B threshold.
>>
>>
>> The ones from 19:17:00 are from 'kerberos-log-in service, which is
>> one-shot executed upon login.  That went fine.
>>
>> However the 'kerberos-log-in-refresh is only at 12:00:02, and only as
>> "Waiting anew ...".  The message indicates that the computer might be
>> resuming from sleep, however that was not the case here.  It is a
>> desktop machine, and it was left running over night.
>
> What architecture is this on?

x86_64-linux, AMD Ryzen 5 5600G with Radeon Graphics

>
> From the excerpt above, the ‘log-rotation’ timer did fire as expected.
> Did it also have “Waiting anew” messages?

No, no such message.  Actually, there are only 2 additional lines in the
log file.  So the following are the last 4 lines of the
shepherd.log.1.zstd (you did already see the first 2 lines):

--8<---------------cut here---------------start------------->8---
2025-02-23 12:00:02 Waiting anew for timer 'kerberos-log-in-refresh' (resuming from sleep state?).
2025-02-23 22:00:01 Not rotating '/home/<redacted>/.local/state/shepherd/dbus.log', which is below the 8192 B threshold.

2025-02-23 22:00:01 Rotating log.
--8<---------------cut here---------------end--------------->8---

The empty line is in the log, that is not a copy&paste error.  The next
log (shepherd.log) starts with:

--8<---------------cut here---------------start------------->8---
2025-02-23 22:00:01 Rotating '/home/<redacted>/.local/state/shepherd/shepherd.log' to '/home/<redacted>/.local/state/shepherd/shepherd.log.1'.
2025-02-23 22:00:01 Rotated '/home/<redacted>/.local/state/shepherd/shepherd.log'.
--8<---------------cut here---------------end--------------->8---

So there is not much indication what happened.

Tomas

-- 
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.

[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-guix <at> gnu.org:
bug#76516; Package guix. (Mon, 24 Feb 2025 19:25:02 GMT) Full text and rfc822 format available.

Message #17 received at 76516 <at> debbugs.gnu.org (full text, mbox):

From: Tomas Volf <~@wolfsden.cz>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: 76516 <at> debbugs.gnu.org
Subject: Re: bug#76516: [shepherd] Timer not executed
Date: Mon, 24 Feb 2025 20:24:42 +0100

[Message part 1 (text/plain, inline)]

Ludovic Courtès <ludo <at> gnu.org> writes:

> Ludovic Courtès <ludo <at> gnu.org> skribis:
>
>>> 2025-02-23 12:00:02 Waiting anew for timer 'kerberos-log-in-refresh' (resuming from sleep state?).
>
> The “Waiting anew” message happens when the timer fires 2 seconds or
> more later than expected (see ‘sleep-operation/check’), which is indeed
> the case here.
>
> It’s not supposed to happen normally.  Before we bump that to 10
> seconds, say, it would be good to understand why the timer got late
> here.

I definitely agree on this.

(I wonder if there is better way to detect the sleep.  I feel like *any*
number will be wrong for someone.  Do we know how for example systemd's
timers handle this?)

>
> Are there services that could block shepherd somehow, for instance by
> calling ‘waitpid’, or running computations at 12:00pm?

Not really (I think).  This is full shepherd status output:

--8<---------------cut here---------------start------------->8---
$ herd status
Started:
 + dbus
 + pulseaudio
 + root
 + timer
 + transient
Running timers:
 + kerberos-log-in-refresh
 + log-rotation
One-shot:
 * kerberos-log-in
 * kerberos-reachable?
--8<---------------cut here---------------end--------------->8---

I have already shared the definition of kerberos-log-in-refresh.  There
is no other timer scheduled (except for log rotation).  Other services
are from Guix, with the exception of pulseaudio:

--8<---------------cut here---------------start------------->8---
(define (home-pulseaudio-shepherd-services _)
  "Return a shepherd service to run a pulseaudio daemon.

Currently no configuration is supported."
  (list
   (shepherd-service
    (documentation "Run a pulseaudio daemon.")
    (provision '(pulseaudio))
    (start #~(make-forkexec-constructor
              '(#$(file-append pulseaudio "/bin/pulseaudio")
                "--daemonize=false")))
    (stop #~(make-kill-destructor)))))
--8<---------------cut here---------------end--------------->8---

There is a timer scheduled to run every 15 minutes in the system
shepherd, but is it not compute heavy (it just checks error counts from
the root filesystem).  The machine has 12 cores, each at ~3GHz, 32GB of
RAM and SSD for /.  I am not aware of any significant resource use that
should happen at noon, but even if there would be one, it is hard to
believe shepherd would not get a time slice on *any* core for 2 seconds.

For what it is worth, today the cronjob worked fine, however even today
it was executed at :01, so a second later then it should have been.

--8<---------------cut here---------------start------------->8---
2025-02-24 12:00:01 Timer 'kerberos-log-in-refresh' spawned process 24129.
2025-02-24 12:00:01 Registering new logger for kerberos-log-in-refresh.
--8<---------------cut here---------------end--------------->8---

If you have any idea what additional information would be useful, I have
no problem deploying patched shepherd with extra logging to this machine
(assuming you know what extra logs we need).

Tomas

-- 
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.

[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-guix <at> gnu.org:
bug#76516; Package guix. (Mon, 24 Feb 2025 21:56:02 GMT) Full text and rfc822 format available.

Message #20 received at 76516 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Tomas Volf <~@wolfsden.cz>
Cc: 76516 <at> debbugs.gnu.org
Subject: Re: bug#76516: [shepherd] Timer not executed
Date: Mon, 24 Feb 2025 22:55:05 +0100

[Message part 1 (text/plain, inline)]

Tomas Volf <~@wolfsden.cz> skribis:

> (I wonder if there is better way to detect the sleep.  I feel like *any*
> number will be wrong for someone.  Do we know how for example systemd's
> timers handle this?)

I believe systemd is the one initiating hibernation, so it has the
information first-hand; in our case this is initiated by elogind and
shepherd doesn’t know.  Probably something to fix.

Anyway, this time drift remains a mystery to me.  I would go for a hack
like this:

[Message part 2 (text/x-patch, inline)]

diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index adc4530..1587a02 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -2490,6 +2490,10 @@ keyword arguments as @code{fork+exec-command}: @code{#:user},
   "Make an operation that returns @var{timeout} when @var{seconds} have
 elapsed and @var{overslept} when many more seconds have elapsed--this can
 happen if the machine is suspended or put into hibernation mode."
+  (define max-delay
+    ;; Time after which we consider that we missed the deadline.
+    (if (> seconds 180) 10 2))
+
   (let ((expiry (+ (get-internal-real-time)
                    (inexact->exact
                     (round (* seconds internal-time-units-per-second))))))
@@ -2497,7 +2501,7 @@ happen if the machine is suspended or put into hibernation mode."
                     (lambda ()
                       (let* ((now (get-internal-real-time))
                              (delta (- now expiry)))
-                        (if (> delta (* 2 internal-time-units-per-second))
+                        (if (> delta (* max-delay internal-time-units-per-second))
                             overslept
                             timeout))))))

[Message part 3 (text/plain, inline)]

WDYT?

Ludo’.

Information forwarded to bug-guix <at> gnu.org:
bug#76516; Package guix. (Mon, 24 Feb 2025 22:58:03 GMT) Full text and rfc822 format available.

Message #23 received at 76516 <at> debbugs.gnu.org (full text, mbox):

From: Tomas Volf <~@wolfsden.cz>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: 76516 <at> debbugs.gnu.org
Subject: Re: bug#76516: [shepherd] Timer not executed
Date: Mon, 24 Feb 2025 23:55:40 +0100

[Message part 1 (text/plain, inline)]

Ludovic Courtès <ludo <at> gnu.org> writes:

> Tomas Volf <~@wolfsden.cz> skribis:
>
>> (I wonder if there is better way to detect the sleep.  I feel like *any*
>> number will be wrong for someone.  Do we know how for example systemd's
>> timers handle this?)
>
> I believe systemd is the one initiating hibernation, so it has the
> information first-hand; in our case this is initiated by elogind and
> shepherd doesn’t know.  Probably something to fix.
>
> Anyway, this time drift remains a mystery to me.  I would go for a hack
> like this:
>
> diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
> index adc4530..1587a02 100644
> --- a/modules/shepherd/service.scm
> +++ b/modules/shepherd/service.scm
> @@ -2490,6 +2490,10 @@ keyword arguments as @code{fork+exec-command}: @code{#:user},
>    "Make an operation that returns @var{timeout} when @var{seconds} have
>  elapsed and @var{overslept} when many more seconds have elapsed--this can
>  happen if the machine is suspended or put into hibernation mode."
> +  (define max-delay
> +    ;; Time after which we consider that we missed the deadline.

I would extend the comment to describe why both 10 and 2 are used.

> +    (if (> seconds 180) 10 2))
> +
>    (let ((expiry (+ (get-internal-real-time)
>                     (inexact->exact
>                      (round (* seconds internal-time-units-per-second))))))
> @@ -2497,7 +2501,7 @@ happen if the machine is suspended or put into hibernation mode."
>                      (lambda ()
>                        (let* ((now (get-internal-real-time))

I have no idea how Shepherd works internally (and much less how Fibers
work), so maybe this comment is completely off, but this seems
suspicious.  Should this lambda not get the wake up time as an argument,
instead of calling get-internal-real-time to get the "now"?

I have no idea what guarantees do Fibers make regarding the delays
between detecting that time is up and calling the callback.  And after
quick look at the source code I have decided that it is way beyond me to
try to figure it out.

Is there a way to enable logging of the events?  So we would know when
fibers decided the timer is up, and when the lambda was called?

>                               (delta (- now expiry)))
> -                        (if (> delta (* 2 internal-time-units-per-second))
> +                        (if (> delta (* max-delay internal-time-units-per-second))
>                              overslept
>                              timeout))))))
>  
>
>
> WDYT?

Well, in *this* particular case it would have resolved the problem, so
great for me I guess.  However I have left a suggestion above.

Out of curiosity, I have scheduled a timer event for tomorrow 23:0{0..5}
to see if they will fire with delay.  Testing with short timer (closest
whole minute) did not bring any results (the timers were executed
exactly on time), so maybe the long wait is a factor?  Will report
tomorrow.

Tomas

PS: Looking into timer.scm, I see this comment

--8<---------------cut here---------------start------------->8---
;; Reached when resuming from sleep state: we slept
;; significantly more than the requested number of seconds.  To
;; avoid triggering every timer when resuming from sleep state,
;; sleep again to remain in sync.
--8<---------------cut here---------------end--------------->8---

Not sure I would call 2 (or even the 10) a "significantly more". :) If I
expect the cron to sleep for 86400 seconds, 10 more seems... minor.

Maybe (I did not put too much though into this and the numbers are
completely thumb-sucked), the "overslept" could be if the sleep was
longer by more than 10% of the timer period, clipped to be at least 2,
and at most 30 minutes?

If I have a cron scheduled to run once a month, I would guess most
people would prefer to have it run 20 minutes late than to skip a month
completely.

-- 
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.

[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-guix <at> gnu.org:
bug#76516; Package guix. (Wed, 26 Feb 2025 10:10:02 GMT) Full text and rfc822 format available.

Message #26 received at 76516 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Tomas Volf <~@wolfsden.cz>
Cc: 76516 <at> debbugs.gnu.org
Subject: Re: bug#76516: [shepherd] Timer not executed
Date: Wed, 26 Feb 2025 11:09:22 +0100

Hi,

Tomas Volf <~@wolfsden.cz> skribis:

> I have no idea how Shepherd works internally (and much less how Fibers
> work), so maybe this comment is completely off, but this seems
> suspicious.  Should this lambda not get the wake up time as an argument,
> instead of calling get-internal-real-time to get the "now"?

Yes, it would probably be nicer, but it wouldn’t make much of a
difference here (and it’s not related to the bug: the bug shows that we
sleep longer than asked for).

> Is there a way to enable logging of the events?  So we would know when
> fibers decided the timer is up, and when the lambda was called?

There’s no logging at the Fibers level; all we have is logging by
shepherd itself.

> PS: Looking into timer.scm, I see this comment
>
> ;; Reached when resuming from sleep state: we slept
> ;; significantly more than the requested number of seconds.  To
> ;; avoid triggering every timer when resuming from sleep state,
> ;; sleep again to remain in sync.
>
> Not sure I would call 2 (or even the 10) a "significantly more". :) If I
> expect the cron to sleep for 86400 seconds, 10 more seems... minor.
>
> Maybe (I did not put too much though into this and the numbers are
> completely thumb-sucked), the "overslept" could be if the sleep was
> longer by more than 10% of the timer period, clipped to be at least 2,
> and at most 30 minutes?

Yeah, though there’s no reason for sleeps to drift this much, it’s a
pretty fundamental assumption.  Maybe this:

  (define max-delay
    ;; Time after which we consider that we missed the deadline.  Tolerate a
    ;; slight drift, which can happen occasionally.
    (max (min (/ seconds 10.) 120) 2))

Thanks,
Ludo’.

Information forwarded to bug-guix <at> gnu.org:
bug#76516; Package guix. (Fri, 28 Feb 2025 01:30:03 GMT) Full text and rfc822 format available.

Message #29 received at 76516 <at> debbugs.gnu.org (full text, mbox):

From: Tomas Volf <~@wolfsden.cz>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: 76516 <at> debbugs.gnu.org
Subject: Re: bug#76516: [shepherd] Timer not executed
Date: Fri, 28 Feb 2025 02:29:50 +0100

Ludovic Courtès <ludo <at> gnu.org> writes:

> Hi,
>
> Tomas Volf <~@wolfsden.cz> skribis:
>
>> I have no idea how Shepherd works internally (and much less how Fibers
>> work), so maybe this comment is completely off, but this seems
>> suspicious.  Should this lambda not get the wake up time as an argument,
>> instead of calling get-internal-real-time to get the "now"?
>
> Yes, it would probably be nicer, but it wouldn’t make much of a
> difference here (and it’s not related to the bug: the bug shows that we
> sleep longer than asked for).

I am not sure this is correct.  What the bug shows is that the callback
is called later then expected.  We do not know how long the sleep was.
Am I missing something?

>
>> Is there a way to enable logging of the events?  So we would know when
>> fibers decided the timer is up, and when the lambda was called?
>
> There’s no logging at the Fibers level; all we have is logging by
> shepherd itself.
>
>> PS: Looking into timer.scm, I see this comment
>>
>> ;; Reached when resuming from sleep state: we slept
>> ;; significantly more than the requested number of seconds.  To
>> ;; avoid triggering every timer when resuming from sleep state,
>> ;; sleep again to remain in sync.
>>
>> Not sure I would call 2 (or even the 10) a "significantly more". :) If I
>> expect the cron to sleep for 86400 seconds, 10 more seems... minor.
>>
>> Maybe (I did not put too much though into this and the numbers are
>> completely thumb-sucked), the "overslept" could be if the sleep was
>> longer by more than 10% of the timer period, clipped to be at least 2,
>> and at most 30 minutes?
>
> Yeah, though there’s no reason for sleeps to drift this much, it’s a
> pretty fundamental assumption.

Does not seem to hold in this particular case (at least for the lower
bound).  ¯\_(ツ)_/¯

> Maybe this:
>
>   (define max-delay
>     ;; Time after which we consider that we missed the deadline.  Tolerate a
>     ;; slight drift, which can happen occasionally.
>     (max (min (/ seconds 10.) 120) 2))

That should work, yeah.  At least as a temporary measure. :)

Few additional data-points: The timers I have scheduled for almost 24h
in the future fired exactly on time.  As for the kerberos-log-in-refresh
timer, twice it fired within the 2 seconds (12:00:01), once outside
(12:00:02).

I was thinking about this some more, and the right solution here
probably is to use netlink to listen for ACPI events, the same way acpid
does.  That should provide reliable information about the suspend and
resume events.

Tomas

-- 
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.

Information forwarded to bug-guix <at> gnu.org:
bug#76516; Package guix. (Sat, 01 Mar 2025 17:27:02 GMT) Full text and rfc822 format available.

Message #32 received at 76516 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Tomas Volf <~@wolfsden.cz>
Cc: 76516 <at> debbugs.gnu.org
Subject: Re: bug#76516: [shepherd] Timer not executed
Date: Sat, 01 Mar 2025 18:25:33 +0100

Hi,

Tomas Volf <~@wolfsden.cz> skribis:

>>> I have no idea how Shepherd works internally (and much less how Fibers
>>> work), so maybe this comment is completely off, but this seems
>>> suspicious.  Should this lambda not get the wake up time as an argument,
>>> instead of calling get-internal-real-time to get the "now"?
>>
>> Yes, it would probably be nicer, but it wouldn’t make much of a
>> difference here (and it’s not related to the bug: the bug shows that we
>> sleep longer than asked for).
>
> I am not sure this is correct.  What the bug shows is that the callback
> is called later then expected.  We do not know how long the sleep was.
> Am I missing something?

The bug is that it slept longer than expected, not that it was late, if
you see what I mean.

>> Maybe this:
>>
>>   (define max-delay
>>     ;; Time after which we consider that we missed the deadline.  Tolerate a
>>     ;; slight drift, which can happen occasionally.
>>     (max (min (/ seconds 10.) 120) 2))
>
> That should work, yeah.  At least as a temporary measure. :)

Heh, agreed.  Pushed as 7a7b4e16f9697c4822b7693e63cc4ba0ace134a2.

> Few additional data-points: The timers I have scheduled for almost 24h
> in the future fired exactly on time.  As for the kerberos-log-in-refresh
> timer, twice it fired within the 2 seconds (12:00:01), once outside
> (12:00:02).

OK.

> I was thinking about this some more, and the right solution here
> probably is to use netlink to listen for ACPI events, the same way acpid
> does.  That should provide reliable information about the suspend and
> resume events.

Sounds like a good idea (though it’s a bit annoying to depend on
guile-netlink and low-level details).

Another thing I had in mind was to use an elogind hook so that shepherd
would know when we’re suspending; this is necessary for other things
such as locking LUKS devices on suspend.  But that’s a change for 1.1.x.

Thanks,
Ludo’.

bug closed, send any further explanations to 76516 <at> debbugs.gnu.org and Tomas Volf <~@wolfsden.cz> Request was from Ludovic Courtès <ludo <at> gnu.org> to control <at> debbugs.gnu.org. (Sat, 01 Mar 2025 17:27:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-guix <at> gnu.org:
bug#76516; Package guix. (Sat, 01 Mar 2025 23:38:01 GMT) Full text and rfc822 format available.

Message #37 received at 76516 <at> debbugs.gnu.org (full text, mbox):

From: Tomas Volf <~@wolfsden.cz>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: 76516 <at> debbugs.gnu.org
Subject: Re: bug#76516: [shepherd] Timer not executed
Date: Sun, 02 Mar 2025 00:37:01 +0100

[Message part 1 (text/plain, inline)]

Ludovic Courtès <ludo <at> gnu.org> writes:

> [..]
>
> Another thing I had in mind was to use an elogind hook so that shepherd
> would know when we’re suspending; this is necessary for other things
> such as locking LUKS devices on suspend.  But that’s a change for 1.1.x.

I see two possible problem here (both solvable).

1. AFAICT shepherd currently does not depend on elogind at all.  Having
it as a run-time dependency might be fine on Guix (assuming we move
elogind into %base-services), but could be annoying on foreign
distributions, especially from non-root user's point of view.

2. How will the hook know what all processes it should let know?  There
is no global registry of all running shepherd processes no?

Though I am sure both of these are solvable.

Have a nice day (and thanks for the fix :) ),
Tomas

-- 
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.

[signature.asc (application/pgp-signature, inline)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 30 Mar 2025 11:24:18 GMT) Full text and rfc822 format available.

This bug report was last modified 102 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #76516 [shepherd] Timer not executed

GNU bug report logs - #76516
[shepherd] Timer not executed