GNU bug report logs - #43552
[PATCH] Add watchdog support.

Previous Next

Package: guix-patches;

Reported by: Mathieu Othacehe <othacehe <at> gnu.org>

Date: Mon, 21 Sep 2020 16:50:01 UTC

Severity: normal

Tags: patch

Done: Mathieu Othacehe <othacehe <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 43552 in the body.
You can then email your comments to 43552 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to guix-patches <at> gnu.org:
bug#43552; Package guix-patches. (Mon, 21 Sep 2020 16:50:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Mathieu Othacehe <othacehe <at> gnu.org>:
New bug report received and forwarded. Copy sent to guix-patches <at> gnu.org. (Mon, 21 Sep 2020 16:50:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Mathieu Othacehe <othacehe <at> gnu.org>
To: guix-patches <at> gnu.org
Cc: Mathieu Othacehe <othacehe <at> gnu.org>
Subject: [PATCH] Add watchdog support.
Date: Mon, 21 Sep 2020 18:49:08 +0200
* src/cuirass/watchdog.scm: New file.
* Makefile.am (dist_pkgmodule_DATA): Add it.
* src/cuirass/utils.scm (with-timeout, get-message-with-timeout): Export them.
* bin/cuirass.in (main): Start the watchdog.
---

Hello,

I've noticed that during several hours all Cuirass fibers seem to be
stucked. I suspect that this is because a fiber is doing a blocking action,
hence preventing its scheduler to run other fibers.

This patch adds a watchdog that print warning messages if some Fibers
scheduler is hanging for more than 5 seconds. The next step will be to
identify the faulty fiber.

Thanks,

Mathieu

 Makefile.am              |  5 ++-
 bin/cuirass.in           |  3 +-
 src/cuirass/utils.scm    |  3 ++
 src/cuirass/watchdog.scm | 81 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 89 insertions(+), 3 deletions(-)
 create mode 100644 src/cuirass/watchdog.scm

diff --git a/Makefile.am b/Makefile.am
index 0e4d3c8..2c7e6e2 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -50,8 +50,9 @@ dist_pkgmodule_DATA =				\
   src/cuirass/metrics.scm			\
   src/cuirass/send-events.scm			\
   src/cuirass/ui.scm				\
-  src/cuirass/utils.scm             \
-  src/cuirass/templates.scm
+  src/cuirass/utils.scm		                \
+  src/cuirass/templates.scm                     \
+  src/cuirass/watchdog.scm
 
 nodist_pkgmodule_DATA = \
   src/cuirass/config.scm
diff --git a/bin/cuirass.in b/bin/cuirass.in
index ed21ed7..7f4793e 100644
--- a/bin/cuirass.in
+++ b/bin/cuirass.in
@@ -31,6 +31,7 @@ exec ${GUILE:-@GUILE@} --no-auto-compile -e main -s "$0" "$@"
              (cuirass logging)
              (cuirass metrics)
              (cuirass utils)
+             (cuirass watchdog)
              (guix ui)
              ((guix build utils) #:select (mkdir-p))
              (fibers)
@@ -142,7 +143,7 @@ exec ${GUILE:-@GUILE@} --no-auto-compile -e main -s "$0" "$@"
                  (if one-shot?
                      (process-specs (db-get-specifications))
                      (let ((exit-channel (make-channel)))
-
+                       (start-watchdog)
                        (if (option-ref opts 'web #f)
                            (begin
                              (spawn-fiber
diff --git a/src/cuirass/utils.scm b/src/cuirass/utils.scm
index 00cfef6..7ce4b83 100644
--- a/src/cuirass/utils.scm
+++ b/src/cuirass/utils.scm
@@ -37,6 +37,9 @@
             define-enumeration
             unwind-protect
 
+            with-timeout
+            get-message-with-timeout
+
             make-worker-thread-channel
             call-with-worker-thread
             with-worker-thread
diff --git a/src/cuirass/watchdog.scm b/src/cuirass/watchdog.scm
new file mode 100644
index 0000000..b912a9f
--- /dev/null
+++ b/src/cuirass/watchdog.scm
@@ -0,0 +1,81 @@
+;;; watchdog.scm -- Monitor fibers scheduling.
+;;; This file is part of Cuirass.
+;;;
+;;; Cuirass is free software: you can redistribute it and/or modify
+;;; it under the terms of the GNU General Public License as published by
+;;; the Free Software Foundation, either version 3 of the License, or
+;;; (at your option) any later version.
+;;;
+;;; Cuirass is distributed in the hope that it will be useful,
+;;; but WITHOUT ANY WARRANTY; without even the implied warranty of
+;;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+;;; GNU General Public License for more details.
+;;;
+;;; You should have received a copy of the GNU General Public License
+;;; along with Cuirass.  If not, see <http://www.gnu.org/licenses/>.
+
+(define-module (cuirass watchdog)
+  #:use-module (cuirass logging)
+  #:use-module (cuirass utils)
+  #:use-module (fibers)
+  #:use-module (fibers channels)
+  #:use-module (fibers internal)
+  #:use-module (fibers operations)
+  #:use-module (ice-9 match)
+  #:use-module (ice-9 threads)
+  #:export (start-watchdog))
+
+(define* (watchdog-fiber scheduler channel
+                         #:key
+                         timeout)
+  "Spawn a fiber running on SCHEDULER that sends over CHANNEL, every TIMEOUT
+seconds, the scheduler name and the current time."
+  (spawn-fiber
+   (lambda ()
+     (while #t
+       (put-message channel (list (scheduler-name scheduler)
+                                  (current-time)))
+       (sleep timeout)))
+   scheduler))
+
+(define* (start-watchdog #:key (timeout 5))
+  "Start a watchdog checking that each Fibers scheduler is not blocked for
+more than TIMEOUT seconds.
+
+The watchdog mechanism consists in spawning a dedicated fiber per running
+Fiber scheduler, using the above watchdog-fiber method.  Those fibers send a
+ping signal periodically to a separate thread.  If no signal is received from
+one of the schedulers for more than TIMEOUT seconds, a warning message is
+printed."
+  (define (check-timeouts pings)
+    (for-each
+     (match-lambda
+       ((scheduler . time)
+        (let* ((cur-time (current-time))
+               (diff-ping (- cur-time time)))
+          (when (> diff-ping timeout)
+            (log-message "Scheduler ~a blocked since ~a seconds."
+                         scheduler diff-ping)))))
+     pings))
+
+  (let ((watchdog-channel (make-channel)))
+    (parameterize (((@@ (fibers internal) current-fiber) #f))
+      (call-with-new-thread
+       (lambda ()
+         (let loop ((pings '()))
+           (let ((operation-timeout 2))
+             (match (perform-operation
+                     (with-timeout
+                      (get-operation watchdog-channel)
+                      #:seconds operation-timeout
+                      #:wrap (const 'timeout)))
+               ((scheduler ping)
+                (loop (assq-set! pings scheduler ping)))
+               ('timeout
+                (check-timeouts pings)
+                (loop pings))))))))
+    (fold-all-schedulers
+     (lambda (name scheduler seed)
+       (watchdog-fiber scheduler watchdog-channel
+                       #:timeout timeout))
+     '())))
-- 
2.28.0





Information forwarded to guix-patches <at> gnu.org:
bug#43552; Package guix-patches. (Tue, 22 Sep 2020 08:35:01 GMT) Full text and rfc822 format available.

Message #8 received at 43552 <at> debbugs.gnu.org (full text, mbox):

From: Mathieu Othacehe <othacehe <at> gnu.org>
To: 43552 <at> debbugs.gnu.org
Subject: Re: [bug#43552] [PATCH] Add watchdog support.
Date: Tue, 22 Sep 2020 10:34:03 +0200
Hello,

> I've noticed that during several hours all Cuirass fibers seem to be
> stucked. I suspect that this is because a fiber is doing a blocking action,
> hence preventing its scheduler to run other fibers.

I think the issue is that we are communicating with guix-daemon over a
blocking socket. When Cuirass calls "build-derivations", and the "read"
over guix-daemon socket blocks, the whole Fiber scheduler is also
blocked.

As the GC is running for 6 - 7 hours a day[1], a Fiber scheduler is also
blocked for that duration. There's still some activity happening on
other schedulers, but that's still quite unfortunate. For instance,
nothing has been fetched since 4:00 this morning.

Thanks,

Mathieu

[1]: https://issues.guix.gnu.org/24937




Reply sent to Mathieu Othacehe <othacehe <at> gnu.org>:
You have taken responsibility. (Tue, 29 Sep 2020 15:06:02 GMT) Full text and rfc822 format available.

Notification sent to Mathieu Othacehe <othacehe <at> gnu.org>:
bug acknowledged by developer. (Tue, 29 Sep 2020 15:06:02 GMT) Full text and rfc822 format available.

Message #13 received at 43552-done <at> debbugs.gnu.org (full text, mbox):

From: Mathieu Othacehe <othacehe <at> gnu.org>
To: 43552-done <at> debbugs.gnu.org
Subject: Re: [bug#43552] [PATCH] Add watchdog support.
Date: Tue, 29 Sep 2020 17:05:06 +0200
Pushed as eb8d1b88adcb63b12986855f40a454e2c59d0b9c.

Thanks,

Mathieu




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Wed, 28 Oct 2020 11:24:07 GMT) Full text and rfc822 format available.

This bug report was last modified 3 years and 179 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.