GNU bug report logs - #31925
'guix substitutes' sometimes hangs on glibc 2.27

Package: guix;

Reported by: ludo <at> gnu.org (Ludovic Courtès)

Date: Thu, 21 Jun 2018 11:46:01 UTC

Severity: serious

Tags: unreproducible

Done: Ludovic Courtès <ludo <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 31925 in the body.
You can then email your comments to 31925 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-guix <at> gnu.org:
bug#31925; Package guix. (Thu, 21 Jun 2018 11:46:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to ludo <at> gnu.org (Ludovic Courtès):
New bug report received and forwarded. Copy sent to bug-guix <at> gnu.org. (Thu, 21 Jun 2018 11:46:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: ludo <at> gnu.org (Ludovic Courtès)
To: bug-guix <at> gnu.org
Subject: 'guix substitutes' sometimes hangs on glibc 2.27
Date: Thu, 21 Jun 2018 13:45:12 +0200

Hello Guix!

When downloading a number of substitutes, ‘guix substitute’ sometimes
hangs for me since the switch to glibc 2.27.  Anyone else experiencing
this?  It’s relatively frequent for me.

The backtrace shows this:

--8<---------------cut here---------------start------------->8---
(gdb) bt
#0  0x00007fbb34bf794d in __GI___pthread_timedjoin_ex (threadid=140441961314048, thread_return=thread_return <at> entry=0x0, abstime=abstime <at> entry=0x0, block=block <at> entry=true)
    at pthread_join_common.c:89
#1  0x00007fbb34bf773c in __pthread_join (threadid=<optimized out>, thread_return=thread_return <at> entry=0x0) at pthread_join.c:24
#2  0x00007fbb350d7548 in stop_finalization_thread () at finalizers.c:265
#3  0x00007fbb350d7759 in scm_i_finalizer_pre_fork () at finalizers.c:290
#4  0x00007fbb3514f256 in scm_fork () at posix.c:1222
#5  0x00007fbb351477fd in vm_regular_engine (thread=0x7fbb313739d0, vp=0x1569f30, registers=0x52de, resume=884963661) at vm-engine.c:784
#6  0x00007fbb3514ae5a in scm_call_n (proc=<error reading variable: ERROR: Cannot access memory at address 0x355c2030>0x7fbb355c2030, argv=argv <at> entry=0x7fff856ae7f8, nargs=nargs <at> entry=1)
    at vm.c:1257
#7  0x00007fbb350ceff7 in scm_primitive_eval (exp=exp <at> entry=<error reading variable: ERROR: Cannot access memory at address 0x31738cf8>0x164fb00) at eval.c:662
#8  0x00007fbb350cf053 in scm_eval (exp=<error reading variable: ERROR: Cannot access memory at address 0x31738cf8>0x164fb00, 
    module_or_state=module_or_state <at> entry=<error reading variable: ERROR: Cannot access memory at address 0x354750d8>0x162b140) at eval.c:696
#9  0x00007fbb3511a0b0 in scm_shell (argc=10, argv=0x15f0390) at script.c:454
#10 0x00007fbb350e5add in invoke_main_func (body_data=0x7fff856aed10) at init.c:340
#11 0x00007fbb350c82da in c_body (d=0x7fff856aec50) at continuations.c:422
#12 0x00007fbb351477fd in vm_regular_engine (thread=0x7fbb313739d0, vp=0x1569f30, registers=0x52de, resume=884963661) at vm-engine.c:784
#13 0x00007fbb3514ae5a in scm_call_n (proc=proc <at> entry=#<smob catch-closure 1564200>, argv=argv <at> entry=0x0, nargs=nargs <at> entry=0) at vm.c:1257
#14 0x00007fbb350cdef9 in scm_call_0 (proc=proc <at> entry=#<smob catch-closure 1564200>) at eval.c:481
#15 0x00007fbb3513a026 in catch (tag=tag <at> entry=#t, thunk=#<smob catch-closure 1564200>, handler=<error reading variable: ERROR: Cannot access memory at address 0x0>0x15641e0, 
    pre_unwind_handler=<error reading variable: ERROR: Cannot access memory at address 0x0>0x15641c0) at throw.c:137
#16 0x00007fbb3513a365 in scm_catch_with_pre_unwind_handler (key=key <at> entry=#t, thunk=<optimized out>, handler=<optimized out>, pre_unwind_handler=<optimized out>) at throw.c:254
#17 0x00007fbb3513a51f in scm_c_catch (tag=tag <at> entry=#t, body=body <at> entry=0x7fbb350c82d0 <c_body>, body_data=body_data <at> entry=0x7fff856aec50, handler=handler <at> entry=0x7fbb350c8560 <c_handler>, 
    handler_data=handler_data <at> entry=0x7fff856aec50, pre_unwind_handler=pre_unwind_handler <at> entry=0x7fbb350c83c0 <pre_unwind_handler>, pre_unwind_handler_data=0x1564b60) at throw.c:377
#18 0x00007fbb350c88c0 in scm_i_with_continuation_barrier (body=body <at> entry=0x7fbb350c82d0 <c_body>, body_data=body_data <at> entry=0x7fff856aec50, handler=handler <at> entry=0x7fbb350c8560 <c_handler>, 
    handler_data=handler_data <at> entry=0x7fff856aec50, pre_unwind_handler=pre_unwind_handler <at> entry=0x7fbb350c83c0 <pre_unwind_handler>, pre_unwind_handler_data=0x1564b60) at continuations.c:360
#19 0x00007fbb350c8955 in scm_c_with_continuation_barrier (func=<optimized out>, data=<optimized out>) at continuations.c:456
#20 0x00007fbb35138c3c in with_guile (base=base <at> entry=0x7fff856aecb8, data=data <at> entry=0x7fff856aece0) at threads.c:661
#21 0x00007fbb34e2afb8 in GC_call_with_stack_base (fn=fn <at> entry=0x7fbb35138bf0 <with_guile>, arg=arg <at> entry=0x7fff856aece0) at misc.c:1949
#22 0x00007fbb35138fd8 in scm_i_with_guile (dynamic_state=<optimized out>, data=data <at> entry=0x7fff856aece0, func=func <at> entry=0x7fbb350e5ac0 <invoke_main_func>) at threads.c:704
#23 scm_with_guile (func=func <at> entry=0x7fbb350e5ac0 <invoke_main_func>, data=data <at> entry=0x7fff856aed10) at threads.c:710
#24 0x00007fbb350e5c72 in scm_boot_guile (argc=argc <at> entry=7, argv=argv <at> entry=0x7fff856aee68, main_func=main_func <at> entry=0x400ce0 <inner_main>, closure=closure <at> entry=0x0) at init.c:323
#25 0x0000000000400b80 in main (argc=7, argv=0x7fff856aee68) at guile.c:101
(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7fbb355cab80 (LWP 21207) "guix substitute" 0x00007fbb34bf794d in __GI___pthread_timedjoin_ex (threadid=140441961314048, thread_return=thread_return <at> entry=0x0, 
    abstime=abstime <at> entry=0x0, block=block <at> entry=true) at pthread_join_common.c:89
  2    Thread 0x7fbb3342c700 (LWP 21208) ".guix-real" 0x00007fbb34bfc552 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fbb3504f6ec <mark_cv+44>)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  3    Thread 0x7fbb32c2b700 (LWP 21209) ".guix-real" 0x00007fbb34bfc552 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fbb3504f6ec <mark_cv+44>)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  4    Thread 0x7fbb3242a700 (LWP 21210) ".guix-real" 0x00007fbb34bfc552 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fbb3504f6ec <mark_cv+44>)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  5    Thread 0x7fbb2eb78700 (LWP 21212) ".guix-real" 0x00007fbb34bffaac in __libc_read (fd=9, buf=buf <at> entry=0x7fbb2eb77540, nbytes=nbytes <at> entry=1) at ../sysdeps/unix/sysv/linux/read.c:27
  6    Thread 0x7fbb31373700 (LWP 21214) "guix substitute" 0x00007fbb34bffaac in __libc_read (fd=5, buf=buf <at> entry=0x7fbb31372a30, nbytes=nbytes <at> entry=1) at ../sysdeps/unix/sysv/linux/read.c:27
--8<---------------cut here---------------end--------------->8---

The finalization thread is itself stuck reading from its pipe:

--8<---------------cut here---------------start------------->8---
(gdb) thread 6
[Switching to thread 6 (Thread 0x7fbb31373700 (LWP 21214))]
#0  0x00007fbb34bffaac in __libc_read (fd=5, buf=buf <at> entry=0x7fbb31372a30, nbytes=nbytes <at> entry=1) at ../sysdeps/unix/sysv/linux/read.c:27
27      in ../sysdeps/unix/sysv/linux/read.c
(gdb) bt
#0  0x00007fbb34bffaac in __libc_read (fd=5, buf=buf <at> entry=0x7fbb31372a30, nbytes=nbytes <at> entry=1) at ../sysdeps/unix/sysv/linux/read.c:27
#1  0x00007fbb350d74d7 in read_finalization_pipe_data (data=0x7fbb31372a30) at finalizers.c:199
#2  0x00007fbb34e30b63 in GC_do_blocking_inner (data=0x7fbb313729f0 "\300t\r5\273\177", context=<optimized out>) at pthread_support.c:1353
#3  0x00007fbb34e25389 in GC_with_callee_saves_pushed (fn=0x7fbb34e30b20 <GC_do_blocking_inner>, arg=arg <at> entry=0x7fbb313729f0 "\300t\r5\273\177") at mach_dep.c:322
#4  0x00007fbb34e2afec in GC_do_blocking (fn=fn <at> entry=0x7fbb350d74c0 <read_finalization_pipe_data>, client_data=client_data <at> entry=0x7fbb31372a30) at misc.c:2061
#5  0x00007fbb3513902a in scm_without_guile (func=0x7fbb350d74c0 <read_finalization_pipe_data>, data=0x7fbb31372a30) at threads.c:722
#6  0x00007fbb350d7887 in finalization_thread_proc (unused=<optimized out>) at finalizers.c:212
#7  0x00007fbb350c82da in c_body (d=0x7fbb31372e50) at continuations.c:422
#8  0x00007fbb351477fd in vm_regular_engine (thread=0x5, vp=0x172aea0, registers=0x1, resume=884996780) at vm-engine.c:784
#9  0x00007fbb3514ae5a in scm_call_n (proc=proc <at> entry=#<smob catch-closure 1c1ff60>, argv=argv <at> entry=0x0, nargs=nargs <at> entry=0) at vm.c:1257
#10 0x00007fbb350cdef9 in scm_call_0 (proc=proc <at> entry=#<smob catch-closure 1c1ff60>) at eval.c:481
#11 0x00007fbb3513a026 in catch (tag=tag <at> entry=#t, thunk=#<smob catch-closure 1c1ff60>, handler=<error reading variable: ERROR: Cannot access memory at address 0x0>0x1c1ff40, 
    pre_unwind_handler=<error reading variable: ERROR: Cannot access memory at address 0x0>0x1c1ff00) at throw.c:137
#12 0x00007fbb3513a365 in scm_catch_with_pre_unwind_handler (key=key <at> entry=#t, thunk=<optimized out>, handler=<optimized out>, pre_unwind_handler=<optimized out>) at throw.c:254
#13 0x00007fbb3513a51f in scm_c_catch (tag=tag <at> entry=#t, body=body <at> entry=0x7fbb350c82d0 <c_body>, body_data=body_data <at> entry=0x7fbb31372e50, handler=handler <at> entry=0x7fbb350c8560 <c_handler>, 
    handler_data=handler_data <at> entry=0x7fbb31372e50, pre_unwind_handler=pre_unwind_handler <at> entry=0x7fbb350c83c0 <pre_unwind_handler>, pre_unwind_handler_data=0x1564b60) at throw.c:377
#14 0x00007fbb350c88c0 in scm_i_with_continuation_barrier (body=body <at> entry=0x7fbb350c82d0 <c_body>, body_data=body_data <at> entry=0x7fbb31372e50, handler=handler <at> entry=0x7fbb350c8560 <c_handler>, 
    handler_data=handler_data <at> entry=0x7fbb31372e50, pre_unwind_handler=pre_unwind_handler <at> entry=0x7fbb350c83c0 <pre_unwind_handler>, pre_unwind_handler_data=0x1564b60) at continuations.c:360
#15 0x00007fbb350c8955 in scm_c_with_continuation_barrier (func=<optimized out>, data=<optimized out>) at continuations.c:456
#16 0x00007fbb35138c3c in with_guile (base=base <at> entry=0x7fbb31372eb8, data=data <at> entry=0x7fbb31372ee0) at threads.c:661
#17 0x00007fbb34e2afb8 in GC_call_with_stack_base (fn=fn <at> entry=0x7fbb35138bf0 <with_guile>, arg=arg <at> entry=0x7fbb31372ee0) at misc.c:1949
#18 0x00007fbb35138fd8 in scm_i_with_guile (dynamic_state=<optimized out>, data=<optimized out>, func=<optimized out>) at threads.c:704
#19 scm_with_guile (func=<optimized out>, data=<optimized out>) at threads.c:710
#20 0x00007fbb34bf6567 in start_thread (arg=0x7fbb31373700) at pthread_create.c:463
#21 0x00007fbb3351eeaf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
--8<---------------cut here---------------end--------------->8---

The ‘primitive-fork’ call itself presumably comes from
‘decompressed-port’, used in (guix scripts substitute).

Ludo’.

Information forwarded to bug-guix <at> gnu.org:
bug#31925; Package guix. (Thu, 21 Jun 2018 14:12:02 GMT) Full text and rfc822 format available.

Message #8 received at 31925 <at> debbugs.gnu.org (full text, mbox):

From: Ricardo Wurmus <rekado <at> elephly.net>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: 31925 <at> debbugs.gnu.org
Subject: Re: bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27
Date: Thu, 21 Jun 2018 16:10:45 +0200

Hi Ludo,

> When downloading a number of substitutes, ‘guix substitute’ sometimes
> hangs for me since the switch to glibc 2.27.  Anyone else experiencing
> this?  It’s relatively frequent for me.

It has never happened to me.

-- 
Ricardo

Severity set to 'serious' from 'normal' Request was from ludo <at> gnu.org (Ludovic Courtès) to control <at> debbugs.gnu.org. (Mon, 25 Jun 2018 12:26:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-guix <at> gnu.org:
bug#31925; Package guix. (Wed, 04 Jul 2018 07:05:03 GMT) Full text and rfc822 format available.

Message #13 received at 31925 <at> debbugs.gnu.org (full text, mbox):

From: ludo <at> gnu.org (Ludovic Courtès)
To: 31925 <at> debbugs.gnu.org
Subject: Re: bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27
Date: Wed, 04 Jul 2018 09:03:53 +0200

[Message part 1 (text/plain, inline)]

ludo <at> gnu.org (Ludovic Courtès) skribis:

> #0  0x00007fbb34bf794d in __GI___pthread_timedjoin_ex (threadid=140441961314048, thread_return=thread_return <at> entry=0x0, abstime=abstime <at> entry=0x0, block=block <at> entry=true)
>     at pthread_join_common.c:89
> #1  0x00007fbb34bf773c in __pthread_join (threadid=<optimized out>, thread_return=thread_return <at> entry=0x0) at pthread_join.c:24
> #2  0x00007fbb350d7548 in stop_finalization_thread () at finalizers.c:265
> #3  0x00007fbb350d7759 in scm_i_finalizer_pre_fork () at finalizers.c:290
> #4  0x00007fbb3514f256 in scm_fork () at posix.c:1222

Here’s a reproducer that works quite well (it hangs within a couple of
minutes):

[finalization-thread-proc.scm (text/x-scheme, inline)]

(use-modules (guix utils)
             (ice-9 ftw)
             (ice-9 match)
             (srfi srfi-1)
             (srfi srfi-26)
             (rnrs io ports))

(define infodir
  (string-append (getenv "HOME")
                 "/.guix-profile/share/info/"))

(define files
  (apply circular-list
         (map (cut string-append infodir <>)
              (scandir infodir (lambda (file)
                                 (string-suffix? ".gz" file))))))

(sigaction SIGALRM
  (lambda _
    (alarm 1)))
(alarm 1)

(let loop ((files files)
           (n 0))
  (match files
    ((file . tail)
     (call-with-input-file file
       (lambda (port)
         (call-with-decompressed-port 'gzip port
           (lambda (port)
             (let loop ()
               (unless (eof-object? (get-bytevector-n port 777))
                 (loop)))))))
     ;; (pk 'loop n file)
     (display ".")
     (loop tail (+ n 1)))))

[Message part 3 (text/plain, inline)]

Ludo’.

Information forwarded to bug-guix <at> gnu.org:
bug#31925; Package guix. (Wed, 04 Jul 2018 17:00:02 GMT) Full text and rfc822 format available.

Message #16 received at 31925 <at> debbugs.gnu.org (full text, mbox):

From: ludo <at> gnu.org (Ludovic Courtès)
To: 31925 <at> debbugs.gnu.org, Andy Wingo <wingo <at> igalia.com>
Subject: Re: bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27
Date: Wed, 04 Jul 2018 18:58:30 +0200

(+Cc: Andy as the ultimate authority for all these things.  :-))

ludo <at> gnu.org (Ludovic Courtès) skribis:

> (let loop ((files files)
>            (n 0))
>   (match files
>     ((file . tail)
>      (call-with-input-file file
>        (lambda (port)
>          (call-with-decompressed-port 'gzip port
>            (lambda (port)
>              (let loop ()
>                (unless (eof-object? (get-bytevector-n port 777))
>                  (loop)))))))
>      ;; (pk 'loop n file)
>      (display ".")
>      (loop tail (+ n 1)))))

One problem I’ve noticed is that the child process that
‘call-with-decompressed-port’ spawns would be stuck trying to get the
allocation lock:

--8<---------------cut here---------------start------------->8---
(gdb) bt
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f9fd8d5cb25 in __GI___pthread_mutex_lock (mutex=0x7f9fd91b3240 <GC_allocate_ml>) at ../nptl/pthread_mutex_lock.c:78
#2  0x00007f9fd8f8ef8f in GC_call_with_alloc_lock (fn=fn <at> entry=0x7f9fd92b0420 <do_copy_weak_entry>, client_data=client_data <at> entry=0x7ffe4b9a0d80) at misc.c:1929
#3  0x00007f9fd92b1270 in copy_weak_entry (dst=0x7ffe4b9a0d70, src=0x759ed0) at weak-set.c:124
#4  weak_set_remove_x (closure=0x8850c0, pred=0x7f9fd92b0440 <eq_predicate>, hash=3944337866184184181, set=0x70cf00) at weak-set.c:615
#5  scm_c_weak_set_remove_x (set=set <at> entry=#<weak-set 756df0>, raw_hash=<optimized out>, pred=pred <at> entry=0x7f9fd92b0440 <eq_predicate>, closure=closure <at> entry=0x8850c0) at weak-set.c:791
#6  0x00007f9fd92b13b0 in scm_weak_set_remove_x (set=#<weak-set 756df0>, obj=obj <at> entry=#<port 2 8850c0>) at weak-set.c:812
#7  0x00007f9fd926f72f in close_port (port=#<port 2 8850c0>, explicit=<optimized out>) at ports.c:884
#8  0x00007f9fd92ad307 in vm_regular_engine (thread=0x7f9fd91b3240 <GC_allocate_ml>, vp=0x7adf30, registers=0x0, resume=-657049556) at vm-engine.c:786
#9  0x00007f9fd92afb37 in scm_call_n (proc=<error reading variable: ERROR: Cannot access memory at address 0xd959b030>0x7f9fd959b030, argv=argv <at> entry=0x7ffe4b9a1018, nargs=nargs <at> entry=1) at vm.c:1257
#10 0x00007f9fd9233017 in scm_primitive_eval (exp=<optimized out>, exp <at> entry=<error reading variable: ERROR: Cannot access memory at address 0xd5677cf8>0x855280) at eval.c:662
#11 0x00007f9fd9233073 in scm_eval (exp=<error reading variable: ERROR: Cannot access memory at address 0xd5677cf8>0x855280, module_or_state=module_or_state <at> entry=<error reading variable: ERROR: Cannot access memory at address 0xd95580d8>0x83d140) at eval.c:696
#12 0x00007f9fd927e8d0 in scm_shell (argc=2, argv=0x7ffe4b9a1668) at script.c:454
#13 0x00007f9fd9249a9d in invoke_main_func (body_data=0x7ffe4b9a1510) at init.c:340
#14 0x00007f9fd922c28a in c_body (d=0x7ffe4b9a1450) at continuations.c:422
#15 0x00007f9fd92ad307 in vm_regular_engine (thread=0x7f9fd91b3240 <GC_allocate_ml>, vp=0x7adf30, registers=0x0, resume=-657049556) at vm-engine.c:786
#16 0x00007f9fd92afb37 in scm_call_n (proc=proc <at> entry=#<smob catch-closure 795120>, argv=argv <at> entry=0x0, nargs=nargs <at> entry=0) at vm.c:1257
#17 0x00007f9fd9231e69 in scm_call_0 (proc=proc <at> entry=#<smob catch-closure 795120>) at eval.c:481
#18 0x00007f9fd929e7b2 in catch (tag=tag <at> entry=#t, thunk=#<smob catch-closure 795120>, handler=<error reading variable: ERROR: Cannot access memory at address 0x400000000>0x7950c0, pre_unwind_handler=<error reading variable: ERROR: Cannot access memory at address 0x400000000>0x7950a0) at throw.c:137
#19 0x00007f9fd929ea95 in scm_catch_with_pre_unwind_handler (key=key <at> entry=#t, thunk=<optimized out>, handler=<optimized out>, pre_unwind_handler=<optimized out>) at throw.c:254
#20 0x00007f9fd929ec5f in scm_c_catch (tag=tag <at> entry=#t, body=body <at> entry=0x7f9fd922c280 <c_body>, body_data=body_data <at> entry=0x7ffe4b9a1450, handler=handler <at> entry=0x7f9fd922c510 <c_handler>, handler_data=handler_data <at> entry=0x7ffe4b9a1450, pre_unwind_handler=pre_unwind_handler <at> entry=0x7f9fd922c370 <pre_unwind_handler>, pre_unwind_handler_data=0x7a9bc0) at throw.c:377
#21 0x00007f9fd922c870 in scm_i_with_continuation_barrier (body=body <at> entry=0x7f9fd922c280 <c_body>, body_data=body_data <at> entry=0x7ffe4b9a1450, handler=handler <at> entry=0x7f9fd922c510 <c_handler>, handler_data=handler_data <at> entry=0x7ffe4b9a1450, pre_unwind_handler=pre_unwind_handler <at> entry=0x7f9fd922c370 <pre_unwind_handler>, pre_unwind_handler_data=0x7a9bc0) at continuations.c:360
#22 0x00007f9fd922c905 in scm_c_with_continuation_barrier (func=<optimized out>, data=<optimized out>) at continuations.c:456
#23 0x00007f9fd929d3ec in with_guile (base=base <at> entry=0x7ffe4b9a14b8, data=data <at> entry=0x7ffe4b9a14e0) at threads.c:661
#24 0x00007f9fd8f8efb8 in GC_call_with_stack_base (fn=fn <at> entry=0x7f9fd929d3a0 <with_guile>, arg=arg <at> entry=0x7ffe4b9a14e0) at misc.c:1949
#25 0x00007f9fd929d708 in scm_i_with_guile (dynamic_state=<optimized out>, data=data <at> entry=0x7ffe4b9a14e0, func=func <at> entry=0x7f9fd9249a80 <invoke_main_func>) at threads.c:704
#26 scm_with_guile (func=func <at> entry=0x7f9fd9249a80 <invoke_main_func>, data=data <at> entry=0x7ffe4b9a1510) at threads.c:710
#27 0x00007f9fd9249c32 in scm_boot_guile (argc=argc <at> entry=2, argv=argv <at> entry=0x7ffe4b9a1668, main_func=main_func <at> entry=0x400cb0 <inner_main>, closure=closure <at> entry=0x0) at init.c:323
#28 0x0000000000400b70 in main (argc=2, argv=0x7ffe4b9a1668) at guile.c:101
(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7f9fd972eb80 (LWP 15573) "guile" __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
--8<---------------cut here---------------end--------------->8---

So it seems quite clear that the thing has the alloc lock taken.  I
suppose this can happen if one of the libgc threads runs right when we
call fork and takes the alloc lock, right?

If that is correct, the fix would be to call fork within
‘GC_call_with_alloc_lock’.

How does that sound?

As a workaround on the Guix side, we might achieve the same effect by
calling ‘gc-disable’ right before ‘primitive-fork’.

Ludo’.

Information forwarded to bug-guix <at> gnu.org:
bug#31925; Package guix. (Thu, 05 Jul 2018 03:36:02 GMT) Full text and rfc822 format available.

Message #19 received at 31925 <at> debbugs.gnu.org (full text, mbox):

From: Mark H Weaver <mhw <at> netris.org>
To: ludo <at> gnu.org (Ludovic Courtès)
Cc: Andy Wingo <wingo <at> igalia.com>, 31925 <at> debbugs.gnu.org
Subject: Re: bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27
Date: Wed, 04 Jul 2018 23:33:52 -0400

Hi Ludovic,

ludo <at> gnu.org (Ludovic Courtès) writes:

> (+Cc: Andy as the ultimate authority for all these things.  :-))
>
> ludo <at> gnu.org (Ludovic Courtès) skribis:
>
>> (let loop ((files files)
>>            (n 0))
>>   (match files
>>     ((file . tail)
>>      (call-with-input-file file
>>        (lambda (port)
>>          (call-with-decompressed-port 'gzip port
>>            (lambda (port)
>>              (let loop ()
>>                (unless (eof-object? (get-bytevector-n port 777))
>>                  (loop)))))))
>>      ;; (pk 'loop n file)
>>      (display ".")
>>      (loop tail (+ n 1)))))
>
> One problem I’ve noticed is that the child process that
> ‘call-with-decompressed-port’ spawns would be stuck trying to get the
> allocation lock:
>
> (gdb) bt
> #0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
> #1  0x00007f9fd8d5cb25 in __GI___pthread_mutex_lock (mutex=0x7f9fd91b3240 <GC_allocate_ml>) at ../nptl/pthread_mutex_lock.c:78
> #2  0x00007f9fd8f8ef8f in GC_call_with_alloc_lock (fn=fn <at> entry=0x7f9fd92b0420 <do_copy_weak_entry>, client_data=client_data <at> entry=0x7ffe4b9a0d80) at misc.c:1929
> #3  0x00007f9fd92b1270 in copy_weak_entry (dst=0x7ffe4b9a0d70, src=0x759ed0) at weak-set.c:124
> #4  weak_set_remove_x (closure=0x8850c0, pred=0x7f9fd92b0440 <eq_predicate>, hash=3944337866184184181, set=0x70cf00) at weak-set.c:615
> #5  scm_c_weak_set_remove_x (set=set <at> entry=#<weak-set 756df0>, raw_hash=<optimized out>, pred=pred <at> entry=0x7f9fd92b0440 <eq_predicate>, closure=closure <at> entry=0x8850c0) at weak-set.c:791
> #6  0x00007f9fd92b13b0 in scm_weak_set_remove_x (set=#<weak-set 756df0>, obj=obj <at> entry=#<port 2 8850c0>) at weak-set.c:812
> #7  0x00007f9fd926f72f in close_port (port=#<port 2 8850c0>, explicit=<optimized out>) at ports.c:884
> #8  0x00007f9fd92ad307 in vm_regular_engine (thread=0x7f9fd91b3240 <GC_allocate_ml>, vp=0x7adf30, registers=0x0, resume=-657049556) at vm-engine.c:786
> #9  0x00007f9fd92afb37 in scm_call_n (proc=<error reading variable: ERROR: Cannot access memory at address 0xd959b030>0x7f9fd959b030, argv=argv <at> entry=0x7ffe4b9a1018, nargs=nargs <at> entry=1) at vm.c:1257
> #10 0x00007f9fd9233017 in scm_primitive_eval (exp=<optimized out>, exp <at> entry=<error reading variable: ERROR: Cannot access memory at address 0xd5677cf8>0x855280) at eval.c:662
> #11 0x00007f9fd9233073 in scm_eval (exp=<error reading variable: ERROR: Cannot access memory at address 0xd5677cf8>0x855280, module_or_state=module_or_state <at> entry=<error reading variable: ERROR: Cannot access memory at address 0xd95580d8>0x83d140) at eval.c:696
> #12 0x00007f9fd927e8d0 in scm_shell (argc=2, argv=0x7ffe4b9a1668) at script.c:454
> #13 0x00007f9fd9249a9d in invoke_main_func (body_data=0x7ffe4b9a1510) at init.c:340
> #14 0x00007f9fd922c28a in c_body (d=0x7ffe4b9a1450) at continuations.c:422
> #15 0x00007f9fd92ad307 in vm_regular_engine (thread=0x7f9fd91b3240 <GC_allocate_ml>, vp=0x7adf30, registers=0x0, resume=-657049556) at vm-engine.c:786
> #16 0x00007f9fd92afb37 in scm_call_n (proc=proc <at> entry=#<smob catch-closure 795120>, argv=argv <at> entry=0x0, nargs=nargs <at> entry=0) at vm.c:1257
> #17 0x00007f9fd9231e69 in scm_call_0 (proc=proc <at> entry=#<smob catch-closure 795120>) at eval.c:481
> #18 0x00007f9fd929e7b2 in catch (tag=tag <at> entry=#t, thunk=#<smob catch-closure 795120>, handler=<error reading variable: ERROR: Cannot access memory at address 0x400000000>0x7950c0, pre_unwind_handler=<error reading variable: ERROR: Cannot access memory at address 0x400000000>0x7950a0) at throw.c:137
> #19 0x00007f9fd929ea95 in scm_catch_with_pre_unwind_handler (key=key <at> entry=#t, thunk=<optimized out>, handler=<optimized out>, pre_unwind_handler=<optimized out>) at throw.c:254
> #20 0x00007f9fd929ec5f in scm_c_catch (tag=tag <at> entry=#t, body=body <at> entry=0x7f9fd922c280 <c_body>, body_data=body_data <at> entry=0x7ffe4b9a1450, handler=handler <at> entry=0x7f9fd922c510 <c_handler>, handler_data=handler_data <at> entry=0x7ffe4b9a1450, pre_unwind_handler=pre_unwind_handler <at> entry=0x7f9fd922c370 <pre_unwind_handler>, pre_unwind_handler_data=0x7a9bc0) at throw.c:377
> #21 0x00007f9fd922c870 in scm_i_with_continuation_barrier (body=body <at> entry=0x7f9fd922c280 <c_body>, body_data=body_data <at> entry=0x7ffe4b9a1450, handler=handler <at> entry=0x7f9fd922c510 <c_handler>, handler_data=handler_data <at> entry=0x7ffe4b9a1450, pre_unwind_handler=pre_unwind_handler <at> entry=0x7f9fd922c370 <pre_unwind_handler>, pre_unwind_handler_data=0x7a9bc0) at continuations.c:360
> #22 0x00007f9fd922c905 in scm_c_with_continuation_barrier (func=<optimized out>, data=<optimized out>) at continuations.c:456
> #23 0x00007f9fd929d3ec in with_guile (base=base <at> entry=0x7ffe4b9a14b8, data=data <at> entry=0x7ffe4b9a14e0) at threads.c:661
> #24 0x00007f9fd8f8efb8 in GC_call_with_stack_base (fn=fn <at> entry=0x7f9fd929d3a0 <with_guile>, arg=arg <at> entry=0x7ffe4b9a14e0) at misc.c:1949
> #25 0x00007f9fd929d708 in scm_i_with_guile (dynamic_state=<optimized out>, data=data <at> entry=0x7ffe4b9a14e0, func=func <at> entry=0x7f9fd9249a80 <invoke_main_func>) at threads.c:704
> #26 scm_with_guile (func=func <at> entry=0x7f9fd9249a80 <invoke_main_func>, data=data <at> entry=0x7ffe4b9a1510) at threads.c:710
> #27 0x00007f9fd9249c32 in scm_boot_guile (argc=argc <at> entry=2, argv=argv <at> entry=0x7ffe4b9a1668, main_func=main_func <at> entry=0x400cb0 <inner_main>, closure=closure <at> entry=0x0) at init.c:323
> #28 0x0000000000400b70 in main (argc=2, argv=0x7ffe4b9a1668) at guile.c:101
> (gdb) info threads
>   Id   Target Id         Frame 
> * 1    Thread 0x7f9fd972eb80 (LWP 15573) "guile" __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
>
> So it seems quite clear that the thing has the alloc lock taken.  I
> suppose this can happen if one of the libgc threads runs right when we
> call fork and takes the alloc lock, right?

Does libgc spawn threads that run concurrently with user threads?  If
so, that would be news to me.  My understanding was that incremental
marking occurs within GC allocation calls, and marking threads are only
spawned after all user threads have been stopped, but I could be wrong.

The first idea that comes to my mind is that perhaps the finalization
thread is holding the GC allocation lock when 'fork' is called.  The
finalization thread grabs the GC allocation lock every time it calls
'GC_invoke_finalizers'.  All ports backed by POSIX file descriptors
(including pipes) register finalizers and therefore spawn the
finalization thread and make work for it to do.

Another possibility: both the finalization thread and the signal
delivery thread call 'scm_without_guile', which calls 'GC_do_blocking',
which also temporarily grabs the GC allocation lock before calling the
specified function.  See 'GC_do_blocking_inner' in pthread_support.c in
libgc.  You spawn the signal delivery thread by calling 'sigaction' and
you make work for it to do every second when the SIGALRM is delivered.

> If that is correct, the fix would be to call fork within
> ‘GC_call_with_alloc_lock’.
>
> How does that sound?

Sure, sounds good to me.

> As a workaround on the Guix side, we might achieve the same effect by
> calling ‘gc-disable’ right before ‘primitive-fork’.

I don't think this would help.

     Thanks,
       Mark

Information forwarded to bug-guix <at> gnu.org:
bug#31925; Package guix. (Thu, 05 Jul 2018 08:02:01 GMT) Full text and rfc822 format available.

Message #22 received at 31925 <at> debbugs.gnu.org (full text, mbox):

From: Andy Wingo <wingo <at> igalia.com>
To: Mark H Weaver <mhw <at> netris.org>
Cc: 31925 <at> debbugs.gnu.org, Ludovic Courtès <ludo <at> gnu.org>
Subject: Re: bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27
Date: Thu, 05 Jul 2018 10:00:52 +0200

Hi!

On Thu 05 Jul 2018 05:33, Mark H Weaver <mhw <at> netris.org> writes:

>> One problem I’ve noticed is that the child process that
>> ‘call-with-decompressed-port’ spawns would be stuck trying to get the
>> allocation lock:
>>
>> So it seems quite clear that the thing has the alloc lock taken.  I
>> suppose this can happen if one of the libgc threads runs right when we
>> call fork and takes the alloc lock, right?
>
> Does libgc spawn threads that run concurrently with user threads?  If
> so, that would be news to me.  My understanding was that incremental
> marking occurs within GC allocation calls, and marking threads are only
> spawned after all user threads have been stopped, but I could be wrong.

I think Mark is correct.

> The first idea that comes to my mind is that perhaps the finalization
> thread is holding the GC allocation lock when 'fork' is called.

So of course we agree you're only supposed to "fork" when there are no
other threads running, I think.

As far as the finalizer thread goes, "primitive-fork" calls
"scm_i_finalizer_pre_fork" which should join the finalizer thread,
before the fork.  There could be a bug obviously but the intention is
for Guile to shut down its internal threads.  Here's the body of
primitive-fork fwiw:

    {
      int pid;
      scm_i_finalizer_pre_fork ();
      if (scm_ilength (scm_all_threads ()) != 1)
        /* Other threads may be holding on to resources that Guile needs --
           it is not safe to permit one thread to fork while others are
           running.
    
           In addition, POSIX clearly specifies that if a multi-threaded
           program forks, the child must only call functions that are
           async-signal-safe.  We can't guarantee that in general.  The best
           we can do is to allow forking only very early, before any call to
           sigaction spawns the signal-handling thread.  */
        scm_display
          (scm_from_latin1_string
           ("warning: call to primitive-fork while multiple threads are running;\n"
            "         further behavior unspecified.  See \"Processes\" in the\n"
            "         manual, for more information.\n"),
           scm_current_warning_port ());
      pid = fork ();
      if (pid == -1)
        SCM_SYSERROR;
      return scm_from_int (pid);
    }

> Another possibility: both the finalization thread and the signal
> delivery thread call 'scm_without_guile', which calls 'GC_do_blocking',
> which also temporarily grabs the GC allocation lock before calling the
> specified function.  See 'GC_do_blocking_inner' in pthread_support.c in
> libgc.  You spawn the signal delivery thread by calling 'sigaction' and
> you make work for it to do every second when the SIGALRM is delivered.

The signal thread is a possibility though in that case you'd get a
warning; the signal-handling thread appears in scm_all_threads.  Do you
see a warning?  If you do, that is a problem :)

>> If that is correct, the fix would be to call fork within
>> ‘GC_call_with_alloc_lock’.
>>
>> How does that sound?
>
> Sure, sounds good to me.

I don't think this is necessary.  I think the problem is that other
threads are running.  If we solve that, then we solve this issue; if we
don't solve that, we don't know what else those threads are doing, so we
don't know what mutexes and other state they might have.

Andy

Information forwarded to bug-guix <at> gnu.org:
bug#31925; Package guix. (Thu, 05 Jul 2018 08:36:02 GMT) Full text and rfc822 format available.

Message #25 received at 31925 <at> debbugs.gnu.org (full text, mbox):

From: ludo <at> gnu.org (Ludovic Courtès)
To: Mark H Weaver <mhw <at> netris.org>
Cc: Andy Wingo <wingo <at> igalia.com>, 31925 <at> debbugs.gnu.org
Subject: Re: bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27
Date: Thu, 05 Jul 2018 10:34:38 +0200

[Message part 1 (text/plain, inline)]

Hello Mark,

Thanks for chiming in!

Mark H Weaver <mhw <at> netris.org> skribis:

> Does libgc spawn threads that run concurrently with user threads?  If
> so, that would be news to me.  My understanding was that incremental
> marking occurs within GC allocation calls, and marking threads are only
> spawned after all user threads have been stopped, but I could be wrong.

libgc launches mark threads as soon as it is initialized, I think.

> The first idea that comes to my mind is that perhaps the finalization
> thread is holding the GC allocation lock when 'fork' is called.  The
> finalization thread grabs the GC allocation lock every time it calls
> 'GC_invoke_finalizers'.  All ports backed by POSIX file descriptors
> (including pipes) register finalizers and therefore spawn the
> finalization thread and make work for it to do.

In 2.2 there’s scm_i_finalizer_pre_fork that takes care of shutting down
the finalization thread right before fork.  So the finalization thread
cannot be blamed, AIUI.

> Another possibility: both the finalization thread and the signal
> delivery thread call 'scm_without_guile', which calls 'GC_do_blocking',
> which also temporarily grabs the GC allocation lock before calling the
> specified function.  See 'GC_do_blocking_inner' in pthread_support.c in
> libgc.  You spawn the signal delivery thread by calling 'sigaction' and
> you make work for it to do every second when the SIGALRM is delivered.

That’s definitely a possibility: the signal thread could be allocating
stuff, and thereby taking the alloc lock just at that time.

>> If that is correct, the fix would be to call fork within
>> ‘GC_call_with_alloc_lock’.
>>
>> How does that sound?
>
> Sure, sounds good to me.

Here’s a patch:

[Message part 2 (text/x-patch, inline)]

diff --git a/libguile/posix.c b/libguile/posix.c
index b0fcad5fd..088e75631 100644
--- a/libguile/posix.c
+++ b/libguile/posix.c
@@ -1209,6 +1209,13 @@ SCM_DEFINE (scm_execle, "execle", 2, 0, 1,
 #undef FUNC_NAME
 
 #ifdef HAVE_FORK
+static void *
+do_fork (void *pidp)
+{
+  * (int *) pidp = fork ();
+  return NULL;
+}
+
 SCM_DEFINE (scm_fork, "primitive-fork", 0, 0, 0,
             (),
 	    "Creates a new \"child\" process by duplicating the current \"parent\" process.\n"
@@ -1236,7 +1243,13 @@ SCM_DEFINE (scm_fork, "primitive-fork", 0, 0, 0,
         "         further behavior unspecified.  See \"Processes\" in the\n"
         "         manual, for more information.\n"),
        scm_current_warning_port ());
-  pid = fork ();
+
+  /* Take the alloc lock to make sure it is released when the child
+     process starts.  Failing to do that the child process could start
+     in a state where the alloc lock is taken and will never be
+     released.  */
+  GC_call_with_alloc_lock (do_fork, &pid);
+
   if (pid == -1)
     SCM_SYSERROR;
   return scm_from_int (pid);

[Message part 3 (text/plain, inline)]

Thoughts?

Unfortunately my ‘call-with-decompressed-port’ reproducer doesn’t seem t
to reproduce much today so I can’t tell if this helps (I let it run more
than 5 minutes with the supposedly-buggy Guile and nothing happened…).

Thanks,
Ludo’.

Information forwarded to bug-guix <at> gnu.org:
bug#31925; Package guix. (Thu, 05 Jul 2018 10:08:01 GMT) Full text and rfc822 format available.

Message #28 received at 31925 <at> debbugs.gnu.org (full text, mbox):

From: Mark H Weaver <mhw <at> netris.org>
To: Andy Wingo <wingo <at> igalia.com>
Cc: 31925 <at> debbugs.gnu.org, Ludovic Courtès <ludo <at> gnu.org>
Subject: Re: bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27
Date: Thu, 05 Jul 2018 06:05:59 -0400

Hi,

Andy Wingo <wingo <at> igalia.com> writes:

> On Thu 05 Jul 2018 05:33, Mark H Weaver <mhw <at> netris.org> writes:
>
>>> One problem I’ve noticed is that the child process that
>>> ‘call-with-decompressed-port’ spawns would be stuck trying to get the
>>> allocation lock:
>>>
>>> So it seems quite clear that the thing has the alloc lock taken.  I
>>> suppose this can happen if one of the libgc threads runs right when we
>>> call fork and takes the alloc lock, right?
>>
>> Does libgc spawn threads that run concurrently with user threads?  If
>> so, that would be news to me.  My understanding was that incremental
>> marking occurs within GC allocation calls, and marking threads are only
>> spawned after all user threads have been stopped, but I could be wrong.
>
> I think Mark is correct.

Actually, looking at the libgc code more closely, it seems that Ludovic
is correct.  GC_init calls GC_start_mark_threads_inner if PARALLEL_MARK
is defined at compile time.

However, it's also the case that libgc uses 'pthread_atfork' (where
available) to arrange to grab the GC allocation as well as the "mark
locks" in the case where parallel marking is enabled.  See
fork_prepare_proc, fork_parent_proc, and fork_child_proc in
pthread_support.c.

As the libgc developers admit in the comment above 'fork_prepare_proc',
they are not strictly meeting the requirements for safe use of 'fork',
but they _are_ grabbing the allocation lock during 'fork', at least on
systems that support 'pthread_atfork'.

It looks like setting the GC_MARKERS environment variable to 1 should
result in 'available_markers_m1' being set to 0, and thus effectively
disable parallel marking.  In that case, if I understand the code
correctly, no marker threads will be spawned.

It would be interesting to see if this problem can be reproduced when
GC_MARKERS is set to 1.

>> The first idea that comes to my mind is that perhaps the finalization
>> thread is holding the GC allocation lock when 'fork' is called.
>
> So of course we agree you're only supposed to "fork" when there are no
> other threads running, I think.
>
> As far as the finalizer thread goes, "primitive-fork" calls
> "scm_i_finalizer_pre_fork" which should join the finalizer thread,
> before the fork.  There could be a bug obviously but the intention is
> for Guile to shut down its internal threads.

Ah, good!  I didn't know this.  So, I guess the problem is most likely
elsewhere.

We should probably arrange to join the signal delivery thread at the
same time, and then to respawn it in the parent and child if needed.
What do you think?

> Here's the body of primitive-fork fwiw:
>
>     {
>       int pid;
>       scm_i_finalizer_pre_fork ();
>       if (scm_ilength (scm_all_threads ()) != 1)

I think there's a race here.  I think it's possible for the finalizer
thread to be respawned after 'scm_i_finalizer_pre_fork' in two different
ways:

(1) 'scm_all_threads' performs allocation, which could trigger GC.

(2) another thread could perform heap allocation and trigger GC after
    'scm_i_finalizer_pre_fork' joins the thread.  it might then shut
    down before 'scm_all_thread' is called.

However, these are highly unlikely scenarios, and most likely not the
problem we're seeing here.

Still, I think the 'scm_i_finalizer_pre_fork' should be moved after the
'if', to avoid this race.

>> Another possibility: both the finalization thread and the signal
>> delivery thread call 'scm_without_guile', which calls 'GC_do_blocking',
>> which also temporarily grabs the GC allocation lock before calling the
>> specified function.  See 'GC_do_blocking_inner' in pthread_support.c in
>> libgc.  You spawn the signal delivery thread by calling 'sigaction' and
>> you make work for it to do every second when the SIGALRM is delivered.
>
> The signal thread is a possibility though in that case you'd get a
> warning; the signal-handling thread appears in scm_all_threads.  Do you
> see a warning?  If you do, that is a problem :)

Good point!

>>> If that is correct, the fix would be to call fork within
>>> ‘GC_call_with_alloc_lock’.
>>>
>>> How does that sound?
>>
>> Sure, sounds good to me.
>
> I don't think this is necessary.  I think the problem is that other
> threads are running.  If we solve that, then we solve this issue; if we
> don't solve that, we don't know what else those threads are doing, so we
> don't know what mutexes and other state they might have.

On second thought, I agree with Andy here.  Grabbing the allocation lock
shouldn't be needed, and moreover it's not sufficient.  I think we
really need to ensure that no other threads are running, in which case
grabbing the allocation lock is pointless.  Anyway, it seems that libgc
is already arranging to do this, at least on modern GNU/Linux systems.

     Thanks,
       Mark

Information forwarded to bug-guix <at> gnu.org:
bug#31925; Package guix. (Thu, 05 Jul 2018 12:28:02 GMT) Full text and rfc822 format available.

Message #31 received at 31925 <at> debbugs.gnu.org (full text, mbox):

From: ludo <at> gnu.org (Ludovic Courtès)
To: Andy Wingo <wingo <at> igalia.com>
Cc: Mark H Weaver <mhw <at> netris.org>, 31925 <at> debbugs.gnu.org
Subject: Re: bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27
Date: Thu, 05 Jul 2018 14:27:20 +0200

Hello,

Andy Wingo <wingo <at> igalia.com> skribis:

> On Thu 05 Jul 2018 05:33, Mark H Weaver <mhw <at> netris.org> writes:

[...]

>> Another possibility: both the finalization thread and the signal
>> delivery thread call 'scm_without_guile', which calls 'GC_do_blocking',
>> which also temporarily grabs the GC allocation lock before calling the
>> specified function.  See 'GC_do_blocking_inner' in pthread_support.c in
>> libgc.  You spawn the signal delivery thread by calling 'sigaction' and
>> you make work for it to do every second when the SIGALRM is delivered.
>
> The signal thread is a possibility though in that case you'd get a
> warning; the signal-handling thread appears in scm_all_threads.  Do you
> see a warning?  If you do, that is a problem :)

I don’t see a warning.

But as a Guile user, I shouldn’t see a warning just because there’s a
signal thread anyway; it’s not a thread I spawned myself.

The weird thing is that the signal thread always exists, and it’s
surprising IMO that it shows up in ‘scm_all_threads’ because it’s not a
“user thread”.  The other surprise is that the warning isn’t triggered:

--8<---------------cut here---------------start------------->8---
$ guile
GNU Guile 2.2.4
Copyright (C) 1995-2017 Free Software Foundation, Inc.

Guile comes with ABSOLUTELY NO WARRANTY; for details type `,show w'.
This program is free software, and you are welcome to redistribute it
under certain conditions; type `,show c' for details.

Enter `,help' for help.
scheme@(guile-user)> (all-threads)
$1 = (#<thread 140488743438080 (1b06c40)> #<thread 140488815434624 (1b06e00)>)
scheme@(guile-user)> (when (zero? (primitive-fork)) (primitive-_exit 0))
;; no warning
--8<---------------cut here---------------end--------------->8---

Ludo’, surprised.  :-)

Information forwarded to bug-guix <at> gnu.org:
bug#31925; Package guix. (Thu, 05 Jul 2018 14:05:01 GMT) Full text and rfc822 format available.

Message #34 received at 31925 <at> debbugs.gnu.org (full text, mbox):

From: Andy Wingo <wingo <at> igalia.com>
To: Mark H Weaver <mhw <at> netris.org>
Cc: 31925 <at> debbugs.gnu.org, Ludovic Courtès <ludo <at> gnu.org>
Subject: Re: bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27
Date: Thu, 05 Jul 2018 16:04:12 +0200

Hi,

On Thu 05 Jul 2018 12:05, Mark H Weaver <mhw <at> netris.org> writes:

> However, it's also the case that libgc uses 'pthread_atfork' (where
> available) to arrange to grab the GC allocation as well as the "mark
> locks" in the case where parallel marking is enabled.  See
> fork_prepare_proc, fork_parent_proc, and fork_child_proc in
> pthread_support.c.

I don't think this is enabled by default.  You have to configure your
libgc this way.  When investigating similar bugs, I proposed enabling it
by default a while ago:

  http://www.hpl.hp.com/hosted/linux/mail-archives/gc/2012-February/004958.html

I ended up realizing that pthread_atfork was just a bogus interface.
For one, it turns out that POSIX clearly says that if a multithreaded
program forks, the behavior of the child after the fork is undefined if
it calls any non-async-signal-safe function before calling exec():

  https://lists.gnu.org/archive/html/guile-devel/2012-02/msg00157.html

But practically, the only reasonable thing to do with atfork is to grab
all of the locks, then release them after forking, in both child and
parent.  However you can't do this without deadlocking from a library,
as the total lock order is a property of the program and not something a
library can decide.

There are thus two solutions: either ensure that there are no other
threads when you fork, or only call async-signal-safe functions before
you exec().  open-process does the latter.  fork will warn if the former
is not the case.  When last I looked into this, I concluded that
pthread_atfork doesn't buy us anything, though I could be wrong!

>> Here's the body of primitive-fork fwiw:
>>
>>     {
>>       int pid;
>>       scm_i_finalizer_pre_fork ();
>>       if (scm_ilength (scm_all_threads ()) != 1)
>
> I think there's a race here.  I think it's possible for the finalizer
> thread to be respawned after 'scm_i_finalizer_pre_fork' in two different
> ways:
>
> (1) 'scm_all_threads' performs allocation, which could trigger GC.
>
> (2) another thread could perform heap allocation and trigger GC after
>     'scm_i_finalizer_pre_fork' joins the thread.  it might then shut
>     down before 'scm_all_thread' is called.
>
> However, these are highly unlikely scenarios, and most likely not the
> problem we're seeing here.
>
> Still, I think the 'scm_i_finalizer_pre_fork' should be moved after the
> 'if', to avoid this race.

Good point!  Probably we should use some non-allocating
 scm_i_is_multi_threaded() or something.  We can't move the pre-fork
thing though because one of the effects we are looking for is to reduce
the thread count!

Cheers,

Andy

Information forwarded to bug-guix <at> gnu.org:
bug#31925; Package guix. (Thu, 05 Jul 2018 14:09:01 GMT) Full text and rfc822 format available.

Message #37 received at 31925 <at> debbugs.gnu.org (full text, mbox):

From: Andy Wingo <wingo <at> igalia.com>
To: ludo <at> gnu.org (Ludovic Courtès)
Cc: Mark H Weaver <mhw <at> netris.org>, 31925 <at> debbugs.gnu.org
Subject: Re: bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27
Date: Thu, 05 Jul 2018 16:08:45 +0200

On Thu 05 Jul 2018 14:27, ludo <at> gnu.org (Ludovic Courtès) writes:

> Hello,
>
> Andy Wingo <wingo <at> igalia.com> skribis:
>
>> The signal thread is a possibility though in that case you'd get a
>> warning; the signal-handling thread appears in scm_all_threads.  Do you
>> see a warning?  If you do, that is a problem :)
>
> I don’t see a warning.
>
> But as a Guile user, I shouldn’t see a warning just because there’s a
> signal thread anyway; it’s not a thread I spawned myself.

I understand but it's how it works.  If we want to change this, probably
we need a similar interface as we have with finalization.

> The weird thing is that the signal thread always exists, and it’s
> surprising IMO that it shows up in ‘scm_all_threads’ because it’s not a
> “user thread”.  The other surprise is that the warning isn’t triggered:
>
> $ guile
> GNU Guile 2.2.4
> Copyright (C) 1995-2017 Free Software Foundation, Inc.
>
> Guile comes with ABSOLUTELY NO WARRANTY; for details type `,show w'.
> This program is free software, and you are welcome to redistribute it
> under certain conditions; type `,show c' for details.
>
> Enter `,help' for help.
> scheme@(guile-user)> (all-threads)
> $1 = (#<thread 140488743438080 (1b06c40)> #<thread 140488815434624 (1b06e00)>)
> scheme@(guile-user)> (when (zero? (primitive-fork)) (primitive-_exit 0))
> ;; no warning

Are you certain that this is the signal-handling thread and not the
finalizer thread?  I suspect it is the finalizer thread, and that it
gets properly shut down before the fork.

Regarding seeing the warning: do you do make some other binding for the
default warning port in Guix?

Andy

Information forwarded to bug-guix <at> gnu.org:
bug#31925; Package guix. (Fri, 06 Jul 2018 15:36:02 GMT) Full text and rfc822 format available.

Message #40 received at 31925 <at> debbugs.gnu.org (full text, mbox):

From: ludo <at> gnu.org (Ludovic Courtès)
To: Andy Wingo <wingo <at> igalia.com>
Cc: Mark H Weaver <mhw <at> netris.org>, 31925 <at> debbugs.gnu.org
Subject: Re: bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27
Date: Fri, 06 Jul 2018 17:35:22 +0200

Hello Andy,

Andy Wingo <wingo <at> igalia.com> skribis:

> On Thu 05 Jul 2018 14:27, ludo <at> gnu.org (Ludovic Courtès) writes:
>
>> Hello,
>>
>> Andy Wingo <wingo <at> igalia.com> skribis:
>>
>>> The signal thread is a possibility though in that case you'd get a
>>> warning; the signal-handling thread appears in scm_all_threads.  Do you
>>> see a warning?  If you do, that is a problem :)
>>
>> I don’t see a warning.
>>
>> But as a Guile user, I shouldn’t see a warning just because there’s a
>> signal thread anyway; it’s not a thread I spawned myself.
>
> I understand but it's how it works.  If we want to change this, probably
> we need a similar interface as we have with finalization.

Right, understood.

>> scheme@(guile-user)> (all-threads)
>> $1 = (#<thread 140488743438080 (1b06c40)> #<thread 140488815434624 (1b06e00)>)
>> scheme@(guile-user)> (when (zero? (primitive-fork)) (primitive-_exit 0))
>> ;; no warning
>
> Are you certain that this is the signal-handling thread and not the
> finalizer thread?  I suspect it is the finalizer thread, and that it
> gets properly shut down before the fork.

Oh, you must be right.

> Regarding seeing the warning: do you do make some other binding for the
> default warning port in Guix?

No.

Thanks,
Ludo’.

Added tag(s) unreproducible. Request was from Ludovic Courtès <ludo <at> gnu.org> to control <at> debbugs.gnu.org. (Sat, 19 Oct 2019 20:05:02 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 31925 <at> debbugs.gnu.org and ludo <at> gnu.org (Ludovic Courtès) Request was from Ludovic Courtès <ludo <at> gnu.org> to control <at> debbugs.gnu.org. (Sat, 19 Oct 2019 20:05:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 17 Nov 2019 12:24:11 GMT) Full text and rfc822 format available.

This bug report was last modified 4 years and 153 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #31925 'guix substitutes' sometimes hangs on glibc 2.27

GNU bug report logs - #31925
'guix substitutes' sometimes hangs on glibc 2.27