GNU bug report logs -
#58631
[Shepherd] Indefinite heap growth (memory leak)
Previous Next
Reported by: Ludovic Courtès <ludo <at> gnu.org>
Date: Wed, 19 Oct 2022 12:18:02 UTC
Severity: important
Done: Ludovic Courtès <ludo <at> gnu.org>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 58631 in the body.
You can then email your comments to 58631 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-guix <at> gnu.org
:
bug#58631
; Package
guix
.
(Wed, 19 Oct 2022 12:18:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Ludovic Courtès <ludo <at> gnu.org>
:
New bug report received and forwarded. Copy sent to
bug-guix <at> gnu.org
.
(Wed, 19 Oct 2022 12:18:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Earlier today, berlin was unreachable. From /var/log/messages, this
appears to be due to a crash of shepherd (PID 1); here are its last
words:
--8<---------------cut here---------------start------------->8---
Oct 19 06:32:36 localhost shepherd[1]: Respawning mumi-worker.
Oct 19 06:32:37 localhost shepherd[1]: Service mumi-worker has been started.
Oct 19 06:32:52 localhost ntpd[1837]: Soliciting pool server 193.141.27.6
Oct 19 06:33:56 localhost ntpd[1837]: Soliciting pool server 193.158.22.13
Oct 19 06:35:02 localhost ntpd[1837]: Soliciting pool server 62.75.236.38
Oct 19 06:35:43 localhost vmunix: [1586150.310354] shepherd[87089]: segfault at 0 ip 00007ff3018dc264 sp 00007ffeae066810 error 6 in crash-handler.so[7ff3018dc000+1000]
Oct 19 06:35:43 localhost vmunix: [1586150.322247] Code: ff ff ff e8 6e fe ff ff 48 8d 3d b7 0d 00 00 e8 12 fe ff ff bf 27 00 00 00 31 c0 e8 26 fe ff ff 89 ee 48 89 c7 e8 2c fe ff ff <c7> 04 25 00 00 00 00 00 00 00 00 0f 0b 48 89 c3 be 01 00 00 00 89
Oct 19 06:36:09 localhost ntpd[1837]: Soliciting pool server 51.75.67.47
Oct 19 06:36:40 localhost postgres[1389]: [73-1] 2022-10-19 04:36:40.036 GMT [1389] LOG: using stale statistics instead of current ones because stats collector is not responding
Oct 19 06:36:50 localhost postgres[87362]: [6-1] 2022-10-19 04:36:50.101 GMT [87362] LOG: using stale statistics instead of current ones because stats collector is not responding
Oct 19 06:37:00 localhost postgres[1389]: [74-1] 2022-10-19 04:37:00.062 GMT [1389] LOG: using stale statistics instead of current ones because stats collector is not responding
Oct 19 06:37:10 localhost postgres[87368]: [6-1] 2022-10-19 04:37:10.128 GMT [87368] LOG: using stale statistics instead of current ones because stats collector is not responding
Oct 19 06:37:16 localhost ntpd[1837]: Soliciting pool server 3.64.117.201
Oct 19 12:48:58 localhost syslogd (GNU inetutils 2.0): restart
--8<---------------cut here---------------end--------------->8---
The interesting part is that we have a core dump—see ‘crash-handler.so’
above, which is shepherd’s mechanism to ensure there’s a core dump
before the machine stops. Here’s what we get:
--8<---------------cut here---------------start------------->8---
ludo <at> berlin ~$ gdb -ix ~/.gdbinit /gnu/store/cnfsv9ywaacyafkqdqsv2ry8f01yr7a9-guile-3.0.7/bin/guile /core.shepherd-20221019
GNU gdb (GDB) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /gnu/store/cnfsv9ywaacyafkqdqsv2ry8f01yr7a9-guile-3.0.7/bin/guile...
Reading symbols from /gnu/store/4ws3vh3zrs1yi9lfaibha64chf3vn2rm-guile-3.0.7-debug/lib/debug//gnu/store/cnfsv9ywaacyafkqdqsv2ry8f01yr7a9-guile-3.0.7/bin/guile.debug...
warning: core file may not match specified executable file.
[New LWP 87089]
warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.
Core was generated by `/gnu/store/cnfsv9ywaacyafkqdqsv2ry8f01yr7a9-guile-3.0.7/bin/guile --no-auto-com'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007ff3018dc264 in ?? ()
from /gnu/store/cdc1gzbp3q15kdiwn2i5j3437jwx61ac-shepherd-0.9.2/lib/shepherd/crash-handler.so
(gdb) bt
#0 0x00007ff3018dc264 in ?? ()
from /gnu/store/cdc1gzbp3q15kdiwn2i5j3437jwx61ac-shepherd-0.9.2/lib/shepherd/crash-handler.so
#1 <signal handler called>
#2 0x00007ff3018dc264 in ?? ()
from /gnu/store/cdc1gzbp3q15kdiwn2i5j3437jwx61ac-shepherd-0.9.2/lib/shepherd/crash-handler.so
#3 <signal handler called>
#4 0x00007ff30b2d2030 in raise ()
from /gnu/store/5h2w4qi9hk1qzzgi1w83220ydslinr4s-glibc-2.33/lib/libc.so.6
#5 0x00007ff30b2bc526 in abort ()
from /gnu/store/5h2w4qi9hk1qzzgi1w83220ydslinr4s-glibc-2.33/lib/libc.so.6
#6 0x00007ff30b7bb263 in ?? () from /gnu/store/2lczkxbdbzh4gk7wh91bzrqrk7h5g1dl-libgc-8.0.4/lib/libgc.so.1
#7 0x00007ff30b7c2232 in ?? () from /gnu/store/2lczkxbdbzh4gk7wh91bzrqrk7h5g1dl-libgc-8.0.4/lib/libgc.so.1
#8 0x00007ff30b7c25a8 in ?? () from /gnu/store/2lczkxbdbzh4gk7wh91bzrqrk7h5g1dl-libgc-8.0.4/lib/libgc.so.1
#9 0x00007ff30b7c290f in ?? () from /gnu/store/2lczkxbdbzh4gk7wh91bzrqrk7h5g1dl-libgc-8.0.4/lib/libgc.so.1
#10 0x00007ff30b7c78da in GC_generic_malloc_many ()
from /gnu/store/2lczkxbdbzh4gk7wh91bzrqrk7h5g1dl-libgc-8.0.4/lib/libgc.so.1
#11 0x00007ff30b87fb45 in scm_inline_gc_alloc (kind=SCM_INLINE_GC_KIND_NORMAL, idx=6,
freelist=0x7ff30b246e48) at gc-inline.h:79
#12 allocate_words_with_freelist (thread=0x7ff30b246d80, freelist_idx=6) at intrinsics.c:470
#13 0x00007ff300039f95 in ?? ()
#14 0x00007ff30b246d80 in ?? ()
#15 0x00007ff302a2f7e8 in ?? ()
#16 0x0000000000000040 in ?? ()
#17 0x00007ff30b88bb1c in scm_jit_enter_mcode (thread=0x7ff30b246d80,
mcode=0x7ff30231e75c "B", <incomplete sequence \340>) at jit.c:6038
#18 0x00007ff30b8e80c5 in scm_call_n (proc=<optimized out>, argv=argv <at> entry=0x7ffeae067878,
nargs=nargs <at> entry=1) at vm.c:1602
#19 0x00007ff30b862ea7 in scm_primitive_eval (exp=<optimized out>) at eval.c:671
#20 0x00007ff30b88d729 in scm_primitive_load (filename=<optimized out>) at load.c:131
#21 0x00007ff30b8e5915 in vm_regular_engine (thread=0x7ff30b246d80) at vm-engine.c:972
#22 0x00007ff30b8e8029 in scm_call_n (proc=<optimized out>, argv=argv <at> entry=0x7ffeae067a58,
nargs=nargs <at> entry=1) at vm.c:1608
#23 0x00007ff30b862ea7 in scm_primitive_eval (exp=<optimized out>,
exp <at> entry=((@ (ice-9 control) %) (begin ((@@ (ice-9 command-line) load/lang) "/gnu/store/cdc1gzbp3q15kdiwn2i5j3437jwx61ac-shepherd-0.9.2/bin/shepherd") (quit)))) at eval.c:671
#24 0x00007ff30b862f06 in scm_eval (
exp=((@ (ice-9 control) %) (begin ((@@ (ice-9 command-line) load/lang) "/gnu/store/cdc1gzbp3q15kdiwn2i5j3437jwx61ac-shepherd-0.9.2/bin/shepherd") (quit))),
module_or_state=module_or_state <at> entry="#<struct module>" = {...}) at eval.c:705
#25 0x00007ff30b8bde76 in scm_shell (argc=5, argv=0x7ffeae0680e8) at script.c:357
#26 0x00007ff30b87b36d in invoke_main_func (body_data=0x7ffeae067f80) at init.c:313
#27 0x00007ff30b85cbea in c_body (d=0x7ffeae067ec0) at continuations.c:430
#28 0x00007ff30b8e5915 in vm_regular_engine (thread=0x7ff30b246d80) at vm-engine.c:972
#29 0x00007ff30b8e8029 in scm_call_n (proc=<optimized out>, argv=argv <at> entry=0x7ffeae067c80,
nargs=nargs <at> entry=2) at vm.c:1608
#30 0x00007ff30b861dfa in scm_call_2 (proc=<optimized out>, arg1=<optimized out>, arg2=<optimized out>)
at eval.c:503
#31 0x00007ff30b863529 in scm_c_with_exception_handler (type=type <at> entry=#t,
handler=handler <at> entry=0x7ff30b8dd750 <catch_post_unwind_handler>,
handler_data=handler_data <at> entry=0x7ffeae067df0, thunk=thunk <at> entry=0x7ff30b8dd890 <catch_body>,
thunk_data=thunk_data <at> entry=0x7ffeae067df0) at exceptions.c:170
#32 0x00007ff30b8dda8d in scm_c_catch (tag=tag <at> entry=#t, body=body <at> entry=0x7ff30b85cbe0 <c_body>,
body_data=body_data <at> entry=0x7ffeae067ec0, handler=handler <at> entry=0x7ff30b85ce80 <c_handler>,
handler_data=handler_data <at> entry=0x7ffeae067ec0,
pre_unwind_handler=pre_unwind_handler <at> entry=0x7ff30b85ccd0 <pre_unwind_handler>,
pre_unwind_handler_data=0x7ff3038c4b00) at throw.c:168
#33 0x00007ff30b85d238 in scm_i_with_continuation_barrier (body=0x7ff30b85cbe0 <c_body>,
body_data=0x7ffeae067ec0, handler=0x7ff30b85ce80 <c_handler>, handler_data=0x7ffeae067ec0,
pre_unwind_handler=0x7ff30b85ccd0 <pre_unwind_handler>, pre_unwind_handler_data=0x7ff3038c4b00)
at continuations.c:368
#34 0x00007ff30b85d295 in scm_c_with_continuation_barrier (func=<optimized out>, data=<optimized out>)
at continuations.c:464
#35 0x00007ff30b8dc549 in with_guile (base=0x7ffeae067f28, data=0x7ffeae067f50) at threads.c:645
#36 0x00007ff30b7b90ba in GC_call_with_stack_base ()
from /gnu/store/2lczkxbdbzh4gk7wh91bzrqrk7h5g1dl-libgc-8.0.4/lib/libgc.so.1
#37 0x00007ff30b8dc848 in scm_i_with_guile (dynamic_state=<optimized out>, data=data <at> entry=0x7ffeae067f30,
func=func <at> entry=0x7ff30b87b350 <invoke_main_func>) at threads.c:688
#38 scm_with_guile (func=func <at> entry=0x7ff30b87b350 <invoke_main_func>, data=data <at> entry=0x7ffeae067f80)
at threads.c:694
#39 0x00007ff30b87b4e2 in scm_boot_guile (argc=argc <at> entry=5, argv=argv <at> entry=0x7ffeae0680e8,
main_func=main_func <at> entry=0x401230 <inner_main>, closure=closure <at> entry=0x0) at init.c:296
#40 0x00000000004010f7 in main (argc=5, argv=0x7ffeae0680e8) at guile.c:94
(gdb) info proc status
unable to handle request
(gdb) info proc
exe = '/gnu/store/cnfsv9ywaacyafkqdqsv2ry8f01yr7a9-guile-3.0.7/bin/guile --no-auto-com'
--8<---------------cut here---------------end--------------->8---
Ludo’.
Severity set to 'important' from 'normal'
Request was from
Ludovic Courtès <ludo <at> gnu.org>
to
control <at> debbugs.gnu.org
.
(Wed, 19 Oct 2022 16:18:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-guix <at> gnu.org
:
bug#58631
; Package
guix
.
(Wed, 19 Oct 2022 16:26:01 GMT)
Full text and
rfc822 format available.
Message #10 received at 58631 <at> debbugs.gnu.org (full text, mbox):
Same backtrace with debugging symbols for libgc and libc:
--8<---------------cut here---------------start------------->8---
(gdb) bt
#0 0x00007ff3018dc264 in ?? ()
from /gnu/store/cdc1gzbp3q15kdiwn2i5j3437jwx61ac-shepherd-0.9.2/lib/shepherd/crash-handler.so
#1 <signal handler called>
#2 0x00007ff3018dc264 in ?? ()
from /gnu/store/cdc1gzbp3q15kdiwn2i5j3437jwx61ac-shepherd-0.9.2/lib/shepherd/crash-handler.so
#3 <signal handler called>
#4 __GI_raise (sig=sig <at> entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
#5 0x00007ff30b2bc526 in __GI_abort () at abort.c:79
#6 0x00007ff30b7bb263 in GC_add_to_heap (bytes=<optimized out>, p=0x7fefd1f70000) at extra/../alloc.c:1232
#7 GC_expand_hp_inner (n=n <at> entry=2048) at extra/../alloc.c:1386
#8 0x00007ff30b7c2232 in GC_collect_or_expand (needed_blocks=needed_blocks <at> entry=1,
ignore_off_page=ignore_off_page <at> entry=0, retry=retry <at> entry=0) at extra/../alloc.c:1480
#9 0x00007ff30b7c25a8 in GC_allocobj (gran=gran <at> entry=7, kind=<optimized out>) at extra/../alloc.c:1543
#10 0x00007ff30b7c290f in GC_generic_malloc_inner (lb=lb <at> entry=112, k=k <at> entry=1) at extra/../malloc.c:191
#11 0x00007ff30b7c78da in GC_generic_malloc_many (lb=lb <at> entry=112, k=k <at> entry=1,
result=result <at> entry=0x7ff30b246e48) at extra/../mallocx.c:473
#12 0x00007ff30b87fb45 in scm_inline_gc_alloc (kind=SCM_INLINE_GC_KIND_NORMAL, idx=6,
freelist=0x7ff30b246e48) at gc-inline.h:79
#13 allocate_words_with_freelist (thread=0x7ff30b246d80, freelist_idx=6) at intrinsics.c:470
#14 0x00007ff300039f95 in ?? ()
#15 0x00007ff30b246d80 in ?? ()
#16 0x00007ff302a2f7e8 in ?? ()
#17 0x0000000000000040 in ?? ()
#18 0x00007ff30b88bb1c in scm_jit_enter_mcode (thread=thread <at> entry=0x7ff30b246d80,
mcode=0x7ff30231e75c "B", <incomplete sequence \340>,
mcode <at> entry=0x7ff302a2f7e8 "I\211\314I)\304I\203\374 \017\205\345\b") at jit.c:6038
#19 0x00007ff30b8e80c5 in scm_call_n (proc=<optimized out>, argv=argv <at> entry=0x7ffeae067878,
nargs=nargs <at> entry=1) at vm.c:1602
#20 0x00007ff30b862ea7 in scm_primitive_eval (exp=<optimized out>) at eval.c:671
#21 0x00007ff30b88d729 in scm_primitive_load (filename=<optimized out>) at load.c:131
#22 0x00007ff30b8e5915 in vm_regular_engine (thread=0x7ff30b246d80) at vm-engine.c:972
#23 0x00007ff30b8e8029 in scm_call_n (proc=<optimized out>, argv=argv <at> entry=0x7ffeae067a58,
nargs=nargs <at> entry=1) at vm.c:1608
#24 0x00007ff30b862ea7 in scm_primitive_eval (exp=<optimized out>,
exp <at> entry=((@ (ice-9 control) %) (begin ((@@ (ice-9 command-line) load/lang) "/gnu/store/cdc1gzbp3q15kdiwn2i5j3437jwx61ac-shepherd-0.9.2/bin/shepherd") (quit)))) at eval.c:671
--8<---------------cut here---------------end--------------->8---
This means we’re hitting this ‘abort’ call:
--8<---------------cut here---------------start------------->8---
GC_INNER void GC_add_to_heap(struct hblk *p, size_t bytes)
{
hdr * phdr;
word endp;
if (GC_n_heap_sects >= MAX_HEAP_SECTS) {
ABORT("Too many heap sections: Increase MAXHINCR or MAX_HEAP_SECTS");
}
--8<---------------cut here---------------end--------------->8---
Indeed, the core dump weighs 13 GiB, so it looks like a memory leak (too
bad ‘info proc stat’ in GDB doesn’t work).
I have not observed it on any other machine. The difference between
berlin and machines I have access to is that berlin is being hammered on
port 22, which means it gets to spawn sshd processes very often. This
could be where the leak is.
Ludo’.
Information forwarded
to
bug-guix <at> gnu.org
:
bug#58631
; Package
guix
.
(Thu, 20 Oct 2022 09:30:02 GMT)
Full text and
rfc822 format available.
Message #13 received at 58631 <at> debbugs.gnu.org (full text, mbox):
Heap usage of shepherd on berlin gathered roughly at T0 (a few hours
after booting), T0+8h, an T0+10h:
--8<---------------cut here---------------start------------->8---
ludo <at> berlin ~$ sudo herd eval root '(gc-stats)'
Evaluating user expression (gc-stats).
((gc-time-taken . 36730013844) (heap-size . 31494144) (heap-free-size . 8339456) (heap-total-allocated . 1198237472) (heap-allocated-since-gc . 7581824) (protected-objects . 0) (gc-times . 157))
ludo <at> berlin ~$ sudo cat /proc/1/maps |grep -A10 heap
016fe000-01810000 rw-p 00000000 00:00 0 [heap]
7f78a387d000-7f78a4000000 rw-p 00000000 00:00 0
7f78a4000000-7f78a4021000 rw-p 00000000 00:00 0
7f78a4021000-7f78a8000000 ---p 00000000 00:00 0
7f78a81d0000-7f78a8600000 rw-p 00000000 00:00 0
7f78a8600000-7f78a8640000 rwxp 00000000 00:00 0
7f78a8640000-7f78a8700000 rw-p 00000000 00:00 0
7f78a8700000-7f78a8740000 rwxp 00000000 00:00 0
7f78a8740000-7f78a8c6c000 rw-p 00000000 00:00 0
7f78a8c6c000-7f78a8cac000 rwxp 00000000 00:00 0
7f78a8cac000-7f78a8ccc000 r--p 00000000 00:17 167921 /gnu/store/9nb1nplyx4gx5pdrmmidngir9jmrzxi3-guile-netlink-1.1.1/lib/guile/3.0/site-ccache/ip/route.go
ludo <at> berlin ~$ sudo herd eval root '(gc-stats)'
Password:
Evaluating user expression (gc-stats).
((gc-time-taken . 2698835942232) (heap-size . 322998272) (heap-free-size . 56537088) (heap-total-allocated . 49605825968) (heap-allocated-since-gc . 67360208) (protected-objects . 0) (gc-times . 1356))
ludo <at> berlin ~$ sudo cat /proc/1/maps |grep -A12 heap
016fe000-01810000 rw-p 00000000 00:00 0 [heap]
7f788ce8d000-7f78a1bdd000 rw-p 00000000 00:00 0
7f78a1bdd000-7f78a1c1d000 rwxp 00000000 00:00 0
7f78a1c1d000-7f78a4000000 rw-p 00000000 00:00 0
7f78a4000000-7f78a4021000 rw-p 00000000 00:00 0
7f78a4021000-7f78a8000000 ---p 00000000 00:00 0
7f78a8000000-7f78a8600000 rw-p 00000000 00:00 0
7f78a8600000-7f78a8640000 rwxp 00000000 00:00 0
7f78a8640000-7f78a8700000 rw-p 00000000 00:00 0
7f78a8700000-7f78a8740000 rwxp 00000000 00:00 0
7f78a8740000-7f78a8c6c000 rw-p 00000000 00:00 0
7f78a8c6c000-7f78a8cac000 rwxp 00000000 00:00 0
7f78a8cac000-7f78a8ccc000 r--p 00000000 00:17 167921 /gnu/store/9nb1nplyx4gx5pdrmmidngir9jmrzxi3-guile-netlink-1.1.1/lib/guile/3.0/site-ccache/ip/route.go
ludo <at> berlin ~$ sudo herd eval root '(gc-stats)'
Password:
Evaluating user expression (gc-stats).
((gc-time-taken . 2909168084878) (heap-size . 348164096) (heap-free-size . 51949568) (heap-total-allocated . 53131500656) (heap-allocated-since-gc . 94062880) (protected-objects . 0) (gc-times . 1383))
ludo <at> berlin ~$ sudo cat /proc/1/maps |grep -A12 heap
016fe000-01810000 rw-p 00000000 00:00 0 [heap]
7f788b4bd000-7f78a1bdd000 rw-p 00000000 00:00 0
7f78a1bdd000-7f78a1c1d000 rwxp 00000000 00:00 0
7f78a1c1d000-7f78a4000000 rw-p 00000000 00:00 0
7f78a4000000-7f78a4021000 rw-p 00000000 00:00 0
7f78a4021000-7f78a8000000 ---p 00000000 00:00 0
7f78a8000000-7f78a8600000 rw-p 00000000 00:00 0
7f78a8600000-7f78a8640000 rwxp 00000000 00:00 0
7f78a8640000-7f78a8700000 rw-p 00000000 00:00 0
7f78a8700000-7f78a8740000 rwxp 00000000 00:00 0
7f78a8740000-7f78a8c6c000 rw-p 00000000 00:00 0
7f78a8c6c000-7f78a8cac000 rwxp 00000000 00:00 0
7f78a8cac000-7f78a8ccc000 r--p 00000000 00:17 167921 /gnu/store/9nb1nplyx4gx5pdrmmidngir9jmrzxi3-guile-netlink-1.1.1/lib/guile/3.0/site-ccache/ip/route.go
--8<---------------cut here---------------end--------------->8---
Heap keeps increasing, and quite quickly. That includes code arenas
allocated for JIT (the “rwx” mappings), but the biggest part seems to be
the rest of the heap (in particular the second mapping on the second and
third samples).
During that time shepherd is mostly receiving and logging messages from
‘guix publish’, and occasionally accepting a connection on port 22 and
spawning ‘sshd’:
--8<---------------cut here---------------start------------->8---
ludo <at> berlin ~$ sudo strace -p1 -Tt
strace: Process 1 attached
10:33:59 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0 <0.000009>
10:33:59 epoll_wait(13, [{events=EPOLLIN, data={u32=50, u64=50}}], 8, -1) = 1 <0.925330>
10:34:00 read(50, "GET /nar/lzip/z0l3fp12ck539qkzc2"..., 4096) = 76 <0.000012>
10:34:00 newfstatat(AT_FDCWD, "/etc/localtime", {st_mode=S_IFREG|0444, st_size=2298, ...}, 0) = 0 <0.000012>
10:34:00 write(51, "2022-10-20 10:34:00 GET /nar/lzi"..., 96) = 96 <0.000036>
10:34:00 read(50, 0x7f78a88d8020, 4096) = -1 EAGAIN (Resource temporarily unavailable) <0.000008>
10:34:00 epoll_ctl(13, EPOLL_CTL_MOD, 50, {events=EPOLLIN|EPOLLRDHUP|EPOLLONESHOT, data={u32=50, u64=50}}) = 0 <0.000009>
10:34:00 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0 <0.000007>
10:34:00 epoll_wait(13, [{events=EPOLLIN, data={u32=50, u64=50}}], 8, -1) = 1 <0.001968>
10:34:00 read(50, "GET /x2ii2b9zwxnaf020zinccckl2bh"..., 4096) = 46 <0.000010>
10:34:00 newfstatat(AT_FDCWD, "/etc/localtime", {st_mode=S_IFREG|0444, st_size=2298, ...}, 0) = 0 <0.000012>
10:34:00 write(51, "2022-10-20 10:34:00 GET /x2ii2b9"..., 66) = 66 <0.000019>
10:34:00 read(50, 0x7f78a88d8020, 4096) = -1 EAGAIN (Resource temporarily unavailable) <0.000007>
10:34:00 epoll_ctl(13, EPOLL_CTL_MOD, 50, {events=EPOLLIN|EPOLLRDHUP|EPOLLONESHOT, data={u32=50, u64=50}}) = 0 <0.000008>
10:34:00 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0 <0.000007>
10:34:00 epoll_wait(13, [{events=EPOLLIN, data={u32=50, u64=50}}], 8, -1) = 1 <0.097714>
10:34:00 read(50, "GET /lclbcq0jds63zal1p55g6v0mwz9"..., 4096) = 46 <0.000009>
10:34:00 newfstatat(AT_FDCWD, "/etc/localtime", {st_mode=S_IFREG|0444, st_size=2298, ...}, 0) = 0 <0.000011>
10:34:00 write(51, "2022-10-20 10:34:00 GET /lclbcq0"..., 66) = 66 <0.000018>
10:34:00 read(50, 0x7f78a88d8020, 4096) = -1 EAGAIN (Resource temporarily unavailable) <0.000009>
10:34:00 epoll_ctl(13, EPOLL_CTL_MOD, 50, {events=EPOLLIN|EPOLLRDHUP|EPOLLONESHOT, data={u32=50, u64=50}}) = 0 <0.000011>
--8<---------------cut here---------------end--------------->8---
Logging as performed by ‘%service-file-logger’ is quite GC-intensive
(but shouldn’t be leaky!); this bit:
(let ((prefix (strftime default-logfile-date-format
(localtime (current-time)))))
(format output "~a~a~%" prefix line)
(loop))
ends up allocating a couple of vectors (for ‘localtime’), a bunch of
strings, a string port (via (ice-9 format) it seems), etc. That puts GC
under stress.
Still, what’s leaking? Why are JIT code arenas being allocated?
Ludo’.
Changed bug title to '[Shepherd] Indefinite heap growth (memory leak)' from 'Shepherd crash on berlin'
Request was from
Ludovic Courtès <ludo <at> gnu.org>
to
control <at> debbugs.gnu.org
.
(Thu, 20 Oct 2022 09:31:01 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-guix <at> gnu.org
:
bug#58631
; Package
guix
.
(Thu, 20 Oct 2022 17:50:02 GMT)
Full text and
rfc822 format available.
Message #18 received at 58631 <at> debbugs.gnu.org (full text, mbox):
Ludovic Courtès <ludo <at> gnu.org> writes:
> Indeed, the core dump weighs 13 GiB, so it looks like a memory leak (too
> bad ‘info proc stat’ in GDB doesn’t work).
>
> I have not observed it on any other machine. The difference between
> berlin and machines I have access to is that berlin is being hammered on
> port 22, which means it gets to spawn sshd processes very often. This
> could be where the leak is.
>
So, this is a hack, but perhaps berlin could start to use endlessh on
port 22, and then the real ssh port could be on another standard port.
https://issues.guix.gnu.org/39136
It would change most guix developers work loads. But it's a thought. :)
> Ludo’.
Information forwarded
to
bug-guix <at> gnu.org
:
bug#58631
; Package
guix
.
(Thu, 20 Oct 2022 21:47:01 GMT)
Full text and
rfc822 format available.
Message #21 received at 58631 <at> debbugs.gnu.org (full text, mbox):
Joshua Branson <jbranso <at> dismail.de> skribis:
> So, this is a hack, but perhaps berlin could start to use endlessh on
> port 22, and then the real ssh port could be on another standard port.
We should do that, I agree; we should still fix the actual bug though!
Ludo’.
Information forwarded
to
bug-guix <at> gnu.org
:
bug#58631
; Package
guix
.
(Fri, 21 Oct 2022 13:08:02 GMT)
Full text and
rfc822 format available.
Message #24 received at 58631 <at> debbugs.gnu.org (full text, mbox):
Am Donnerstag, dem 20.10.2022 um 11:29 +0200 schrieb Ludovic Courtès:
> Logging as performed by ‘%service-file-logger’ is quite GC-intensive
> (but shouldn’t be leaky!); this bit:
>
> (let ((prefix (strftime default-logfile-date-format
> (localtime (current-time)))))
> (format output "~a~a~%" prefix line)
> (loop))
(let ((prefix (strftime default-logfile-date-format
(localtime (current-time)))))
(format output "~a~a~%" prefix line)
(loop))
Would it make a difference if you instead wrote
(let ((prefix (strftime default-logfile-date-format
(localtime (current-time)))))
(format output "~a~a~%" prefix line))
(loop)
? Semantically, they're equivalent (perhaps with a wrapping begin
needed), but the compiler might fail to see this.
Cheers
Information forwarded
to
bug-guix <at> gnu.org
:
bug#58631
; Package
guix
.
(Sat, 22 Oct 2022 20:22:01 GMT)
Full text and
rfc822 format available.
Message #27 received at 58631 <at> debbugs.gnu.org (full text, mbox):
Liliana Marie Prikler <liliana.prikler <at> ist.tugraz.at> skribis:
> Am Donnerstag, dem 20.10.2022 um 11:29 +0200 schrieb Ludovic Courtès:
>> Logging as performed by ‘%service-file-logger’ is quite GC-intensive
>> (but shouldn’t be leaky!); this bit:
>>
>> (let ((prefix (strftime default-logfile-date-format
>> (localtime (current-time)))))
>> (format output "~a~a~%" prefix line)
>> (loop))
>
> (let ((prefix (strftime default-logfile-date-format
> (localtime (current-time)))))
> (format output "~a~a~%" prefix line)
> (loop))
>
> Would it make a difference if you instead wrote
>
> (let ((prefix (strftime default-logfile-date-format
> (localtime (current-time)))))
> (format output "~a~a~%" prefix line))
> (loop)
>
> ? Semantically, they're equivalent (perhaps with a wrapping begin
> needed), but the compiler might fail to see this.
I’m confident the compiler does the right thing, and it’s a tail call so
‘prefix’ doesn’t outlive one loop iteration.
Ludo’.
Information forwarded
to
bug-guix <at> gnu.org
:
bug#58631
; Package
guix
.
(Sat, 22 Oct 2022 20:22:02 GMT)
Full text and
rfc822 format available.
Message #30 received at 58631 <at> debbugs.gnu.org (full text, mbox):
Liliana Marie Prikler <liliana.prikler <at> ist.tugraz.at> skribis:
> Am Donnerstag, dem 20.10.2022 um 11:29 +0200 schrieb Ludovic Courtès:
>> Logging as performed by ‘%service-file-logger’ is quite GC-intensive
>> (but shouldn’t be leaky!); this bit:
>>
>> (let ((prefix (strftime default-logfile-date-format
>> (localtime (current-time)))))
>> (format output "~a~a~%" prefix line)
>> (loop))
>
> (let ((prefix (strftime default-logfile-date-format
> (localtime (current-time)))))
> (format output "~a~a~%" prefix line)
> (loop))
>
> Would it make a difference if you instead wrote
>
> (let ((prefix (strftime default-logfile-date-format
> (localtime (current-time)))))
> (format output "~a~a~%" prefix line))
> (loop)
>
> ? Semantically, they're equivalent (perhaps with a wrapping begin
> needed), but the compiler might fail to see this.
I’m confident the compiler does the right thing, and it’s a tail call so
‘prefix’ doesn’t outlive one loop iteration.
Ludo’.
Information forwarded
to
bug-guix <at> gnu.org
:
bug#58631
; Package
guix
.
(Sat, 22 Oct 2022 20:31:01 GMT)
Full text and
rfc822 format available.
Message #33 received at 58631 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
An update: I can reproduce in a VM running a slightly simplified version
of ‘hydra/berlin.scm’, with loop running:
while true; do wget -qO/dev/null http://localhost:3000/nix-cache-info; done
to trigger ‘guix publish’ logging.
Better, I can mostly reproduce the issue with the attached config file,
then starting shepherd:
rm -f sock && ./shepherd -s sock -c log-conf.scm -I
… monitoring heap usage:
./herd -s sock eval root '(gc-stats)'
… and triggering inetd service startup, which in turn triggers heap
growth:
while : ; do echo foo> /dev/tcp/localhost/4567 ; done
Then the heap size reported by ‘gc-stats’ seems to hit a threshold above
which is stop growing, or it grows too slowly (IOW the problem is not as
easy to observe as on berlin).
That’s pretty much all I have so far. :-/
Ludo’.
[log-conf.scm (text/plain, inline)]
;; https://issues.guix.gnu.org/58631
(define %command
(list "/bin/sh" "-c"
(string-append "while true; do "
(string-concatenate
(make-list 30
(string-append
"echo "
(string-concatenate
(make-list 8 "logging "))
"; ")))
"sleep 0.2; "
" done")))
(define %echo-server
;; Simple echo server.
'("/bin/sh" "-c" "echo hello; read line; echo line; echo done"))
(define loss
(make-vector (* 10 (expt 2 20))))
(register-services
(make <service>
#:provides '(test-logging)
#:start (make-forkexec-constructor %command
#:log-file "/tmp/service.log")
#:stop (make-kill-destructor))
(make <service>
#:provides '(test-inetd)
#:start (make-inetd-constructor %echo-server
(list
(endpoint (make-socket-address
AF_INET
INADDR_LOOPBACK
4567))))
#:stop (make-inetd-destructor)))
(start 'test-logging)
(start 'test-inetd)
(pk 'init-gc (gc-stats))
Information forwarded
to
bug-guix <at> gnu.org
:
bug#58631
; Package
guix
.
(Sat, 29 Oct 2022 10:03:01 GMT)
Full text and
rfc822 format available.
Message #36 received at 58631 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
The attached Fibers program illustrates the problem: heap grows even
though it’s not supposed to.
Ludo’.
[fibers-memory-consumption.scm (text/plain, inline)]
;; https://issues.guix.gnu.org/58631
(use-modules (fibers)
(fibers channels)
(ice-9 rdelim)
(statprof))
(run-fibers
(lambda ()
(define channel
(make-channel))
(define leave-channel
(make-channel))
(spawn-fiber
(lambda ()
(sleep 10)
(put-message leave-channel 'leave)))
(spawn-fiber
(lambda ()
(let loop ()
(put-message channel 'hi!)
(get-message channel)
(loop))))
(spawn-fiber
(lambda ()
(let loop ()
(get-message channel)
(put-message channel 'hey!)
(loop))))
(spawn-fiber
(lambda ()
(let loop ()
(pk 'heap-size (assoc-ref (gc-stats) 'heap-size))
(sleep 2)
(loop))))
(get-message leave-channel))
;; #:drain? #t
#:parallelism 1 ;don't create POSIX threads
#:hz 0)
Information forwarded
to
bug-guix <at> gnu.org
:
bug#58631
; Package
guix
.
(Sat, 29 Oct 2022 14:55:01 GMT)
Full text and
rfc822 format available.
Message #39 received at 58631 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 29-10-2022 12:01, Ludovic Courtès wrote:
> The attached Fibers program illustrates the problem: heap grows even
> though it’s not supposed to.
If I add (gc) before the (pk 'heap-size ...), I can't reproduce:
;;; (heap-size 12742656)
;;; (heap-size 12808192)
;;; (heap-size 12808192)
;;; (heap-size 12808192)
;;; (heap-size 12808192)
(another run)
;;; (heap-size 4976640)
;;; (heap-size 8855552)
;;; (heap-size 12029952)
;;; (heap-size 12029952)
;;; (heap-size 12029952)
However, if I increase the number of seconds waited, I can eventually
reproduce some heap growth:
;;; (heap-size 12742656)
;;; (heap-size 12742656)
;;; (heap-size 12808192)
;;; (heap-size 12808192)
;;; (heap-size 12808192)
;;; (heap-size 12808192)
;;; (heap-size 12808192)
;;; (heap-size 12808192)
;;; (heap-size 12808192)
;;; (heap-size 12808192)
;;; (heap-size 12808192)
;; <--- here's the jump
;;; (heap-size 23359488)
;;; (heap-size 23359488)
;;; (heap-size 23359488)
;;; (heap-size 23359488)
;;; (heap-size 23359488)
;;; (heap-size 23359488)
;;; (heap-size 23359488)
;;; (heap-size 23359488)
;;; (heap-size 23359488)
(This is in a "guix shell guile guile-fibers" environment)
Greetings,
Maxime.
[OpenPGP_0x49E3EE22191725EE.asc (application/pgp-keys, attachment)]
[OpenPGP_signature (application/pgp-signature, attachment)]
Information forwarded
to
bug-guix <at> gnu.org
:
bug#58631
; Package
guix
.
(Sun, 30 Oct 2022 09:40:01 GMT)
Full text and
rfc822 format available.
Message #42 received at 58631 <at> debbugs.gnu.org (full text, mbox):
Ludovic Courtès <ludo <at> gnu.org> skribis:
> The attached Fibers program illustrates the problem: heap grows even
> though it’s not supposed to.
The saga continues at <https://github.com/wingo/fibers/issues/65>.
Information forwarded
to
bug-guix <at> gnu.org
:
bug#58631
; Package
guix
.
(Sun, 06 Nov 2022 22:46:01 GMT)
Full text and
rfc822 format available.
Message #45 received at 58631 <at> debbugs.gnu.org (full text, mbox):
Ludovic Courtès <ludo <at> gnu.org> skribis:
> Ludovic Courtès <ludo <at> gnu.org> skribis:
>
>> The attached Fibers program illustrates the problem: heap grows even
>> though it’s not supposed to.
>
> The saga continues at <https://github.com/wingo/fibers/issues/65>.
… and then at <https://issues.guix.gnu.org/59021>, which appears to have
a happy end: these Guix commits deploy the fix:
072fd1124a gnu: shepherd <at> 0.9: Use 'guile-3.0-latest' to address memory leak.
7138ba34fa gnu: guile <at> 3.0.8: Add patch to address continuation memory leak.
In my testing it fixes the problem. It’s now deployed it on
bayfront.guix.gnu.org and we’ll deploy to berlin.guix.gnu.org later on
and see how it goes in the coming days.
Ludo’.
Information forwarded
to
bug-guix <at> gnu.org
:
bug#58631
; Package
guix
.
(Mon, 07 Nov 2022 10:45:02 GMT)
Full text and
rfc822 format available.
Message #48 received at 58631 <at> debbugs.gnu.org (full text, mbox):
Ludovic Courtès <ludo <at> gnu.org> skribis:
> Ludovic Courtès <ludo <at> gnu.org> skribis:
>
>> Ludovic Courtès <ludo <at> gnu.org> skribis:
>>
>>> The attached Fibers program illustrates the problem: heap grows even
>>> though it’s not supposed to.
>>
>> The saga continues at <https://github.com/wingo/fibers/issues/65>.
>
> … and then at <https://issues.guix.gnu.org/59021>, which appears to have
> a happy end: these Guix commits deploy the fix:
>
> 072fd1124a gnu: shepherd <at> 0.9: Use 'guile-3.0-latest' to address memory leak.
> 7138ba34fa gnu: guile <at> 3.0.8: Add patch to address continuation memory leak.
>
> In my testing it fixes the problem. It’s now deployed it on
> bayfront.guix.gnu.org and we’ll deploy to berlin.guix.gnu.org later on
> and see how it goes in the coming days.
Turns out it’s neither the end nor happy! On bayfront, shepherd is
apparently still leaking:
--8<---------------cut here---------------start------------->8---
ludo <at> bayfront ~$ date && sudo herd eval root '(gc-stats)'
Sun Nov 6 11:50:48 PM CET 2022
Password:
Evaluating user expression (gc-stats).
((gc-time-taken . 2193894749) (heap-size . 16404480) (heap-free-size . 5505024) (heap-total-allocated . 119183424) (heap-allocated-since-gc . 4071216) (protected-objects . 0) (gc-times . 21))
ludo <at> bayfront ~$ date && sudo herd eval root '(gc-stats)'
Mon Nov 7 09:45:35 AM CET 2022
Password:
Evaluating user expression (gc-stats).
((gc-time-taken . 741717175797) (heap-size . 47304704) (heap-free-size . 10625024) (heap-total-allocated . 17295529056) (heap-allocated-since-gc . 10237808) (protected-objects . 0) (gc-times . 1700))
ludo <at> bayfront ~$ date && sudo herd eval root '(gc-stats)'
Mon Nov 7 11:18:05 AM CET 2022
Password:
Evaluating user expression (gc-stats).
((gc-time-taken . 1090097976826) (heap-size . 55693312) (heap-free-size . 5206016) (heap-total-allocated . 23190066736) (heap-allocated-since-gc . 18995744) (protected-objects . 0) (gc-times . 1996))
--8<---------------cut here---------------end--------------->8---
There seems to be an unusually high number of “program” objects in the
heap:
--8<---------------cut here---------------start------------->8---
ludo <at> bayfront ~$ sudo herd eval root '(begin (load "/home/ludo/heap-profiler.scm") (profile-heap))'
Password:
Evaluating user expression (begin (load "/home/ludo/heap-profiler.scm") (#)).
% type self avg obj size
48.0 program 1188240 32.1
21.8 pair 539904 16.0
7.4 bytevector 184368 784.5
5.6 vector 138373 7.1
4.6 struct 113328 85.8
3.5 unknown 86912 39.8
2.0 stringbuf 50144 60.5
1.3 symbol 33152 32.9
0.6 string 13728 34.1
0.5 smob 12656 39.9
0.5 variable 12432 22.6
0.4 frame 10720 34.6
0.4 pointer 10256 32.5
0.4 weak-vector 9968 32.8
0.4 bitvector 9936 34.4
0.4 hash-table 9840 34.5
0.3 weak-table 7504 34.0
0.3 vm-continuation 6976 31.7
0.3 heap-number 6848 35.3
0.2 atomic-box 5488 36.1
0.2 port 5248 36.4
0.2 dynamic-state 3760 30.1
0.1 fluid 3664 70.5
0.1 syntax 3312 48.0
0.1 keyword 2016 40.3
0.1 weak-set 1872 48.0
0.1 array 1776 48.0
0.1 partial-continuation 1760 39.1
0.0 primitive 528 16.0
0.0 foreign-program 448 44.8
0.0 continuation 336 48.0
0.0 primitive-generic 176 35.2
sampled heap: 2.36098 MiB (heap size: 53.11328 MiB)
--8<---------------cut here---------------end--------------->8---
(Note that ‘partial-continuation’, which are programs with the
SCM_F_PROGRAM_IS_PARTIAL_CONTINUATION flag, are counted separately. So
here we really talking about programs with flags == 0.)
Ludo’.
Information forwarded
to
bug-guix <at> gnu.org
:
bug#58631
; Package
guix
.
(Thu, 10 Nov 2022 08:52:02 GMT)
Full text and
rfc822 format available.
Message #51 received at 58631 <at> debbugs.gnu.org (full text, mbox):
With the Fibers fix for <https://github.com/wingo/fibers/issues/65>,
backported in Guix commit d9ca9cdd01bf1097343a047b51a1392131c7cf58, this
shepherd memory leak seems to be finally fixed (details in the issue
above).
Bayfront.guix has been running it for more than 12h and heap size hasn’t
changed in that time span.
I’ll wait a bit more before closing the issue, but we’ll soon be done.
Ludo’.
Reply sent
to
Ludovic Courtès <ludo <at> gnu.org>
:
You have taken responsibility.
(Fri, 11 Nov 2022 10:56:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Ludovic Courtès <ludo <at> gnu.org>
:
bug acknowledged by developer.
(Fri, 11 Nov 2022 10:56:02 GMT)
Full text and
rfc822 format available.
Message #56 received at 58631-done <at> debbugs.gnu.org (full text, mbox):
Hi!
Ludovic Courtès <ludo <at> gnu.org> skribis:
> Bayfront.guix has been running it for more than 12h and heap size hasn’t
> changed in that time span.
Heap size has remained unchanged for the past 48h. Berlin.guix was
reconfigured & rebooted 12h and shepherd’s heap size has remained
unchanged.
Officially closing now!
Ludo’.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Fri, 09 Dec 2022 12:24:05 GMT)
Full text and
rfc822 format available.
This bug report was last modified 1 year and 136 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.