GNU bug report logs - #41357
28.0.50; GC may miss to mark calle safe register content

Previous Next

Package: emacs;

Reported by: Andrea Corallo <akrl <at> sdf.org>

Date: Sun, 17 May 2020 12:43:02 UTC

Severity: normal

Found in version 28.0.50

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 41357 in the body.
You can then email your comments to 41357 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#41357; Package emacs. (Sun, 17 May 2020 12:43:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Andrea Corallo <akrl <at> sdf.org>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Sun, 17 May 2020 12:43:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Andrea Corallo <akrl <at> sdf.org>
To: bug-gnu-emacs <at> gnu.org
Cc: Eli Zaretskii <eliz <at> gnu.org>, Paul Eggert <eggert <at> cs.ucla.edu>
Subject: 28.0.50; GC may miss to mark calle safe register content
Date: Sun, 17 May 2020 12:42:48 +0000
[Message part 1 (text/plain, inline)]
Hi all,

debugging the native compiler I've been chasing a bug in a configuration
where the .eln are compiled at speed 2 (-O2) and emacs-core is compiled
at -O0.

What is going on is that in a .eln in a function A a Lisp_Object is
hold in a register (r14).  Function A is calling other functions into
emacs-core till Garbage Collection is triggered.

Being emacs-core compiled with -O0 GCC is not selecting any callee safe
register and therefore these gets never pushed.  The value stays in r14
till we enter into 'flush_stack_call_func' where we have to push all
registers and identify the end of the stack for mark.

We correctly push callee safe register with __builtin_unwind_init () and
we identify the top (end) of the stack on my machine using
__builtin_frame_address (0).

Here I think raise the issue, __builtin_frame_address on GCC 7 and 10
for X86_64 is returning the base pointer and not the stack pointer [1].
As a consequence this is not including the callee safe registers that we
have just pushed.

In my case r14 gets pushed at address 0x7ffc47b95fa0 but in mark_stack
we are scanning the interval 0x7ffc47b95fb0 (end) 0x7ffc47b9a150
(bottom).  This because __builtin_frame_address returned ebp
(0x7ffc47b95fb0 in this case).

The consequence is that the object originally referenced by r14 is never
marked and this leads to have it freed and to a crash.

I think we would be interested into obtaining the stack pointer and not
the base pointer, unfortunately what __builtin_frame_address does is
appears not really portable:

https://gcc.gnu.org/onlinedocs/gcc/Return-Address.html

This bug is easy to observe in the native compiler with configurations
like this (speed2 for eln -O0 for core) but I believe can affect stock
Emacs too if any caller of flush_stack_call_func has a callee safe
register holding a reference to a live object not present into the
stack.  This can get trickier especially with LTO enabled.

For now I'm testing the simple attached patch that seams to do the job
for me.  It pushes the registers in 'flush_stack_call_func' and then
call 'flush_stack_call_func1' where now ebp must include the address
where those register got pushed.

I hope I'm not catastrophically wrong in this analysis, in case
I apologize for the noise.

Thanks

  Andrea

[1] Reduced example. GCC7 -O0

void *
foo (void)
{
  __builtin_unwind_init ();
  return __builtin_frame_address (0);
}

foo:
	push	rbp
	mov	rbp, rsp
	push	r15
	push	r14
	push	r13
	push	r12
	push	rbx
	mov	rax, rbp
	pop	rbx
	pop	r12
	pop	r13
	pop	r14
	pop	r15
	pop	rbp
	ret
[0001-Fix-Garbage-Collector-for-missing-calle-safe-registe.patch (text/x-diff, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#41357; Package emacs. (Sun, 17 May 2020 15:38:01 GMT) Full text and rfc822 format available.

Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Andrea Corallo <akrl <at> sdf.org>
Cc: bug-gnu-emacs <at> gnu.org, eggert <at> cs.ucla.edu
Subject: Re: 28.0.50; GC may miss to mark calle safe register content
Date: Sun, 17 May 2020 18:36:45 +0300
> From: Andrea Corallo <akrl <at> sdf.org>
> Cc: Paul Eggert <eggert <at> cs.ucla.edu>, Eli Zaretskii <eliz <at> gnu.org>
> Date: Sun, 17 May 2020 12:42:48 +0000
> 
> What is going on is that in a .eln in a function A a Lisp_Object is
> hold in a register (r14).  Function A is calling other functions into
> emacs-core till Garbage Collection is triggered.
> 
> Being emacs-core compiled with -O0 GCC is not selecting any callee safe
> register and therefore these gets never pushed.

Isn't this something for the infrastructure of calling
natively-compiled Lisp to solve?  The Emacs C code isn't prepared for
calling optimized C code when it calls Lisp, and I don't think it's
right for us to assume that, because it will make Emacs slower.  If
the natively-compiled Lisp needs some setup to be compatible with GC,
I think the calling framework should set that up.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#41357; Package emacs. (Sun, 17 May 2020 16:41:02 GMT) Full text and rfc822 format available.

Message #11 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Andrea Corallo <akrl <at> sdf.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: bug-gnu-emacs <at> gnu.org, eggert <at> cs.ucla.edu
Subject: Re: 28.0.50; GC may miss to mark calle safe register content
Date: Sun, 17 May 2020 16:40:09 +0000
[Message part 1 (text/plain, inline)]
Eli Zaretskii <eliz <at> gnu.org> writes:

>> From: Andrea Corallo <akrl <at> sdf.org>
>> Cc: Paul Eggert <eggert <at> cs.ucla.edu>, Eli Zaretskii <eliz <at> gnu.org>
>> Date: Sun, 17 May 2020 12:42:48 +0000
>>
>> What is going on is that in a .eln in a function A a Lisp_Object is
>> hold in a register (r14).  Function A is calling other functions into
>> emacs-core till Garbage Collection is triggered.
>>
>> Being emacs-core compiled with -O0 GCC is not selecting any callee safe
>> register and therefore these gets never pushed.
>
> Isn't this something for the infrastructure of calling
> natively-compiled Lisp to solve?  The Emacs C code isn't prepared for
> calling optimized C code when it calls Lisp, and I don't think it's
> right for us to assume that, because it will make Emacs slower.  If
> the natively-compiled Lisp needs some setup to be compatible with GC,
> I think the calling framework should set that up.

Hi Eli,

I think this is a real bug that we have in the codebase (emacs-27
included).

Usually it works because having many big functions with high register
pressure that gets activated before reaching 'flush_stack_call_func'
statistically callee saved regs are very likely to be pushed.

But nothing prevents a caller of 'flush_stack_call_func' to store a lisp
object into a callee saved regs and trigger the bug.  This obviously
depends on the compiler, flags used etc, things we have no control over.

This could be also an explanation of instability for LTO configurations.
Given that callers of 'flush_stack_call_func' are more likely to be
inlined the exposed surface becomes considerably higher.

This bug should be also more likely to be observable if C files are
compiled with a mix of -O0 and -O2.

I think we should honor calling convention and make sure we garbage
collect also the content of callee saved registers, BTW I guess that's
the reason why we call '__builtin_unwind_init' isn't?

If we are concerned about performance the attached the attached patch
should be zero performance overhead.

Regards

  Andrea

--
akrl <at> sdf.org
[0001-Fix-Garbage-Collector-for-missing-calle-safe-registe.patch (text/x-diff, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#41357; Package emacs. (Sun, 17 May 2020 16:47:01 GMT) Full text and rfc822 format available.

Message #14 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Andrea Corallo <akrl <at> sdf.org>, Eli Zaretskii <eliz <at> gnu.org>
Cc: bug-gnu-emacs <at> gnu.org
Subject: Re: 28.0.50; GC may miss to mark calle safe register content
Date: Sun, 17 May 2020 09:46:23 -0700
On 5/17/20 9:40 AM, Andrea Corallo wrote:
> I think this is a real bug that we have in the codebase (emacs-27
> included).

Thanks for all the detective work! Your analysis is correct and your patch looks
good. I've always been suspicious of that code, and it looks like you've
confirmed my suspicions.

The only question in my mind is whether to install the patch into the emacs-27
branch or the master branch. Given Eli's problems with stability in emacs-27
(see Bug#41321), I'm inclined to think the former, as the bug could explain the
problems Eli is observing.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#41357; Package emacs. (Sun, 17 May 2020 17:02:01 GMT) Full text and rfc822 format available.

Message #17 received at 41357 <at> debbugs.gnu.org (full text, mbox):

From: Pip Cet <pipcet <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Eli Zaretskii <eliz <at> gnu.org>, 41357 <at> debbugs.gnu.org,
 Andrea Corallo <akrl <at> sdf.org>
Subject: Re: bug#41357: 28.0.50;
 GC may miss to mark calle safe register content
Date: Sun, 17 May 2020 17:00:38 +0000
On Sun, May 17, 2020 at 4:47 PM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> On 5/17/20 9:40 AM, Andrea Corallo wrote:
> > I think this is a real bug that we have in the codebase (emacs-27
> > included).
> Thanks for all the detective work! Your analysis is correct and your patch looks
> good.

That's my impression as well.

> The only question in my mind is whether to install the patch into the emacs-27
> branch or the master branch. Given Eli's problems with stability in emacs-27
> (see Bug#41321), I'm inclined to think the former, as the bug could explain the
> problems Eli is observing.

I don't think that platform even has callee-saved registers? But I
think the fix should go on the emacs-27 branch. It's a bad bug and
sheer luck that Fgarbage_collect on my platform (using this specific
compiler, etc.) pushes all callee-saved registers. We shouldn't rely
on such lucks on all platforms.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#41357; Package emacs. (Sun, 17 May 2020 17:05:02 GMT) Full text and rfc822 format available.

Message #20 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Andrea Corallo <akrl <at> sdf.org>
Cc: bug-gnu-emacs <at> gnu.org, eggert <at> cs.ucla.edu
Subject: Re: 28.0.50; GC may miss to mark calle safe register content
Date: Sun, 17 May 2020 20:04:01 +0300
> From: Andrea Corallo <akrl <at> sdf.org>
> Cc: bug-gnu-emacs <at> gnu.org, eggert <at> cs.ucla.edu
> Date: Sun, 17 May 2020 16:40:09 +0000
> 
> I think this is a real bug that we have in the codebase (emacs-27
> included).

Maybe it's so, but your explanation makes sense only in the context of
calling a machine-language function.  When we call Lisp or bytecode,
the machine-level operation is very different, and I cannot easily
correlate your description of using registers with what happens when
we call Lisp or bytecode.  Sorry for my misunderstanding.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#41357; Package emacs. (Sun, 17 May 2020 17:05:02 GMT) Full text and rfc822 format available.

Message #23 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: bug-gnu-emacs <at> gnu.org, akrl <at> sdf.org
Subject: Re: 28.0.50; GC may miss to mark calle safe register content
Date: Sun, 17 May 2020 20:04:46 +0300
> Cc: bug-gnu-emacs <at> gnu.org
> From: Paul Eggert <eggert <at> cs.ucla.edu>
> Date: Sun, 17 May 2020 09:46:23 -0700
> 
> The only question in my mind is whether to install the patch into the emacs-27
> branch or the master branch.

Definitely the master!




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#41357; Package emacs. (Sun, 17 May 2020 17:14:01 GMT) Full text and rfc822 format available.

Message #26 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Andrea Corallo <akrl <at> sdf.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: bug-gnu-emacs <at> gnu.org, eggert <at> cs.ucla.edu
Subject: Re: 28.0.50; GC may miss to mark calle safe register content
Date: Sun, 17 May 2020 17:13:26 +0000
Eli Zaretskii <eliz <at> gnu.org> writes:

>> From: Andrea Corallo <akrl <at> sdf.org>
>> Cc: bug-gnu-emacs <at> gnu.org, eggert <at> cs.ucla.edu
>> Date: Sun, 17 May 2020 16:40:09 +0000
>> 
>> I think this is a real bug that we have in the codebase (emacs-27
>> included).
>
> Maybe it's so, but your explanation makes sense only in the context of
> calling a machine-language function.  When we call Lisp or bytecode,
> the machine-level operation is very different, and I cannot easily
> correlate your description of using registers with what happens when
> we call Lisp or bytecode.  Sorry for my misunderstanding.

That is correct, but I don't think we need bytecode to come into play
here to have the problem.

If a C function caller of 'flush_stack_call_func' allocates a
Lisp_Object in a temp variable and the compiler decide to keep this in a
callee saved reg while 'flush_stack_call_func' is called this will be
garbage collected unexpectedly.

Am I wrong?

  Andrea

-- 
akrl <at> sdf.org




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#41357; Package emacs. (Sun, 17 May 2020 17:16:02 GMT) Full text and rfc822 format available.

Message #29 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Eli Zaretskii <eliz <at> gnu.org>, Andrea Corallo <akrl <at> sdf.org>
Cc: bug-gnu-emacs <at> gnu.org
Subject: Re: 28.0.50; GC may miss to mark calle safe register content
Date: Sun, 17 May 2020 10:08:00 -0700
On 5/17/20 10:04 AM, Eli Zaretskii wrote:
> I cannot easily
> correlate your description of using registers with what happens when
> we call Lisp or bytecode.

His description is generic: it applies regardless of whether the garbage
collector is called from C code (in his branch, generated from Lisp code) or
from the interpreter (either in his branch or in the emacs-27 branch) as it is
executing Lisp code or bytecode.

It's a low-level problem in which the garbage collector is not seeing some
objects that it should see, because at the machine level the object addresses
are in registers that the garbage collector hasn't saved and thus won't see when
it scans memory.

A serious and insidious bug in our existing system, in other words.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#41357; Package emacs. (Sun, 17 May 2020 17:23:02 GMT) Full text and rfc822 format available.

Message #32 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Andrea Corallo <akrl <at> sdf.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: bug-gnu-emacs <at> gnu.org, Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: 28.0.50; GC may miss to mark calle safe register content
Date: Sun, 17 May 2020 17:21:54 +0000
Eli Zaretskii <eliz <at> gnu.org> writes:

>> Cc: bug-gnu-emacs <at> gnu.org
>> From: Paul Eggert <eggert <at> cs.ucla.edu>
>> Date: Sun, 17 May 2020 09:46:23 -0700
>> 
>> The only question in my mind is whether to install the patch into the emacs-27
>> branch or the master branch.
>
> Definitely the master!

Is not my responsability so I'll not insist.

But I just wanted to point out that I think this is clearly a bug in our
code, and it can trigger depending on decision of the tool-chain we have
no control over.

I think it should be trated as we would do if we discover and verify an
"access out of bounds" or "read after free".

My opinion :)

Regards

  Andrea

-- 
akrl <at> sdf.org




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#41357; Package emacs. (Sun, 17 May 2020 17:24:01 GMT) Full text and rfc822 format available.

Message #35 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Andrea Corallo <akrl <at> sdf.org>
Cc: bug-gnu-emacs <at> gnu.org, eggert <at> cs.ucla.edu
Subject: Re: 28.0.50; GC may miss to mark calle safe register content
Date: Sun, 17 May 2020 20:22:58 +0300
> From: Andrea Corallo <akrl <at> sdf.org>
> Cc: bug-gnu-emacs <at> gnu.org, eggert <at> cs.ucla.edu
> Date: Sun, 17 May 2020 17:13:26 +0000
> 
> If a C function caller of 'flush_stack_call_func' allocates a
> Lisp_Object in a temp variable and the compiler decide to keep this in a
> callee saved reg while 'flush_stack_call_func' is called this will be
> garbage collected unexpectedly.

Can you show me an example of this (as skeleton C code)?

Thanks.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#41357; Package emacs. (Sun, 17 May 2020 17:25:02 GMT) Full text and rfc822 format available.

Message #38 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: bug-gnu-emacs <at> gnu.org, akrl <at> sdf.org
Subject: Re: 28.0.50; GC may miss to mark calle safe register content
Date: Sun, 17 May 2020 20:24:23 +0300
> Cc: bug-gnu-emacs <at> gnu.org
> From: Paul Eggert <eggert <at> cs.ucla.edu>
> Date: Sun, 17 May 2020 10:08:00 -0700
> 
> It's a low-level problem in which the garbage collector is not seeing some
> objects that it should see, because at the machine level the object addresses
> are in registers that the garbage collector hasn't saved and thus won't see when
> it scans memory.

Since we write in C and in Lisp, not in assembly, I struggle to see
how a Lisp object could appear in a register without leaving any trace
on the stack.  I'm probably missing something.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#41357; Package emacs. (Sun, 17 May 2020 17:29:01 GMT) Full text and rfc822 format available.

Message #41 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Andrea Corallo <akrl <at> sdf.org>
Cc: bug-gnu-emacs <at> gnu.org, eggert <at> cs.ucla.edu
Subject: Re: 28.0.50; GC may miss to mark calle safe register content
Date: Sun, 17 May 2020 20:28:04 +0300
> From: Andrea Corallo <akrl <at> sdf.org>
> Cc: Paul Eggert <eggert <at> cs.ucla.edu>, bug-gnu-emacs <at> gnu.org
> Date: Sun, 17 May 2020 17:21:54 +0000
> 
> But I just wanted to point out that I think this is clearly a bug in our
> code, and it can trigger depending on decision of the tool-chain we have
> no control over.

I'm not saying it is not a bug.  I'm saying that we've lived with this
bug for a very long time, so if it is real, it is definitely very-very
rare.

If we put every bugfix we could think of into the release branch, we
will never release Emacs 27.  Never.  Because there's always one more
bug.

> I think it should be trated as we would do if we discover and verify an
> "access out of bounds" or "read after free".

Depends on the circumstances.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#41357; Package emacs. (Sun, 17 May 2020 17:46:01 GMT) Full text and rfc822 format available.

Message #44 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Andrea Corallo <akrl <at> sdf.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: bug-gnu-emacs <at> gnu.org, eggert <at> cs.ucla.edu
Subject: Re: 28.0.50; GC may miss to mark calle safe register content
Date: Sun, 17 May 2020 17:45:28 +0000
Eli Zaretskii <eliz <at> gnu.org> writes:

>> From: Andrea Corallo <akrl <at> sdf.org>
>> Cc: bug-gnu-emacs <at> gnu.org, eggert <at> cs.ucla.edu
>> Date: Sun, 17 May 2020 17:13:26 +0000
>> 
>> If a C function caller of 'flush_stack_call_func' allocates a
>> Lisp_Object in a temp variable and the compiler decide to keep this in a
>> callee saved reg while 'flush_stack_call_func' is called this will be
>> garbage collected unexpectedly.
>
> Can you show me an example of this (as skeleton C code)?
>
> Thanks.

Sure, something like

=====

Lisp_Object
foo (void)
{
  /* 'res' goes in a callee saved reg  */
  Lisp_Object res = build_string ("bar");
  [...]
  /* LTO inline the following as "flush_stack_call_func (mark_threads_callback, NULL);" */
  mark_threads ();
  [...]
  gc_sweep ();

  /* The string pointed by 'res' was garbage collected.  */
  return res;
}

=====

I'm not sure this is the only possible scenarion tho.

  Andrea

-- 
akrl <at> sdf.org




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#41357; Package emacs. (Sun, 17 May 2020 17:58:01 GMT) Full text and rfc822 format available.

Message #47 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Andrea Corallo <akrl <at> sdf.org>
Cc: bug-gnu-emacs <at> gnu.org, eggert <at> cs.ucla.edu
Subject: Re: 28.0.50; GC may miss to mark calle safe register content
Date: Sun, 17 May 2020 20:57:28 +0300
> From: Andrea Corallo <akrl <at> sdf.org>
> Cc: bug-gnu-emacs <at> gnu.org, eggert <at> cs.ucla.edu
> Date: Sun, 17 May 2020 17:45:28 +0000
> 
> Lisp_Object
> foo (void)
> {
>   /* 'res' goes in a callee saved reg  */
>   Lisp_Object res = build_string ("bar");
>   [...]
>   /* LTO inline the following as "flush_stack_call_func (mark_threads_callback, NULL);" */
>   mark_threads ();
>   [...]
>   gc_sweep ();
> 
>   /* The string pointed by 'res' was garbage collected.  */
>   return res;
> }

But mark_threads etc. (GC in general) isn't called from functions like
your 'foo.  It is more like this:

Lisp_Object
foo (void)
{
  /* 'res' goes in a callee saved reg  */
  Lisp_Object res = build_string ("bar");
  [...]
  call_something ();
  [...]

}

call_something (void)
{
  [...]
  garbage_collect ();
  [...]
}

Which is quite different, AFAIU, wrt stack usage.

Or maybe I don't understand how "callee saved registers" work.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#41357; Package emacs. (Sun, 17 May 2020 18:17:01 GMT) Full text and rfc822 format available.

Message #50 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Andrea Corallo <akrl <at> sdf.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: bug-gnu-emacs <at> gnu.org, eggert <at> cs.ucla.edu
Subject: Re: 28.0.50; GC may miss to mark calle safe register content
Date: Sun, 17 May 2020 18:16:24 +0000
Eli Zaretskii <eliz <at> gnu.org> writes:

>> From: Andrea Corallo <akrl <at> sdf.org>
>> Cc: bug-gnu-emacs <at> gnu.org, eggert <at> cs.ucla.edu
>> Date: Sun, 17 May 2020 17:45:28 +0000
>> 
>> Lisp_Object
>> foo (void)
>> {
>>   /* 'res' goes in a callee saved reg  */
>>   Lisp_Object res = build_string ("bar");
>>   [...]
>>   /* LTO inline the following as "flush_stack_call_func (mark_threads_callback, NULL);" */
>>   mark_threads ();
>>   [...]
>>   gc_sweep ();
>> 
>>   /* The string pointed by 'res' was garbage collected.  */
>>   return res;
>> }
>
> But mark_threads etc. (GC in general) isn't called from functions like
> your 'foo.  It is more like this:
>
> Lisp_Object
> foo (void)
> {
>   /* 'res' goes in a callee saved reg  */
>   Lisp_Object res = build_string ("bar");
>   [...]
>   call_something ();
>   [...]
>
> }
>
> call_something (void)
> {
>   [...]
>   garbage_collect ();
>   [...]
> }

Yes, my example was minimal your is certanly more realistic.

But also this can be critical.  We have to hope that in 'call_something'
or 'garbage_collect' there is sufficient register pressure to have the
register that is holding 'res' to be pushed.


  Andrea

-- 
akrl <at> sdf.org




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#41357; Package emacs. (Sun, 17 May 2020 19:02:01 GMT) Full text and rfc822 format available.

Message #53 received at 41357 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Pip Cet <pipcet <at> gmail.com>
Cc: Eli Zaretskii <eliz <at> gnu.org>, 41357 <at> debbugs.gnu.org,
 Andrea Corallo <akrl <at> sdf.org>
Subject: Re: bug#41357: 28.0.50; GC may miss to mark calle safe register
 content
Date: Sun, 17 May 2020 12:01:40 -0700
On 5/17/20 10:00 AM, Pip Cet wrote:
> I don't think that platform even has callee-saved registers?

Eli's platform is 32-bit Microsoft Windows, and W32 has four callee-save
registers (ebx, esi, edi, ebp) not counting esp and eip which are of course
callee-save by definition. So the problem could at least in theory be occurring
on his platform, depending on the compiler and its options.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#41357; Package emacs. (Sun, 17 May 2020 19:06:02 GMT) Full text and rfc822 format available.

Message #56 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: bug-gnu-emacs <at> gnu.org, akrl <at> sdf.org
Subject: Re: 28.0.50; GC may miss to mark calle safe register content
Date: Sun, 17 May 2020 12:05:25 -0700
On 5/17/20 10:24 AM, Eli Zaretskii wrote:
> I struggle to see
> how a Lisp object could appear in a register without leaving any trace
> on the stack

Quite easily. It happens all the time. If I do something like this:

    Lisp_Object a = Fcons (b, c);
    f (x, y);
    return a;

The compiler might put 'a' into a callee-save register R, which means that while
f is running there's no trace of 'a' on the stack (unless f's code itself
decides to use R for whatever reason, but let's suppose it doesn't). This
situation can persist even if f calls g which calls h which calls the garbage
collector, and the garbage collector will then think the cons is garbage even
though it's not.

The proposed fix is harmless except it may execute a handful more instructions
per GC. So the cost of applying the fix is tiny, whereas the potential
reliability benefit is large.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#41357; Package emacs. (Sun, 17 May 2020 19:21:02 GMT) Full text and rfc822 format available.

Message #59 received at 41357 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 41357 <at> debbugs.gnu.org, pipcet <at> gmail.com, akrl <at> sdf.org
Subject: Re: bug#41357: 28.0.50; GC may miss to mark calle safe register
 content
Date: Sun, 17 May 2020 22:19:52 +0300
> Cc: Andrea Corallo <akrl <at> sdf.org>, Eli Zaretskii <eliz <at> gnu.org>,
>  41357 <at> debbugs.gnu.org
> From: Paul Eggert <eggert <at> cs.ucla.edu>
> Date: Sun, 17 May 2020 12:01:40 -0700
> 
> On 5/17/20 10:00 AM, Pip Cet wrote:
> > I don't think that platform even has callee-saved registers?
> 
> Eli's platform is 32-bit Microsoft Windows, and W32 has four callee-save
> registers (ebx, esi, edi, ebp) not counting esp and eip which are of course
> callee-save by definition. So the problem could at least in theory be occurring
> on his platform, depending on the compiler and its options.

I've seen the same problem on 64-bit Windows as well, in Emacs
compiled with a different (newer) version of GCC.  I don't think this
has anything to do with how many registers are there.  I also never
before saw these problems, so this is most definitely due to some
recent changes, I just cannot yet figure out which ones.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#41357; Package emacs. (Sun, 17 May 2020 19:27:01 GMT) Full text and rfc822 format available.

Message #62 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: bug-gnu-emacs <at> gnu.org, akrl <at> sdf.org
Subject: Re: 28.0.50; GC may miss to mark calle safe register content
Date: Sun, 17 May 2020 22:26:18 +0300
> Cc: akrl <at> sdf.org, bug-gnu-emacs <at> gnu.org
> From: Paul Eggert <eggert <at> cs.ucla.edu>
> Date: Sun, 17 May 2020 12:05:25 -0700
> 
> On 5/17/20 10:24 AM, Eli Zaretskii wrote:
> > I struggle to see
> > how a Lisp object could appear in a register without leaving any trace
> > on the stack
> 
> Quite easily. It happens all the time. If I do something like this:
> 
>     Lisp_Object a = Fcons (b, c);
>     f (x, y);
>     return a;

And where's GC in this picture?  If it's called directly from 'f', can
you show me such code in Emacs?  Then we could disassembly it and see
what we've got.

Usually the code that calls GC is much deeper, and thus the chance of
that temporary to stay in a register is very small, to say the least.

> The proposed fix is harmless

Yeah, right.  Sorry, I don't buy this.  Too many gray hair from such
assumptions.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#41357; Package emacs. (Sun, 17 May 2020 19:47:01 GMT) Full text and rfc822 format available.

Message #65 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Andrea Corallo <akrl <at> sdf.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: bug-gnu-emacs <at> gnu.org, Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: 28.0.50; GC may miss to mark calle safe register content
Date: Sun, 17 May 2020 19:46:35 +0000
Eli Zaretskii <eliz <at> gnu.org> writes:

>> Cc: akrl <at> sdf.org, bug-gnu-emacs <at> gnu.org
>> From: Paul Eggert <eggert <at> cs.ucla.edu>
>> Date: Sun, 17 May 2020 12:05:25 -0700
>> 
>> On 5/17/20 10:24 AM, Eli Zaretskii wrote:
>> > I struggle to see
>> > how a Lisp object could appear in a register without leaving any trace
>> > on the stack
>> 
>> Quite easily. It happens all the time. If I do something like this:
>> 
>>     Lisp_Object a = Fcons (b, c);
>>     f (x, y);
>>     return a;
>
> And where's GC in this picture?

GC can be triggered by f or any of his callee it does not matter.

> If it's called directly from 'f', can
> you show me such code in Emacs?  Then we could disassembly it and see
> what we've got.

I'm not sure what we can prove disassembling, that would be just the
result of a specific .c + toolchain + invocation.  I think we want to
have code that is sufficiently portable and safe because correct.

> Usually the code that calls GC is much deeper, and thus the chance of
> that temporary to stay in a register is very small, to say the least.

Probably yes, but I don't think we want to have code that works accidentally.

  Andrea

-- 
akrl <at> sdf.org




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#41357; Package emacs. (Sun, 17 May 2020 20:24:02 GMT) Full text and rfc822 format available.

Message #68 received at 41357 <at> debbugs.gnu.org (full text, mbox):

From: Andrea Corallo <akrl <at> sdf.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 41357 <at> debbugs.gnu.org, Paul Eggert <eggert <at> cs.ucla.edu>, pipcet <at> gmail.com
Subject: Re: bug#41357: 28.0.50;
 GC may miss to mark calle safe register content
Date: Sun, 17 May 2020 20:23:02 +0000
Eli Zaretskii <eliz <at> gnu.org> writes:

>> Eli's platform is 32-bit Microsoft Windows, and W32 has four callee-save
>> registers (ebx, esi, edi, ebp) not counting esp and eip which are of course
>> callee-save by definition. So the problem could at least in theory be occurring
>> on his platform, depending on the compiler and its options.
>
> I've seen the same problem on 64-bit Windows as well, in Emacs
> compiled with a different (newer) version of GCC.  I don't think this
> has anything to do with how many registers are there.  I also never
> before saw these problems, so this is most definitely due to some
> recent changes, I just cannot yet figure out which ones.

It's hard to say but I suspect the main gate that saved us till today is
'garbage_collect' that being quite big is likely to have calle-save regs
spilled.

You can disassemble your 'garbage_collect' and see if all of this regs
are spilled (in my case they are at -O2).  This could give an indication
on the correlation of the two bugs (but nothing more).

  Andrea

-- 
akrl <at> sdf.org




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#41357; Package emacs. (Sun, 17 May 2020 21:04:01 GMT) Full text and rfc822 format available.

Message #71 received at 41357 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 41357 <at> debbugs.gnu.org, pipcet <at> gmail.com, akrl <at> sdf.org
Subject: Re: bug#41357: 28.0.50; GC may miss to mark calle safe register
 content
Date: Sun, 17 May 2020 14:03:12 -0700
>>> I don't think that platform even has callee-saved registers?

>> Eli's platform is 32-bit Microsoft Windows, and W32 has four callee-save
>> registers (ebx, esi, edi, ebp) not counting esp and eip which are of course
>> callee-save by definition. So the problem could at least in theory be occurring
>> on his platform, depending on the compiler and its options.

> I've seen the same problem on 64-bit Windows as well, in Emacs
> compiled with a different (newer) version of GCC.  I don't think this
> has anything to do with how many registers are there. 

You're right that the number of registers doesn't matter, in the sense that the
problem can occur if any registers are callee-save. I was responding to Pip
Cet's comment, where he said he thought your platform (W32 in the original bug
report) had zero callee-saved registers. That would have meant the problem
couldn't occur on your platform. However, because your platform does have
callee-save registers the problem can occur there.

64-bit Windows also has callee-save registers (rbx, rbp, rdi, rsi, r12, r13,
r14, r15) so it can also have the problem. Most platforms do have callee-save
these days.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#41357; Package emacs. (Sun, 17 May 2020 21:22:01 GMT) Full text and rfc822 format available.

Message #74 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: bug-gnu-emacs <at> gnu.org, akrl <at> sdf.org
Subject: Re: 28.0.50; GC may miss to mark calle safe register content
Date: Sun, 17 May 2020 14:21:49 -0700
On 5/17/20 12:26 PM, Eli Zaretskii wrote:
> And where's GC in this picture?  If it's called directly from 'f', can
> you show me such code in Emacs?  Then we could disassembly it and see
> what we've got.
> 
> Usually the code that calls GC is much deeper, and thus the chance of
> that temporary to stay in a register is very small, to say the least.

The probability is not that small, unfortunately. Compilers often have a habit
of running through the same set of callee-save registers in the same order.
Let's say you're on the x86 and your compiler consumes the four callee-save
registers in the order ebx, esi, edi, ebp. Then if we call f which calls g which
calls h which calls the GC, it's likely that f will save just ebx, then g will
save just ebx, esi, edi, then h will save just ebx and esi. Hence if the caller
has assigned a local variable to ebp, the GC won't see the variable's contents.

We should give Andrea a big round of applause for catching this bug.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#41357; Package emacs. (Sun, 17 May 2020 21:28:02 GMT) Full text and rfc822 format available.

Message #77 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Eli Zaretskii <eliz <at> gnu.org>, Andrea Corallo <akrl <at> sdf.org>
Cc: bug-gnu-emacs <at> gnu.org
Subject: Re: 28.0.50; GC may miss to mark calle safe register content
Date: Sun, 17 May 2020 14:27:55 -0700
On 5/17/20 10:28 AM, Eli Zaretskii wrote:
> I'm saying that we've lived with this
> bug for a very long time, so if it is real, it is definitely very-very
> rare.

Yes, it very much has the smell of a GC bug: rare and hard to reproduce, but
deadly when it occurs. I wouldn't be surprised if it's causing the rare and
hard-to-reproduce crashes you reported in Bug#41321. You might try installing
the patch into your copy of emacs-27 and see whether it affects Bug#41321.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#41357; Package emacs. (Sun, 17 May 2020 21:48:02 GMT) Full text and rfc822 format available.

Message #80 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Andrea Corallo <akrl <at> sdf.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Eli Zaretskii <eliz <at> gnu.org>, bug-gnu-emacs <at> gnu.org
Subject: Re: 28.0.50; GC may miss to mark calle safe register content
Date: Sun, 17 May 2020 21:47:12 +0000
Paul Eggert <eggert <at> cs.ucla.edu> writes:

> On 5/17/20 9:40 AM, Andrea Corallo wrote:
>> I think this is a real bug that we have in the codebase (emacs-27
>> included).
>
> Thanks for all the detective work! Your analysis is correct and your patch looks
> good. I've always been suspicious of that code, and it looks like you've
> confirmed my suspicions.
>
> The only question in my mind is whether to install the patch into the emacs-27
> branch or the master branch. Given Eli's problems with stability in emacs-27
> (see Bug#41321), I'm inclined to think the former, as the bug could explain the
> problems Eli is observing.

Hi Paul,

I'm glad you liked the investigation.

I've pushed the fix on master as Eli suggested, feel free to improve it
in case.

I hope a different agreement can be found for 27, in case I'll port it
there.

Thanks!

  Andrea

-- 
akrl <at> sdf.org




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#41357; Package emacs. (Mon, 25 May 2020 02:11:03 GMT) Full text and rfc822 format available.

Message #83 received at 41357 <at> debbugs.gnu.org (full text, mbox):

From: Tom Tromey <tom <at> tromey.com>
To: Andrea Corallo <akrl <at> sdf.org>
Cc: Eli Zaretskii <eliz <at> gnu.org>, 41357 <at> debbugs.gnu.org,
 Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: bug#41357: 28.0.50; GC may miss to mark calle safe register
 content
Date: Sun, 24 May 2020 20:09:58 -0600
Andrea> For now I'm testing the simple attached patch that seams to do the job
Andrea> for me.

It looks like this patch was checked in, so I think this bug can be
closed.  I didn't want to do it without verifying with you first though.

Tom




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#41357; Package emacs. (Mon, 25 May 2020 08:39:01 GMT) Full text and rfc822 format available.

Message #86 received at 41357 <at> debbugs.gnu.org (full text, mbox):

From: Andrea Corallo <akrl <at> sdf.org>
To: Tom Tromey <tom <at> tromey.com>
Cc: 41357 <at> debbugs.gnu.org, Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: bug#41357: 28.0.50;
 GC may miss to mark calle safe register content
Date: Mon, 25 May 2020 08:37:54 +0000
Tom Tromey <tom <at> tromey.com> writes:

> Andrea> For now I'm testing the simple attached patch that seams to do the job
> Andrea> for me.
>
> It looks like this patch was checked in, so I think this bug can be
> closed.  I didn't want to do it without verifying with you first though.

Hi Tom,

thanks.  Yes the only left point was if to apply it on emacs-27 given
the bug is present there too, I think Eli prefers not to do that tho.

Not sure what should be the state of the bug then, feel free to close it
if that's the correct state.

Thanks

  Andrea

-- 
akrl <at> sdf.org




Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Thu, 28 May 2020 22:09:01 GMT) Full text and rfc822 format available.

Notification sent to Andrea Corallo <akrl <at> sdf.org>:
bug acknowledged by developer. (Thu, 28 May 2020 22:09:02 GMT) Full text and rfc822 format available.

Message #91 received at 41357-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Andrea Corallo <akrl <at> sdf.org>, Tom Tromey <tom <at> tromey.com>
Cc: 41357-done <at> debbugs.gnu.org
Subject: Re: bug#41357: 28.0.50; GC may miss to mark calle safe register
 content
Date: Thu, 28 May 2020 15:08:35 -0700
On 5/25/20 1:37 AM, Andrea Corallo wrote:

> Not sure what should be the state of the bug then, feel free to close it
> if that's the correct state.

"Fixed in master" is good enough to close a bug report, so I'm closing it.
Thanks again.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 26 Jun 2020 11:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 5 years and 11 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.