GNU bug report logs - #38748
28.0.50; crash on MacOS 10.15.2

Previous Next

Package: emacs;

Reported by: Andrii Kolomoiets <andreyk.mad <at> gmail.com>

Date: Thu, 26 Dec 2019 09:49:01 UTC

Severity: normal

Merged with 38822

Found in versions 27.0.60, 28.0.50

Fixed in version 27.1

Done: Lars Ingebrigtsen <larsi <at> gnus.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 38748 in the body.
You can then email your comments to 38748 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Thu, 26 Dec 2019 09:49:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Andrii Kolomoiets <andreyk.mad <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Thu, 26 Dec 2019 09:49:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Andrii Kolomoiets <andreyk.mad <at> gmail.com>
To: bug-gnu-emacs <at> gnu.org
Subject: 28.0.50; crash on MacOS 10.15.2
Date: Thu, 26 Dec 2019 11:47:29 +0200

[Message part 1 (text/plain, inline)]

Unfortunately I have no recipe to reproduce this issue.  Emacs just
crashing from time to time.

See attached crash info.

Emacs is buit from nearly recent master (commit
7c5d6a2afc6c23a7fff8456f506ee2aa2d37a3b9)

In GNU Emacs 28.0.50 (build 2, x86_64-apple-darwin19.2.0, NS appkit-1894.20 Version 10.15.2 (Build 19C57))
Windowing system distributor 'Apple', version 10.3.1894
System Description:  Mac OS X 10.15.2

Configured using:
 'configure --disable-dependency-tracking --disable-silent-rules
 --enable-locallisppath=/usr/local/share/emacs/site-lisp
 --infodir=/usr/local/Cellar/emacs/dev/share/info/emacs
 --prefix=/usr/local/Cellar/emacs/dev --with-gnutls --without-x
 --with-xml2 --without-dbus --with-modules --disable-ns-self-contained
 --with-ns'

Configured features:
NOTIFY KQUEUE ACL GNUTLS LIBXML2 ZLIB TOOLKIT_SCROLL_BARS NS MODULES
THREADS JSON PDUMPER GMP

[emacs-crash.txt (text/plain, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Thu, 26 Dec 2019 13:05:02 GMT) Full text and rfc822 format available.

Message #8 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Alan Third <alan <at> idiocy.org>
To: Andrii Kolomoiets <andreyk.mad <at> gmail.com>
Cc: 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Thu, 26 Dec 2019 13:04:20 +0000

On Thu, Dec 26, 2019 at 11:47:29AM +0200, Andrii Kolomoiets wrote:
> Unfortunately I have no recipe to reproduce this issue.  Emacs just
> crashing from time to time.
> 
> See attached crash info.
> 
> Emacs is buit from nearly recent master (commit
> 7c5d6a2afc6c23a7fff8456f506ee2aa2d37a3b9)
> 
<snip>
> 
> Exception Type:        EXC_BAD_ACCESS (SIGABRT)
> Exception Codes:       KERN_INVALID_ADDRESS at 0x00000000434f4e44
> Exception Note:        EXC_CORPSE_NOTIFY
> 
<snip>
>
> 20  org.gnu.Emacs                 	0x00000001084a7c86 handle_sigsegv + 168
> 21  libsystem_platform.dylib      	0x00007fff6b73a42d _sigtramp + 29
> 22  ???                           	000000000000000000 0 + 0
> 23  org.gnu.Emacs                 	0x00000001084ddd80 mark_object + 272
> 24  org.gnu.Emacs                 	0x00000001084ddd80 mark_object + 272

Looks like a crash in GC.
-- 
Alan Third

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Thu, 26 Dec 2019 17:19:01 GMT) Full text and rfc822 format available.

Message #11 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Alan Third <alan <at> idiocy.org>
Cc: andreyk.mad <at> gmail.com, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Thu, 26 Dec 2019 19:18:33 +0200

> Date: Thu, 26 Dec 2019 13:04:20 +0000
> From: Alan Third <alan <at> idiocy.org>
> Cc: 38748 <at> debbugs.gnu.org
> 
> > 20  org.gnu.Emacs                 	0x00000001084a7c86 handle_sigsegv + 168
> > 21  libsystem_platform.dylib      	0x00007fff6b73a42d _sigtramp + 29
> > 22  ???                           	000000000000000000 0 + 0
> > 23  org.gnu.Emacs                 	0x00000001084ddd80 mark_object + 272
> > 24  org.gnu.Emacs                 	0x00000001084ddd80 mark_object + 272
> 
> Looks like a crash in GC.

Yes, but why?

One possibility is stack overflow.  If that's not the reason, then one
needs to employ the technique described in etc/DEBUG to find out which
object got corrupted and why.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Fri, 27 Dec 2019 11:29:01 GMT) Full text and rfc822 format available.

Message #14 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Andrii Kolomoiets <andreyk.mad <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: Alan Third <alan <at> idiocy.org>, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Fri, 27 Dec 2019 13:28:11 +0200

[Message part 1 (text/plain, inline)]

Eli Zaretskii <eliz <at> gnu.org> writes:

>> Date: Thu, 26 Dec 2019 13:04:20 +0000
>> From: Alan Third <alan <at> idiocy.org>
>> Cc: 38748 <at> debbugs.gnu.org
>> 
>> > 20  org.gnu.Emacs                 	0x00000001084a7c86 handle_sigsegv + 168
>> > 21  libsystem_platform.dylib      	0x00007fff6b73a42d _sigtramp + 29
>> > 22  ???                           	000000000000000000 0 + 0
>> > 23  org.gnu.Emacs                 	0x00000001084ddd80 mark_object + 272
>> > 24  org.gnu.Emacs                 	0x00000001084ddd80 mark_object + 272
>> 
>> Looks like a crash in GC.
>
> Yes, but why?
>
> One possibility is stack overflow.  If that's not the reason, then one
> needs to employ the technique described in etc/DEBUG to find out which
> object got corrupted and why.

I followed the steps described in etc/DEBUG.

Emacs is configured using:
'configure --without-xml2 --with-ns --with-modules
 --disable-ns-self-contained --enable-checking=yes,glyphs
 --enable-check-lisp-object-type 'CFLAGS=-O3 -g3''

See gdb session output attached.

Hope this will help.

[gdb-bt-full.txt (text/plain, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Fri, 27 Dec 2019 14:15:01 GMT) Full text and rfc822 format available.

Message #17 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Andrii Kolomoiets <andreyk.mad <at> gmail.com>
Cc: alan <at> idiocy.org, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Fri, 27 Dec 2019 16:14:11 +0200

> From: Andrii Kolomoiets <andreyk.mad <at> gmail.com>
> Cc: Alan Third <alan <at> idiocy.org>,  38748 <at> debbugs.gnu.org
> Date: Fri, 27 Dec 2019 13:28:11 +0200
> 
> > One possibility is stack overflow.  If that's not the reason, then one
> > needs to employ the technique described in etc/DEBUG to find out which
> > object got corrupted and why.
> 
> I followed the steps described in etc/DEBUG.
> 
> See gdb session output attached.

The attachment just shows the output of "bt full", I see nothing there
that should have been produced by following the etc/DEBUG instructions
under "Debugging problems which happen in GC".

Are you sure you posted the file you intended to?

Thanks.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Sun, 29 Dec 2019 19:02:02 GMT) Full text and rfc822 format available.

Message #20 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Andrii Kolomoiets <andreyk.mad <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: alan <at> idiocy.org, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Sun, 29 Dec 2019 21:01:42 +0200

Eli Zaretskii <eliz <at> gnu.org> writes:

>> From: Andrii Kolomoiets <andreyk.mad <at> gmail.com>
>> Cc: Alan Third <alan <at> idiocy.org>,  38748 <at> debbugs.gnu.org
>> Date: Fri, 27 Dec 2019 13:28:11 +0200
>> 
>> > One possibility is stack overflow.  If that's not the reason, then one
>> > needs to employ the technique described in etc/DEBUG to find out which
>> > object got corrupted and why.
>> 
>> I followed the steps described in etc/DEBUG.
>> 
>> See gdb session output attached.
>
> The attachment just shows the output of "bt full", I see nothing there
> that should have been produced by following the etc/DEBUG instructions
> under "Debugging problems which happen in GC".
>
> Are you sure you posted the file you intended to?

My bad, didn't read that section at all.  I read only "Configuring Emacs
for debugging" section because of this text in `report-emacs-bug'
letter: "If Emacs crashed, include the output from 'bt full' and 'xbacktrace'".

Now Emacs is built with -O0 and I need some help, please.

(gdb) bt full
#0  terminate_due_to_signal (sig=607650026, backtrace_limit=1116) at ../../emacs/src/emacs.c:370
No locals.
#1  0x0000000100a28660 in ?? ()
No symbol table info available.
#2  0x0000000000000000 in ?? ()
No symbol table info available.

Lisp Backtrace:
Cannot access memory at address 0xadf0

I can print the 'last_marked_index':

(gdb) p last_marked_index
$2 = 41

But what can I do with 'last_marked'?

(gdb) p last_marked[40]
'last_marked' has unknown type; cast it to its declared type

Give me some tips, please. TIA.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Sun, 29 Dec 2019 19:32:01 GMT) Full text and rfc822 format available.

Message #23 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Andrii Kolomoiets <andreyk.mad <at> gmail.com>
Cc: alan <at> idiocy.org, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Sun, 29 Dec 2019 21:31:17 +0200

> From: Andrii Kolomoiets <andreyk.mad <at> gmail.com>
> Cc: alan <at> idiocy.org,  38748 <at> debbugs.gnu.org
> Date: Sun, 29 Dec 2019 21:01:42 +0200
> 
> I can print the 'last_marked_index':
> 
> (gdb) p last_marked_index
> $2 = 41
> 
> But what can I do with 'last_marked'?
> 
> (gdb) p last_marked[40]
> 'last_marked' has unknown type; cast it to its declared type

last_marked is an array of Lisp objects, arranged in circular order,
i.e. when the index reaches the last element, it is reset back to
zero.

To print the object at last_marked[i], for some i, you do

  (gdb) p last_marked[i]
  (gdb) xtype

The xtype command will tell you the type of the Lisp object.  You then
display it with the corresponding xTYPE command: xint for an integer,
xcons for a cons cell, xstring for a string, xvector for a vector,
xbuffer for a buffer, etc.  Here's a short example:

  (gdb) p last_marked_index
  $2 = 1
  (gdb) p last_marked[0]
  $3 = XIL(0x8000000006287630)
  (gdb) xtype
  Lisp_String
  (gdb) xstring
  $4 = (struct Lisp_String *) 0x6287630
  " *buffer-defaults*"

So in this example, the last marked object was a Lisp string whose
contents is " *buffer-defaults*".  GDB stores its C definition in
history slot $4, so we can look at its details:

  (gdb) p *$4
  $5 = {
    u = {
      s = {
	size = 18,
	size_byte = -2,
	intervals = 0x0,
	data = 0x19a1dea <DEFAULT_REHASH_SIZE+14054> " *buffer-defaults*"
      },
      next = 0x12,
      gcaligned = 18 '\022'
    }
  }

All of those commands are in src/.gdbinit; if GDB says it doesn't know
these commands, tell it to read that file:

  (gdb) source /path/to/emacs/src/.gdbinit

If last_marked_index is 41, you should print the objects starting from
last_marked[40], going back (39, 38, 37, etc.), trying to find the
object that is corrupted (e.g., the corresponding xTYPE command will
error out trying to display it).

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Wed, 01 Jan 2020 20:43:01 GMT) Full text and rfc822 format available.

Message #26 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Andrii Kolomoiets <andreyk.mad <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: alan <at> idiocy.org, jguenther <at> gmail.com, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Wed, 01 Jan 2020 22:42:19 +0200

Eli Zaretskii <eliz <at> gnu.org> writes:

>> From: Andrii Kolomoiets <andreyk.mad <at> gmail.com>
>> Cc: alan <at> idiocy.org,  38748 <at> debbugs.gnu.org
>> Date: Sun, 29 Dec 2019 21:01:42 +0200
>> 
>> I can print the 'last_marked_index':
>> 
>> (gdb) p last_marked_index
>> $2 = 41
>> 
>> But what can I do with 'last_marked'?
>> 
>> (gdb) p last_marked[40]
>> 'last_marked' has unknown type; cast it to its declared type
>
> last_marked is an array of Lisp objects, arranged in circular order,
> i.e. when the index reaches the last element, it is reset back to
> zero.
>
> To print the object at last_marked[i], for some i, you do
>
>   (gdb) p last_marked[i]
>   (gdb) xtype
>
> The xtype command will tell you the type of the Lisp object.  You then
> display it with the corresponding xTYPE command: xint for an integer,
> xcons for a cons cell, xstring for a string, xvector for a vector,
> xbuffer for a buffer, etc.  Here's a short example:
>
>   (gdb) p last_marked_index
>   $2 = 1
>   (gdb) p last_marked[0]
>   $3 = XIL(0x8000000006287630)
>   (gdb) xtype
>   Lisp_String
>   (gdb) xstring
>   $4 = (struct Lisp_String *) 0x6287630
>   " *buffer-defaults*"

I'm still have no luck to print last_marked item:

(gdb) p last_marked_index
$1 = 278
(gdb) p last_marked[277]
'last_marked' has unknown type; cast it to its declared type

IDK if it make sense, casting last_modified to Lisp_Object gives me
this:

(gdb) p (Lisp_Object)last_marked
$6 = XIL(0x102dc4203)
(gdb) xtype
Lisp_Cons
(gdb) xcons
$7 = (struct Lisp_Cons *) 0x102dc4200
{
  u = {
    s = {
      car = XIL(0x102a3aa15), 
      u = {
        cdr = XIL(0x102dc4213), 
        chain = 0x102dc4213
      }
    }, 
    gcaligned = 0x15
  }
}

But I found the commit after which error is occurs:
b2949d39261e82c33572ba8a250298ef0b165b95

Commenting out that 'ok = false;' line make Emacs works without errors.

Justin, can you please check if Emacs prior to that commit is works fine
for you?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Thu, 02 Jan 2020 14:07:02 GMT) Full text and rfc822 format available.

Message #29 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Andrii Kolomoiets <andreyk.mad <at> gmail.com>
Cc: alan <at> idiocy.org, jguenther <at> gmail.com, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Thu, 02 Jan 2020 16:06:23 +0200

> From: Andrii Kolomoiets <andreyk.mad <at> gmail.com>
> Cc: alan <at> idiocy.org,  38748 <at> debbugs.gnu.org,  jguenther <at> gmail.com
> Date: Wed, 01 Jan 2020 22:42:19 +0200
> 
> >   (gdb) p last_marked_index
> >   $2 = 1
> >   (gdb) p last_marked[0]
> >   $3 = XIL(0x8000000006287630)
> >   (gdb) xtype
> >   Lisp_String
> >   (gdb) xstring
> >   $4 = (struct Lisp_String *) 0x6287630
> >   " *buffer-defaults*"
> 
> I'm still have no luck to print last_marked item:
> 
> (gdb) p last_marked_index
> $1 = 278
> (gdb) p last_marked[277]
> 'last_marked' has unknown type; cast it to its declared type

This looks like some compiler bug, or maybe bug in GDB on your
platform?  Because the source clearly says

   Lisp_Object last_marked[LAST_MARKED_SIZE] EXTERNALLY_VISIBLE;

so the type should be known to GDB.  But this is just an aside.

> But I found the commit after which error is occurs:
> b2949d39261e82c33572ba8a250298ef0b165b95
> 
> Commenting out that 'ok = false;' line make Emacs works without errors.

I cannot explain how that change could cause any harm.  Here's the
relevant code fragment:

      if (CONSP (parent_face))
	{
	  Lisp_Object tail;
	  ok = false;
	  for (tail = parent_face; !NILP (tail); tail = XCDR (tail))
	    {
	      ok = get_lface_attributes (w, f, XCAR (tail), inherited_attrs,
					 false, named_merge_points);
	      if (!ok)
		break;
	      attr_val = face_inherited_attr (w, f, inherited_attrs, attr_idx,
					      named_merge_points);
	      if (!UNSPECIFIEDP (attr_val))
		break;
	    }
	  if (!ok)	/* bad face? */
	    break;  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<
	}
      else
	{
	  ok = get_lface_attributes (w, f, parent_face, inherited_attrs,
				     false, named_merge_points);
	  if (!ok)
	    break;
	  attr_val = inherited_attrs[attr_idx];
	}

Since parent_face is a cons cell, then we enter the for-loop (since a
cons cell cannot be nil), and then we immediately call
get_lface_attributes whose return value overwrites the initial value
of 'ok'.

So how could the initial value of 'ok' matter here?  What am I
missing?

Can you run the unmodified code with a breakpoint on the line
indicated by "<<<<<" above, and see if the breakpoint ever breaks?  If
it does break, can you show the face being merged in this case?

Also, if you build Emacs with exactly the same configure options, but
without optimizations, does the problem persist?

Thanks.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Sat, 04 Jan 2020 16:49:02 GMT) Full text and rfc822 format available.

Message #32 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Pieter van Oostrum <pieter-l <at> vanoostrum.org>
To: Andrii Kolomoiets <andreyk.mad <at> gmail.com>
Cc: Eli Zaretskii <eliz <at> gnu.org>, alan <at> idiocy.org, jguenther <at> gmail.com,
 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Sat, 04 Jan 2020 17:48:04 +0100

[Message part 1 (text/plain, inline)]

Andrii Kolomoiets <andreyk.mad <at> gmail.com> writes:

> But I found the commit after which error is occurs:
> b2949d39261e82c33572ba8a250298ef0b165b95
>
> Commenting out that 'ok = false;' line make Emacs works without errors.
>
> Justin, can you please check if Emacs prior to that commit is works fine
> for you?

I had Emacs built from master a few days ago, and got the same crashes, about twice a day, often when Emacs was idle.
So I decided to compile from the parent of the commit mentioned above, which is 73f37da12d.

However, this one also crashed, albeit with a different crash. See the attachment.

[Emacs_2020-01-04-165858_Cochabamba.crash (text/plain, attachment)]

[Message part 3 (text/plain, inline)]

-- 
Pieter van Oostrum
www: http://pieter.vanoostrum.org/
PGP key: [8DAE142BE17999C4]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Sat, 04 Jan 2020 17:26:01 GMT) Full text and rfc822 format available.

Message #35 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Alan Third <alan <at> idiocy.org>
To: Pieter van Oostrum <pieter-l <at> vanoostrum.org>
Cc: Eli Zaretskii <eliz <at> gnu.org>, jguenther <at> gmail.com,
 Andrii Kolomoiets <andreyk.mad <at> gmail.com>, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Sat, 4 Jan 2020 17:25:09 +0000

On Sat, Jan 04, 2020 at 05:48:04PM +0100, Pieter van Oostrum wrote:
> Andrii Kolomoiets <andreyk.mad <at> gmail.com> writes:
> 
> > But I found the commit after which error is occurs:
> > b2949d39261e82c33572ba8a250298ef0b165b95
> >
> > Commenting out that 'ok = false;' line make Emacs works without errors.
> >
> > Justin, can you please check if Emacs prior to that commit is works fine
> > for you?
> 
> I had Emacs built from master a few days ago, and got the same crashes, about twice a day, often when Emacs was idle.
> So I decided to compile from the parent of the commit mentioned above, which is 73f37da12d.
> 
> However, this one also crashed, albeit with a different crash. See the attachment.
> 
> 8   org.gnu.Emacs                 	0x00000001011cdb58 handle_fatal_signal + 24
> 9   org.gnu.Emacs                 	0x00000001011cdbf2 deliver_thread_signal + 146
> 10  org.gnu.Emacs                 	0x00000001011cb3da deliver_fatal_thread_signal + 26
> 11  org.gnu.Emacs                 	0x00000001011cdc96 handle_sigsegv + 134
> 12  libsystem_platform.dylib      	0x00007fff756adf5a _sigtramp + 26
> 13  ???                           	000000000000000000 0 + 0
> 14  org.gnu.Emacs                 	0x0000000101053bab Fmouse_pixel_position + 187

Hmm, I made a change to the NS mouse position code recently

fbf9fea4fdad467429058077b8087dbd0758b964

Perhaps that’s related somehow.

-- 
Alan Third

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Sun, 05 Jan 2020 19:42:02 GMT) Full text and rfc822 format available.

Message #38 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Pieter van Oostrum <pieter-l <at> vanoostrum.org>
To: Alan Third <alan <at> idiocy.org>
Cc: jguenther <at> gmail.com, Andrii Kolomoiets <andreyk.mad <at> gmail.com>,
 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Sun, 05 Jan 2020 20:41:49 +0100

Alan Third <alan <at> idiocy.org> writes:

> On Sat, Jan 04, 2020 at 05:48:04PM +0100, Pieter van Oostrum wrote:
>> Andrii Kolomoiets <andreyk.mad <at> gmail.com> writes:
>> 
>> > But I found the commit after which error is occurs:
>> > b2949d39261e82c33572ba8a250298ef0b165b95
>> >
>> > Commenting out that 'ok = false;' line make Emacs works without errors.
>> >
>> > Justin, can you please check if Emacs prior to that commit is works fine
>> > for you?
>> 
>> I had Emacs built from master a few days ago, and got the same
>> crashes, about twice a day, often when Emacs was idle.
>> So I decided to compile from the parent of the commit mentioned above, which is 73f37da12d.
>> 
>> However, this one also crashed, albeit with a different crash. See the attachment.
>> 
>> 8   org.gnu.Emacs                 	0x00000001011cdb58 handle_fatal_signal + 24
>> 9   org.gnu.Emacs                 	0x00000001011cdbf2 deliver_thread_signal + 146
>> 10  org.gnu.Emacs                 	0x00000001011cb3da deliver_fatal_thread_signal + 26
>> 11  org.gnu.Emacs                 	0x00000001011cdc96 handle_sigsegv + 134
>> 12  libsystem_platform.dylib      	0x00007fff756adf5a _sigtramp + 26
>> 13  ???                           	000000000000000000 0 + 0
>> 14  org.gnu.Emacs                 	0x0000000101053bab Fmouse_pixel_position + 187
>
> Hmm, I made a change to the NS mouse position code recently
>
> fbf9fea4fdad467429058077b8087dbd0758b964
>
> Perhaps that’s related somehow.

No. I compiled the version before that (9042ece787cf93665776ffb69893fcb1357aacbe) and it crashed with exactly the same crash. So, no, it must have been introduced before that.
On the other hand, I have been working before this with a version from Dec 1, 2019 (I think 9f2145f42daab13aed5cf89fdb6a7c5579819ec0) and I have used that quite a time without crashes. Whereas the other versions crashed 1-2 times a day.
-- 
Pieter van Oostrum
www: http://pieter.vanoostrum.org/
PGP key: [8DAE142BE17999C4]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Wed, 08 Jan 2020 17:40:01 GMT) Full text and rfc822 format available.

Message #41 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Robert Pluim <rpluim <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: alan <at> idiocy.org, jguenther <at> gmail.com,
 Andrii Kolomoiets <andreyk.mad <at> gmail.com>, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Wed, 08 Jan 2020 18:39:42 +0100

>>>>> On Thu, 02 Jan 2020 16:06:23 +0200, Eli Zaretskii <eliz <at> gnu.org> said:

Iʼm now seeing this as well on both master and emacs-27

    Eli> This looks like some compiler bug, or maybe bug in GDB on your
    Eli> platform?  Because the source clearly says

    Eli>    Lisp_Object last_marked[LAST_MARKED_SIZE] EXTERNALLY_VISIBLE;

    Eli> so the type should be known to GDB.  But this is just an aside.

    >> But I found the commit after which error is occurs:
    >> b2949d39261e82c33572ba8a250298ef0b165b95
    >> 
    >> Commenting out that 'ok = false;' line make Emacs works without errors.

I can confirm this.

    Eli> I cannot explain how that change could cause any harm.  Here's the
    Eli> relevant code fragment:

    Eli>       if (CONSP (parent_face))
    Eli> 	{
    Eli> 	  Lisp_Object tail;
    Eli> 	  ok = false;
    Eli> 	  for (tail = parent_face; !NILP (tail); tail = XCDR (tail))
    Eli> 	    {
    Eli> 	      ok = get_lface_attributes (w, f, XCAR (tail), inherited_attrs,
    Eli> 					 false, named_merge_points);
    Eli> 	      if (!ok)
    Eli> 		break;
    Eli> 	      attr_val = face_inherited_attr (w, f, inherited_attrs, attr_idx,
    Eli> 					      named_merge_points);
    Eli> 	      if (!UNSPECIFIEDP (attr_val))
    Eli> 		break;
    Eli> 	    }
    Eli> 	  if (!ok)	/* bad face? */
    Eli> 	    break;  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<
    Eli> 	}
    Eli>       else
    Eli> 	{
    Eli> 	  ok = get_lface_attributes (w, f, parent_face, inherited_attrs,
    Eli> 				     false, named_merge_points);
    Eli> 	  if (!ok)
    Eli> 	    break;
    Eli> 	  attr_val = inherited_attrs[attr_idx];
    Eli> 	}

    Eli> Since parent_face is a cons cell, then we enter the for-loop (since a
    Eli> cons cell cannot be nil), and then we immediately call
    Eli> get_lface_attributes whose return value overwrites the initial value
    Eli> of 'ok'.

    Eli> So how could the initial value of 'ok' matter here?  What am I
    Eli> missing?

    Eli> Can you run the unmodified code with a breakpoint on the line
    Eli> indicated by "<<<<<" above, and see if the breakpoint ever breaks?  If
    Eli> it does break, can you show the face being merged in this case?

It never breaks there for me.

    Eli> Also, if you build Emacs with exactly the same configure options, but
    Eli> without optimizations, does the problem persist?

Yes. Iʼll note that when this happens there are over 9000 stackframes,
so perhaps itʼs stack exhaustion. macOS has a default stack of 8192
kB, Iʼll see if increasing it helps.

Iʼm running under lldb as well, perhaps that will work better with
'last_marked'.

Robert

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Wed, 08 Jan 2020 19:19:01 GMT) Full text and rfc822 format available.

Message #44 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Pip Cet <pipcet <at> gmail.com>
To: Robert Pluim <rpluim <at> gmail.com>
Cc: Eli Zaretskii <eliz <at> gnu.org>, Andrii Kolomoiets <andreyk.mad <at> gmail.com>,
 alan <at> idiocy.org, jguenther <at> gmail.com, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Wed, 8 Jan 2020 19:18:15 +0000

On Wed, Jan 8, 2020 at 5:40 PM Robert Pluim <rpluim <at> gmail.com> wrote:
>     >> But I found the commit after which error is occurs:>     >> b2949d39261e82c33572ba8a250298ef0b165b95
>     >>
>     >> Commenting out that 'ok = false;' line make Emacs works without errors.
>
> I can confirm this.

I think we should disassemble the two versions and see where the
differences are, unless this is too difficult because of inlining. Can
you provide compiler details?

>     Eli> I cannot explain how that change could cause any harm.  Here's the
>     Eli> relevant code fragment:

>     Eli> So how could the initial value of 'ok' matter here?  What am I
>     Eli> missing?

I think it's likely to be the stack thing; the ok = false might make
the difference between allocating inherited_attrs on the stack once
and doing so once per recursion of face_inherited_attr. The latter
case might lead to a stack overflow more easily.

> Yes. Iʼll note that when this happens there are over 9000 stackframes,
> so perhaps itʼs stack exhaustion. macOS has a default stack of 8192
> kB, Iʼll see if increasing it helps.

That does sound like infinite recursion, or infinite recursion waiting
for something to change asynchronously that breaks the loop. If the
"ok = false" prevents the compiler from recognizing
face_inherited_attr is effectively tail-recursive, that might be it?

Changing the line to "ok = true" would be an interesting experiment.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Wed, 08 Jan 2020 19:59:01 GMT) Full text and rfc822 format available.

Message #47 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Pip Cet <pipcet <at> gmail.com>
Cc: rpluim <at> gmail.com, andreyk.mad <at> gmail.com, alan <at> idiocy.org,
 jguenther <at> gmail.com, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Wed, 08 Jan 2020 21:58:42 +0200

> From: Pip Cet <pipcet <at> gmail.com>
> Date: Wed, 8 Jan 2020 19:18:15 +0000
> Cc: Eli Zaretskii <eliz <at> gnu.org>, alan <at> idiocy.org, jguenther <at> gmail.com, 
> 	Andrii Kolomoiets <andreyk.mad <at> gmail.com>, 38748 <at> debbugs.gnu.org
> 
> > Yes. Iʼll note that when this happens there are over 9000 stackframes,
> > so perhaps itʼs stack exhaustion. macOS has a default stack of 8192
> > kB, Iʼll see if increasing it helps.
> 
> That does sound like infinite recursion, or infinite recursion waiting
> for something to change asynchronously that breaks the loop.

No, GC is known to take many thousands of recursive calls to
mark_object.  9000 is not a particularly high number, and doesn't
necessarily signal infinite recursion.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Wed, 08 Jan 2020 20:41:02 GMT) Full text and rfc822 format available.

Message #50 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Pip Cet <pipcet <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: rpluim <at> gmail.com, andreyk.mad <at> gmail.com, alan <at> idiocy.org,
 jguenther <at> gmail.com, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Wed, 8 Jan 2020 20:39:43 +0000

On Wed, Jan 8, 2020 at 7:58 PM Eli Zaretskii <eliz <at> gnu.org> wrote:
> > > Yes. Iʼll note that when this happens there are over 9000 stackframes,
> > > so perhaps itʼs stack exhaustion. macOS has a default stack of 8192
> > > kB, Iʼll see if increasing it helps.
> > That does sound like infinite recursion, or infinite recursion waiting
> > for something to change asynchronously that breaks the loop.
> No, GC is known to take many thousands of recursive calls to
> mark_object.  9000 is not a particularly high number, and doesn't
> necessarily signal infinite recursion.

In general, you're absolutely correct. But in this case, it still
sounds very likely: infinite recursion of a properly tail-recursive
function would loop rather than cause a stack overflow, which would
explain everything, except for why it's not actually an infinite loop;
I suspect the macOS code somewhere does modify things asynchronously.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Wed, 08 Jan 2020 21:44:02 GMT) Full text and rfc822 format available.

Message #53 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Robert Pluim <rpluim <at> gmail.com>
To: Pip Cet <pipcet <at> gmail.com>
Cc: alan <at> idiocy.org, jguenther <at> gmail.com,
 Andrii Kolomoiets <andreyk.mad <at> gmail.com>, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Wed, 08 Jan 2020 22:43:30 +0100

[Message part 1 (text/plain, inline)]

>>>>> On Wed, 8 Jan 2020 19:18:15 +0000, Pip Cet <pipcet <at> gmail.com> said:

    Pip> On Wed, Jan 8, 2020 at 5:40 PM Robert Pluim <rpluim <at> gmail.com> wrote:
    >> >> But I found the commit after which error is occurs:>     >> b2949d39261e82c33572ba8a250298ef0b165b95
    >> >>
    >> >> Commenting out that 'ok = false;' line make Emacs works without errors.
    >> 
    >> I can confirm this.

    Pip> I think we should disassemble the two versions and see where the
    Pip> differences are, unless this is too difficult because of inlining. Can
    Pip> you provide compiler details?

gcc --version
Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 10.0.1 (clang-1001.0.46.4)
Target: x86_64-apple-darwin18.7.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

Iʼve attached the disassembly of the two versions. They're very very
similar (this is with -g3 -O0).

    Eli> I cannot explain how that change could cause any harm.  Here's the
    Eli> relevant code fragment:

    Eli> So how could the initial value of 'ok' matter here?  What am I
    Eli> missing?

    Pip> I think it's likely to be the stack thing; the ok = false might make
    Pip> the difference between allocating inherited_attrs on the stack once
    Pip> and doing so once per recursion of face_inherited_attr. The latter
    Pip> case might lead to a stack overflow more easily.

The allocation of inherited_attrs is the same in both.

    >> Yes. Iʼll note that when this happens there are over 9000 stackframes,
    >> so perhaps itʼs stack exhaustion. macOS has a default stack of 8192
    >> kB, Iʼll see if increasing it helps.

    Pip> That does sound like infinite recursion, or infinite recursion waiting
    Pip> for something to change asynchronously that breaks the loop. If the
    Pip> "ok = false" prevents the compiler from recognizing
    Pip> face_inherited_attr is effectively tail-recursive, that might be it?

    Pip> Changing the line to "ok = true" would be an interesting experiment.

Hmm, yes. Iʼll try that.

BTW, running under lldb, last_marked can be accessed successfully, but
of course under lldb you donʼt get all the nice commands from
.gdbinit. Iʼd build a newer version of gdb, but signing binaries on
macOS is a real hassle.

Robert

[modified.txt (text/plain, attachment)]

[unmodified.txt (text/plain, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Wed, 08 Jan 2020 22:19:01 GMT) Full text and rfc822 format available.

Message #56 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Pip Cet <pipcet <at> gmail.com>
To: Robert Pluim <rpluim <at> gmail.com>
Cc: alan <at> idiocy.org, jguenther <at> gmail.com,
 Andrii Kolomoiets <andreyk.mad <at> gmail.com>, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Wed, 8 Jan 2020 22:18:11 +0000

On Wed, Jan 8, 2020 at 9:43 PM Robert Pluim <rpluim <at> gmail.com> wrote:
> gcc --version
> Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
> Apple LLVM version 10.0.1 (clang-1001.0.46.4)
> Target: x86_64-apple-darwin18.7.0
> Thread model: posix
> InstalledDir: /Library/Developer/CommandLineTools/usr/bin
>
> Iʼve attached the disassembly of the two versions. They're very very
> similar (this is with -g3 -O0).

But wait, doesn't the bug happen in both unoptimized versions? I
should have been clearer: my suspicion is the bug only goes away if
tail calls are optimized, which happens only with optimizations
enabled.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Wed, 08 Jan 2020 22:24:01 GMT) Full text and rfc822 format available.

Message #59 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Robert Pluim <rpluim <at> gmail.com>
To: Pip Cet <pipcet <at> gmail.com>
Cc: alan <at> idiocy.org, jguenther <at> gmail.com,
 Andrii Kolomoiets <andreyk.mad <at> gmail.com>, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Wed, 08 Jan 2020 23:23:48 +0100

>>>>> On Wed, 8 Jan 2020 22:18:11 +0000, Pip Cet <pipcet <at> gmail.com> said:

    Pip> On Wed, Jan 8, 2020 at 9:43 PM Robert Pluim <rpluim <at> gmail.com> wrote:
    >> gcc --version
    >> Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
    >> Apple LLVM version 10.0.1 (clang-1001.0.46.4)
    >> Target: x86_64-apple-darwin18.7.0
    >> Thread model: posix
    >> InstalledDir: /Library/Developer/CommandLineTools/usr/bin
    >> 
    >> Iʼve attached the disassembly of the two versions. They're very very
    >> similar (this is with -g3 -O0).

    Pip> But wait, doesn't the bug happen in both unoptimized versions? I
    Pip> should have been clearer: my suspicion is the bug only goes away if
    Pip> tail calls are optimized, which happens only with optimizations
    Pip> enabled.

No, it only happens with the initialisation of 'ok', optimised or not.

As another data point, Iʼm writing this from an emacs with 'ok =
true', which has not crashed yet....

Robert

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Thu, 09 Jan 2020 03:31:02 GMT) Full text and rfc822 format available.

Message #62 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Pip Cet <pipcet <at> gmail.com>
Cc: rpluim <at> gmail.com, andreyk.mad <at> gmail.com, alan <at> idiocy.org,
 jguenther <at> gmail.com, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Thu, 09 Jan 2020 05:30:58 +0200

> From: Pip Cet <pipcet <at> gmail.com>
> Date: Wed, 8 Jan 2020 20:39:43 +0000
> Cc: rpluim <at> gmail.com, alan <at> idiocy.org, jguenther <at> gmail.com, 
> 	andreyk.mad <at> gmail.com, 38748 <at> debbugs.gnu.org
> 
> > No, GC is known to take many thousands of recursive calls to
> > mark_object.  9000 is not a particularly high number, and doesn't
> > necessarily signal infinite recursion.
> 
> In general, you're absolutely correct. But in this case, it still
> sounds very likely: infinite recursion of a properly tail-recursive
> function would loop rather than cause a stack overflow, which would
> explain everything, except for why it's not actually an infinite loop;
> I suspect the macOS code somewhere does modify things asynchronously.

The backtrace shows a very recursive GC, it doesn't show any other
function being deeply recursive.  So I'm not sure I understand what
tail-recursive function did you have in mind.  Can you elaborate?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Thu, 09 Jan 2020 07:52:01 GMT) Full text and rfc822 format available.

Message #65 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Robert Pluim <rpluim <at> gmail.com>
To: Pip Cet <pipcet <at> gmail.com>
Cc: alan <at> idiocy.org, jguenther <at> gmail.com,
 Andrii Kolomoiets <andreyk.mad <at> gmail.com>, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Thu, 09 Jan 2020 08:51:43 +0100

>>>>> On Wed, 08 Jan 2020 23:23:48 +0100, Robert Pluim <rpluim <at> gmail.com> said:
    Robert> As another data point, Iʼm writing this from an emacs with 'ok =
    Robert> true', which has not crashed yet....

scratch that, it crashed this morning.

Robert

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Thu, 09 Jan 2020 10:09:01 GMT) Full text and rfc822 format available.

Message #68 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: bug-gnu-emacs <at> gnu.org, Robert Pluim <rpluim <at> gmail.com>,
 Pip Cet <pipcet <at> gmail.com>
Cc: alan <at> idiocy.org, jguenther <at> gmail.com,
 Andrii Kolomoiets <andreyk.mad <at> gmail.com>, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Thu, 09 Jan 2020 12:07:54 +0200

On January 9, 2020 9:51:43 AM GMT+02:00, Robert Pluim <rpluim <at> gmail.com> wrote:
> >>>>> On Wed, 08 Jan 2020 23:23:48 +0100, Robert Pluim
> <rpluim <at> gmail.com> said:
> Robert> As another data point, Iʼm writing this from an emacs with 'ok
> =
>     Robert> true', which has not crashed yet....
> 
> scratch that, it crashed this morning.
> 
> Robert

Thanks for trying.

A stab in the dark: does it help to rename the variable 'ok' in face_inherited_attr to some other name, like 'ok1'?

Also, can I please see one backtrace with all the call-stack frames, starting from 'main' and ending at 'handle_fatal_signal'?  The original report shows only the top-most 511 frames, and the other one has a lot of ?? (missing symbols) in it.

And finally, are all the crashes inside GC, or do some happen outside it?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Thu, 09 Jan 2020 10:09:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Thu, 09 Jan 2020 10:32:01 GMT) Full text and rfc822 format available.

Message #74 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Robert Pluim <rpluim <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: alan <at> idiocy.org, andreyk.mad <at> gmail.com, jguenther <at> gmail.com,
 pipcet <at> gmail.com, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Thu, 09 Jan 2020 11:31:25 +0100

[Message part 1 (text/plain, inline)]

>>>>> On Thu, 09 Jan 2020 12:07:54 +0200, Eli Zaretskii <eliz <at> gnu.org> said:

    Eli> A stab in the dark: does it help to rename the variable 'ok' in face_inherited_attr to some other name, like 'ok1'?

I can try that.

    Eli> Also, can I please see one backtrace with all the call-stack frames,
    Eli> starting from 'main' and ending at 'handle_fatal_signal'?  The
    Eli> original report shows only the top-most 511 frames, and the other one
    Eli> has a lot of ?? (missing symbols) in it.

'bt full' backtrace attached.

    Eli> And finally, are all the crashes inside GC, or do some happen outside it?

Iʼve only seen it inside GC.

Robert

[full-backtrace.txt.gz (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Thu, 09 Jan 2020 13:52:02 GMT) Full text and rfc822 format available.

Message #77 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Andrii Kolomoiets <andreyk.mad <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: jguenther <at> gmail.com, bug-gnu-emacs <at> gnu.org, alan <at> idiocy.org,
 Pip Cet <pipcet <at> gmail.com>, 38748 <at> debbugs.gnu.org,
 Robert Pluim <rpluim <at> gmail.com>
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Thu, 09 Jan 2020 15:51:24 +0200

[Message part 1 (text/plain, inline)]

Eli Zaretskii <eliz <at> gnu.org> writes:

> On January 9, 2020 9:51:43 AM GMT+02:00, Robert Pluim <rpluim <at> gmail.com> wrote:
>> <rpluim <at> gmail.com> said:
>>> As another data point, Iʼm writing this from an emacs with 'ok
>>> = true', which has not crashed yet....
>> 
>> scratch that, it crashed this morning.
>> 
>> Robert
>
> Thanks for trying.
>
> A stab in the dark: does it help to rename the variable 'ok' in
> face_inherited_attr to some other name, like 'ok1'?
>
> Also, can I please see one backtrace with all the call-stack frames,
> starting from 'main' and ending at 'handle_fatal_signal'?  The
> original report shows only the top-most 511 frames, and the other one
> has a lot of ?? (missing symbols) in it.
>
> And finally, are all the crashes inside GC, or do some happen outside
> it?

I made an assumption that gdb is indeed working incorrectly for me
because:
- It can't print last_marked
- It shows a lot of ?? in call-stack
- Emacs is not crashing if running not under gdb
- Emacs keep working after continuing execution after gdb reaches
  terminate_due_to_signal breakpoint

So I tried to use lldb.
Under lldb the crash is not occured on commit with 'ok = false'.

Also I came up with code to reproduce crash under 'emacs -Q' at least on
my machine.

Here is the '~/emacs-crash.el' content:
(make-frame `((parent-frame . ,(window-frame))))
(make-frame `((parent-frame . ,(window-frame))))
(make-frame `((parent-frame . ,(window-frame))))
(make-frame `((parent-frame . ,(window-frame))))
(delete-frame)
(delete-frame)
(delete-frame)
(delete-frame)
(garbage-collect)

This code is start crashing on the commit
bb42f6ef10cb250a9263b17a8794e950a563d5d0

Though I can't use xTYPE commands under lldb please see attached lldb
output. It has all the call-stack frames starting from 'main'.

[lldb.txt (text/plain, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Thu, 09 Jan 2020 13:52:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Thu, 09 Jan 2020 14:12:02 GMT) Full text and rfc822 format available.

Message #83 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Pip Cet <pipcet <at> gmail.com>
To: Robert Pluim <rpluim <at> gmail.com>
Cc: Eli Zaretskii <eliz <at> gnu.org>, andreyk.mad <at> gmail.com, alan <at> idiocy.org,
 jguenther <at> gmail.com, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Thu, 9 Jan 2020 14:10:20 +0000

On Thu, Jan 9, 2020 at 10:31 AM Robert Pluim <rpluim <at> gmail.com> wrote:
>     Eli> Also, can I please see one backtrace with all the call-stack frames,
>     Eli> starting from 'main' and ending at 'handle_fatal_signal'?  The
>     Eli> original report shows only the top-most 511 frames, and the other one
>     Eli> has a lot of ?? (missing symbols) in it.
>
> 'bt full' backtrace attached.

At the risk of being wrong again, is it possible we're looking at two
different bugs? This looks like it might be a crash in mark_frame when
a "destroyed" frame's ->output_data.ns area is being accessed.

And, indeed, nsterm.m's ns_free_frame_resources contains:

  xfree (f->output_data.ns);

but not

  f->output_data.ns = NULL;
  f->output_method = output_initial;

or anything like them.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Thu, 09 Jan 2020 14:14:01 GMT) Full text and rfc822 format available.

Message #86 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Robert Pluim <rpluim <at> gmail.com>
To: Andrii Kolomoiets <andreyk.mad <at> gmail.com>
Cc: Eli Zaretskii <eliz <at> gnu.org>, 38748 <at> debbugs.gnu.org, jguenther <at> gmail.com,
 pipcet <at> gmail.com, alan <at> idiocy.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Thu, 09 Jan 2020 15:13:22 +0100

>>>>> On Thu, 09 Jan 2020 15:51:24 +0200, Andrii Kolomoiets <andreyk.mad <at> gmail.com> said:

    Andrii> I made an assumption that gdb is indeed working incorrectly for me
    Andrii> because:
    Andrii> - It can't print last_marked
    Andrii> - It shows a lot of ?? in call-stack
    Andrii> - Emacs is not crashing if running not under gdb
    Andrii> - Emacs keep working after continuing execution after gdb reaches
    Andrii>   terminate_due_to_signal breakpoint

Emacs crashes for me with or without gdb (and under lldb).

    Andrii> So I tried to use lldb.
    Andrii> Under lldb the crash is not occured on commit with 'ok = false'.

    Andrii> Also I came up with code to reproduce crash under 'emacs -Q' at least on
    Andrii> my machine.

    Andrii> Here is the '~/emacs-crash.el' content:
    Andrii> (make-frame `((parent-frame . ,(window-frame))))
    Andrii> (make-frame `((parent-frame . ,(window-frame))))
    Andrii> (make-frame `((parent-frame . ,(window-frame))))
    Andrii> (make-frame `((parent-frame . ,(window-frame))))
    Andrii> (delete-frame)
    Andrii> (delete-frame)
    Andrii> (delete-frame)
    Andrii> (delete-frame)
    Andrii> (garbage-collect)

That doesnʼt crash for me with 'emacs -Q', but Iʼm not on 10.15.2 yet,
Iʼm still on 10.14

    Andrii> This code is start crashing on the commit
    Andrii> bb42f6ef10cb250a9263b17a8794e950a563d5d0

    Andrii> Though I can't use xTYPE commands under lldb please see attached lldb
    Andrii> output. It has all the call-stack frames starting from 'main'.

Thatʼs very different from the call stack I see. Perhaps we have two
bugs?

Robert

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Thu, 09 Jan 2020 14:16:02 GMT) Full text and rfc822 format available.

Message #89 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Robert Pluim <rpluim <at> gmail.com>
Cc: alan <at> idiocy.org, andreyk.mad <at> gmail.com, jguenther <at> gmail.com,
 pipcet <at> gmail.com, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Thu, 09 Jan 2020 16:16:03 +0200

> From: Robert Pluim <rpluim <at> gmail.com>
> Cc: 38748 <at> debbugs.gnu.org,  pipcet <at> gmail.com,  alan <at> idiocy.org,
>   jguenther <at> gmail.com,  andreyk.mad <at> gmail.com
> Date: Thu, 09 Jan 2020 11:31:25 +0100
> 
>     Eli> Also, can I please see one backtrace with all the call-stack frames,
>     Eli> starting from 'main' and ending at 'handle_fatal_signal'?  The
>     Eli> original report shows only the top-most 511 frames, and the other one
>     Eli> has a lot of ?? (missing symbols) in it.
> 
> 'bt full' backtrace attached.

Thanks.

> Thread 2 received signal SIGSEGV, Segmentation fault.
> 0x0000000100221f88 in vector_marked_p (v=0x20a000000000) at alloc.c:3726
> 3726	  return XVECTOR_MARKED_P (v);
> (gdb) bt full
> #0  0x0000000100221f88 in vector_marked_p (v=0x20a000000000) at alloc.c:3726
> No locals.
> #1  0x00000001002255e5 in vectorlike_marked_p (header=0x20a000000000)
>     at alloc.c:3744
> No locals.
> #2  0x00000001002221c2 in mark_frame (ptr=0x164cc69a0) at alloc.c:6321
>         font = 0x20a000000000
>         f = 0x164cc69a0

This says that we were marking a frame, and its default font is a
garbled pointer.  Are all of the crashes you see happen because of a
faulty frame font in this snippet:

  static void
  mark_frame (struct Lisp_Vector *ptr)
  {
    struct frame *f = (struct frame *) ptr;
    mark_vectorlike (&ptr->header);
    mark_face_cache (f->face_cache);
  #ifdef HAVE_WINDOW_SYSTEM
    if (FRAME_WINDOW_P (f) && FRAME_OUTPUT_DATA (f))
      {
	struct font *font = FRAME_FONT (f);

	if (font && !vectorlike_marked_p (&font->header))  <<<<<<<<<<<<
	  mark_vectorlike (&font->header);
      }
  #endif
  }

I hope you still have this crashed session in the debugger.  If so,
please tell: do you have many frames in that session, or just a few
(perhaps even one)?  I'd like to see some more details about this
frame, if possible.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Thu, 09 Jan 2020 14:18:02 GMT) Full text and rfc822 format available.

Message #92 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Pip Cet <pipcet <at> gmail.com>
To: Andrii Kolomoiets <andreyk.mad <at> gmail.com>
Cc: jguenther <at> gmail.com, bug-gnu-emacs <at> gnu.org, alan <at> idiocy.org,
 38748 <at> debbugs.gnu.org, Eli Zaretskii <eliz <at> gnu.org>,
 Robert Pluim <rpluim <at> gmail.com>
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Thu, 9 Jan 2020 14:16:54 +0000

[Message part 1 (text/plain, inline)]

On Thu, Jan 9, 2020 at 1:51 PM Andrii Kolomoiets <andreyk.mad <at> gmail.com> wrote:
> Here is the '~/emacs-crash.el' content:
> (make-frame `((parent-frame . ,(window-frame))))
> (make-frame `((parent-frame . ,(window-frame))))
> (make-frame `((parent-frame . ,(window-frame))))
> (make-frame `((parent-frame . ,(window-frame))))
> (delete-frame)
> (delete-frame)
> (delete-frame)
> (delete-frame)
> (garbage-collect)

That sounds like Robert's bug, but not like the one that's related to
the "x = false" thing.

Can you try the attached patch?

[38748.diff (text/x-patch, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Thu, 09 Jan 2020 14:18:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Thu, 09 Jan 2020 14:30:02 GMT) Full text and rfc822 format available.

Message #98 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Andrii Kolomoiets <andreyk.mad <at> gmail.com>
To: Pip Cet <pipcet <at> gmail.com>
Cc: jguenther <at> gmail.com, bug-gnu-emacs <at> gnu.org, alan <at> idiocy.org,
 38748 <at> debbugs.gnu.org, Eli Zaretskii <eliz <at> gnu.org>,
 Robert Pluim <rpluim <at> gmail.com>
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Thu, 9 Jan 2020 16:29:07 +0200

On 9 Jan 2020, at 16:16, Pip Cet <pipcet <at> gmail.com> wrote:
> 
> On Thu, Jan 9, 2020 at 1:51 PM Andrii Kolomoiets <andreyk.mad <at> gmail.com> wrote:
>> Here is the '~/emacs-crash.el' content:
>> (make-frame `((parent-frame . ,(window-frame))))
>> (make-frame `((parent-frame . ,(window-frame))))
>> (make-frame `((parent-frame . ,(window-frame))))
>> (make-frame `((parent-frame . ,(window-frame))))
>> (delete-frame)
>> (delete-frame)
>> (delete-frame)
>> (delete-frame)
>> (garbage-collect)
> 
> That sounds like Robert's bug, but not like the one that's related to
> the "x = false" thing.
> 
> Can you try the attached patch?
> <38748.diff>

The patch resolves the later crash.

Now I going to build b2949d39261e82c33572ba8a250298ef0b165b95 again and try to catch the former crash.

Thanks!

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Thu, 09 Jan 2020 14:30:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Thu, 09 Jan 2020 14:57:01 GMT) Full text and rfc822 format available.

Message #104 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Robert Pluim <rpluim <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: alan <at> idiocy.org, andreyk.mad <at> gmail.com, jguenther <at> gmail.com,
 pipcet <at> gmail.com, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Thu, 09 Jan 2020 15:56:01 +0100

>>>>> On Thu, 09 Jan 2020 16:16:03 +0200, Eli Zaretskii <eliz <at> gnu.org> said:

    >> From: Robert Pluim <rpluim <at> gmail.com>
    >> Cc: 38748 <at> debbugs.gnu.org,  pipcet <at> gmail.com,  alan <at> idiocy.org,
    >> jguenther <at> gmail.com,  andreyk.mad <at> gmail.com
    >> Date: Thu, 09 Jan 2020 11:31:25 +0100
    >> 
    Eli> Also, can I please see one backtrace with all the call-stack frames,
    Eli> starting from 'main' and ending at 'handle_fatal_signal'?  The
    Eli> original report shows only the top-most 511 frames, and the other one
    Eli> has a lot of ?? (missing symbols) in it.
    >> 
    >> 'bt full' backtrace attached.

    Eli> Thanks.

    >> Thread 2 received signal SIGSEGV, Segmentation fault.
    >> 0x0000000100221f88 in vector_marked_p (v=0x20a000000000) at alloc.c:3726
    >> 3726	  return XVECTOR_MARKED_P (v);
    >> (gdb) bt full
    >> #0  0x0000000100221f88 in vector_marked_p (v=0x20a000000000) at alloc.c:3726
    >> No locals.
    >> #1  0x00000001002255e5 in vectorlike_marked_p (header=0x20a000000000)
    >> at alloc.c:3744
    >> No locals.
    >> #2  0x00000001002221c2 in mark_frame (ptr=0x164cc69a0) at alloc.c:6321
    >> font = 0x20a000000000
    >> f = 0x164cc69a0

    Eli> This says that we were marking a frame, and its default font is a
    Eli> garbled pointer.  Are all of the crashes you see happen because of a
    Eli> faulty frame font in this snippet:

    Eli>   static void
    Eli>   mark_frame (struct Lisp_Vector *ptr)
    Eli>   {
    Eli>     struct frame *f = (struct frame *) ptr;
    Eli>     mark_vectorlike (&ptr->header);
    Eli>     mark_face_cache (f->face_cache);
    Eli>   #ifdef HAVE_WINDOW_SYSTEM
    Eli>     if (FRAME_WINDOW_P (f) && FRAME_OUTPUT_DATA (f))
    Eli>       {
    Eli> 	struct font *font = FRAME_FONT (f);

    Eli> 	if (font && !vectorlike_marked_p (&font->header))  <<<<<<<<<<<<
    Eli> 	  mark_vectorlike (&font->header);
    Eli>       }
    Eli>   #endif
    Eli>   }

    Eli> I hope you still have this crashed session in the debugger.  If so,
    Eli> please tell: do you have many frames in that session, or just a few
    Eli> (perhaps even one)?  I'd like to see some more details about this
    Eli> frame, if possible.

I donʼt have it right now, but itʼs easy enough to recreate the crash
(and yes, I tend to have half a dozen frames open). What details would
you like?

Robert

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Thu, 09 Jan 2020 15:16:01 GMT) Full text and rfc822 format available.

Message #107 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Robert Pluim <rpluim <at> gmail.com>
To: Pip Cet <pipcet <at> gmail.com>
Cc: jguenther <at> gmail.com, Andrii Kolomoiets <andreyk.mad <at> gmail.com>,
 bug-gnu-emacs <at> gnu.org, alan <at> idiocy.org, 38748 <at> debbugs.gnu.org,
 Eli Zaretskii <eliz <at> gnu.org>
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Thu, 09 Jan 2020 16:15:31 +0100

>>>>> On Thu, 9 Jan 2020 14:16:54 +0000, Pip Cet <pipcet <at> gmail.com> said:

    Pip> On Thu, Jan 9, 2020 at 1:51 PM Andrii Kolomoiets <andreyk.mad <at> gmail.com> wrote:
    >> Here is the '~/emacs-crash.el' content:
    >> (make-frame `((parent-frame . ,(window-frame))))
    >> (make-frame `((parent-frame . ,(window-frame))))
    >> (make-frame `((parent-frame . ,(window-frame))))
    >> (make-frame `((parent-frame . ,(window-frame))))
    >> (delete-frame)
    >> (delete-frame)
    >> (delete-frame)
    >> (delete-frame)
    >> (garbage-collect)

    Pip> That sounds like Robert's bug, but not like the one that's related to
    Pip> the "x = false" thing.

    Pip> Can you try the attached patch?

    Pip> diff --git a/src/nsterm.m b/src/nsterm.m
    Pip> index 03754e5ae5..c1d1d41117 100644
    Pip> --- a/src/nsterm.m
    Pip> +++ b/src/nsterm.m
    Pip> @@ -1644,6 +1644,7 @@ Hide the window (X11 semantics)
    Pip>    [view release];
 
    Pip>    xfree (f->output_data.ns);
    Pip> +  f->output_data.ns = NULL;
 
    Pip>    unblock_input ();
    Pip>  }

That has fixed things for me, not been able to crash it with Andrii's
recipe (I had to increase the number of frames to get it to crash).

Robert

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Thu, 09 Jan 2020 15:16:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Thu, 09 Jan 2020 17:08:01 GMT) Full text and rfc822 format available.

Message #113 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Robert Pluim <rpluim <at> gmail.com>
Cc: alan <at> idiocy.org, andreyk.mad <at> gmail.com, jguenther <at> gmail.com,
 pipcet <at> gmail.com, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Thu, 09 Jan 2020 19:06:49 +0200

> From: Robert Pluim <rpluim <at> gmail.com>
> Cc: pipcet <at> gmail.com,  alan <at> idiocy.org,  jguenther <at> gmail.com,
>   andreyk.mad <at> gmail.com,  38748 <at> debbugs.gnu.org
> Date: Thu, 09 Jan 2020 15:56:01 +0100
> 
>     Eli> I hope you still have this crashed session in the debugger.  If so,
>     Eli> please tell: do you have many frames in that session, or just a few
>     Eli> (perhaps even one)?  I'd like to see some more details about this
>     Eli> frame, if possible.
> 
> I donʼt have it right now, but itʼs easy enough to recreate the crash
> (and yes, I tend to have half a dozen frames open). What details would
> you like?

The windows on that frame and buffers they display, and the frame
parameters.  Also, the faces.

Please keep in mind that GDB commands that invoke Emacs functions,
such as 'pp', are likely to crash during GC, so you will have to use
the x* commands instead.  For example, to show the members of a list,
you will have to use 'xcar', 'xcdr', and 'xcons'.  It's tedious, but
there's no other way of displaying Lisp object during GC without
risking to crash the session.

TIA

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Fri, 10 Jan 2020 07:33:02 GMT) Full text and rfc822 format available.

Message #116 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Pip Cet <pipcet <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: rpluim <at> gmail.com, andreyk.mad <at> gmail.com, alan <at> idiocy.org,
 jguenther <at> gmail.com, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Fri, 10 Jan 2020 07:32:07 +0000

On Thu, Jan 9, 2020 at 3:30 AM Eli Zaretskii <eliz <at> gnu.org> wrote:
> > > No, GC is known to take many thousands of recursive calls to
> > > mark_object.  9000 is not a particularly high number, and doesn't
> > > necessarily signal infinite recursion.
> >
> > In general, you're absolutely correct. But in this case, it still
> > sounds very likely: infinite recursion of a properly tail-recursive
> > function would loop rather than cause a stack overflow, which would
> > explain everything, except for why it's not actually an infinite loop;
> > I suspect the macOS code somewhere does modify things asynchronously.
>
> The backtrace shows a very recursive GC, it doesn't show any other
> function being deeply recursive.  So I'm not sure I understand what
> tail-recursive function did you have in mind.  Can you elaborate?

I can. I think we're looking at two bugs: the first is the simple
use-after-free of XFRAME (frame)->output_data.ns where `frame' is a
dead frame. I've confirmed on GNU/Linux that mark_frame is called for
a frame for which x_free_frame_resources has already been called, if
there's a global variable still referencing the frame. I think the
same thing happens on macOS.

The second one is very tricky, and a hypothesis at best:

1. I think face_inherited_attr is being optimized to tail-call itself
rather than calling itself in a new stack frame; thus, it loops
indefinitely for a faulty face setup which would otherwise lead to an
immediate crash.
1b. that optimization only works without the harmless initialization of "ok".

2. Our initial face setup is faulty in the sense above.

3. Something happens on a secondary thread which causes our face setup
to become non-faulty, possibly during GC.

That would explain the observed behavior, I think, including such
oddities as the bug happening more frequently when running in gdb
(which delays thread creation).

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Fri, 10 Jan 2020 08:28:01 GMT) Full text and rfc822 format available.

Message #119 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Pip Cet <pipcet <at> gmail.com>
Cc: rpluim <at> gmail.com, andreyk.mad <at> gmail.com, alan <at> idiocy.org,
 jguenther <at> gmail.com, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Fri, 10 Jan 2020 10:27:45 +0200

> From: Pip Cet <pipcet <at> gmail.com>
> Date: Fri, 10 Jan 2020 07:32:07 +0000
> Cc: rpluim <at> gmail.com, alan <at> idiocy.org, jguenther <at> gmail.com, 
> 	andreyk.mad <at> gmail.com, 38748 <at> debbugs.gnu.org
> 
> > The backtrace shows a very recursive GC, it doesn't show any other
> > function being deeply recursive.  So I'm not sure I understand what
> > tail-recursive function did you have in mind.  Can you elaborate?
> 
> I can. I think we're looking at two bugs: the first is the simple
> use-after-free of XFRAME (frame)->output_data.ns where `frame' is a
> dead frame. I've confirmed on GNU/Linux that mark_frame is called for
> a frame for which x_free_frame_resources has already been called, if
> there's a global variable still referencing the frame. I think the
> same thing happens on macOS.

This one doesn't depend on the 'ok's initialization in
face_inherited_attr in any way, does it?

> 1. I think face_inherited_attr is being optimized to tail-call itself
> rather than calling itself in a new stack frame; thus, it loops
> indefinitely for a faulty face setup which would otherwise lead to an
> immediate crash.
> 1b. that optimization only works without the harmless initialization of "ok".
> 
> 2. Our initial face setup is faulty in the sense above.
> 
> 3. Something happens on a secondary thread which causes our face setup
> to become non-faulty, possibly during GC.

What do you mean by "secondary thread"?  And how can GC modify Lisp
data structures? that'd be a terrible bug.

In any case, the full backtrace shows no trace of face_inherited_attr
call anywhere in the callstack, so if there is indeed infinite
recursion in that function, it was somehow exited long ago by the time
GC runs.

As for the tail-recursion part: do you see any sign of that in the
disassembly posted by Robert?  I didn't, but maybe I missed
something.  And such subtleties should only rear their ugly heads in
optimized code, whereas we already know that an unoptimized build
crashes in the same way.

I still think the shortest way to finding the culprit here is to
patiently and painfully go over the last_marked array, deciphering
the Lisp object we marked, until we succeed in identifying the Lisp
data structure which got corrupted.  Once we succeed in identifying
that data structure, it should be relatively easy to find who and
where corrupts it.  This may mean a lot of inconvenient drudgery,
exacerbated by the fact that having a functional GDB on macOS is not
easy, but I don't think we have a better way at this point.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Fri, 10 Jan 2020 09:00:02 GMT) Full text and rfc822 format available.

Message #122 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Robert Pluim <rpluim <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: alan <at> idiocy.org, jguenther <at> gmail.com, andreyk.mad <at> gmail.com,
 Pip Cet <pipcet <at> gmail.com>, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Fri, 10 Jan 2020 09:58:52 +0100

[Message part 1 (text/plain, inline)]

>>>>> On Fri, 10 Jan 2020 10:27:45 +0200, Eli Zaretskii <eliz <at> gnu.org> said:

    >> From: Pip Cet <pipcet <at> gmail.com>
    >> Date: Fri, 10 Jan 2020 07:32:07 +0000
    >> Cc: rpluim <at> gmail.com, alan <at> idiocy.org, jguenther <at> gmail.com, 
    >> andreyk.mad <at> gmail.com, 38748 <at> debbugs.gnu.org
    >> 
    >> > The backtrace shows a very recursive GC, it doesn't show any other
    >> > function being deeply recursive.  So I'm not sure I understand what
    >> > tail-recursive function did you have in mind.  Can you elaborate?
    >> 
    >> I can. I think we're looking at two bugs: the first is the simple
    >> use-after-free of XFRAME (frame)->output_data.ns where `frame' is a
    >> dead frame. I've confirmed on GNU/Linux that mark_frame is called for
    >> a frame for which x_free_frame_resources has already been called, if
    >> there's a global variable still referencing the frame. I think the
    >> same thing happens on macOS.

    Eli> This one doesn't depend on the 'ok's initialization in
    Eli> face_inherited_attr in any way, does it?

No, it doesnʼt.

    >> 1. I think face_inherited_attr is being optimized to tail-call itself
    >> rather than calling itself in a new stack frame; thus, it loops
    >> indefinitely for a faulty face setup which would otherwise lead to an
    >> immediate crash.
    >> 1b. that optimization only works without the harmless initialization of "ok".
    >> 
    >> 2. Our initial face setup is faulty in the sense above.
    >> 
    >> 3. Something happens on a secondary thread which causes our face setup
    >> to become non-faulty, possibly during GC.

    Eli> What do you mean by "secondary thread"?  And how can GC modify Lisp
    Eli> data structures? that'd be a terrible bug.

    Eli> In any case, the full backtrace shows no trace of face_inherited_attr
    Eli> call anywhere in the callstack, so if there is indeed infinite
    Eli> recursion in that function, it was somehow exited long ago by the time
    Eli> GC runs.

    Eli> As for the tail-recursion part: do you see any sign of that in the
    Eli> disassembly posted by Robert?  I didn't, but maybe I missed
    Eli> something.  And such subtleties should only rear their ugly heads in
    Eli> optimized code, whereas we already know that an unoptimized build
    Eli> crashes in the same way.

Iʼm attaching the disassembly of face_inherited_attr with -O2, with
and without the change to 'ok'. I canʼt see any tail recursion, and
modulo the use of r14 rather than r13, the only change I can see is
right at the end, where the return value is set up (disclaimer: Iʼm
not fluent in x86 assembler).

    Eli> I still think the shortest way to finding the culprit here is to
    Eli> patiently and painfully go over the last_marked array, deciphering
    Eli> the Lisp object we marked, until we succeed in identifying the Lisp
    Eli> data structure which got corrupted.  Once we succeed in identifying
    Eli> that data structure, it should be relatively easy to find who and
    Eli> where corrupts it.  This may mean a lot of inconvenient drudgery,
    Eli> exacerbated by the fact that having a functional GDB on macOS is not
    Eli> easy, but I don't think we have a better way at this point.

Itʼs possible that there is only one bug. The emacs Iʼve been using
with the change in nsterm.m suggested by Pip has been completely
stable. If it does crash again I can trawl through last_marked.

Robert

[unmodified-optimized.txt (text/plain, attachment)]

[modified-optimized.txt (text/plain, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Fri, 10 Jan 2020 09:23:02 GMT) Full text and rfc822 format available.

Message #125 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Robert Pluim <rpluim <at> gmail.com>
Cc: alan <at> idiocy.org, jguenther <at> gmail.com, andreyk.mad <at> gmail.com,
 pipcet <at> gmail.com, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Fri, 10 Jan 2020 11:21:59 +0200

> From: Robert Pluim <rpluim <at> gmail.com>
> Cc: Pip Cet <pipcet <at> gmail.com>,  38748 <at> debbugs.gnu.org,  alan <at> idiocy.org,
>   andreyk.mad <at> gmail.com,  jguenther <at> gmail.com
> Date: Fri, 10 Jan 2020 09:58:52 +0100
> 
> Itʼs possible that there is only one bug.

I certainly hope so!

> The emacs Iʼve been using with the change in nsterm.m suggested by
> Pip has been completely stable. If it does crash again I can trawl
> through last_marked.

Thanks.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Fri, 10 Jan 2020 09:24:02 GMT) Full text and rfc822 format available.

Message #128 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Pip Cet <pipcet <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: rpluim <at> gmail.com, andreyk.mad <at> gmail.com, alan <at> idiocy.org,
 jguenther <at> gmail.com, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Fri, 10 Jan 2020 09:22:30 +0000

On Fri, Jan 10, 2020 at 8:27 AM Eli Zaretskii <eliz <at> gnu.org> wrote:
> > I can. I think we're looking at two bugs: the first is the simple
> > use-after-free of XFRAME (frame)->output_data.ns where `frame' is a
> > dead frame. I've confirmed on GNU/Linux that mark_frame is called for
> > a frame for which x_free_frame_resources has already been called, if
> > there's a global variable still referencing the frame. I think the
> > same thing happens on macOS.
>
> This one doesn't depend on the 'ok's initialization in
> face_inherited_attr in any way, does it?

It doesn't, no.

> What do you mean by "secondary thread"?

It's my impression that macOS forces us to run in several threads,
even though we don't really want to do so. For example, changeFont in
nsterm.m appears not to assume it's run on the main thread, but calls
build_string, which sounds dangerous to me.

> And how can GC modify Lisp
> data structures? that'd be a terrible bug.

Yes, it would be, but if bug#2 is real it's going to be terrible in
one way or another (I hope it's not GC-related, but "just" a stack
overflow).

> In any case, the full backtrace shows no trace of face_inherited_attr
> call anywhere in the callstack, so if there is indeed infinite
> recursion in that function, it was somehow exited long ago by the time
> GC runs.

I don't think the full backtrace is bug#2, it's bug#1.

> As for the tail-recursion part: do you see any sign of that in the
> disassembly posted by Robert?

No, just in the backtrace which shows execution at xfaces.c:2226, with
the PC not saved in the stack frame.

> I didn't, but maybe I missed
> something.  And such subtleties should only rear their ugly heads in
> optimized code, whereas we already know that an unoptimized build
> crashes in the same way.

Do we, though? We know that an unoptimized build crashes, but we don't
know it's the (hypothetical, as I said) bug#2.
>
> I still think the shortest way to finding the culprit here is to
> patiently and painfully go over the last_marked array, deciphering
> the Lisp object we marked, until we succeed in identifying the Lisp
> data structure which got corrupted.  Once we succeed in identifying
> that data structure, it should be relatively easy to find who and
> where corrupts it.  This may mean a lot of inconvenient drudgery,
> exacerbated by the fact that having a functional GDB on macOS is not
> easy, but I don't think we have a better way at this point.

I disagree. The patch to nsterm.m is obviously harmless, and appears
to fix the one bug we have clear evidence of, in a way that seems
logical and necessary to me.

If there is a second bug, and the backtrace we saw wasn't just a
fluke, it's going to show up when people run emacs on macOS in gdb in
all-stop mode. The problem is I think that hardly ever happens, and I
don't have access to a macOS machine.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Fri, 10 Jan 2020 09:34:02 GMT) Full text and rfc822 format available.

Message #131 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Pip Cet <pipcet <at> gmail.com>
Cc: rpluim <at> gmail.com, andreyk.mad <at> gmail.com, alan <at> idiocy.org,
 jguenther <at> gmail.com, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Fri, 10 Jan 2020 11:33:06 +0200

> From: Pip Cet <pipcet <at> gmail.com>
> Date: Fri, 10 Jan 2020 09:22:30 +0000
> Cc: rpluim <at> gmail.com, alan <at> idiocy.org, jguenther <at> gmail.com, 
> 	andreyk.mad <at> gmail.com, 38748 <at> debbugs.gnu.org
> 
> > I still think the shortest way to finding the culprit here is to
> > patiently and painfully go over the last_marked array, deciphering
> > the Lisp object we marked, until we succeed in identifying the Lisp
> > data structure which got corrupted.  Once we succeed in identifying
> > that data structure, it should be relatively easy to find who and
> > where corrupts it.  This may mean a lot of inconvenient drudgery,
> > exacerbated by the fact that having a functional GDB on macOS is not
> > easy, but I don't think we have a better way at this point.
> 
> I disagree. The patch to nsterm.m is obviously harmless, and appears
> to fix the one bug we have clear evidence of, in a way that seems
> logical and necessary to me.

I wasn't talking about that part (I agree that fix should be
installed), but again, it's unrelated to the initialization of 'ok'.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Fri, 10 Jan 2020 10:19:02 GMT) Full text and rfc822 format available.

Message #134 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Robert Pluim <rpluim <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: alan <at> idiocy.org, andreyk.mad <at> gmail.com, jguenther <at> gmail.com,
 pipcet <at> gmail.com, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Fri, 10 Jan 2020 11:18:35 +0100

>>>>> On Fri, 10 Jan 2020 11:21:59 +0200, Eli Zaretskii <eliz <at> gnu.org> said:

    >> From: Robert Pluim <rpluim <at> gmail.com>
    >> Cc: Pip Cet <pipcet <at> gmail.com>,  38748 <at> debbugs.gnu.org,  alan <at> idiocy.org,
    >> andreyk.mad <at> gmail.com,  jguenther <at> gmail.com
    >> Date: Fri, 10 Jan 2020 09:58:52 +0100
    >> 
    >> Itʼs possible that there is only one bug.

    Eli> I certainly hope so!

    >> The emacs Iʼve been using with the change in nsterm.m suggested by
    >> Pip has been completely stable. If it does crash again I can trawl
    >> through last_marked.

Although of course that build is with '-O0', and if there is a 2nd bug
it would be optimization dependent. Iʼll rebuild with the default
'-O2' and run that.

Robert

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Sat, 11 Jan 2020 06:27:02 GMT) Full text and rfc822 format available.

Message #137 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Pankaj Jangid <p4j <at> j4d.net>
To: 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Sat, 11 Jan 2020 06:26:45 +0000

Yesterday, Emacs 27.0.60 (built from HEAD) crashed on my macOS
10.15.2. I could not reproduce it after many tries. But just in case if
it happens again, what information should I share apart from steps to
reproduce.

Is there a crash dump create somewhere? I am not aware of it.

Regards
-- 
Pankaj Jangid

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Sat, 11 Jan 2020 08:09:01 GMT) Full text and rfc822 format available.

Message #140 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Pankaj Jangid <p4j <at> j4d.net>
Cc: 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Sat, 11 Jan 2020 10:08:29 +0200

> From: Pankaj Jangid <p4j <at> j4d.net>
> Date: Sat, 11 Jan 2020 06:26:45 +0000
> 
> Yesterday, Emacs 27.0.60 (built from HEAD) crashed on my macOS
> 10.15.2. I could not reproduce it after many tries. But just in case if
> it happens again, what information should I share apart from steps to
> reproduce.

In general, if Emacs crashes from time to time, my advice is to run it
under a debugger at all times, and when it crashes, produce a
backtrace and post it together with the bug report.  If you can afford
leaving the crashed session under the debugger, please do, as we might
have some requests for you to look inside the crashed session and show
values of some variables.

> Is there a crash dump create somewhere? I am not aware of it.

It's your OS function.  I don't use macOS, but every modern OS records
some information about a crash of every program in some place, so
searching the Internet and/or your system documentation will certainly
reveal how to find that place and look up the crash info from there.

Thanks.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Sat, 11 Jan 2020 10:44:01 GMT) Full text and rfc822 format available.

Message #143 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Pankaj Jangid <p4j <at> j4d.net>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Sat, 11 Jan 2020 10:43:10 +0000

Eli Zaretskii <eliz <at> gnu.org> writes:
>> Yesterday, Emacs 27.0.60 (built from HEAD) crashed on my macOS
>> ...
> In general, if Emacs crashes from time to time, my advice is to run it
> under a debugger at all times, and when it crashes, produce a
> backtrace and post it together with the bug report.  If you can afford
> leaving the crashed session under the debugger, please do, as we might
> have some requests for you to look inside the crashed session and show
> values of some variables.

Thanks for this info. I'll follow the above steps.

>> Is there a crash dump create somewhere? I am not aware of it.
>
> It's your OS function.  I don't use macOS, but every modern OS records
> some information about a crash of every program in some place, so
> searching the Internet and/or your system documentation will certainly
> reveal how to find that place and look up the crash info from there.
>

Yes. About 10 mins back, my Emacs crashed again (Emacs-27.0.60
HEAD). Got the OS dump,

https://send.firefox.com/download/2efd11c5e13a4fd7/#AsR4tM-dV4cV4Cwig09pyA

Regards
Pankaj

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Sat, 11 Jan 2020 12:15:02 GMT) Full text and rfc822 format available.

Message #146 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Pankaj Jangid <p4j <at> j4d.net>
Cc: 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Sat, 11 Jan 2020 14:14:16 +0200

> From: Pankaj Jangid <p4j <at> j4d.net>
> Cc: 38748 <at> debbugs.gnu.org
> Date: Sat, 11 Jan 2020 10:43:10 +0000
> 
> >> Is there a crash dump create somewhere? I am not aware of it.
> >
> > It's your OS function.  I don't use macOS, but every modern OS records
> > some information about a crash of every program in some place, so
> > searching the Internet and/or your system documentation will certainly
> > reveal how to find that place and look up the crash info from there.
> >
> 
> Yes. About 10 mins back, my Emacs crashed again (Emacs-27.0.60
> HEAD). Got the OS dump,
> 
> https://send.firefox.com/download/2efd11c5e13a4fd7/#AsR4tM-dV4cV4Cwig09pyA

Looks like the other crashes reported here, so please stay tuned for a
possible solution.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Sat, 11 Jan 2020 14:00:02 GMT) Full text and rfc822 format available.

Message #149 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Alan Third <alan <at> idiocy.org>
To: Pip Cet <pipcet <at> gmail.com>
Cc: Eli Zaretskii <eliz <at> gnu.org>, andreyk.mad <at> gmail.com, rpluim <at> gmail.com,
 jguenther <at> gmail.com, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Sat, 11 Jan 2020 13:59:20 +0000

On Fri, Jan 10, 2020 at 09:22:30AM +0000, Pip Cet wrote:
> On Fri, Jan 10, 2020 at 8:27 AM Eli Zaretskii <eliz <at> gnu.org> wrote:
> > What do you mean by "secondary thread"?
> 
> It's my impression that macOS forces us to run in several threads,
> even though we don't really want to do so. For example, changeFont in
> nsterm.m appears not to assume it's run on the main thread, but calls
> build_string, which sounds dangerous to me.

What makes you think it’s assuming it may not be run on the main
thread?

macOS does set up several threads, but it doesn’t force any of your
code to run in arbitrary threads.

One of the big TODOs in the NS port is making code that may be called
from lisp safe to run in any thread because at the moment it all
assumes it’s running in a single thread, but lisp can call from any
lisp thread (and then Emacs crashes).

-- 
Alan Third

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Sat, 11 Jan 2020 14:15:02 GMT) Full text and rfc822 format available.

Message #152 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Pip Cet <pipcet <at> gmail.com>
To: Alan Third <alan <at> idiocy.org>
Cc: Eli Zaretskii <eliz <at> gnu.org>, andreyk.mad <at> gmail.com, rpluim <at> gmail.com,
 jguenther <at> gmail.com, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Sat, 11 Jan 2020 14:13:43 +0000

On Sat, Jan 11, 2020 at 1:59 PM Alan Third <alan <at> idiocy.org> wrote:
> > It's my impression that macOS forces us to run in several threads,
> > even though we don't really want to do so. For example, changeFont in
> > nsterm.m appears not to assume it's run on the main thread, but calls
> > build_string, which sounds dangerous to me.
>
> What makes you think it’s assuming it may not be run on the main
> thread?

The way it doesn't simply call Lisp, but sets up an event to be
handled in the event loop. How is changeFont actually called? Would it
be safe to call Lisp from it?

> macOS does set up several threads, but it doesn’t force any of your
> code to run in arbitrary threads.

That's good to know, thanks.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Sat, 11 Jan 2020 18:38:01 GMT) Full text and rfc822 format available.

Message #155 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Pieter van Oostrum <pieter-l <at> vanoostrum.org>
To: Robert Pluim <rpluim <at> gmail.com>
Cc: jguenther <at> gmail.com, andreyk.mad <at> gmail.com, alan <at> idiocy.org,
 Pip Cet <pipcet <at> gmail.com>, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Sat, 11 Jan 2020 19:37:02 +0100

Robert Pluim <rpluim <at> gmail.com> writes:

>>>>>> On Thu, 9 Jan 2020 14:16:54 +0000, Pip Cet <pipcet <at> gmail.com> said:
>
>     Pip> On Thu, Jan 9, 2020 at 1:51 PM Andrii Kolomoiets <andreyk.mad <at> gmail.com> wrote:
>     >> Here is the '~/emacs-crash.el' content:
>     >> (make-frame `((parent-frame . ,(window-frame))))
>     >> (make-frame `((parent-frame . ,(window-frame))))
>     >> (make-frame `((parent-frame . ,(window-frame))))
>     >> (make-frame `((parent-frame . ,(window-frame))))
>     >> (delete-frame)
>     >> (delete-frame)
>     >> (delete-frame)
>     >> (delete-frame)
>     >> (garbage-collect)
>
>     Pip> That sounds like Robert's bug, but not like the one that's related to
>     Pip> the "x = false" thing.
>
>     Pip> Can you try the attached patch?
>
>     Pip> diff --git a/src/nsterm.m b/src/nsterm.m
>     Pip> index 03754e5ae5..c1d1d41117 100644
>     Pip> --- a/src/nsterm.m
>     Pip> +++ b/src/nsterm.m
>     Pip> @@ -1644,6 +1644,7 @@ Hide the window (X11 semantics)
>     Pip>    [view release];
>  
>     Pip>    xfree (f->output_data.ns);
>     Pip> +  f->output_data.ns = NULL;
>  
>     Pip>    unblock_input ();
>     Pip>  }
>
> That has fixed things for me, not been able to crash it with Andrii's
> recipe (I had to increase the number of frames to get it to crash).
>
> Robert

I compiled HEAD with this patch applied, and it still crashed but with the other crash cause (in Fmouse_pixel_position).
-- 
Pieter van Oostrum
www: http://pieter.vanoostrum.org/
PGP key: [8DAE142BE17999C4]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Sat, 11 Jan 2020 18:44:02 GMT) Full text and rfc822 format available.

Message #158 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Pieter van Oostrum <pieter-l <at> vanoostrum.org>
Cc: jguenther <at> gmail.com, andreyk.mad <at> gmail.com, rpluim <at> gmail.com,
 alan <at> idiocy.org, pipcet <at> gmail.com, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Sat, 11 Jan 2020 20:43:47 +0200

> From: Pieter van Oostrum <pieter-l <at> vanoostrum.org>
> Date: Sat, 11 Jan 2020 19:37:02 +0100
> Cc: Pip Cet <pipcet <at> gmail.com>, jguenther <at> gmail.com, alan <at> idiocy.org,
>  andreyk.mad <at> gmail.com, 38748 <at> debbugs.gnu.org
> 
> Robert Pluim <rpluim <at> gmail.com> writes:
> 
> >     Pip> diff --git a/src/nsterm.m b/src/nsterm.m
> >     Pip> index 03754e5ae5..c1d1d41117 100644
> >     Pip> --- a/src/nsterm.m
> >     Pip> +++ b/src/nsterm.m
> >     Pip> @@ -1644,6 +1644,7 @@ Hide the window (X11 semantics)
> >     Pip>    [view release];
> >  
> >     Pip>    xfree (f->output_data.ns);
> >     Pip> +  f->output_data.ns = NULL;
> >  
> >     Pip>    unblock_input ();
> >     Pip>  }
> >
> > That has fixed things for me, not been able to crash it with Andrii's
> > recipe (I had to increase the number of frames to get it to crash).
> >
> > Robert
> 
> I compiled HEAD with this patch applied, and it still crashed but with the other crash cause (in Fmouse_pixel_position).

Can you show the values of variables I asked about regarding that
crash?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Sat, 11 Jan 2020 19:08:01 GMT) Full text and rfc822 format available.

Message #161 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Alan Third <alan <at> idiocy.org>
To: Pip Cet <pipcet <at> gmail.com>
Cc: Eli Zaretskii <eliz <at> gnu.org>, andreyk.mad <at> gmail.com, rpluim <at> gmail.com,
 jguenther <at> gmail.com, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Sat, 11 Jan 2020 19:07:15 +0000

On Sat, Jan 11, 2020 at 02:13:43PM +0000, Pip Cet wrote:
> On Sat, Jan 11, 2020 at 1:59 PM Alan Third <alan <at> idiocy.org> wrote:
> > > It's my impression that macOS forces us to run in several threads,
> > > even though we don't really want to do so. For example, changeFont in
> > > nsterm.m appears not to assume it's run on the main thread, but calls
> > > build_string, which sounds dangerous to me.
> >
> > What makes you think it’s assuming it may not be run on the main
> > thread?
> 
> The way it doesn't simply call Lisp, but sets up an event to be
> handled in the event loop. How is changeFont actually called? Would it
> be safe to call Lisp from it?

changeFont is called during the NS run (event) loop which I don’t
think is safe for calling lisp.

Effectively Emacs requests the font panel to be opened and then any
changes made in it are handled as though they’re user input events. I
remember looking into it because it doesn’t work like on other
toolkits, but because it’s this detached thing that only communicates
through input events while Emacs continues running it makes it
difficult to match its behaviour.

-- 
Alan Third

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Sat, 11 Jan 2020 19:16:02 GMT) Full text and rfc822 format available.

Message #164 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Pip Cet <pipcet <at> gmail.com>
To: Pieter van Oostrum <pieter-l <at> vanoostrum.org>
Cc: Robert Pluim <rpluim <at> gmail.com>, 38748 <at> debbugs.gnu.org, jguenther <at> gmail.com,
 andreyk.mad <at> gmail.com, alan <at> idiocy.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Sat, 11 Jan 2020 19:14:38 +0000

[Message part 1 (text/plain, inline)]

On Sat, Jan 11, 2020 at 6:37 PM Pieter van Oostrum
<pieter-l <at> vanoostrum.org> wrote:
> I compiled HEAD with this patch applied, and it still crashed but with the other crash cause (in Fmouse_pixel_position).

Do you have a backtrace? I think it's a NULL pointer reference now.
The attached patch might help.

[38748b.diff (text/x-patch, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Sat, 11 Jan 2020 21:24:01 GMT) Full text and rfc822 format available.

Message #167 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Pieter van Oostrum <pieter-l <at> vanoostrum.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: jguenther <at> gmail.com, andreyk.mad <at> gmail.com, rpluim <at> gmail.com,
 alan <at> idiocy.org, pipcet <at> gmail.com, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Sat, 11 Jan 2020 22:23:49 +0100

Eli Zaretskii <eliz <at> gnu.org> writes:

>> From: Pieter van Oostrum <pieter-l <at> vanoostrum.org>
>> Date: Sat, 11 Jan 2020 19:37:02 +0100
>> Cc: Pip Cet <pipcet <at> gmail.com>, jguenther <at> gmail.com, alan <at> idiocy.org,
>>  andreyk.mad <at> gmail.com, 38748 <at> debbugs.gnu.org
>> 
>> Robert Pluim <rpluim <at> gmail.com> writes:
>> 
>> >     Pip> diff --git a/src/nsterm.m b/src/nsterm.m
>> >     Pip> index 03754e5ae5..c1d1d41117 100644
>> >     Pip> --- a/src/nsterm.m
>> >     Pip> +++ b/src/nsterm.m
>> >     Pip> @@ -1644,6 +1644,7 @@ Hide the window (X11 semantics)
>> >     Pip>    [view release];
>> >  
>> >     Pip>    xfree (f->output_data.ns);
>> >     Pip> +  f->output_data.ns = NULL;
>> >  
>> >     Pip>    unblock_input ();
>> >     Pip>  }
>> >
>> > That has fixed things for me, not been able to crash it with Andrii's
>> > recipe (I had to increase the number of frames to get it to crash).
>> >
>> > Robert
>> 
>> I compiled HEAD with this patch applied, and it still crashed but with
>> the other crash cause (in Fmouse_pixel_position).
>
> Can you show the values of variables I asked about regarding that
> crash?

Sorry, no. I wasn't running under gdb when that crash occurred (now I do). And I wasn't aware that you asked about some variables for this particular crash. Only for the other one with all the  mark-related stuff. So which variables would that be? I couldn't find it in the discussion.
-- 
Pieter van Oostrum
www: http://pieter.vanoostrum.org/
PGP key: [8DAE142BE17999C4]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Sat, 11 Jan 2020 21:37:02 GMT) Full text and rfc822 format available.

Message #170 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Pieter van Oostrum <pieter-l <at> vanoostrum.org>
To: Pip Cet <pipcet <at> gmail.com>
Cc: Robert Pluim <rpluim <at> gmail.com>, 38748 <at> debbugs.gnu.org, jguenther <at> gmail.com,
 andreyk.mad <at> gmail.com, alan <at> idiocy.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Sat, 11 Jan 2020 22:36:12 +0100

[Message part 1 (text/plain, inline)]

Pip Cet <pipcet <at> gmail.com> writes:

> On Sat, Jan 11, 2020 at 6:37 PM Pieter van Oostrum
> <pieter-l <at> vanoostrum.org> wrote:
>> I compiled HEAD with this patch applied, and it still crashed but with
>> the other crash cause (in Fmouse_pixel_position).
>
> Do you have a backtrace? I think it's a NULL pointer reference now.
> The attached patch might help.

I have a backtrace, but without debug info. I am now compiling with your patch and with debug info, as described in etc/DEBUG.

[Emacs_2020-01-11-144434_Cochabamba.crash (application/octet-stream, attachment)]

[Message part 3 (text/plain, inline)]

-- 
Pieter van Oostrum
www: http://pieter.vanoostrum.org/
PGP key: [8DAE142BE17999C4]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Sun, 12 Jan 2020 03:34:02 GMT) Full text and rfc822 format available.

Message #173 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Pieter van Oostrum <pieter-l <at> vanoostrum.org>
Cc: jguenther <at> gmail.com, andreyk.mad <at> gmail.com, rpluim <at> gmail.com,
 alan <at> idiocy.org, pipcet <at> gmail.com, 38748 <at> debbugs.gnu.org
Subject: Re: bug#38748: 28.0.50; crash on MacOS 10.15.2
Date: Sun, 12 Jan 2020 05:33:35 +0200

> From: Pieter van Oostrum <pieter-l <at> vanoostrum.org>
> Cc: rpluim <at> gmail.com,  pipcet <at> gmail.com,  jguenther <at> gmail.com,
>   alan <at> idiocy.org,  andreyk.mad <at> gmail.com,  38748 <at> debbugs.gnu.org
> Date: Sat, 11 Jan 2020 22:23:49 +0100
> 
> > Can you show the values of variables I asked about regarding that
> > crash?
> 
> Sorry, no. I wasn't running under gdb when that crash occurred (now I do). And I wasn't aware that you asked about some variables for this particular crash. Only for the other one with all the  mark-related stuff. So which variables would that be? I couldn't find it in the discussion.

The values of f, FRAME_TERMINAL (f), and
FRAME_TERMINAL(f)->mouse_position_hook.

Also, can you show exactly in the terms of C source where does it
crash?

Thanks.

Merged 38748 38822. Request was from Robert Pluim <rpluim <at> gmail.com> to control <at> debbugs.gnu.org. (Mon, 20 Jan 2020 09:03:02 GMT) Full text and rfc822 format available.

bug Marked as fixed in versions 27.1. Request was from Robert Pluim <rpluim <at> gmail.com> to control <at> debbugs.gnu.org. (Mon, 20 Jan 2020 16:31:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#38748; Package emacs. (Sun, 20 Sep 2020 10:51:02 GMT) Full text and rfc822 format available.

Message #180 received at 38748 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Robert Pluim <rpluim <at> gmail.com>
Cc: Eli Zaretskii <eliz <at> gnu.org>, 38822 <at> debbugs.gnu.org, jguenther <at> gmail.com,
 38748 <at> debbugs.gnu.org
Subject: Re: bug#38822: 27.0.60; Crashes on MacOS 10.14 when quitting emacs,
 and intermittent crashes during normal usage
Date: Sun, 20 Sep 2020 12:50:15 +0200

Robert Pluim <rpluim <at> gmail.com> writes:

>     >> Thanks for that. Eli, should we apply this to emacs-27 with
>     >> attribution to Pip Cet?:
>
>     Eli> Fine with me, thanks.
>
> Done. Bug closed.

Hm -- it looks like this bug was left open?  Skimming it, it's somewhat
confusing, but I think the conclusion was that this had been fixed by
Pip's patch, so I'm now (re)closing the bug report.  If this is still a
problem to be fixed here, please send a message to the debbugs address,
and we'll reopen the bug report.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

bug closed, send any further explanations to 38748 <at> debbugs.gnu.org and Andrii Kolomoiets <andreyk.mad <at> gmail.com> Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Sun, 20 Sep 2020 10:51:03 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 18 Oct 2020 11:24:07 GMT) Full text and rfc822 format available.

This bug report was last modified 4 years and 269 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #38748 28.0.50; crash on MacOS 10.15.2

GNU bug report logs - #38748
28.0.50; crash on MacOS 10.15.2