GNU bug report logs -
#72145
rare Emacs screwups on x86 due to GCC bug 58416
Previous Next
Reported by: Paul Eggert <eggert <at> cs.ucla.edu>
Date: Tue, 16 Jul 2024 23:27:02 UTC
Severity: normal
Tags: patch
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 72145 in the body.
You can then email your comments to 72145 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#72145
; Package
emacs
.
(Tue, 16 Jul 2024 23:27:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Paul Eggert <eggert <at> cs.ucla.edu>
:
New bug report received and forwarded. Copy sent to
bug-gnu-emacs <at> gnu.org
.
(Tue, 16 Jul 2024 23:27:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
While testing GNU Emacs built on Fedora 40 with gcc (GCC) 14.1.1
20240607 (Red Hat 14.1.1-5) with -m32 for x86 and configured
--with-wide-int, I discovered that Emacs misbehaved in a hard-to-debug
way due to GCC bug 58416. This bug causes GCC to generate wrong x86
machine instructions when a C program accesses a union containing a
'double'.
The bug I observed is that if you have something like this:
union u { double d; long long int i; } u;
then GCC sometimes generates x86 instructions that copy u.i by using
fldl/fstpl instruction pairs to push the 64-bit quantity onto the 387
floating point stack, and then pop the stack into another memory
location. Unfortunately the fldl/fstpl trick fails in the unusual case
when the bit pattern of u.i, when interpreted as a double, is a NaN, as
that can cause the fldl/fstpl pair to store a different NaN with a
different bit pattern, which means the destination integer disagrees
with u.i.
The bug is obscure, since the bug's presence depends on the GCC version,
on the optimization options used, on the exact source code, and on the
exact integer value at runtime (the value is typically copied correctly
even when GCC has generated the incorrect machine code, since most long
long int values don't alias with NaNs).
In short the bug appears to be rare.
Here are some possible courses of action:
* Do nothing and hope x86 users won't run into this rare bug.
* Have the GCC folks fix the bug. However, given that the bug has been
reported for over a decade multiple times without a fix, it seems that
fixing it is too difficult and/or too low priority for this aging
platform. Also, even if the bug is fixed in future GCC the bug will
still be present with people using older GCC.
* Build with Clang or some other compiler instead. We should be
encouraging GCC, though.
* Rewrite Emacs to never use 'double' (or 'float' or 'long double')
inside a union. This could be painful and hardly seems worthwhile.
* When using GCC to build Emacs on x86, compile with safer options that
make the bug impossible. The attached proposed patch does that, by
telling GCC not to use the 387 stack. (This patch fixed the Emacs
misbehavior in my experimental build.) The downside is that the
resulting Emacs executables need SSE2, introduced for the Pentium 4 in
2000 <https://en.wikipedia.org/wiki/SSE2>. Nowadays few users need to
run Emacs on non-SSE2 x86, so this may be good enough. Also, the
proposed patch gives the builder an option to compile Emacs without the
safer options, for people who want to build for older Intel-compatible
platforms and who don't mind an occasional wrong answer or crash.
[0001-Work-around-GCC-bug-58416-when-building-for-x86.patch (text/x-patch, attachment)]
Added tag(s) patch.
Request was from
Paul Eggert <eggert <at> cs.ucla.edu>
to
control <at> debbugs.gnu.org
.
(Wed, 17 Jul 2024 00:16:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#72145
; Package
emacs
.
(Wed, 17 Jul 2024 00:59:02 GMT)
Full text and
rfc822 format available.
Message #10 received at 72145 <at> debbugs.gnu.org (full text, mbox):
Paul Eggert <eggert <at> cs.ucla.edu> writes:
> While testing GNU Emacs built on Fedora 40 with gcc (GCC) 14.1.1
> 20240607 (Red Hat 14.1.1-5) with -m32 for x86 and configured
> --with-wide-int, I discovered that Emacs misbehaved in a hard-to-debug
> way due to GCC bug 58416. This bug causes GCC to generate wrong x86
> machine instructions when a C program accesses a union containing a
> 'double'.
>
> The bug I observed is that if you have something like this:
>
> union u { double d; long long int i; } u;
>
> then GCC sometimes generates x86 instructions that copy u.i by using
> fldl/fstpl instruction pairs to push the 64-bit quantity onto the 387
> floating point stack, and then pop the stack into another memory
> location. Unfortunately the fldl/fstpl trick fails in the unusual case
> when the bit pattern of u.i, when interpreted as a double, is a NaN,
> as that can cause the fldl/fstpl pair to store a different NaN with a
> different bit pattern, which means the destination integer disagrees
> with u.i.
>
> The bug is obscure, since the bug's presence depends on the GCC
> version, on the optimization options used, on the exact source code,
> and on the exact integer value at runtime (the value is typically
> copied correctly even when GCC has generated the incorrect machine
> code, since most long long int values don't alias with NaNs).
>
> In short the bug appears to be rare.
>
> Here are some possible courses of action:
>
> * Do nothing and hope x86 users won't run into this rare bug.
>
> * Have the GCC folks fix the bug. However, given that the bug has been
> reported for over a decade multiple times without a fix, it seems
> that fixing it is too difficult and/or too low priority for this
> aging platform. Also, even if the bug is fixed in future GCC the bug
> will still be present with people using older GCC.
>
> * Build with Clang or some other compiler instead. We should be
> encouraging GCC, though.
>
> * Rewrite Emacs to never use 'double' (or 'float' or 'long double')
> inside a union. This could be painful and hardly seems worthwhile.
>
> * When using GCC to build Emacs on x86, compile with safer options
> that make the bug impossible. The attached proposed patch does that,
> by telling GCC not to use the 387 stack. (This patch fixed the Emacs
> misbehavior in my experimental build.) The downside is that the
> resulting Emacs executables need SSE2, introduced for the Pentium 4
> in 2000 <https://en.wikipedia.org/wiki/SSE2>. Nowadays few users
> need to run Emacs on non-SSE2 x86, so this may be good enough. Also,
> the proposed patch gives the builder an option to compile Emacs
> without the safer options, for people who want to build for older
> Intel-compatible platforms and who don't mind an occasional wrong
> answer or crash.
Wouldn't it be better if configure attempted to detect the presence of
SSE2 on the host system?
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#72145
; Package
emacs
.
(Wed, 17 Jul 2024 05:03:02 GMT)
Full text and
rfc822 format available.
Message #13 received at 72145 <at> debbugs.gnu.org (full text, mbox):
On 2024-07-16 17:57, Po Lu wrote:
> Wouldn't it be better if configure attempted to detect the presence of
> SSE2 on the host system?
We could add an AC_RUN_IFELSE test for SSE2, though I doubt whether it
would affect builds significantly in practice. Build systems invariably
support SSE2 nowadays and AC_RUN_IFELSE tests the build system, not the
host system.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#72145
; Package
emacs
.
(Wed, 17 Jul 2024 21:58:02 GMT)
Full text and
rfc822 format available.
Message #16 received at 72145 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 2024-07-16 22:01, Paul Eggert wrote:
> We could add an AC_RUN_IFELSE test for SSE2, though I doubt whether it
> would affect builds significantly in practice.
On second thought the rare Arch or Gentoo user could still be building
Emacs for the Pentium III, and for such a user a run-time test on the
build host would be a win. This can be done via the attached revised
patch. It uses AC_LINK_IFELSE to compile and run a single program,
instead of AC_RUN_IFELSE which (when combined with AC_COMPILE_IFELSE)
would mean compiling two test programs and running one.
[0001-Work-around-GCC-bug-58416-when-building-for-x86.patch (text/x-patch, attachment)]
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#72145
; Package
emacs
.
(Thu, 18 Jul 2024 02:41:02 GMT)
Full text and
rfc822 format available.
Message #19 received at 72145 <at> debbugs.gnu.org (full text, mbox):
Paul Eggert <eggert <at> cs.ucla.edu> writes:
> On 2024-07-16 22:01, Paul Eggert wrote:
>> We could add an AC_RUN_IFELSE test for SSE2, though I doubt whether
>> it would affect builds significantly in practice.
>
> On second thought the rare Arch or Gentoo user could still be building
> Emacs for the Pentium III, and for such a user a run-time test on the
> build host would be a win. This can be done via the attached revised
> patch. It uses AC_LINK_IFELSE to compile and run a single program,
> instead of AC_RUN_IFELSE which (when combined with AC_COMPILE_IFELSE)
> would mean compiling two test programs and running one.
I'm thinking of the computer where I produce binaries for Windows 9X,
which, being a Windows 98 system, probably does not support SSE2.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#72145
; Package
emacs
.
(Thu, 18 Jul 2024 03:25:02 GMT)
Full text and
rfc822 format available.
Message #22 received at 72145 <at> debbugs.gnu.org (full text, mbox):
[[[ To any NSA and FBI agents reading my email: please consider ]]]
[[[ whether defending the US Constitution against all enemies, ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]
> * Rewrite Emacs to never use 'double' (or 'float' or 'long double')
> inside a union. This could be painful and hardly seems worthwhile.
Where does Emacs use those types inside a union?
Maybe this is not difficult.
--
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#72145
; Package
emacs
.
(Thu, 18 Jul 2024 05:15:02 GMT)
Full text and
rfc822 format available.
Message #25 received at 72145 <at> debbugs.gnu.org (full text, mbox):
> Cc: 72145 <at> debbugs.gnu.org
> Date: Thu, 18 Jul 2024 10:39:42 +0800
> From: Po Lu via "Bug reports for GNU Emacs,
> the Swiss army knife of text editors" <bug-gnu-emacs <at> gnu.org>
>
> Paul Eggert <eggert <at> cs.ucla.edu> writes:
>
> > On 2024-07-16 22:01, Paul Eggert wrote:
> >> We could add an AC_RUN_IFELSE test for SSE2, though I doubt whether
> >> it would affect builds significantly in practice.
> >
> > On second thought the rare Arch or Gentoo user could still be building
> > Emacs for the Pentium III, and for such a user a run-time test on the
> > build host would be a win. This can be done via the attached revised
> > patch. It uses AC_LINK_IFELSE to compile and run a single program,
> > instead of AC_RUN_IFELSE which (when combined with AC_COMPILE_IFELSE)
> > would mean compiling two test programs and running one.
>
> I'm thinking of the computer where I produce binaries for Windows 9X,
> which, being a Windows 98 system, probably does not support SSE2.
Look at the Properties to see what kind of CPU it has. Then you can
establish whether it supports SSE2.
But I think the problem is not where you produce the binaries, the
problem is where people will run them. On Windows, it is very
frequently a completely different system, so a test on the build host
is insufficient. I think builds for Windows 9X should use the
'emacs_cv_SSE2_CFLAGS=no' thing regardless of what the build host
supports, because otherwise the binary will simply refuse to run on
the target.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#72145
; Package
emacs
.
(Thu, 18 Jul 2024 12:39:02 GMT)
Full text and
rfc822 format available.
Message #28 received at 72145 <at> debbugs.gnu.org (full text, mbox):
On 2024-07-17 20:22, Richard Stallman wrote:
> > * Rewrite Emacs to never use 'double' (or 'float' or 'long double')
> > inside a union. This could be painful and hardly seems worthwhile.
>
> Where does Emacs use those types inside a union?
> Maybe this is not difficult.
I found the bug in src/timefns.c, which uses a union to represent
timestamp forms (one of which represents an Emacs float). Other uses
that come to mind are src/lisp.h's struct Lisp_Float, which uses a union
to save space when representing Lisp floats, and src/lread.c's and
src/print.c's use of <ieee754.h>'s unions to deal with NaNs when reading
and printing Lisp floats. Although I have not done an audit I expect
there are other places too, and I expect it would take some time to
audit, rewrite and thoroughly test Emacs to not use floating point in
these places, with runtime performance degraded somewhat as a result.
Although that effort might be worth it if the bug was likely and there
was no other workaround, the bug is quite rare (we've lived with it for
decades and I'm the first person to notice it, or at least track it
down), and with the proposed compiler-flag workaround the remaining
affected platforms are so obsolescent (decades-old CPUs) that they're
also rare. I doubt whether it's worth significantly contorting the C
code (possibly introducing bugs on mainstream platforms) to fix these
exceedingly rare bugs in obsolescent platforms.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#72145
; Package
emacs
.
(Thu, 18 Jul 2024 14:20:02 GMT)
Full text and
rfc822 format available.
Message #31 received at 72145 <at> debbugs.gnu.org (full text, mbox):
Paul Eggert <eggert <at> cs.ucla.edu> writes:
> While testing GNU Emacs built on Fedora 40 with gcc (GCC) 14.1.1
> 20240607 (Red Hat 14.1.1-5) with -m32 for x86 and configured
> --with-wide-int, I discovered that Emacs misbehaved in a hard-to-debug
> way due to GCC bug 58416. This bug causes GCC to generate wrong x86
> machine instructions when a C program accesses a union containing a
> 'double'.
>
> The bug I observed is that if you have something like this:
>
> union u { double d; long long int i; } u;
>
> then GCC sometimes generates x86 instructions that copy u.i by using
> fldl/fstpl instruction pairs to push the 64-bit quantity onto the 387
> floating point stack, and then pop the stack into another memory
> location. Unfortunately the fldl/fstpl trick fails in the unusual case
> when the bit pattern of u.i, when interpreted as a double, is a NaN,
> as that can cause the fldl/fstpl pair to store a different NaN with a
> different bit pattern, which means the destination integer disagrees
> with u.i.
>
> The bug is obscure, since the bug's presence depends on the GCC
> version, on the optimization options used, on the exact source code,
> and on the exact integer value at runtime (the value is typically
> copied correctly even when GCC has generated the incorrect machine
> code, since most long long int values don't alias with NaNs).
>
> In short the bug appears to be rare.
>
> Here are some possible courses of action:
>
> * Do nothing and hope x86 users won't run into this rare bug.
>
> * Have the GCC folks fix the bug. However, given that the bug has been
> reported for over a decade multiple times without a fix, it seems
> that fixing it is too difficult and/or too low priority for this
> aging platform. Also, even if the bug is fixed in future GCC the bug
> will still be present with people using older GCC.
>
> * Build with Clang or some other compiler instead. We should be
> encouraging GCC, though.
>
> * Rewrite Emacs to never use 'double' (or 'float' or 'long double')
> inside a union. This could be painful and hardly seems worthwhile.
>
> * When using GCC to build Emacs on x86, compile with safer options
> that make the bug impossible. The attached proposed patch does that,
> by telling GCC not to use the 387 stack. (This patch fixed the Emacs
> misbehavior in my experimental build.) The downside is that the
> resulting Emacs executables need SSE2, introduced for the Pentium 4
> in 2000 <https://en.wikipedia.org/wiki/SSE2>. Nowadays few users
> need to run Emacs on non-SSE2 x86, so this may be good enough. Also,
> the proposed patch gives the builder an option to compile Emacs
> without the safer options, for people who want to build for older
> Intel-compatible platforms and who don't mind an occasional wrong
> answer or crash.
Mmmh nice one :)
I asked GCC people if they have a suggestion on how to work around this
bug <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58416#c9>.
Thanks
Andrea
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#72145
; Package
emacs
.
(Thu, 18 Jul 2024 15:21:02 GMT)
Full text and
rfc822 format available.
Message #34 received at 72145 <at> debbugs.gnu.org (full text, mbox):
On Thursday, July 18th, 2024 at 12:38, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> On 2024-07-17 20:22, Richard Stallman wrote:
>
> > > * Rewrite Emacs to never use 'double' (or 'float' or 'long double')
> > > inside a union. This could be painful and hardly seems worthwhile.
> >
> > Where does Emacs use those types inside a union?
> > Maybe this is not difficult.
>
>
> I found the bug in src/timefns.c, which uses a union to represent
> timestamp forms (one of which represents an Emacs float). Other uses
> that come to mind are src/lisp.h's struct Lisp_Float, which uses a union
> to save space when representing Lisp floats, and src/lread.c's and
> src/print.c's use of <ieee754.h>'s unions to deal with NaNs when reading
>
> and printing Lisp floats. Although I have not done an audit I expect
> there are other places too, and I expect it would take some time to
> audit, rewrite and thoroughly test Emacs to not use floating point in
> these places, with runtime performance degraded somewhat as a result.
>
> Although that effort might be worth it if the bug was likely and there
> was no other workaround, the bug is quite rare (we've lived with it for
> decades and I'm the first person to notice it, or at least track it
> down), and with the proposed compiler-flag workaround the remaining
> affected platforms are so obsolescent (decades-old CPUs) that they're
> also rare. I doubt whether it's worth significantly contorting the C
> code (possibly introducing bugs on mainstream platforms) to fix these
> exceedingly rare bugs in obsolescent platforms.
It should be mentioned that this isn't just about the CPU: the OS also needs to enable the XMM register set, right? That means we might end up dropping support for many old platforms as well as old CPUs and emulators, and I'm not sure that's a good idea.
Pip
Reply sent
to
Paul Eggert <eggert <at> cs.ucla.edu>
:
You have taken responsibility.
(Fri, 19 Jul 2024 21:32:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Paul Eggert <eggert <at> cs.ucla.edu>
:
bug acknowledged by developer.
(Fri, 19 Jul 2024 21:32:02 GMT)
Full text and
rfc822 format available.
Message #39 received at 72145-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 2024-07-18 08:19, Pip Cet wrote:
> It should be mentioned that this isn't just about the CPU: the OS also needs to enable the XMM register set, right?
Right.
In <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58416#c10> GCC's
Richard Biener suggested a more portable workaround: use -fno-tree-sra
when generating 32-bit x86 code for which it is not known that SSE2 is
supported. (With SSE2, -mfpmath=sse is a better workaround.) Using
-fno-tree-rsa means we needn't worry whether the build and host
platforms use different CPU types.
I did that by installing the attached patch to Emacs on savannah, and am
closing the bug report.
[0001-Work-around-GCC-bug-58416-on-32-bit-x86.patch (text/x-patch, attachment)]
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Sat, 17 Aug 2024 11:24:08 GMT)
Full text and
rfc822 format available.
bug unarchived.
Request was from
Paul Eggert <eggert <at> cs.ucla.edu>
to
control <at> debbugs.gnu.org
.
(Thu, 22 Aug 2024 06:43:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#72145
; Package
emacs
.
(Thu, 22 Aug 2024 06:46:02 GMT)
Full text and
rfc822 format available.
Message #46 received at 72145 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
GCC bug 58416 has been fixed, and the fix should appear in in the
forthcoming GCC 15. I installed the attached patch into GNU Emacs, so
that Emacs no longer attempts to work around the bug if GCC 15+ is being
used.
[0001-GCC-bug-58416-is-fixed-in-GCC-15.patch (text/x-patch, attachment)]
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Thu, 19 Sep 2024 11:24:10 GMT)
Full text and
rfc822 format available.
This bug report was last modified 73 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.