GNU bug report logs -
#80021
suboptimal Lisp_Object tag check (TAGGEDP)
Previous Next
To reply to this bug, email your comments to 80021 AT debbugs.gnu.org.
There is no need to reopen the bug first.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-gnu-emacs <at> gnu.org:
bug#80021; Package
emacs.
(Wed, 17 Dec 2025 13:59:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Mattias Engdegård <mattias.engdegard <at> gmail.com>:
New bug report received and forwarded. Copy sent to
bug-gnu-emacs <at> gnu.org.
(Wed, 17 Dec 2025 13:59:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
The obvious way to test whether a Lisp_Object X has the tag bits T (assuming USE_LSB_TAG) would be
(X & 0b111) == T
but the current expression used in TAGGEDP is
(((unsigned)X - T) & 0b111) == 0
because for x86, GCC compiles the two expressions to
MOV r,X
AND r,7
CMP r,T
JE tag_matched
and
LEA r,[X-T]
TEST r,7
JE tag_matched
respectively, because the latter is shorter (in bytes) and so makes more efficient use of cache and i-fetch (and possibly decode) bandwidth. (The cast to unsigned just prods GCC to save a REX prefix byte in LEA.) Previous discussion at
https://lists.gnu.org/r/emacs-devel/2018-08/msg00828.html
However, once it's in cache the runtime should be the same: the MOV is free so in both cases we end up with a single ALU op before the fused compare-and-branch macro-op. More importantly, it's common to have multiple type tests for the same object in sequence and now the 'clever' code becomes less so:
if (TAGGEDP(X,T1)) goto is_T1;
else if (TAGGEDP(X,T2)) goto is_T2;
becomes
MOV r,X
AND r,7
CMP r,T1
JE is_T1
CMP r,T2
JE is_T2
and
LEA r,[X-T1]
TEST r,7
JE is_T1
LEA r,[X-T2]
TEST r,7
JE is_T2
and the second version is likely a cycle slower if the first test fails.
Of course this whole business is entirely x86-specific: on other platforms, the 'obvious' code is as good or better.
It's also compiler-specific: recent versions of Clang normalise and usually emit the 'obvious' code for both.
Thus we should keep the current 'clever' TAGGEDP for GCC on x86, for now, and use the 'obvious' form for other compilers and targets. Suggested patch attached.
Addendum: FIXNUMP can also be coded as
TAGGEDP(X,0b010) || TAGGEDP(X,0b010)
which seems daft but could be better when testing other types at the same time, as in FIXNUMP(X) || FLOATP(X), because it exposes the tag to CSE.
[taggedp.diff (application/octet-stream, attachment)]
Information forwarded
to
bug-gnu-emacs <at> gnu.org:
bug#80021; Package
emacs.
(Wed, 17 Dec 2025 14:42:02 GMT)
Full text and
rfc822 format available.
Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):
> Thus we should keep the current 'clever' TAGGEDP for GCC on x86, for now,
> and use the 'obvious' form for other compilers and targets. Suggested
> patch attached.
FWIW, I'd prefer to use the same code for all architectures.
There's already way too many `#if`s in that code.
Stefan
Information forwarded
to
bug-gnu-emacs <at> gnu.org:
bug#80021; Package
emacs.
(Wed, 17 Dec 2025 19:50:02 GMT)
Full text and
rfc822 format available.
Message #11 received at submit <at> debbugs.gnu.org (full text, mbox):
Mattias Engdegård <mattias.engdegard <at> gmail.com> writes:
> The obvious way to test whether a Lisp_Object X has the tag bits T (assuming USE_LSB_TAG) would be
>
> (X & 0b111) == T
>
> but the current expression used in TAGGEDP is
>
> (((unsigned)X - T) & 0b111) == 0
>
> because for x86, GCC compiles the two expressions to
>
> MOV r,X
> AND r,7
> CMP r,T
> JE tag_matched
>
> and
>
> LEA r,[X-T]
> TEST r,7
> JE tag_matched
IIRC, the initial idea was that the untagged pointer would now be in
"r", ready to be used, but I'm not sure GCC performs that optimization;
today's major platforms, AArch64 and x86, both have unscaled memory
offsets (LDU on AArch64, ordinary loads on x86) so it doesn't matter
much how the tag bits are removed or whether that happens once or twice.
> respectively, because the latter is shorter (in bytes) and so makes
> more efficient use of cache and i-fetch (and possibly decode)
> bandwidth. (The cast to unsigned just prods GCC to save a REX prefix
> byte in LEA.) Previous discussion at
Casting to unsigned does of course prevent the re-use of r on 64-bit
machines, so maybe we should carefully inspect the assembler code around
the call sites with and without the cast.
> https://lists.gnu.org/r/emacs-devel/2018-08/msg00828.html
>
> However, once it's in cache the runtime should be the same: the MOV is
> free so in both cases we end up with a single ALU op before the fused
> compare-and-branch macro-op. More importantly, it's common to have
> multiple type tests for the same object in sequence and now the
> 'clever' code becomes less so:
>
> if (TAGGEDP(X,T1)) goto is_T1;
> else if (TAGGEDP(X,T2)) goto is_T2;
>
> becomes
>
> MOV r,X
> AND r,7
> CMP r,T1
> JE is_T1
> CMP r,T2
> JE is_T2
>
> and
>
> LEA r,[X-T1]
> TEST r,7
> JE is_T1
> LEA r,[X-T2]
> TEST r,7
> JE is_T2
>
> and the second version is likely a cycle slower if the first test fails.
A whole cycle?! Color me surprised. I fail to see the hazard which would
cause such a drastic slowdown. Does lea somehow not rename the register
as efficiently as mov does?
On x86, checking two tags at once is probably done most compactly by
generating a 32-bit mask, then left-shifting by the value of the lisp
word (which ignores all but the LSB 5 bits) and checking the MSB of the
result: (((mask) << (lisp_word)) < 0).
> Of course this whole business is entirely x86-specific: on other platforms, the 'obvious' code is as good or better.
I think of it as the equivalent of fsincos: you're usually going to need
both, so you might as well compute what we optimistically assume is the
untagged (aligned) pointer while checking the tag. I don't think current
CPUs care much about an aligned pointer being present in a register for
prefetching or speculation.
> It's also compiler-specific: recent versions of Clang normalise and usually emit the 'obvious' code for both.
That's probably not a bad decision on their part, as many such clever
optimizations are no longer worth it and normalizing them is a net gain.
> Thus we should keep the current 'clever' TAGGEDP for GCC on x86, for
> now, and use the 'obvious' form for other compilers and
> targets. Suggested patch attached.
Given the weird 32-bit cast, I agree, though maybe we should simply use
the version without the cast in the general case (or maybe in all cases:
GCC is now smart enough to use a bit test for FIXNUMP(x) || FLOATP(x),
so maybe its CSE has also improved and it can reuse the pointer?).
> Addendum: FIXNUMP can also be coded as
>
> TAGGEDP(X,0b010) || TAGGEDP(X,0b010)
>
> which seems daft but could be better when testing other types at the same time, as in FIXNUMP(X) || FLOATP(X), because it exposes the tag to CSE.
I think you mean TAGGEDP(X, 0b010) || TAGGEDP(X, 0b110)? I think that
might be easier for humans to understand, particularly if they're not
fully aware of the double fixnum tag.
(I'm still planning to distinguish conses and "fat conses" based on the
4th tag bit; it would not help to change all other users of TAGGEDP to
check two tags instead of one).
Pip
Information forwarded
to
bug-gnu-emacs <at> gnu.org:
bug#80021; Package
emacs.
(Thu, 18 Dec 2025 05:59:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 80021 <at> debbugs.gnu.org (full text, mbox):
> Cc: 80021 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, pipcet <at> protonmail.com
> Date: Wed, 17 Dec 2025 09:40:51 -0500
> From: Stefan Monnier via "Bug reports for GNU Emacs,
> the Swiss army knife of text editors" <bug-gnu-emacs <at> gnu.org>
>
> > Thus we should keep the current 'clever' TAGGEDP for GCC on x86, for now,
> > and use the 'obvious' form for other compilers and targets. Suggested
> > patch attached.
>
> FWIW, I'd prefer to use the same code for all architectures.
> There's already way too many `#if`s in that code.
I tend to agree. In addition to the proliferation of #ifdef's, the
probability is quite low that we will always have experts on board who
understand the intricacies and fine points of this stuff on such a low
level. So this will become a maintenance headache, given the
relatively high rate of introducing new processors and versions of
processors into the industry.
So unless this change has a dramatic effect on performance in
real-life usage of Emacs (as opposed to synthetic benchmarks), I'd
rather we didn't install this kind of changes.
Information forwarded
to
bug-gnu-emacs <at> gnu.org:
bug#80021; Package
emacs.
(Thu, 18 Dec 2025 16:51:02 GMT)
Full text and
rfc822 format available.
Message #17 received at submit <at> debbugs.gnu.org (full text, mbox):
17 dec. 2025 kl. 20.48 skrev Pip Cet <pipcet <at> protonmail.com>:
> IIRC, the initial idea was that the untagged pointer would now be in
> "r", ready to be used, but I'm not sure GCC performs that optimization;
> today's major platforms, AArch64 and x86, both have unscaled memory
> offsets (LDU on AArch64, ordinary loads on x86) so it doesn't matter
> much how the tag bits are removed or whether that happens once or twice.
Indeed, the construct doesn't seem to be very useful that way, even if the truncating cast is dropped.
>> and the second version is likely a cycle slower if the first test fails.
>
> A whole cycle?!
No, sorry about that, there's indeed no dependence there. There is an extra instruction to be tracked which adds to the execution resource pressure, and the code becomes longer, but there is no latency directly due to dependence.
> On x86, checking two tags at once is probably done most compactly by
> generating a 32-bit mask, then left-shifting by the value of the lisp
> word (which ignores all but the LSB 5 bits) and checking the MSB of the
> result: (((mask) << (lisp_word)) < 0).
That's indeed what compilers typically generate once they understand what the test is about (with some variations) but, not with the 'clever' test.
> I think of it as the equivalent of fsincos: you're usually going to need
> both, so you might as well compute what we optimistically assume is the
> untagged (aligned) pointer while checking the tag.
But we often don't need the untagged pointer at that point, and it doesn't cost much (not even latency as there's no dependence; see above) even when we do.
>> It's also compiler-specific: recent versions of Clang normalise and usually emit the 'obvious' code for both.
>
> That's probably not a bad decision on their part, as many such clever
> optimizations are no longer worth it and normalizing them is a net gain.
Definitely, it shouldn't matter how we formulate the condition.
>> Thus we should keep the current 'clever' TAGGEDP for GCC on x86, for
>> now, and use the 'obvious' form for other compilers and
>> targets. Suggested patch attached.
>
> Given the weird 32-bit cast, I agree, though maybe we should simply use
> the version without the cast in the general case (or maybe in all cases:
> GCC is now smart enough to use a bit test for FIXNUMP(x) || FLOATP(x),
> so maybe its CSE has also improved and it can reuse the pointer?).
I wouldn't mind using (X&7)==T everywhere and trusting GCC to do its job, but the reason for the current code was apparently dissatisfaction with its ability to do just that. The current code is clearly smaller but not faster (on the old machine used for testing).
>> TAGGEDP(X,0b010) || TAGGEDP(X,0b010)
>>
>> which seems daft but could be better when testing other types at the same time, as in FIXNUMP(X) || FLOATP(X), because it exposes the tag to CSE.
>
> I think you mean TAGGEDP(X, 0b010) || TAGGEDP(X, 0b110)?
Yes, sorry about the typo.
Reply sent
to
Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility.
(Fri, 19 Dec 2025 17:14:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Mattias Engdegård <mattias.engdegard <at> gmail.com>:
bug acknowledged by developer.
(Fri, 19 Dec 2025 17:14:02 GMT)
Full text and
rfc822 format available.
Message #22 received at 80021-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 2025-12-18 08:50, Mattias Engdegård wrote:
> I wouldn't mind using (X&7)==T everywhere and trusting GCC to do its job, but the reason for the current code was apparently dissatisfaction with its ability to do just that. The current code is clearly smaller but not faster (on the old machine used for testing).
The 2018 benchmarks[1] that prompted the tricky TAGGEDP business were
done on an AMD Phenom II X4 910e, a circa-2010 "Deneb" product that is
no longer a good representative of what Emacs currently runs on (though
it's still the CPU in my old office desktop at UCLA!). Also, they
predate native compilation.
I just now did similar benchmarks, this time with yesterday's Emacs
master (commit 5e6df127a96fe10b3f659626d64e9ba13fd12873) running Ubuntu
25.10 which uses gcc (Ubuntu 15.2.0-4ubuntu4), and on an Intel Xeon
W-1350, a circa-2021 "Rocket Lake-S" product that is a better
representative nowadays. Ubuntu's GCC, unlike the 2018 benchmarks, uses
-fstack-protector. And the new benchmarks are based on current (not
2018) Emacs sources and these use native compilation.
On these benchmarks the tricky TAGGEDP business still shrank the temacs
text segment (in this case by 0.6%), but on the "make compile-always"
benchmark they made Emacs 3.0% slower.
So let's just go back to using (X&7)==T everywhere, as you suggest. This
should address the comments on this bug report. I installed the attached
patch to do that, and am boldly closing the bug report.
[1]: https://lists.gnu.org/r/emacs-devel/2018-08/msg00845.html
[0001-Revert-to-simpler-and-we-hope-faster-TAGGEDP.patch (text/x-patch, attachment)]
This bug report was last modified today.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.