GNU bug report logs - #27270
display-raw-bytes-as-hex generates ambiguous output for Emacs strings

Package: emacs;

Reported by: Paul Eggert <eggert <at> cs.ucla.edu>

Date: Wed, 7 Jun 2017 03:59:01 UTC

Severity: wishlist

Tags: moreinfo

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 27270 in the body.
You can then email your comments to 27270 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Wed, 07 Jun 2017 03:59:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Paul Eggert <eggert <at> cs.ucla.edu>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Wed, 07 Jun 2017 03:59:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Emacs bug reports and feature requests <bug-gnu-emacs <at> gnu.org>
Cc: Vasilij Schneidermann <v.schneidermann <at> gmail.com>
Subject: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
Date: Tue, 6 Jun 2017 20:57:51 -0700

With the default octal display format one can copy text out of a terminal window 
and into an Emacs string, reliably. With the new hex display this doesn't work 
any more, unfortunately. For example, if I run this shell script:

printf 'x\2205y\n' >foo.txt
LC_ALL=C emacs -nw --color=no --eval '(progn (setq display-raw-bytes-as-hex t) 
(find-file-literally "foo.txt"))'

then on the terminal display I see:

x\x905y

If I cut and paste this (using my windowing system) into an Emacs string, like this:

"x\x905y"

and then evaluate the string, the result is the string "xअy", that is, a 
3-character string with the characters "x", "अ", and "y", where the middle 
character is U+090F DEVANAGARI LETTER A. This is an incorrect representation, as 
the buffer actually contains the four characters "x", "\x90", "5", and "y". The 
problem is that the string has glued together the representation of the 
character "\x90" to the representation of the character "5", resulting in the 
representation of the character "\x905" which is not accurate.

Please change the behavior of display-raw-bytes-as-hex so that it is not 
ambiguous in this way.

A simple solution would be to display this instead:

x\x90\x35y

though that is awkward because it means the ASCII 0-9, a-f, A-F would be 
displayed as hexadecimal escapes when they follow another hexadecimal escape. 
Perhaps we can think of a better approach. One possibility would be to define 
and use a new string escape \Xxx that contains at most two hex digits.

By the way, I expected display-raw-bytes-as-hex to affect how Emacs displays 
Emacs strings, too. Shouldn't it?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Wed, 07 Jun 2017 05:18:02 GMT) Full text and rfc822 format available.

Message #8 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: v.schneidermann <at> gmail.com, 27270 <at> debbugs.gnu.org
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output for
 Emacs strings
Date: Wed, 07 Jun 2017 08:17:04 +0300

> From: Paul Eggert <eggert <at> cs.ucla.edu>
> Date: Tue, 6 Jun 2017 20:57:51 -0700
> Cc: Vasilij Schneidermann <v.schneidermann <at> gmail.com>
> 
> then on the terminal display I see:
> 
> x\x905y
> 
> If I cut and paste this (using my windowing system) into an Emacs string, like this:
> 
> "x\x905y"
> 
> and then evaluate the string, the result is the string "xअy"

display-raw-bytes-as-hex is a display-only feature, as its name tells,
it isn't supposed to affect evaluation or the Lisp reader.  So I'm
unsure why you expected it to affect evaluation.  It's the same if you
define a display table to display one character as another, and then
expect Emacs to perform the opposite transformation when it reads
characters or strings.

> A simple solution would be to display this instead:
> 
> x\x90\x35y

That would mean display-raw-bytes-as-hex is "viral", in that it
affects not just the raw byte, but also the next character.  That
sounds sub-optimal, as it makes reading the result harder.

> though that is awkward because it means the ASCII 0-9, a-f, A-F would be 
> displayed as hexadecimal escapes when they follow another hexadecimal escape. 

Exactly.

> By the way, I expected display-raw-bytes-as-hex to affect how Emacs displays 
> Emacs strings, too. Shouldn't it?

What do you mean by "Emacs strings"?  Buffer text is a string, isn't
it?

Added indication that bug 27270 blocks24655 Request was from Glenn Morris <rgm <at> gnu.org> to control <at> debbugs.gnu.org. (Wed, 07 Jun 2017 17:45:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Thu, 08 Jun 2017 00:50:02 GMT) Full text and rfc822 format available.

Message #13 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: v.schneidermann <at> gmail.com, 27270 <at> debbugs.gnu.org
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Wed, 7 Jun 2017 17:49:41 -0700

On 06/06/2017 10:17 PM, Eli Zaretskii wrote:

> What do you mean by "Emacs strings"?

I meant that if I prefer hex to octal for buffer escapes, then when I 
type this into *scratch*:

  (format "J%cK" ?\u0080) C-j

I almost surely would prefer to see the result displayed as hexadecimal 
than as "J\200K" (the current behavior). People who prefer hex in one 
place are quite likely to prefer it in the other.

Here's another suggestion for the buffer problem: separate problematic 
character pairs by "\ " in the buffer display. That way, my test case 
would be displayed this way in a buffer;

  x\x90\ 5y

and this will work as expected when cut and pasted into a string, due to 
the backslash-space syntax already supported for strings. This buffer 
syntax would be less confusing than the "x\x905y" syntax that is 
currently used. Under this approach character pair XY is considered to 
be problematic if X is displayed with a hexadecimal escape and Y is a 
hexadecimal digit.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Thu, 08 Jun 2017 01:06:02 GMT) Full text and rfc822 format available.

Message #16 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: npostavs <at> users.sourceforge.net
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: v.schneidermann <at> gmail.com, Eli Zaretskii <eliz <at> gnu.org>,
 27270 <at> debbugs.gnu.org
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Wed, 07 Jun 2017 21:07:25 -0400

Paul Eggert <eggert <at> cs.ucla.edu> writes:

> On 06/06/2017 10:17 PM, Eli Zaretskii wrote:
>
>> What do you mean by "Emacs strings"?
>
> I meant that if I prefer hex to octal for buffer escapes, then when I
> type this into *scratch*:
>
>   (format "J%cK" ?\u0080) C-j
>
> I almost surely would prefer to see the result displayed as
> hexadecimal than as "J\200K" (the current behavior).

display-raw-bytes-as-hex does affect the result display for me (of
course, since the result goes into the buffer), doesn't it for you?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Thu, 08 Jun 2017 15:21:02 GMT) Full text and rfc822 format available.

Message #19 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: npostavs <at> users.sourceforge.net
Cc: v.schneidermann <at> gmail.com, 27270 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Thu, 08 Jun 2017 18:20:41 +0300

> From: npostavs <at> users.sourceforge.net
> Cc: Eli Zaretskii <eliz <at> gnu.org>,  27270 <at> debbugs.gnu.org,  v.schneidermann <at> gmail.com
> Date: Wed, 07 Jun 2017 21:07:25 -0400
> 
> > I meant that if I prefer hex to octal for buffer escapes, then when I
> > type this into *scratch*:
> >
> >   (format "J%cK" ?\u0080) C-j
> >
> > I almost surely would prefer to see the result displayed as
> > hexadecimal than as "J\200K" (the current behavior).
> 
> display-raw-bytes-as-hex does affect the result display for me (of
> course, since the result goes into the buffer), doesn't it for you?

Likewise here.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Thu, 08 Jun 2017 15:57:02 GMT) Full text and rfc822 format available.

Message #22 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: npostavs <at> users.sourceforge.net
Cc: v.schneidermann <at> gmail.com, Eli Zaretskii <eliz <at> gnu.org>,
 27270 <at> debbugs.gnu.org
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Thu, 8 Jun 2017 08:56:31 -0700

On 06/07/2017 06:07 PM, npostavs <at> users.sourceforge.net wrote:
> display-raw-bytes-as-hex does affect the result display for me (of
> course, since the result goes into the buffer), doesn't it for you?

Sorry, it didn't when I tried it earlier, but apparently I messed up. 
Yes, it does affect the display.

But this means the problem is even worse than I thought. If I evaluate 
this in *scratch* in a terminal session running emacs -nw:

(setq display-raw-bytes-as-hex t) C-j
(format "%c%c" ?\u0090 ?5) C-j

Emacs displays this:

"\x905"

which is the wrong string visually. And if I cut this string out of the 
terminal window and paste it into another terminal window running Emacs, 
I'll get "अ" (a string containing the single character U+0905 DEVANAGARI 
LETTER A), which is indeed the wrong string. The string should be 
displayed unambiguously, either like this:

"\x80\ 5"

or via some other means.

The bottom line is that the visual display of buffers and strings should 
continue to be unambiguous even when display-raw-bytes-as-hex is t.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Thu, 08 Jun 2017 16:12:02 GMT) Full text and rfc822 format available.

Message #25 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: v.schneidermann <at> gmail.com, 27270 <at> debbugs.gnu.org,
 npostavs <at> users.sourceforge.net
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Thu, 08 Jun 2017 19:11:19 +0300

> Cc: Eli Zaretskii <eliz <at> gnu.org>, 27270 <at> debbugs.gnu.org,
>  v.schneidermann <at> gmail.com
> From: Paul Eggert <eggert <at> cs.ucla.edu>
> Date: Thu, 8 Jun 2017 08:56:31 -0700
> 
> (setq display-raw-bytes-as-hex t) C-j
> (format "%c%c" ?\u0090 ?5) C-j
> 
> Emacs displays this:
> 
> "\x905"
> 
> which is the wrong string visually.

How is that different from "\2205" you get under the default settings?

> The string should be 
> displayed unambiguously, either like this:
> 
> "\x80\ 5"
> 
> or via some other means.

We do use "some other means": the raw byte has a different face.  But
if you evaluate the above in *scratch*, you won't see that because of
font-lock.  Turn off font-lock-mode, and you will clearly see where
the raw byte ends and "normal" text begins.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Thu, 08 Jun 2017 16:26:02 GMT) Full text and rfc822 format available.

Message #28 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: v.schneidermann <at> gmail.com, 27270 <at> debbugs.gnu.org,
 npostavs <at> users.sourceforge.net
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Thu, 8 Jun 2017 09:24:56 -0700

On 06/08/2017 09:11 AM, Eli Zaretskii wrote:
>> (setq display-raw-bytes-as-hex t) C-j
>> (format "%c%c" ?\u0090 ?5) C-j
>>
>> Emacs displays this:
>>
>> "\x905"
>>
>> which is the wrong string visually.
> How is that different from "\2205" you get under the default settings?

When I cut and paste "\2205" into another Emacs, it evaluates to the 
same two-character string that I started off with because octal escapes 
are limited to 3 octal digits. When I cut and paste "\x905" I get a 
one-character string because there is no limit to the length of 
hexadecimal escapes. This is a problem, because cut-and-paste should 
continue to copy text accurately even when I'm using terminal windows.

>> The string should be
>> displayed unambiguously, either like this:
>>
>> "\x80\ 5"
>>
>> or via some other means.
> We do use "some other means": the raw byte has a different face.

That doesn't help when --color=no is specified, or in terminal sessions 
that do not support colors. And the colors, even when present, do not 
survive cutting and pasting, which copies the text without colors. So 
this is a real problem.

> But if you evaluate the above in*scratch*, you won't see that because of
> font-lock.  Turn off font-lock-mode, and you will clearly see where
> the raw byte ends and "normal" text begins.

Turning off font-lock-mode doesn't help when colors are disabled. I 
often run with colors disabled, since my terminal color scheme disagrees 
with Emacs's and I prefer monochrome anyway. So this ambiguity will be a 
real pain for me.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Thu, 08 Jun 2017 19:01:02 GMT) Full text and rfc822 format available.

Message #31 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: v.schneidermann <at> gmail.com, 27270 <at> debbugs.gnu.org,
 npostavs <at> users.sourceforge.net
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Thu, 08 Jun 2017 21:59:56 +0300

> Cc: npostavs <at> users.sourceforge.net, 27270 <at> debbugs.gnu.org,
>  v.schneidermann <at> gmail.com
> From: Paul Eggert <eggert <at> cs.ucla.edu>
> Date: Thu, 8 Jun 2017 09:24:56 -0700
> 
> >> "\x905"
> >>
> >> which is the wrong string visually.
> > How is that different from "\2205" you get under the default settings?
> 
> When I cut and paste "\2205" into another Emacs, it evaluates to the 
> same two-character string that I started off with because octal escapes 
> are limited to 3 octal digits.

That's a different issue.  You said "\x905" was wrong visually, so I
asked how is that different, visually, from "\2205".

> When I cut and paste "\x905" I get a 
> one-character string because there is no limit to the length of 
> hexadecimal escapes. This is a problem, because cut-and-paste should 
> continue to copy text accurately even when I'm using terminal windows.

Same thing happens when you copy/paste from an Emacs window which uses
a display table: the pasted string will be different from the original
one.  I believe I already pointed that out in this discussion.

> >> "\x80\ 5"
> >>
> >> or via some other means.
> > We do use "some other means": the raw byte has a different face.
> 
> That doesn't help when --color=no is specified, or in terminal sessions 
> that do not support colors.

In those cases, the octal notation has the same visual problems.

> I prefer monochrome anyway. So this ambiguity will be a real pain
> for me.

I still don't understand how this is different from the octal
notation, but if it is, you can always stay with the default octal
display.  That's what I do.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Thu, 08 Jun 2017 19:44:02 GMT) Full text and rfc822 format available.

Message #34 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: v.schneidermann <at> gmail.com, 27270 <at> debbugs.gnu.org,
 npostavs <at> users.sourceforge.net
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Thu, 8 Jun 2017 12:43:38 -0700

On 06/08/2017 11:59 AM, Eli Zaretskii wrote:
> That's a different issue. You said "\x905" was wrong visually, so I
> asked how is that different, visually, from "\2205".

It's wrong visually, because I know the syntax for strings in Emacs 
Lisp, and I know that "\x905" is supposed to be a 1-character string 
whereas "\2205" is a two-character string.

> Same thing happens when you copy/paste from an Emacs window which uses
> a display table

The difference is that I don't use display tables and don't want to use 
them. In contrast, I would like to use hexadecimal display, if it worked 
as well as octal does (which it does not).

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Thu, 08 Jun 2017 19:57:01 GMT) Full text and rfc822 format available.

Message #37 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: v.schneidermann <at> gmail.com, 27270 <at> debbugs.gnu.org,
 npostavs <at> users.sourceforge.net
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Thu, 08 Jun 2017 22:56:21 +0300

> Cc: npostavs <at> users.sourceforge.net, 27270 <at> debbugs.gnu.org,
>  v.schneidermann <at> gmail.com
> From: Paul Eggert <eggert <at> cs.ucla.edu>
> Date: Thu, 8 Jun 2017 12:43:38 -0700
> 
> On 06/08/2017 11:59 AM, Eli Zaretskii wrote:
> > That's a different issue. You said "\x905" was wrong visually, so I
> > asked how is that different, visually, from "\2205".
> 
> It's wrong visually, because I know the syntax for strings in Emacs 
> Lisp, and I know that "\x905" is supposed to be a 1-character string 
> whereas "\2205" is a two-character string.

How do you know "\2205" is a two character string?

What about this:

  (aset printable-chars #x3fffc nil) C-j
  (format "%c%c" #x3fffc ?5) C-j

Where does the octal codepoint end now?

> > Same thing happens when you copy/paste from an Emacs window which uses
> > a display table
> 
> The difference is that I don't use display tables and don't want to use 
> them. In contrast, I would like to use hexadecimal display, if it worked 
> as well as octal does (which it does not).

Then we need to code a separate feature in the Lisp reader, I think.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Thu, 08 Jun 2017 20:36:01 GMT) Full text and rfc822 format available.

Message #40 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: v.schneidermann <at> gmail.com, 27270 <at> debbugs.gnu.org,
 npostavs <at> users.sourceforge.net
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Thu, 8 Jun 2017 13:35:45 -0700

On 06/08/2017 12:56 PM, Eli Zaretskii wrote:
> How do you know "\2205" is a two character string

Because I use Emacs out of the box, with the default printable-chars.

>
>> The difference is that I don't use display tables and don't want to use
>> them. In contrast, I would like to use hexadecimal display, if it worked
>> as well as octal does (which it does not).
> Then we need to code a separate feature in the Lisp reader, I think.

What do you think of using capital X for hexadecimal escapes with at 
most two digits? That way, "\X905" would be a two-character string, 
which is what is wanted here. Or we could use small h for hexadecimal, 
and "\h905".

If we were feeling ambitous and concise, we could use no character at 
all and upper-case hex digits for bytes in the range 0x80 through 0xFF; 
this would be unambiguous in strings (the example would be "\905"). This 
may be a bridge too far, though.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Fri, 09 Jun 2017 06:01:02 GMT) Full text and rfc822 format available.

Message #43 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: v.schneidermann <at> gmail.com, 27270 <at> debbugs.gnu.org,
 npostavs <at> users.sourceforge.net
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Fri, 09 Jun 2017 09:00:26 +0300

> Cc: npostavs <at> users.sourceforge.net, 27270 <at> debbugs.gnu.org,
>  v.schneidermann <at> gmail.com
> From: Paul Eggert <eggert <at> cs.ucla.edu>
> Date: Thu, 8 Jun 2017 13:35:45 -0700
> 
> On 06/08/2017 12:56 PM, Eli Zaretskii wrote:
> > How do you know "\2205" is a two character string
> 
> Because I use Emacs out of the box, with the default printable-chars.

That's just sheer luck, then, not a general solution that works for
everybody.  And it's not unimaginable that we will mark more
codepoints printable at some point, given some development in the
Unicode standard or in Emacs.

> >> The difference is that I don't use display tables and don't want to use
> >> them. In contrast, I would like to use hexadecimal display, if it worked
> >> as well as octal does (which it does not).
> > Then we need to code a separate feature in the Lisp reader, I think.
> 
> What do you think of using capital X for hexadecimal escapes with at 
> most two digits? That way, "\X905" would be a two-character string, 
> which is what is wanted here. Or we could use small h for hexadecimal, 
> and "\h905".

I'm okay, but I'm not sure I understand how does this fix your
problem.  Can you explain?

> If we were feeling ambitous and concise, we could use no character at 
> all and upper-case hex digits for bytes in the range 0x80 through 0xFF; 
> this would be unambiguous in strings (the example would be "\905"). This 
> may be a bridge too far, though.

Too far, I agree.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Fri, 09 Jun 2017 23:45:02 GMT) Full text and rfc822 format available.

Message #46 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: v.schneidermann <at> gmail.com, 27270 <at> debbugs.gnu.org,
 npostavs <at> users.sourceforge.net
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Fri, 9 Jun 2017 16:44:46 -0700

Eli Zaretskii wrote:
>> What do you think of using capital X for hexadecimal escapes with at
>> most two digits? That way, "\X905" would be a two-character string,
>> which is what is wanted here. Or we could use small h for hexadecimal,
>> and "\h905".
> I'm okay, but I'm not sure I understand how does this fix your
> problem.  Can you explain?
> 

The idea is to add a new \X escape for character constants and strings. This 
escape would allow at most two hexadecimal digits, rather than the unlimited 
number of digits that \x does. For example, the Lisp string "\XABC" would be 
equivalent to the Lisp string "\xAB\ C", that is, it would be a two-character 
string containing the character U+00AB LEFT POINTING GUILLEMET followed by the 
character U+0043 LATIN CAPITAL LETTER C.

Also, display-raw-bytes-as-hex would cause raw bytes to be displayed with this 
new X escape, rather than with with the x escape.

This would fix my problem, since I would continue to be able to copy text 
displayed in a terminal window, and paste it into an Emacs string, and get the 
text unaltered even if display-raw-bytes-as-hex is t.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Sat, 10 Jun 2017 07:25:01 GMT) Full text and rfc822 format available.

Message #49 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: v.schneidermann <at> gmail.com, 27270 <at> debbugs.gnu.org,
 npostavs <at> users.sourceforge.net
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Sat, 10 Jun 2017 10:24:25 +0300

> Cc: npostavs <at> users.sourceforge.net, 27270 <at> debbugs.gnu.org,
>  v.schneidermann <at> gmail.com
> From: Paul Eggert <eggert <at> cs.ucla.edu>
> Date: Fri, 9 Jun 2017 16:44:46 -0700
> 
> The idea is to add a new \X escape for character constants and strings. This 
> escape would allow at most two hexadecimal digits, rather than the unlimited 
> number of digits that \x does. For example, the Lisp string "\XABC" would be 
> equivalent to the Lisp string "\xAB\ C", that is, it would be a two-character 
> string containing the character U+00AB LEFT POINTING GUILLEMET followed by the 
> character U+0043 LATIN CAPITAL LETTER C.

So your proposal would mean a change to the Lisp reader to support
such escapes, right?  If so, isn't such a change
backward-incompatible?

> Also, display-raw-bytes-as-hex would cause raw bytes to be displayed with this 
> new X escape, rather than with with the x escape.

It could only do that for codepoints below 256 decimal, so that
limitation should be taken into account when deciding on the proposal.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Sat, 10 Jun 2017 22:52:02 GMT) Full text and rfc822 format available.

Message #52 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: npostavs <at> users.sourceforge.net
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: v.schneidermann <at> gmail.com, Eli Zaretskii <eliz <at> gnu.org>,
 27270 <at> debbugs.gnu.org
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Sat, 10 Jun 2017 18:52:31 -0400

severity 27270 wishlist
quit

Paul Eggert <eggert <at> cs.ucla.edu> writes:

> But this means the problem is even worse than I thought. If I evaluate
> this in *scratch* in a terminal session running emacs -nw:
>
> (setq display-raw-bytes-as-hex t) C-j
> (format "%c%c" ?\u0090 ?5) C-j

I wonder what you do about low bytes, as in (format "^G%c" ?\a).  Do
those just not come up very much?  It's too bad there's no copying
counterpart to bracketed paste mode...

Severity set to 'wishlist' from 'normal' Request was from npostavs <at> users.sourceforge.net to control <at> debbugs.gnu.org. (Sat, 10 Jun 2017 22:52:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Sun, 11 Jun 2017 00:05:02 GMT) Full text and rfc822 format available.

Message #57 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: v.schneidermann <at> gmail.com, 27270 <at> debbugs.gnu.org,
 npostavs <at> users.sourceforge.net
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Sat, 10 Jun 2017 17:04:40 -0700

On 06/10/2017 12:24 AM, Eli Zaretskii wrote:
> So your proposal would mean a change to the Lisp reader to support
> such escapes, right?  If so, isn't such a change
> backward-incompatible?

Yes, but only in the sense that undocumented escapes evaluate to 
themselves, e.g., "\F" is currently the same as "F" in Emacs Lisp 
because there is no escape sequence \F currently defined for character 
constants. But there's nothing new here, e.g., when we added "\N{...}" 
last year we changed the interpretation of the formerly-undocumented \N 
escape.

>> Also, display-raw-bytes-as-hex would cause raw bytes to be displayed with this
>> new X escape, rather than with with the x escape.
> It could only do that for codepoints below 256 decimal, so that
> limitation should be taken into account when deciding on the proposal.

Ouch, I hadn't thought of that.

Wait -- doesn't that mean that "display-raw-bytes-as-hex" is a 
misleading name, because it affects the display not only of raw bytes, 
but of other undisplayable characters? Shouldn't we change its name to 
something more generic and more accurate, like "display-characters-as-hex"?

Anyway, to address the point you raised: how about a different idea? We 
extend the existing \x syntax in strings so that \x{dddd} has the same 
meaning as "\xdddd", except that the "}" terminates the escape. This 
syntax is used by Perl and so is in the same family as \N{...}. We also 
change display-raw-bytes-as-hex to use this new syntax when a character 
is immediately followed by a hexadecimal digit. That way, most 
characters are displayed as before, but my problematic example is 
displayed as "x\x{90}5y", which is a good visual cue of the unusual 
situation.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Sun, 11 Jun 2017 00:11:01 GMT) Full text and rfc822 format available.

Message #60 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: npostavs <at> users.sourceforge.net
Cc: v.schneidermann <at> gmail.com, Eli Zaretskii <eliz <at> gnu.org>,
 27270 <at> debbugs.gnu.org
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Sat, 10 Jun 2017 17:10:38 -0700

On 06/10/2017 03:52 PM, npostavs <at> users.sourceforge.net wrote:
> I wonder what you do about low bytes, as in (format "^G%c" ?\a).  Do
> those just not come up very much?

They didn't in my examples. :-)  But yes, they do happen, it's just that 
when they mess things up it tends to be more obvious. It might be nice, 
I suppose, if there were an option to make them not happen.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Sun, 11 Jun 2017 14:49:01 GMT) Full text and rfc822 format available.

Message #63 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: v.schneidermann <at> gmail.com, 27270 <at> debbugs.gnu.org,
 npostavs <at> users.sourceforge.net
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Sun, 11 Jun 2017 17:48:04 +0300

> Cc: npostavs <at> users.sourceforge.net, 27270 <at> debbugs.gnu.org,
>  v.schneidermann <at> gmail.com
> From: Paul Eggert <eggert <at> cs.ucla.edu>
> Date: Sat, 10 Jun 2017 17:04:40 -0700
> 
> On 06/10/2017 12:24 AM, Eli Zaretskii wrote:
> > So your proposal would mean a change to the Lisp reader to support
> > such escapes, right?  If so, isn't such a change
> > backward-incompatible?
> 
> Yes, but only in the sense that undocumented escapes evaluate to 
> themselves, e.g., "\F" is currently the same as "F" in Emacs Lisp 
> because there is no escape sequence \F currently defined for character 
> constants. But there's nothing new here, e.g., when we added "\N{...}" 
> last year we changed the interpretation of the formerly-undocumented \N 
> escape.

Then maybe the new hex display should use the \N{U+nnn} format?

> >> Also, display-raw-bytes-as-hex would cause raw bytes to be displayed with this
> >> new X escape, rather than with with the x escape.
> > It could only do that for codepoints below 256 decimal, so that
> > limitation should be taken into account when deciding on the proposal.
> 
> Ouch, I hadn't thought of that.
> 
> Wait -- doesn't that mean that "display-raw-bytes-as-hex" is a 
> misleading name, because it affects the display not only of raw bytes, 
> but of other undisplayable characters?

That's true, but since the chances of a _user_ changing the
printable-chars char-table are pretty slim, I didn't think it was
justified to obfuscate the name.

> Shouldn't we change its name to 
> something more generic and more accurate, like "display-characters-as-hex"?

Codepoints whose printable-chars entry is nil cannot in good faith be
called "characters", IMO.  "Codepoints", maybe?  But again, that makes
the discoverability harder, so I'm not sure it's worth the hassle.

> Anyway, to address the point you raised: how about a different idea? We 
> extend the existing \x syntax in strings so that \x{dddd} has the same 
> meaning as "\xdddd", except that the "}" terminates the escape. This 
> syntax is used by Perl and so is in the same family as \N{...}. We also 
> change display-raw-bytes-as-hex to use this new syntax when a character 
> is immediately followed by a hexadecimal digit. That way, most 
> characters are displayed as before, but my problematic example is 
> displayed as "x\x{90}5y", which is a good visual cue of the unusual 
> situation.

See above: why not \N{U+...}?  The only downside is that it's much
longer than \xNN.  Could be another option, perhaps.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Sun, 11 Jun 2017 17:27:02 GMT) Full text and rfc822 format available.

Message #66 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: v.schneidermann <at> gmail.com, 27270 <at> debbugs.gnu.org,
 npostavs <at> users.sourceforge.net
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Sun, 11 Jun 2017 10:26:28 -0700

Eli Zaretskii wrote:
> Then maybe the new hex display should use the \N{U+nnn} format?

If we're going to do that, we might as well use \unnnn, which is shorter. A 
downside of either syntax, though, is the implication that the raw byte is 
intended to be Unicode, which it typically is not. That is partly why I was 
thinking \x{nn} would be better: it'd be clearer to users.

Removed indication that bug 27270 blocks Request was from Eli Zaretskii <eliz <at> gnu.org> to control <at> debbugs.gnu.org. (Sat, 02 Sep 2017 13:26:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Sat, 02 Sep 2017 13:27:02 GMT) Full text and rfc822 format available.

Message #71 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 27270 <at> debbugs.gnu.org, v.schneidermann <at> gmail.com,
 npostavs <at> users.sourceforge.net
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Sat, 02 Sep 2017 16:25:33 +0300

unblock 24655 by 27270
thanks

> From: Paul Eggert <eggert <at> cs.ucla.edu>
> Date: Sun, 11 Jun 2017 10:26:28 -0700
> Cc: v.schneidermann <at> gmail.com, 27270 <at> debbugs.gnu.org,
>  npostavs <at> users.sourceforge.net
> 
> Eli Zaretskii wrote:
> > Then maybe the new hex display should use the \N{U+nnn} format?
> 
> If we're going to do that, we might as well use \unnnn, which is shorter. A 
> downside of either syntax, though, is the implication that the raw byte is 
> intended to be Unicode, which it typically is not. That is partly why I was 
> thinking \x{nn} would be better: it'd be clearer to users.

In any case, since this is a "wishlist" bug report, I don't think it
should block the release of Emacs 26.1 (or any other version).

Thanks.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Sat, 23 Apr 2022 14:01:02 GMT) Full text and rfc822 format available.

Message #74 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Eli Zaretskii <eliz <at> gnu.org>, v.schneidermann <at> gmail.com,
 27270 <at> debbugs.gnu.org, npostavs <at> users.sourceforge.net
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Sat, 23 Apr 2022 16:00:31 +0200

[Message part 1 (text/plain, inline)]

Paul Eggert <eggert <at> cs.ucla.edu> writes:

> The idea is to add a new \X escape for character constants and
> strings. This escape would allow at most two hexadecimal digits,
> rather than the unlimited number of digits that \x does. For example,
> the Lisp string "\XABC" would be equivalent to the Lisp string "\xAB\
> C", that is, it would be a two-character string containing the
> character U+00AB LEFT POINTING GUILLEMET followed by the character
> U+0043 LATIN CAPITAL LETTER C.

This was four years ago, but I don't think any steps were taken in this
direction, beyond marking the raw bytes more clearly:

[Message part 2 (image/png, inline)]

[Message part 3 (text/plain, inline)]

Even in *scratch*, where font-locking overrode those, I think?

The issue still remains -- if you do this in emacs -nw:

(format "%c5" 128)
"5"

And cut and paste that do a different Emacs, you get the string

"\x805"
=> "ࠅ"

But...  we've had this format for half a decade now, and this doesn't
really seem to be a problem in practice, so while the format is somewhat
ambiguous, I tend to think that introducing a new syntax just to fix it
isn't worth it.  Especially a syntax like \x{80}, which was one of the
suggestions -- the idea, after all, is to make display prettier and more
readable.

Any further opinions?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Added tag(s) moreinfo. Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Sat, 23 Apr 2022 14:01:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Sun, 24 Apr 2022 07:11:01 GMT) Full text and rfc822 format available.

Message #79 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: Eli Zaretskii <eliz <at> gnu.org>, v.schneidermann <at> gmail.com,
 27270 <at> debbugs.gnu.org, npostavs <at> users.sourceforge.net
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Sun, 24 Apr 2022 00:10:44 -0700

On 4/23/22 07:00, Lars Ingebrigtsen wrote:
> we've had this format for half a decade now, and this doesn't
> really seem to be a problem in practice

Not surprising, since most people don't set display-raw-bytes-as-hex. 
But that doesn't mean it's not a problem. Quoting bugs can be issues 
even if they're unlikely to occur at random. (Think SQL injection. :-)

> I tend to think that introducing a new syntax just to fix it
> isn't worth it.

That's fine, so let's fix the problem as originally suggested. That is, 
display the string returned by (format "%c%c" #x9e #x66) as "\x9e\x66" 
(equivalent to (concat "\x9e" "\x66") which is correct) instead of as 
"\x9ef" (equivalent to "\N{BENGALI DIGIT NINE}" which is wrong).

This fixes the problem and doesn't introduce new syntax.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Sun, 24 Apr 2022 09:57:01 GMT) Full text and rfc822 format available.

Message #82 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: Vasilij Schneidermann <v.schneidermann <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Lars Ingebrigtsen <larsi <at> gnus.org>, 27270 <at> debbugs.gnu.org,
 Eli Zaretskii <eliz <at> gnu.org>, npostavs <at> users.sourceforge.net
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Sun, 24 Apr 2022 11:56:04 +0200

> > I tend to think that introducing a new syntax just to fix it
> > isn't worth it.
>
> That's fine, so let's fix the problem as originally suggested. That is,
> display the string returned by (format "%c%c" #x9e #x66) as "\x9e\x66"
> (equivalent to (concat "\x9e" "\x66") which is correct) instead of as
> "\x9ef" (equivalent to "\N{BENGALI DIGIT NINE}" which is wrong).
>
> This fixes the problem and doesn't introduce new syntax.

Wait, hold up. Under which conditions exactly does the bug happen? If I
use GUI Emacs, thanks to font-lock it's pretty obvious that the output
is three bytes, the first one displayed using the hex escape syntax and
the remaining two bytes using hex letters.  If I copy-paste those into
another GUI Emacs, it's still the same three bytes. I don't know about
terminal Emacs, but trying to work around terminals being bad doesn't
seem worth the extra effort.

Besides, suppose it is worth it, what exactly should the logic be here?
Detect if there's a preceding hex escaped byte and if yes, display
adjacent bytes that are formatted using hex characters using escaping,
too? That seems too involved for something run in redisplay.

The other proposed alternative of tightening up read syntax seems
incompatible, but saner to me overall. Emacs Lisp is the odd one out
here anyway. Only C and C++ consider such sequences as potentially
having a greater length than 2 and they error out with a compilation
error for me.

    len("\x1234") # Python, Go: 3

    "\x1234".length # Ruby, JavaScript: 3

    length("\x1234") # Perl: 3

    (string-length "\x1234") ; Guile, Racket, CHICKEN: 3

    ;; Common Lisp absent because it lacks a lot of string escapes and
    ;; using FORMAT neatly sidesteps these issues

    ;; Clojure only has octal/unicode string escapes
    (count (seq "\u12345678")) ; Clojure: 5

    (length "\x1234") ; Emacs Lisp: 1

    strlen("\x1234") /* C: compilation error */

    std::string("\x1234").length() // C++: compilation error

    "\x1234".len() // Rust: 3

Before deciding on such a change, there should be efforts to figure out
whether anything could actually break due to this. That is, code with
long hex escapes in strings, be it manually authored (unlikely, people
either use raw bytes in strings or unicode escapes) or automatically
generated (cannot comment on that, maybe the byte-code compiler emits
such code?). If not, then it would be an obvious candidate for the next
major release of Emacs.

On Sun, Apr 24, 2022 at 9:10 AM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
>
> On 4/23/22 07:00, Lars Ingebrigtsen wrote:
> > we've had this format for half a decade now, and this doesn't
> > really seem to be a problem in practice
>
> Not surprising, since most people don't set display-raw-bytes-as-hex.
> But that doesn't mean it's not a problem. Quoting bugs can be issues
> even if they're unlikely to occur at random. (Think SQL injection. :-)
>
>
> > I tend to think that introducing a new syntax just to fix it
> > isn't worth it.
>
> That's fine, so let's fix the problem as originally suggested. That is,
> display the string returned by (format "%c%c" #x9e #x66) as "\x9e\x66"
> (equivalent to (concat "\x9e" "\x66") which is correct) instead of as
> "\x9ef" (equivalent to "\N{BENGALI DIGIT NINE}" which is wrong).
>
> This fixes the problem and doesn't introduce new syntax.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Sun, 24 Apr 2022 10:27:01 GMT) Full text and rfc822 format available.

Message #85 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: Andreas Schwab <schwab <at> linux-m68k.org>
To: Vasilij Schneidermann <v.schneidermann <at> gmail.com>
Cc: Lars Ingebrigtsen <larsi <at> gnus.org>, Paul Eggert <eggert <at> cs.ucla.edu>,
 27270 <at> debbugs.gnu.org, npostavs <at> users.sourceforge.net
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Sun, 24 Apr 2022 12:26:45 +0200

On Apr 24 2022, Vasilij Schneidermann wrote:

>     strlen("\x1234") /* C: compilation error */

You need to use a wide string:

      wslen(L"\x1234")

>     std::string("\x1234").length() // C++: compilation error

Likewise:

      std::wstring(L"\x1234").length()

-- 
Andreas Schwab, schwab <at> linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Sun, 24 Apr 2022 10:53:02 GMT) Full text and rfc822 format available.

Message #88 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: Vasilij Schneidermann <v.schneidermann <at> gmail.com>
To: Andreas Schwab <schwab <at> linux-m68k.org>
Cc: Lars Ingebrigtsen <larsi <at> gnus.org>, Paul Eggert <eggert <at> cs.ucla.edu>,
 27270 <at> debbugs.gnu.org, npostavs <at> users.sourceforge.net
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Sun, 24 Apr 2022 12:51:58 +0200

> You need to use a wide string:
>
>       wslen(L"\x1234")
>
> >     std::string("\x1234").length() // C++: compilation error
>
> Likewise:
>
>       std::wstring(L"\x1234").length()

Thank you for pointing this out. This gives us three camps:

- Languages where "\x1234" is always one character (Emacs Lisp)
- Languages where "\x1234" is an error, but may become one character
when opting into this with wide literals (C, C++)
- Languages where "\x1234" is always multiple characters (everything
else under the sun)

I propose Emacs Lisp to move into camp 3 (not really a point in moving
to camp two as it requires new syntax for a hardly used feature). As
evident by the bug report, this is a footgun waiting to happen. We
already do have syntax in case one truly wants to specify a value
greater than #xFF using Unicode names/values. This would require an
amendment in `(info "(elisp) General Escape Syntax")`, point 3. Like
with oldstyle backquotes, a warning could be emitted if greater hex
values are used in a string.

I've checked Emacs sources for usage of such hex escapes and only
found org-entities.el to represent non-breaking space (nbsp) this way,
so breakage should be limited.

If there is interest, I could extend the survey to include whether
character syntax is/should be affected the same way and/or include
more languages.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Sun, 24 Apr 2022 11:02:02 GMT) Full text and rfc822 format available.

Message #91 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: Andreas Schwab <schwab <at> linux-m68k.org>
To: Vasilij Schneidermann <v.schneidermann <at> gmail.com>
Cc: Lars Ingebrigtsen <larsi <at> gnus.org>, Paul Eggert <eggert <at> cs.ucla.edu>,
 27270 <at> debbugs.gnu.org, npostavs <at> users.sourceforge.net
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Sun, 24 Apr 2022 13:01:01 +0200

On Apr 24 2022, Vasilij Schneidermann wrote:

> I propose Emacs Lisp to move into camp 3

This will break every use of \x in Emacs.

-- 
Andreas Schwab, schwab <at> linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Sun, 24 Apr 2022 11:26:02 GMT) Full text and rfc822 format available.

Message #94 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Eli Zaretskii <eliz <at> gnu.org>, v.schneidermann <at> gmail.com,
 27270 <at> debbugs.gnu.org, npostavs <at> users.sourceforge.net
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Sun, 24 Apr 2022 13:24:53 +0200

Paul Eggert <eggert <at> cs.ucla.edu> writes:

> Not surprising, since most people don't set
> display-raw-bytes-as-hex. But that doesn't mean it's not a
> problem. Quoting bugs can be issues even if they're unlikely to occur
> at random. (Think SQL injection. :-)

I don't think we're talking quite the same magnitude -- this is a
problem if you're cutting strings from a -nw Emacs and pasting into a
different Emacs and then using the Lisp reader to read it back.  And
then there's a raw byte in the string.

The likelihood of anybody actually encountering this issue is ... small.

>> I tend to think that introducing a new syntax just to fix it
>> isn't worth it.
>
> That's fine, so let's fix the problem as originally suggested. That
> is, display the string returned by (format "%c%c" #x9e #x66) as
> "\x9e\x66" (equivalent to (concat "\x9e" "\x66") which is correct)
> instead of as "\x9ef" (equivalent to "\N{BENGALI DIGIT NINE}" which is
> wrong).
>
> This fixes the problem and doesn't introduce new syntax.

You want to quote all %c as if they were raw bytes?  Or only following a
raw byte?  And what about

(format "%cf" #x9e)

which was the originally reported issue?

In any case, this would definitely be a regression, because it creates
very confusing displayed strings.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Sun, 24 Apr 2022 11:30:04 GMT) Full text and rfc822 format available.

Message #97 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Andreas Schwab <schwab <at> linux-m68k.org>
Cc: npostavs <at> users.sourceforge.net, 27270 <at> debbugs.gnu.org,
 Vasilij Schneidermann <v.schneidermann <at> gmail.com>,
 Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Sun, 24 Apr 2022 13:29:09 +0200

Andreas Schwab <schwab <at> linux-m68k.org> writes:

>> I propose Emacs Lisp to move into camp 3
>
> This will break every use of \x in Emacs.

As Vasilij says, it won't break much of the in-tree code (which usually
looks like "\x3c\x7e\xff\xff\xff\xff\x7e\x3c"), but nevertheless, it'll
break stuff in subtle ways, so I don't think it's an option.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Sun, 24 Apr 2022 22:37:02 GMT) Full text and rfc822 format available.

Message #100 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: Eli Zaretskii <eliz <at> gnu.org>, v.schneidermann <at> gmail.com,
 27270 <at> debbugs.gnu.org, npostavs <at> users.sourceforge.net
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Sun, 24 Apr 2022 15:35:53 -0700

On 4/24/22 04:24, Lars Ingebrigtsen wrote:

> The likelihood of anybody actually encountering this issue is ... small.

Sure, if strings are random. But strings from opponents aren't random.

I'll readily grant that it's a much smaller exposure than SQL injection. 
Still, like SQL injection it's an exposure and should be fixed.

> You want to quote all %c as if they were raw bytes?  Or only following a
> raw byte?

Closer to the latter, but even less than the latter. I am being 
conservative and am proposing that Emacs do what it does now unless the 
resulting output would be misinterpreted on input. So I wouldn't change 
how all characters are quoted; only how characters are quoted when the 
result would be interpreted incorrectly.

> what about (format "%cf" #x9e)

Since that returns a multibyte string, I suggest "\u009ef" which is 
multibyte. For its unibyte counterpart (encode-coding-string (format 
"%cf" #x9e) 'iso-latin-1) I suggest the syntax "\x9e\ f" which is 
unibyte. (These are not the only possibilities; for example, the former 
could be "\u009e\ f" if you think that's clearer.)

This string syntax is already supported by Emacs, so this wouldn't 
change the Lisp reader.

> it creates
> very confusing displayed strings.

These examples are not *that* confusing. And although they may not be 
beautiful, correct strings are less confusing than incorrect strings.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Sun, 24 Apr 2022 22:48:01 GMT) Full text and rfc822 format available.

Message #103 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Vasilij Schneidermann <v.schneidermann <at> gmail.com>
Cc: Lars Ingebrigtsen <larsi <at> gnus.org>, 27270 <at> debbugs.gnu.org,
 Eli Zaretskii <eliz <at> gnu.org>, npostavs <at> users.sourceforge.net
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Sun, 24 Apr 2022 15:46:55 -0700

On 4/24/22 02:56, Vasilij Schneidermann wrote:

> Under which conditions exactly does the bug happen?

I run into it with emacs -nw or equivalent, which I often use when I 
have a high-latency network connection so GUI Emacs is too slow. A few 
people even run Emacs from text consoles, with no graphics or windowing 
system at all, though I'm usually not that hard-core.

> trying to work around terminals being bad doesn't
> seem worth the extra effort.

Please bear with us poor users who don't always use GUIs...

> what exactly should the logic be here?
> Detect if there's a preceding hex escaped byte and if yes, display
> adjacent bytes that are formatted using hex characters using escaping,
> too?

Simpler than that. When hex-escaping a character, Emacs would look at 
the next character and if it's hexadecimal would print "\ " (or some 
similar escaping approach). This is a simple test and won't hurt 
printing performance much in the usual case.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Mon, 25 Apr 2022 07:41:02 GMT) Full text and rfc822 format available.

Message #106 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Eli Zaretskii <eliz <at> gnu.org>, v.schneidermann <at> gmail.com,
 27270 <at> debbugs.gnu.org, npostavs <at> users.sourceforge.net
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Mon, 25 Apr 2022 09:40:06 +0200

Paul Eggert <eggert <at> cs.ucla.edu> writes:

>> The likelihood of anybody actually encountering this issue is ... small.
>
> Sure, if strings are random. But strings from opponents aren't random.
>
> I'll readily grant that it's a much smaller exposure than SQL
> injection. Still, like SQL injection it's an exposure and should be
> fixed.

The opponent would have to get somebody to start an Emacs with -nw, then
cut and paste a string with the mouse, then get the user to use the Lisp
reader to read that string in again, and then end up with a string that
will somehow be a security issue.

Comparing this to SQL injection is far fetched, to put it mildly.

We have a similar issue with the octal printer -- if you print something
out with it, and you end up with something displayed as foo\205bar, you
cut and paste that from -nw, and you save it into a file, you end up
with a file containing 10 characters instead of 8, and then you have
your exploit.

I.e., the Lisp reader and strings isn't the only thing confusable here.

>> what about (format "%cf" #x9e)
>
> Since that returns a multibyte string, I suggest "\u009ef" which is
> multibyte. For its unibyte counterpart (encode-coding-string (format
> "%cf" #x9e) 'iso-latin-1) I suggest the syntax "\x9e\ f" which is
> unibyte. (These are not the only possibilities; for example, the
> former could be "\u009e\ f" if you think that's clearer.)

display-raw-bytes-as-hex is a display setting.  You want to change it so
that the data output will be different, which will break all kinds of
things, even if (when you use the Lisp reader) it'll end up being the
same.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Mon, 25 Apr 2022 16:50:02 GMT) Full text and rfc822 format available.

Message #109 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: Eli Zaretskii <eliz <at> gnu.org>, v.schneidermann <at> gmail.com,
 27270 <at> debbugs.gnu.org, npostavs <at> users.sourceforge.net
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Mon, 25 Apr 2022 09:49:15 -0700

On 4/25/22 00:40, Lars Ingebrigtsen wrote:

> Comparing this to SQL injection is far fetched

Call me paranoid if you like. (Can you tell I used to work for a 
computer security company? :-) And to be honest my main motivation is 
irritation that cut-and-paste doesn't work, not security.

> We have a similar issue with the octal printer -- if you print something
> out with it, and you end up with something displayed as foo\205bar, you
> cut and paste that from -nw, and you save it into a file,

Nobody expects things to work if you output with one quoting scheme and 
input with a different one. But cutting and pasting from Emacs's 
read-eval-print-loop is expected to work.

> display-raw-bytes-as-hex is a display setting.  You want to change it so
> that the data output will be different

No, I would like to change only the display. (I had suggested otherwise 
in comment #5 of this bug report, but was mistaken and took that 
suggestion back in later comments.)

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Tue, 26 Apr 2022 10:08:02 GMT) Full text and rfc822 format available.

Message #112 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Eli Zaretskii <eliz <at> gnu.org>, v.schneidermann <at> gmail.com,
 27270 <at> debbugs.gnu.org, npostavs <at> users.sourceforge.net
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Tue, 26 Apr 2022 12:06:56 +0200

Paul Eggert <eggert <at> cs.ucla.edu> writes:

>> display-raw-bytes-as-hex is a display setting.  You want to change it so
>> that the data output will be different
>
> No, I would like to change only the display. (I had suggested
> otherwise in comment #5 of this bug report, but was mistaken and took
> that suggestion back in later comments.)

Your last suggestion was to output

(format "%cf" 129)
=> "\x81\x66"

I think?  Which is changing the data output.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Tue, 26 Apr 2022 16:49:02 GMT) Full text and rfc822 format available.

Message #115 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: Eli Zaretskii <eliz <at> gnu.org>, v.schneidermann <at> gmail.com,
 27270 <at> debbugs.gnu.org, npostavs <at> users.sourceforge.net
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Tue, 26 Apr 2022 09:48:35 -0700

On 4/26/22 03:06, Lars Ingebrigtsen wrote:
> Your last suggestion was to output
> 
> (format "%cf" 129)
> => "\x81\x66"
> 
> I think?  Which is changing the data output.

Oh, right. Scratch that. Let's just use "\uXXXX" if multibyte, "\OOO" 
(octal) if unibyte. (This is only when the character precedes a hex 
digit.) That's simpler anyway.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Wed, 27 Apr 2022 12:15:02 GMT) Full text and rfc822 format available.

Message #118 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Eli Zaretskii <eliz <at> gnu.org>, v.schneidermann <at> gmail.com,
 27270 <at> debbugs.gnu.org, npostavs <at> users.sourceforge.net
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Wed, 27 Apr 2022 14:13:43 +0200

Paul Eggert <eggert <at> cs.ucla.edu> writes:

>> Your last suggestion was to output
>> (format "%cf" 129)
>> => "\x81\x66"
>> I think?  Which is changing the data output.
>
> Oh, right. Scratch that. Let's just use "\uXXXX" if multibyte, "\OOO"
> (octal) if unibyte. (This is only when the character precedes a hex
> digit.) That's simpler anyway.

That will also change the output, which display-raw-bytes-as-hex is not
supposed to do.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Wed, 27 Apr 2022 17:22:01 GMT) Full text and rfc822 format available.

Message #121 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: Eli Zaretskii <eliz <at> gnu.org>, v.schneidermann <at> gmail.com,
 27270 <at> debbugs.gnu.org, npostavs <at> users.sourceforge.net
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Wed, 27 Apr 2022 10:21:02 -0700

On 4/27/22 05:13, Lars Ingebrigtsen wrote:
>> Oh, right. Scratch that. Let's just use "\uXXXX" if multibyte, "\OOO"
>> (octal) if unibyte. (This is only when the character precedes a hex
>> digit.) That's simpler anyway.
> That will also change the output, which display-raw-bytes-as-hex is not
> supposed to do.

Could you explain what you mean by "change the output"? (Sorry, I'm not 
seeing it.)

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#27270; Package emacs. (Wed, 27 Apr 2022 17:24:01 GMT) Full text and rfc822 format available.

Message #124 received at 27270 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Eli Zaretskii <eliz <at> gnu.org>, v.schneidermann <at> gmail.com,
 27270 <at> debbugs.gnu.org, npostavs <at> users.sourceforge.net
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Wed, 27 Apr 2022 19:22:50 +0200

Paul Eggert <eggert <at> cs.ucla.edu> writes:

> Could you explain what you mean by "change the output"? (Sorry, I'm
> not seeing it.)

I said earlier:

> display-raw-bytes-as-hex is a display setting.  You want to change it so
> that the data output will be different, which will break all kinds of
> things, even if (when you use the Lisp reader) it'll end up being the
> same.


-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Thu, 28 Apr 2022 17:59:02 GMT) Full text and rfc822 format available.

Notification sent to Paul Eggert <eggert <at> cs.ucla.edu>:
bug acknowledged by developer. (Thu, 28 Apr 2022 17:59:02 GMT) Full text and rfc822 format available.

Message #129 received at 27270-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: Eli Zaretskii <eliz <at> gnu.org>, v.schneidermann <at> gmail.com,
 27270-done <at> debbugs.gnu.org, npostavs <at> users.sourceforge.net
Subject: Re: bug#27270: display-raw-bytes-as-hex generates ambiguous output
 for Emacs strings
Date: Thu, 28 Apr 2022 10:58:33 -0700

[Message part 1 (text/plain, inline)]

On 4/27/22 10:22, Lars Ingebrigtsen wrote:
> Paul Eggert <eggert <at> cs.ucla.edu> writes:
> 
>> Could you explain what you mean by "change the output"? (Sorry, I'm
>> not seeing it.)
> 
> I said earlier:
> 
>> display-raw-bytes-as-hex is a display setting.  You want to change it so
>> that the data output will be different

Still not quite following, as I had been thinking more recently of 
changing only how display-raw-bytes-as-hex displays.

That being said, I looked into the code and found that what I was asking 
for would be quite a pain to implement - more trouble than it's worth, 
anyway - so I withdraw the suggestion and am closing the bug report.

I installed the attached, which documents the situation.

[0001-Document-807-etc.-in-raw-byte-display.patch (text/x-patch, attachment)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 27 May 2022 11:24:06 GMT) Full text and rfc822 format available.

This bug report was last modified 3 years and 47 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #27270 display-raw-bytes-as-hex generates ambiguous output for Emacs strings

GNU bug report logs - #27270
display-raw-bytes-as-hex generates ambiguous output for Emacs strings