GNU bug report logs - #53236
26.1; encode-coding-string does not encode the string as expected

Previous Next

Package: emacs;

Reported by: Markus Triska <triska <at> metalevel.at>

Date: Thu, 13 Jan 2022 19:47:01 UTC

Severity: normal

Found in version 26.1

Done: Markus Triska <triska <at> metalevel.at>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 53236 in the body.
You can then email your comments to 53236 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#53236; Package emacs. (Thu, 13 Jan 2022 19:47:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Markus Triska <triska <at> metalevel.at>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Thu, 13 Jan 2022 19:47:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Markus Triska <triska <at> metalevel.at>
To: bug-gnu-emacs <at> gnu.org
Subject: 26.1; encode-coding-string does not encode the string as expected
Date: Thu, 13 Jan 2022 20:45:57 +0100
Dear all,

please consider the UTF-8 encoding of the Unicode codepoint 0x80, which
is formed by two bytes. In hexadecimal notation, they are: 0xC2 0x80.

We can use decode-coding-string to verify that this byte sequence is
decoded to 0x80 when specifying utf-8, which works exactly as expected:

    (decode-coding-string "\xC2\x80" 'utf-8)

This yields "\200", which is the same as "\x80", as verified via:

    (string= "\200" "\x80") --> t

Correspondingly, I expect (encode-coding-string "\200" 'utf-8) to yield
a string equivalent to "\xC2\x80", but that seems not to be the case. I get:

    (encode-coding-string "\200" 'utf-8) --> "\200"

And therefore, unexpectedly:

    (string= (encode-coding-string "\200" 'utf-8) "\xC2\x80") --> nil

It appears that encode-coding-string does not encode the string in UTF-8
as expected. Is there any way to obtain the desired encoding with
encode-coding-string, i.e., the UTF-8-encoded string "\xC2\x80"?

Thank you and all the best!
Markus

In GNU Emacs 26.1 (build 3, x86_64-pc-linux-gnu, X toolkit, Xaw scroll bars)
 of 2019-04-09 built on mt-laptop
Windowing system distributor 'The X.Org Foundation', version 11.0.12004000
System Description:	Ubuntu 19.04

Configured features:
XPM JPEG GIF PNG SOUND GSETTINGS NOTIFY GNUTLS LIBXML2 FREETYPE XFT ZLIB
TOOLKIT_SCROLL_BARS LUCID X11 THREADS

Important settings:
  value of $LC_MONETARY: en_GB.UTF-8
  value of $LC_NUMERIC: en_GB.UTF-8
  value of $LC_TIME: en_GB.UTF-8
  value of $LANG: en_US.UTF-8
  value of $XMODIFIERS: @im=ibus
  locale-coding-system: utf-8-unix





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#53236; Package emacs. (Thu, 13 Jan 2022 20:24:02 GMT) Full text and rfc822 format available.

Message #8 received at 53236 <at> debbugs.gnu.org (full text, mbox):

From: Philipp Stephani <p.stephani2 <at> gmail.com>
To: Markus Triska <triska <at> metalevel.at>
Cc: 53236 <at> debbugs.gnu.org
Subject: Re: bug#53236: 26.1; encode-coding-string does not encode the string
 as expected
Date: Thu, 13 Jan 2022 21:23:33 +0100
Am Do., 13. Jan. 2022 um 21:14 Uhr schrieb Markus Triska <triska <at> metalevel.at>:
>
> Dear all,
>
> please consider the UTF-8 encoding of the Unicode codepoint 0x80, which
> is formed by two bytes. In hexadecimal notation, they are: 0xC2 0x80.
>
> We can use decode-coding-string to verify that this byte sequence is
> decoded to 0x80 when specifying utf-8, which works exactly as expected:
>
>     (decode-coding-string "\xC2\x80" 'utf-8)
>
> This yields "\200", which is the same as "\x80", as verified via:
>
>     (string= "\200" "\x80") --> t

There are two possible interpretations of "\200":
1. The unibyte string containing the byte #x80
2. The multibyte string containing the Unicode character U+0080
The string literal "\200" gives you the former, while
(decode-coding-string "\xC2\x80" 'utf-8) gives you the latter. In
fact,
(string= (decode-coding-string "\xC2\x80" 'utf-8) "\200") ⇒ nil
but
(string= (decode-coding-string "\xC2\x80" 'utf-8) "\u0080") ⇒ t

>
> Correspondingly, I expect (encode-coding-string "\200" 'utf-8) to yield
> a string equivalent to "\xC2\x80", but that seems not to be the case. I get:
>
>     (encode-coding-string "\200" 'utf-8) --> "\200"

Here "\200" gives you the unibyte string that contains the byte #x80.
That can't be encoded as UTF-8 (since UTF-8 encodes Unicode scalar
values, not raw bytes), so it's left alone.
However,
(encode-coding-string "\u0080" 'utf-8) ⇒ "\302\200"

There's some background in the chapter "Text representations" in the
ELisp manual.
HTH




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#53236; Package emacs. (Fri, 14 Jan 2022 06:56:02 GMT) Full text and rfc822 format available.

Message #11 received at 53236 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Markus Triska <triska <at> metalevel.at>
Cc: 53236 <at> debbugs.gnu.org
Subject: Re: bug#53236: 26.1;
 encode-coding-string does not encode the string as expected
Date: Fri, 14 Jan 2022 08:55:30 +0200
> From: Markus Triska <triska <at> metalevel.at>
> Date: Thu, 13 Jan 2022 20:45:57 +0100
> 
> Correspondingly, I expect (encode-coding-string "\200" 'utf-8) to yield
> a string equivalent to "\xC2\x80", but that seems not to be the case. I get:
> 
>     (encode-coding-string "\200" 'utf-8) --> "\200"
> 
> And therefore, unexpectedly:
> 
>     (string= (encode-coding-string "\200" 'utf-8) "\xC2\x80") --> nil

"\200" is a unibyte string, and encoding unibyte strings returns those
strings without changing them.

This is not a bug, just a dark corner of encoding/decoding stuff.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#53236; Package emacs. (Fri, 14 Jan 2022 10:01:02 GMT) Full text and rfc822 format available.

Message #14 received at 53236 <at> debbugs.gnu.org (full text, mbox):

From: Andreas Schwab <schwab <at> linux-m68k.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 53236 <at> debbugs.gnu.org, Markus Triska <triska <at> metalevel.at>
Subject: Re: bug#53236: 26.1; encode-coding-string does not encode the
 string as expected
Date: Fri, 14 Jan 2022 11:00:17 +0100
On Jan 14 2022, Eli Zaretskii wrote:

>> From: Markus Triska <triska <at> metalevel.at>
>> Date: Thu, 13 Jan 2022 20:45:57 +0100
>> 
>> Correspondingly, I expect (encode-coding-string "\200" 'utf-8) to yield
>> a string equivalent to "\xC2\x80", but that seems not to be the case. I get:
>> 
>>     (encode-coding-string "\200" 'utf-8) --> "\200"
>> 
>> And therefore, unexpectedly:
>> 
>>     (string= (encode-coding-string "\200" 'utf-8) "\xC2\x80") --> nil
>
> "\200" is a unibyte string, and encoding unibyte strings returns those
> strings without changing them.
>
> This is not a bug, just a dark corner of encoding/decoding stuff.

Or a dark corner of the string syntax.

ELISP> (multibyte-string-p "\200")
nil
ELISP> (multibyte-string-p "\x80")
nil
ELISP> (multibyte-string-p "\x0080")
t
ELISP> (encode-coding-string "\x0080" 'utf-8)
"\302\200"

-- 
Andreas Schwab, schwab <at> linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."




bug closed, send any further explanations to 53236 <at> debbugs.gnu.org and Markus Triska <triska <at> metalevel.at> Request was from Markus Triska <triska <at> metalevel.at> to control <at> debbugs.gnu.org. (Sat, 15 Jan 2022 06:42:01 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sat, 12 Feb 2022 12:24:07 GMT) Full text and rfc822 format available.

This bug report was last modified 2 years and 74 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.