GNU bug report logs - #12291
[rev 109796] wrong UTF-8 handling

Package: emacs;

Reported by: Werner LEMBERG <wl <at> gnu.org>

Date: Tue, 28 Aug 2012 05:49:02 UTC

Severity: normal

Tags: moreinfo

Done: Lars Ingebrigtsen <larsi <at> gnus.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 12291 in the body.
You can then email your comments to 12291 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#12291; Package emacs. (Tue, 28 Aug 2012 05:49:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Werner LEMBERG <wl <at> gnu.org>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Tue, 28 Aug 2012 05:49:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Werner LEMBERG <wl <at> gnu.org>
To: bug-gnu-emacs <at> gnu.org
Cc: Curtis Smith <smithcu <at> gvsu.edu>
Subject: [rev 109796] wrong UTF-8 handling
Date: Tue, 28 Aug 2012 07:47:20 +0200 (CEST)

[Message part 1 (text/plain, inline)]

[bzr revision 109796]

Have a look at the attached file, containing a single character.
(It's transmitted as binary to avoid e-mail encoding issues).  It
contains a single, four-byte UTF-8 encoded character (0xF4 0xB5 0x87
0x9E, which would map to the non-existent Unicode character code
U+1351DE).  If I load this file as UTF-8 encoded, Emacs gives this as
the output of `C-u C-x =':

               position: 1 of 2 (0%), column: 0
              character: 二 (displayed as 二) (codepoint 20108, #o47214, #x4e8c)
      preferred charset: unicode (Unicode (ISO10646))
  code point in charset: 0x4E8C
                 syntax: w 	which means: word
               category: .:Base, C:2-byte han, L:Left-to-right (strong), c:Chinese, h:Korean, j:Japanese, |:line breakable
               to input: type "C-x 8 RET HEX-CODEPOINT" or "C-x 8 RET NAME"
            buffer code: #xE4 #xBA #x8C
              file code: #xE4 #xBA #x8C (encoded by coding system utf-8-unix)
                display: by this font (glyph code)
      xft:-unknown-SimSun-normal-normal-normal-*-24-*-*-*-d-0-iso10646-1 (#x460)

  Character code properties: customize what to show
    name: CJK IDEOGRAPH-4E8C
    general-category: Lo (Letter, Other)
    decomposition: (20108) ('二')

Look what Emacs says about the file code.  If I save this
one-character file as UTF-8, the character code stays as-is.

This behaviour is clearly wrong.  I suspect that Emacs is using such a
high character code for internal representation of the `emacs-mule'
encoding.  However, the user must not see this.  Instead, such
characters must be converted to correct UTF-8.


    Werner


======================================================================

In GNU Emacs 24.2.50.1 (i686-pc-linux-gnu, GTK+ Version 2.24.9)
 of 2012-08-28 on linux-nvf0
Windowing system distributor `The X.Org Foundation', version 11.0.11004000
Configured using:
 `configure 'MAKEINFO=/usr/bin/makeinfo' '--with-x-toolkit=gtk''

Important settings:
  value of $LANG: de_DE.UTF-8
  value of $XMODIFIERS: @im=none
  locale-coding-system: utf-8-unix
  default enable-multibyte-characters: t

Major mode: Summary

Minor modes in effect:
  tooltip-mode: t
  mouse-wheel-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  blink-cursor-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  column-number-mode: t
  transient-mark-mode: t

Recent input:
<return> w b u g - e m <tab> <tab> <tab> <tab> <tab> 
<tab> <tab> <backspace> <backspace> <tab> <tab> C-c 
C-q y M-x w r i t e - e m <tab> C-g C-h a b u g <return> 
<M-next> C-x 1 M-x r e p r t <backspace> <backspace> 
o r t - e m <tab> <return>

Recent messages:
Saving file /home/wl/Mail/draft/11...
Wrote /home/wl/Mail/draft/11
Draft is prepared
No matching alias [7 times]
Kill draft message? (y or n)  y
Saving file /home/wl/Mail/draft/11...
Wrote /home/wl/Mail/draft/11
Draft was killed
Quit
Type C-x 4 C-o RET to restore the other window.  

Load-path shadows:
None found.

Features:
(shadow emacsbug message format-spec rfc822 mml mml-sec mm-decode
mm-bodies mm-encode mail-parse rfc2231 mailabbrev gmm-utils mailheader
sendmail rfc2047 rfc2045 ietf-drums mm-util mail-prsvr mail-utils
apropos descr-text latexenc preview prv-emacs byte-opt tex-buf
noutline outline font-latex warnings bytecomp byte-compile cconv
macroexp latex easy-mmode edmacro kmacro tex-style cus-edit wid-edit
cus-start cus-load pp mew-varsx mew-unix cal-menu calendar
cal-loaddefs mew-auth mew-config mew-imap2 mew-imap mew-nntp2 mew-nntp
mew-pop mew-smtp mew-ssl mew-ssh mew-net mew-highlight mew-sort
mew-fib mew-ext mew-refile mew-demo mew-attach mew-draft mew-message
mew-thread mew-virtual mew-summary4 mew-summary3 mew-summary2
mew-summary mew-search mew-pick mew-passwd mew-scan mew-syntax mew-bq
mew-smime mew-pgp mew-header mew-exec mew-mark mew-mime mew-edit
mew-decode mew-encode mew-cache mew-minibuf mew-complete mew-addrbook
mew-local mew-vars3 mew-vars2 mew-vars mew-env mew-mule3 mew-mule
mew-gemacs mew-key mew-func mew-blvs mew-const mew tex advice help-fns
advice-preload tex-site auto-loads quail help-mode easymenu cjktilde
disp-table time-date tooltip ediff-hook vc-hooks lisp-float-type
mwheel x-win x-dnd tool-bar dnd fontset image regexp-opt fringe
tabulated-list newcomment lisp-mode register page menu-bar rfn-eshadow
timer select scroll-bar mouse jit-lock font-lock syntax facemenu
font-core frame cham georgian utf-8-lang misc-lang vietnamese tibetan
thai tai-viet lao korean japanese hebrew greek romanian slovak czech
european ethiopic indian cyrillic chinese case-table epa-hook
jka-cmpr-hook help simple abbrev minibuffer loaddefs button faces
cus-face files text-properties overlay sha1 md5 base64 format env
code-pages mule custom widget hashtable-print-readable backquote
make-network-process dbusbind dynamic-setting system-font-setting
font-render-setting move-toolbar gtk x-toolkit x multi-tty emacs)

[emacs-problem.utf8 (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#12291; Package emacs. (Tue, 28 Aug 2012 09:05:02 GMT) Full text and rfc822 format available.

Message #8 received at 12291 <at> debbugs.gnu.org (full text, mbox):

From: Andreas Schwab <schwab <at> linux-m68k.org>
To: Werner LEMBERG <wl <at> gnu.org>
Cc: 12291 <at> debbugs.gnu.org, Curtis Smith <smithcu <at> gvsu.edu>
Subject: Re: bug#12291: [rev 109796] wrong UTF-8 handling
Date: Tue, 28 Aug 2012 11:03:28 +0200

The code points above #x110000 are used for CJK unification.  The utf-8
decoder should probably reject all those codes.

Andreas.

-- 
Andreas Schwab, schwab <at> linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#12291; Package emacs. (Tue, 28 Aug 2012 15:00:02 GMT) Full text and rfc822 format available.

Message #11 received at 12291 <at> debbugs.gnu.org (full text, mbox):

From: Kenichi Handa <handa <at> gnu.org>
To: Werner LEMBERG <wl <at> gnu.org>
Cc: 12291 <at> debbugs.gnu.org, smithcu <at> gvsu.edu
Subject: Re: bug#12291: [rev 109796] wrong UTF-8 handling
Date: Tue, 28 Aug 2012 23:57:39 +0900

In article <20120828.074720.480105751.wl <at> gnu.org>, Werner LEMBERG <wl <at> gnu.org> writes:

> Have a look at the attached file, containing a single character.
> (It's transmitted as binary to avoid e-mail encoding issues).  It
> contains a single, four-byte UTF-8 encoded character (0xF4 0xB5 0x87
> 0x9E, which would map to the non-existent Unicode character code
> U+1351DE).  If I load this file as UTF-8 encoded, Emacs gives this as
> the output of `C-u C-x =':

>                position: 1 of 2 (0%), column: 0
>               character: 二 (displayed as 二) (codepoint 20108, #o47214, #x4e8c)
[...]
> Look what Emacs says about the file code.  If I save this
> one-character file as UTF-8, the character code stays as-is.

> This behaviour is clearly wrong.

Sure.

> I suspect that Emacs is using such a
> high character code for internal representation of the `emacs-mule'
> encoding.  However, the user must not see this.  

That higher character code area is used for two purposes.

One is for reading CJK characters of legacy encoding (euc,
sjis, big5, etc).  They are decoded into the utf-8-emacs
byte sequence corresponding to the higher character cod
area.  But, on getting their character code, most of them
are unified into Unicode BMP characters.  But few are left
un-unified.  Those are private characters in each legacy
character set.

Another is for supporting non-Unicode characters.  The
biggest set is GB18030.

In both cases, user surely see them.

> Instead, such characters must be converted to correct
> UTF-8.

??? I don't understand what you means by "correct UTF-8".

I think the correct behaviour on reading such a file by
utf-8 is to treat each byte as raw-byte.

---
Kenichi Handa
handa <at> gnu.org

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#12291; Package emacs. (Tue, 28 Aug 2012 19:24:02 GMT) Full text and rfc822 format available.

Message #14 received at 12291 <at> debbugs.gnu.org (full text, mbox):

From: Werner LEMBERG <wl <at> gnu.org>
To: handa <at> gnu.org
Cc: 12291 <at> debbugs.gnu.org, smithcu <at> gvsu.edu
Subject: Re: bug#12291: [rev 109796] wrong UTF-8 handling
Date: Tue, 28 Aug 2012 21:22:26 +0200 (CEST)

> In both cases, user surely see them.

OK.  BTW, the real use-case is a bug in emacs 23.x which prevented
correct conversion from emacs-mule encoding to utf-8, creating such
funnily encoded utf-8 files (I can't repeat this problem with my
recently compiled emacs, so it seems that it has been fixed
meanwhile).

>> Instead, such characters must be converted to correct
>> UTF-8.
> 
> ??? I don't understand what you means by "correct UTF-8".

Sorry, I've meant correct Unicode.  U+1351DE is larger than the
largest valid Unicode value.  As my example demonstrates, the Chinese
character in the file is certainly *neither* a private character nor a
character from GB 18030, so it should be converted to a regular
Unicode value.

> I think the correct behaviour on reading such a file by utf-8 is to
> treat each byte as raw-byte.

Maybe.  I'm not sure how Emacs should behave in reading such files.


    Werner

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#12291; Package emacs. (Fri, 31 Aug 2012 10:43:01 GMT) Full text and rfc822 format available.

Message #17 received at 12291 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Werner LEMBERG <wl <at> gnu.org>
Cc: 12291 <at> debbugs.gnu.org, handa <at> gnu.org, smithcu <at> gvsu.edu
Subject: Re: bug#12291: [rev 109796] wrong UTF-8 handling
Date: Fri, 31 Aug 2012 13:40:44 +0300

> Date: Tue, 28 Aug 2012 21:22:26 +0200 (CEST)
> From: Werner LEMBERG <wl <at> gnu.org>
> Cc: 12291 <at> debbugs.gnu.org, smithcu <at> gvsu.edu
> 
> > I think the correct behaviour on reading such a file by utf-8 is to
> > treat each byte as raw-byte.
> 
> Maybe.  I'm not sure how Emacs should behave in reading such files.

We can either read them as raw bytes, or convert them to u+FFFD.  The
former sounds like a more useful behavior to me, FWIW.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#12291; Package emacs. (Mon, 03 Sep 2012 01:02:02 GMT) Full text and rfc822 format available.

Message #20 received at 12291 <at> debbugs.gnu.org (full text, mbox):

From: Kenichi Handa <handa <at> gnu.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 12291 <at> debbugs.gnu.org, smithcu <at> gvsu.edu, wl <at> gnu.org
Subject: Re: bug#12291: [rev 109796] wrong UTF-8 handling
Date: Mon, 03 Sep 2012 09:59:22 +0900

In article <83bohrqr83.fsf <at> gnu.org>, Eli Zaretskii <eliz <at> gnu.org> writes:

> > Date: Tue, 28 Aug 2012 21:22:26 +0200 (CEST)
> > From: Werner LEMBERG <wl <at> gnu.org>
> > Cc: 12291 <at> debbugs.gnu.org, smithcu <at> gvsu.edu
> > 
> > > I think the correct behaviour on reading such a file by utf-8 is to
> > > treat each byte as raw-byte.
> > 
> > Maybe.  I'm not sure how Emacs should behave in reading such files.

> We can either read them as raw bytes, or convert them to u+FFFD.  The
> former sounds like a more useful behavior to me, FWIW.

What to convert to U+FFFD?  Each byte, or the byte sequence?

Anyway, we can't simply convert them to U+FFFD because it
results in change of file contents just by reading and
writing.  We can add post-read-conversion and
pre-write-conversion functions to the conding system utf-8
to perform the conversion (and adding text properties for
reverting) and reverting (using the text properties attached
at the time of reading).  But, is it worth doing that?

I think converting each invalid byte to raw-byte is simpler
and equally useful.

---
Kenichi Handa
handa <at> gnu.org

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#12291; Package emacs. (Mon, 03 Sep 2012 02:42:02 GMT) Full text and rfc822 format available.

Message #23 received at 12291 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Kenichi Handa <handa <at> gnu.org>
Cc: 12291 <at> debbugs.gnu.org, smithcu <at> gvsu.edu, wl <at> gnu.org
Subject: Re: bug#12291: [rev 109796] wrong UTF-8 handling
Date: Mon, 03 Sep 2012 05:40:09 +0300

> From: Kenichi Handa <handa <at> gnu.org>
> Cc: wl <at> gnu.org, 12291 <at> debbugs.gnu.org, smithcu <at> gvsu.edu
> Date: Mon, 03 Sep 2012 09:59:22 +0900
> 
> > We can either read them as raw bytes, or convert them to u+FFFD.  The
> > former sounds like a more useful behavior to me, FWIW.
> 
> What to convert to U+FFFD?  Each byte, or the byte sequence?

The byte sequence.

> Anyway, we can't simply convert them to U+FFFD because it
> results in change of file contents just by reading and
> writing.

Yes, and that's why I prefer the raw-bytes way.

> I think converting each invalid byte to raw-byte is simpler
> and equally useful.

It's more useful, I think.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#12291; Package emacs. (Thu, 27 Jan 2022 16:34:02 GMT) Full text and rfc822 format available.

Message #26 received at 12291 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Werner LEMBERG <wl <at> gnu.org>
Cc: 12291 <at> debbugs.gnu.org, Curtis Smith <smithcu <at> gvsu.edu>,
 Eli Zaretskii <eliz <at> gnu.org>
Subject: Re: bug#12291: [rev 109796] wrong UTF-8 handling
Date: Thu, 27 Jan 2022 17:32:53 +0100

Werner LEMBERG <wl <at> gnu.org> writes:

> Have a look at the attached file, containing a single character.
> (It's transmitted as binary to avoid e-mail encoding issues).  It
> contains a single, four-byte UTF-8 encoded character (0xF4 0xB5 0x87
> 0x9E, which would map to the non-existent Unicode character code
> U+1351DE).  If I load this file as UTF-8 encoded, Emacs gives this as
> the output of `C-u C-x =':
>
>                position: 1 of 2 (0%), column: 0
>               character: 二 (displayed as 二) (codepoint 20108, #o47214, #x4e8c)
>       preferred charset: unicode (Unicode (ISO10646))

(I'm going through old bug reports that unfortunately weren't resolved
at the time.)

This has changed at some point between this was reported and now:

             position: 1 of 2 (0%), column: 0
            character:  (displayed as ) (codepoint 1266142, #o4650736, #x1351de)
              charset: emacs (Full Emacs charset (excluding eight bit chars))
code point in charset: 0x1351DE
               syntax: w 	which means: word
             category: L:Strong L2R
             to input: type "C-x 8 RET 1351de"

So Emacs now displays more accurate information about the utf-8
sequence.

It was pointed out that this sequence is outside the Unicode range,
which only extends up to U+10FFFF, and that Emacs should perhaps display
this as a number of raw bytes instead.  Is that something we still want
to pursue, or is Emacs behaving like we want to here?  Eli?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Added tag(s) moreinfo. Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Thu, 27 Jan 2022 16:34:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#12291; Package emacs. (Thu, 27 Jan 2022 16:53:02 GMT) Full text and rfc822 format available.

Message #31 received at 12291 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: 12291 <at> debbugs.gnu.org, smithcu <at> gvsu.edu, wl <at> gnu.org
Subject: Re: bug#12291: [rev 109796] wrong UTF-8 handling
Date: Thu, 27 Jan 2022 18:52:26 +0200

> From: Lars Ingebrigtsen <larsi <at> gnus.org>
> Cc: 12291 <at> debbugs.gnu.org,  Curtis Smith <smithcu <at> gvsu.edu>, Eli Zaretskii
>  <eliz <at> gnu.org>
> Date: Thu, 27 Jan 2022 17:32:53 +0100
> 
>              position: 1 of 2 (0%), column: 0
>             character:  (displayed as ) (codepoint 1266142, #o4650736, #x1351de)
>               charset: emacs (Full Emacs charset (excluding eight bit chars))
> code point in charset: 0x1351DE
>                syntax: w 	which means: word
>              category: L:Strong L2R
>              to input: type "C-x 8 RET 1351de"
> 
> So Emacs now displays more accurate information about the utf-8
> sequence.
> 
> It was pointed out that this sequence is outside the Unicode range,
> which only extends up to U+10FFFF, and that Emacs should perhaps display
> this as a number of raw bytes instead.  Is that something we still want
> to pursue, or is Emacs behaving like we want to here?  Eli?

This is the expected behavior.  The raw bytes start at #x3FFF00, so
#x1351de is some character code reserved for characters not unified with
Unicode (some CJK encodings have them).  Interpreting them as raw
bytes would be counter-productive.

I'm not sure what was Werner's problem with this, so maybe let him
chime in and explain more.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#12291; Package emacs. (Fri, 25 Feb 2022 02:34:02 GMT) Full text and rfc822 format available.

Message #34 received at 12291 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 12291 <at> debbugs.gnu.org, smithcu <at> gvsu.edu, wl <at> gnu.org
Subject: Re: bug#12291: [rev 109796] wrong UTF-8 handling
Date: Fri, 25 Feb 2022 03:33:02 +0100

Eli Zaretskii <eliz <at> gnu.org> writes:

> This is the expected behavior.  The raw bytes start at #x3FFF00, so
> #x1351de is some character code reserved for characters not unified with
> Unicode (some CJK encodings have them).  Interpreting them as raw
> bytes would be counter-productive.
>
> I'm not sure what was Werner's problem with this, so maybe let him
> chime in and explain more.

This was a month ago, and there was no followup, so there doesn't seem
to be anything to be done on the Emacs side here, and I'm therefore
closing this bug report.  If there's something that should be changed in
Emacs, please respond to the debbugs address and we'll reopen.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

bug closed, send any further explanations to 12291 <at> debbugs.gnu.org and Werner LEMBERG <wl <at> gnu.org> Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Fri, 25 Feb 2022 02:34:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 25 Mar 2022 11:24:09 GMT) Full text and rfc822 format available.

This bug report was last modified 3 years and 286 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #12291 [rev 109796] wrong UTF-8 handling

GNU bug report logs - #12291
[rev 109796] wrong UTF-8 handling