GNU logs - #12291, boring messages


Message sent to bug-gnu-emacs@HIDDEN:


X-Loop: help-debbugs@HIDDEN
Subject: bug#12291: [rev 109796] wrong UTF-8 handling
Resent-From: Werner LEMBERG <wl@HIDDEN>
Original-Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Resent-CC: bug-gnu-emacs@HIDDEN
Resent-Date: Tue, 28 Aug 2012 05:49:02 +0000
Resent-Message-ID: <handler.12291.B.134613291217358 <at> debbugs.gnu.org>
Resent-Sender: help-debbugs@HIDDEN
X-GNU-PR-Message: report 12291
X-GNU-PR-Package: emacs
X-GNU-PR-Keywords: 
To: 12291 <at> debbugs.gnu.org
Cc: Curtis Smith <smithcu@HIDDEN>
X-Debbugs-Original-To: bug-gnu-emacs@HIDDEN
Received: via spool by submit <at> debbugs.gnu.org id=B.134613291217358
          (code B ref -1); Tue, 28 Aug 2012 05:49:02 +0000
Received: (at submit) by debbugs.gnu.org; 28 Aug 2012 05:48:32 +0000
Received: from localhost ([127.0.0.1]:53298 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.72)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1T6Ef8-0004Vs-RC
	for submit <at> debbugs.gnu.org; Tue, 28 Aug 2012 01:48:31 -0400
Received: from eggs.gnu.org ([208.118.235.92]:36974)
	by debbugs.gnu.org with esmtp (Exim 4.72)
	(envelope-from <werner.lemberg@HIDDEN>) id 1T6Ef5-0004Vk-90
	for submit <at> debbugs.gnu.org; Tue, 28 Aug 2012 01:48:28 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <werner.lemberg@HIDDEN>) id 1T6Ee9-0005eB-3a
	for submit <at> debbugs.gnu.org; Tue, 28 Aug 2012 01:47:30 -0400
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=-6.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM,
	RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.2
Received: from lists.gnu.org ([208.118.235.17]:46748)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <werner.lemberg@HIDDEN>) id 1T6Ee9-0005e7-03
	for submit <at> debbugs.gnu.org; Tue, 28 Aug 2012 01:47:29 -0400
Received: from eggs.gnu.org ([208.118.235.92]:35595)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <werner.lemberg@HIDDEN>) id 1T6Ee7-00034T-TV
	for bug-gnu-emacs@HIDDEN; Tue, 28 Aug 2012 01:47:28 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <werner.lemberg@HIDDEN>) id 1T6Ee6-0005ds-BJ
	for bug-gnu-emacs@HIDDEN; Tue, 28 Aug 2012 01:47:27 -0400
Received: from mailout-de.gmx.net ([213.165.64.22]:42756)
	by eggs.gnu.org with smtp (Exim 4.71)
	(envelope-from <werner.lemberg@HIDDEN>) id 1T6Ee6-0005df-1y
	for bug-gnu-emacs@HIDDEN; Tue, 28 Aug 2012 01:47:26 -0400
Received: (qmail invoked by alias); 28 Aug 2012 05:47:23 -0000
Received: from 178-190-192-56.adsl.highway.telekom.at (EHLO localhost)
	[178.190.192.56]
	by mail.gmx.net (mp002) with SMTP; 28 Aug 2012 07:47:23 +0200
X-Authenticated: #54312696
X-Provags-ID: V01U2FsdGVkX18SRP1Uhl4S2B8VkSc8PDoPiqQvi21Bu0HwbmTVf5
	FSHgxwctFOMLy2
Date: Tue, 28 Aug 2012 07:47:20 +0200 (CEST)
Message-Id: <20120828.074720.480105751.wl@HIDDEN>
From: Werner LEMBERG <wl@HIDDEN>
X-Mailer: Mew version 6.4rc1 on Emacs 24.2.50.1 / Mule 6.0 (HANACHIRUSATO)
Mime-Version: 1.0
Content-Type: Multipart/Mixed;
	boundary="--Next_Part(Tue_Aug_28_07_47_20_2012_714)--"
Content-Transfer-Encoding: 7bit
X-Y-GMX-Trusted: 0
X-detected-operating-system: by eggs.gnu.org: Genre and OS details not
	recognized.
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3)
X-Received-From: 208.118.235.17
X-Spam-Score: -5.8 (-----)
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.13
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
X-Spam-Score: -5.8 (-----)

----Next_Part(Tue_Aug_28_07_47_20_2012_714)--
Content-Type: Text/Plain; charset=iso-2022-jp
Content-Transfer-Encoding: 7bit


[bzr revision 109796]

Have a look at the attached file, containing a single character.
(It's transmitted as binary to avoid e-mail encoding issues).  It
contains a single, four-byte UTF-8 encoded character (0xF4 0xB5 0x87
0x9E, which would map to the non-existent Unicode character code
U+1351DE).  If I load this file as UTF-8 encoded, Emacs gives this as
the output of `C-u C-x =':

               position: 1 of 2 (0%), column: 0
              character: 二 (displayed as 二) (codepoint 20108, #o47214, #x4e8c)
      preferred charset: unicode (Unicode (ISO10646))
  code point in charset: 0x4E8C
                 syntax: w 	which means: word
               category: .:Base, C:2-byte han, L:Left-to-right (strong), c:Chinese, h:Korean, j:Japanese, |:line breakable
               to input: type "C-x 8 RET HEX-CODEPOINT" or "C-x 8 RET NAME"
            buffer code: #xE4 #xBA #x8C
              file code: #xE4 #xBA #x8C (encoded by coding system utf-8-unix)
                display: by this font (glyph code)
      xft:-unknown-SimSun-normal-normal-normal-*-24-*-*-*-d-0-iso10646-1 (#x460)

  Character code properties: customize what to show
    name: CJK IDEOGRAPH-4E8C
    general-category: Lo (Letter, Other)
    decomposition: (20108) ('二')

Look what Emacs says about the file code.  If I save this
one-character file as UTF-8, the character code stays as-is.

This behaviour is clearly wrong.  I suspect that Emacs is using such a
high character code for internal representation of the `emacs-mule'
encoding.  However, the user must not see this.  Instead, such
characters must be converted to correct UTF-8.


    Werner


======================================================================

In GNU Emacs 24.2.50.1 (i686-pc-linux-gnu, GTK+ Version 2.24.9)
 of 2012-08-28 on linux-nvf0
Windowing system distributor `The X.Org Foundation', version 11.0.11004000
Configured using:
 `configure 'MAKEINFO=/usr/bin/makeinfo' '--with-x-toolkit=gtk''

Important settings:
  value of $LANG: de_DE.UTF-8
  value of $XMODIFIERS: @im=none
  locale-coding-system: utf-8-unix
  default enable-multibyte-characters: t

Major mode: Summary

Minor modes in effect:
  tooltip-mode: t
  mouse-wheel-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  blink-cursor-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  column-number-mode: t
  transient-mark-mode: t

Recent input:
<return> w b u g - e m <tab> <tab> <tab> <tab> <tab> 
<tab> <tab> <backspace> <backspace> <tab> <tab> C-c 
C-q y M-x w r i t e - e m <tab> C-g C-h a b u g <return> 
<M-next> C-x 1 M-x r e p r t <backspace> <backspace> 
o r t - e m <tab> <return>

Recent messages:
Saving file /home/wl/Mail/draft/11...
Wrote /home/wl/Mail/draft/11
Draft is prepared
No matching alias [7 times]
Kill draft message? (y or n)  y
Saving file /home/wl/Mail/draft/11...
Wrote /home/wl/Mail/draft/11
Draft was killed
Quit
Type C-x 4 C-o RET to restore the other window.  

Load-path shadows:
None found.

Features:
(shadow emacsbug message format-spec rfc822 mml mml-sec mm-decode
mm-bodies mm-encode mail-parse rfc2231 mailabbrev gmm-utils mailheader
sendmail rfc2047 rfc2045 ietf-drums mm-util mail-prsvr mail-utils
apropos descr-text latexenc preview prv-emacs byte-opt tex-buf
noutline outline font-latex warnings bytecomp byte-compile cconv
macroexp latex easy-mmode edmacro kmacro tex-style cus-edit wid-edit
cus-start cus-load pp mew-varsx mew-unix cal-menu calendar
cal-loaddefs mew-auth mew-config mew-imap2 mew-imap mew-nntp2 mew-nntp
mew-pop mew-smtp mew-ssl mew-ssh mew-net mew-highlight mew-sort
mew-fib mew-ext mew-refile mew-demo mew-attach mew-draft mew-message
mew-thread mew-virtual mew-summary4 mew-summary3 mew-summary2
mew-summary mew-search mew-pick mew-passwd mew-scan mew-syntax mew-bq
mew-smime mew-pgp mew-header mew-exec mew-mark mew-mime mew-edit
mew-decode mew-encode mew-cache mew-minibuf mew-complete mew-addrbook
mew-local mew-vars3 mew-vars2 mew-vars mew-env mew-mule3 mew-mule
mew-gemacs mew-key mew-func mew-blvs mew-const mew tex advice help-fns
advice-preload tex-site auto-loads quail help-mode easymenu cjktilde
disp-table time-date tooltip ediff-hook vc-hooks lisp-float-type
mwheel x-win x-dnd tool-bar dnd fontset image regexp-opt fringe
tabulated-list newcomment lisp-mode register page menu-bar rfn-eshadow
timer select scroll-bar mouse jit-lock font-lock syntax facemenu
font-core frame cham georgian utf-8-lang misc-lang vietnamese tibetan
thai tai-viet lao korean japanese hebrew greek romanian slovak czech
european ethiopic indian cyrillic chinese case-table epa-hook
jka-cmpr-hook help simple abbrev minibuffer loaddefs button faces
cus-face files text-properties overlay sha1 md5 base64 format env
code-pages mule custom widget hashtable-print-readable backquote
make-network-process dbusbind dynamic-setting system-font-setting
font-render-setting move-toolbar gtk x-toolkit x multi-tty emacs)

----Next_Part(Tue_Aug_28_07_47_20_2012_714)--
Content-Type: Application/Octet-Stream
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="emacs-problem.utf8"

9LWHngo=

----Next_Part(Tue_Aug_28_07_47_20_2012_714)----




Message sent:


Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-Mailer: MIME-tools 5.428 (Entity 5.428)
Content-Type: text/plain; charset=utf-8
X-Loop: help-debbugs@HIDDEN
From: help-debbugs@HIDDEN (GNU bug Tracking System)
To: Werner LEMBERG <wl@HIDDEN>
Subject: bug#12291: Acknowledgement ([rev 109796] wrong UTF-8 handling)
Message-ID: <handler.12291.B.134613291217358.ack <at> debbugs.gnu.org>
References: <20120828.074720.480105751.wl@HIDDEN>
X-Gnu-PR-Message: ack 12291
X-Gnu-PR-Package: emacs
Reply-To: 12291 <at> debbugs.gnu.org
Date: Tue, 28 Aug 2012 05:49:03 +0000

Thank you for filing a new bug report with debbugs.gnu.org.

This is an automatically generated reply to let you know your message
has been received.

Your message is being forwarded to the package maintainers and other
interested parties for their attention; they will reply in due course.

Your message has been sent to the package maintainer(s):
 bug-gnu-emacs@HIDDEN

If you wish to submit further information on this problem, please
send it to 12291 <at> debbugs.gnu.org.

Please do not send mail to help-debbugs@HIDDEN unless you wish
to report a problem with the Bug-tracking system.

--=20
12291: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D12291
GNU Bug Tracking System
Contact help-debbugs@HIDDEN with problems


Message sent to bug-gnu-emacs@HIDDEN:


X-Loop: help-debbugs@HIDDEN
Subject: bug#12291: [rev 109796] wrong UTF-8 handling
Resent-From: Andreas Schwab <schwab@HIDDEN>
Original-Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Resent-CC: bug-gnu-emacs@HIDDEN
Resent-Date: Tue, 28 Aug 2012 09:05:02 +0000
Resent-Message-ID: <handler.12291.B12291.13461446726203 <at> debbugs.gnu.org>
Resent-Sender: help-debbugs@HIDDEN
X-GNU-PR-Message: followup 12291
X-GNU-PR-Package: emacs
X-GNU-PR-Keywords: 
To: Werner LEMBERG <wl@HIDDEN>
Cc: 12291 <at> debbugs.gnu.org, Curtis Smith <smithcu@HIDDEN>
Received: via spool by 12291-submit <at> debbugs.gnu.org id=B12291.13461446726203
          (code B ref 12291); Tue, 28 Aug 2012 09:05:02 +0000
Received: (at 12291) by debbugs.gnu.org; 28 Aug 2012 09:04:32 +0000
Received: from localhost ([127.0.0.1]:53697 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.72)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1T6Hiq-0001c0-H7
	for submit <at> debbugs.gnu.org; Tue, 28 Aug 2012 05:04:32 -0400
Received: from mail-out.m-online.net ([212.18.0.10]:59242)
	by debbugs.gnu.org with esmtp (Exim 4.72)
	(envelope-from <whitebox@HIDDEN>) id 1T6Hio-0001bt-9i
	for 12291 <at> debbugs.gnu.org; Tue, 28 Aug 2012 05:04:31 -0400
Received: from frontend1.mail.m-online.net (frontend1.mail.intern.m-online.net
	[192.168.8.180])
	by mail-out.m-online.net (Postfix) with ESMTP id 3X5kXM741Yz3hhgN;
	Tue, 28 Aug 2012 11:03:30 +0200 (CEST)
X-Auth-Info: y20ZYa4KbDSeTD048vUzFo1wAcoK+azw9Ks/R4WcWno=
Received: from igel.home (ppp-93-104-145-159.dynamic.mnet-online.de
	[93.104.145.159])
	by mail.mnet-online.de (Postfix) with ESMTPA id 3X5kXL1Lt6zbbhg;
	Tue, 28 Aug 2012 11:03:30 +0200 (CEST)
Received: by igel.home (Postfix, from userid 501)
	id 7D667CA2A5; Tue, 28 Aug 2012 11:03:29 +0200 (CEST)
From: Andreas Schwab <schwab@HIDDEN>
References: <20120828.074720.480105751.wl@HIDDEN>
X-Yow: How do I get HOME?
Date: Tue, 28 Aug 2012 11:03:28 +0200
In-Reply-To: <20120828.074720.480105751.wl@HIDDEN> (Werner LEMBERG's message
	of "Tue, 28 Aug 2012 07:47:20 +0200 (CEST)")
Message-ID: <m21uirie1r.fsf@HIDDEN>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.2 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Score: -1.9 (-)
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.13
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
X-Spam-Score: -1.9 (-)

The code points above #x110000 are used for CJK unification.  The utf-8
decoder should probably reject all those codes.

Andreas.

-- 
Andreas Schwab, schwab@HIDDEN
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."




Message sent to bug-gnu-emacs@HIDDEN:


X-Loop: help-debbugs@HIDDEN
Subject: bug#12291: [rev 109796] wrong UTF-8 handling
References: <20120828.074720.480105751.wl@HIDDEN>
Resent-From: Kenichi Handa <handa@HIDDEN>
Original-Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Resent-CC: bug-gnu-emacs@HIDDEN
Resent-Date: Tue, 28 Aug 2012 15:00:02 +0000
Resent-Message-ID: <handler.12291.B12291.13461659468332 <at> debbugs.gnu.org>
Resent-Sender: help-debbugs@HIDDEN
X-GNU-PR-Message: followup 12291
X-GNU-PR-Package: emacs
X-GNU-PR-Keywords: 
To: Werner LEMBERG <wl@HIDDEN>
Cc: 12291 <at> debbugs.gnu.org, smithcu@HIDDEN
Received: via spool by 12291-submit <at> debbugs.gnu.org id=B12291.13461659468332
          (code B ref 12291); Tue, 28 Aug 2012 15:00:02 +0000
Received: (at 12291) by debbugs.gnu.org; 28 Aug 2012 14:59:06 +0000
Received: from localhost ([127.0.0.1]:54526 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.72)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1T6NFx-0002AK-Pt
	for submit <at> debbugs.gnu.org; Tue, 28 Aug 2012 10:59:06 -0400
Received: from fencepost.gnu.org ([208.118.235.10]:55445)
	by debbugs.gnu.org with esmtp (Exim 4.72)
	(envelope-from <handa@HIDDEN>) id 1T6NFv-0002AC-7j
	for 12291 <at> debbugs.gnu.org; Tue, 28 Aug 2012 10:59:04 -0400
Received: from 126.229.accsnet.ne.jp ([202.220.229.126]:52524 helo=ubuntu)
	by fencepost.gnu.org with esmtpsa (TLS1.0:DHE_RSA_AES_128_CBC_SHA1:16)
	(Exim 4.71) (envelope-from <handa@HIDDEN>)
	id 1T6NEw-0004dL-HT; Tue, 28 Aug 2012 10:58:03 -0400
From: Kenichi Handa <handa@HIDDEN>
In-Reply-To: <20120828.074720.480105751.wl@HIDDEN> (message from Werner
	LEMBERG on Tue, 28 Aug 2012 07:47:20 +0200 (CEST))
Date: Tue, 28 Aug 2012 23:57:39 +0900
Message-ID: <87a9xfdpy4.fsf@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-2022-jp
X-Spam-Score: -7.1 (-------)
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.13
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
X-Spam-Score: -7.1 (-------)

In article <20120828.074720.480105751.wl@HIDDEN>, Werner LEMBERG <wl@HIDDEN> writes:

> Have a look at the attached file, containing a single character.
> (It's transmitted as binary to avoid e-mail encoding issues).  It
> contains a single, four-byte UTF-8 encoded character (0xF4 0xB5 0x87
> 0x9E, which would map to the non-existent Unicode character code
> U+1351DE).  If I load this file as UTF-8 encoded, Emacs gives this as
> the output of `C-u C-x =':

>                position: 1 of 2 (0%), column: 0
>               character: 二 (displayed as 二) (codepoint 20108, #o47214, #x4e8c)
[...]
> Look what Emacs says about the file code.  If I save this
> one-character file as UTF-8, the character code stays as-is.

> This behaviour is clearly wrong.

Sure.

> I suspect that Emacs is using such a
> high character code for internal representation of the `emacs-mule'
> encoding.  However, the user must not see this.  

That higher character code area is used for two purposes.

One is for reading CJK characters of legacy encoding (euc,
sjis, big5, etc).  They are decoded into the utf-8-emacs
byte sequence corresponding to the higher character cod
area.  But, on getting their character code, most of them
are unified into Unicode BMP characters.  But few are left
un-unified.  Those are private characters in each legacy
character set.

Another is for supporting non-Unicode characters.  The
biggest set is GB18030.

In both cases, user surely see them.

> Instead, such characters must be converted to correct
> UTF-8.

??? I don't understand what you means by "correct UTF-8".

I think the correct behaviour on reading such a file by
utf-8 is to treat each byte as raw-byte.

---
Kenichi Handa
handa@HIDDEN




Message sent to bug-gnu-emacs@HIDDEN:


X-Loop: help-debbugs@HIDDEN
Subject: bug#12291: [rev 109796] wrong UTF-8 handling
Resent-From: Werner LEMBERG <wl@HIDDEN>
Original-Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Resent-CC: bug-gnu-emacs@HIDDEN
Resent-Date: Tue, 28 Aug 2012 19:24:02 +0000
Resent-Message-ID: <handler.12291.B12291.134618181631693 <at> debbugs.gnu.org>
Resent-Sender: help-debbugs@HIDDEN
X-GNU-PR-Message: followup 12291
X-GNU-PR-Package: emacs
X-GNU-PR-Keywords: 
To: handa@HIDDEN
Cc: 12291 <at> debbugs.gnu.org, smithcu@HIDDEN
Received: via spool by 12291-submit <at> debbugs.gnu.org id=B12291.134618181631693
          (code B ref 12291); Tue, 28 Aug 2012 19:24:02 +0000
Received: (at 12291) by debbugs.gnu.org; 28 Aug 2012 19:23:36 +0000
Received: from localhost ([127.0.0.1]:54806 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.72)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1T6RNv-0008F7-48
	for submit <at> debbugs.gnu.org; Tue, 28 Aug 2012 15:23:36 -0400
Received: from mailout-de.gmx.net ([213.165.64.22]:60433)
	by debbugs.gnu.org with smtp (Exim 4.72)
	(envelope-from <werner.lemberg@HIDDEN>) id 1T6RNs-0008Ey-Eq
	for 12291 <at> debbugs.gnu.org; Tue, 28 Aug 2012 15:23:33 -0400
Received: (qmail invoked by alias); 28 Aug 2012 19:22:31 -0000
Received: from 178-191-182-81.adsl.highway.telekom.at (EHLO localhost)
	[178.191.182.81]
	by mail.gmx.net (mp024) with SMTP; 28 Aug 2012 21:22:31 +0200
X-Authenticated: #54312696
X-Provags-ID: V01U2FsdGVkX1/CleBxAkLCrVMCneluIEaDPBJ6PZnTcpP3Q/62Ti
	nLdaVoymsX4x1G
Date: Tue, 28 Aug 2012 21:22:26 +0200 (CEST)
Message-Id: <20120828.212226.458921190.wl@HIDDEN>
From: Werner LEMBERG <wl@HIDDEN>
In-Reply-To: <87a9xfdpy4.fsf@HIDDEN>
References: <20120828.074720.480105751.wl@HIDDEN>
	<87a9xfdpy4.fsf@HIDDEN>
X-Mailer: Mew version 6.4rc1 on Emacs 24.2.50.1 / Mule 6.0 (HANACHIRUSATO)
Mime-Version: 1.0
Content-Type: Text/Plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Y-GMX-Trusted: 0
X-Spam-Score: -1.9 (-)
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.13
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
X-Spam-Score: -1.9 (-)


> In both cases, user surely see them.

OK.  BTW, the real use-case is a bug in emacs 23.x which prevented
correct conversion from emacs-mule encoding to utf-8, creating such
funnily encoded utf-8 files (I can't repeat this problem with my
recently compiled emacs, so it seems that it has been fixed
meanwhile).

>> Instead, such characters must be converted to correct
>> UTF-8.
> 
> ??? I don't understand what you means by "correct UTF-8".

Sorry, I've meant correct Unicode.  U+1351DE is larger than the
largest valid Unicode value.  As my example demonstrates, the Chinese
character in the file is certainly *neither* a private character nor a
character from GB 18030, so it should be converted to a regular
Unicode value.

> I think the correct behaviour on reading such a file by utf-8 is to
> treat each byte as raw-byte.

Maybe.  I'm not sure how Emacs should behave in reading such files.


    Werner




Message sent to bug-gnu-emacs@HIDDEN:


X-Loop: help-debbugs@HIDDEN
Subject: bug#12291: [rev 109796] wrong UTF-8 handling
Resent-From: Eli Zaretskii <eliz@HIDDEN>
Original-Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Resent-CC: bug-gnu-emacs@HIDDEN
Resent-Date: Fri, 31 Aug 2012 10:43:01 +0000
Resent-Message-ID: <handler.12291.B12291.134640972520273 <at> debbugs.gnu.org>
Resent-Sender: help-debbugs@HIDDEN
X-GNU-PR-Message: followup 12291
X-GNU-PR-Package: emacs
X-GNU-PR-Keywords: 
To: Werner LEMBERG <wl@HIDDEN>
Cc: 12291 <at> debbugs.gnu.org, handa@HIDDEN, smithcu@HIDDEN
Reply-To: Eli Zaretskii <eliz@HIDDEN>
Received: via spool by 12291-submit <at> debbugs.gnu.org id=B12291.134640972520273
          (code B ref 12291); Fri, 31 Aug 2012 10:43:01 +0000
Received: (at 12291) by debbugs.gnu.org; 31 Aug 2012 10:42:05 +0000
Received: from localhost ([127.0.0.1]:59101 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.72)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1T7Oft-0005Gw-1V
	for submit <at> debbugs.gnu.org; Fri, 31 Aug 2012 06:42:05 -0400
Received: from mtaout23.012.net.il ([80.179.55.175]:62597)
	by debbugs.gnu.org with esmtp (Exim 4.72)
	(envelope-from <eliz@HIDDEN>) id 1T7Ofp-0005GW-PL
	for 12291 <at> debbugs.gnu.org; Fri, 31 Aug 2012 06:42:03 -0400
Received: from conversion-daemon.a-mtaout23.012.net.il by
	a-mtaout23.012.net.il (HyperSendmail v2007.08) id
	<0M9M00L0088ZSU00@HIDDEN> for
	12291 <at> debbugs.gnu.org; Fri, 31 Aug 2012 13:40:45 +0300 (IDT)
Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout23.012.net.il
	(HyperSendmail v2007.08) with ESMTPA id
	<0M9M00L6G8BWSB30@HIDDEN>;
	Fri, 31 Aug 2012 13:40:45 +0300 (IDT)
Date: Fri, 31 Aug 2012 13:40:44 +0300
From: Eli Zaretskii <eliz@HIDDEN>
In-reply-to: <20120828.212226.458921190.wl@HIDDEN>
X-012-Sender: halo1@HIDDEN
Message-id: <83bohrqr83.fsf@HIDDEN>
References: <20120828.074720.480105751.wl@HIDDEN> <87a9xfdpy4.fsf@HIDDEN>
	<20120828.212226.458921190.wl@HIDDEN>
X-Spam-Score: -1.2 (-)
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.13
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
X-Spam-Score: -1.2 (-)

> Date: Tue, 28 Aug 2012 21:22:26 +0200 (CEST)
> From: Werner LEMBERG <wl@HIDDEN>
> Cc: 12291 <at> debbugs.gnu.org, smithcu@HIDDEN
> 
> > I think the correct behaviour on reading such a file by utf-8 is to
> > treat each byte as raw-byte.
> 
> Maybe.  I'm not sure how Emacs should behave in reading such files.

We can either read them as raw bytes, or convert them to u+FFFD.  The
former sounds like a more useful behavior to me, FWIW.




Message sent to bug-gnu-emacs@HIDDEN:


X-Loop: help-debbugs@HIDDEN
Subject: bug#12291: [rev 109796] wrong UTF-8 handling
References: <20120828.074720.480105751.wl@HIDDEN>
Resent-From: Kenichi Handa <handa@HIDDEN>
Original-Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Resent-CC: bug-gnu-emacs@HIDDEN
Resent-Date: Mon, 03 Sep 2012 01:02:02 +0000
Resent-Message-ID: <handler.12291.B12291.134663406414999 <at> debbugs.gnu.org>
Resent-Sender: help-debbugs@HIDDEN
X-GNU-PR-Message: followup 12291
X-GNU-PR-Package: emacs
X-GNU-PR-Keywords: 
To: Eli Zaretskii <eliz@HIDDEN>
Cc: 12291 <at> debbugs.gnu.org, smithcu@HIDDEN, wl@HIDDEN
Received: via spool by 12291-submit <at> debbugs.gnu.org id=B12291.134663406414999
          (code B ref 12291); Mon, 03 Sep 2012 01:02:02 +0000
Received: (at 12291) by debbugs.gnu.org; 3 Sep 2012 01:01:04 +0000
Received: from localhost ([127.0.0.1]:35188 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.72)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1T8L2F-0003ts-Pt
	for submit <at> debbugs.gnu.org; Sun, 02 Sep 2012 21:01:04 -0400
Received: from fencepost.gnu.org ([208.118.235.10]:57514)
	by debbugs.gnu.org with esmtp (Exim 4.72)
	(envelope-from <handa@HIDDEN>) id 1T8L2C-0003tT-M8
	for 12291 <at> debbugs.gnu.org; Sun, 02 Sep 2012 21:01:01 -0400
Received: from [150.29.149.7] (port=64775 helo=ubuntu)
	by fencepost.gnu.org with esmtpsa (TLS1.0:DHE_RSA_AES_128_CBC_SHA1:16)
	(Exim 4.71) (envelope-from <handa@HIDDEN>)
	id 1T8L0j-0001oE-KW; Sun, 02 Sep 2012 20:59:30 -0400
From: Kenichi Handa <handa@HIDDEN>
In-Reply-To: <83bohrqr83.fsf@HIDDEN> (message from Eli Zaretskii on Fri,
	31 Aug 2012 13:40:44 +0300)
Date: Mon, 03 Sep 2012 09:59:22 +0900
Message-ID: <87392zvs45.fsf@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Score: -7.1 (-------)
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.13
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
X-Spam-Score: -7.1 (-------)

In article <83bohrqr83.fsf@HIDDEN>, Eli Zaretskii <eliz@HIDDEN> writes:

> > Date: Tue, 28 Aug 2012 21:22:26 +0200 (CEST)
> > From: Werner LEMBERG <wl@HIDDEN>
> > Cc: 12291 <at> debbugs.gnu.org, smithcu@HIDDEN
> > 
> > > I think the correct behaviour on reading such a file by utf-8 is to
> > > treat each byte as raw-byte.
> > 
> > Maybe.  I'm not sure how Emacs should behave in reading such files.

> We can either read them as raw bytes, or convert them to u+FFFD.  The
> former sounds like a more useful behavior to me, FWIW.

What to convert to U+FFFD?  Each byte, or the byte sequence?

Anyway, we can't simply convert them to U+FFFD because it
results in change of file contents just by reading and
writing.  We can add post-read-conversion and
pre-write-conversion functions to the conding system utf-8
to perform the conversion (and adding text properties for
reverting) and reverting (using the text properties attached
at the time of reading).  But, is it worth doing that?

I think converting each invalid byte to raw-byte is simpler
and equally useful.

---
Kenichi Handa
handa@HIDDEN




Message sent to bug-gnu-emacs@HIDDEN:


X-Loop: help-debbugs@HIDDEN
Subject: bug#12291: [rev 109796] wrong UTF-8 handling
Resent-From: Eli Zaretskii <eliz@HIDDEN>
Original-Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Resent-CC: bug-gnu-emacs@HIDDEN
Resent-Date: Mon, 03 Sep 2012 02:42:02 +0000
Resent-Message-ID: <handler.12291.B12291.134664011323654 <at> debbugs.gnu.org>
Resent-Sender: help-debbugs@HIDDEN
X-GNU-PR-Message: followup 12291
X-GNU-PR-Package: emacs
X-GNU-PR-Keywords: 
To: Kenichi Handa <handa@HIDDEN>
Cc: 12291 <at> debbugs.gnu.org, smithcu@HIDDEN, wl@HIDDEN
Reply-To: Eli Zaretskii <eliz@HIDDEN>
Received: via spool by 12291-submit <at> debbugs.gnu.org id=B12291.134664011323654
          (code B ref 12291); Mon, 03 Sep 2012 02:42:02 +0000
Received: (at 12291) by debbugs.gnu.org; 3 Sep 2012 02:41:53 +0000
Received: from localhost ([127.0.0.1]:35245 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.72)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1T8Mbp-00069S-0a
	for submit <at> debbugs.gnu.org; Sun, 02 Sep 2012 22:41:53 -0400
Received: from mtaout20.012.net.il ([80.179.55.166]:36283)
	by debbugs.gnu.org with esmtp (Exim 4.72)
	(envelope-from <eliz@HIDDEN>) id 1T8Mbj-00069H-Gb
	for 12291 <at> debbugs.gnu.org; Sun, 02 Sep 2012 22:41:48 -0400
Received: from conversion-daemon.a-mtaout20.012.net.il by
	a-mtaout20.012.net.il (HyperSendmail v2007.08) id
	<0M9R00D0061YJX00@HIDDEN> for
	12291 <at> debbugs.gnu.org; Mon, 03 Sep 2012 05:40:03 +0300 (IDT)
Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout20.012.net.il
	(HyperSendmail v2007.08) with ESMTPA id
	<0M9R00DVT62R9X50@HIDDEN>;
	Mon, 03 Sep 2012 05:40:03 +0300 (IDT)
Date: Mon, 03 Sep 2012 05:40:09 +0300
From: Eli Zaretskii <eliz@HIDDEN>
In-reply-to: <87392zvs45.fsf@HIDDEN>
X-012-Sender: halo1@HIDDEN
Message-id: <83627vg77a.fsf@HIDDEN>
References: <87392zvs45.fsf@HIDDEN>
X-Spam-Score: -1.2 (-)
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.13
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
X-Spam-Score: -1.2 (-)

> From: Kenichi Handa <handa@HIDDEN>
> Cc: wl@HIDDEN, 12291 <at> debbugs.gnu.org, smithcu@HIDDEN
> Date: Mon, 03 Sep 2012 09:59:22 +0900
> 
> > We can either read them as raw bytes, or convert them to u+FFFD.  The
> > former sounds like a more useful behavior to me, FWIW.
> 
> What to convert to U+FFFD?  Each byte, or the byte sequence?

The byte sequence.

> Anyway, we can't simply convert them to U+FFFD because it
> results in change of file contents just by reading and
> writing.

Yes, and that's why I prefer the raw-bytes way.

> I think converting each invalid byte to raw-byte is simpler
> and equally useful.

It's more useful, I think.





Last modified: Fri, 31 Oct 2014 17:00:04 UTC

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997 nCipher Corporation Ltd, 1994-97 Ian Jackson.