GNU bug report logs - #12291
[rev 109796] wrong UTF-8 handling

Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.

Package: emacs; Reported by: Werner LEMBERG <wl@HIDDEN>; dated Tue, 28 Aug 2012 05:49:02 UTC; Maintainer for emacs is bug-gnu-emacs@HIDDEN.

Message received at 12291 <at> debbugs.gnu.org:


Received: (at 12291) by debbugs.gnu.org; 3 Sep 2012 02:41:53 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sun Sep 02 22:41:53 2012
Received: from localhost ([127.0.0.1]:35245 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.72)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1T8Mbp-00069S-0a
	for submit <at> debbugs.gnu.org; Sun, 02 Sep 2012 22:41:53 -0400
Received: from mtaout20.012.net.il ([80.179.55.166]:36283)
	by debbugs.gnu.org with esmtp (Exim 4.72)
	(envelope-from <eliz@HIDDEN>) id 1T8Mbj-00069H-Gb
	for 12291 <at> debbugs.gnu.org; Sun, 02 Sep 2012 22:41:48 -0400
Received: from conversion-daemon.a-mtaout20.012.net.il by
	a-mtaout20.012.net.il (HyperSendmail v2007.08) id
	<0M9R00D0061YJX00@HIDDEN> for
	12291 <at> debbugs.gnu.org; Mon, 03 Sep 2012 05:40:03 +0300 (IDT)
Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout20.012.net.il
	(HyperSendmail v2007.08) with ESMTPA id
	<0M9R00DVT62R9X50@HIDDEN>;
	Mon, 03 Sep 2012 05:40:03 +0300 (IDT)
Date: Mon, 03 Sep 2012 05:40:09 +0300
From: Eli Zaretskii <eliz@HIDDEN>
Subject: Re: bug#12291: [rev 109796] wrong UTF-8 handling
In-reply-to: <87392zvs45.fsf@HIDDEN>
X-012-Sender: halo1@HIDDEN
To: Kenichi Handa <handa@HIDDEN>
Message-id: <83627vg77a.fsf@HIDDEN>
References: <87392zvs45.fsf@HIDDEN>
X-Spam-Score: -1.2 (-)
X-Debbugs-Envelope-To: 12291
Cc: 12291 <at> debbugs.gnu.org, smithcu@HIDDEN, wl@HIDDEN
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.13
Precedence: list
Reply-To: Eli Zaretskii <eliz@HIDDEN>
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
X-Spam-Score: -1.2 (-)

> From: Kenichi Handa <handa@HIDDEN>
> Cc: wl@HIDDEN, 12291 <at> debbugs.gnu.org, smithcu@HIDDEN
> Date: Mon, 03 Sep 2012 09:59:22 +0900
> 
> > We can either read them as raw bytes, or convert them to u+FFFD.  The
> > former sounds like a more useful behavior to me, FWIW.
> 
> What to convert to U+FFFD?  Each byte, or the byte sequence?

The byte sequence.

> Anyway, we can't simply convert them to U+FFFD because it
> results in change of file contents just by reading and
> writing.

Yes, and that's why I prefer the raw-bytes way.

> I think converting each invalid byte to raw-byte is simpler
> and equally useful.

It's more useful, I think.




Information forwarded to bug-gnu-emacs@HIDDEN:
bug#12291; Package emacs. Full text available.

Message received at 12291 <at> debbugs.gnu.org:


Received: (at 12291) by debbugs.gnu.org; 3 Sep 2012 01:01:04 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sun Sep 02 21:01:04 2012
Received: from localhost ([127.0.0.1]:35188 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.72)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1T8L2F-0003ts-Pt
	for submit <at> debbugs.gnu.org; Sun, 02 Sep 2012 21:01:04 -0400
Received: from fencepost.gnu.org ([208.118.235.10]:57514)
	by debbugs.gnu.org with esmtp (Exim 4.72)
	(envelope-from <handa@HIDDEN>) id 1T8L2C-0003tT-M8
	for 12291 <at> debbugs.gnu.org; Sun, 02 Sep 2012 21:01:01 -0400
Received: from [150.29.149.7] (port=64775 helo=ubuntu)
	by fencepost.gnu.org with esmtpsa (TLS1.0:DHE_RSA_AES_128_CBC_SHA1:16)
	(Exim 4.71) (envelope-from <handa@HIDDEN>)
	id 1T8L0j-0001oE-KW; Sun, 02 Sep 2012 20:59:30 -0400
From: Kenichi Handa <handa@HIDDEN>
To: Eli Zaretskii <eliz@HIDDEN>
Subject: Re: bug#12291: [rev 109796] wrong UTF-8 handling
In-Reply-To: <83bohrqr83.fsf@HIDDEN> (message from Eli Zaretskii on Fri,
	31 Aug 2012 13:40:44 +0300)
Date: Mon, 03 Sep 2012 09:59:22 +0900
Message-ID: <87392zvs45.fsf@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Score: -7.1 (-------)
X-Debbugs-Envelope-To: 12291
Cc: 12291 <at> debbugs.gnu.org, smithcu@HIDDEN, wl@HIDDEN
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.13
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
X-Spam-Score: -7.1 (-------)

In article <83bohrqr83.fsf@HIDDEN>, Eli Zaretskii <eliz@HIDDEN> writes:

> > Date: Tue, 28 Aug 2012 21:22:26 +0200 (CEST)
> > From: Werner LEMBERG <wl@HIDDEN>
> > Cc: 12291 <at> debbugs.gnu.org, smithcu@HIDDEN
> > 
> > > I think the correct behaviour on reading such a file by utf-8 is to
> > > treat each byte as raw-byte.
> > 
> > Maybe.  I'm not sure how Emacs should behave in reading such files.

> We can either read them as raw bytes, or convert them to u+FFFD.  The
> former sounds like a more useful behavior to me, FWIW.

What to convert to U+FFFD?  Each byte, or the byte sequence?

Anyway, we can't simply convert them to U+FFFD because it
results in change of file contents just by reading and
writing.  We can add post-read-conversion and
pre-write-conversion functions to the conding system utf-8
to perform the conversion (and adding text properties for
reverting) and reverting (using the text properties attached
at the time of reading).  But, is it worth doing that?

I think converting each invalid byte to raw-byte is simpler
and equally useful.

---
Kenichi Handa
handa@HIDDEN




Information forwarded to bug-gnu-emacs@HIDDEN:
bug#12291; Package emacs. Full text available.

Message received at 12291 <at> debbugs.gnu.org:


Received: (at 12291) by debbugs.gnu.org; 31 Aug 2012 10:42:05 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Aug 31 06:42:05 2012
Received: from localhost ([127.0.0.1]:59101 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.72)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1T7Oft-0005Gw-1V
	for submit <at> debbugs.gnu.org; Fri, 31 Aug 2012 06:42:05 -0400
Received: from mtaout23.012.net.il ([80.179.55.175]:62597)
	by debbugs.gnu.org with esmtp (Exim 4.72)
	(envelope-from <eliz@HIDDEN>) id 1T7Ofp-0005GW-PL
	for 12291 <at> debbugs.gnu.org; Fri, 31 Aug 2012 06:42:03 -0400
Received: from conversion-daemon.a-mtaout23.012.net.il by
	a-mtaout23.012.net.il (HyperSendmail v2007.08) id
	<0M9M00L0088ZSU00@HIDDEN> for
	12291 <at> debbugs.gnu.org; Fri, 31 Aug 2012 13:40:45 +0300 (IDT)
Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout23.012.net.il
	(HyperSendmail v2007.08) with ESMTPA id
	<0M9M00L6G8BWSB30@HIDDEN>;
	Fri, 31 Aug 2012 13:40:45 +0300 (IDT)
Date: Fri, 31 Aug 2012 13:40:44 +0300
From: Eli Zaretskii <eliz@HIDDEN>
Subject: Re: bug#12291: [rev 109796] wrong UTF-8 handling
In-reply-to: <20120828.212226.458921190.wl@HIDDEN>
X-012-Sender: halo1@HIDDEN
To: Werner LEMBERG <wl@HIDDEN>
Message-id: <83bohrqr83.fsf@HIDDEN>
References: <20120828.074720.480105751.wl@HIDDEN> <87a9xfdpy4.fsf@HIDDEN>
	<20120828.212226.458921190.wl@HIDDEN>
X-Spam-Score: -1.2 (-)
X-Debbugs-Envelope-To: 12291
Cc: 12291 <at> debbugs.gnu.org, handa@HIDDEN, smithcu@HIDDEN
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.13
Precedence: list
Reply-To: Eli Zaretskii <eliz@HIDDEN>
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
X-Spam-Score: -1.2 (-)

> Date: Tue, 28 Aug 2012 21:22:26 +0200 (CEST)
> From: Werner LEMBERG <wl@HIDDEN>
> Cc: 12291 <at> debbugs.gnu.org, smithcu@HIDDEN
> 
> > I think the correct behaviour on reading such a file by utf-8 is to
> > treat each byte as raw-byte.
> 
> Maybe.  I'm not sure how Emacs should behave in reading such files.

We can either read them as raw bytes, or convert them to u+FFFD.  The
former sounds like a more useful behavior to me, FWIW.




Information forwarded to bug-gnu-emacs@HIDDEN:
bug#12291; Package emacs. Full text available.

Message received at 12291 <at> debbugs.gnu.org:


Received: (at 12291) by debbugs.gnu.org; 28 Aug 2012 19:23:36 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Tue Aug 28 15:23:36 2012
Received: from localhost ([127.0.0.1]:54806 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.72)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1T6RNv-0008F7-48
	for submit <at> debbugs.gnu.org; Tue, 28 Aug 2012 15:23:36 -0400
Received: from mailout-de.gmx.net ([213.165.64.22]:60433)
	by debbugs.gnu.org with smtp (Exim 4.72)
	(envelope-from <werner.lemberg@HIDDEN>) id 1T6RNs-0008Ey-Eq
	for 12291 <at> debbugs.gnu.org; Tue, 28 Aug 2012 15:23:33 -0400
Received: (qmail invoked by alias); 28 Aug 2012 19:22:31 -0000
Received: from 178-191-182-81.adsl.highway.telekom.at (EHLO localhost)
	[178.191.182.81]
	by mail.gmx.net (mp024) with SMTP; 28 Aug 2012 21:22:31 +0200
X-Authenticated: #54312696
X-Provags-ID: V01U2FsdGVkX1/CleBxAkLCrVMCneluIEaDPBJ6PZnTcpP3Q/62Ti
	nLdaVoymsX4x1G
Date: Tue, 28 Aug 2012 21:22:26 +0200 (CEST)
Message-Id: <20120828.212226.458921190.wl@HIDDEN>
To: handa@HIDDEN
Subject: Re: bug#12291: [rev 109796] wrong UTF-8 handling
From: Werner LEMBERG <wl@HIDDEN>
In-Reply-To: <87a9xfdpy4.fsf@HIDDEN>
References: <20120828.074720.480105751.wl@HIDDEN>
	<87a9xfdpy4.fsf@HIDDEN>
X-Mailer: Mew version 6.4rc1 on Emacs 24.2.50.1 / Mule 6.0 (HANACHIRUSATO)
Mime-Version: 1.0
Content-Type: Text/Plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Y-GMX-Trusted: 0
X-Spam-Score: -1.9 (-)
X-Debbugs-Envelope-To: 12291
Cc: 12291 <at> debbugs.gnu.org, smithcu@HIDDEN
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.13
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
X-Spam-Score: -1.9 (-)


> In both cases, user surely see them.

OK.  BTW, the real use-case is a bug in emacs 23.x which prevented
correct conversion from emacs-mule encoding to utf-8, creating such
funnily encoded utf-8 files (I can't repeat this problem with my
recently compiled emacs, so it seems that it has been fixed
meanwhile).

>> Instead, such characters must be converted to correct
>> UTF-8.
> 
> ??? I don't understand what you means by "correct UTF-8".

Sorry, I've meant correct Unicode.  U+1351DE is larger than the
largest valid Unicode value.  As my example demonstrates, the Chinese
character in the file is certainly *neither* a private character nor a
character from GB 18030, so it should be converted to a regular
Unicode value.

> I think the correct behaviour on reading such a file by utf-8 is to
> treat each byte as raw-byte.

Maybe.  I'm not sure how Emacs should behave in reading such files.


    Werner




Information forwarded to bug-gnu-emacs@HIDDEN:
bug#12291; Package emacs. Full text available.

Message received at 12291 <at> debbugs.gnu.org:


Received: (at 12291) by debbugs.gnu.org; 28 Aug 2012 14:59:06 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Tue Aug 28 10:59:06 2012
Received: from localhost ([127.0.0.1]:54526 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.72)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1T6NFx-0002AK-Pt
	for submit <at> debbugs.gnu.org; Tue, 28 Aug 2012 10:59:06 -0400
Received: from fencepost.gnu.org ([208.118.235.10]:55445)
	by debbugs.gnu.org with esmtp (Exim 4.72)
	(envelope-from <handa@HIDDEN>) id 1T6NFv-0002AC-7j
	for 12291 <at> debbugs.gnu.org; Tue, 28 Aug 2012 10:59:04 -0400
Received: from 126.229.accsnet.ne.jp ([202.220.229.126]:52524 helo=ubuntu)
	by fencepost.gnu.org with esmtpsa (TLS1.0:DHE_RSA_AES_128_CBC_SHA1:16)
	(Exim 4.71) (envelope-from <handa@HIDDEN>)
	id 1T6NEw-0004dL-HT; Tue, 28 Aug 2012 10:58:03 -0400
From: Kenichi Handa <handa@HIDDEN>
To: Werner LEMBERG <wl@HIDDEN>
Subject: Re: bug#12291: [rev 109796] wrong UTF-8 handling
In-Reply-To: <20120828.074720.480105751.wl@HIDDEN> (message from Werner
	LEMBERG on Tue, 28 Aug 2012 07:47:20 +0200 (CEST))
Date: Tue, 28 Aug 2012 23:57:39 +0900
Message-ID: <87a9xfdpy4.fsf@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-2022-jp
X-Spam-Score: -7.1 (-------)
X-Debbugs-Envelope-To: 12291
Cc: 12291 <at> debbugs.gnu.org, smithcu@HIDDEN
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.13
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
X-Spam-Score: -7.1 (-------)

In article <20120828.074720.480105751.wl@HIDDEN>, Werner LEMBERG <wl@HIDDEN> writes:

> Have a look at the attached file, containing a single character.
> (It's transmitted as binary to avoid e-mail encoding issues).  It
> contains a single, four-byte UTF-8 encoded character (0xF4 0xB5 0x87
> 0x9E, which would map to the non-existent Unicode character code
> U+1351DE).  If I load this file as UTF-8 encoded, Emacs gives this as
> the output of `C-u C-x =':

>                position: 1 of 2 (0%), column: 0
>               character: 二 (displayed as 二) (codepoint 20108, #o47214, #x4e8c)
[...]
> Look what Emacs says about the file code.  If I save this
> one-character file as UTF-8, the character code stays as-is.

> This behaviour is clearly wrong.

Sure.

> I suspect that Emacs is using such a
> high character code for internal representation of the `emacs-mule'
> encoding.  However, the user must not see this.  

That higher character code area is used for two purposes.

One is for reading CJK characters of legacy encoding (euc,
sjis, big5, etc).  They are decoded into the utf-8-emacs
byte sequence corresponding to the higher character cod
area.  But, on getting their character code, most of them
are unified into Unicode BMP characters.  But few are left
un-unified.  Those are private characters in each legacy
character set.

Another is for supporting non-Unicode characters.  The
biggest set is GB18030.

In both cases, user surely see them.

> Instead, such characters must be converted to correct
> UTF-8.

??? I don't understand what you means by "correct UTF-8".

I think the correct behaviour on reading such a file by
utf-8 is to treat each byte as raw-byte.

---
Kenichi Handa
handa@HIDDEN




Information forwarded to bug-gnu-emacs@HIDDEN:
bug#12291; Package emacs. Full text available.

Message received at 12291 <at> debbugs.gnu.org:


Received: (at 12291) by debbugs.gnu.org; 28 Aug 2012 09:04:32 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Tue Aug 28 05:04:32 2012
Received: from localhost ([127.0.0.1]:53697 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.72)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1T6Hiq-0001c0-H7
	for submit <at> debbugs.gnu.org; Tue, 28 Aug 2012 05:04:32 -0400
Received: from mail-out.m-online.net ([212.18.0.10]:59242)
	by debbugs.gnu.org with esmtp (Exim 4.72)
	(envelope-from <whitebox@HIDDEN>) id 1T6Hio-0001bt-9i
	for 12291 <at> debbugs.gnu.org; Tue, 28 Aug 2012 05:04:31 -0400
Received: from frontend1.mail.m-online.net (frontend1.mail.intern.m-online.net
	[192.168.8.180])
	by mail-out.m-online.net (Postfix) with ESMTP id 3X5kXM741Yz3hhgN;
	Tue, 28 Aug 2012 11:03:30 +0200 (CEST)
X-Auth-Info: y20ZYa4KbDSeTD048vUzFo1wAcoK+azw9Ks/R4WcWno=
Received: from igel.home (ppp-93-104-145-159.dynamic.mnet-online.de
	[93.104.145.159])
	by mail.mnet-online.de (Postfix) with ESMTPA id 3X5kXL1Lt6zbbhg;
	Tue, 28 Aug 2012 11:03:30 +0200 (CEST)
Received: by igel.home (Postfix, from userid 501)
	id 7D667CA2A5; Tue, 28 Aug 2012 11:03:29 +0200 (CEST)
From: Andreas Schwab <schwab@HIDDEN>
To: Werner LEMBERG <wl@HIDDEN>
Subject: Re: bug#12291: [rev 109796] wrong UTF-8 handling
References: <20120828.074720.480105751.wl@HIDDEN>
X-Yow: How do I get HOME?
Date: Tue, 28 Aug 2012 11:03:28 +0200
In-Reply-To: <20120828.074720.480105751.wl@HIDDEN> (Werner LEMBERG's message
	of "Tue, 28 Aug 2012 07:47:20 +0200 (CEST)")
Message-ID: <m21uirie1r.fsf@HIDDEN>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.2 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Score: -1.9 (-)
X-Debbugs-Envelope-To: 12291
Cc: 12291 <at> debbugs.gnu.org, Curtis Smith <smithcu@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.13
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
X-Spam-Score: -1.9 (-)

The code points above #x110000 are used for CJK unification.  The utf-8
decoder should probably reject all those codes.

Andreas.

-- 
Andreas Schwab, schwab@HIDDEN
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."




Information forwarded to bug-gnu-emacs@HIDDEN:
bug#12291; Package emacs. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 28 Aug 2012 05:48:32 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Tue Aug 28 01:48:32 2012
Received: from localhost ([127.0.0.1]:53298 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.72)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1T6Ef8-0004Vs-RC
	for submit <at> debbugs.gnu.org; Tue, 28 Aug 2012 01:48:31 -0400
Received: from eggs.gnu.org ([208.118.235.92]:36974)
	by debbugs.gnu.org with esmtp (Exim 4.72)
	(envelope-from <werner.lemberg@HIDDEN>) id 1T6Ef5-0004Vk-90
	for submit <at> debbugs.gnu.org; Tue, 28 Aug 2012 01:48:28 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <werner.lemberg@HIDDEN>) id 1T6Ee9-0005eB-3a
	for submit <at> debbugs.gnu.org; Tue, 28 Aug 2012 01:47:30 -0400
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=-6.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM,
	RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.2
Received: from lists.gnu.org ([208.118.235.17]:46748)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <werner.lemberg@HIDDEN>) id 1T6Ee9-0005e7-03
	for submit <at> debbugs.gnu.org; Tue, 28 Aug 2012 01:47:29 -0400
Received: from eggs.gnu.org ([208.118.235.92]:35595)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <werner.lemberg@HIDDEN>) id 1T6Ee7-00034T-TV
	for bug-gnu-emacs@HIDDEN; Tue, 28 Aug 2012 01:47:28 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <werner.lemberg@HIDDEN>) id 1T6Ee6-0005ds-BJ
	for bug-gnu-emacs@HIDDEN; Tue, 28 Aug 2012 01:47:27 -0400
Received: from mailout-de.gmx.net ([213.165.64.22]:42756)
	by eggs.gnu.org with smtp (Exim 4.71)
	(envelope-from <werner.lemberg@HIDDEN>) id 1T6Ee6-0005df-1y
	for bug-gnu-emacs@HIDDEN; Tue, 28 Aug 2012 01:47:26 -0400
Received: (qmail invoked by alias); 28 Aug 2012 05:47:23 -0000
Received: from 178-190-192-56.adsl.highway.telekom.at (EHLO localhost)
	[178.190.192.56]
	by mail.gmx.net (mp002) with SMTP; 28 Aug 2012 07:47:23 +0200
X-Authenticated: #54312696
X-Provags-ID: V01U2FsdGVkX18SRP1Uhl4S2B8VkSc8PDoPiqQvi21Bu0HwbmTVf5
	FSHgxwctFOMLy2
Date: Tue, 28 Aug 2012 07:47:20 +0200 (CEST)
Message-Id: <20120828.074720.480105751.wl@HIDDEN>
To: bug-gnu-emacs@HIDDEN
Subject: [rev 109796] wrong UTF-8 handling
From: Werner LEMBERG <wl@HIDDEN>
X-Mailer: Mew version 6.4rc1 on Emacs 24.2.50.1 / Mule 6.0 (HANACHIRUSATO)
Mime-Version: 1.0
Content-Type: Multipart/Mixed;
	boundary="--Next_Part(Tue_Aug_28_07_47_20_2012_714)--"
Content-Transfer-Encoding: 7bit
X-Y-GMX-Trusted: 0
X-detected-operating-system: by eggs.gnu.org: Genre and OS details not
	recognized.
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3)
X-Received-From: 208.118.235.17
X-Spam-Score: -5.8 (-----)
X-Debbugs-Envelope-To: submit
Cc: Curtis Smith <smithcu@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.13
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
X-Spam-Score: -5.8 (-----)

----Next_Part(Tue_Aug_28_07_47_20_2012_714)--
Content-Type: Text/Plain; charset=iso-2022-jp
Content-Transfer-Encoding: 7bit


[bzr revision 109796]

Have a look at the attached file, containing a single character.
(It's transmitted as binary to avoid e-mail encoding issues).  It
contains a single, four-byte UTF-8 encoded character (0xF4 0xB5 0x87
0x9E, which would map to the non-existent Unicode character code
U+1351DE).  If I load this file as UTF-8 encoded, Emacs gives this as
the output of `C-u C-x =':

               position: 1 of 2 (0%), column: 0
              character: 二 (displayed as 二) (codepoint 20108, #o47214, #x4e8c)
      preferred charset: unicode (Unicode (ISO10646))
  code point in charset: 0x4E8C
                 syntax: w 	which means: word
               category: .:Base, C:2-byte han, L:Left-to-right (strong), c:Chinese, h:Korean, j:Japanese, |:line breakable
               to input: type "C-x 8 RET HEX-CODEPOINT" or "C-x 8 RET NAME"
            buffer code: #xE4 #xBA #x8C
              file code: #xE4 #xBA #x8C (encoded by coding system utf-8-unix)
                display: by this font (glyph code)
      xft:-unknown-SimSun-normal-normal-normal-*-24-*-*-*-d-0-iso10646-1 (#x460)

  Character code properties: customize what to show
    name: CJK IDEOGRAPH-4E8C
    general-category: Lo (Letter, Other)
    decomposition: (20108) ('二')

Look what Emacs says about the file code.  If I save this
one-character file as UTF-8, the character code stays as-is.

This behaviour is clearly wrong.  I suspect that Emacs is using such a
high character code for internal representation of the `emacs-mule'
encoding.  However, the user must not see this.  Instead, such
characters must be converted to correct UTF-8.


    Werner


======================================================================

In GNU Emacs 24.2.50.1 (i686-pc-linux-gnu, GTK+ Version 2.24.9)
 of 2012-08-28 on linux-nvf0
Windowing system distributor `The X.Org Foundation', version 11.0.11004000
Configured using:
 `configure 'MAKEINFO=/usr/bin/makeinfo' '--with-x-toolkit=gtk''

Important settings:
  value of $LANG: de_DE.UTF-8
  value of $XMODIFIERS: @im=none
  locale-coding-system: utf-8-unix
  default enable-multibyte-characters: t

Major mode: Summary

Minor modes in effect:
  tooltip-mode: t
  mouse-wheel-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  blink-cursor-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  column-number-mode: t
  transient-mark-mode: t

Recent input:
<return> w b u g - e m <tab> <tab> <tab> <tab> <tab> 
<tab> <tab> <backspace> <backspace> <tab> <tab> C-c 
C-q y M-x w r i t e - e m <tab> C-g C-h a b u g <return> 
<M-next> C-x 1 M-x r e p r t <backspace> <backspace> 
o r t - e m <tab> <return>

Recent messages:
Saving file /home/wl/Mail/draft/11...
Wrote /home/wl/Mail/draft/11
Draft is prepared
No matching alias [7 times]
Kill draft message? (y or n)  y
Saving file /home/wl/Mail/draft/11...
Wrote /home/wl/Mail/draft/11
Draft was killed
Quit
Type C-x 4 C-o RET to restore the other window.  

Load-path shadows:
None found.

Features:
(shadow emacsbug message format-spec rfc822 mml mml-sec mm-decode
mm-bodies mm-encode mail-parse rfc2231 mailabbrev gmm-utils mailheader
sendmail rfc2047 rfc2045 ietf-drums mm-util mail-prsvr mail-utils
apropos descr-text latexenc preview prv-emacs byte-opt tex-buf
noutline outline font-latex warnings bytecomp byte-compile cconv
macroexp latex easy-mmode edmacro kmacro tex-style cus-edit wid-edit
cus-start cus-load pp mew-varsx mew-unix cal-menu calendar
cal-loaddefs mew-auth mew-config mew-imap2 mew-imap mew-nntp2 mew-nntp
mew-pop mew-smtp mew-ssl mew-ssh mew-net mew-highlight mew-sort
mew-fib mew-ext mew-refile mew-demo mew-attach mew-draft mew-message
mew-thread mew-virtual mew-summary4 mew-summary3 mew-summary2
mew-summary mew-search mew-pick mew-passwd mew-scan mew-syntax mew-bq
mew-smime mew-pgp mew-header mew-exec mew-mark mew-mime mew-edit
mew-decode mew-encode mew-cache mew-minibuf mew-complete mew-addrbook
mew-local mew-vars3 mew-vars2 mew-vars mew-env mew-mule3 mew-mule
mew-gemacs mew-key mew-func mew-blvs mew-const mew tex advice help-fns
advice-preload tex-site auto-loads quail help-mode easymenu cjktilde
disp-table time-date tooltip ediff-hook vc-hooks lisp-float-type
mwheel x-win x-dnd tool-bar dnd fontset image regexp-opt fringe
tabulated-list newcomment lisp-mode register page menu-bar rfn-eshadow
timer select scroll-bar mouse jit-lock font-lock syntax facemenu
font-core frame cham georgian utf-8-lang misc-lang vietnamese tibetan
thai tai-viet lao korean japanese hebrew greek romanian slovak czech
european ethiopic indian cyrillic chinese case-table epa-hook
jka-cmpr-hook help simple abbrev minibuffer loaddefs button faces
cus-face files text-properties overlay sha1 md5 base64 format env
code-pages mule custom widget hashtable-print-readable backquote
make-network-process dbusbind dynamic-setting system-font-setting
font-render-setting move-toolbar gtk x-toolkit x multi-tty emacs)

----Next_Part(Tue_Aug_28_07_47_20_2012_714)--
Content-Type: Application/Octet-Stream
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="emacs-problem.utf8"

9LWHngo=

----Next_Part(Tue_Aug_28_07_47_20_2012_714)----




Acknowledgement sent to Werner LEMBERG <wl@HIDDEN>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs@HIDDEN. Full text available.
Report forwarded to bug-gnu-emacs@HIDDEN:
bug#12291; Package emacs. Full text available.
Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.
Last modified: Fri, 31 Oct 2014 17:00:04 UTC

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997 nCipher Corporation Ltd, 1994-97 Ian Jackson.