GNU bug report logs - #34862
27.0.50; Trying to update pinyin.map

Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.

Package: emacs; Reported by: Eric Abrahamsen <eric@HIDDEN>; dated Thu, 14 Mar 2019 21:52:01 UTC; Maintainer for emacs is bug-gnu-emacs@HIDDEN.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 15 Mar 2019 18:32:06 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Mar 15 14:32:06 2019
Received: from localhost ([127.0.0.1]:45505 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1h4rck-0005ar-AU
	for submit <at> debbugs.gnu.org; Fri, 15 Mar 2019 14:32:06 -0400
Received: from eggs.gnu.org ([209.51.188.92]:46476)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <geb-bug-gnu-emacs@HIDDEN>) id 1h4rch-0005Zv-Jp
 for submit <at> debbugs.gnu.org; Fri, 15 Mar 2019 14:32:05 -0400
Received: from lists.gnu.org ([209.51.188.17]:59816)
 by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32)
 (Exim 4.71) (envelope-from <geb-bug-gnu-emacs@HIDDEN>)
 id 1h4rcc-0000lM-EA
 for submit <at> debbugs.gnu.org; Fri, 15 Mar 2019 14:31:58 -0400
Received: from eggs.gnu.org ([209.51.188.92]:60859)
 by lists.gnu.org with esmtp (Exim 4.71)
 (envelope-from <geb-bug-gnu-emacs@HIDDEN>) id 1h4rcb-0001nB-B7
 for bug-gnu-emacs@HIDDEN; Fri, 15 Mar 2019 14:31:58 -0400
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: *
X-Spam-Status: No, score=1.6 required=5.0 tests=BAYES_50,RDNS_NONE,
 URIBL_BLOCKED autolearn=disabled version=3.3.2
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <geb-bug-gnu-emacs@HIDDEN>) id 1h4rca-0000kR-9b
 for bug-gnu-emacs@HIDDEN; Fri, 15 Mar 2019 14:31:57 -0400
Received: from [195.159.176.226] (port=40356 helo=blaine.gmane.org)
 by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32)
 (Exim 4.71) (envelope-from <geb-bug-gnu-emacs@HIDDEN>)
 id 1h4rcZ-0000jw-SA
 for bug-gnu-emacs@HIDDEN; Fri, 15 Mar 2019 14:31:56 -0400
Received: from list by blaine.gmane.org with local (Exim 4.89)
 (envelope-from <geb-bug-gnu-emacs@HIDDEN>) id 1h4rcX-000HDA-88
 for bug-gnu-emacs@HIDDEN; Fri, 15 Mar 2019 19:31:53 +0100
X-Injected-Via-Gmane: http://gmane.org/
To: bug-gnu-emacs@HIDDEN
From: Eric Abrahamsen <eric@HIDDEN>
Subject: Re: bug#34862: 27.0.50; Trying to update pinyin.map
Date: Fri, 15 Mar 2019 11:31:40 -0700
Message-ID: <871s38at0z.fsf@HIDDEN>
References: <87zhpxyvls.fsf@HIDDEN> <83ftro20gt.fsf@HIDDEN>
 <87o96cbrwp.fsf@HIDDEN> <83ef781uuh.fsf@HIDDEN>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.0.50 (gnu/linux)
Cancel-Lock: sha1:pPiBZli7MzsPPxE+Tur51Akm9Xs=
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-Received-From: 195.159.176.226
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: submit
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -1.0 (-)

Eli Zaretskii <eliz@HIDDEN> writes:

>> From: Eric Abrahamsen <eric@HIDDEN>
>> Cc: 34862 <at> debbugs.gnu.org
>> Date: Thu, 14 Mar 2019 22:58:14 -0700
>> 
>> > I'm not sure I understand the encoding of which file would you like to
>> > change?  Could you please clarify?
>> 
>> Sorry, I'm trying to add more characters to ./leim/MISC-DIC/pinyin.map,
>> which is encoded as chinese-iso-8bit-dos, and it can't accept the new
>> characters with that current encoding. That's the file I'd like to
>> change.
>
> That file is imported from an external source, isn't it?  Are you
> saying we should stop synchronizing it with that source, and instead
> fork it, maintain our own separate copy, and never resync with that
> source again?  If so, then I see no reason not to recode it in UTF-8.

Near as I can tell that file was imported into Emacs in 2001 and not
touched since (apart from copyright and encoding stuff). The Debian
package from which it comes seems to have been orphaned in 2003[1]. So
there's not much to either synchronize or fork!

> Btw, I understand that the Google pinyin method is Apache licensed,
> but does this mean we can freely use its data for updating pinyin.map?
> IANAL.  Could you perhaps describe how you intend to extract the data
> from the Google input method for the purpose of updating our file?  I
> think someone will have to audit that process for being legal and
> compatible with both the Apache license and the GPL.

This[2] is the source file I used. I chopped off all the
multiple-character dictionary entries, and munged the remaining data
into the format we need. Ie, lines like this:

八 6677.54934466 0 ba
把 165484.231697 0 ba
吧 385205.434615 0 ba

Became this:

ba 吧把八

A straight rearrangement, with frequency of use translated into simple
ordering of the characters. While this is obviously pretty manual, and a
bit of work, a file like this really only needs to be updated every five
years or so -- if that. Whenever someone thinks of it.

Regarding the license, I'm even less of a lawyer than you, but these[3]
are the terms that cover this data.

> (Also, I'm somewhat surprised that gbk isn't capable of covering the
> characters you want to add.  Or did you not try using it?)

I did not try using it! Mostly because the error message suggested
gb18030 first. gbk also works. I don't have any opinion about encoding,
apart from assuming utf8 unless there's a good reason not to.

Thanks,
Eric

[1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=189523;msg=18

[2]  https://android.googlesource.com/platform/packages/inputmethods/PinyinIME/+/refs/heads/master/jni/data/rawdict_utf16_65105_freq.txt

[3]  https://android.googlesource.com/platform/packages/inputmethods/PinyinIME/+/refs/heads/master/NOTICE






Information forwarded to bug-gnu-emacs@HIDDEN:
bug#34862; Package emacs. Full text available.

Message received at 34862 <at> debbugs.gnu.org:


Received: (at 34862) by debbugs.gnu.org; 15 Mar 2019 07:05:21 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Mar 15 03:05:21 2019
Received: from localhost ([127.0.0.1]:44334 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1h4gu9-0008GZ-FT
	for submit <at> debbugs.gnu.org; Fri, 15 Mar 2019 03:05:21 -0400
Received: from eggs.gnu.org ([209.51.188.92]:44912)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <eliz@HIDDEN>) id 1h4gu7-0008GM-T1
 for 34862 <at> debbugs.gnu.org; Fri, 15 Mar 2019 03:05:20 -0400
Received: from fencepost.gnu.org ([2001:470:142:3::e]:51601)
 by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <eliz@HIDDEN>)
 id 1h4gu2-0002Tf-AF; Fri, 15 Mar 2019 03:05:14 -0400
Received: from [176.228.60.248] (port=2123 helo=home-c4e4a596f7)
 by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256)
 (Exim 4.82) (envelope-from <eliz@HIDDEN>)
 id 1h4gu0-0003to-Rr; Fri, 15 Mar 2019 03:05:13 -0400
Date: Fri, 15 Mar 2019 09:04:54 +0200
Message-Id: <83ef781uuh.fsf@HIDDEN>
From: Eli Zaretskii <eliz@HIDDEN>
To: Eric Abrahamsen <eric@HIDDEN>
In-reply-to: <87o96cbrwp.fsf@HIDDEN> (message from Eric Abrahamsen
 on Thu, 14 Mar 2019 22:58:14 -0700)
Subject: Re: bug#34862: 27.0.50; Trying to update pinyin.map
References: <87zhpxyvls.fsf@HIDDEN> <83ftro20gt.fsf@HIDDEN>
 <87o96cbrwp.fsf@HIDDEN>
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: 34862
Cc: 34862 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -1.0 (-)

> From: Eric Abrahamsen <eric@HIDDEN>
> Cc: 34862 <at> debbugs.gnu.org
> Date: Thu, 14 Mar 2019 22:58:14 -0700
> 
> > I'm not sure I understand the encoding of which file would you like to
> > change?  Could you please clarify?
> 
> Sorry, I'm trying to add more characters to ./leim/MISC-DIC/pinyin.map,
> which is encoded as chinese-iso-8bit-dos, and it can't accept the new
> characters with that current encoding. That's the file I'd like to
> change.

That file is imported from an external source, isn't it?  Are you
saying we should stop synchronizing it with that source, and instead
fork it, maintain our own separate copy, and never resync with that
source again?  If so, then I see no reason not to recode it in UTF-8.

Btw, I understand that the Google pinyin method is Apache licensed,
but does this mean we can freely use its data for updating pinyin.map?
IANAL.  Could you perhaps describe how you intend to extract the data
from the Google input method for the purpose of updating our file?  I
think someone will have to audit that process for being legal and
compatible with both the Apache license and the GPL.

(Also, I'm somewhat surprised that gbk isn't capable of covering the
characters you want to add.  Or did you not try using it?)

Thanks.




Information forwarded to bug-gnu-emacs@HIDDEN:
bug#34862; Package emacs. Full text available.

Message received at 34862 <at> debbugs.gnu.org:


Received: (at 34862) by debbugs.gnu.org; 15 Mar 2019 05:58:24 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Mar 15 01:58:24 2019
Received: from localhost ([127.0.0.1]:44326 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1h4frM-0006bR-DE
	for submit <at> debbugs.gnu.org; Fri, 15 Mar 2019 01:58:24 -0400
Received: from ericabrahamsen.net ([52.70.2.18]:44440
 helo=mail.ericabrahamsen.net)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <eric@HIDDEN>) id 1h4frJ-0006bC-HA
 for 34862 <at> debbugs.gnu.org; Fri, 15 Mar 2019 01:58:22 -0400
Received: from localhost (97-126-92-188.tukw.qwest.net [97.126.92.188])
 (Authenticated sender: eric@HIDDEN)
 by mail.ericabrahamsen.net (Postfix) with ESMTPSA id 88FCDFA02C;
 Fri, 15 Mar 2019 05:58:15 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ericabrahamsen.net;
 s=mail; t=1552629495;
 bh=1WMbuR8psBguiBCANIupZgiIuYR1OJBmZG5zgvK5KHw=;
 h=From:To:Cc:Subject:References:Date:In-Reply-To:From;
 b=Uapj+FAKBS+XfRf9a8AidQQr7wYobaii9tebvzHmo/8Rm0Ese4b23hpVxLOq5JmH7
 CBikJ0Ri0S1e8EC6APFv20IgZCORXU11LXFTZJbwEx1dun25+Ntk/1kszey4mBfnMB
 t950anuQJ7G//tyFCduf76Zs9I0+N6p+43nM9ILg=
From: Eric Abrahamsen <eric@HIDDEN>
To: Eli Zaretskii <eliz@HIDDEN>
Subject: Re: bug#34862: 27.0.50; Trying to update pinyin.map
References: <87zhpxyvls.fsf@HIDDEN> <83ftro20gt.fsf@HIDDEN>
Date: Thu, 14 Mar 2019 22:58:14 -0700
In-Reply-To: <83ftro20gt.fsf@HIDDEN> (Eli Zaretskii's message of "Fri, 15 Mar
 2019 07:03:30 +0200")
Message-ID: <87o96cbrwp.fsf@HIDDEN>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.0.50 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: 34862
Cc: 34862 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -1.0 (-)


On 03/15/19 07:03 AM, Eli Zaretskii wrote:
>> From: Eric Abrahamsen <eric@HIDDEN>
>> Date: Thu, 14 Mar 2019 14:49:51 -0700
>> 
>> 
>> As discussed in bug#34215, I'm trying to update the
>> romanization-to-Chinese-character mapping in the
>> file ./leim/MISC-DIC/pinyin.map to use the more complete mapping
>> provided by the Google pinyin input method, licensed under Apache 2.0.
>> This expands the number of characters recognized by Emacs from around
>> 7,000 to around 17,000. (And increases the size of the mapping file from
>> 18K to 53K.)
>> 
>> I'm running into encoding problems when adding the new characters --
>> Emacs says some of the characters can't be written using the existing
>> coding system. The original file has an encoding cookie reading coding:
>> cn-gb-2312, and describing the coding system gives me:
>> 
>> chinese-iso-8bit-dos (alias: cn-gb-2312-dos euc-china-dos euc-cn-dos
>>   cn-gb-dos gb2312-dos)
>> 
>> The characters *can* be encoded using gb18030, and of course utf8. The
>> wikipedia page for gb18030 describes gb2312 as "legacy"[1], and says
>> gb18030 is a superset of 2312.
>> 
>> Is there any reason not to go straight to utf8 for this file? If that's
>> not okay, would gb18030 be acceptable?
>
> I'm not sure I understand the encoding of which file would you like to
> change?  Could you please clarify?

Sorry, I'm trying to add more characters to ./leim/MISC-DIC/pinyin.map,
which is encoded as chinese-iso-8bit-dos, and it can't accept the new
characters with that current encoding. That's the file I'd like to
change.

Thanks,
Eric




Information forwarded to bug-gnu-emacs@HIDDEN:
bug#34862; Package emacs. Full text available.

Message received at 34862 <at> debbugs.gnu.org:


Received: (at 34862) by debbugs.gnu.org; 15 Mar 2019 05:03:57 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Mar 15 01:03:57 2019
Received: from localhost ([127.0.0.1]:44311 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1h4f0f-0005AD-5r
	for submit <at> debbugs.gnu.org; Fri, 15 Mar 2019 01:03:57 -0400
Received: from eggs.gnu.org ([209.51.188.92]:42748)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <eliz@HIDDEN>) id 1h4f0c-00059z-J3
 for 34862 <at> debbugs.gnu.org; Fri, 15 Mar 2019 01:03:55 -0400
Received: from fencepost.gnu.org ([2001:470:142:3::e]:50070)
 by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <eliz@HIDDEN>)
 id 1h4f0W-000627-VQ; Fri, 15 Mar 2019 01:03:49 -0400
Received: from [176.228.60.248] (port=2589 helo=home-c4e4a596f7)
 by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256)
 (Exim 4.82) (envelope-from <eliz@HIDDEN>)
 id 1h4f0W-0001W9-Dc; Fri, 15 Mar 2019 01:03:48 -0400
Date: Fri, 15 Mar 2019 07:03:30 +0200
Message-Id: <83ftro20gt.fsf@HIDDEN>
From: Eli Zaretskii <eliz@HIDDEN>
To: Eric Abrahamsen <eric@HIDDEN>
In-reply-to: <87zhpxyvls.fsf@HIDDEN> (message from Eric Abrahamsen
 on Thu, 14 Mar 2019 14:49:51 -0700)
Subject: Re: bug#34862: 27.0.50; Trying to update pinyin.map
References: <87zhpxyvls.fsf@HIDDEN>
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: 34862
Cc: 34862 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -1.0 (-)

> From: Eric Abrahamsen <eric@HIDDEN>
> Date: Thu, 14 Mar 2019 14:49:51 -0700
> 
> 
> As discussed in bug#34215, I'm trying to update the
> romanization-to-Chinese-character mapping in the
> file ./leim/MISC-DIC/pinyin.map to use the more complete mapping
> provided by the Google pinyin input method, licensed under Apache 2.0.
> This expands the number of characters recognized by Emacs from around
> 7,000 to around 17,000. (And increases the size of the mapping file from
> 18K to 53K.)
> 
> I'm running into encoding problems when adding the new characters --
> Emacs says some of the characters can't be written using the existing
> coding system. The original file has an encoding cookie reading coding:
> cn-gb-2312, and describing the coding system gives me:
> 
> chinese-iso-8bit-dos (alias: cn-gb-2312-dos euc-china-dos euc-cn-dos
>   cn-gb-dos gb2312-dos)
> 
> The characters *can* be encoded using gb18030, and of course utf8. The
> wikipedia page for gb18030 describes gb2312 as "legacy"[1], and says
> gb18030 is a superset of 2312.
> 
> Is there any reason not to go straight to utf8 for this file? If that's
> not okay, would gb18030 be acceptable?

I'm not sure I understand the encoding of which file would you like to
change?  Could you please clarify?




Information forwarded to bug-gnu-emacs@HIDDEN:
bug#34862; Package emacs. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 14 Mar 2019 21:51:19 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Thu Mar 14 17:51:19 2019
Received: from localhost ([127.0.0.1]:44137 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1h4YFz-0002yM-4d
	for submit <at> debbugs.gnu.org; Thu, 14 Mar 2019 17:51:19 -0400
Received: from eggs.gnu.org ([209.51.188.92]:41342)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <eric@HIDDEN>) id 1h4YFw-0002xv-Kr
 for submit <at> debbugs.gnu.org; Thu, 14 Mar 2019 17:51:17 -0400
Received: from lists.gnu.org ([209.51.188.17]:60462)
 by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32)
 (Exim 4.71) (envelope-from <eric@HIDDEN>)
 id 1h4YFr-0007if-Bc
 for submit <at> debbugs.gnu.org; Thu, 14 Mar 2019 17:51:11 -0400
Received: from eggs.gnu.org ([209.51.188.92]:55692)
 by lists.gnu.org with esmtp (Exim 4.71)
 (envelope-from <eric@HIDDEN>) id 1h4YFq-0000uz-9G
 for bug-gnu-emacs@HIDDEN; Thu, 14 Mar 2019 17:51:11 -0400
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,URIBL_BLOCKED
 autolearn=disabled version=3.3.2
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <eric@HIDDEN>) id 1h4YEi-000744-Ry
 for bug-gnu-emacs@HIDDEN; Thu, 14 Mar 2019 17:50:01 -0400
Received: from ericabrahamsen.net ([52.70.2.18]:33086
 helo=mail.ericabrahamsen.net)
 by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32)
 (Exim 4.71) (envelope-from <eric@HIDDEN>)
 id 1h4YEi-00072H-CZ
 for bug-gnu-emacs@HIDDEN; Thu, 14 Mar 2019 17:50:00 -0400
Received: from localhost (unknown [207.109.85.82])
 (Authenticated sender: eric@HIDDEN)
 by mail.ericabrahamsen.net (Postfix) with ESMTPSA id 446A3FA17C
 for <bug-gnu-emacs@HIDDEN>; Thu, 14 Mar 2019 21:49:52 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ericabrahamsen.net;
 s=mail; t=1552600192;
 bh=fXrEWI/bQEIDvrgeD4GW0+72Y793ZlGMCfxGLvqW6VM=;
 h=From:To:Subject:Date:From;
 b=s3/WdRg1nz1fv4BNwwjZbOcN0K8vagP97FBysXCcwDicRYcEfIM81zJiNS7fpzRlR
 jqtTcD2OQxh5mYutSFu/Hee0lAhLjavifHE42djnk656/BT+byXo8DEIEMQ0YzsrBs
 yxHskBjky6WqnzQ2Tzm08oBHgGGqKtW7Ny4pKJkA=
From: Eric Abrahamsen <eric@HIDDEN>
To: bug-gnu-emacs@HIDDEN
Subject: 27.0.50; Trying to update pinyin.map
Date: Thu, 14 Mar 2019 14:49:51 -0700
Message-ID: <87zhpxyvls.fsf@HIDDEN>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.0.50 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-Received-From: 52.70.2.18
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x
X-Spam-Score: 0.9 (/)
X-Debbugs-Envelope-To: submit
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -0.1 (/)


As discussed in bug#34215, I'm trying to update the
romanization-to-Chinese-character mapping in the
file ./leim/MISC-DIC/pinyin.map to use the more complete mapping
provided by the Google pinyin input method, licensed under Apache 2.0.
This expands the number of characters recognized by Emacs from around
7,000 to around 17,000. (And increases the size of the mapping file from
18K to 53K.)

I'm running into encoding problems when adding the new characters --
Emacs says some of the characters can't be written using the existing
coding system. The original file has an encoding cookie reading coding:
cn-gb-2312, and describing the coding system gives me:

chinese-iso-8bit-dos (alias: cn-gb-2312-dos euc-china-dos euc-cn-dos
  cn-gb-dos gb2312-dos)

The characters *can* be encoded using gb18030, and of course utf8. The
wikipedia page for gb18030 describes gb2312 as "legacy"[1], and says
gb18030 is a superset of 2312.

Is there any reason not to go straight to utf8 for this file? If that's
not okay, would gb18030 be acceptable?

Codepoint 23744 is an example of a character that can be encoded with
18030 but not 2312. It also exercises my font engine.

I have two other questions, about reducing vc churn, and how to insert
the license at the top of the file, but I figured I'd ask this first.

Thanks,
Eric

[1]  https://en.wikipedia.org/wiki/GB_18030





Acknowledgement sent to Eric Abrahamsen <eric@HIDDEN>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs@HIDDEN. Full text available.
Report forwarded to bug-gnu-emacs@HIDDEN:
bug#34862; Package emacs. Full text available.
Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.
Last modified: Fri, 15 Mar 2019 18:45:02 UTC

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997 nCipher Corporation Ltd, 1994-97 Ian Jackson.