GNU bug report logs - #20789
auto-generate more Unicode data from sources

Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.

Package: emacs; Severity: wishlist; Reported by: Glenn Morris <rgm@HIDDEN>; dated Thu, 11 Jun 2015 22:06:02 UTC; Maintainer for emacs is bug-gnu-emacs@HIDDEN.

Message received at 20789 <at> debbugs.gnu.org:


Received: (at 20789) by debbugs.gnu.org; 27 Jun 2015 07:43:05 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sat Jun 27 03:43:05 2015
Received: from localhost ([127.0.0.1]:58661 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1Z8klU-0006X0-8v
	for submit <at> debbugs.gnu.org; Sat, 27 Jun 2015 03:43:04 -0400
Received: from mtaout20.012.net.il ([80.179.55.166]:45669)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <eliz@HIDDEN>) id 1Z8klR-0006WU-CQ
 for 20789 <at> debbugs.gnu.org; Sat, 27 Jun 2015 03:43:02 -0400
Received: from conversion-daemon.a-mtaout20.012.net.il by
 a-mtaout20.012.net.il (HyperSendmail v2007.08) id
 <0NQL00900E286G00@HIDDEN> for 20789 <at> debbugs.gnu.org;
 Sat, 27 Jun 2015 10:42:54 +0300 (IDT)
Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout20.012.net.il
 (HyperSendmail v2007.08) with ESMTPA id
 <0NQL008GZERIZ6B0@HIDDEN>;
 Sat, 27 Jun 2015 10:42:54 +0300 (IDT)
Date: Sat, 27 Jun 2015 10:42:51 +0300
From: Eli Zaretskii <eliz@HIDDEN>
Subject: Re: bug#20789: Invalid script or charset
 name:	cuneiform-numbers-and-punctuation
In-reply-to: <awa8vldi2r.fsf@HIDDEN>
X-012-Sender: halo1@HIDDEN
To: Glenn Morris <rgm@HIDDEN>
Message-id: <83a8vld2bo.fsf@HIDDEN>
References: <21zj45kiix.fsf@HIDDEN>
 <rek2v93mux.fsf@HIDDEN> <83y4jpqqjq.fsf@HIDDEN>
 <ozy4jkh58w.fsf@HIDDEN> <834mm7ogv3.fsf@HIDDEN>
 <4cegla7rnj.fsf@HIDDEN> <83eglamha2.fsf@HIDDEN>
 <6pp4qlzti.fsf@HIDDEN> <83mvzthzsr.fsf@HIDDEN>
 <awa8vldi2r.fsf@HIDDEN>
X-Spam-Score: 1.0 (+)
X-Debbugs-Envelope-To: 20789
Cc: handa@HIDDEN, 20789 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
Reply-To: Eli Zaretskii <eliz@HIDDEN>
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: 1.0 (+)

> From: Glenn Morris <rgm@HIDDEN>
> Cc: Kenichi Handa <handa@HIDDEN>,  20789 <at> debbugs.gnu.org
> Date: Fri, 26 Jun 2015 22:02:36 -0400
> 
> Eli Zaretskii wrote:
> 
> >> The width 2 characters look like they might be the "W" and "F" characters,
> >
> > Yes.
> >
> >> but just doing that gives a list that has many differences to the list
> >> Emacs uses.
> 
> This is list of "F" and "W" characters, compared to the 11 ranges that
> Emacs uses:

Looks good to me.  The 11 ranges we have now are either identical or
more coarse than the list derived from the UCD that you show.

> > I don't see any significant differences, except perhaps in unassigned
> > codepoints (see paragraph 6.1 of UAX#11 for the treatment of
> > unassigned CJK codepoints).
> 
> I don't know if this means that the above needs modifying?

I was saying that we need to augment the list with the 5 ranges of
unassigned codepoints that belong to the CJK planes, as described in
that section of UAX#11.  An unassigned codepoint has its
'general-category' property set to 'Cn', and the list of the 5 planes
could be in some defconst, because it will probably never change.

Thanks.




Information forwarded to bug-gnu-emacs@HIDDEN:
bug#20789; Package emacs. Full text available.

Message received at 20789 <at> debbugs.gnu.org:


Received: (at 20789) by debbugs.gnu.org; 27 Jun 2015 02:02:49 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Jun 26 22:02:49 2015
Received: from localhost ([127.0.0.1]:58543 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1Z8fSC-0007An-W1
	for submit <at> debbugs.gnu.org; Fri, 26 Jun 2015 22:02:49 -0400
Received: from eggs.gnu.org ([208.118.235.92]:39775)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <rgm@HIDDEN>) id 1Z8fS9-0007Aa-VE
 for 20789 <at> debbugs.gnu.org; Fri, 26 Jun 2015 22:02:46 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <rgm@HIDDEN>) id 1Z8fS3-0000gC-S4
 for 20789 <at> debbugs.gnu.org; Fri, 26 Jun 2015 22:02:40 -0400
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.3 required=5.0 tests=BAYES_00,RP_MATCHES_RCVD
 autolearn=disabled version=3.3.2
Received: from fencepost.gnu.org ([2001:4830:134:3::e]:39589)
 by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <rgm@HIDDEN>)
 id 1Z8fS1-0000fu-OT; Fri, 26 Jun 2015 22:02:37 -0400
Received: from rgm by fencepost.gnu.org with local (Exim 4.82)
 (envelope-from <rgm@HIDDEN>)
 id 1Z8fS0-0006eu-NO; Fri, 26 Jun 2015 22:02:36 -0400
From: Glenn Morris <rgm@HIDDEN>
To: Eli Zaretskii <eliz@HIDDEN>
Subject: Re: bug#20789: Invalid script or charset
 name:	cuneiform-numbers-and-punctuation
References: <21zj45kiix.fsf@HIDDEN>
 <rek2v93mux.fsf@HIDDEN> <83y4jpqqjq.fsf@HIDDEN>
 <ozy4jkh58w.fsf@HIDDEN> <834mm7ogv3.fsf@HIDDEN>
 <4cegla7rnj.fsf@HIDDEN> <83eglamha2.fsf@HIDDEN>
 <6pp4qlzti.fsf@HIDDEN> <83mvzthzsr.fsf@HIDDEN>
X-Spook: Mudslide Rootkit Shootout Keylogger Crest nuclear
X-Ran: P?;pE30l[EXO8+3^KB$Ymy%9=$:#J%Z}\3G.4eWvcwI$Y?D8ht)Pswpq=3W[NzuoE~!h29
X-Hue: cyan
X-Debbugs-No-Ack: yes
X-Attribution: GM
Date: Fri, 26 Jun 2015 22:02:36 -0400
In-Reply-To: <83mvzthzsr.fsf@HIDDEN> (Eli Zaretskii's message of "Sun, 21 Jun
 2015 18:00:20 +0300")
Message-ID: <awa8vldi2r.fsf@HIDDEN>
User-Agent: Gnus (www.gnus.org), GNU Emacs (www.gnu.org/software/emacs/)
MIME-Version: 1.0
Content-Type: text/plain
X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address
 (bad octet value).
X-Received-From: 2001:4830:134:3::e
X-Spam-Score: -6.4 (------)
X-Debbugs-Envelope-To: 20789
Cc: Kenichi Handa <handa@HIDDEN>, 20789 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -6.4 (------)

Eli Zaretskii wrote:

>> The width 2 characters look like they might be the "W" and "F" characters,
>
> Yes.
>
>> but just doing that gives a list that has many differences to the list
>> Emacs uses.

This is list of "F" and "W" characters, compared to the 11 ranges that
Emacs uses:

(#x1100 . #x115F)
(#x2329 . #x232A)
(#x2E80 . #x2E99)
(#x2E9B . #x2EF3)
(#x2F00 . #x2FD5)
(#x2FF0 . #x2FFB)
(#x3000 . #x303E)
(#x3041 . #x3096)
(#x3099 . #x30FF)
(#x3105 . #x312D)
(#x3131 . #x318E)
(#x3190 . #x31BA)
(#x31C0 . #x31E3)
(#x31F0 . #x321E)
(#x3220 . #x3247)
(#x3250 . #x32FE)
(#x3300 . #x4DBF)
(#x4E00 . #xA48C)
(#xA490 . #xA4C6)
(#xA960 . #xA97C)
(#xAC00 . #xD7A3)
(#xF900 . #xFAFF)
(#xFE10 . #xFE19)
(#xFE30 . #xFE52)
(#xFE54 . #xFE66)
(#xFE68 . #xFE6B)
(#xFF01 . #xFF60)
(#xFFE0 . #xFFE6)
(#x1B000 . #x1B001)
(#x1F200 . #x1F202)
(#x1F210 . #x1F23A)
(#x1F240 . #x1F248)
(#x1F250 . #x1F251)
(#x20000 . #x2FFFD)
(#x30000 . #x3FFFD)

> I don't see any significant differences, except perhaps in unassigned
> codepoints (see paragraph 6.1 of UAX#11 for the treatment of
> unassigned CJK codepoints).

I don't know if this means that the above needs modifying?




Information forwarded to bug-gnu-emacs@HIDDEN:
bug#20789; Package emacs. Full text available.
Severity set to 'wishlist' from 'normal' Request was from Glenn Morris <rgm@HIDDEN> to control <at> debbugs.gnu.org. Full text available.
Changed bug title to 'auto-generate more Unicode data from sources' from 'Invalid script or charset name: cuneiform-numbers-and-punctuation' Request was from Glenn Morris <rgm@HIDDEN> to control <at> debbugs.gnu.org. Full text available.
Did not alter fixed versions and reopened. Request was from Debbugs Internal Request <help-debbugs@HIDDEN> to internal_control <at> debbugs.gnu.org. Full text available.

Message received at 20789 <at> debbugs.gnu.org:


Received: (at 20789) by debbugs.gnu.org; 21 Jun 2015 15:00:49 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sun Jun 21 11:00:49 2015
Received: from localhost ([127.0.0.1]:53478 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1Z6gjn-0000Oa-08
	for submit <at> debbugs.gnu.org; Sun, 21 Jun 2015 11:00:49 -0400
Received: from mtaout29.012.net.il ([80.179.55.185]:38770)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <eliz@HIDDEN>) id 1Z6gji-0000OI-Kw
 for 20789 <at> debbugs.gnu.org; Sun, 21 Jun 2015 11:00:44 -0400
Received: from conversion-daemon.mtaout29.012.net.il by mtaout29.012.net.il
 (HyperSendmail v2007.08) id <0NQA00O00URR4200@HIDDEN> for
 20789 <at> debbugs.gnu.org; Sun, 21 Jun 2015 18:00:05 +0300 (IDT)
Received: from HOME-C4E4A596F7 ([87.69.4.28]) by mtaout29.012.net.il
 (HyperSendmail v2007.08) with ESMTPA id
 <0NQA00H5KV004L80@HIDDEN>; Sun, 21 Jun 2015 18:00:04 +0300 (IDT)
Date: Sun, 21 Jun 2015 18:00:20 +0300
From: Eli Zaretskii <eliz@HIDDEN>
Subject: Re: bug#20789: Invalid script or charset
 name:	cuneiform-numbers-and-punctuation
In-reply-to: <6pp4qlzti.fsf@HIDDEN>
X-012-Sender: halo1@HIDDEN
To: Glenn Morris <rgm@HIDDEN>, Kenichi Handa <handa@HIDDEN>
Message-id: <83mvzthzsr.fsf@HIDDEN>
References: <21zj45kiix.fsf@HIDDEN>
 <rek2v93mux.fsf@HIDDEN> <83y4jpqqjq.fsf@HIDDEN>
 <ozy4jkh58w.fsf@HIDDEN> <834mm7ogv3.fsf@HIDDEN>
 <4cegla7rnj.fsf@HIDDEN> <83eglamha2.fsf@HIDDEN>
 <6pp4qlzti.fsf@HIDDEN>
X-Spam-Score: 1.0 (+)
X-Debbugs-Envelope-To: 20789
Cc: 20789 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
Reply-To: Eli Zaretskii <eliz@HIDDEN>
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: 1.0 (+)

> From: Glenn Morris <rgm@HIDDEN>
> Cc: 20789 <at> debbugs.gnu.org
> Date: Sat, 20 Jun 2015 19:34:01 -0400
> 
> I spent some time looking at some of these.
> In no case could I see a clear path from the inputs to the outputs.

Thanks for looking into this.

Let me first make a general comment: we can always convert only
certain parts of the setup to an automated procedure, and leave the
rest in its present form, more or less.  That's especially true where
Emacs has specialized needs or defines properties not in Unicode.

> >   . characters.el:
> >
> >     . The modify-category-entry calls -- they basically can be derived
> >       from Blocks.txt
> 
> I looked at it briefly. I can see that they are somewhat related, but
> not precisely how. Eg:
> 
> Emacs: 2E80:312F and 3190:33FF are "line breakable".
> Which means that "Hangul Compatibility Jamo" isn't. I have no idea why.
> 
> Emacs: 3400:4DBF and 4E00:9FAF are "2-byte han".
> Which means that "Yijing Hexagram Symbols" aren't. Again, I have no idea why.
> 
> I didn't look any further.

When I said "derived from Blocks.txt", I meant the categories that are
related to script names, like ASCII, Latin, Arabic, Chinese, etc.
Sorry for not saying that explicitly.

Other categories need other sources.  Here's my attempt to decipher
some of them:

 . ?| -- "line breakable"

   The data seems to be in LineBreak.txt, described in detail in
   UAX#14 (http://unicode.org/reports/tr14/).  It looks like
   characters with the ?| category are those whose line-break
   properties are ID or CJ or NS.  Therefore, the exclusion of Hangul
   Compatibility Jamo is a mistake (or maybe an omission, since the
   comment says "Chinese"); in particular, UAX#14 explicitly says, in
   section 5.1 under "ID", that the characters in the range 3130..318F
   are treated as class ID.

   This category is currently used only by kinsoku.el, which has its
   own data (and sets the ?< and ?> categories).  So this will only
   become important if we ever implement in Emacs something more
   general, like the algorithm described in UAX#14.

 . "2-byte han" -- I think this is related to their legacy encoding; I
   don't see this used anywhere.  Likewise with other 2-byte
   categories.  Perhaps Handa-san (CC'ed) could comment on their
   necessity.  If this is still needed, we should probably leave these
   alone.

 . ?0 - ?9 -- I don't see how to get this data from the UCD or any
   other source.  Some of it seems to be in IndicSyllabicCategory.txt,
   FWIW.

 . ?R and ?L -- already set up using the Unicode data, so no change is
   needed.

 . ?^ -- should be set for any character whose general-category is
   Mn.  Since we already do this, the manual setting around line 820
   is redundant and should be deleted.

 . ?. -- already set using Unicode data, no change needed.

> >     . The setup of char-width-table -- I think the information is in
> >       EastAsianWidth.txt, with background information described in
> >       UAX#11 (http://www.unicode.org/reports/tr11/)
> 
> Looks somewhat promising, but could you be more specific?
> There's nothing in that file that defines "zero width" characters, so I
> don't see where Emacs's width 0 characters come from.

The following rules regarding zero-width characters are due to Markus
Kuhn, and are excerpted from the description in comments to his
implementation of 'wcwidth' (http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c):

 . The null character (U+0000) has a column width of 0.
 . Non-spacing and enclosing combining characters (general category
   code Mn or Me in the Unicode database) have a column width of 0. 
 . ZERO WIDTH SPACE (U+200B) and format characters (general category
   code Cf in the Unicode database), except SOFT HYPHEN (U+00AD), have
   a column width of 0.
 . Hangul Jamo medial vowels and final consonants (U+1160-U+11FF) have
   a column width of 0.

> The width 2 characters look like they might be the "W" and "F" characters,

Yes.

> but just doing that gives a list that has many differences to the list
> Emacs uses.

I don't see any significant differences, except perhaps in unassigned
codepoints (see paragraph 6.1 of UAX#11 for the treatment of
unassigned CJK codepoints).  I think any differences beyond that
should be treated as errors in Emacs in this case.

> >     . The setup of char-acronym-table: at least some of the data is in
> >       NameAliases.txt and NameList.txt
> 
> Looks somewhat promising.
> I can see how most of this comes from NameAliases.txt.
> But there are many oddities:
> 
> Why does Emacs not have anything for 0009 (HT or TAB) or 000A (LF, NL,
> or EOF)?

This table is set for the 'acronym' method of glyphless-char-display,
so I guess these omissions are for characters for which no one
envisioned them to be ever displayed as glyphless.  I'd include them
in the table anyway, just in case, and also to keep our exceptions vs
the UCD to the bare minimum.

> 0019 is EOM in the source but EM in Emacs.

Typo, I think.

> 0080 is PAD in the source but XXX in Emacs.
> 0081 is HOP in the source but XXX in Emacs.
> 008F is SS3 in the source but SS1 in Emacs.
> 0099 is SGC in the source but XXX in Emacs.

I think these are typos and perhaps acronyms that whoever wrote this
didn't know.

> How does Emacs choose which entries to list? There are many more in the
> source. Could it do any harm to add more?

As long as you take only "abbreviations", i.e. they are short, I think
we should use all of them, given their use in Emacs.

> Where does "KIVAQ" come from? That appears nowhere in the source AFAICS.

AFAIK, that's the official name of that character.  At least that's
what I glean from Google; I know nothing about the Khmer script.

> Why does Emacs list two Khmer entries, and nothing else? There are loads
> more of them.

These are the only 2 that have such abbreviations; see
https://en.wikipedia.org/wiki/Khmer_alphabet (assuming by "loads more"
you meant the Khmer letters).

> >   . fontset.el:
> >
> >     . The setup of script-representative-chars
> 
> I don't see how. It seems to be "for some of, but not all, the entries
> in char-script-table, choose a single character somewhere in the range."

We should have a representative character for each entry in
char-script-table.  They are used with some font back-ends (xfont and
x?ftfont, AFAIR) to probe candidate fonts for coverage of the required
script, so we should have the full information about that.  I think
the only reason for the partial information we have now is that it is
maintained manually, so it includes whatever the people who worked on
that bothered to add.

> There seems to be no pattern to how the character is chosen within the
> range. Often the first one, but by no means always.

I think the rule is to choose the first one that is a letter, i.e. its
general-category is either one of Lu, Ll, Lt, Lo, or Lm.

> >   . mule-cmds.el:
> >
> >     . The setting of locale-language-names -- the data is available in
> >       IANA's Language Subtag Registry
> >       
> > (http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry)
> >       and in ISO 639-2 (http://www.loc.gov/standards/iso639-2/,
> >       http://www.loc.gov/standards/iso639-2/php/English_list.php)
> 
> Again, I don't see how. Eg nowhere in those source files do I see Welsh
> associated with iso-8859-14, and the comment in mule-cmds says that the
> last part is "implementation dependent".

The bulk of the data is the correspondence between the ISO 639
2-letter names and the country/culture name.  The few cases where we
also have the encoding could be set up with a very small database once
the main data is set, by adding the encoding to those few that need
it.

If by "last part" you mean IPA and "Nonstandard or obsolete language
codes", then these are very few and can be added manually.

> > P.S. It would be good to add to somewhere (admin/make-tarball.txt?) a
> > reminder to fetch all those reference files and regenerate their
> > dependencies, before we prepare a release.
> 
> admin/FOR-RELEASE contains that kind of thing.

Right, I will add the information there.

Thanks.




Information forwarded to bug-gnu-emacs@HIDDEN:
bug#20789; Package emacs. Full text available.

Message received at 20789 <at> debbugs.gnu.org:


Received: (at 20789) by debbugs.gnu.org; 20 Jun 2015 23:34:13 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sat Jun 20 19:34:13 2015
Received: from localhost ([127.0.0.1]:52987 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1Z6SH6-0001Hh-Cr
	for submit <at> debbugs.gnu.org; Sat, 20 Jun 2015 19:34:12 -0400
Received: from eggs.gnu.org ([208.118.235.92]:52367)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <rgm@HIDDEN>) id 1Z6SH3-0001HT-7C
 for 20789 <at> debbugs.gnu.org; Sat, 20 Jun 2015 19:34:09 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <rgm@HIDDEN>) id 1Z6SGw-0001WA-Et
 for 20789 <at> debbugs.gnu.org; Sat, 20 Jun 2015 19:34:03 -0400
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.8 required=5.0 tests=ALL_TRUSTED,BAYES_50,
 RP_MATCHES_RCVD autolearn=disabled version=3.3.2
Received: from fencepost.gnu.org ([208.118.235.10]:35529)
 by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <rgm@HIDDEN>)
 id 1Z6SGw-0001Vs-B8
 for 20789 <at> debbugs.gnu.org; Sat, 20 Jun 2015 19:34:02 -0400
Received: from rgm by fencepost.gnu.org with local (Exim 4.82)
 (envelope-from <rgm@HIDDEN>)
 id 1Z6SGv-0003ww-Q3; Sat, 20 Jun 2015 19:34:01 -0400
From: Glenn Morris <rgm@HIDDEN>
To: Eli Zaretskii <eliz@HIDDEN>
Subject: Re: bug#20789: Invalid script or charset
 name:	cuneiform-numbers-and-punctuation
References: <21zj45kiix.fsf@HIDDEN>
 <rek2v93mux.fsf@HIDDEN> <83y4jpqqjq.fsf@HIDDEN>
 <ozy4jkh58w.fsf@HIDDEN> <834mm7ogv3.fsf@HIDDEN>
 <4cegla7rnj.fsf@HIDDEN> <83eglamha2.fsf@HIDDEN>
X-Spook: CIDA Gazprom Border Patrol Tony Blair Dock Soviet
X-Ran: o$"Z!Xw_D#rY2GFBBl*#nhsZ-h;9("_4+#Sr`-Z=Y89d?&:{A%~tpvaIBmzGF=L4N]-b{n
X-Hue: yellow
X-Debbugs-No-Ack: yes
X-Attribution: GM
Date: Sat, 20 Jun 2015 19:34:01 -0400
Message-ID: <6pp4qlzti.fsf@HIDDEN>
User-Agent: Gnus (www.gnus.org), GNU Emacs (www.gnu.org/software/emacs/)
MIME-Version: 1.0
Content-Type: text/plain
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-Received-From: 208.118.235.10
X-Spam-Score: -5.6 (-----)
X-Debbugs-Envelope-To: 20789
Cc: 20789 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -5.6 (-----)


I spent some time looking at some of these.
In no case could I see a clear path from the inputs to the outputs.

Eli Zaretskii wrote:

>   . characters.el:
>
>     . The modify-category-entry calls -- they basically can be derived
>       from Blocks.txt

I looked at it briefly. I can see that they are somewhat related, but
not precisely how. Eg:

Emacs: 2E80:312F and 3190:33FF are "line breakable".
Which means that "Hangul Compatibility Jamo" isn't. I have no idea why.

Emacs: 3400:4DBF and 4E00:9FAF are "2-byte han".
Which means that "Yijing Hexagram Symbols" aren't. Again, I have no idea why.

I didn't look any further.

>     . The modify-syntax-entry and set-case-syntax calls can be derived
>       from the values of the 'general-category' property returned by
>       'get-char-code-property', perhaps augmented by 'paired-bracket'
>       and 'paired-type' properties

I didn't look at this yet.

>     . The set-case-syntax-pair calls (perhaps use the data in
>       CaseFolding.txt, or even the case mapping information in
>       UnicodeData.txt)

I didn't look at this yet.

>     . The setup of char-width-table -- I think the information is in
>       EastAsianWidth.txt, with background information described in
>       UAX#11 (http://www.unicode.org/reports/tr11/)

Looks somewhat promising, but could you be more specific?
There's nothing in that file that defines "zero width" characters, so I
don't see where Emacs's width 0 characters come from.

The width 2 characters look like they might be the "W" and "F" characters,
but just doing that gives a list that has many differences to the list
Emacs uses.

>     . The setup of char-acronym-table: at least some of the data is in
>       NameAliases.txt and NameList.txt

Looks somewhat promising.
I can see how most of this comes from NameAliases.txt.
But there are many oddities:

Why does Emacs not have anything for 0009 (HT or TAB) or 000A (LF, NL,
or EOF)?
0019 is EOM in the source but EM in Emacs.

0080 is PAD in the source but XXX in Emacs.
0081 is HOP in the source but XXX in Emacs.
008F is SS3 in the source but SS1 in Emacs.
0099 is SGC in the source but XXX in Emacs.

How does Emacs choose which entries to list? There are many more in the
source. Could it do any harm to add more?

Where does "KIVAQ" come from? That appears nowhere in the source AFAICS.
Why does Emacs list two Khmer entries, and nothing else? There are loads
more of them.

>   . fontset.el:
>
>     . The setup of script-representative-chars

I don't see how. It seems to be "for some of, but not all, the entries
in char-script-table, choose a single character somewhere in the range."
There seems to be no pattern to how the character is chosen within the
range. Often the first one, but by no means always.

>   . mule-cmds.el:
>
>     . The setting of locale-language-names -- the data is available in
>       IANA's Language Subtag Registry
>       (http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry)
>       and in ISO 639-2 (http://www.loc.gov/standards/iso639-2/,
>       http://www.loc.gov/standards/iso639-2/php/English_list.php)

Again, I don't see how. Eg nowhere in those source files do I see Welsh
associated with iso-8859-14, and the comment in mule-cmds says that the
last part is "implementation dependent".

> P.S. It would be good to add to somewhere (admin/make-tarball.txt?) a
> reminder to fetch all those reference files and regenerate their
> dependencies, before we prepare a release.

admin/FOR-RELEASE contains that kind of thing.




Information forwarded to bug-gnu-emacs@HIDDEN:
bug#20789; Package emacs. Full text available.

Message received at 20789 <at> debbugs.gnu.org:


Received: (at 20789) by debbugs.gnu.org; 17 Jun 2015 16:49:43 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Wed Jun 17 12:49:43 2015
Received: from localhost ([127.0.0.1]:49783 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1Z5GX0-0001UD-2y
	for submit <at> debbugs.gnu.org; Wed, 17 Jun 2015 12:49:42 -0400
Received: from mtaout20.012.net.il ([80.179.55.166]:43756)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <eliz@HIDDEN>) id 1Z5GC5-0000wS-NC
 for 20789 <at> debbugs.gnu.org; Wed, 17 Jun 2015 12:28:07 -0400
Received: from conversion-daemon.a-mtaout20.012.net.il by
 a-mtaout20.012.net.il (HyperSendmail v2007.08) id
 <0NQ300B00K6U8800@HIDDEN> for 20789 <at> debbugs.gnu.org;
 Wed, 17 Jun 2015 19:27:59 +0300 (IDT)
Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout20.012.net.il
 (HyperSendmail v2007.08) with ESMTPA id
 <0NQ300BK6KEM4440@HIDDEN>;
 Wed, 17 Jun 2015 19:27:59 +0300 (IDT)
Date: Wed, 17 Jun 2015 19:27:49 +0300
From: Eli Zaretskii <eliz@HIDDEN>
Subject: Re: bug#20789: Invalid script or charset
 name:	cuneiform-numbers-and-punctuation
In-reply-to: <4cegla7rnj.fsf@HIDDEN>
X-012-Sender: halo1@HIDDEN
To: Glenn Morris <rgm@HIDDEN>
Message-id: <83eglamha2.fsf@HIDDEN>
References: <21zj45kiix.fsf@HIDDEN>
 <rek2v93mux.fsf@HIDDEN> <83y4jpqqjq.fsf@HIDDEN>
 <ozy4jkh58w.fsf@HIDDEN> <834mm7ogv3.fsf@HIDDEN>
 <4cegla7rnj.fsf@HIDDEN>
X-Spam-Score: 1.0 (+)
X-Debbugs-Envelope-To: 20789
Cc: 20789 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
Reply-To: Eli Zaretskii <eliz@HIDDEN>
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: 1.0 (+)

> From: Glenn Morris <rgm@HIDDEN>
> Cc: 20789 <at> debbugs.gnu.org
> Date: Wed, 17 Jun 2015 02:52:48 -0400
> 
> Is there anything else in international/ that could benefit from being
> auto-generated?

Some.  Things I've spotted:

  . characters.el:

    . The modify-category-entry calls -- they basically can be derived
      from Blocks.txt

    . The modify-syntax-entry and set-case-syntax calls can be derived
      from the values of the 'general-category' property returned by
      'get-char-code-property', perhaps augmented by 'paired-bracket'
      and 'paired-type' properties

    . The set-case-syntax-pair calls (perhaps use the data in
      CaseFolding.txt, or even the case mapping information in
      UnicodeData.txt)

    . The setup of char-width-table -- I think the information is in
      EastAsianWidth.txt, with background information described in
      UAX#11 (http://www.unicode.org/reports/tr11/)

    . The setup of char-acronym-table: at least some of the data is in
      NameAliases.txt and NameList.txt

  . fontset.el:

    . The setup of script-representative-chars

  . mule-cmds.el:

    . The setting of locale-language-names -- the data is available in
      IANA's Language Subtag Registry
      (http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry)
      and in ISO 639-2 (http://www.loc.gov/standards/iso639-2/,
      http://www.loc.gov/standards/iso639-2/php/English_list.php)
      
TIA

P.S. It would be good to add to somewhere (admin/make-tarball.txt?) a
reminder to fetch all those reference files and regenerate their
dependencies, before we prepare a release.




Information forwarded to bug-gnu-emacs@HIDDEN:
bug#20789; Package emacs. Full text available.

Message received at 20789 <at> debbugs.gnu.org:


Received: (at 20789) by debbugs.gnu.org; 17 Jun 2015 06:53:00 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Wed Jun 17 02:53:00 2015
Received: from localhost ([127.0.0.1]:56707 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1Z57DY-0003Bp-4N
	for submit <at> debbugs.gnu.org; Wed, 17 Jun 2015 02:53:00 -0400
Received: from eggs.gnu.org ([208.118.235.92]:43361)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <rgm@HIDDEN>) id 1Z57DV-0003Bb-Ir
 for 20789 <at> debbugs.gnu.org; Wed, 17 Jun 2015 02:52:58 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <rgm@HIDDEN>) id 1Z57DP-0008GX-I2
 for 20789 <at> debbugs.gnu.org; Wed, 17 Jun 2015 02:52:52 -0400
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=0.2 required=5.0 tests=BAYES_50,RP_MATCHES_RCVD
 autolearn=disabled version=3.3.2
Received: from fencepost.gnu.org ([2001:4830:134:3::e]:36245)
 by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <rgm@HIDDEN>)
 id 1Z57DP-0008GT-Fj
 for 20789 <at> debbugs.gnu.org; Wed, 17 Jun 2015 02:52:51 -0400
Received: from rgm by fencepost.gnu.org with local (Exim 4.82)
 (envelope-from <rgm@HIDDEN>)
 id 1Z57DM-00010o-M8; Wed, 17 Jun 2015 02:52:48 -0400
From: Glenn Morris <rgm@HIDDEN>
To: Eli Zaretskii <eliz@HIDDEN>
Subject: Re: bug#20789: Invalid script or charset
 name:	cuneiform-numbers-and-punctuation
References: <21zj45kiix.fsf@HIDDEN>
 <rek2v93mux.fsf@HIDDEN> <83y4jpqqjq.fsf@HIDDEN>
 <ozy4jkh58w.fsf@HIDDEN> <834mm7ogv3.fsf@HIDDEN>
X-Spook: Environmental terrorist Human to Animal Nerve agent
X-Ran: 8UYm$NVuSa.3ws,WkdUTE##`Wm`Sz|`@R0Pjj@*'`(^sed+uKwn.S)z5Q*I,G(ae%rGO+`
X-Hue: red
X-Debbugs-No-Ack: yes
X-Attribution: GM
Date: Wed, 17 Jun 2015 02:52:48 -0400
Message-ID: <4cegla7rnj.fsf@HIDDEN>
User-Agent: Gnus (www.gnus.org), GNU Emacs (www.gnu.org/software/emacs/)
MIME-Version: 1.0
Content-Type: text/plain
X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address
 (bad octet value).
X-Received-From: 2001:4830:134:3::e
X-Spam-Score: -5.6 (-----)
X-Debbugs-Envelope-To: 20789
Cc: 20789 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -5.6 (-----)

Eli Zaretskii wrote:

> Well, "signwriting" is not a word, AFAIK, it's 2 words [...]

It's a word (in the OED), but in the sense of painting commercial signs.
I don't really care, it's just that ~ 50% of the script is transforming
the Unicode names to the (seemingly randomly chosen) Emacs names.
If the latter were more straightforwardly derived from the former,
things would be simpler. But one more special rule makes no difference.

> P.S. Does the script work with mawk?

Yes, and with Sun OS 5.10's /usr/xpg4/bin/awk (but not /usr/bin/awk).
I don't believe it uses any more features than admin/charsets/*.awk.


Is there anything else in international/ that could benefit from being
auto-generated?




Information forwarded to bug-gnu-emacs@HIDDEN:
bug#20789; Package emacs. Full text available.

Message received at 20789 <at> debbugs.gnu.org:


Received: (at 20789) by debbugs.gnu.org; 16 Jun 2015 14:42:05 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Tue Jun 16 10:42:05 2015
Received: from localhost ([127.0.0.1]:55937 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1Z4s3t-0002jd-KB
	for submit <at> debbugs.gnu.org; Tue, 16 Jun 2015 10:42:05 -0400
Received: from mtaout23.012.net.il ([80.179.55.175]:39146)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <eliz@HIDDEN>) id 1Z4s3n-0002jJ-8e
 for 20789 <at> debbugs.gnu.org; Tue, 16 Jun 2015 10:41:59 -0400
Received: from conversion-daemon.a-mtaout23.012.net.il by
 a-mtaout23.012.net.il (HyperSendmail v2007.08) id
 <0NQ100I00KOMV300@HIDDEN> for 20789 <at> debbugs.gnu.org;
 Tue, 16 Jun 2015 17:41:48 +0300 (IDT)
Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout23.012.net.il
 (HyperSendmail v2007.08) with ESMTPA id
 <0NQ100IM9KTOTR20@HIDDEN>;
 Tue, 16 Jun 2015 17:41:48 +0300 (IDT)
Date: Tue, 16 Jun 2015 17:41:36 +0300
From: Eli Zaretskii <eliz@HIDDEN>
Subject: Re: bug#20789: Invalid script or charset
 name:	cuneiform-numbers-and-punctuation
In-reply-to: <ozy4jkh58w.fsf@HIDDEN>
X-012-Sender: halo1@HIDDEN
To: Glenn Morris <rgm@HIDDEN>
Message-id: <834mm7ogv3.fsf@HIDDEN>
MIME-version: 1.0
Content-type: text/plain; charset=utf-8
Content-transfer-encoding: 8BIT
References: <21zj45kiix.fsf@HIDDEN>
 <rek2v93mux.fsf@HIDDEN> <83y4jpqqjq.fsf@HIDDEN>
 <ozy4jkh58w.fsf@HIDDEN>
X-Spam-Score: 1.0 (+)
X-Debbugs-Envelope-To: 20789
Cc: 20789 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
Reply-To: Eli Zaretskii <eliz@HIDDEN>
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: 1.0 (+)

> From: Glenn Morris <rgm@HIDDEN>
> Cc: 20789 <at> debbugs.gnu.org
> Date: Mon, 15 Jun 2015 20:22:07 -0400
> 
> Eli Zaretskii wrote:
> 
> >> I don't suppose that big list can be auto-generated from the inputs?
> >
> > It's not trivial.  I describe below some of the issues, in the hope
> > that Someoneā„¢ will volunteer:
> 
> Thanks. Script that processes Blocks.txt attached. Some questions:
> 
> 1. In Blocks.txt:
> 
>   FF00..FFEF; Halfwidth and Fullwidth Forms
> 
> In Emacs:
> 
>   (#xFF00 #xFF5F cjk-misc)
>   (#xFF61 #xFF9F kana)
>   (#xFFE0 #xFFEF cjk-misc)
> 
> Is ff60 (FULLWIDTH RIGHT WHITE PARENTHESIS) intentionally omitted?

AFAICT, there's a small mess around there.  Based on the names of the
pertinent characters, I think we should have this instead of the above
3 ranges:

  (#xFF00 #xFF60 cjk-misc)
  (#xFF61 #xFF9F kana)
  (#xFFA0 #xFFDF hangul)
  (#xFFE0 #xFFEF cjk-misc)

> 2. In Emacs "olt-italic" looks like a typo ("old-italic"). Can it be renamed?

Yes, please.

> 3. In Blocks.txt, Anatolian Hieroglyphs ends at 1467F.
> In Emacs, it ends at 1457F. Typo?

Yes.

> 4. In Blocks.txt:
> 
>   20000..2A6DF; CJK Unified Ideographs Extension B
>   2A700..2B73F; CJK Unified Ideographs Extension C
>   2B740..2B81F; CJK Unified Ideographs Extension D
>   2B820..2CEAF; CJK Unified Ideographs Extension E
>   2F800..2FA1F; CJK Compatibility Ideographs Supplement
> 
> In Emacs:
> 
>   (#x20000 #x2CEAF han)
>   (#x2F800 #x2FFFF han)
> 
> Emacs adds the ranges 2a6e0:2a6ff and 2fa20:2ffff, which Blocks.txt does
> not cover. Intentional?

I don't know, but probably not intentional.  I think we had better
made it consistent with the UCD.

> 5. Newly added "sutton-sign-writing" - should be "sutton-signwriting"?
> (The case-insensitive source says "Sutton SignWriting".)

Well, "signwriting" is not a word, AFAIK, it's 2 words (and the funny
camel-case seems to agree with me).  AFAIU, they used "SignWriting"
because it's the commercial name.  But if you insist, I won't...

Thank you for doing this.

P.S. Does the script work with mawk?  (Some systems have it as their
default Awk, I think.)




Information forwarded to bug-gnu-emacs@HIDDEN:
bug#20789; Package emacs. Full text available.

Message received at 20789 <at> debbugs.gnu.org:


Received: (at 20789) by debbugs.gnu.org; 16 Jun 2015 00:22:18 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Mon Jun 15 20:22:18 2015
Received: from localhost ([127.0.0.1]:55052 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1Z4edt-0003OK-PO
	for submit <at> debbugs.gnu.org; Mon, 15 Jun 2015 20:22:18 -0400
Received: from eggs.gnu.org ([208.118.235.92]:54862)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <rgm@HIDDEN>) id 1Z4edr-0003O6-I5
 for 20789 <at> debbugs.gnu.org; Mon, 15 Jun 2015 20:22:16 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <rgm@HIDDEN>) id 1Z4edl-00054k-Co
 for 20789 <at> debbugs.gnu.org; Mon, 15 Jun 2015 20:22:10 -0400
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=0.6 required=5.0 tests=ALL_TRUSTED,BAYES_50,
 RP_MATCHES_RCVD,UNRESOLVED_TEMPLATE autolearn=disabled version=3.3.2
Received: from fencepost.gnu.org ([208.118.235.10]:36717)
 by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <rgm@HIDDEN>)
 id 1Z4edl-00054g-8z
 for 20789 <at> debbugs.gnu.org; Mon, 15 Jun 2015 20:22:09 -0400
Received: from rgm by fencepost.gnu.org with local (Exim 4.82)
 (envelope-from <rgm@HIDDEN>)
 id 1Z4edk-0004sU-53; Mon, 15 Jun 2015 20:22:08 -0400
From: Glenn Morris <rgm@HIDDEN>
To: Eli Zaretskii <eliz@HIDDEN>
Subject: Re: bug#20789: Invalid script or charset
 name:	cuneiform-numbers-and-punctuation
References: <21zj45kiix.fsf@HIDDEN>
 <rek2v93mux.fsf@HIDDEN> <83y4jpqqjq.fsf@HIDDEN>
X-Spook: terrorism UOP Cloud fraud PLF National Operations Center
X-Ran: gTwYKK@H;z1<|;%LOYYgv'7Bt[;$y/iJM{Yv$#+/i{-2<0nEG\A"0BoelWd:lyK[e;2vye
X-Hue: cyan
X-Debbugs-No-Ack: yes
X-Attribution: GM
Date: Mon, 15 Jun 2015 20:22:07 -0400
Message-ID: <ozy4jkh58w.fsf@HIDDEN>
User-Agent: Gnus (www.gnus.org), GNU Emacs (www.gnu.org/software/emacs/)
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="=-=-="
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-Received-From: 208.118.235.10
X-Spam-Score: -4.7 (----)
X-Debbugs-Envelope-To: 20789
Cc: 20789 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -4.7 (----)

--=-=-=
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

Eli Zaretskii wrote:

>> I don't suppose that big list can be auto-generated from the inputs?
>
> It's not trivial.  I describe below some of the issues, in the hope
> that Someone=E2=84=A2 will volunteer:

Thanks. Script that processes Blocks.txt attached. Some questions:

1. In Blocks.txt:

  FF00..FFEF; Halfwidth and Fullwidth Forms

In Emacs:

  (#xFF00 #xFF5F cjk-misc)
  (#xFF61 #xFF9F kana)
  (#xFFE0 #xFFEF cjk-misc)

Is ff60 (FULLWIDTH RIGHT WHITE PARENTHESIS) intentionally omitted?


2. In Emacs "olt-italic" looks like a typo ("old-italic"). Can it be rename=
d?


3. In Blocks.txt, Anatolian Hieroglyphs ends at 1467F.
In Emacs, it ends at 1457F. Typo?


4. In Blocks.txt:

  20000..2A6DF; CJK Unified Ideographs Extension B
  2A700..2B73F; CJK Unified Ideographs Extension C
  2B740..2B81F; CJK Unified Ideographs Extension D
  2B820..2CEAF; CJK Unified Ideographs Extension E
  2F800..2FA1F; CJK Compatibility Ideographs Supplement

In Emacs:

  (#x20000 #x2CEAF han)
  (#x2F800 #x2FFFF han)

Emacs adds the ranges 2a6e0:2a6ff and 2fa20:2ffff, which Blocks.txt does
not cover. Intentional?


5. Newly added "sutton-sign-writing" - should be "sutton-signwriting"?
(The case-insensitive source says "Sutton SignWriting".)



--=-=-=
Content-Type: application/octet-stream
Content-Disposition: attachment; filename=blocks.awk
Content-Transfer-Encoding: base64

IyEvdXNyL2Jpbi9hd2sgLWYKCiMjIENvcHlyaWdodCAoQykgMjAxNSBGcmVlIFNvZnR3YXJlIEZv
dW5kYXRpb24sIEluYy4KCiMjIEF1dGhvcjogR2xlbm4gTW9ycmlzIDxyZ21AZ251Lm9yZz4KCiMj
IFRoaXMgZmlsZSBpcyBwYXJ0IG9mIEdOVSBFbWFjcy4KCiMjIEdOVSBFbWFjcyBpcyBmcmVlIHNv
ZnR3YXJlOiB5b3UgY2FuIHJlZGlzdHJpYnV0ZSBpdCBhbmQvb3IgbW9kaWZ5CiMjIGl0IHVuZGVy
IHRoZSB0ZXJtcyBvZiB0aGUgR05VIEdlbmVyYWwgUHVibGljIExpY2Vuc2UgYXMgcHVibGlzaGVk
IGJ5CiMjIHRoZSBGcmVlIFNvZnR3YXJlIEZvdW5kYXRpb24sIGVpdGhlciB2ZXJzaW9uIDMgb2Yg
dGhlIExpY2Vuc2UsIG9yCiMjIChhdCB5b3VyIG9wdGlvbikgYW55IGxhdGVyIHZlcnNpb24uCgoj
IyBHTlUgRW1hY3MgaXMgZGlzdHJpYnV0ZWQgaW4gdGhlIGhvcGUgdGhhdCBpdCB3aWxsIGJlIHVz
ZWZ1bCwKIyMgYnV0IFdJVEhPVVQgQU5ZIFdBUlJBTlRZOyB3aXRob3V0IGV2ZW4gdGhlIGltcGxp
ZWQgd2FycmFudHkgb2YKIyMgTUVSQ0hBTlRBQklMSVRZIG9yIEZJVE5FU1MgRk9SIEEgUEFSVElD
VUxBUiBQVVJQT1NFLiAgU2VlIHRoZQojIyBHTlUgR2VuZXJhbCBQdWJsaWMgTGljZW5zZSBmb3Ig
bW9yZSBkZXRhaWxzLgoKIyMgWW91IHNob3VsZCBoYXZlIHJlY2VpdmVkIGEgY29weSBvZiB0aGUg
R05VIEdlbmVyYWwgUHVibGljIExpY2Vuc2UKIyMgYWxvbmcgd2l0aCBHTlUgRW1hY3MuICBJZiBu
b3QsIHNlZSA8aHR0cDovL3d3dy5nbnUub3JnL2xpY2Vuc2VzLz4uCgojIyMgQ29tbWVudGFyeToK
CiMjIFRoaXMgc2NyaXB0IHRha2VzIGFzIGlucHV0IFVuaWNvZGUncyBCbG9ja3MudHh0CiMjICho
dHRwOi8vd3d3LnVuaWNvZGUub3JnL1B1YmxpYy9VTklEQVRBL0Jsb2Nrcy50eHQpCiMjIGFuZCBw
cm9kdWNlcyBvdXRwdXQgZm9yIEVtYWNzJ3MgbGlzcC9pbnRlcm5hdGlvbmFsL2NoYXJzY3JpcHQu
ZWwuCgojIyBJdCBsdW1wcyB0b2dldGhlciBhbGwgdGhlIGJsb2NrcyBiZWxvbmdpbmcgdG8gdGhl
IHNhbWUgbGFuZ3VhZ2UuCiMjIEUuZy4sICJCYXNpYyBMYXRpbiIsICJMYXRpbi0xIFN1cHBsZW1l
bnQiLCAiTGF0aW4gRXh0ZW5kZWQtQSIsCiMjIGV0Yy4gYXJlIGFsbCBsdW1wZWQgdG9nZXRoZXIg
dW5kZXIgImxhdGluIi4KCiMjIFRoZSBVbmljb2RlIGJsb2NrcyBhY3R1YWxseSBleHRlbmQgcGFz
dCBzb21lIG9mIHRoZXNlIHJhbmdlcyB3aXRoCiMjIHVuZGVmaW5lZCBjb2RlcG9pbnRzLgoKIyMg
Rm9yIGFkZGl0aW9uYWwgZGV0YWlscywgc2VlIDxodHRwOi8vZGViYnVncy5nbnUub3JnLzIwNzg5
IzExPi4KCiMjIyBDb2RlOgoKQkVHSU4gewogICAgIyMgSGFyZC1jb2RlZCBuYW1lcy4gIFNlZSBu
YW1lMmFsaWFzIGZvciB0aGUgcmVzdC4KICAgIGFsaWFzWyJpcGEgZXh0ZW5zaW9ucyJdID0gInBo
b25ldGljIgogICAgYWxpYXNbImxldHRlcmxpa2Ugc3ltYm9scyJdID0gInN5bWJvbCIKICAgIGFs
aWFzWyJudW1iZXIgZm9ybXMiXSA9ICJzeW1ib2wiCiAgICBhbGlhc1sibWlzY2VsbGFuZW91cyB0
ZWNobmljYWwiXSA9ICJzeW1ib2wiCiAgICBhbGlhc1siY29udHJvbCBwaWN0dXJlcyJdID0gInN5
bWJvbCIKICAgIGFsaWFzWyJvcHRpY2FsIGNoYXJhY3RlciByZWNvZ25pdGlvbiJdID0gInN5bWJv
bCIKICAgIGFsaWFzWyJlbmNsb3NlZCBhbHBoYW51bWVyaWNzIl0gPSAic3ltYm9sIgogICAgYWxp
YXNbImJveCBkcmF3aW5nIl0gPSAic3ltYm9sIgogICAgYWxpYXNbImJsb2NrIGVsZW1lbnRzIl0g
PSAic3ltYm9sIgogICAgYWxpYXNbIm1pc2NlbGxhbmVvdXMgc3ltYm9scyJdID0gInN5bWJvbCIK
ICAgIGFsaWFzWyJjamsgc3Ryb2tlcyJdID0gImNqay1taXNjIgogICAgYWxpYXNbImNqayBzeW1i
b2xzIGFuZCBwdW5jdHVhdGlvbiJdID0gImNqay1taXNjIgogICAgYWxpYXNbImhhbGZ3aWR0aCBh
bmQgZnVsbHdpZHRoIGZvcm1zIl0gPSAiY2prLW1pc2MiCiAgICBhbGlhc1siY29tbW9uIGluZGlj
IG51bWJlciBmb3JtcyJdID0gIm5vcnRoLWluZGljLW51bWJlciIKCiAgICB0b2hleFsiYSJdID0g
MTAKICAgIHRvaGV4WyJiIl0gPSAxMQogICAgdG9oZXhbImMiXSA9IDEyCiAgICB0b2hleFsiZCJd
ID0gMTMKICAgIHRvaGV4WyJlIl0gPSAxNAogICAgdG9oZXhbImYiXSA9IDE1CgogICAgZml4X3N0
YXJ0WyIwMDgwIl0gPSAiMDBBMCIKICAgIGZpeF9lbmRbIjJBNkRGIl0gPSAiMkE2RkYiCiAgICBm
aXhfZW5kWyIyRkExRiJdID0gIjJGRkZGIgp9CgojIyBGcm9tIGFkbWluL2NoYXJzZXRzLy4KIyMg
V2l0aCBnYXdrJ3MgLS1ub24tZGVjaW1hbC1kYXRhIHN3aXRjaCB3ZSB3b3VsZG4ndCBuZWVkIHRo
aXMuCmZ1bmN0aW9uIGRlY29kZV9oZXgoc3RyICAgLCBuLCBsZW4sIGksIGMpIHsKICBuID0gMAog
IGxlbiA9IGxlbmd0aChzdHIpCiAgZm9yIChpID0gMTsgaSA8PSBsZW47IGkrKykKICAgIHsKICAg
ICAgYyA9IHN1YnN0ciAoc3RyLCBpLCAxKQogICAgICBpZiAoYyA+PSAiMCIgJiYgYyA8PSAiOSIp
CgluID0gbiAqIDE2ICsgKGMgLSAiMCIpCiAgICAgIGVsc2UKCW4gPSBuICogMTYgKyB0b2hleFt0
b2xvd2VyKGMpXQogICAgfQogIHJldHVybiBuCn0KCmZ1bmN0aW9uIG5hbWUyYWxpYXMobmFtZSAg
ICwgdywgdzIpIHsKICAgIG5hbWUgPSB0b2xvd2VyKG5hbWUpCiAgICBpZiAoYWxpYXNbbmFtZV0p
IHJldHVybiBhbGlhc1tuYW1lXQogICAgZWxzZSBpZiAobmFtZSB+IC9mb3Igc3ltYm9scy8pIHJl
dHVybiAic3ltYm9sIgogICAgZWxzZSBpZiAobmFtZSB+IC9sYXRpbnxjb21iaW5pbmcgLiogbWFy
a3N8c3BhY2luZyBtb2RpZmllcnx0b25lIGxldHRlcnN8YWxwaGFiZXRpYyBwcmVzZW50YXRpb24v
KSByZXR1cm4gImxhdGluIgogICAgZWxzZSBpZiAobmFtZSB+IC9jamt8eWlqaW5nfGVuY2xvc2Vk
IGlkZW9ncmFwaHxrYW5neGkvKSByZXR1cm4gImhhbiIKICAgIGVsc2UgaWYgKG5hbWUgfiAvYXJh
YmljLykgcmV0dXJuICJhcmFiaWMiCiAgICBlbHNlIGlmIChuYW1lIH4gL15ncmVlay8pIHJldHVy
biAiZ3JlZWsiCiAgICBlbHNlIGlmIChuYW1lIH4gL15jb3B0aWMvKSByZXR1cm4gImNvcHRpYyIK
ICAgIGVsc2UgaWYgKG5hbWUgfiAvY3VuZWlmb3JtIG51bWJlci8pIHJldHVybiAiY3VuZWlmb3Jt
LW51bWJlcnMtYW5kLXB1bmN0dWF0aW9uIgogICAgZWxzZSBpZiAobmFtZSB+IC9jdW5laWZvcm0v
KSByZXR1cm4gImN1bmVpZm9ybSIKICAgIGVsc2UgaWYgKG5hbWUgfiAvbWF0aGVtYXRpY2FsIGFs
cGhhbnVtZXJpYyBzeW1ib2wvKSByZXR1cm4gIm1hdGhlbWF0aWNhbCIKICAgIGVsc2UgaWYgKG5h
bWUgfiAvcHVuY3R1YXRpb258bWF0aGVtYXRpY2FsfGFycm93c3xjdXJyZW5jeXxzdXBlcnNjcmlw
dHxzbWFsbCBmb3JtIHZhcmlhbnRzfGdlb21ldHJpY3xkaW5nYmF0c3xlbmNsb3NlZHxhbGNoZW1p
Y2FsfHBpY3RvZ3JhcGh8ZW1vdGljb258dHJhbnNwb3J0LykgcmV0dXJuICJzeW1ib2wiCiAgICBl
bHNlIGlmIChuYW1lIH4gL2NhbmFkaWFuIGFib3JpZ2luYWwvKSByZXR1cm4gImNhbmFkaWFuLWFi
b3JpZ2luYWwiCiAgICBlbHNlIGlmIChuYW1lIH4gL2thdGFrYW5hfGhpcmFnYW5hLykgcmV0dXJu
ICJrYW5hIgogICAgZWxzZSBpZiAobmFtZSB+IC9teWFubWFyLykgcmV0dXJuICJidXJtZXNlIgog
ICAgZWxzZSBpZiAobmFtZSB+IC9oYW5ndWwvKSByZXR1cm4gImhhbmd1bCIKICAgIGVsc2UgaWYg
KG5hbWUgfiAva2htZXIvKSByZXR1cm4gImtobWVyIgogICAgZWxzZSBpZiAobmFtZSB+IC9icmFp
bGxlLykgcmV0dXJuICJicmFpbGxlIgogICAgZWxzZSBpZiAobmFtZSB+IC9eeWkgLykgcmV0dXJu
ICJ5aSIKICAgIGVsc2UgaWYgKG5hbWUgfiAvc3Vycm9nYXRlc3xwcml2YXRlIHVzZXx2YXJpYXRp
b24gc2VsZWN0b3JzLykgcmV0dXJuIDAKICAgIGVsc2UgaWYgKG5hbWUgfi9eKHNwZWNpYWxzfHRh
Z3MpJC8pIHJldHVybiAwCiAgICBlbHNlIGlmIChuYW1lIH4gL2xpbmVhciBiLykgcmV0dXJuICJs
aW5lYXItYiIKICAgIGVsc2UgaWYgKG5hbWUgfiAvYXJhbWFpYy8pIHJldHVybiAiYXJhbWFpYyIK
ICAgIGVsc2UgaWYgKG5hbWUgfiAvcnVtaSBudW0vKSByZXR1cm4gInJ1bWktbnVtYmVyIgogICAg
ZWxzZSBpZiAobmFtZSB+IC9kdXBsb3lhbnxzaG9ydGhhbmQvKSByZXR1cm4gImR1cGxveWFuLXNo
b3J0aGFuZCIKICAgIGVsc2UgaWYgKG5hbWUgfiAvc3V0dG9uIHNpZ253cml0aW5nLykgcmV0dXJu
ICJzdXR0b24tc2lnbi13cml0aW5nIgoKICAgIHN1YigvIChleHRlbmRlZHxleHRlbnNpb25zfHN1
cHBsZW1lbnQpLiovLCAiIiwgbmFtZSkKICAgIHN1YigvbnVtYmVycy8sICJudW1iZXIiLCBuYW1l
KQogICAgc3ViKC9udW1lcmFscy8sICJudW1lcmFsIiwgbmFtZSkKICAgIHN1Yigvc3ltYm9scy8s
ICJzeW1ib2wiLCBuYW1lKQogICAgc3ViKC9mb3JtcyQvLCAiZm9ybSIsIG5hbWUpCiAgICBzdWIo
L3RpbGVzJC8sICJ0aWxlIiwgbmFtZSkKICAgIHN1YigvXm5ldyAvLCAiIiwgbmFtZSkKICAgIHN1
YigvIChjaGFyYWN0ZXJzfGhpZXJvZ2x5cGhzfGN1cnNpdmUpJC8sICIiLCBuYW1lKQogICAgZ3N1
YigvIC8sICItIiwgbmFtZSkKCiAgICByZXR1cm4gbmFtZQp9CgovXlswLTlBLUZdLyB7CiAgICBz
ZXAgPSBpbmRleCgkMSwgIi4uIikKICAgIGxlbiA9IGxlbmd0aCgkMSkKICAgIHMgPSBzdWJzdHIo
JDEsMSxzZXAtMSkKICAgIGUgPSBzdWJzdHIoJDEsc2VwKzIsbGVuLXNlcC0yKQogICAgJDEgPSAi
IgogICAgc3ViKC9eICovLCAiIiwgJDApCiAgICBpKysKICAgIHN0YXJ0W2ldID0gZml4X3N0YXJ0
W3NdID8gZml4X3N0YXJ0W3NdIDogcwogICAgZW5kW2ldID0gZml4X2VuZFtlXSA/IGZpeF9lbmRb
ZV06IGUKICAgIG5hbWVbaV0gPSAkMAoKICAgIGFsdFtpXSA9IG5hbWUyYWxpYXMobmFtZVtpXSkK
CiAgICBpZiAoIWFsdFtpXSkKICAgIHsKICAgICAgICBpLS0KICAgICAgICBuZXh0CiAgICB9Cgog
ICAgIyMgQ29tYmluZSBhZGphY2VudCByYW5nZXMgd2l0aCB0aGUgc2FtZSBuYW1lLgogICAgaWYg
KGFsdFtpXSA9PSBhbHRbaS0xXSAmJiBkZWNvZGVfaGV4KHN0YXJ0W2ldKSA9PSAxICsgZGVjb2Rl
X2hleChlbmRbaS0xXSkpCiAgICB7CiAgICAgICAgZW5kW2ktMV0gPSBlbmRbaV0KICAgICAgICBu
YW1lW2ktMV0gPSAobmFtZVtpLTFdICIsICIgbmFtZVtpXSkKICAgICAgICBpLS0KICAgIH0KCiAg
ICAjIyBTb21lIGhhcmQtY29kZWQgc3BsaXRzLgogICAgaWYgKHN0YXJ0W2ldID09ICIwMzcwIikK
ICAgIHsKICAgICAgICBlbmRbaV0gPSAiMDNFMSIKICAgICAgICBpKysKICAgICAgICBzdGFydFtp
XSA9ICIwM0UyIgogICAgICAgIGVuZFtpXSA9ICIwM0VGIgogICAgICAgIGFsdFtpXSA9ICJjb3B0
aWMiCiAgICAgICAgaSsrCiAgICAgICAgc3RhcnRbaV0gPSAiMDNGMCIKICAgICAgICBlbmRbaV0g
PSAiMDNGRiIKICAgICAgICBhbHRbaV0gPSAiZ3JlZWsiCiAgICB9CiAgICBlbHNlIGlmIChzdGFy
dFtpXSA9PSAiRkIwMCIpCiAgICB7CiAgICAgICAgZW5kW2ldID0gIkZCMDYiCiAgICAgICAgaSsr
CiAgICAgICAgc3RhcnRbaV0gPSAiRkIxMyIKICAgICAgICBlbmRbaV0gPSAiRkIxNyIKICAgICAg
ICBhbHRbaV0gPSAiYXJtZW5pYW4iCiAgICAgICAgaSsrCiAgICAgICAgc3RhcnRbaV0gPSAiRkIx
RCIKICAgICAgICBlbmRbaV0gPSAiRkI0RiIKICAgICAgICBhbHRbaV0gPSAiaGVicmV3IgogICAg
fQogICAgZWxzZSBpZiAoc3RhcnRbaV0gPT0gIkZGMDAiKQogICAgewogICAgICAgIGVuZFtpXSA9
ICJGRjVGIgogICAgICAgIGkrKwogICAgICAgIHN0YXJ0W2ldID0gIkZGNjEiCiAgICAgICAgZW5k
W2ldID0gIkZGOUYiCiAgICAgICAgYWx0W2ldID0gImthbmEiCiAgICAgICAgaSsrCiAgICAgICAg
c3RhcnRbaV0gPSAiRkZFMCIKICAgICAgICBlbmRbaV0gPSAiRkZFRiIKICAgICAgICBhbHRbaV0g
PSAiY2prLW1pc2MiCiAgICB9Cn0KCkVORCB7CiAgICBwcmludCAiOzs7IGNoYXJzY3JpcHQuZWwg
LS0tIGNoYXJhY3RlciBzY3JpcHQgdGFibGUgLSotIG5vLWJ5dGUtY29tcGlsZTogdCAtKi0iCiAg
ICBwcmludCAiOzs7IEF1dG9tYXRpY2FsbHkgZ2VuZXJhdGVkIGZyb20gYWRtaW4vdW5pZGF0YS9C
bG9ja3MudHh0IgogICAgcHJpbnQgIihsZXQgKHNjcmlwdC1saXN0KSIKICAgIHByaW50ICIgIChk
b2xpc3QgKGVsdCAnKCIKCiAgICBmb3IgKGo9MTtqPD1pO2orKykKICAgIHsKICAgICAgICBwcmlu
dGYoIiAgICAoI3glcyAjeCVzICVzKSIsIHN0YXJ0W2pdLCBlbmRbal0sIGFsdFtqXSkKICAgICAg
ICAjIyBGdXp6IHRvIGRlY2lkZSB3aGV0aGVyIHdvcnRoIHByaW50aW5nIG9yaWdpbmFsIG5hbWUg
YXMgYSBjb21tZW50LgogICAgICAgIGlmIChuYW1lW2pdICYmIGFsdFtqXSAhPSB0b2xvd2VyKG5h
bWVbal0pICYmIGFsdFtqXSAhfiAvLS8pCiAgICAgICAgICAgIHByaW50ZigiIDsgJXMiLCBuYW1l
W2pdKQogICAgICAgIHByaW50ZigiXG4iKQogICAgfQoKICAgIHByaW50ICIgICAgKSkiCiAgICBw
cmludCAiICAgIChzZXQtY2hhci10YWJsZS1yYW5nZSBjaGFyLXNjcmlwdC10YWJsZSIKICAgIHBy
aW50ICIJCQkgIChjb25zIChjYXIgZWx0KSAobnRoIDEgZWx0KSkgKG50aCAyIGVsdCkpIgogICAg
cHJpbnQgIiAgICAob3IgKG1lbXEgKG50aCAyIGVsdCkgc2NyaXB0LWxpc3QpIgogICAgcHJpbnQg
Igkoc2V0cSBzY3JpcHQtbGlzdCAoY29ucyAobnRoIDIgZWx0KSBzY3JpcHQtbGlzdCkpKSkiCiAg
ICBwcmludCAiICAoc2V0LWNoYXItdGFibGUtZXh0cmEtc2xvdCBjaGFyLXNjcmlwdC10YWJsZSAw
IChucmV2ZXJzZSBzY3JpcHQtbGlzdCkpKSIKICAgIHByaW50ICIiCiAgICBwcmludCAiKHByb3Zp
ZGUgJ2NoYXJzY3JpcHQpIgp9Cg==
--=-=-=--




Information forwarded to bug-gnu-emacs@HIDDEN:
bug#20789; Package emacs. Full text available.

Message received at 20789 <at> debbugs.gnu.org:


Received: (at 20789) by debbugs.gnu.org; 12 Jun 2015 08:28:24 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Jun 12 04:28:24 2015
Received: from localhost ([127.0.0.1]:51250 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1Z3KK7-0000Mn-JQ
	for submit <at> debbugs.gnu.org; Fri, 12 Jun 2015 04:28:24 -0400
Received: from mtaout22.012.net.il ([80.179.55.172]:42347)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <eliz@HIDDEN>) id 1Z3KK5-0000MX-0t
 for 20789 <at> debbugs.gnu.org; Fri, 12 Jun 2015 04:28:22 -0400
Received: from conversion-daemon.a-mtaout22.012.net.il by
 a-mtaout22.012.net.il (HyperSendmail v2007.08) id
 <0NPT00500OTNVV00@HIDDEN> for 20789 <at> debbugs.gnu.org;
 Fri, 12 Jun 2015 11:28:14 +0300 (IDT)
Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout22.012.net.il
 (HyperSendmail v2007.08) with ESMTPA id
 <0NPT005OAOV1SU30@HIDDEN>;
 Fri, 12 Jun 2015 11:28:14 +0300 (IDT)
Date: Fri, 12 Jun 2015 11:28:09 +0300
From: Eli Zaretskii <eliz@HIDDEN>
Subject: Re: bug#20789: Invalid script or charset
 name:	cuneiform-numbers-and-punctuation
In-reply-to: <rek2v93mux.fsf@HIDDEN>
X-012-Sender: halo1@HIDDEN
To: Glenn Morris <rgm@HIDDEN>
Message-id: <83y4jpqqjq.fsf@HIDDEN>
MIME-version: 1.0
Content-type: text/plain; charset=utf-8
Content-transfer-encoding: 8BIT
References: <21zj45kiix.fsf@HIDDEN>
 <rek2v93mux.fsf@HIDDEN>
X-Spam-Score: 1.0 (+)
X-Debbugs-Envelope-To: 20789
Cc: 20789 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
Reply-To: Eli Zaretskii <eliz@HIDDEN>
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: 1.0 (+)

> From: Glenn Morris <rgm@HIDDEN>
> Date: Thu, 11 Jun 2015 18:24:06 -0400
> 
> Glenn Morris wrote:
> 
> >   Error (initialization): Creation of the default fontsets failed: (error
> >   Invalid script or charset name: cuneiform-numbers-and-punctuation)
> 
> I fixed a typo that seems to have caused that.

Sorry about that.

> I don't suppose that big list can be auto-generated from the inputs?

It's not trivial.  I describe below some of the issues, in the hope
that Someoneā„¢ will volunteer:

  . Most of the script names come from the corresponding Unicode
    blocks, with trivial transformations (downcase words and replace
    blanks with a hyphen).  So basically, we will need to use the
    information in Blocks.txt, a file that is part of the Unicode
    Character Database (UCD), but with quirks described below.

  . The first quirk is that we lump together all the blocks that
    belong to the same script, like "Basic Latin", "Latin Extended-A",
    "Latin-1 Supplement", etc. -- these all go to the single script
    called 'latin'.  Likewise with other similar blocks that are
    either "SOMETHING Extended" or "Supplement" or whatever.

  . The second quirk is with the CJK characters: those are divided
    into several broad scripts like 'han', 'kana', and 'cjk-misc'
    whose exact rules I don't know.

  . The third quirk is with the 'symbol' pseudo-script: we lump there
    all punctuation characters and all symbol characters (those for
    which the General Category is one of Pc, Pd, Ps, Pe, Pi, Pf, Po,
    Sm, Sc, Sk, So), but with the following notable exception:
    punctuation characters that belong to blocks that include
    non-punctuation characters are left in those blocks -- those are
    punctuation characters used only with the scripts named by those
    blocks, like U+05BE HEBREW PUNCTUATION MAQAF, which is only used
    by the Hebrew script.

  . Another quirk is that mathematical alphanumerics (which are just
    letters from the Unicode POV) are lumped into a separate script
    'mathematical'.

Alternatively, one could use Scripts.txt from the UCD, and then the
only problem is to subdivide what they call "Common" into the scripts
we use.

For the general category of a character, one can do in Emacs:

      (get-char-code-property CHAR 'general-category)

Alternatively, one can search UnicodeData.txt directly: the General
Category is the 3rd field there.

Patches are welcome to do all of the above automatically, perhaps with
some small database that expresses the more tricky of the above rules.




Information forwarded to bug-gnu-emacs@HIDDEN:
bug#20789; Package emacs. Full text available.
bug closed, send any further explanations to 20789 <at> debbugs.gnu.org and Glenn Morris <rgm@HIDDEN> Request was from Glenn Morris <rgm@HIDDEN> to control <at> debbugs.gnu.org. Full text available.

Message received at 20789 <at> debbugs.gnu.org:


Received: (at 20789) by debbugs.gnu.org; 11 Jun 2015 22:24:15 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Thu Jun 11 18:24:15 2015
Received: from localhost ([127.0.0.1]:51087 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1Z3AtT-00011Y-H6
	for submit <at> debbugs.gnu.org; Thu, 11 Jun 2015 18:24:15 -0400
Received: from eggs.gnu.org ([208.118.235.92]:40166)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <rgm@HIDDEN>) id 1Z3AtR-00011K-AR
 for 20789 <at> debbugs.gnu.org; Thu, 11 Jun 2015 18:24:14 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <rgm@HIDDEN>) id 1Z3AtL-0004SI-Am
 for 20789 <at> debbugs.gnu.org; Thu, 11 Jun 2015 18:24:08 -0400
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,T_RP_MATCHES_RCVD
 autolearn=disabled version=3.3.2
Received: from fencepost.gnu.org ([2001:4830:134:3::e]:50530)
 by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <rgm@HIDDEN>)
 id 1Z3AtL-0004SE-7R
 for 20789 <at> debbugs.gnu.org; Thu, 11 Jun 2015 18:24:07 -0400
Received: from rgm by fencepost.gnu.org with local (Exim 4.82)
 (envelope-from <rgm@HIDDEN>)
 id 1Z3AtK-0002SE-LU; Thu, 11 Jun 2015 18:24:06 -0400
From: Glenn Morris <rgm@HIDDEN>
To: 20789 <at> debbugs.gnu.org
Subject: Re: bug#20789: Invalid script or charset name:
 cuneiform-numbers-and-punctuation
References: <21zj45kiix.fsf@HIDDEN>
X-Spook: Suicide bomber Trafficking CIDA UOP digicash Temblor
X-Ran: QWzIxK})m-=&aolblW9bx[=\"&r"e:MJ];!%<c:a0?^|bzmn,/Qf!@MXC;9=8"w`?X{Uq_
X-Hue: red
X-Debbugs-No-Ack: yes
X-Attribution: GM
Date: Thu, 11 Jun 2015 18:24:06 -0400
In-Reply-To: <21zj45kiix.fsf@HIDDEN> (Glenn Morris's message of
 "Thu, 11 Jun 2015 18:05:42 -0400")
Message-ID: <rek2v93mux.fsf@HIDDEN>
User-Agent: Gnus (www.gnus.org), GNU Emacs (www.gnu.org/software/emacs/)
MIME-Version: 1.0
Content-Type: text/plain
X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address
 (bad octet value).
X-Received-From: 2001:4830:134:3::e
X-Spam-Score: -5.0 (-----)
X-Debbugs-Envelope-To: 20789
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -5.0 (-----)

Glenn Morris wrote:

>   Error (initialization): Creation of the default fontsets failed: (error
>   Invalid script or charset name: cuneiform-numbers-and-punctuation)

I fixed a typo that seems to have caused that.

I don't suppose that big list can be auto-generated from the inputs?

> A second bug: the *Warnings* buffer is not shown at startup, *scratch* is.




Information forwarded to bug-gnu-emacs@HIDDEN:
bug#20789; Package emacs. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 11 Jun 2015 22:05:54 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Thu Jun 11 18:05:53 2015
Received: from localhost ([127.0.0.1]:51074 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1Z3Abg-0000ag-5B
	for submit <at> debbugs.gnu.org; Thu, 11 Jun 2015 18:05:52 -0400
Received: from eggs.gnu.org ([208.118.235.92]:36046)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <rgm@HIDDEN>) id 1Z3Abc-0000aP-NH
 for submit <at> debbugs.gnu.org; Thu, 11 Jun 2015 18:05:49 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <rgm@HIDDEN>) id 1Z3AbW-0005Vy-Sl
 for submit <at> debbugs.gnu.org; Thu, 11 Jun 2015 18:05:43 -0400
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,T_RP_MATCHES_RCVD
 autolearn=disabled version=3.3.2
Received: from fencepost.gnu.org ([2001:4830:134:3::e]:50240)
 by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <rgm@HIDDEN>)
 id 1Z3AbW-0005Vu-P7
 for submit <at> debbugs.gnu.org; Thu, 11 Jun 2015 18:05:42 -0400
Received: from rgm by fencepost.gnu.org with local (Exim 4.82)
 (envelope-from <rgm@HIDDEN>)
 id 1Z3AbW-0000nx-CG; Thu, 11 Jun 2015 18:05:42 -0400
From: Glenn Morris <rgm@HIDDEN>
To: submit <at> debbugs.gnu.org
Subject: Invalid script or charset name: cuneiform-numbers-and-punctuation
X-Spook: COSCO Mena CID Suspicious device BLU-114/B UN Consul
X-Ran: qf|z=uq:*6FdoEp:7oMzbob2XGSNe$[?)lw_vIQntMMZI_[VU'V3{[s=?d[ChKS;q!%j<C
X-Hue: magenta
X-Debbugs-No-Ack: yes
X-Attribution: GM
Date: Thu, 11 Jun 2015 18:05:42 -0400
Message-ID: <21zj45kiix.fsf@HIDDEN>
User-Agent: Gnus (www.gnus.org), GNU Emacs (www.gnu.org/software/emacs/)
MIME-Version: 1.0
Content-Type: text/plain
X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address
 (bad octet value).
X-Received-From: 2001:4830:134:3::e
X-Spam-Score: -5.0 (-----)
X-Debbugs-Envelope-To: submit
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -5.0 (-----)

Package: emacs
Version: 25.0.50

Current master on x86_64 RHEL 7.1.

emacs -Q: All looks fine, but there is a *Warnings* buffer with contents:

  Error (initialization): Creation of the default fontsets failed: (error
  Invalid script or charset name: cuneiform-numbers-and-punctuation)

A second bug: the *Warnings* buffer is not shown at startup, *scratch* is.




Report forwarded to bug-gnu-emacs@HIDDEN:
bug#20789; Package emacs. Full text available.
Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.
Last modified: Mon, 25 Nov 2019 12:00:02 UTC

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997 nCipher Corporation Ltd, 1994-97 Ian Jackson.