GNU bug report logs - #36923
Combining Diacritical Marks are not Latin only

Previous Next

Package: emacs;

Reported by: Juri Linkov <juri <at> linkov.net>

Date: Sun, 4 Aug 2019 20:50:02 UTC

Severity: normal

Done: Juri Linkov <juri <at> linkov.net>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 36923 in the body.
You can then email your comments to 36923 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#36923; Package emacs. (Sun, 04 Aug 2019 20:50:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Juri Linkov <juri <at> linkov.net>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Sun, 04 Aug 2019 20:50:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> linkov.net>
To: bug-gnu-emacs <at> gnu.org
Subject: Combining Diacritical Marks are not Latin only
Date: Sun, 04 Aug 2019 23:40:38 +0300
The generated file lisp/international/charscript.el
assigns the block “Combining Diacritical Marks” to the ‘latin’ script
on the assumption that these characters are used only in Latin.

But in fact according to e.g. https://en.wikipedia.org/wiki/Acute_accent
the acute accent marks the stressed vowel of a word in several languages
with alphabets based on the Latin, Cyrillic, and Greek scripts.
In particular https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode
mentions how characters from other blocks are used in Cyrillic script.
Moreover, the Combining Diacritical Marks block also
contains several characters from the Greek script:
COMBINING GREEK PERISPOMENI, COMBINING GREEK KORONIS
COMBINING GREEK DIALYTIKA TONOS, COMBINING GREEK YPOGEGRAMMENI

I noticed this problem recently while helping to develop char-fold where
GREEK SMALL LETTER IOTA combined with COMBINING GREEK DIALYTIKA TONOS was
alarmingly highlighted as “mixed scripts” by markchars-mode from GNU ELPA.

Of course, it's possible to add exceptions for characters in this block
in markchars-mode.  But before doing this, I'm asking a confirmation
whether Unicode data should be fixed in ‘char-script-table’, so e.g.

  (aref char-script-table ?\N{COMBINING ACUTE ACCENT})

could return

  (latin greek cyrillic)

instead of the current

  latin




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#36923; Package emacs. (Mon, 05 Aug 2019 16:09:02 GMT) Full text and rfc822 format available.

Message #8 received at 36923 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Juri Linkov <juri <at> linkov.net>
Cc: 36923 <at> debbugs.gnu.org
Subject: Re: bug#36923: Combining Diacritical Marks are not Latin only
Date: Mon, 05 Aug 2019 19:08:21 +0300
> From: Juri Linkov <juri <at> linkov.net>
> Date: Sun, 04 Aug 2019 23:40:38 +0300
> 
> The generated file lisp/international/charscript.el
> assigns the block “Combining Diacritical Marks” to the ‘latin’ script
> on the assumption that these characters are used only in Latin.
> 
> But in fact according to e.g. https://en.wikipedia.org/wiki/Acute_accent
> the acute accent marks the stressed vowel of a word in several languages
> with alphabets based on the Latin, Cyrillic, and Greek scripts.
> In particular https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode
> mentions how characters from other blocks are used in Cyrillic script.
> Moreover, the Combining Diacritical Marks block also
> contains several characters from the Greek script:
> COMBINING GREEK PERISPOMENI, COMBINING GREEK KORONIS
> COMBINING GREEK DIALYTIKA TONOS, COMBINING GREEK YPOGEGRAMMENI
> 
> I noticed this problem recently while helping to develop char-fold where
> GREEK SMALL LETTER IOTA combined with COMBINING GREEK DIALYTIKA TONOS was
> alarmingly highlighted as “mixed scripts” by markchars-mode from GNU ELPA.
> 
> Of course, it's possible to add exceptions for characters in this block
> in markchars-mode.  But before doing this, I'm asking a confirmation
> whether Unicode data should be fixed in ‘char-script-table’, so e.g.
> 
>   (aref char-script-table ?\N{COMBINING ACUTE ACCENT})
> 
> could return
> 
>   (latin greek cyrillic)
> 
> instead of the current
> 
>   latin

char-script-table is documented to yield a single symbol, so returning
a list would be an incompatible change, which we should avoid.

More generally, I think what you describe is a clear conceptual bug in
markchars-mode: it should only pay attention to the script of the base
characters, not to the script of combining accents.  The latter is
mostly irrelevant, certainly so for the purpose of detecting
confusables.

So I think this should be fixed in markchars-mode, and the fact that
we somewhat arbitrarily assign those diacritics to the latin script is
not a serious problem, if at all.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#36923; Package emacs. (Mon, 05 Aug 2019 19:59:01 GMT) Full text and rfc822 format available.

Message #11 received at 36923 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> linkov.net>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 36923 <at> debbugs.gnu.org
Subject: Re: bug#36923: Combining Diacritical Marks are not Latin only
Date: Mon, 05 Aug 2019 22:41:59 +0300
>>   (aref char-script-table ?\N{COMBINING ACUTE ACCENT})
>>
>> could return
>>
>>   (latin greek cyrillic)
>>
>> instead of the current
>>
>>   latin
>
> char-script-table is documented to yield a single symbol, so returning
> a list would be an incompatible change, which we should avoid.

The docstring of char-script-table says:

  Char table of script symbols.
  It has one extra slot whose value is a list of script symbols.

So it seems char-script-table should yield a list of script symbols?

I searched more for char-script-table in the documentation, and one
place where it's used is forward-word.  But I don't understand why
forward-word doesn't stop between “COMBINING ACUTE ACCENT” (that is
the Latin script) and non-Latin letters.

This is good that it doesn't stop here, and I'm just trying to
understand why - so the same logic could be used in markchars-mode.
Maybe it doesn't stop because of special script handling in
‘find-word-boundary-function-table’?  Or because it ignores all
combining characters?

BTW, while looking at forward-word and right-word I noticed inconsistency:
there are left-word and right-word commands, but no left-sexp and right-sexp
to accompany forward-sexp.

> More generally, I think what you describe is a clear conceptual bug in
> markchars-mode: it should only pay attention to the script of the base
> characters, not to the script of combining accents.  The latter is
> mostly irrelevant, certainly so for the purpose of detecting
> confusables.

Could you suggest a proper function to strip all combining characters
from the string?




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#36923; Package emacs. (Tue, 06 Aug 2019 14:33:02 GMT) Full text and rfc822 format available.

Message #14 received at 36923 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Juri Linkov <juri <at> linkov.net>
Cc: 36923 <at> debbugs.gnu.org
Subject: Re: bug#36923: Combining Diacritical Marks are not Latin only
Date: Tue, 06 Aug 2019 17:32:33 +0300
> From: Juri Linkov <juri <at> linkov.net>
> Cc: 36923 <at> debbugs.gnu.org
> Date: Mon, 05 Aug 2019 22:41:59 +0300
> 
> >>   (aref char-script-table ?\N{COMBINING ACUTE ACCENT})
> >>
> >> could return
> >>
> >>   (latin greek cyrillic)
> >>
> >> instead of the current
> >>
> >>   latin
> >
> > char-script-table is documented to yield a single symbol, so returning
> > a list would be an incompatible change, which we should avoid.
> 
> The docstring of char-script-table says:
> 
>   Char table of script symbols.
>   It has one extra slot whose value is a list of script symbols.
> 
> So it seems char-script-table should yield a list of script symbols?

No, that's only in the extra slot.  The ELisp manual says:

 -- Variable: char-script-table
     The value of this variable is a char-table that specifies, for each
     character, a symbol whose name is the script to which the character
     belongs, according to the Unicode Standard classification of the
     Unicode code space into script-specific blocks.  This char-table
     has a single extra slot whose value is the list of all script
     symbols.

> I searched more for char-script-table in the documentation, and one
> place where it's used is forward-word.  But I don't understand why
> forward-word doesn't stop between “COMBINING ACUTE ACCENT” (that is
> the Latin script) and non-Latin letters.

See word-combining-categories: it causes word-movement commands to
ignore any script boundaries with characters whose category is
combining diacritic or mark.

> Maybe it doesn't stop because of special script handling in
> ‘find-word-boundary-function-table’?

Not by default, because find-word-boundary-function-table's entry for
any character is nil by default.

> BTW, while looking at forward-word and right-word I noticed inconsistency:
> there are left-word and right-word commands, but no left-sexp and right-sexp
> to accompany forward-sexp.

Programming languages are all L2R, so there's no need to move by sexps
in R2L direction.

> > More generally, I think what you describe is a clear conceptual bug in
> > markchars-mode: it should only pay attention to the script of the base
> > characters, not to the script of combining accents.  The latter is
> > mostly irrelevant, certainly so for the purpose of detecting
> > confusables.
> 
> Could you suggest a proper function to strip all combining characters
> from the string?

Each base character has its canonical combining class attribute as
zero, so you could use

   (get-char-code-property CHAR 'canonical-combining-class)

to filter out those CHARs for which the value is non-zero.

Alternatively, you could go by categories: base characters have the
?. category set, combining characters have the ?^ category set.

My recommendation is to use the canonical-combining-class property, as
it is a more direct way of doing this.





Reply sent to Juri Linkov <juri <at> linkov.net>:
You have taken responsibility. (Wed, 07 Aug 2019 22:03:03 GMT) Full text and rfc822 format available.

Notification sent to Juri Linkov <juri <at> linkov.net>:
bug acknowledged by developer. (Wed, 07 Aug 2019 22:03:03 GMT) Full text and rfc822 format available.

Message #19 received at 36923-done <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> linkov.net>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 36923-done <at> debbugs.gnu.org
Subject: Re: bug#36923: Combining Diacritical Marks are not Latin only
Date: Thu, 08 Aug 2019 00:44:49 +0300
> Each base character has its canonical combining class attribute as
> zero, so you could use
>
>    (get-char-code-property CHAR 'canonical-combining-class)
>
> to filter out those CHARs for which the value is non-zero.
>
> Alternatively, you could go by categories: base characters have the
> ?. category set, combining characters have the ?^ category set.
>
> My recommendation is to use the canonical-combining-class property, as
> it is a more direct way of doing this.

Thanks, I fixed markchars-mode by using canonical-combining-class.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 05 Sep 2019 11:24:08 GMT) Full text and rfc822 format available.

This bug report was last modified 5 years and 238 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.