GNU bug report logs - #79515
char-fold-to-regexp doesn't handle most Arabic & Persian diacritics

Previous Next

Package: emacs;

Reported by: Alipour Alipour <alipoor90 <at> gmail.com>

Date: Thu, 25 Sep 2025 19:49:05 UTC

Severity: normal

Done: Eli Zaretskii <eliz <at> gnu.org>

To reply to this bug, email your comments to 79515 AT debbugs.gnu.org.
There is no need to reopen the bug first.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#79515; Package emacs. (Thu, 25 Sep 2025 19:49:07 GMT) Full text and rfc822 format available.

Acknowledgement sent to Alipour Alipour <alipoor90 <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Thu, 25 Sep 2025 19:49:07 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Alipour Alipour <alipoor90 <at> gmail.com>
To: bug-gnu-emacs <at> gnu.org
Subject: char-fold-to-regexp doesn't handle most Arabic & Persian diacritics
Date: Thu, 25 Sep 2025 23:17:15 +0330
[Message part 1 (text/plain, inline)]
As per manual I set search-default-mode to char-fold-to-regexp.

https://www.gnu.org/software/emacs/manual/html_node/emacs/Lax-Search.html

But when using isearch to look for the word امتحان in the text sample
below, only the first 4 words are found (considered as equivalents), but
not the rest.

امتحان
آمتحان
أمتحان
إمتحان
امتّحان
امتَحان
امتِحان
امتُحان
امتًحان
امتٍحان
امتٌحان
امتْحان

In essence hamzah (glottal stop mark) is handled properly.

But Ḥarakāt, Sukūn, Tanwīn and shadda, which are part of Arabic & Persian
diacritics are not handled properly. (I.e. not ignored when searching in
char-fold-to-regexp mode).

https://en.wikipedia.org/wiki/Arabic_diacritics#Tashkīl
[Message part 2 (text/html, inline)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79515; Package emacs. (Fri, 26 Sep 2025 07:21:01 GMT) Full text and rfc822 format available.

Message #8 received at 79515 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Alipour Alipour <alipoor90 <at> gmail.com>
Cc: 79515 <at> debbugs.gnu.org
Subject: Re: bug#79515: char-fold-to-regexp doesn't handle most Arabic &
 Persian diacritics
Date: Fri, 26 Sep 2025 10:19:45 +0300
> From: Alipour Alipour <alipoor90 <at> gmail.com>
> Date: Thu, 25 Sep 2025 23:17:15 +0330
> 
> As per manual I set search-default-mode to char-fold-to-regexp.
> 
> https://www.gnu.org/software/emacs/manual/html_node/emacs/Lax-Search.html
> 
> But when using isearch to look for the word امتحان in the text sample below, only the first 4 words are found
> (considered as equivalents), but not the rest.

> امتحان
> آمتحان
> أمتحان
> إمتحان
> امتّحان
> امتَحان
> امتِحان
> امتُحان
> امتًحان
> امتٍحان
> امتٌحان
> امتْحان

> In essence hamzah (glottal stop mark) is handled properly.
> 
> But Ḥarakāt, Sukūn, Tanwīn and shadda, which are part of Arabic & Persian diacritics are not handled
> properly. (I.e. not ignored when searching in char-fold-to-regexp mode).
> 
> https://en.wikipedia.org/wiki/Arabic_diacritics#Tashkīl 

Thank you for your report.

To help us investigate, would you please make the report more easily
understood by people who don't read Arabic, by telling, for the search
word and for each of the words you expected Emacs to find, the
sequence of characters that produce that word.  Please use U+NNNN
notation to make that easy to understand.  This will allow us to
analyze the issue vs what char-fold.el supports OOTB.

In general, if you expect char-fold to ignore any diacriticals in
Abjad alphabets (such as Arabic or Farsi), then it doesn't currently
do that, and AFAIU cannot do that without extensive customization of
the char-fold-include user option.  But maybe I don't understand the
problem well enough, thus my request above to show the sequences of
Unicode codepoints used to type those words.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79515; Package emacs. (Fri, 26 Sep 2025 08:23:02 GMT) Full text and rfc822 format available.

Message #11 received at 79515 <at> debbugs.gnu.org (full text, mbox):

From: Alipour Alipour <alipoor90 <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 79515 <at> debbugs.gnu.org
Subject: Re: bug#79515: char-fold-to-regexp doesn't handle most Arabic &
 Persian diacritics
Date: Fri, 26 Sep 2025 11:51:45 +0330
[Message part 1 (text/plain, inline)]
I understand if every diacritic can't be handled, but I'm a bit surprised
that hamzah (which is often much more complicated) is handled but not
Ḥarakāt, Sukūn, Tanwīn and shadda.

Since these are actually not only more important for this functionality,
but also easier to handle, as these characters can be completely ignored
when doing the matching.
As opposed to hamzah (Which seems to be currently handled properly) which
actually requires some equivalence tables for equivalent characters.

Below you will find the code points for the search term and the sample text.
Currently only the first four lines are matched (first one being equivalent
to the search term).
Correct behavior is that all of them should be matched.

You will notice that the examples that currently don't match have an extra
unicode character. And that's the diacritic. (Whereas with the first four
lines, the first character changes between its alternate forms)

In essence, if you make the function responsible for the matching,
strip/ignore these 8 characters from the buffer and the search term (If
char-fold-symmetric is true), during matching, the functionality would be
vastly improved: U+0651U U+064EU U+0650U U+064FU U+064BU U+064DU U+064CU
U+0652U

You could perhaps make an exception for when the search term consists only
of these 8 characters (I.e. when the user is searching for a particular
diacritic rather than a word).

Search term and the sample text:

U+0627U+0645U+062AU+062DU+0627U+0646

U+0627U+0645U+062AU+062DU+0627U+0646
U+0622U+0645U+062AU+062DU+0627U+0646
U+0623U+0645U+062AU+062DU+0627U+0646
U+0625U+0645U+062AU+062DU+0627U+0646
U+0627U+0645U+062AU+0651U+062DU+0627U+0646
U+0627U+0645U+062AU+064EU+062DU+0627U+0646
U+0627U+0645U+062AU+0650U+062DU+0627U+0646
U+0627U+0645U+062AU+064FU+062DU+0627U+0646
U+0627U+0645U+062AU+064BU+062DU+0627U+0646
U+0627U+0645U+062AU+064DU+062DU+0627U+0646
U+0627U+0645U+062AU+064CU+062DU+0627U+0646
U+0627U+0645U+062AU+0652U+062DU+0627U+0646

امتحان

امتحان
آمتحان
أمتحان
إمتحان
امتّحان
امتَحان
امتِحان
امتُحان
امتًحان
امتٍحان
امتٌحان
امتْحان

On Fri, Sep 26, 2025 at 10:49 AM Eli Zaretskii <eliz <at> gnu.org> wrote:

> > From: Alipour Alipour <alipoor90 <at> gmail.com>
> > Date: Thu, 25 Sep 2025 23:17:15 +0330
> >
> > As per manual I set search-default-mode to char-fold-to-regexp.
> >
> >
> https://www.gnu.org/software/emacs/manual/html_node/emacs/Lax-Search.html
> >
> > But when using isearch to look for the word امتحان in the text sample
> below, only the first 4 words are found
> > (considered as equivalents), but not the rest.
>
> > امتحان
> > آمتحان
> > أمتحان
> > إمتحان
> > امتّحان
> > امتَحان
> > امتِحان
> > امتُحان
> > امتًحان
> > امتٍحان
> > امتٌحان
> > امتْحان
>
> > In essence hamzah (glottal stop mark) is handled properly.
> >
> > But Ḥarakāt, Sukūn, Tanwīn and shadda, which are part of Arabic &
> Persian diacritics are not handled
> > properly. (I.e. not ignored when searching in char-fold-to-regexp mode).
> >
> > https://en.wikipedia.org/wiki/Arabic_diacritics#Tashkīl
>
> Thank you for your report.
>
> To help us investigate, would you please make the report more easily
> understood by people who don't read Arabic, by telling, for the search
> word and for each of the words you expected Emacs to find, the
> sequence of characters that produce that word.  Please use U+NNNN
> notation to make that easy to understand.  This will allow us to
> analyze the issue vs what char-fold.el supports OOTB.
>
> In general, if you expect char-fold to ignore any diacriticals in
> Abjad alphabets (such as Arabic or Farsi), then it doesn't currently
> do that, and AFAIU cannot do that without extensive customization of
> the char-fold-include user option.  But maybe I don't understand the
> problem well enough, thus my request above to show the sequences of
> Unicode codepoints used to type those words.
>
[Message part 2 (text/html, inline)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79515; Package emacs. (Fri, 26 Sep 2025 08:51:02 GMT) Full text and rfc822 format available.

Message #14 received at 79515 <at> debbugs.gnu.org (full text, mbox):

From: Alipour Alipour <alipoor90 <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 79515 <at> debbugs.gnu.org
Subject: Re: bug#79515: char-fold-to-regexp doesn't handle most Arabic &
 Persian diacritics
Date: Fri, 26 Sep 2025 12:19:39 +0330
[Message part 1 (text/plain, inline)]
In case it's not obvious, I left an extra U when copy-pasting the "eight
diacritic" characters (that need to be stripped/ignored when matching).

They are characters U+064B to U+0652

These are the most common diacritics in Arabic / Persian script.

U+064B  ً   ‎ Arabic Fathatan
U+064C  ٌ   ‎ Arabic Dammatan
U+064D  ٍ   ‎ Arabic Kasratan
U+064E  َ   ‎ Arabic Fatha
U+064F  ُ   ‎ Arabic Damma
U+0650  ِ   ‎ Arabic Kasra
U+0651  ّ   ‎ Arabic Shadda
U+0652  ْ   ‎ Arabic Sukun


On Fri, Sep 26, 2025 at 11:51 AM Alipour Alipour <alipoor90 <at> gmail.com>
wrote:

> I understand if every diacritic can't be handled, but I'm a bit surprised
> that hamzah (which is often much more complicated) is handled but not
> Ḥarakāt, Sukūn, Tanwīn and shadda.
>
> Since these are actually not only more important for this functionality,
> but also easier to handle, as these characters can be completely ignored
> when doing the matching.
> As opposed to hamzah (Which seems to be currently handled properly) which
> actually requires some equivalence tables for equivalent characters.
>
> Below you will find the code points for the search term and the sample
> text.
> Currently only the first four lines are matched (first one being
> equivalent to the search term).
> Correct behavior is that all of them should be matched.
>
> You will notice that the examples that currently don't match have an extra
> unicode character. And that's the diacritic. (Whereas with the first four
> lines, the first character changes between its alternate forms)
>
> In essence, if you make the function responsible for the matching,
> strip/ignore these 8 characters from the buffer and the search term (If
> char-fold-symmetric is true), during matching, the functionality would be
> vastly improved: U+0651U U+064EU U+0650U U+064FU U+064BU U+064DU U+064CU
> U+0652U
>
> You could perhaps make an exception for when the search term consists only
> of these 8 characters (I.e. when the user is searching for a particular
> diacritic rather than a word).
>
> Search term and the sample text:
>
> U+0627U+0645U+062AU+062DU+0627U+0646
>
> U+0627U+0645U+062AU+062DU+0627U+0646
> U+0622U+0645U+062AU+062DU+0627U+0646
> U+0623U+0645U+062AU+062DU+0627U+0646
> U+0625U+0645U+062AU+062DU+0627U+0646
> U+0627U+0645U+062AU+0651U+062DU+0627U+0646
> U+0627U+0645U+062AU+064EU+062DU+0627U+0646
> U+0627U+0645U+062AU+0650U+062DU+0627U+0646
> U+0627U+0645U+062AU+064FU+062DU+0627U+0646
> U+0627U+0645U+062AU+064BU+062DU+0627U+0646
> U+0627U+0645U+062AU+064DU+062DU+0627U+0646
> U+0627U+0645U+062AU+064CU+062DU+0627U+0646
> U+0627U+0645U+062AU+0652U+062DU+0627U+0646
>
> امتحان
>
> امتحان
> آمتحان
> أمتحان
> إمتحان
> امتّحان
> امتَحان
> امتِحان
> امتُحان
> امتًحان
> امتٍحان
> امتٌحان
> امتْحان
>
> On Fri, Sep 26, 2025 at 10:49 AM Eli Zaretskii <eliz <at> gnu.org> wrote:
>
>> > From: Alipour Alipour <alipoor90 <at> gmail.com>
>> > Date: Thu, 25 Sep 2025 23:17:15 +0330
>> >
>> > As per manual I set search-default-mode to char-fold-to-regexp.
>> >
>> >
>> https://www.gnu.org/software/emacs/manual/html_node/emacs/Lax-Search.html
>> >
>> > But when using isearch to look for the word امتحان in the text sample
>> below, only the first 4 words are found
>> > (considered as equivalents), but not the rest.
>>
>> > امتحان
>> > آمتحان
>> > أمتحان
>> > إمتحان
>> > امتّحان
>> > امتَحان
>> > امتِحان
>> > امتُحان
>> > امتًحان
>> > امتٍحان
>> > امتٌحان
>> > امتْحان
>>
>> > In essence hamzah (glottal stop mark) is handled properly.
>> >
>> > But Ḥarakāt, Sukūn, Tanwīn and shadda, which are part of Arabic &
>> Persian diacritics are not handled
>> > properly. (I.e. not ignored when searching in char-fold-to-regexp mode).
>> >
>> > https://en.wikipedia.org/wiki/Arabic_diacritics#Tashkīl
>>
>> Thank you for your report.
>>
>> To help us investigate, would you please make the report more easily
>> understood by people who don't read Arabic, by telling, for the search
>> word and for each of the words you expected Emacs to find, the
>> sequence of characters that produce that word.  Please use U+NNNN
>> notation to make that easy to understand.  This will allow us to
>> analyze the issue vs what char-fold.el supports OOTB.
>>
>> In general, if you expect char-fold to ignore any diacriticals in
>> Abjad alphabets (such as Arabic or Farsi), then it doesn't currently
>> do that, and AFAIU cannot do that without extensive customization of
>> the char-fold-include user option.  But maybe I don't understand the
>> problem well enough, thus my request above to show the sequences of
>> Unicode codepoints used to type those words.
>>
>
[Message part 2 (text/html, inline)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79515; Package emacs. (Fri, 26 Sep 2025 08:59:02 GMT) Full text and rfc822 format available.

Message #17 received at 79515 <at> debbugs.gnu.org (full text, mbox):

From: Robert Pluim <rpluim <at> gmail.com>
To: Alipour Alipour <alipoor90 <at> gmail.com>
Cc: Eli Zaretskii <eliz <at> gnu.org>, 79515 <at> debbugs.gnu.org
Subject: Re: bug#79515: char-fold-to-regexp doesn't handle most Arabic &
 Persian diacritics
Date: Fri, 26 Sep 2025 10:58:33 +0200
>>>>> On Fri, 26 Sep 2025 12:19:39 +0330, Alipour Alipour <alipoor90 <at> gmail.com> said:

    Alipour> In case it's not obvious, I left an extra U when copy-pasting the "eight
    Alipour> diacritic" characters (that need to be stripped/ignored when matching).

    Alipour> They are characters U+064B to U+0652

    Alipour> These are the most common diacritics in Arabic / Persian script.

    Alipour> U+064B  ً   ‎ Arabic Fathatan
    Alipour> U+064C  ٌ   ‎ Arabic Dammatan
    Alipour> U+064D  ٍ   ‎ Arabic Kasratan
    Alipour> U+064E  َ   ‎ Arabic Fatha
    Alipour> U+064F  ُ   ‎ Arabic Damma
    Alipour> U+0650  ِ   ‎ Arabic Kasra
    Alipour> U+0651  ّ   ‎ Arabic Shadda
    Alipour> U+0652  ْ   ‎ Arabic Sukun


    Alipour> On Fri, Sep 26, 2025 at 11:51 AM Alipour Alipour <alipoor90 <at> gmail.com>
    Alipour> wrote:

    >> I understand if every diacritic can't be handled, but I'm a bit surprised
    >> that hamzah (which is often much more complicated) is handled but not
    >> Ḥarakāt, Sukūn, Tanwīn and shadda.
    >> 
    >> Since these are actually not only more important for this functionality,
    >> but also easier to handle, as these characters can be completely ignored
    >> when doing the matching.
    >> As opposed to hamzah (Which seems to be currently handled properly) which
    >> actually requires some equivalence tables for equivalent characters.
    >> 

The Unicode data files contain decomposition information, which tell
us that the precomposed codepoints which use hamzah map to a base
character + hamzah. We use that information when doing the matching,
but no such data exists for the other diacritics.

Robert
-- 




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79515; Package emacs. (Fri, 26 Sep 2025 09:18:01 GMT) Full text and rfc822 format available.

Message #20 received at 79515 <at> debbugs.gnu.org (full text, mbox):

From: Alipour Alipour <alipoor90 <at> gmail.com>
To: Robert Pluim <rpluim <at> gmail.com>
Cc: Eli Zaretskii <eliz <at> gnu.org>, 79515 <at> debbugs.gnu.org
Subject: Re: bug#79515: char-fold-to-regexp doesn't handle most Arabic &
 Persian diacritics
Date: Fri, 26 Sep 2025 12:46:55 +0330
[Message part 1 (text/plain, inline)]
The reason that precomposed code points for these eight don't exist is
because unlike hamzah, these are not actual and independant alphabet
letters.

And the fact that they can sit on pretty much any alphabet letter means you
would have several hundred combinations (Compared to only a few alternate
forms for Hamzah)

Native Arabic / Persian speakers typically omit these 8 diacritics when
writing and simply pronounce the words from memory (with help from
context).
But sometimes the material (often classical texts) are typed with the
diacritics included.

These 8 can be safely ignored during search (in char-fold-to-regexp mode)
There are more diacritics that should be ignored during matching, but these
8 are the most definite ones, representing the 8 most common (though
typically omitted and simply memorized by natives) diacritics in Arabic &
Persian.

On Fri, Sep 26, 2025 at 12:28 PM Robert Pluim <rpluim <at> gmail.com> wrote:

> >>>>> On Fri, 26 Sep 2025 12:19:39 +0330, Alipour Alipour <
> alipoor90 <at> gmail.com> said:
>
>     Alipour> In case it's not obvious, I left an extra U when copy-pasting
> the "eight
>     Alipour> diacritic" characters (that need to be stripped/ignored when
> matching).
>
>     Alipour> They are characters U+064B to U+0652
>
>     Alipour> These are the most common diacritics in Arabic / Persian
> script.
>
>     Alipour> U+064B  ً   ‎ Arabic Fathatan
>     Alipour> U+064C  ٌ   ‎ Arabic Dammatan
>     Alipour> U+064D  ٍ   ‎ Arabic Kasratan
>     Alipour> U+064E  َ   ‎ Arabic Fatha
>     Alipour> U+064F  ُ   ‎ Arabic Damma
>     Alipour> U+0650  ِ   ‎ Arabic Kasra
>     Alipour> U+0651  ّ   ‎ Arabic Shadda
>     Alipour> U+0652  ْ   ‎ Arabic Sukun
>
>
>     Alipour> On Fri, Sep 26, 2025 at 11:51 AM Alipour Alipour <
> alipoor90 <at> gmail.com>
>     Alipour> wrote:
>
>     >> I understand if every diacritic can't be handled, but I'm a bit
> surprised
>     >> that hamzah (which is often much more complicated) is handled but
> not
>     >> Ḥarakāt, Sukūn, Tanwīn and shadda.
>     >>
>     >> Since these are actually not only more important for this
> functionality,
>     >> but also easier to handle, as these characters can be completely
> ignored
>     >> when doing the matching.
>     >> As opposed to hamzah (Which seems to be currently handled properly)
> which
>     >> actually requires some equivalence tables for equivalent characters.
>     >>
>
> The Unicode data files contain decomposition information, which tell
> us that the precomposed codepoints which use hamzah map to a base
> character + hamzah. We use that information when doing the matching,
> but no such data exists for the other diacritics.
>
> Robert
> --
>
[Message part 2 (text/html, inline)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79515; Package emacs. (Fri, 26 Sep 2025 09:40:05 GMT) Full text and rfc822 format available.

Message #23 received at 79515 <at> debbugs.gnu.org (full text, mbox):

From: Alipour Alipour <alipoor90 <at> gmail.com>
To: Robert Pluim <rpluim <at> gmail.com>
Cc: Eli Zaretskii <eliz <at> gnu.org>, 79515 <at> debbugs.gnu.org
Subject: Re: bug#79515: char-fold-to-regexp doesn't handle most Arabic &
 Persian diacritics
Date: Fri, 26 Sep 2025 13:08:39 +0330
[Message part 1 (text/plain, inline)]
For example if you save the sample text into a file and open it in Firefox
or Chromium, and try to find the search term, you will see that they match
all the instances, and essentially ignore/strip those 8 characters from the
text during matching:

امتحان

امتحان
آمتحان
أمتحان
إمتحان
امتّحان
امتَحان
امتِحان
امتُحان
امتًحان
امتٍحان
امتٌحان
امتْحان

Now, again, there are more characters/diacritics that should be ignored
during matching (in char-fold-to-regexp mode), but these 8 are the most
definitive and common ones that I could confidently tell you without having
to look for some kind of standard (that most likely doesn't exist) or some
1000-page linguistics thesis and getting into peculiarities of various
languages that use the Arabic script.

And how other popular software (E.g. Firefox and Chromium) with good
internationalization handle these characters should make it clear that
these eight (U+064B to U+0652) should be ignored during matching.



On Fri, Sep 26, 2025 at 12:46 PM Alipour Alipour <alipoor90 <at> gmail.com>
wrote:

> The reason that precomposed code points for these eight don't exist is
> because unlike hamzah, these are not actual and independant alphabet
> letters.
>
> And the fact that they can sit on pretty much any alphabet letter means
> you would have several hundred combinations (Compared to only a few
> alternate forms for Hamzah)
>
> Native Arabic / Persian speakers typically omit these 8 diacritics when
> writing and simply pronounce the words from memory (with help from
> context).
> But sometimes the material (often classical texts) are typed with the
> diacritics included.
>
> These 8 can be safely ignored during search (in char-fold-to-regexp mode)
> There are more diacritics that should be ignored during matching, but
> these 8 are the most definite ones, representing the 8 most common (though
> typically omitted and simply memorized by natives) diacritics in Arabic &
> Persian.
>
> On Fri, Sep 26, 2025 at 12:28 PM Robert Pluim <rpluim <at> gmail.com> wrote:
>
>> >>>>> On Fri, 26 Sep 2025 12:19:39 +0330, Alipour Alipour <
>> alipoor90 <at> gmail.com> said:
>>
>>     Alipour> In case it's not obvious, I left an extra U when
>> copy-pasting the "eight
>>     Alipour> diacritic" characters (that need to be stripped/ignored when
>> matching).
>>
>>     Alipour> They are characters U+064B to U+0652
>>
>>     Alipour> These are the most common diacritics in Arabic / Persian
>> script.
>>
>>     Alipour> U+064B  ً   ‎ Arabic Fathatan
>>     Alipour> U+064C  ٌ   ‎ Arabic Dammatan
>>     Alipour> U+064D  ٍ   ‎ Arabic Kasratan
>>     Alipour> U+064E  َ   ‎ Arabic Fatha
>>     Alipour> U+064F  ُ   ‎ Arabic Damma
>>     Alipour> U+0650  ِ   ‎ Arabic Kasra
>>     Alipour> U+0651  ّ   ‎ Arabic Shadda
>>     Alipour> U+0652  ْ   ‎ Arabic Sukun
>>
>>
>>     Alipour> On Fri, Sep 26, 2025 at 11:51 AM Alipour Alipour <
>> alipoor90 <at> gmail.com>
>>     Alipour> wrote:
>>
>>     >> I understand if every diacritic can't be handled, but I'm a bit
>> surprised
>>     >> that hamzah (which is often much more complicated) is handled but
>> not
>>     >> Ḥarakāt, Sukūn, Tanwīn and shadda.
>>     >>
>>     >> Since these are actually not only more important for this
>> functionality,
>>     >> but also easier to handle, as these characters can be completely
>> ignored
>>     >> when doing the matching.
>>     >> As opposed to hamzah (Which seems to be currently handled
>> properly) which
>>     >> actually requires some equivalence tables for equivalent
>> characters.
>>     >>
>>
>> The Unicode data files contain decomposition information, which tell
>> us that the precomposed codepoints which use hamzah map to a base
>> character + hamzah. We use that information when doing the matching,
>> but no such data exists for the other diacritics.
>>
>> Robert
>> --
>>
>
[Message part 2 (text/html, inline)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79515; Package emacs. (Fri, 26 Sep 2025 11:10:02 GMT) Full text and rfc822 format available.

Message #26 received at 79515 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Alipour Alipour <alipoor90 <at> gmail.com>
Cc: rpluim <at> gmail.com, 79515 <at> debbugs.gnu.org
Subject: Re: bug#79515: char-fold-to-regexp doesn't handle most Arabic &
 Persian diacritics
Date: Fri, 26 Sep 2025 14:07:38 +0300
> From: Alipour Alipour <alipoor90 <at> gmail.com>
> Date: Fri, 26 Sep 2025 13:08:39 +0330
> Cc: Eli Zaretskii <eliz <at> gnu.org>, 79515 <at> debbugs.gnu.org
> 
> For example if you save the sample text into a file and open it in Firefox or Chromium, and try to find the
> search term, you will see that they match all the instances, and essentially ignore/strip those 8 characters
> from the text during matching:
> 
> امتحان
> 
> امتحان
> آمتحان
> أمتحان
> إمتحان
> امتّحان
> امتَحان
> امتِحان
> امتُحان
> امتًحان
> امتٍحان
> امتٌحان
> امتْحان
> 
> Now, again, there are more characters/diacritics that should be ignored during matching (in
> char-fold-to-regexp mode), but these 8 are the most definitive and common ones that I could confidently tell
> you without having to look for some kind of standard (that most likely doesn't exist) or some 1000-page
> linguistics thesis and getting into peculiarities of various languages that use the Arabic script.
> 
> And how other popular software (E.g. Firefox and Chromium) with good internationalization handle these
> characters should make it clear that these eight (U+064B to U+0652) should be ignored during matching. 

We are well aware of what the Unicode UTS#10 says about ignoring
diacriticals in search.  However, Emacs doesn't yet implement UTS#10
in its entirety, and char-fold.el is just an approximation that uses
regular expressions and some of the Unicode data.  It explicitly does
NOT strip diacriticals from the search string and the buffer text.

So if you expect Emacs to perform according to UTS#10, we don't yet
have these capabilities, sorry; patches to add that are very welcome.
Until this is implemented in Emacs, you can customize
char-fold-include to add at least some of the character combinations
you want Emacs to consider as equivalent during search.




Reply sent to Eli Zaretskii <eliz <at> gnu.org>:
You have taken responsibility. (Sat, 11 Oct 2025 08:22:03 GMT) Full text and rfc822 format available.

Notification sent to Alipour Alipour <alipoor90 <at> gmail.com>:
bug acknowledged by developer. (Sat, 11 Oct 2025 08:22:03 GMT) Full text and rfc822 format available.

Message #31 received at 79515-done <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Robert Pluim <rpluim <at> gmail.com>
Cc: 79515-done <at> debbugs.gnu.org, alipoor90 <at> gmail.com
Subject: Re: bug#79515: char-fold-to-regexp doesn't handle most Arabic &
 Persian diacritics
Date: Sat, 11 Oct 2025 11:21:43 +0300
> From: Robert Pluim <rpluim <at> gmail.com>
> Cc: Eli Zaretskii <eliz <at> gnu.org>,  79515 <at> debbugs.gnu.org
> Date: Fri, 26 Sep 2025 10:58:33 +0200
> 
> >>>>> On Fri, 26 Sep 2025 12:19:39 +0330, Alipour Alipour <alipoor90 <at> gmail.com> said:
> 
>     Alipour> In case it's not obvious, I left an extra U when copy-pasting the "eight
>     Alipour> diacritic" characters (that need to be stripped/ignored when matching).
> 
>     Alipour> They are characters U+064B to U+0652
> 
>     Alipour> These are the most common diacritics in Arabic / Persian script.
> 
>     Alipour> U+064B  ً   ‎ Arabic Fathatan
>     Alipour> U+064C  ٌ   ‎ Arabic Dammatan
>     Alipour> U+064D  ٍ   ‎ Arabic Kasratan
>     Alipour> U+064E  َ   ‎ Arabic Fatha
>     Alipour> U+064F  ُ   ‎ Arabic Damma
>     Alipour> U+0650  ِ   ‎ Arabic Kasra
>     Alipour> U+0651  ّ   ‎ Arabic Shadda
>     Alipour> U+0652  ْ   ‎ Arabic Sukun
> 
> 
>     Alipour> On Fri, Sep 26, 2025 at 11:51 AM Alipour Alipour <alipoor90 <at> gmail.com>
>     Alipour> wrote:
> 
>     >> I understand if every diacritic can't be handled, but I'm a bit surprised
>     >> that hamzah (which is often much more complicated) is handled but not
>     >> Ḥarakāt, Sukūn, Tanwīn and shadda.
>     >> 
>     >> Since these are actually not only more important for this functionality,
>     >> but also easier to handle, as these characters can be completely ignored
>     >> when doing the matching.
>     >> As opposed to hamzah (Which seems to be currently handled properly) which
>     >> actually requires some equivalence tables for equivalent characters.
>     >> 
> 
> The Unicode data files contain decomposition information, which tell
> us that the precomposed codepoints which use hamzah map to a base
> character + hamzah. We use that information when doing the matching,
> but no such data exists for the other diacritics.

Thanks.

No further comments within 2 weeks, so I'm now closing this bug.




This bug report was last modified 27 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.