GNU bug report logs -
#79702
request: flag for visually identical but different unicode characters
Previous Next
To reply to this bug, email your comments to 79702 AT debbugs.gnu.org.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org:
bug#79702; Package
grep.
(Sun, 26 Oct 2025 13:54:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Dave <dj.2dixx <at> googlemail.com>:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org.
(Sun, 26 Oct 2025 13:54:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Today, I realized that there are characters which are visually
identical, yet have different unicodes, thus they can't be matched in
grep.
Example #1:
احمدی
Example #2:
احمدى
The ى in both examples are exactly the same, yet the first one is
U+06CC, and second one U+0649.
From the user's perspective, it's impossible to realize which unicode
the word is using. In fact, these two words, even though they are from
different languages/keyboards, match perfectly on the other letters,
and only it's ی/ى that espaces the match.
While not as important, this letter has other variants like ي (notice
two dots below it, think an umlaut) corresponding to U+064A. If you
press Ctrl + F on your browser, you'd notice that you can match U+064A
with U+0649 one. but this is not the default behavior in grep either.
I understand there's no straightforward solution for this, so I'm
thinking of having an extra flag which converts all visually similar
characters to the same unicode and then looks for matches. Thoughts?
Information forwarded
to
bug-grep <at> gnu.org:
bug#79702; Package
grep.
(Sun, 26 Oct 2025 18:42:02 GMT)
Full text and
rfc822 format available.
Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):
Hi Dave,
Dave via Bug reports for GNU grep <bug-grep <at> gnu.org> writes:
> Today, I realized that there are characters which are visually
> identical, yet have different unicodes, thus they can't be matched in
> grep.
A bit different from your example, but in some cases you can encode the
same character in multiple ways.
The character á (LATIN SMALL LETTER A WITH ACUTE) can be written as:
* Normalized: U+00E1
* Unnormalized: U+0061 U+0301
> Example #1:
> احمدی
>
> Example #2:
> احمدى
>
> The ى in both examples are exactly the same, yet the first one is
> U+06CC, and second one U+0649.
>
> From the user's perspective, it's impossible to realize which unicode
> the word is using. In fact, these two words, even though they are from
> different languages/keyboards, match perfectly on the other letters,
> and only it's ی/ى that espaces the match.
>
> While not as important, this letter has other variants like ي (notice
> two dots below it, think an umlaut) corresponding to U+064A. If you
> press Ctrl + F on your browser, you'd notice that you can match U+064A
> with U+0649 one. but this is not the default behavior in grep either.
What browser does that? Firefox and Chrome on my machine don't match the
other character.
Collin
Information forwarded
to
bug-grep <at> gnu.org:
bug#79702; Package
grep.
(Sun, 26 Oct 2025 18:42:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-grep <at> gnu.org:
bug#79702; Package
grep.
(Sun, 26 Oct 2025 19:47:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 79702 <at> debbugs.gnu.org (full text, mbox):
Isn't this what equivalence classes (like [[=e=]]) are supposed
to solve?
Can grep even use them?
Arnold
Dave via Bug reports for GNU grep <bug-grep <at> gnu.org> wrote:
> Today, I realized that there are characters which are visually
> identical, yet have different unicodes, thus they can't be matched in
> grep.
>
> Example #1:
> احمدی
>
> Example #2:
> احمدى
>
> The ى in both examples are exactly the same, yet the first one is
> U+06CC, and second one U+0649.
>
> From the user's perspective, it's impossible to realize which unicode
> the word is using. In fact, these two words, even though they are from
> different languages/keyboards, match perfectly on the other letters,
> and only it's ی/ى that espaces the match.
>
> While not as important, this letter has other variants like ي (notice
> two dots below it, think an umlaut) corresponding to U+064A. If you
> press Ctrl + F on your browser, you'd notice that you can match U+064A
> with U+0649 one. but this is not the default behavior in grep either.
>
> I understand there's no straightforward solution for this, so I'm
> thinking of having an extra flag which converts all visually similar
> characters to the same unicode and then looks for matches. Thoughts?
>
>
>
Information forwarded
to
bug-grep <at> gnu.org:
bug#79702; Package
grep.
(Sun, 26 Oct 2025 22:09:02 GMT)
Full text and
rfc822 format available.
Message #17 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
bug-grep <at> gnu.org
----- Forwarded Message ----- From: David G. Pickett <dgpickett <at> aol.com>To: Dave <dj.2dixx <at> googlemail.com>Sent: Sunday, October 26, 2025 at 06:07:02 PM EDTSubject: Re: bug#79702: request: flag for visually identical but different unicode characters
Even before hackers were using Cyrillic - Roman lookalikes for fake URLs (e.g., chase.com with a Cyrillic a), I recall Sybase doing insensitivity both of case and of Nordic markups in iso-8859-1, like 'A' with a umlaut 'Ä', in string indexes, so this is not a new idea! I am not sure of the utility in practical terms. Who gets to identify the look-alikes?
On Sunday, October 26, 2025 at 09:54:42 AM EDT, Dave via Bug reports for GNU grep <bug-grep <at> gnu.org> wrote:
Today, I realized that there are characters which are visually
identical, yet have different unicodes, thus they can't be matched in
grep.
Example #1:
احمدی
Example #2:
احمدى
The ى in both examples are exactly the same, yet the first one is
U+06CC, and second one U+0649.
From the user's perspective, it's impossible to realize which unicode
the word is using. In fact, these two words, even though they are from
different languages/keyboards, match perfectly on the other letters,
and only it's ی/ى that espaces the match.
While not as important, this letter has other variants like ي (notice
two dots below it, think an umlaut) corresponding to U+064A. If you
press Ctrl + F on your browser, you'd notice that you can match U+064A
with U+0649 one. but this is not the default behavior in grep either.
I understand there's no straightforward solution for this, so I'm
thinking of having an extra flag which converts all visually similar
characters to the same unicode and then looks for matches. Thoughts?
[Message part 2 (text/html, inline)]
Information forwarded
to
bug-grep <at> gnu.org:
bug#79702; Package
grep.
(Sun, 26 Oct 2025 22:37:02 GMT)
Full text and
rfc822 format available.
Message #20 received at 79702 <at> debbugs.gnu.org (full text, mbox):
On 2025-10-26 15:08, David G. Pickett via Bug reports for GNU grep wrote:
> Who gets to identify the look-alikes?
The Unicode Consortium has done this, and as is usual with characters,
it's complicated. See:
https://www.unicode.org/reports/tr39/#Confusable_Detection
This bug report was last modified 9 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.