GNU bug report logs - #79702
request: flag for visually identical but different unicode characters

Package: grep;

Reported by: Dave <dj.2dixx <at> googlemail.com>

Date: Sun, 26 Oct 2025 13:54:02 UTC

Severity: normal

To reply to this bug, email your comments to 79702 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-grep <at> gnu.org:
bug#79702; Package grep. (Sun, 26 Oct 2025 13:54:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Dave <dj.2dixx <at> googlemail.com>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Sun, 26 Oct 2025 13:54:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Dave <dj.2dixx <at> googlemail.com>
To: bug-grep <at> gnu.org
Subject: request: flag for visually identical but different unicode characters
Date: Sun, 26 Oct 2025 11:00:28 +0330

Today, I realized that there are characters which are visually
identical, yet have different unicodes, thus they can't be matched in
grep.

Example #1:
احمدی

Example #2:
احمدى

The ى in both examples are exactly the same, yet the first one is
U+06CC, and second one U+0649.

From the user's perspective, it's impossible to realize which unicode
the word is using. In fact, these two words, even though they are from
different languages/keyboards, match perfectly on the other letters,
and only it's ی/ى that espaces the match.

While not as important, this letter has other variants like ي (notice
two dots below it, think an umlaut) corresponding to U+064A. If you
press Ctrl + F on your browser, you'd notice that you can match U+064A
with U+0649 one. but this is not the default behavior in grep either.

I understand there's no straightforward solution for this, so I'm
thinking of having an extra flag which converts all visually similar
characters to the same unicode and then looks for matches. Thoughts?

Information forwarded to bug-grep <at> gnu.org:
bug#79702; Package grep. (Sun, 26 Oct 2025 18:42:02 GMT) Full text and rfc822 format available.

Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Collin Funk <collin.funk1 <at> gmail.com>
To: Dave via Bug reports for GNU grep <bug-grep <at> gnu.org>
Cc: Dave <dj.2dixx <at> googlemail.com>, 79702 <at> debbugs.gnu.org
Subject: Re: bug#79702: request: flag for visually identical but different
 unicode characters
Date: Sun, 26 Oct 2025 11:40:55 -0700

Hi Dave,

Dave via Bug reports for GNU grep <bug-grep <at> gnu.org> writes:

> Today, I realized that there are characters which are visually
> identical, yet have different unicodes, thus they can't be matched in
> grep.

A bit different from your example, but in some cases you can encode the
same character in multiple ways.

The character á (LATIN SMALL LETTER A WITH ACUTE) can be written as:

    * Normalized:   U+00E1
    * Unnormalized: U+0061 U+0301


> Example #1:
> احمدی
>
> Example #2:
> احمدى
>
> The ى in both examples are exactly the same, yet the first one is
> U+06CC, and second one U+0649.
>
> From the user's perspective, it's impossible to realize which unicode
> the word is using. In fact, these two words, even though they are from
> different languages/keyboards, match perfectly on the other letters,
> and only it's ی/ى that espaces the match.
>
> While not as important, this letter has other variants like ي (notice
> two dots below it, think an umlaut) corresponding to U+064A. If you
> press Ctrl + F on your browser, you'd notice that you can match U+064A
> with U+0649 one. but this is not the default behavior in grep either.

What browser does that? Firefox and Chrome on my machine don't match the
other character.

Collin

Information forwarded to bug-grep <at> gnu.org:
bug#79702; Package grep. (Sun, 26 Oct 2025 18:42:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#79702; Package grep. (Sun, 26 Oct 2025 19:47:02 GMT) Full text and rfc822 format available.

Message #14 received at 79702 <at> debbugs.gnu.org (full text, mbox):

From: arnold <at> skeeve.com
To: dj.2dixx <at> googlemail.com, 79702 <at> debbugs.gnu.org
Subject: Re: bug#79702: request: flag for visually identical but different
 unicode characters
Date: Sun, 26 Oct 2025 13:46:48 -0600

Isn't this what equivalence classes (like [[=e=]]) are supposed
to solve?

Can grep even use them?

Arnold

Dave via Bug reports for GNU grep <bug-grep <at> gnu.org> wrote:

> Today, I realized that there are characters which are visually
> identical, yet have different unicodes, thus they can't be matched in
> grep.
>
> Example #1:
> احمدی
>
> Example #2:
> احمدى
>
> The ى in both examples are exactly the same, yet the first one is
> U+06CC, and second one U+0649.
>
> From the user's perspective, it's impossible to realize which unicode
> the word is using. In fact, these two words, even though they are from
> different languages/keyboards, match perfectly on the other letters,
> and only it's ی/ى that espaces the match.
>
> While not as important, this letter has other variants like ي (notice
> two dots below it, think an umlaut) corresponding to U+064A. If you
> press Ctrl + F on your browser, you'd notice that you can match U+064A
> with U+0649 one. but this is not the default behavior in grep either.
>
> I understand there's no straightforward solution for this, so I'm
> thinking of having an extra flag which converts all visually similar
> characters to the same unicode and then looks for matches. Thoughts?
>
>
>

Information forwarded to bug-grep <at> gnu.org:
bug#79702; Package grep. (Sun, 26 Oct 2025 22:09:02 GMT) Full text and rfc822 format available.

Message #17 received at submit <at> debbugs.gnu.org (full text, mbox):

From: "David G. Pickett" <dgpickett <at> aol.com>
To: "bug-grep <at> gnu.org" <bug-grep <at> gnu.org>
Subject: Fw: bug#79702: request: flag for visually identical but different
 unicode characters
Date: Sun, 26 Oct 2025 22:08:03 +0000 (UTC)

[Message part 1 (text/plain, inline)]


bug-grep <at> gnu.org

 

   ----- Forwarded Message ----- From: David G. Pickett <dgpickett <at> aol.com>To: Dave <dj.2dixx <at> googlemail.com>Sent: Sunday, October 26, 2025 at 06:07:02 PM EDTSubject: Re: bug#79702: request: flag for visually identical but different unicode characters  
  Even before hackers were using Cyrillic - Roman lookalikes for fake URLs (e.g., chase.com with a Cyrillic a), I recall Sybase doing insensitivity both of case and of Nordic markups in iso-8859-1, like 'A' with a umlaut 'Ä', in string indexes, so this is not a new idea!  I am not sure of the utility in practical terms.  Who gets to identify the look-alikes?


    On Sunday, October 26, 2025 at 09:54:42 AM EDT, Dave via Bug reports for GNU grep <bug-grep <at> gnu.org> wrote:   

 Today, I realized that there are characters which are visually
identical, yet have different unicodes, thus they can't be matched in
grep.

Example #1:
احمدی

Example #2:
احمدى

The ى in both examples are exactly the same, yet the first one is
U+06CC, and second one U+0649.

From the user's perspective, it's impossible to realize which unicode
the word is using. In fact, these two words, even though they are from
different languages/keyboards, match perfectly on the other letters,
and only it's ی/ى that espaces the match.

While not as important, this letter has other variants like ي (notice
two dots below it, think an umlaut) corresponding to U+064A. If you
press Ctrl + F on your browser, you'd notice that you can match U+064A
with U+0649 one. but this is not the default behavior in grep either.

I understand there's no straightforward solution for this, so I'm
thinking of having an extra flag which converts all visually similar
characters to the same unicode and then looks for matches. Thoughts?

[Message part 2 (text/html, inline)]

Information forwarded to bug-grep <at> gnu.org:
bug#79702; Package grep. (Sun, 26 Oct 2025 22:37:02 GMT) Full text and rfc822 format available.

Message #20 received at 79702 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: "David G. Pickett" <dgpickett <at> aol.com>
Cc: 79702 <at> debbugs.gnu.org
Subject: Re: bug#79702: Fw: bug#79702: request: flag for visually identical
 but different unicode characters
Date: Sun, 26 Oct 2025 15:36:42 -0700

On 2025-10-26 15:08, David G. Pickett via Bug reports for GNU grep wrote:
> Who gets to identify the look-alikes?

The Unicode Consortium has done this, and as is usual with characters, 
it's complicated. See:

https://www.unicode.org/reports/tr39/#Confusable_Detection

Information forwarded to bug-grep <at> gnu.org:
bug#79702; Package grep. (Thu, 06 Nov 2025 17:15:03 GMT) Full text and rfc822 format available.

Message #23 received at 79702 <at> debbugs.gnu.org (full text, mbox):

From: "Dale R. Worley" <Dale.Worley <at> comcast.net>
To: 79702 <at> debbugs.gnu.org
Subject: Re: bug#79702: Fw: bug#79702: request: flag for visually identical but
 different unicode characters
Date: Thu, 06 Nov 2025 12:14:32 -0500

Paul Eggert <eggert <at> cs.ucla.edu> writes:
>> Who gets to identify the look-alikes?
>
> The Unicode Consortium has done this, and as is usual with characters, 
> it's complicated. See:
>
> https://www.unicode.org/reports/tr39/#Confusable_Detection

ISTM that trying to incorporate this functionality into grep would be an
endless maintenance chore.  Probably better is to have a separate
utility (project) that "canonicalize" each confusable character into one
standard form.  Then you can use grep to do the search.  If I've got all
the shell constructions right, the one-line form would be:

    $ grep "$( canonicalize -opts <<<'pattern' )" \
        <(canonicalize -opts file) <(canonicalize -opts file) ...

Since surely there are variations on canonicalization, I've shown
"-opts".  Also the "<<<" construction adds a newline at the end, so you
need an option to canonicalize to remove the final newline.

Dale

This bug report was last modified 51 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #79702 request: flag for visually identical but different unicode characters

GNU bug report logs - #79702
request: flag for visually identical but different unicode characters