GNU bug report logs - #16631
Consideration of title case on case-insensitive matching

Previous Next

Package: grep;

Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>

Date: Mon, 3 Feb 2014 16:21:02 UTC

Severity: normal

Tags: patch

Done: Jim Meyering <jim <at> meyering.net>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 16631 in the body.
You can then email your comments to 16631 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#16631; Package grep. (Mon, 03 Feb 2014 16:21:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Mon, 03 Feb 2014 16:21:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: submit <at> debbugs.gnu.org
Subject: Consideration of title case on case-insensitive matching
Date: Tue, 04 Feb 2014 01:20:23 +0900
[Message part 1 (text/plain, inline)]
Package: grep
Tags: patch

In UTF-8 character set, an alphabet may have not only upper case and
lower case but title case.  grep-2.16 fails in matching as following
in order not to take it into consideration.

  echo 'LJ' | LC_ALL=en_US.UTF-8 grep -i Lj
  echo 'Lj' | LC_ALL=en_US.UTF-8 grep -i LJ

We expect that LJ and Lj are returned, respectively.  But both return
nothing. 

This patch replaces `towupper' and `towlower' to `towctrans'.
And the above will return the expected results. 
[grep-ignore-icase.txt (application/octet-stream, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16631; Package grep. (Mon, 03 Feb 2014 16:30:02 GMT) Full text and rfc822 format available.

Message #8 received at 16631 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 16631 <at> debbugs.gnu.org
Subject: Re: bug#16631: Consideration of title case on case-insensitive
 matching
Date: Mon, 3 Feb 2014 08:28:44 -0800
On Mon, Feb 3, 2014 at 8:20 AM, Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> Package: grep
> Tags: patch
>
> In UTF-8 character set, an alphabet may have not only upper case and
> lower case but title case.  grep-2.16 fails in matching as following
> in order not to take it into consideration.
>
>   echo 'LJ' | LC_ALL=en_US.UTF-8 grep -i Lj
>   echo 'Lj' | LC_ALL=en_US.UTF-8 grep -i LJ
>
> We expect that LJ and Lj are returned, respectively.  But both return
> nothing.
>
> This patch replaces `towupper' and `towlower' to `towctrans'.
> And the above will return the expected results.

Thank you for working on this.  However, the attached patch is one
that has already been applied.




Information forwarded to bug-grep <at> gnu.org:
bug#16631; Package grep. (Mon, 03 Feb 2014 16:36:01 GMT) Full text and rfc822 format available.

Message #11 received at 16631 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: 16631 <at> debbugs.gnu.org
Subject: Re: bug#16631: Consideration of title case on case-insensitive
 matching
Date: Tue, 04 Feb 2014 01:34:51 +0900
[Message part 1 (text/plain, inline)]
Sorry, I've attached the patch, which is wrong.
I redress it.
[case-fold-title-case.txt (application/octet-stream, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16631; Package grep. (Mon, 03 Feb 2014 18:24:01 GMT) Full text and rfc822 format available.

Message #14 received at 16631 <at> debbugs.gnu.org (full text, mbox):

From: Aharon Robbins <arnold <at> skeeve.com>
To: noritnk <at> kcn.ne.jp, 16631 <at> debbugs.gnu.org
Subject: Re: bug#16631: Consideration of title case on case-insensitive
 matching
Date: Mon, 03 Feb 2014 20:23:28 +0200
Hi.

I'm just wondering - does the regex code have the same issue with
title case characters?  This is an issue for gawk.  I will try to run
your test on gawk, but if you have time to check you can do so by
setting GAWK_NO_DFA in the environment and then gawk will bypass the
dfa matcher.

Thanks!

Arnold




Information forwarded to bug-grep <at> gnu.org:
bug#16631; Package grep. (Thu, 06 Feb 2014 23:42:01 GMT) Full text and rfc822 format available.

Message #17 received at 16631 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, 16631 <at> debbugs.gnu.org
Subject: Re: bug#16631: Consideration of title case on case-insensitive
 matching
Date: Thu, 06 Feb 2014 15:41:17 -0800
On 02/03/2014 08:20 AM, Norihiro Tanaka wrote:
>    echo 'LJ' | LC_ALL=en_US.UTF-8 grep -i Lj
>    echo 'Lj' | LC_ALL=en_US.UTF-8 grep -i LJ
>
> We expect that LJ and Lj are returned, respectively.  But both return
> nothing.
Both test cases worked for me. I expect that you meant the cases with 
single characters, as in "echo lj | LC_ALL=en_US.UTF-8 grep -i Lj".

I have doubts about this patch, for several reasons.

1. It doesn't solve the problem from the ordinary user's point of view. 
For example, "echo lj | LC_ALL=en_US.UTF-8 src/grep -i Lj" will still 
output nothing, because the one-character pattern "Lj" does not match the 
two-character string "lj" even when the latter's two-letter case 
variants "Lj", "lJ", "LJ" are considered.

2. The characters in question are present in Unicode only for 
compatibility with previous standards; they're not intended to be used 
in new text. So this is a problem of the past, one that has mostly died 
out already.

3. Because of (2) the characters in question are rare, even in the 
languages where one might naively think they're useful. For example, the 
Croatian Wikipedia page for Ljubljana 
<http://hr.wikipedia.org/wiki/Ljubljana> consistently uses the 
two-character forms "Lj" and "lj", not the one-character forms "Lj" and "lj".

4. The solution doesn't generalize to similar problems in 
more-complicated orthographies. For example, in polytonic Greek when 
ignoring case ordinary users would expect "ᾄ" (U+1F84) to match not only 
"ᾌ" (U+1F8C), but also "Α" (U+0391), "ΑΙ" (U+0391, U+0399; two 
characters) and "Αι" (U+0391, U+03B9). Worse, this depends on context: 
often "ᾄ" should not match "Αι" when ignoring case. For details on this, 
please see Nick Nicholas's discussion "Titlecase and Adscripts" 
<http://www.tlg.uci.edu/~opoudjis/unicode/unicode_adscript.html>.

5. When POSIX specifies how to match a regular expression while ignoring 
case, it talks only about "uppercase or lowercase" 
<http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_02>. 
If we change 'grep' along the lines being suggested, we'll either have 
to change POSIX, or have the change take effect only if POSIXLY_CORRECT 
is not set.

Taking all this into consideration, it sounds like we should let 
sleeping dogs lie, i.e., that dfa.c should do the minimal work necessary 
needed to support traditional case-insensitive matching a la POSIX.




Information forwarded to bug-grep <at> gnu.org:
bug#16631; Package grep. (Fri, 07 Feb 2014 16:50:05 GMT) Full text and rfc822 format available.

Message #20 received at 16631 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 16631 <at> debbugs.gnu.org
Subject: Re: bug#16631: Consideration of title case on case-insensitive
 matching
Date: Sat, 08 Feb 2014 01:49:47 +0900
Paul Eggert wrote:
> 1. It doesn't solve the problem from the ordinary user's point of view.
> For example, "echo lj | LC_ALL=en_US.UTF-8 src/grep -i ?" will still
> output nothing, because the one-character pattern "?" does not match
> the two-character string "lj" even when the latter's two-letter case
> variants "Lj", "lJ", "LJ" are considered.
> 
> 2. The characters in question are present in Unicode only for
> compatibility with previous standards; they're not intended to be used
> in new text. So this is a problem of the past, one that has mostly died
> out already.
> 
> 3. Because of (2) the characters in question are rare, even in the
> languages where one might naively think they're useful. For example,
> the Croatian Wikipedia page for Ljubljana <http://hr.wikipedia.org/wiki/Ljubljana>
> consistently uses the two-character forms "Lj" and "lj", not the
> one-character forms "?" and "?".
> 
> 4. The solution doesn't generalize to similar problems in more-complicated
> orthographies. For example, in polytonic Greek when ignoring case
> ordinary users would expect "?" (U+1F84) to match not only "?" (U+1F8C),
> but also "?" (U+0391), "??" (U+0391, U+0399; two characters) and "??" 
> (U+0391, U+03B9). Worse, this depends on context: often "?" should
> not match "??" when ignoring case. For details on this, please see
> Nick Nicholas's discussion "Titlecase and Adscripts"
> <http://www.tlg.uci.edu/~opoudjis/unicode/unicode_adscript.html>.
> 
> I think that it's because the problem is glibc doesn't define conversion
> between two-character string "lj" and single-character Lj, "?" (U+1F8C)
> and "?" (U+0391) etc.

For example, grep on HP-UX, I look like it's quitely compliant with POSIX,
supports conversion between single-character "lj" and single-character
"Lj", though dones't support conversion as above.

I believe that the conversion rule is in compliance with the locale-data
of libc is required.  I look like the convesion beween "Lj", "lJ" and "LJ"
is defined in UTF-8, but not defined between U+1F84 and U+0391 etc.

> 5. When POSIX specifies how to match a regular expression while ignoring 
> case, it talks only about "uppercase or lowercase" 
> <http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_02>. 
> If we change 'grep' along the lines being suggested, we'll either have 
> to change POSIX, or have the change take effect only if POSIXLY_CORRECT 
> is not set.

The upper case of single-character "Lj" is "LJ" and the  case is "lj".
Thire conversion are also supported by towupper and towlower functions.


Aharon Robbins wrote:
> This is an issue for gawk.

I seem that I have misunderstood.  The problem doesn't reproduce on
grep-2.16.  It's taken by the patch for bug#16421, which removes
GREP-oriented dfa.c.






Reply sent to Jim Meyering <jim <at> meyering.net>:
You have taken responsibility. (Sun, 09 Feb 2014 16:31:03 GMT) Full text and rfc822 format available.

Notification sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
bug acknowledged by developer. (Sun, 09 Feb 2014 16:31:04 GMT) Full text and rfc822 format available.

Message #25 received at 16631-done <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, 16631-done <at> debbugs.gnu.org
Subject: Re: bug#16631: Consideration of title case on case-insensitive
 matching
Date: Sun, 9 Feb 2014 08:30:22 -0800
Thank you both for helping with this issue.
Paul's argument has convinced me that we should not make this change now.
If anyone comes up with a good case for applying this patch, please let us
know and we'll be happy to revisit.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 10 Mar 2014 11:24:10 GMT) Full text and rfc822 format available.

This bug report was last modified 10 years and 57 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.