GNU bug report logs -
#16631
Consideration of title case on case-insensitive matching
Previous Next
Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Date: Mon, 3 Feb 2014 16:21:02 UTC
Severity: normal
Tags: patch
Done: Jim Meyering <jim <at> meyering.net>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 16631 in the body.
You can then email your comments to 16631 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#16631
; Package
grep
.
(Mon, 03 Feb 2014 16:21:03 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Norihiro Tanaka <noritnk <at> kcn.ne.jp>
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Mon, 03 Feb 2014 16:21:03 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Package: grep
Tags: patch
In UTF-8 character set, an alphabet may have not only upper case and
lower case but title case. grep-2.16 fails in matching as following
in order not to take it into consideration.
echo 'LJ' | LC_ALL=en_US.UTF-8 grep -i Lj
echo 'Lj' | LC_ALL=en_US.UTF-8 grep -i LJ
We expect that LJ and Lj are returned, respectively. But both return
nothing.
This patch replaces `towupper' and `towlower' to `towctrans'.
And the above will return the expected results.
[grep-ignore-icase.txt (application/octet-stream, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16631
; Package
grep
.
(Mon, 03 Feb 2014 16:30:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 16631 <at> debbugs.gnu.org (full text, mbox):
On Mon, Feb 3, 2014 at 8:20 AM, Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> Package: grep
> Tags: patch
>
> In UTF-8 character set, an alphabet may have not only upper case and
> lower case but title case. grep-2.16 fails in matching as following
> in order not to take it into consideration.
>
> echo 'LJ' | LC_ALL=en_US.UTF-8 grep -i Lj
> echo 'Lj' | LC_ALL=en_US.UTF-8 grep -i LJ
>
> We expect that LJ and Lj are returned, respectively. But both return
> nothing.
>
> This patch replaces `towupper' and `towlower' to `towctrans'.
> And the above will return the expected results.
Thank you for working on this. However, the attached patch is one
that has already been applied.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16631
; Package
grep
.
(Mon, 03 Feb 2014 16:36:01 GMT)
Full text and
rfc822 format available.
Message #11 received at 16631 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Sorry, I've attached the patch, which is wrong.
I redress it.
[case-fold-title-case.txt (application/octet-stream, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16631
; Package
grep
.
(Mon, 03 Feb 2014 18:24:01 GMT)
Full text and
rfc822 format available.
Message #14 received at 16631 <at> debbugs.gnu.org (full text, mbox):
Hi.
I'm just wondering - does the regex code have the same issue with
title case characters? This is an issue for gawk. I will try to run
your test on gawk, but if you have time to check you can do so by
setting GAWK_NO_DFA in the environment and then gawk will bypass the
dfa matcher.
Thanks!
Arnold
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16631
; Package
grep
.
(Thu, 06 Feb 2014 23:42:01 GMT)
Full text and
rfc822 format available.
Message #17 received at 16631 <at> debbugs.gnu.org (full text, mbox):
On 02/03/2014 08:20 AM, Norihiro Tanaka wrote:
> echo 'LJ' | LC_ALL=en_US.UTF-8 grep -i Lj
> echo 'Lj' | LC_ALL=en_US.UTF-8 grep -i LJ
>
> We expect that LJ and Lj are returned, respectively. But both return
> nothing.
Both test cases worked for me. I expect that you meant the cases with
single characters, as in "echo lj | LC_ALL=en_US.UTF-8 grep -i Lj".
I have doubts about this patch, for several reasons.
1. It doesn't solve the problem from the ordinary user's point of view.
For example, "echo lj | LC_ALL=en_US.UTF-8 src/grep -i Lj" will still
output nothing, because the one-character pattern "Lj" does not match the
two-character string "lj" even when the latter's two-letter case
variants "Lj", "lJ", "LJ" are considered.
2. The characters in question are present in Unicode only for
compatibility with previous standards; they're not intended to be used
in new text. So this is a problem of the past, one that has mostly died
out already.
3. Because of (2) the characters in question are rare, even in the
languages where one might naively think they're useful. For example, the
Croatian Wikipedia page for Ljubljana
<http://hr.wikipedia.org/wiki/Ljubljana> consistently uses the
two-character forms "Lj" and "lj", not the one-character forms "Lj" and "lj".
4. The solution doesn't generalize to similar problems in
more-complicated orthographies. For example, in polytonic Greek when
ignoring case ordinary users would expect "ᾄ" (U+1F84) to match not only
"ᾌ" (U+1F8C), but also "Α" (U+0391), "ΑΙ" (U+0391, U+0399; two
characters) and "Αι" (U+0391, U+03B9). Worse, this depends on context:
often "ᾄ" should not match "Αι" when ignoring case. For details on this,
please see Nick Nicholas's discussion "Titlecase and Adscripts"
<http://www.tlg.uci.edu/~opoudjis/unicode/unicode_adscript.html>.
5. When POSIX specifies how to match a regular expression while ignoring
case, it talks only about "uppercase or lowercase"
<http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_02>.
If we change 'grep' along the lines being suggested, we'll either have
to change POSIX, or have the change take effect only if POSIXLY_CORRECT
is not set.
Taking all this into consideration, it sounds like we should let
sleeping dogs lie, i.e., that dfa.c should do the minimal work necessary
needed to support traditional case-insensitive matching a la POSIX.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16631
; Package
grep
.
(Fri, 07 Feb 2014 16:50:05 GMT)
Full text and
rfc822 format available.
Message #20 received at 16631 <at> debbugs.gnu.org (full text, mbox):
Paul Eggert wrote:
> 1. It doesn't solve the problem from the ordinary user's point of view.
> For example, "echo lj | LC_ALL=en_US.UTF-8 src/grep -i ?" will still
> output nothing, because the one-character pattern "?" does not match
> the two-character string "lj" even when the latter's two-letter case
> variants "Lj", "lJ", "LJ" are considered.
>
> 2. The characters in question are present in Unicode only for
> compatibility with previous standards; they're not intended to be used
> in new text. So this is a problem of the past, one that has mostly died
> out already.
>
> 3. Because of (2) the characters in question are rare, even in the
> languages where one might naively think they're useful. For example,
> the Croatian Wikipedia page for Ljubljana <http://hr.wikipedia.org/wiki/Ljubljana>
> consistently uses the two-character forms "Lj" and "lj", not the
> one-character forms "?" and "?".
>
> 4. The solution doesn't generalize to similar problems in more-complicated
> orthographies. For example, in polytonic Greek when ignoring case
> ordinary users would expect "?" (U+1F84) to match not only "?" (U+1F8C),
> but also "?" (U+0391), "??" (U+0391, U+0399; two characters) and "??"
> (U+0391, U+03B9). Worse, this depends on context: often "?" should
> not match "??" when ignoring case. For details on this, please see
> Nick Nicholas's discussion "Titlecase and Adscripts"
> <http://www.tlg.uci.edu/~opoudjis/unicode/unicode_adscript.html>.
>
> I think that it's because the problem is glibc doesn't define conversion
> between two-character string "lj" and single-character Lj, "?" (U+1F8C)
> and "?" (U+0391) etc.
For example, grep on HP-UX, I look like it's quitely compliant with POSIX,
supports conversion between single-character "lj" and single-character
"Lj", though dones't support conversion as above.
I believe that the conversion rule is in compliance with the locale-data
of libc is required. I look like the convesion beween "Lj", "lJ" and "LJ"
is defined in UTF-8, but not defined between U+1F84 and U+0391 etc.
> 5. When POSIX specifies how to match a regular expression while ignoring
> case, it talks only about "uppercase or lowercase"
> <http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_02>.
> If we change 'grep' along the lines being suggested, we'll either have
> to change POSIX, or have the change take effect only if POSIXLY_CORRECT
> is not set.
The upper case of single-character "Lj" is "LJ" and the case is "lj".
Thire conversion are also supported by towupper and towlower functions.
Aharon Robbins wrote:
> This is an issue for gawk.
I seem that I have misunderstood. The problem doesn't reproduce on
grep-2.16. It's taken by the patch for bug#16421, which removes
GREP-oriented dfa.c.
Reply sent
to
Jim Meyering <jim <at> meyering.net>
:
You have taken responsibility.
(Sun, 09 Feb 2014 16:31:03 GMT)
Full text and
rfc822 format available.
Notification sent
to
Norihiro Tanaka <noritnk <at> kcn.ne.jp>
:
bug acknowledged by developer.
(Sun, 09 Feb 2014 16:31:04 GMT)
Full text and
rfc822 format available.
Message #25 received at 16631-done <at> debbugs.gnu.org (full text, mbox):
Thank you both for helping with this issue.
Paul's argument has convinced me that we should not make this change now.
If anyone comes up with a good case for applying this patch, please let us
know and we'll be happy to revisit.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Mon, 10 Mar 2014 11:24:10 GMT)
Full text and
rfc822 format available.
This bug report was last modified 10 years and 257 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.