GNU bug report logs - #43577
wrong result for grep -io in turkish locale

Previous Next

Package: grep;

Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>

Date: Wed, 23 Sep 2020 13:24:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 43577 in the body.
You can then email your comments to 43577 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#43577; Package grep. (Wed, 23 Sep 2020 13:24:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Wed, 23 Sep 2020 13:24:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: <bug-grep <at> gnu.org>
Subject: wrong result for grep -io in turkish locale
Date: Wed, 23 Sep 2020 22:23:09 +0900
In turkish locale, upper and lower case are mapped as following.

  U0049 <-> U0131
  U0069 <-> U0130

It's expected that both following test cases returns U0130, but later
returns nothing.

$ printf '\304\260\n' >I  # U0130
$ env LC_ALL=tr_TR.utf8 grep -i i I
?  # U0130
$ env LC_ALL=tr_TR.utf8 grep -oi i I
$ 

By the way, both following test cases work correctly.

$ printf '\304\260\n' >i  # U0131
$ env LC_ALL=tr_TR.utf8 grep -i I i
?  # U0131
$ env LC_ALL=tr_TR.utf8 grep -oi I i
?  # U0131
$





Information forwarded to bug-grep <at> gnu.org:
bug#43577; Package grep. (Wed, 23 Sep 2020 14:32:02 GMT) Full text and rfc822 format available.

Message #8 received at 43577 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 43577 <at> debbugs.gnu.org
Subject: Re: bug#43577: wrong result for grep -io in turkish locale
Date: Wed, 23 Sep 2020 07:30:58 -0700
On Wed, Sep 23, 2020 at 6:24 AM Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
>
> In turkish locale, upper and lower case are mapped as following.
>
>   U0049 <-> U0131
>   U0069 <-> U0130
>
> It's expected that both following test cases returns U0130, but later
> returns nothing.
>
> $ printf '\304\260\n' >I  # U0130
> $ env LC_ALL=tr_TR.utf8 grep -i i I
> ?  # U0130

Oh! We must have different code or systems.
When I run anything using -i and that locale on Fedora 32, it aborts:

$ LC_ALL=tr_TR.utf8 src/grep -i a
zsh: abort (core dumped)  LC_ALL=tr_TR.utf8 src/grep -i a




Information forwarded to bug-grep <at> gnu.org:
bug#43577; Package grep. (Wed, 23 Sep 2020 18:58:01 GMT) Full text and rfc822 format available.

Message #11 received at 43577 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>, Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 43577 <at> debbugs.gnu.org
Subject: Re: bug#43577: wrong result for grep -io in turkish locale
Date: Wed, 23 Sep 2020 11:57:25 -0700
On 9/23/20 7:30 AM, Jim Meyering wrote:
> $ LC_ALL=tr_TR.utf8 src/grep -i a
> zsh: abort (core dumped)  LC_ALL=tr_TR.utf8 src/grep -i a

I can reproduce this bug. There seems to be a performance regression too. I'll 
look into it.




Information forwarded to bug-grep <at> gnu.org:
bug#43577; Package grep. (Thu, 24 Sep 2020 01:48:01 GMT) Full text and rfc822 format available.

Message #14 received at 43577 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: 43577 <at> debbugs.gnu.org
Subject: Re: bug#43577: wrong result for grep -io in turkish locale
Date: Thu, 24 Sep 2020 10:47:31 +0900
[Message part 1 (text/plain, inline)]
I attach the fix for the bug.  Regex is fixed in Paul, thank you.
[0001-grep-fix-ignore-case-Turkish-bug.patch (text/plain, attachment)]

Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Thu, 24 Sep 2020 02:58:01 GMT) Full text and rfc822 format available.

Notification sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
bug acknowledged by developer. (Thu, 24 Sep 2020 02:58:01 GMT) Full text and rfc822 format available.

Message #19 received at 43577-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 43577-done <at> debbugs.gnu.org
Subject: Re: bug#43577: wrong result for grep -io in turkish locale
Date: Wed, 23 Sep 2020 19:57:36 -0700
[Message part 1 (text/plain, inline)]
On 9/23/20 6:47 PM, Norihiro Tanaka wrote:
> I attach the fix for the bug.  Regex is fixed in Paul, thank you.
> 

Thanks, I had written a similar patch, and your patch helped me find a bug in 
what I wrote. The patch I wrote uses an auxiliary ok_fold table that lets 
fgrep_icase_charlen avoid calling mbrtwoc for single-byte characters in the 
pattern; this may help performance for long patterns. More important, 
fgrep_icase_charlen does not return -1 for a character like 'a' in an 
en_US.UTF-8 locale merely because 'a' has a case folded counterpart 'A'; the 
idea is that we should be OK if the case folded counterparts are single-byte.

I had added more-extensive tests than were in your patch, and some of them found 
a crash in kwsinit that indicated a similar change is needed there. I assume 
this was because the patch I wrote had a more-generous fgrep_icase_charlen. As 
this simplifies kwsinit, this patch does that too.

While looking into this I found a performance glitch I recently introduced (I 
double-counted some regular expressions, messing up later heuristics). Plus I 
checked on this on our old Solaris 10 box and fixed a couple of porting 
glitches. I installed the attached patches, into the master branch, to help make 
it easier for you to compare your changes to mine. Patch 0003 is the enhanced 
version of the patch that you wrote.

Thanks again for working on this.
[0001-grep-fix-recently-introduced-performance-glitch.patch (text/x-patch, attachment)]
[0002-build-update-gnulib-submodule-to-latest.patch (text/x-patch, attachment)]
[0003-grep-fix-more-Turkish-eyes-bugs.patch (text/x-patch, attachment)]
[0004-grep-pacify-Sun-C-5.15.patch (text/x-patch, attachment)]
[0005-grep-don-t-assume-PCRE-in-tests.patch (text/x-patch, attachment)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 22 Oct 2020 11:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 3 years and 158 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.