GNU bug report logs - #16893
[PATCH] Avoid matching line-by-line for case-insensitive with grep

Previous Next

Package: grep;

Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>

Date: Thu, 27 Feb 2014 16:03:02 UTC

Severity: normal

Tags: patch

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 16893 in the body.
You can then email your comments to 16893 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to help-debbugs <at> gnu.org:
bug#16893; Package debbugs.gnu.org. (Thu, 27 Feb 2014 16:03:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
New bug report received and forwarded. Copy sent to help-debbugs <at> gnu.org. (Thu, 27 Feb 2014 16:03:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: submit <at> debbugs.gnu.org
Subject: [PATCH] Avoid matching line-by-line for case-insensitive with grep
Date: Fri, 28 Feb 2014 01:02:40 +0900
[Message part 1 (text/plain, inline)]
Now grep and awk matchers doesn't waste buffer in case-sensisitive matching.
So I think that we can avoid line-by-line matching for them.

It enable to speed up case-sensitive matching with grep or awk matcher
without trivial_case_ignore as fast as when with it.

In bug#16232:
> The following times 2.16, 2.17 and 2.17+patch two ways:
> 
> $ yes jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj | head -10000000 > k
> $ for i in 16 17 18; do echo $i; env LC_ALL=en_US.UTF-8 time
> /p/p/grep-2.$i/bin/grep -i foobar k; done
> 16
>        15.96 real        14.57 user         0.12 sys
> 17
>         1.13 real         1.07 user         0.06 sys
> 18
>         1.96 real         1.89 user         0.06 sys
> 
> The above search takes more than 70% longer with the proposed patch.

Therefore, I think 30% slow-down is caused by the line-by-line matching
for them.
[avoid_line_by_line.txt (application/octet-stream, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16893; Package grep. (Thu, 27 Feb 2014 16:55:01 GMT) Full text and rfc822 format available.

Message #8 received at 16893 <at> debbugs.gnu.org (full text, mbox):

From: Glenn Morris <rgm <at> gnu.org>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 16893 <at> debbugs.gnu.org
Subject: Re: bug#16893: [PATCH] Avoid matching line-by-line for
 case-insensitive with grep
Date: Thu, 27 Feb 2014 11:54:39 -0500
This message was sent to the submit <at> debbugs address with no 
"Package:" specified in the body. So it ended up on the help-debbugs
mailing list rather than bug-grep. I have assigned it to grep and am
sending this mail, which will appear on the bug-grep list.

For new reports, either use the bug-grep address, or remember to use
Package: grep at the start of the body. They both have identical results.

Norihiro Tanaka wrote:

> Now grep and awk matchers doesn't waste buffer in case-sensisitive matching.
> So I think that we can avoid line-by-line matching for them.
>
> It enable to speed up case-sensitive matching with grep or awk matcher
> without trivial_case_ignore as fast as when with it.
>
> In bug#16232:
>> The following times 2.16, 2.17 and 2.17+patch two ways:
>> 
>> $ yes jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj | head -10000000 > k
>> $ for i in 16 17 18; do echo $i; env LC_ALL=en_US.UTF-8 time
>> /p/p/grep-2.$i/bin/grep -i foobar k; done
>> 16
>>        15.96 real        14.57 user         0.12 sys
>> 17
>>         1.13 real         1.07 user         0.06 sys
>> 18
>>         1.96 real         1.89 user         0.06 sys
>> 
>> The above search takes more than 70% longer with the proposed patch.
>
> Therefore, I think 30% slow-down is caused by the line-by-line matching
> for them.

[See attachment at http://debbugs.gnu.org/16893]




Information forwarded to bug-grep <at> gnu.org:
bug#16893; Package grep. (Fri, 28 Feb 2014 02:55:01 GMT) Full text and rfc822 format available.

Message #11 received at 16893 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Glenn Morris <rgm <at> gnu.org>
Cc: 16893 <at> debbugs.gnu.org, Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Subject: Re: bug#16893: [PATCH] Avoid matching line-by-line for
 case-insensitive with grep
Date: Thu, 27 Feb 2014 18:54:14 -0800
On Thu, Feb 27, 2014 at 8:54 AM, Glenn Morris <rgm <at> gnu.org> wrote:
>
> This message was sent to the submit <at> debbugs address with no
> "Package:" specified in the body. So it ended up on the help-debbugs
> mailing list rather than bug-grep. I have assigned it to grep and am
> sending this mail, which will appear on the bug-grep list.
>
> For new reports, either use the bug-grep address, or remember to use
> Package: grep at the start of the body. They both have identical results.
>
> Norihiro Tanaka wrote:
>
>> Now grep and awk matchers doesn't waste buffer in case-sensisitive matching.
>> So I think that we can avoid line-by-line matching for them.
>>
>> It enable to speed up case-sensitive matching with grep or awk matcher
>> without trivial_case_ignore as fast as when with it.
>>
>> In bug#16232:
>>> The following times 2.16, 2.17 and 2.17+patch two ways:
>>>
>>> $ yes jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj | head -10000000 > k
>>> $ for i in 16 17 18; do echo $i; env LC_ALL=en_US.UTF-8 time
>>> /p/p/grep-2.$i/bin/grep -i foobar k; done
>>> 16
>>>        15.96 real        14.57 user         0.12 sys
>>> 17
>>>         1.13 real         1.07 user         0.06 sys
>>> 18
>>>         1.96 real         1.89 user         0.06 sys
>>>
>>> The above search takes more than 70% longer with the proposed patch.
>>
>> Therefore, I think 30% slow-down is caused by the line-by-line matching
>> for them.
>
> [See attachment at http://debbugs.gnu.org/16893]

Thank you for forwarding that, Glenn.

Thank you for the patch, Norihiro.
However, your removal of the "MB_CUR_MAX == 1" disjunct
would cause unibyte grep -i with (-F or -P) to match line-by-line,
whereas currently it uses the buffer-matching code.
That would make a search like this take 3 times longer:

  seq 30000000 > in
  LC_ALL=C grep -iF foo in

Without your patch, best of 5 wall clock time is 0.31s,
yet with the patch, it takes 1.00s.

I presume you will want to retain that disjunct?
If you submit an adjusted patch, please also include in the
commit log some timing examples showing how the change affects
grep's performance.




Information forwarded to bug-grep <at> gnu.org:
bug#16893; Package grep. (Fri, 28 Feb 2014 15:03:02 GMT) Full text and rfc822 format available.

Message #14 received at 16893 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Jim Meyering <jim <at> meyering.net>
Cc: 16893 <at> debbugs.gnu.org, Glenn Morris <rgm <at> gnu.org>
Subject: Re: bug#16893: [PATCH] Avoid matching line-by-line for
 case-insensitive with grep
Date: Sat, 01 Mar 2014 00:02:27 +0900
[Message part 1 (text/plain, inline)]
Hi Jim,

I thank you for your review and pointing the bug for the patch.  You are
right.  I have written the wrong if conditions.  I think that behavior
shouldn't be changed for the pcre or fgrep matcher by the patch.  I have
fixed its bug, and re-send the patch and results of tests.

Norihiro
[avoid_line_by_line.txt (application/octet-stream, attachment)]
[tests.txt (application/octet-stream, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16893; Package grep. (Fri, 28 Feb 2014 15:13:01 GMT) Full text and rfc822 format available.

Message #17 received at 16893 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Jim Meyering <jim <at> meyering.net>
Cc: 16893 <at> debbugs.gnu.org
Subject: Re: bug#16893: [PATCH] Avoid matching line-by-line for
 case-insensitive with grep
Date: Sat, 01 Mar 2014 00:11:49 +0900
[Message part 1 (text/plain, inline)]
I used the attachment on this mail to test for "removal of
trivial_case_ignore".

Norihiro
[removal_of_trivial_case_ignore.txt (application/octet-stream, attachment)]

Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Sat, 01 Mar 2014 06:27:02 GMT) Full text and rfc822 format available.

Notification sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
bug acknowledged by developer. (Sat, 01 Mar 2014 06:27:04 GMT) Full text and rfc822 format available.

Message #22 received at 16893-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: 16893-done <at> debbugs.gnu.org
Subject: Re: bug#16893: [PATCH] Avoid matching line-by-line for
 case-insensitive with grep
Date: Fri, 28 Feb 2014 22:26:36 -0800
Thanks, I applied that.  It simplifies another patch I've been meaning 
to put in, yay!




Information forwarded to bug-grep <at> gnu.org:
bug#16893; Package grep. (Sat, 01 Mar 2014 06:35:02 GMT) Full text and rfc822 format available.

Message #25 received at 16893-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: 16893-done <at> debbugs.gnu.org
Subject: Re: bug#16893: [PATCH] Avoid matching line-by-line for
 case-insensitive with grep
Date: Fri, 28 Feb 2014 22:34:22 -0800
[Message part 1 (text/plain, inline)]
I also applied the attached followup patch to clean out macros no longer 
used.

[0001-grep-remove-lint.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16893; Package grep. (Sat, 01 Mar 2014 23:18:02 GMT) Full text and rfc822 format available.

Message #28 received at 16893-done <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 16893-done <at> debbugs.gnu.org
Subject: Re: bug#16893: [PATCH] Avoid matching line-by-line for
 case-insensitive with grep
Date: Sat, 1 Mar 2014 15:16:51 -0800
Thank you both for all of the patches!




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 30 Mar 2014 11:24:07 GMT) Full text and rfc822 format available.

This bug report was last modified 11 years and 9 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.